Publications
Discover highlights from our published research addressing challenges in Digital Humanities and Natural Language Processing.
Small language models (SLMs) are increasingly utilized for on-device applications due to their ability to ensure user privacy, reduce inference latency, and operate independently of cloud infrastructure. However, their performance is often limited when processing complex data structures such as graphs, which are ubiquitous in real-world datasets like social networks and system interactions. Graphs inherently encode intricate structural dependencies, requiring models to effectively capture both local and global relationships. Traditional language models, designed primarily for text data, struggle to address these requirements, leading to suboptimal performance in graph-related tasks. To overcome this limitation, we propose a novel graph encoder-based prompt tuning framework which integrates a graph convolutional network (GCN) with a graph transformer. By leveraging the complementary strengths of the GCN for local structural modeling and the graph transformer for capturing global relationships, our method enables SLMs to effectively process graph data. This integration significantly enhances the ability of SLMs to handle graph-centric tasks while maintaining the efficiency required for resource-constrained devices. The experimental results show that our approach not only improves the performance of SLMs on various graph benchmarks but also achieves results which closely approach the performance of a large language model (LLM). This work highlights the potential of extending SLMs for graph-based applications and advancing the capabilities of on-device artificial intelligence.
This study proposes a new benchmark to evaluate the cultural understanding and natural language processing capabilities of large language models based on Sino-Korean words and four-character idioms. Those are essential linguistic and cultural assets in Korea. Reflecting the official question types of the Korean Hanja Proficiency Test, we constructed four question categories—four-character idioms, synonyms, antonyms, and homophones—and systematically compared the performance of GPT-based and non-GPT LLMs. GPT-4o showed the highest accuracy and explanation quality. However, challenges remain in distinguishing the subtle nuances of individual characters and in adapting to uniquely Korean meanings as opposed to standard Chinese character interpretations. Our findings reveal a gap in LLMs’ understanding of Korea-specific Hanja culture and underscore the need for evaluation tools reflecting these cultural distinctions.
Keywords: large language models evaluation, cultural contextual understanding, Sino-Korean vocabulary, four-character idioms, cross-lingual semantic shift
The evaluation of creative writing has long been a complex and subjective process, made even more intriguing by the rise of advanced Artificial Intelligence (AI) tools like Large Language Models (LLMs). This study evaluates the potential of LLMs as reliable and consistent evaluators of creative texts, directly comparing their performance with traditional human evaluations. The analysis focuses on key creative criteria, including fluency, flexibility, elaboration, originality, usefulness, and specific creativity strategies. Results demonstrate that LLMs provide consistent and objective evaluations, achieving higher Inter-Annotator Agreement (IAA) compared with human evaluators. However, LLMs face limitations in recognizing nuanced, culturally specific, and context-dependent aspects of creativity. Conversely, human evaluators, despite lower consistency and higher subjectivity, exhibit strengths in capturing deeper contextual insights. These findings highlight the need for the further refinement of LLMs to address the complexities of creative writing evaluation.
Pre-trained language models (PrLMs) trained via contrastive learning methods achieved state-of-the-art performance on various natural language processing (NLP) tasks. Most PrLMs for sentence embedding focuses on context similarity as an objective function of contrastive learning. However, we found that these PrLMs, including recently released large language models (LLMs) like LLaMA,2 underperform when analyzing syntax information on probing tasks. This limitation becomes particularly noticeable in applications that depend on nuanced sentence understanding, such as the Retrieval Augmented Generation (RAG) framework in LLMs. This paper introduces a new sentence embedding model named SynCSE: Syntax Graph-based Contrastive Learning of Sentence Embeddings. Our approach enables meaningful sentence embeddings of language models through learning the syntactic features. To accomplish this, we train a PrLM with graph neural networks (GNNs) receiving a directed syntax graph. We then detach additional GNN layers from PrLM for inference; which does not require a syntax graph. The proposed model gains improvement on baselines in sentence textual similarity (STS) tasks, transfer tasks, and especially probing tasks. Additionally, we observe that our model has improved alignment and competitive uniformity compared to the baseline.
Recently, large language models (LLMs) have made significant progress through retrieval-augmented generation (RAG) and preference learning. However, they still exhibit issues such as confirmation bias, the tendency to favor information that confirms one’s beliefs, which remains largely unexplored in current research. In this paper, we propose a novel approach to mitigate confirmation bias-induced hallucination in LLMs through a synthetic data construction pipeline and Direct Preference Optimization (DPO) training. Our method enhances the integration of diverse and complementary information from multiple passages retrieved by RAG, enabling more balanced and accurate reasoning. Experimental results demonstrate significant improvements in response accuracy and reduced hallucination on benchmarks such as Natural Questions Open and HaluBench. These findings suggest that our approach effectively mitigates confirmation bias in long-context question answering, with potential applications to other NLP tasks. We release our data, and evaluation/train code for public access.
Keywords:
This study utilizes network analysis to explore structural unity in Renaissance plays, tracing the influence of medieval touring companies on 16th-century dramatic structures. Employing digital humanities methodologies, the research applies community detection algorithms and silhouette scores to analyziry bipartite or multipartite structures across 38 Shakespearean plays. These touring companies, characterized by actors taking on multiple roles, left an enduring imprint on the narrative and character dynamics within Renaissance drama. The study investigates how logistical and theatrical practices influenced the dramatic transformation from the Middle Ages through the Renaissance. Community detection algorithms identify clusters of frequently interacting characters, revealing underlying narrative frameworks. Silhouette scores quantitatively assess the distinctness of these clusters, illuminating the residual dualistic nature of Renaissance drama, potentially inherited from medieval traditions. Additionally, degree centrality measures the influence of central characters on narrative unity, helping to ascertain whether the play's structure revolves around a single protagonist or features a more dispersed model with multiple focal points. While this integration of network theory and literary analysis seeks to understand the interplay between character relationships and narrative structures, the reliance on quantitative methods can oversimplify complex literary details. This experimental approach underscores the need for further research across a broader spectrum of medieval and Renaissance plays and the development of refined computational methods for literary studies, thus broadening the scope and depth of digital humanities in understanding historical narratives.
Keywords: Shakespeare, Character Network Analysis, Digital Humanities, Mediality, Dramatic Structure, Silhouette Score
With the increase in the aging population of many countries, the prevalence of neovascular age-related macular degeneration (nAMD) is expected to increase. Morphological parameters such as intraretinal fluid (IRF), subretinal fluid (SRF), subretinal hyperreflective material (SHRM), and pigment epithelium detachment (PED) of spectral-domain optical coherence tomography (SD-OCT) images are vital markers for proper treatment of nAMD, especially to get the information of treatment response to determine the proper treatment interval and switching of anti-vascular endothelial growth factor (VEGF) agents. For the precise evaluation of the change in nAMD lesions and patient-specific treatment, quantitative evaluation of the lesions in the OCT volume scans is necessary. However, manual segmentation requires many resources, and the number of studies of automatic segmentation is increasing rapidly. Improving automated segmentation performance in SD-OCT visual results requires long-range contextual inference of spatial information between retinal lesions and layers. This paper proposes a GAGUNet (graph convolution network (GCN)-assisted attention-guided UNet) model with a novel global reasoning module considering these points. The dataset used in the main experiment of this study underwent rigorous review by a retinal specialist from Konkuk University Hospital in Korea, contributing to both data preprocessing and validation to ensure a qualitative assessment. We conducted experiments on the RETOUCH dataset as well to demonstrate the scalability of the proposed model. Overall, our model demonstrates superior performance over the baseline models in both quantitative and qualitative evaluations.
Keywords: Graph convolution network, Transformer, Multiscale skip connection, Medical image segmentation, Retinopathy
Large language models (LLMs) have demonstrated competitive performance across various domains, particularly in tasks requiring creativity, and thus offer a wide range of applications. This study evaluates the performance of large language models (LLMs), such as GPT-4, in generating creative writing through a comparison with text authored by humans. This study evaluates 100 creative writing pieces generated by GPT-4 and 100 texts authored by humans, using parameters such as fluency, flexibility, originality, elaboration, usefulness, and specific creative strategies. The findings reveal that GPT-4 is able to emulate the performance of human authors closely, producing high-quality and creative content. Despite the inconsistencies among the evaluators, GPT-4 demonstrates the significant potential to enhance human creativity and improve the quality of creative writing. However, the limitations inherent to the training data of GPT-4, including its dependence on factual and historical background information, indicate critical differences from human creativity.
The advent of artificial intelligence (AI), particularly Large Language Models (LLMs), is poised to bring about profound societal changes. Despite the risks associated with AI, such as the production of inaccurate information, labor market shifts, and the potential for AI to escape human control, ongoing regulatory efforts may not sufficiently curb its pervasive spread. The US federal government, in collaboration with key figures in the AI industry, has focused on the long-term risks of AI, without intending to stifle the industry’s growth. AI’s potential to automate aspects of writing implies that its inevitable introduction into educational settings will have immediate impacts on composition courses. Instances of students using AI tools like ChatGPT to write and submit assignments have been reported. Despite these concerns, universities are positioned to adapt to the changing environment and explore the potential benefits of AI in education. The relevance of AI writing technologies in language classes could be likened to the relevance of calculators in math classes 50 years ago, assisting humans with laborious aspects of writing. As Ted Underwood argues, AI demonstrates that writing takes place in “a multi-dimensional space in which a variety of writings, none of them original, blend and clash.” Over 50 years after Barthes’s “The Death of the Author,” we are confronted with “Death of an Author,” a work largely written by AI. Shakespeare’s work conveys the notion of artiginality, or ‘the workly character of the work,’ and after LLMs, it is compelling to understand artiginality in all forms of writings.
In the printed texts of early modern plays, scholars have observed a number of lines bracketed by a set of duplicate lines. In 1918, J. Dover Wilson called this type of textual error a “repetition bracket” and argued that it is evidence for the insertion of additional text. In 1930, W. W. Greg adduced pieces of evidence in early modern playhouse manuscripts in support of Wilson’s addition (or “plus”) hypothesis, but he also proposed an omission (or “minus”) hypothesis. However, Greg’s footnoted reference to a single instance in The Second Maiden’s Tragedy was his sole empirical evidence for the latter hypothesis. In this article, I examine Greg’s evidence and review fifty-one extant early modern playhouse manuscripts to argue that Greg’s omission hypothesis is untenable. Duplications in manuscripts are associated with false starts, marginal additions, or text on addition leaves. Based on thorough study of these manuscripts, I conclude that repetition brackets in early printings are a strong sign of revision and not omission. Included in an appendix is a list of all omission and addition markings in extant manuscripts.
Keywords:
The construction of high-quality word embeddings is essential in natural language processing. In existing approaches using a large text corpus, the word embeddings learn only sequential patterns in the context; thus, accurate learning of the syntax and semantic relationships between words is limited. Several methods have been proposed for constructing word embeddings using syntactic information. However, these methods are not trained for the semantic relationships between words in sentences or external knowledge. In this paper, we present a method for improved word embeddings using symbolic graphs for external knowledge and the relationships of the syntax and semantic role between words in sentences. The proposed model sequentially learns two symbolic graphs with different properties through a graph convolutional network (GCN) model. A new symbolic graph representation is generated to understand sentences grammatically and semantically. This graph representation includes comprehensive information that combines dependency parsing and semantic role labeling. Subsequently, word embeddings are constructed through the GCN model. The same GCN model initializes the word representations that are created in the first step and trains the relationships of ConceptNet using the relationships between words. The proposed word embeddings outperform the baselines in benchmarks and extrinsic tasks.
The commonsense question and answering (CSQA) system predicts the right answer based on a comprehensive understanding of the question. Previous research has developed models that use QA pairs, the corresponding evidence, or the knowledge graph as an input. Each method executes QA tasks with representations of pre-trained language models. However, the ability of the pre-trained language model to comprehend completely remains debatable. In this study, adversarial attack experiments were conducted on question-understanding. We examined the restrictions on the question-reasoning process of the pre-trained language model, and then demonstrated the need for models to use the logical structure of abstract meaning representations (AMRs). Additionally, the experimental results demonstrated that the method performed best when the AMR graph was extended with ConceptNet. With this extension, our proposed method outperformed the baseline in diverse commonsense-reasoning QA tasks.
Unlike previous dialogue-based question-answering (QA) datasets, DREAM, multiple-choice Dialogue-based REAding comprehension exaMination dataset, requires a deep understanding of dialogue. Many problems require multi-sentence reasoning, whereas some require commonsense reasoning. However, most pre-trained language models (PTLMs) do not consider commonsense. In addition, because the maximum number of tokens that a language model (LM) can deal with is limited, the entire dialogue history cannot be included. The resulting information loss has an adverse effect on performance. To address these problems, we propose a Dialogue-based QA model with Common-sense Reasoning (DQACR), a language model that exploits Semantic Search and continual learning. We used Semantic Search to complement information loss from truncated dialogue. In addition, we used Semantic Search and continual learning to improve the PTLM’s commonsense reasoning. Our model achieves an improvement of approximately 1.5% over the baseline method and can thus facilitate QA-related tasks. It contributes toward not only dialogue-based QA tasks but also another form of QA datasets for future tasks.
PU-GEN: Enhancing generative commonsense reasoning for language models with human-centered knowledge
Generative commonsense reasoning refers to the ability of a language model to generate a sentence with a given concept-set based on compositional generalization and commonsense reasoning. In the CommonGen challenge, which evaluates the capability of generative commonsense reasoning, language models continue to exhibit low performances and struggle to leverage knowledge representation from humans. Therefore, we propose PU-GEN to leverage human-centered knowledge in language models to enhance compositional generalization and commonsense reasoning considering the human language generation process. To incorporate human-centered knowledge, PU-GEN reinterprets two linguistic philosophies from Wittgenstein: picture theory and use theory. First, we retrieve scene knowledge to reflect picture theory such that a model can describe a general situation as if it were being painted. Second, we extend relational knowledge to consider use theory for understanding various contexts. PU-GEN demonstrates superior performance in qualitative and quantitative evaluations over baseline models in CommonGen and generates convincing evidence for CommonsenseQA. Moreover, it outperforms the state-of-the-art model used in the previous CommonGen challenge.
Recent pre-trained language models (PLMs) achieved great success on many natural language processing tasks through learning linguistic features and contextualized sentence representation. Since attributes captured in stacked layers of PLMs are not clearly identified, straightforward approaches such as embedding the last layer are commonly preferred to derive sentence representations from PLMs. This paper introduces the attention-based pooling strategy, which enables the model to preserve layer-wise signals captured in each layer and learn digested linguistic features for downstream tasks. The contrastive learning objective can adapt the layer-wise attention pooling to both unsupervised and supervised manners. It results in regularizing the anisotropic space of pre-trained embeddings and being more uniform. We evaluate our model on standard semantic textual similarity (STS) and semantic search tasks. As a result, our method improved the performance of the base contrastive learned BERT_base and variants.
Keywords:
Humans usually have conversations by making use of prior knowledge about a topic and background information of the people whom they are talking to. However, existing conversational agents and datasets do not consider such comprehensive information, and thus they have a limitation in generating the utterances where the knowledge and persona are fused properly. To address this issue, we introduce a call For Customized conversation (FoCus) dataset where the customized answers are built with the user's persona and Wikipedia knowledge. To evaluate the abilities to make informative and customized utterances of pre-trained language models, we utilize BART and GPT-2 as well as transformer-based models. We assess their generation abilities with automatic scores and conduct human evaluations for qualitative results. We examine whether the model reflects adequate persona and knowledge with our proposed two sub-tasks, persona grounding (PG) and knowledge grounding (KG). Moreover, we show that the utterances of our data are constructed with the proper knowledge and persona through grounding quality assessment.
Keywords: Speech & Natural Language Processing (SNLP)
The relationship between Shakespeare’s First Folio and early printings, published in his lifetime, has been a matter of dispute for centuries. A computer program that I have developed visualizes the fluctuating quality of textual correspondences between Folio texts, Henry the Sixth, Part Two and Three, and texts that have been suspected as memorial reconstructions of the Folio, The First Part of the Contention and The True Tragedy of Richard Duke of York. The memorial reconstruction hypothesis assumes increased similarity between the two texts when an alleged actor–reporter is on the stage or speaking and vice versa. The visualization of similarity between two texts, based on the Dice similarity metric, does not show a strong association between the fluctuation of similarity and actor–reporter factors, which challenges the memorial reconstruction hypothesis on statistical grounds. In addition, the distribution of line-by-line similarity scores suggests scene division is a considerable explanatory factor for fluctuating similarity, which is not inexplicable, considering the practice of collaborative writing in the early modern playhouse.
Keywords:
In this paper, we introduce a novel knowledge-based word-sense disambiguation (WSD) system. In particular, the main goal of our research is to find an effective way to filter out unnecessary information by using word similarity. For this, we adopt two methods in our WSD system. First, we propose a novel encoding method for word vector representation by considering the graphical semantic relationships from the lexical knowledge bases, and the word vector representation is utilized to determine the word similarity in our WSD system. Second, we present an effective method for extracting the contextual words from a text for analyzing an ambiguous word based on word similarity. The results demonstrate that the suggested methods significantly enhance the baseline WSD performance in all corpora. In particular, the performance on nouns is similar to those of the state-of-the-art knowledge-based WSD models, and the performance on verbs surpasses that of the existing knowledge-based WSD models.
In this paper, we study the task of selecting the optimal response given a user and system utterance history in retrieval-based multi-turn dialog systems. Recently, pre-trained language models (e.g., BERT, RoBERTa, and ELECTRA) showed significant improvements in various natural language processing tasks. This and similar response selection tasks can also be solved using such language models by formulating the tasks as dialog--response binary classification tasks. Although existing works using this approach successfully obtained state-of-the-art results, we observe that language models trained in this manner tend to make predictions based on the relatedness of history and candidates, ignoring the sequential nature of multi-turn dialog systems. This suggests that the response selection task alone is insufficient for learning temporal dependencies between utterances. To this end, we propose utterance manipulation strategies (UMS) to address this problem. Specifically, UMS consist of several strategies (i.e., insertion, deletion, and search), which aid the response selection model towards maintaining dialog coherence. Further, UMS are self-supervised methods that do not require additional annotation and thus can be easily incorporated into existing approaches. Extensive evaluation across multiple languages and models shows that UMS are highly effective in teaching dialog consistency, which leads to models pushing the state-of-the-art with significant margins on multiple public benchmark datasets.
Keywords: Conversational AI/Dialog Systems
CommonsenseQA is a task in which a correct answer is predicted through commonsense reasoning with pre-defined knowledge. Most previous works have aimed to improve the performance with distributed representation without considering the process of predicting the answer from the semantic representation of the question. To shed light upon the semantic interpretation of the question, we propose an AMR-ConceptNet-Pruned (ACP) graph. The ACP graph is pruned from a full integrated graph encompassing Abstract Meaning Representation (AMR) graph generated from input questions and an external commonsense knowledge graph, ConceptNet (CN). Then the ACP graph is exploited to interpret the reasoning path as well as to predict the correct answer on the CommonsenseQA task. This paper presents the manner in which the commonsense reasoning process can be interpreted with the relations and concepts provided by the ACP graph. Moreover, ACP-based models are shown to outperform the baselines.
We focus on multi-turn response selection in a retrieval-based dialog system. In this paper, we utilize the powerful pre-trained language model Bi-directional Encoder Representations from Transformer (BERT) for a multi-turn dialog system and propose a highly effective post-training method on domain-specific corpus. Although BERT is easily adopted to various NLP tasks and outperforms previous baselines of each task, it still has limitations if a task corpus is too focused on a certain domain. Post-training on domain-specific corpus (e.g., Ubuntu Corpus) helps the model to train contextualized representations and words that do not appear in general corpus (e.g., English Wikipedia). Experimental results show that our approach achieves new state-of-the-art on two response selection benchmarks (i.e., Ubuntu Corpus V1, Advising Corpus) performance improvement by 5.9% and 6% on R@1.
Keywords: Response selection, Human computer dialog system, Spoken language processing
Since 1928, The First Part of the Contention and Richard Duke of York (printed separately in the 1590s) have been regarded as memorial reconstructions of two texts in the Folio edition of Shakespeare’s Comedies, Histories, and Tragedies (printed in 1623), where they are instead identified as Henry the Sixth, Part Two and Henry the Sixth, Part Three. Although recent scholarship has questioned the validity of the memorial reconstruction hypothesis, demonstrating aesthetic differences between the “bad quartos” and the Folio as a sign of distinctive authorial engagements, most reference works and critical editions of the Henry VI plays accept a variety of textual evidence in support of the memorial reconstruction hypothesis. The hypothesis assumes that mangled historical details should be attributed not to a playwright who consulted chronicle sources but to non-authorial agents who trusted their memory when reconstructing the Folio version. This article aims to challenge textual evidence for the memorial reconstruction hypothesis adduced by Peter Alexander and recent textual scholars by discrediting the supporting textual evidence, to argue that demonstrable verbal links between the suspect texts and chronicle sources in several passages unique to Contention or Duke of York substantiate the authorial consultation of chronicle sources.