Research Interests
Understanding issues with appropriate methods because There are yet many to uncover. Oftentimes, we are confined to existing methods. My interest is expanding previous works with less explored methods so that our talks can be about the studies, not the methods.
Linguistic analysis with computational modeling is to make use of stochastic hypotheses, or models, to perform linguistic analysis, particularly in syntax-semantics and sociolinguistics. From Support Vector Machines to the fanciest language models with auto-regressive decoders, past decades showed the usefulness of stochastic methods in exploiting language patterns no theory had captured. At the same time, I stand on that human language is not comprised of pattern alone, as if so, we are getting further and further to the question of how do we acquire a language. Thus, my interest is in the application of stochastic methods, not to blame theories but to aid findings by thinkings with data, in the same manner how language experiments and corpus analyses have aided syntactic analysis of “marginal” cases.
Symbolic use of stochastic processes is to make extensive use of domain findings in systems around stochastic modules whenever appropriate. From the emerge of Word2Vec and Sequence-to-Sequence models to GPT’s and other commercial services, the past decade was the process of generalization in processing textual data. From the perspective of modeling, solving general problems even with methods with massive costs can be intriguing. From the perspective of system building, uncontrolled model with enormous building and running cost but little guarantee is not ideal. The two axes of symbolic processing and stochastic processing should be pursued together in building systems that is general yet effective. Thus, my interest is in distilling symbolic wisdoms to modern and old stochastic processings in building effective systems.
Making my neighborhood a slightly better place is the ultimate goal in my life. Wherever I find myself, I am in a society with problems. Certainly, I cannot lead a revolutionary work that will change the humankind. But if I could make the little society around me a slightly better place, I will find my life meaningful. Whether it is linguistics, language education, applying fanciest technologies, or fighting with tyranny of the local powers sometimes, or sometimes just a few line of codes, at least up to my ability my interest lays. As at the end, what is “good” is: to do justice, and to love kindness, and to walk humbly with your god.
Publishes
For the full bibliographies, see my CV.
Understanding issues with appropriate methods
Lee, et al. (2021) A study of an analysis… is an analysis of Rodong Paper of the Democratic People’s Republic of Korea under the cooperation with Hyungjong, an expert in North Korean regime. Precisely, the comparison was carried out for articles from 2016 to 2019, with two years of 2016 and 2017 marked with the presidency from and the majority of a conservative party in the Republic of Korea with some anti-communism theses, while the later two years are marked with the impeachment of the president and the succeeding liberal party which led to the Inter-Korea and ultimately US-DPRK summit at Hanoi. Here, the analyses were backed up with word embeddings to provide data-driven queues for an expert of North Korea studies to compensate previous findings.
Lee & Song (2022) The Korean Coronavirus Corpus is an analysis of the press media in the Republic of Korea, instead. As the pandemic of COVID-19 was fading away, the Republic of Korea was one of the countries with least damages with active testings yet no state-issued lockdowns. However, it was common for the media (particularly the press media but also certain section of the New Media of YouTube) to speak ill of the situation, particularly for the government’s handling of the situations from government’s late import of the vaccines to government’s promoting of vaccinations (which was one of the tiniest conflict several of the same media showed). The research gathered a massive amount of articles published by press media from left to the right and applied several skills, including a combination of corpus linguistics and natural language processing by performing sentiment analysis on the concordance of few selected keywords to estimate the overall “stance” of media. With the result, information theoretic concept of entropy was used to demonstrate little to no impact of the media stance to actual health statistics from vaccination rates to mortality rates.
Linguistic analysis with computational modeling
Lee, et al. (2021) DeepKLM is a premature work intended to make an easy entry point for linguist to the token probability from masked language models of BERT and the likes. Based on the studies often referred as the BERTology, the paper was on a Python library ready to be used with a Jupyter Notebook environment with minimal variable setting for getting the surprisal metrics in an interactive interface. Also, two exemplar studies were included to demonstrate the field of surprisal-based analyses in Korean language, along with few issues to consider including tokenization problems. Song, et al. (2021) Probing the unbound reflexives… and Lee, et al. (2022) Does the Korean… are extension to this by implementing BERT-based metrics for English and Korean, respectively, by simulating language experiments for analyses of both language models and the human language.
Hong, et al. (2023) Reasoning ability… is a study slightly different from the surprisal-based studies above. Here, both encoder-based models and decoder-based models were used to see whether various language models can handle a semantically intriguing phenomenon of tautology (e.g., The weather is getting warmer and warmer, but winter is winter; it was a cold December). Here, the focus was on the inability of language models that are otherwise claimed to be good at catching patterns from the sequential textual inputs. While such studies may only demonstrate the language model’s poor performance on the data with less quantity, but may potentially provide arguments on the existence of the lexical patterns for such linguistic phenomena.
Now working on…
Enhancing Large Language Model Inferences with Logical Rules. While large language models generally have good inference capacity, guiding their inference for complex or controlled inferences remain a challenge. I’m working on guiding such inferences with logical rules, potentially implementing the interpreter design pattern with large language model agents.