Research Interests

String Pattern Matching Algorithms

Formal Interpretation of Practical Regex Features: Practical regexes often incorporate several advanced features, such as the count operation, look-around, and back-references, which are not accounted for in the classical regular expression model. Some of these features even allow regexes to express non-regular languages. We formalize these extensions within the framework of automata and formal language theory, and develop efficient algorithms for such features.
Parikh Matrix Equivalence: The M-equivalence test checks if two strings are equivalent by comparing their Parikh matrices, which detail the frequency and position of characters within the strings. It involves characterization of M-equivalent classes and also algorithms for efficient matrix computation and comparison.
Simon’s Congruence Matching: Given an integer k, two strings are Simon's k-congruent if their set of subsequences with length at most k are equal. We study pattern matching problems under Simon's congruence, where two strings match if they are Simon's k-congruent. We are also interested in solving the approximate version of the matching problem, as well as problems finding a string inside a given language that is Simon's k-congruent to another string.

PCFG-based Parsing Technique for Data Augmentation: We utilize probabilistic context-free grammars (PCFG) to generate diverse and syntactically correct variations of input data, enhancing dataset size and model generalization.
Analysis and Interpretation of Neural Networks via Probabilistic Automata: It involves employing probabilistic automata models to understand and interpret the behavior and decisions made by neural networks, enabling insights into their underlying processes and improving their transparency and explainability.
Neuro-Symbolic AI for Logical Reasoning: Neuro-symbolic AI for logical reasoning combines neural networks with symbolic reasoning to enhance decision-making and problem-solving capabilities, bridging the gap between statistical learning and logical inference. It leverages the strengths of both approaches to tackle complex reasoning tasks, offering more robust and interpretable solutions.

Code Time Complexity Prediction: We predict the time complexity of code by analyzing its structure, identifying key factors such as loops and recursion, and expressing the runtime in terms of Big O notation.
Small-scale Code LLMs: We develop a small-scale code LLM that is executable on a single GPU with a typical memory size. We aim to develop a model that performs well in various code-related tasks with a small model size using techniques such as model merging and instruction tuning.
Natural Language Code Search: NL-Code search is a process of retrieving code snippets of functions using natural language queries. It involves utilizing techniques from NLP and code analysis to understand both the query and the code base.

Hate Speech Detection: We aim to develop algorithms or AI models to automatically identify and categorize language that expresses hatred or prejudice towards specific individuals or groups, aiding in the moderation of online content and fostering a safer digital environment.
Few-Shot Text Classification using Self-Training: It involves leveraging a small labeled dataset combined with a larger unlabeled dataset to improve classification performance by iteratively refining predictions and updating the model’s parameters through self-training iterations.
Machine-Generated Text Detection: Machine-generated text detection is the process of identifying text generated by AI models rather than humans, crucial for identifying fake news and misinformation. It involves analyzing linguistic patterns and inconsistencies to distinguish between human and machine-generated content.