We research on foundations and application of approaches that make AI explainable and controllable.
Interpretable AI
We try to understand the mechanisms of DNN models, and reveal how their internal mechanisms lead to their behavior, especially the behavior related to safety. We advance the frontier of mechanistic interpretability, probing DNNs including language models and beyond. We analyze the DNNs at multiple granularity levels: layerwise, module-wise, attention-wise, neuron-wise, SAE features, etc.
Related works include:
Control of AI
We develop foundational methods to control the DNNs. The methods mostly span across two types: (1) Behavioral control, where we treat the model as a closed box and intervene on e.g., the prompts, data, and external memories. (2) Mechanistic control, where we intervene on the "steering wheels" inside the AI models, leveraging the knowledge learned by interpretability.
Related works include:
- Distribution Prompting: Understanding the Expressivity of Language Models Through the Next-Token Distributions They Can Produce (2025)
- Sheaf Discovery with Joint Computation Graph Pruning and Flexible Granularity (2025)
- Plug and Play with Prompts: A Prompt Tuning Approach for Controlling Text Generation (2024)
Reasoning AI for Research and Education
As AIs have stronger capabilities, a theme of "machine teaching" start to emerge alongside "machine learning". We are interested in enhancing AIs for researchers and students. The AIs should be truthful, efficient in teaching, and self-evolving with new evidence.
Related works include:
- Can AI Validate Science? Benchmarking LLMs for Accurate Scientific Claim $\rightarrow$ Evidence Reasoning
- $ACCORD$: Closing the Commonsense Measurability Gap (2025)
- What Would You Ask When You First Saw $a^2+b^2=c^2$? Evaluating LLM on Curiosity-Driven Questioning (2025)
- LLM-Generated Black-box Explanations can be Adversarially Helpful (2024)
Methods and applications of AI Agents
We make foundational methodological advances and explore critical application scenarios of AI agents driven by strong, general-purpose foundational models (including but not limited to language models and vision-language models). We consider the problems that have profound real-world impacts, including: cybersecurity, finance, sports, and education, etc. We explore innovative architectures for these agents, and develop benchmarks that rigorously evaluate their performances.
Related works include:
Special thanks to the sponsors for supporting our researches:




