Yufan Zhuang

y5zhuang AT ucsd.edu

CSE 2232, 9500 Gilman Drive

Hey, I’m Yufan Zhuang 庄宇凡

I’m a final year CS PhD student at UC San Diego advised by Jingbo Shang working on making language models reason better and more agentic. I develop methods that push the boundaries of how AI systems understand and generate language—from helping them reason through complex problems, long-form content to enabling them to learn from any data modality in continuous space.

👋 I’m on the market for full-time AI research positions starting Dec 2025.

🔍 Reasoning in Continuous Space

Mixture of Inputs (arXiv’25) - Beyond Discrete Sampling
Exploring continuous mixture approaches that fundamentally improve how language models reason through complex problems by operating in continuous vector space rather than discrete tokens.

Vector-ICL (ICLR’25) - Cross-Modal In-Context Learning
Enabling models to learn from any data modality in-context through continuous vector representations, breaking down traditional barriers between text, images, and other data types.

MetaTree (TMLR’24) - Tabular Data Processing with Transformers
Training transformers via meta-learning to directly produce strong decision trees, bridging classical ML algorithms with modern neural approaches for superior generalization.

🤖 Agentic Learning & Long-Context

AgenticLU (ACL’25) - Agentic Long Context Learning
Improving long-context capabilities through intelligent agentic approaches that can dynamically reason through complex, extended content with strategic planning.

ViperVLMs (HuggingFace) - High-Quality Mamba-based Vision Language Models
Training efficient vision-language models that maintain high performance while being more computationally accessible.

WavSpa (NeurIPS’23 UniReps) - Adaptive Long Context
Extending LLM context length through innovative wavelet transform approaches, making long-context processing more efficient.

📊 Trustworthy Model Evaluation

Deep Contamination (EMNLP’24) - Cross-lingual Data Contamination Study
Uncovering critical issues in how we evaluate contemporary LLMs, particularly around cross-lingual contamination that affects model reliability.

🔬 AI for Software Engineering & Sociology

Signal-Aware Code Understanding (ACM TOSEM’23) - Trustworthy AI for Source Code
Developing signal-aware AI models for software vulnerability detection that focus on learning task-relevant source code features rather than spurious correlations, improving model robustness and reliability.

Prediction-Preserving Input Minimization (ESEC/FSE’21) - Probing AI Model Understanding
Introducing P2IM approach to systematically evaluate AI models’ signal awareness by reducing source code to minimal snippets needed to maintain predictions, revealing models’ reliance on incorrect signals.

Computational Social Science (Annals AAG’22) - Machine Learning for Sociological Analysis
Applying machine learning approaches to decipher heterogeneous representations and images of Chinese communities in North America, bridging computational methods with sociological research.

Prior to my PhD study, I worked at IBM T. J. Watson Research Center as a research engineer helping to enhance software engineering with the power of AI and vice versa. I received my MS in Data Science from Columbia, my BSc in Applied Math Minor in CS (with First Class Honors) from Hong Kong Polytechnic University.

news

Jun 14, 2025	I’ve started my research internship on Agents at Apple AIML, Siri Team!
May 21, 2025	Check out our latest work Mixture of Inputs, training free + reasoning improvement, you can use it directly with vLLM!
May 15, 2025	Our paper Agentic Long Context Understanding has been accepted at ACL 2025!
Jan 28, 2025	Our paper Vector-ICL has been accepted at ICLR 2025!
Sep 01, 2024	Our paper Data Contamination Can Cross Language Barriers has been published at EMNLP 2024!

selected publications

arXiv

Text Generation Beyond Discrete Token Sampling

Yufan Zhuang, Liyuan Liu, Chandan Singh, and 2 more authors

arXiv preprint arXiv:2505.14827, 2025

PDF
ACL

Self-Taught Agentic Long Context Understanding

Yufan Zhuang, Xiaodong Yu, Jialian Wu, and 7 more authors

Annual Meeting of the Association for Computational Linguistics, 2025

PDF
ICLR

Vector-ICL: In-context Learning with Continuous Vector Representations

Yufan Zhuang, Chandan Singh, Liyuan Liu, and 2 more authors

International Conference on Learning Representations, 2025

PDF
EMNLP

Data Contamination Can Cross Language Barriers

Feng Yao^*, Yufan Zhuang^*, Zihao Sun, and 3 more authors

Empirical Methods in Natural Language Processing, 2024

PDF
TMLR

Learning a Decision Tree Algorithm with Transformers

Yufan Zhuang, Liyuan Liu, Chandan Singh, and 2 more authors

Transactions on Machine Learning Research, 2024

PDF
UniReps@NeurIPS

WavSpA: Wavelet Space Attention for Boosting Transformers’ Long Sequence Learning Ability

Yufan Zhuang, Zihan Wang, Fangbo Tao, and 1 more author

NeurIPS 1st UniReps Workshop, 2023

PDF
TOSEM

Incorporating Signal Awareness in Source Code Modeling: an Application to Vulnerability Detection

Sahil Suneja, Yufan Zhuang, Yunhui Zheng, and 3 more authors

ACM Transactions on Software Engineering and Methodology, 2023

PDF
AAG

Sleeping Lion or Sick Man? Machine Learning Approaches to Deciphering Heterogeneous Images of Chinese in North America

Qiang Fu, Yufan Zhuang, Yushu Zhu, and 1 more author

Annals of the American Association of Geographers, 2022

PDF
FSE

Probing model signal-awareness via prediction-preserving input minimization

Sahil Suneja^*, Yunhui Zheng^*, Yufan Zhuang^*, and 2 more authors

ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2021

PDF
BDR

Agreeing to disagree: choosing among eight topic-modeling methods

Qiang Fu, Yufan Zhuang, Jiaxin Gu, and 2 more authors

Big Data Research, 2021

PDF