Yufan Zhuang

y5zhuang AT ucsd.edu
CSE 2232, 9500 Gilman Drive
Hey, I’m Yufan Zhuang 庄宇凡
I’m a final year CS PhD student at UC San Diego advised by Jingbo Shang working on making language models reason better and more agentic. I develop methods that push the boundaries of how AI systems understand and generate language—from helping them reason through complex problems, long-form content to enabling them to learn from any data modality in continuous space.
👋 I’m on the market for full-time AI research positions starting Dec 2025.
🔍 Reasoning in Continuous Space
Mixture of Inputs (arXiv’25) - Beyond Discrete Sampling
Exploring continuous mixture approaches that fundamentally improve how language models reason through complex problems by operating in continuous vector space rather than discrete tokens.
Vector-ICL (ICLR’25) - Cross-Modal In-Context Learning
Enabling models to learn from any data modality in-context through continuous vector representations, breaking down traditional barriers between text, images, and other data types.
MetaTree (TMLR’24) - Tabular Data Processing with Transformers
Training transformers via meta-learning to directly produce strong decision trees, bridging classical ML algorithms with modern neural approaches for superior generalization.
🤖 Agentic Learning & Long-Context
AgenticLU (ACL’25) - Agentic Long Context Learning
Improving long-context capabilities through intelligent agentic approaches that can dynamically reason through complex, extended content with strategic planning.
ViperVLMs (HuggingFace) - High-Quality Mamba-based Vision Language Models
Training efficient vision-language models that maintain high performance while being more computationally accessible.
WavSpa (NeurIPS’23 UniReps) - Adaptive Long Context
Extending LLM context length through innovative wavelet transform approaches, making long-context processing more efficient.
📊 Trustworthy Model Evaluation
Deep Contamination (EMNLP’24) - Cross-lingual Data Contamination Study
Uncovering critical issues in how we evaluate contemporary LLMs, particularly around cross-lingual contamination that affects model reliability.
🔬 AI for Software Engineering & Sociology
Signal-Aware Code Understanding (ACM TOSEM’23) - Trustworthy AI for Source Code
Developing signal-aware AI models for software vulnerability detection that focus on learning task-relevant source code features rather than spurious correlations, improving model robustness and reliability.
Prediction-Preserving Input Minimization (ESEC/FSE’21) - Probing AI Model Understanding
Introducing P2IM approach to systematically evaluate AI models’ signal awareness by reducing source code to minimal snippets needed to maintain predictions, revealing models’ reliance on incorrect signals.
Computational Social Science (Annals AAG’22) - Machine Learning for Sociological Analysis
Applying machine learning approaches to decipher heterogeneous representations and images of Chinese communities in North America, bridging computational methods with sociological research.
Prior to my PhD study, I worked at IBM T. J. Watson Research Center as a research engineer helping to enhance software engineering with the power of AI and vice versa. I received my MS in Data Science from Columbia, my BSc in Applied Math Minor in CS (with First Class Honors) from Hong Kong Polytechnic University.
news
Jun 14, 2025 | I’ve started my research internship on Agents at Apple AIML, Siri Team! |
---|---|
May 21, 2025 | Check out our latest work Mixture of Inputs, training free + reasoning improvement, you can use it directly with vLLM! |
May 15, 2025 | Our paper Agentic Long Context Understanding has been accepted at ACL 2025! |
Jan 28, 2025 | Our paper Vector-ICL has been accepted at ICLR 2025! |
Sep 01, 2024 | Our paper Data Contamination Can Cross Language Barriers has been published at EMNLP 2024! |
selected publications
- arXiv
- ACLSelf-Taught Agentic Long Context UnderstandingAnnual Meeting of the Association for Computational Linguistics, 2025
- ICLRVector-ICL: In-context Learning with Continuous Vector RepresentationsInternational Conference on Learning Representations, 2025
- EMNLPData Contamination Can Cross Language BarriersEmpirical Methods in Natural Language Processing, 2024
- TMLRLearning a Decision Tree Algorithm with TransformersTransactions on Machine Learning Research, 2024
- UniReps@NeurIPSWavSpA: Wavelet Space Attention for Boosting Transformers’ Long Sequence Learning AbilityNeurIPS 1st UniReps Workshop, 2023
- TOSEMIncorporating Signal Awareness in Source Code Modeling: an Application to Vulnerability DetectionACM Transactions on Software Engineering and Methodology, 2023
- AAGSleeping Lion or Sick Man? Machine Learning Approaches to Deciphering Heterogeneous Images of Chinese in North AmericaAnnals of the American Association of Geographers, 2022
- FSEProbing model signal-awareness via prediction-preserving input minimizationACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2021
- BDR