Yufan Zhuang

CSE@UC San Diego

prof_pic.jpg

y5zhuang AT ucsd.edu

CSE 2232, 9500 Gilman Drive

Hey, I’m Yufan Zhuang 庄宇凡

I’m a final year CS PhD student at UC San Diego advised by Jingbo Shang working on making language models reason better and more agentic. I develop methods that push the boundaries of how AI systems understand and generate language—from helping them reason through complex problems, long-form content to enabling them to learn from any data modality in continuous space.

👋 I’m on the market for full-time AI research positions starting Dec 2025.

🔍 Reasoning in Continuous Space

Mixture of Inputs (arXiv’25) - Beyond Discrete Sampling
Exploring continuous mixture approaches that fundamentally improve how language models reason through complex problems by operating in continuous vector space rather than discrete tokens.

Vector-ICL (ICLR’25) - Cross-Modal In-Context Learning
Enabling models to learn from any data modality in-context through continuous vector representations, breaking down traditional barriers between text, images, and other data types.

MetaTree (TMLR’24) - Tabular Data Processing with Transformers
Training transformers via meta-learning to directly produce strong decision trees, bridging classical ML algorithms with modern neural approaches for superior generalization.

🤖 Agentic Learning & Long-Context

AgenticLU (ACL’25) - Agentic Long Context Learning
Improving long-context capabilities through intelligent agentic approaches that can dynamically reason through complex, extended content with strategic planning.

ViperVLMs (HuggingFace) - High-Quality Mamba-based Vision Language Models
Training efficient vision-language models that maintain high performance while being more computationally accessible.

WavSpa (NeurIPS’23 UniReps) - Adaptive Long Context
Extending LLM context length through innovative wavelet transform approaches, making long-context processing more efficient.

📊 Trustworthy Model Evaluation

Deep Contamination (EMNLP’24) - Cross-lingual Data Contamination Study
Uncovering critical issues in how we evaluate contemporary LLMs, particularly around cross-lingual contamination that affects model reliability.

🔬 AI for Software Engineering & Sociology

Signal-Aware Code Understanding (ACM TOSEM’23) - Trustworthy AI for Source Code
Developing signal-aware AI models for software vulnerability detection that focus on learning task-relevant source code features rather than spurious correlations, improving model robustness and reliability.

Prediction-Preserving Input Minimization (ESEC/FSE’21) - Probing AI Model Understanding
Introducing P2IM approach to systematically evaluate AI models’ signal awareness by reducing source code to minimal snippets needed to maintain predictions, revealing models’ reliance on incorrect signals.

Computational Social Science (Annals AAG’22) - Machine Learning for Sociological Analysis
Applying machine learning approaches to decipher heterogeneous representations and images of Chinese communities in North America, bridging computational methods with sociological research.


Prior to my PhD study, I worked at IBM T. J. Watson Research Center as a research engineer helping to enhance software engineering with the power of AI and vice versa. I received my MS in Data Science from Columbia, my BSc in Applied Math Minor in CS (with First Class Honors) from Hong Kong Polytechnic University.

news

Jun 14, 2025 I’ve started my research internship on Agents at Apple AIML, Siri Team!
May 21, 2025 Check out our latest work Mixture of Inputs, training free + reasoning improvement, you can use it directly with vLLM!
May 15, 2025 Our paper Agentic Long Context Understanding has been accepted at ACL 2025!
Jan 28, 2025 Our paper Vector-ICL has been accepted at ICLR 2025!
Sep 01, 2024 Our paper Data Contamination Can Cross Language Barriers has been published at EMNLP 2024!

selected publications

  1. arXiv
    Text Generation Beyond Discrete Token Sampling
    Yufan Zhuang, Liyuan Liu, Chandan Singh, and 2 more authors
    arXiv preprint arXiv:2505.14827, 2025
  2. ACL
    Self-Taught Agentic Long Context Understanding
    Yufan Zhuang, Xiaodong Yu, Jialian Wu, and 7 more authors
    Annual Meeting of the Association for Computational Linguistics, 2025
  3. ICLR
    Vector-ICL: In-context Learning with Continuous Vector Representations
    Yufan Zhuang, Chandan Singh, Liyuan Liu, and 2 more authors
    International Conference on Learning Representations, 2025
  4. EMNLP
    Data Contamination Can Cross Language Barriers
    Feng Yao*Yufan Zhuang*, Zihao Sun, and 3 more authors
    Empirical Methods in Natural Language Processing, 2024
  5. TMLR
    Learning a Decision Tree Algorithm with Transformers
    Yufan Zhuang, Liyuan Liu, Chandan Singh, and 2 more authors
    Transactions on Machine Learning Research, 2024
  6. UniReps@NeurIPS
    WavSpA: Wavelet Space Attention for Boosting Transformers’ Long Sequence Learning Ability
    Yufan Zhuang, Zihan Wang, Fangbo Tao, and 1 more author
    NeurIPS 1st UniReps Workshop, 2023
  7. TOSEM
    Incorporating Signal Awareness in Source Code Modeling: an Application to Vulnerability Detection
    Sahil Suneja, Yufan Zhuang, Yunhui Zheng, and 3 more authors
    ACM Transactions on Software Engineering and Methodology, 2023
  8. AAG
    Sleeping Lion or Sick Man? Machine Learning Approaches to Deciphering Heterogeneous Images of Chinese in North America
    Qiang Fu, Yufan Zhuang, Yushu Zhu, and 1 more author
    Annals of the American Association of Geographers, 2022
  9. FSE
    Probing model signal-awareness via prediction-preserving input minimization
    Sahil Suneja*, Yunhui Zheng*Yufan Zhuang*, and 2 more authors
    ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2021
  10. BDR
    Agreeing to disagree: choosing among eight topic-modeling methods
    Qiang Fu, Yufan Zhuang, Jiaxin Gu, and 2 more authors
    Big Data Research, 2021