Publications
My research focuses on reinforcement learning, large language model post-training, imitation learning, and optimization.
* indicates equal contribution or alphabetic ordering where applicable. See also Google Scholar.
Publications by year
2026
-
Beyond Token-level Supervision: Unlocking the Potential of Decoding-based Regression via Reinforcement Learning. The 43rd International Conference on Machine Learning (ICML), 2026
-
The Optimal Token Baseline: Variance Reduction for Long-Horizon LLM-RL. The 43rd International Conference on Machine Learning (ICML), 2026
-
Trust Region Masking for Long-Horizon LLM Reinforcement Learning. The 43rd International Conference on Machine Learning (ICML), 2026
-
OnePO: Direct One-stage Policy Optimization for SFT-free Domain Adaptation. The 43rd International Conference on Machine Learning (ICML), 2026
-
Knapsack RL: Compute-Efficient Reinforcement Learning via Heterogeneous Rollout Allocation. The 43rd International Conference on Machine Learning (ICML), 2026
-
TreePO: Enhancing Policy Efficacy and Inference Efficiency with Tree Modeling. The 43rd International Conference on Machine Learning (ICML), 2026
-
Non-Adversarial Imitation Learning Provably Free of Compounding Errors: The Value Flow Mechanism. The 43rd International Conference on Machine Learning (ICML), 2026
-
SpeechJudge: Towards Human-Level Judgment for Speech Naturalness. The 14th International Conference on Learning Representations (ICLR), 2026
-
Exploration vs Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward. The 14th International Conference on Learning Representations (ICLR), 2026
-
A Survey on Large Language Models for Mathematical Reasoning. ACM Computing Surveys (CSUR), 2026
-
Knowledge Index of Noah's Ark. arXiv:2606.05104
-
Schedule-Level Shared-Prefix Reuse for LLM RL Training. arXiv:2606.01143
-
Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO. arXiv:2605.04077
-
Do Phone-Use Agents Respect Your Privacy?. arXiv:2604.00986
-
Off-Policy Value-Based Reinforcement Learning for Large Language Models. arXiv:2603.23355
-
Understanding Adversarial Imitation Learning in Small Sample Regime: A Stage-coupled Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2026
-
Seed1.8 Model Card: Towards Generalized Real-World Agency. arXiv:2603.20633
2025
-
Dynamic Vocabulary Pruning: Stable LLM-RL by Taming the Tail. arXiv:2512.23087
-
Scaling Latent Reasoning via Looped Language Models. arXiv:2510.25741
-
ORGEval: Graph-Theoretic Evaluation of LLMs in Optimization Modeling. arXiv:2510.27610
-
Teaching Language Models to Reason with Tools. Conference on Neural Information Processing System (NeurIPS) 39, 2025
-
On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization. Accepted by Journal of the American Statistical Association (JASA), 2025
-
Self-Evolving Critique Abilities in Large Language Models. Conference on Language Modeling (COLM), 2025
-
Spectral Policy Optimization: Coloring your Incorrect Reasoning in GRPO. Transactions on Machine Learning Research (TMLR), 2025
-
Advancing Zero-shot Text-to-Speech Intelligibility across Diverse Domains via Preference Alignment. The 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025
-
Controlling Large Language Model with Latent Actions. The 42nd International Conference on Machine Learning (ICML), 2025
-
Adam-mini: Use Fewer Learning Rates To Gain More. The 13th International Conference on Learning Representations (ICLR), 2025
-
Preserving Diversity in Supervised Fine-tuning of Large Language Models. The 13th International Conference on Learning Representations (ICLR), 2025 🏆 Best Paper Runner-up at NeurIPS 2024 Workshop on Fine-Tuning in Modern Machine Learning
-
Understanding and Mitigating Hallucination in Large Vision-Language Models via Modular Attribution and Intervention. The 13th International Conference on Learning Representations (ICLR), 2025
2024
-
Pruning for Robust Concept Erasing in Diffusion Models. NeurIPS Workshop on Safe Generative AI, 2024
-
Unlocking Black-Box Prompt Tuning Efficiency via Zeroth-Order Optimization. The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) (Findings), 2024
-
Sensing Jamming Strategy from Limited Observations: An Imitation Learning Perspective. IEEE Transactions on Signal Processing (TSP)
-
ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models. The 41st International Conference on Machine Learning (ICML), 2024
-
Why Transformers Need Adam: A Hessian Perspective. Conference on Neural Information Processing System (NeurIPS) 38, 2024
-
When is RL better than DPO in RLHF? A Representation and Optimization Perspective. The 12th International Conference on Learning Representations (ICLR) (Tiny Paper Track), 2024 🏆 Oral presentation, with an early version at arXiv:2312.10584
2023
-
Imitation Learning from Imperfection: Theoretical Justifications and Algorithms. Conference on Neural Information Processing System (NeurIPS) 37, 2023 🏆 Spotlight presentation
-
Provably Efficient Adversarial Imitation Learning with Unknown Transitions. The 39th Conference on Uncertainty in Artificial Intelligence (UAI), 2023 🏆 Oral presentation, with an early version at arXiv:2106.10424v2
-
Deploying Offline Reinforcement Learning with Human Feedback. arXiv:2303.07046
2022
-
Rethinking ValueDice: Does It Really Improve Performance?. The 10th International Conference on Learning Representations (ICLR) (Blog Track), 2022
-
HyperDQN: A Randomized Exploration Method for Deep Reinforcement Learning. The 10th International Conference on Learning Representations (ICLR), 2022 🏆 Oral presentation at Workshop on Ecological Theory of Reinforcement Learning at NeurIPS, 2021
2021
-
A Concise Introduction to Imitation Learning. Online Available
-
Error Bounds of Imitating Policies and Environments for Reinforcement Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021
2020
-
Error Bounds of Imitating Policies and Environments. Conference on Neural Information Processing Systems 34 (NeurIPS), 2020
-
Efficient Exploration by Novelty-pursuit. The 2nd International Conference on Distributed Artificial Intelligence (DAI), 2020
-
Self-Guided Evolution Strategies with Historical Estimated Gradients. The 29th International Conference on Joint Artificial Intelligence (IJCAI), 2020