Ziniu Li

alt text 

Research Scientist (Qingyun Talent Program)
Tencent Hunyuan Team

Email: liziniu1997@gmail.com

[Google Scholar]
[Twitter] [Zhihu]

About me

I hold a Ph.D. from The Chinese University of Hong Kong, Shenzhen, where my research focused on large-scale reinforcement learning training and its applications in large language models.

I was advised by Prof. Tom Luo, a prominent applied mathematician in optimization and signal processing. My academic lineage extends to Prof. John Tsitsiklis of MIT—my advisor’s own advisor—who pioneered foundational reinforcement learning theory and co-introduced the actor-critic algorithm in 1999.

Experience

  • 05/2025 - 01/2026: Top Seed Intern @ Bytedance Seed, Beijing

  • 10/2021 - 08/2022: Intern @ Tencent AI Lab, Shenzhen

  • 07/2019 - 06/2022: Research Assistant @ Nanjing University, Nanjing

Recent Highlights

*: indicating equal contribution or alphabetic ordering.

Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation
Ziniu Li, Congliang Chen, Tianyun Yang, Tian Ding, Ruoyu Sun, Ge Zhang, Wenhao Huang, Zhi-Quan Luo
arXiv:2509.25849

TL;DR: This work introduces a knapsack-based exploration framework for RL training in LLMs, unlocking their capability to solve hard tasks and expand performance frontiers

Preserving Diversity in Supervised Fine-tuning of Large Language Models
Ziniu Li, Congliang Chen, Tian Xu, Zeyu Qin, Jiancong Xiao, Zhi-Quan Luo, Ruoyu Sun
(The 13th International Conference on Learning Representations (ICLR), 2025)
(Best Paper Runner-up at NeurIPS 2024 Workshop on Fine-Tuning in Modern Machine Learning: Principles and Scalability)

TL;DR: This work introduces a game-theoretic distribution matching method to address the diversity-reducing and knowledge-forgetting issues in SFT

ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models
Ziniu Li, Tian Xu, Yushun Zhang, Zhihang Lin, Yang Yu, Ruoyu Sun, Zhi-Quan Luo
(The 41st International Conference on Machine Learning (ICML), 2024)

TL;DR: This work provides the foundation of REINFORCE-style methods in LLM training and introduces a method called ReMax that is computationally efficient than PPO