Wenyi Hong

I am a fourth-year PhD student in Computer Science at Tsinghua University since 2022, supervised by Prof. Jie Tang. Before that, I received my Bachelor’s degree in Computer Science and Technology at Tsinghua University with the GPA of 4.00/4.00 (rank 1/238).

My research primarily focuses on multimodal foundation models, including vision-language models, multimodal agents, and vision generation models.

Email  /  Google Scholar  /  Github

profile photo

Honors & Awards

  • 2024: CVPR 2024 Highlight Paper for CogAgent
  • 2024: ICLR 2024 Spotlight Paper for RelayDiffusion
  • 2023: Selected to Tsinghua University's Future Scholars Scholarship Program
  • 2022: Outstanding Graduate of Tsinghua University
  • 2022: Outstanding Graduate of Beijing
  • 2019 & 2020: National Scholarship
  • 2017: Gold Medal of the 34th Chinese Physics Olympiad (Finals)

Selected Research

(* indicates equal contribution).

Visual Language Foundation Models

Framework of GLM-4.5V and GLM-4.1V-Thinking
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, ..., Minlie Huang, Yuxiao Dong, Jie Tang , Aohan Zeng, Baoxu Wang, Bin Chen, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Jiazheng Xu, Jiale Zhu, Jiali Chen, Jing Chen, Jinhao Chen, Jinghao Lin, Jinjiang Wang, Junjie Chen, Leqi Lei, Letian Gong, Leyi Pan, Mingdao Liu, Mingde Xu, Mingzhi Zhang, Qinkai Zheng, Sheng Yang, Shi Zhong, Shiyu Huang, Shuyuan Zhao, Siyan Xue, Shangqin Tu, Shengbiao Meng, Tianshu Zhang, Tianwei Luo, Tianxiang Hao, Tianyu Tong, Wenkai Li, Wei Jia, Xiao Liu, Xiaohan Zhang, Xin Lyu, Xinyue Fan, Xuancheng Huang, Yanling Wang, Yadong Xue, Yanfeng Wang, Yanzi Wang, Yifan An, Yifan Du, Yiming Shi, Yiheng Huang, Yilin Niu, Yuan Wang, Yuanchang Yue, Yuchen Li, Yutao Zhang, Yuting Wang, Yu Wang, Yuxuan Zhang, Zhao Xue, Zhenyu Hou, Zhengxiao Du, Zihan Wang, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Minlie Huang, Yuxiao Dong, Jie Tang (87 authors)

ArXiv / GitHub / Models 4.1V & 4.5V / API 4.1V (free) & 4.5V / bibtex

We present GLM-4.5V (106B-A12B) and GLM-4.1V-Thinking (9B), a series of open-sourced VLMs designed to advance general-purpose multimodal understanding and reasoning. With enhanced pre-trained base and carefully optimized multi-domain RL procedure, GLM-4.5V achieves state-of-the-art performance on nearly all tasks among open-source models of similar size in a comprehensive evaluation across 42 public benchmarks.

CogVLM2 architecture
CogVLM2: Visual Language Models for Image and Video Understanding
Wenyi Hong , Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Junhui Ji, Zhao Xue, Lei Zhao, Zhuoyi Yang, Xiaotao Gu, Xiaohan Zhang, Guanyu Feng, Da Yin, Zihan Wang, Ji Qi, Xixuan Song, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Yuxiao Dong, Jie Tang

ArXiv / GitHub (CogVLM2 & GLM-4V) / bibtex

We propose the CogVLM2 family, a new generation of visual language models for image and video understanding including CogVLM2, CogVLM2-Video and GLM-4V.

CogAgent Demo
CogAgent: A Visual Language Model for GUI Agents
Wenyi Hong , Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxuan Zhang, Juanzi Li, Bin Xu, Yuxiao Dong, Ming Ding, Jie Tang
CVPR, 2024 (Highlight)
ArXiv / GitHub / Models / Models (new version-241220) / bibtex

One of first GUI agents based on pre-trained VLMs.
We introduce CogAgent, an open-sourced 18-billion-parameter visual language model (VLM) specializing in GUI understanding and navigation.

CogVLM Demo
CogVLM: Visual Expert for Pretrained Language Models
Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, Jie Tang
NeurIPS, 2024
ArXiv / GitHub / Models / bibtex

We introduce CogVLM, a powerful open-source visual language foundation model. With the design of vision expert, CogVLM enables deep fusion of vision language features without sacrificing any performance on NLP tasks.

Vision Generation Foundation Models

CogVideo Demo
CogVideo: Large-Scale Pretraining for Text-to-Video Generation via Transformers
Wenyi Hong*, Ming Ding*, Wendi Zheng, Xinghan Liu, Jie Tang
ICLR, 2023  
ArXiv / GitHub / HuggingFace / bibtex

As (probably) the first open-source large-scale pretrained text-to-video model, CogVideo outperforms all publicly available models at a large margin in machine and human evaluations.

CogVideoX Demo
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong , Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, Jie Tang
ICLR, 2025
ArXiv / GitHub / Models / bibtex

We present the 2nd generation of CogVideo -- CogVideoX , a large-scale text-to-video generation model based on diffusion transformer, which can generate 10-second continuous videos aligned with text prompt, with a frame rate of 16 fps and resolution of 768 * 1360 pixels.

CogView2 Demo
CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers
Ming Ding, Wendi Zheng, Wenyi Hong , Jie Tang
NeurIPS, 2022
ArXiv / GitHub / bibtex

To boost faster and higher-resolution image generation, we propse a hierarchical transformers and local parallel auto-regressive generation, achieved by CogLM (cross-modal general language model) architecture.

CogView Architecture
CogView: Mastering Text-to-Image Generation via Transformers
Ming Ding, Zhuoyi Yang, Wenyi Hong , Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, Jie Tang
NeurIPS, 2021
ArXiv / GitHub / bibtex

We present CogView, a 4 billion-parameter Transformer with a VQ-VAE tokenizer for general-domain text-to-image generation, developed contemporaneously with DALL·E by OpenAI.

Evaluation of Vision Language Models

MotionBench intro
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models
Wenyi Hong *, Yean Cheng*, Zhuoyi Yang*, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, Jie Tang
CVPR, 2025
ArXiv / Project Page / Dataset / bibtex

we propose MotionBench, a comprehensive evaluation benchmark designed to assess the fine-grained motion comprehension of video understanding models. Further, we propose a novel and efficient Through-Encoder (TE) Fusion method to enhance VLM's ability to perceive fine-grained motion within a limited sequence length budget.

LVBench intro
LVBench: An Extreme Long Video Understanding Benchmark
Weihan Wang, Zehai He, Wenyi Hong , Yean Cheng, Xiaohan Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, Jie Tang
CVPR, 2025
ArXiv / Project Page / Dataset / bibtex

We introduce LVBench, a benchmark specifically designed for long video understanding. Our dataset contains 6 major capability categories and 21 subcategories, with the video average length of 1.14 hours, approximately four times longer than the longest existing dataset.

Vision Generation Algorithms

Inf-DiT Architecture
Inf-DiT: Upsampling Any-Resolution Image with Memory-Efficient Diffusion Transformer
Zhuoyi Yang, Heyang Jiang, Wenyi Hong, Jiayan Teng, Wendi Zheng, Yuxiao Dong, Ming Ding, Jie Tang
ECCV, 2024
ArXiv / GitHub / bibtex

We propose a unidirectional block attention mechanism for image diffusion models that can adaptively adjust the memory overhead during the inference process and handle global dependencies.

Pipeline of Relay Diffusion
Relay Diffusion: Unifying diffusion process across resolutions for image synthesis
Jiayan Teng, Wendi Zheng, Ming Ding, Wenyi Hong , Jianqiao Wangni, Zhuoyi Yang, Jie Tang
ICLR, 2024 (Spotlight)
ArXiv / GitHub / bibtex

Through the lens of discrete cosine transformation, we find the main reason for difficulty in high-resolution generation with diffusion models is that the same noise level on a higher resolution results in a higher Signal-to-Noise Ratio in the frequency domain. In this work, we present Relay Diffusion Model (RDM), where the diffusion process can continue seamlessly in any new resolution or model without restarting from pure noise or low-resolution conditioning.


Reference code: source code.