I am a fourth-year PhD student in Computer Science at Tsinghua University since 2022, supervised by
Prof. Jie Tang. Before that, I received my Bachelor’s degree in Computer Science and Technology at Tsinghua University with the GPA of 4.00/4.00 (rank 1/238).
My research primarily focuses on multimodal foundation models, including vision-language models, multimodal agents, and
vision generation models.
We present GLM-4.5V (106B-A12B) and GLM-4.1V-Thinking (9B), a series of open-sourced VLMs designed to advance general-purpose multimodal understanding and reasoning. With enhanced pre-trained base and carefully optimized multi-domain RL procedure, GLM-4.5V achieves state-of-the-art performance on nearly all tasks among open-source models of similar size in a comprehensive evaluation across 42 public benchmarks.
CogVLM2: Visual Language Models for Image and Video Understanding
Wenyi Hong , Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng,
Shiyu Huang, Junhui Ji, Zhao Xue, Lei Zhao, Zhuoyi Yang, Xiaotao Gu, Xiaohan Zhang, Guanyu Feng, Da
Yin, Zihan Wang, Ji Qi, Xixuan Song, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Yuxiao Dong, Jie Tang
ArXiv
/
GitHub
(CogVLM2 & GLM-4V)
/
bibtex
We propose the CogVLM2 family, a new generation of visual language models for image and video
understanding including CogVLM2, CogVLM2-Video and GLM-4V.
One of first GUI agents based on pre-trained VLMs.
We introduce CogAgent, an open-sourced 18-billion-parameter visual language model (VLM) specializing
in GUI understanding and navigation.
We introduce CogVLM, a powerful open-source visual language foundation model. With the design of vision expert, CogVLM enables deep fusion of vision language features without sacrificing any performance on NLP tasks.
As (probably) the first open-source large-scale pretrained text-to-video model,
CogVideo outperforms all publicly available models at a large margin in machine and human
evaluations.
We present the 2nd generation of CogVideo -- CogVideoX , a large-scale text-to-video generation model based on diffusion transformer, which can generate 10-second continuous videos aligned with text prompt, with a frame rate of 16 fps and resolution of 768 * 1360 pixels.
To boost faster and higher-resolution image generation, we propse a hierarchical transformers and local parallel auto-regressive generation, achieved by CogLM (cross-modal general language model) architecture.
We present CogView, a 4 billion-parameter Transformer with a VQ-VAE tokenizer for general-domain text-to-image generation, developed contemporaneously with DALL·E by OpenAI.
we propose MotionBench, a comprehensive evaluation benchmark designed to assess the fine-grained motion comprehension of video understanding models. Further, we propose a novel and efficient Through-Encoder (TE) Fusion method to enhance VLM's ability to perceive fine-grained motion within a limited sequence length budget.
We introduce LVBench, a benchmark specifically designed for long video understanding. Our dataset contains 6 major capability categories and 21 subcategories, with the video average length of 1.14 hours, approximately four times longer than the longest existing dataset.
We propose a unidirectional block attention mechanism for image diffusion models that can adaptively adjust the memory overhead during the inference process and handle global dependencies.
Through the lens of discrete cosine transformation, we find the main reason for difficulty in high-resolution generation with diffusion models is that the same noise level on a higher resolution results in a higher Signal-to-Noise Ratio in the frequency domain. In this work, we present Relay Diffusion Model (RDM), where the diffusion process can continue seamlessly in any new resolution or model without restarting from pure noise or low-resolution conditioning.