Publications
You can also find my publication list from my Google Scholar profile.
Code LLM & Agents
Building intelligent coding assistants and autonomous agents that understand and generate code
Shansan Gong, Ruixiang Zhang, Huangjie Zheng, Jiatao Gu, Navdeep Jaitly, Lingpeng Kong, Yizhe Zhang, arXiv preprint, 2025
Abstract
Diffusion large language models (dLLMs) are compelling alternatives to autoregressive (AR) models because their denoising models operate over the entire sequence. We trained a 7B model on 130B code tokens and proposed coupled-GRPO, a novel reinforcement learning sampling scheme. Our work achieved +4.4% on EvalPlus and demonstrated how diffusion models can reduce dependence on autoregressive bias during code generation.
778 GitHub stars - Masked diffusion for code generation with Coupled-GRPO, achieving +4.4% on EvalPlus
Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, Yizhe Zhang, ICML, 2025
Abstract
We introduce SWE-Gym, a new training environment containing 2,438 real-world Python tasks, each with an executable codebase, unit tests, and natural language specifications. Fine-tuning language model-based software engineering agents on this dataset achieves up to 19% absolute gains in resolve rate on SWE-Bench Verified and Lite test sets. We further explore inference-time scaling via verifiers trained on agent trajectories, reaching state-of-the-art results for open-weight agents: 32.0% and 26.0% on the respective benchmarks. SWE-Gym, trained models, and agent trajectories are publicly available to support future research.
602 GitHub stars - Training framework for software engineering agents with real-world GitHub tasks
Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, Heng Ji, ICML, 2024
Abstract
Large language model (LLM) agents have demonstrated remarkable capabilities in automating complex tasks across diverse domains. However, most existing approaches rely on generating natural language actions or constrained predefined action spaces, which can be ambiguous or inflexible for real-world applications. We propose CodeAct, a framework that enables LLM agents to express and execute actions through executable Python code. This approach offers several advantages: (1) Python's expressiveness allows agents to combine multiple primitive actions flexibly, (2) code execution provides deterministic and verifiable action outcomes, and (3) the structured nature of code facilitates better error handling and debugging. We evaluate CodeAct on diverse interactive tasks spanning web browsing, database querying, and embodied control. Our experiments show that CodeAct agents consistently outperform both natural language and predefined action baselines, achieving state-of-the-art results while being more sample-efficient and robust to distribution shifts.
CodeAct agent achieves state-of-the-art on diverse interactive tasks using executable Python code
Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al., ICLR, 2024
Abstract
Recent advances in large language models (LLMs) have enabled AI systems to perform increasingly complex software engineering tasks. However, building generalist AI agents that can operate effectively across diverse real-world software development scenarios remains challenging. We present All-Hands (OpenHands), an open platform designed to enable the development and evaluation of AI agents that can act as generalist software developers. All-Hands provides a unified framework for agents to interact with software development environments through standardized actions, observations, and sandboxed execution. The platform supports diverse agent architectures and enables systematic evaluation on comprehensive benchmarks spanning code generation, debugging, issue resolution, and repository understanding tasks. We demonstrate that agents built on All-Hands can effectively tackle real-world GitHub issues and compete with state-of-the-art proprietary systems while being fully open-source. Our platform enables the research community to collaboratively advance towards more capable and generalizable AI software developers.
65.8k GitHub stars - Open platform enabling AI agents to perform complex software engineering tasks
Long-Horizon Planning
Enabling LLMs to perform complex, multi-step reasoning and planning over extended sequences
Yizhe Zhang, Jiarui Lu, Navdeep Jaitly, ACL, 2024
Abstract
Large language models (LLMs) have shown impressive capabilities in various NLP tasks, but their ability to perform multi-step reasoning and strategic planning in conversational settings remains unclear. We introduce the Entity-Deduction Arena, a benchmark designed to probe the conversational reasoning and planning capabilities of LLMs through an entity deduction game. In this game, an AI agent must identify a hidden entity by asking strategic yes/no questions, requiring the model to maintain context, reason about information gain, and plan question sequences effectively. Our comprehensive evaluation of state-of-the-art LLMs reveals that while they can handle short reasoning chains, they struggle significantly with long-horizon planning, often failing to ask informative questions or properly utilize previous answers. We analyze common failure modes including premature convergence, circular reasoning, and inefficient information gathering. Our findings highlight fundamental limitations in current LLMs' ability to perform sustained strategic reasoning in interactive settings and provide insights for developing more capable conversational AI systems.
Benchmark revealing LLMs struggle with long-horizon conversational reasoning and strategic information gathering
Yizhe Zhang, Jiatao Gu, Zhuofeng Wu, Shuangfei Zhai, Joshua Susskind, Navdeep Jaitly, NeurIPS, 2023
Abstract
Autoregressive models for text sometimes generate repetitive and low-quality output because errors accumulate during generation. We propose PLANNER, which merges latent semantic diffusion with autoregressive generation to produce fluent, lengthy text while maintaining paragraph-level control. The approach combines a decoding module with a planning module that generates semantic embeddings progressively, demonstrating effectiveness on semantic generation, text completion, and summarization tasks.
Latent diffusion with planning module for controlled and diverse paragraph generation
RAG & Reasoning with Continuous Tokens
Retrieval-augmented generation and reasoning systems using continuous token representations
Jie He, Richard He Bai, Sinead Williamson, Jeff Z. Pan, Navdeep Jaitly, Yizhe Zhang, arXiv preprint, 2025
Abstract
Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge but still suffers from long contexts and disjoint retrieval-generation optimization. We propose CLaRa, a framework performing embedding-based compression and joint optimization in a shared continuous space. It introduces SCP, a data synthesis framework, and trains components end-to-end via language modeling loss with a differentiable top-k estimator, achieving state-of-the-art compression and reranking performance on QA benchmarks.
Continuous latent reasoning bridges retrieval and generation with joint optimization in shared continuous space
Haoqiang Kang, Yizhe Zhang, Nikki Lijing Kuang, Nicklas Majamaki, Navdeep Jaitly, Yi-An Ma, Lianhui Qin, arXiv preprint, 2025
Abstract
Large Language Models (LLMs) demonstrate reasoning through chain-of-thought (CoT) generation. However, LLM's autoregressive decoding may limit the ability to revisit and refine earlier tokens holistically, leading to inefficient exploration for diverse solutions. We propose LaDiR, combining a Variational Autoencoder and latent diffusion model to create iterative refinement capabilities for reasoning. It encodes reasoning steps into thought tokens and uses blockwise bidirectional attention for parallel generation of diverse reasoning trajectories, showing improvements in accuracy, diversity, and interpretability across mathematical reasoning and planning benchmarks.
Latent diffusion with VAE for iterative refinement and diverse reasoning trajectories
Text Diffusion Models
Advancing non-autoregressive generation through diffusion-based approaches
Shansan Gong, Zijing Ou, Yizhe Zhang, Navdeep Jaitly, Mukai Li, arXiv preprint, 2025
Abstract
Autoregressive language models (ARMs) deliver strong likelihoods, but are inherently serial: they generate one token per forward pass, which limits throughput and inflates latency for long sequences. We introduce FS-DFM (Few-Step Discrete Flow-Matching), a diffusion language model designed to accelerate text generation. The model maintains consistent quality across different sampling step budgets with a stable update rule using teacher guidance from long-run trajectories. It achieves perplexity parity with 1,024-step baselines using only 8 steps and delivers up to 128× faster sampling for 1,024-token generation.
Achieves perplexity parity with 1024-step baselines using only 8 steps, 128× faster
Ruixiang Zhang, Shuangfei Zhai, Linh Tran, Yizhe Zhang, Tao Wang, Navdeep Jaitly, Joshua Susskind, arXiv preprint, 2025
Abstract
Standard discrete diffusion models treat all unobserved states identically by mapping them to an absorbing [MASK] token, creating an information void where semantic information that could be inferred from unmasked tokens is lost between denoising steps. We present CADD, which augments discrete state spaces with paired continuous latent space diffusion. This approach represents masked tokens as noisy yet informative vectors rather than collapsed states. At sampling time, the continuous latent can guide discrete denoising while enabling trade-offs between diverse outputs and contextually precise generation, demonstrating improvements over mask-based diffusion across text, image synthesis, and code modeling tasks.
Augments discrete states with continuous latents to avoid information void in masked tokens
Haoqiang Kang, Yizhe Zhang, Nikki Lijing Kuang, Nicklas Majamaki, Navdeep Jaitly, Yi-An Ma, Lianhui Qin, arXiv preprint, 2025
Abstract
Large Language Models (LLMs) demonstrate reasoning through chain-of-thought (CoT) generation. However, LLM's autoregressive decoding may limit the ability to revisit and refine earlier tokens holistically, leading to inefficient exploration for diverse solutions. We propose LaDiR, combining a Variational Autoencoder and latent diffusion model to create iterative refinement capabilities for reasoning. It encodes reasoning steps into thought tokens and uses blockwise bidirectional attention for parallel generation of diverse reasoning trajectories, showing improvements in accuracy, diversity, and interpretability across mathematical reasoning and planning benchmarks.
Latent diffusion with VAE for iterative refinement and diverse reasoning trajectories
Shansan Gong, Ruixiang Zhang, Huangjie Zheng, Jiatao Gu, Navdeep Jaitly, Lingpeng Kong, Yizhe Zhang, arXiv preprint, 2025
Abstract
Diffusion large language models (dLLMs) are compelling alternatives to autoregressive (AR) models because their denoising models operate over the entire sequence. We trained a 7B model on 130B code tokens and proposed coupled-GRPO, a novel reinforcement learning sampling scheme. Our work achieved +4.4% on EvalPlus and demonstrated how diffusion models can reduce dependence on autoregressive bias during code generation.
778 GitHub stars - Masked diffusion for code generation with Coupled-GRPO, achieving +4.4% on EvalPlus
Ruixiang Zhang, Shuangfei Zhai, Yizhe Zhang, James Thornton, Zijing Ou, Joshua Susskind, Navdeep Jaitly, ICML, 2025
Abstract
We present Target Concrete Score Matching (TCSM), a novel training objective for discrete diffusion models. TCSM provides a general framework with broad applicability, supporting pre-training discrete diffusion models directly from data samples. Many existing discrete diffusion approaches naturally emerge as special cases of our more general TCSM framework. The framework enables fine-tuning using reward functions, preference data, and knowledge distillation from autoregressive models by estimating the concrete score of the target distribution in the original clean data space.
Unified theoretical framework for discrete diffusion supporting pre-training and fine-tuning with rewards
Ruixiang Zhang, Shuangfei Zhai, Jiatao Gu, Yizhe Zhang, Huangjie Zheng, Tianrong Chen, Miguel Angel Bautista, Josh Susskind, Navdeep Jaitly, NeurIPS, 2025
Abstract
The paper presents TarFlowLM, a framework that reimagines language modeling by operating in continuous latent space rather than discrete tokens. We propose using transformer-based autoregressive normalizing flows to model these continuous representations, enabling bidirectional context capture through alternating-direction transformations and block-wise generation with variable token patch sizes. The approach introduces mixture-based coupling transformations to handle complex dependencies within the latent space and establishes theoretical links to conventional discrete autoregressive models. Experimental results demonstrate competitive likelihood performance while showcasing the framework's flexible modeling capabilities.
TarFlowLM reimagines language modeling in continuous latent space with bidirectional context capture
Jiatao Gu, Ying Shen, Shuangfei Zhai, Yizhe Zhang, Navdeep Jaitly, Joshua M. Susskind, NeurIPS, 2024
Abstract
We introduce Kaleido, a method for enhancing image generation diversity in conditional diffusion models. Diffusion models often exhibit limited diversity in the sampled images, particularly when sampling with a high classifier-free guidance weight. Our solution integrates an autoregressive language model that processes captions and generates intermediate latent representations—including textual descriptions, bounding boxes, object blobs, and visual tokens. These diverse latent variables serve as enriched conditioning signals for the diffusion process. Our experimental findings demonstrate that Kaleido successfully increases the variety of generated images while preserving quality and maintaining fidelity to the generated latent guidance signals, thereby enabling improved control over image generation outcomes.
Integrates autoregressive language model to generate enriched conditioning signals for diverse image generation
Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, et al., ICLR, 2024
Abstract
Diffusion Language Models (DLMs) have emerged as a promising new paradigm for text generative modeling, potentially addressing limitations of autoregressive (AR) models. We demonstrate how to convert existing autoregressive models (GPT2 and LLaMA, ranging from 127M to 7B parameters) into diffusion models called DiffuGPT and DiffuLLaMA. Using less than 200B tokens, we achieve models competitive with their autoregressive counterparts while enabling unique capabilities like fill-in-the-middle generation without prompt reordering.
Convert GPT2 and LLaMA (127M-7B) to diffusion models competitive with AR counterparts
Jiatao Gu, Yuyang Wang, Yizhe Zhang, Qihang Zhang, Dinghuai Zhang, Navdeep Jaitly, Josh Susskind, Shuangfei Zhai, ICLR, 2024
Abstract
We introduce DART, a transformer-based approach for text-to-image generation that unifies autoregressive (AR) and diffusion within a non-Markovian framework, allowing it to iteratively denoise image patches using an architecture similar to standard language models. Unlike traditional diffusion models limited by their Markovian property, DART overcomes this constraint without requiring image quantization, enabling more effective image modeling. The model handles both text and image data in a single unified architecture through unified training. DART demonstrates competitive performance on class-conditioned and text-to-image tasks, providing a scalable, efficient alternative to traditional diffusion models that establishes a new benchmark for scalable, high-quality image synthesis.
Unifies autoregressive and diffusion in non-Markovian framework for scalable text-to-image generation
Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Joshua M Susskind, Navdeep Jaitly, ICLR, 2023
Abstract
We introduce an innovative framework for high-resolution image and video generation. Our approach proposes a diffusion process that processes inputs across multiple resolutions simultaneously, utilizing a NestedUNet architecture where smaller-scale features are embedded within larger-scale parameters. A key innovation is the progressive training approach that moves from lower to higher resolutions, substantially enhancing optimization for high-resolution outputs. The method demonstrates capabilities across diverse applications including class-conditioned image generation, high-resolution text-to-image synthesis, and text-to-video tasks. Notably, the approach enables training a single pixel-space model at resolutions up to 1024x1024 pixels, achieving strong zero-shot generalization using the CC12M dataset, which contains only 12 million images.
Multi-resolution diffusion with NestedUNet achieves 1024x1024 generation with strong zero-shot generalization
Yizhe Zhang, Jiatao Gu, Zhuofeng Wu, Shuangfei Zhai, Joshua Susskind, Navdeep Jaitly, NeurIPS, 2023
Abstract
Autoregressive models for text sometimes generate repetitive and low-quality output because errors accumulate during generation. We propose PLANNER, which merges latent semantic diffusion with autoregressive generation to produce fluent, lengthy text while maintaining paragraph-level control. The approach combines a decoding module with a planning module that generates semantic embeddings progressively, demonstrating effectiveness on semantic generation, text completion, and summarization tasks.
Latent diffusion with planning module for controlled and diverse paragraph generation
Coding-Based AI Scientist
Developing AI systems that can autonomously discover knowledge through code
All Publications (Chronological)
Preprint
Shansan Gong, Zijing Ou, Yizhe Zhang, Navdeep Jaitly, Mukai Li, arXiv preprint, 2025
Abstract
Autoregressive language models (ARMs) deliver strong likelihoods, but are inherently serial: they generate one token per forward pass, which limits throughput and inflates latency for long sequences. We introduce FS-DFM (Few-Step Discrete Flow-Matching), a diffusion language model designed to accelerate text generation. The model maintains consistent quality across different sampling step budgets with a stable update rule using teacher guidance from long-run trajectories. It achieves perplexity parity with 1,024-step baselines using only 8 steps and delivers up to 128× faster sampling for 1,024-token generation.
Ruixiang Zhang, Shuangfei Zhai, Linh Tran, Yizhe Zhang, Tao Wang, Navdeep Jaitly, Joshua Susskind, arXiv preprint, 2025
Abstract
Standard discrete diffusion models treat all unobserved states identically by mapping them to an absorbing [MASK] token, creating an information void where semantic information that could be inferred from unmasked tokens is lost between denoising steps. We present CADD, which augments discrete state spaces with paired continuous latent space diffusion. This approach represents masked tokens as noisy yet informative vectors rather than collapsed states. At sampling time, the continuous latent can guide discrete denoising while enabling trade-offs between diverse outputs and contextually precise generation, demonstrating improvements over mask-based diffusion across text, image synthesis, and code modeling tasks.
Jie He, Richard He Bai, Sinead Williamson, Jeff Z. Pan, Navdeep Jaitly, Yizhe Zhang, arXiv preprint, 2025
Abstract
Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge but still suffers from long contexts and disjoint retrieval-generation optimization. We propose CLaRa, a framework performing embedding-based compression and joint optimization in a shared continuous space. It introduces SCP, a data synthesis framework, and trains components end-to-end via language modeling loss with a differentiable top-k estimator, achieving state-of-the-art compression and reranking performance on QA benchmarks.
Haoqiang Kang, Yizhe Zhang, Nikki Lijing Kuang, Nicklas Majamaki, Navdeep Jaitly, Yi-An Ma, Lianhui Qin, arXiv preprint, 2025
Abstract
Large Language Models (LLMs) demonstrate reasoning through chain-of-thought (CoT) generation. However, LLM's autoregressive decoding may limit the ability to revisit and refine earlier tokens holistically, leading to inefficient exploration for diverse solutions. We propose LaDiR, combining a Variational Autoencoder and latent diffusion model to create iterative refinement capabilities for reasoning. It encodes reasoning steps into thought tokens and uses blockwise bidirectional attention for parallel generation of diverse reasoning trajectories, showing improvements in accuracy, diversity, and interpretability across mathematical reasoning and planning benchmarks.
Shansan Gong, Ruixiang Zhang, Huangjie Zheng, Jiatao Gu, Navdeep Jaitly, Lingpeng Kong, Yizhe Zhang, arXiv preprint, 2025
Abstract
Diffusion large language models (dLLMs) are compelling alternatives to autoregressive (AR) models because their denoising models operate over the entire sequence. We trained a 7B model on 130B code tokens and proposed coupled-GRPO, a novel reinforcement learning sampling scheme. Our work achieved +4.4% on EvalPlus and demonstrated how diffusion models can reduce dependence on autoregressive bias during code generation.
Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, Weizhu Chen, ArXiv preprint, 2022
Abstract
This paper explores applying GPT-3's few-shot learning to the SemEval 2021 MeasEval task, which involves identifying measurements and their associated attributes in scientific literature. Despite initial promise, we found that GPT-3 underperformed compared to our prior multi-turn question-answering approach. We identified several limitations: technical constraints included limits on the size of the prompt and answer, restricting the training signal available. More fundamentally, we discovered that generative models struggle with factual retention, and prompt modifications produced unpredictable results that hindered systematic performance improvement.
Deng Cai, Yizhe Zhang, Yichen Huang, Wai Lam, Bill Dolan, ArXiv preprint, 2022
Yizhe Zhang, Deng Cai, ArXiv preprint, 2022
Xiang Gao, Yizhe Zhang, Michel Galley, Bill Dolan, ArXiv preprint, 2022
2025
Ruixiang Zhang, Shuangfei Zhai, Yizhe Zhang, James Thornton, Zijing Ou, Joshua Susskind, Navdeep Jaitly, ICML, 2025
Abstract
We present Target Concrete Score Matching (TCSM), a novel training objective for discrete diffusion models. TCSM provides a general framework with broad applicability, supporting pre-training discrete diffusion models directly from data samples. Many existing discrete diffusion approaches naturally emerge as special cases of our more general TCSM framework. The framework enables fine-tuning using reward functions, preference data, and knowledge distillation from autoregressive models by estimating the concrete score of the target distribution in the original clean data space.
Ruixiang Zhang, Shuangfei Zhai, Jiatao Gu, Yizhe Zhang, Huangjie Zheng, Tianrong Chen, Miguel Angel Bautista, Josh Susskind, Navdeep Jaitly, NeurIPS, 2025
Abstract
The paper presents TarFlowLM, a framework that reimagines language modeling by operating in continuous latent space rather than discrete tokens. We propose using transformer-based autoregressive normalizing flows to model these continuous representations, enabling bidirectional context capture through alternating-direction transformations and block-wise generation with variable token patch sizes. The approach introduces mixture-based coupling transformations to handle complex dependencies within the latent space and establishes theoretical links to conventional discrete autoregressive models. Experimental results demonstrate competitive likelihood performance while showcasing the framework's flexible modeling capabilities.
2024
Jiatao Gu, Ying Shen, Shuangfei Zhai, Yizhe Zhang, Navdeep Jaitly, Joshua M. Susskind, NeurIPS, 2024
Abstract
We introduce Kaleido, a method for enhancing image generation diversity in conditional diffusion models. Diffusion models often exhibit limited diversity in the sampled images, particularly when sampling with a high classifier-free guidance weight. Our solution integrates an autoregressive language model that processes captions and generates intermediate latent representations—including textual descriptions, bounding boxes, object blobs, and visual tokens. These diverse latent variables serve as enriched conditioning signals for the diffusion process. Our experimental findings demonstrate that Kaleido successfully increases the variety of generated images while preserving quality and maintaining fidelity to the generated latent guidance signals, thereby enabling improved control over image generation outcomes.
Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, Heng Ji, ICML, 2024
Abstract
Large language model (LLM) agents have demonstrated remarkable capabilities in automating complex tasks across diverse domains. However, most existing approaches rely on generating natural language actions or constrained predefined action spaces, which can be ambiguous or inflexible for real-world applications. We propose CodeAct, a framework that enables LLM agents to express and execute actions through executable Python code. This approach offers several advantages: (1) Python's expressiveness allows agents to combine multiple primitive actions flexibly, (2) code execution provides deterministic and verifiable action outcomes, and (3) the structured nature of code facilitates better error handling and debugging. We evaluate CodeAct on diverse interactive tasks spanning web browsing, database querying, and embodied control. Our experiments show that CodeAct agents consistently outperform both natural language and predefined action baselines, achieving state-of-the-art results while being more sample-efficient and robust to distribution shifts.
Yizhe Zhang, Jiarui Lu, Navdeep Jaitly, ACL, 2024
Abstract
Large language models (LLMs) have shown impressive capabilities in various NLP tasks, but their ability to perform multi-step reasoning and strategic planning in conversational settings remains unclear. We introduce the Entity-Deduction Arena, a benchmark designed to probe the conversational reasoning and planning capabilities of LLMs through an entity deduction game. In this game, an AI agent must identify a hidden entity by asking strategic yes/no questions, requiring the model to maintain context, reason about information gain, and plan question sequences effectively. Our comprehensive evaluation of state-of-the-art LLMs reveals that while they can handle short reasoning chains, they struggle significantly with long-horizon planning, often failing to ask informative questions or properly utilize previous answers. We analyze common failure modes including premature convergence, circular reasoning, and inefficient information gathering. Our findings highlight fundamental limitations in current LLMs' ability to perform sustained strategic reasoning in interactive settings and provide insights for developing more capable conversational AI systems.
Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, et al., ICLR, 2024
Abstract
Diffusion Language Models (DLMs) have emerged as a promising new paradigm for text generative modeling, potentially addressing limitations of autoregressive (AR) models. We demonstrate how to convert existing autoregressive models (GPT2 and LLaMA, ranging from 127M to 7B parameters) into diffusion models called DiffuGPT and DiffuLLaMA. Using less than 200B tokens, we achieve models competitive with their autoregressive counterparts while enabling unique capabilities like fill-in-the-middle generation without prompt reordering.
Jiatao Gu, Yuyang Wang, Yizhe Zhang, Qihang Zhang, Dinghuai Zhang, Navdeep Jaitly, Josh Susskind, Shuangfei Zhai, ICLR, 2024
Abstract
We introduce DART, a transformer-based approach for text-to-image generation that unifies autoregressive (AR) and diffusion within a non-Markovian framework, allowing it to iteratively denoise image patches using an architecture similar to standard language models. Unlike traditional diffusion models limited by their Markovian property, DART overcomes this constraint without requiring image quantization, enabling more effective image modeling. The model handles both text and image data in a single unified architecture through unified training. DART demonstrates competitive performance on class-conditioned and text-to-image tasks, providing a scalable, efficient alternative to traditional diffusion models that establishes a new benchmark for scalable, high-quality image synthesis.
2023
Shuangfei Zhai, Tatiana Likhomanenko, Etai Littwin, Dan Busbridge, Jason Ramapuram, Yizhe Zhang, Jiatao Gu, Joshua M Susskind, ICML, 2023
Abstract
We investigate training instability in Transformers by analyzing attention layer dynamics. Our research found that low attention entropy is accompanied by high training instability, which can take the form of oscillating loss or divergence. We propose σReparam, a technique that reparametrizes linear layers using spectral normalization plus a learned scalar to prevent entropy collapse. We provide theoretical grounding by proving that attention entropy decreases exponentially with the spectral norm of attention logits. Experimental validation spans multiple domains—vision, machine translation, speech recognition, and language modeling—demonstrating that σReparam enables competitive performance while eliminating common training requirements like warmup, weight decay, layer normalization, and adaptive optimizers.
Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Joshua M Susskind, Navdeep Jaitly, ICLR, 2023
Abstract
We introduce an innovative framework for high-resolution image and video generation. Our approach proposes a diffusion process that processes inputs across multiple resolutions simultaneously, utilizing a NestedUNet architecture where smaller-scale features are embedded within larger-scale parameters. A key innovation is the progressive training approach that moves from lower to higher resolutions, substantially enhancing optimization for high-resolution outputs. The method demonstrates capabilities across diverse applications including class-conditioned image generation, high-resolution text-to-image synthesis, and text-to-video tasks. Notably, the approach enables training a single pixel-space model at resolutions up to 1024x1024 pixels, achieving strong zero-shot generalization using the CC12M dataset, which contains only 12 million images.
2022
Tianyu Liu, Yizhe Zhang, Chris Brockett, Yi Mao, Zhifang Sui, Weizhu Chen, Bill Dolan, ACL, 2022
Yizhe Zhang, Siqi Sun, Xiang Gao, Yuwei Fang, Chris Brockett, Michel Galley, Jianfeng Gao, Bill Dolan, AAAI, 2022
2021
Jungo Kasai, Hao Peng, Yizhe Zhang, Dani Yogatama, Yi Mao, Weizhu Chen, Noah A Smith, EMNLP, 2021
Yizhe Zhang, Xiang Gao, Sungjin Lee, Chris Brockett, Michel Galley, Jianfeng Gao, Bill Dolan, SIGDIAL, 2021
Dianqi Li, Yizhe Zhang, Hao Peng, Liqun Chen, Chris Brockett, Ming-Ting Sun, Bill Dolan, NAACL, 2021
Woon Sang Cho, Yizhe Zhang, Sudha Rao, Asli Celikyilmaz, Chenyan Xiong, Jianfeng Gao, Mengdi Wang, Bill Dolan, EACL, 2021
Ramakanth Pasunuru, Asli Celikyilmaz , Michel Galley, Chenyan Xiong, Yizhe Zhang, Mohit Bansal, Jianfeng Gao, AAAI, 2021
Zeqiu Wu, Michel Galley, Chris Brockett, Yizhe Zhang, Xiang Gao, Chris Quirk, Rik Koncel-Kedziorski, Jianfeng Gao, Hannaneh Hajishirzi, Mari Ostendorf, Bill Dolan, AAAI, 2021
2020
Shuyang Dai, Yu Cheng, Yizhe Zhang, Zhe Gan, JJ Liu, Lawrence Carins, ACCV, 2020
Yizhe Zhang, Guoyin Wang, Chunyuan Li, Zhe Gan, Chris Brockett, Bill Dolan, EMNLP, 2020
Abstract
We introduce POINTER, a model designed for text generation under lexical constraints. The approach operates through progressive insertion of new tokens between existing tokens in a parallel manner, applied recursively until completion. This generates a coarse-to-fine hierarchy that enhances interpretability. We pre-train on Wikipedia and achieve state-of-the-art results on constrained generation tasks. The non-autoregressive decoding strategy produces logarithmic time complexity during inference, offering efficiency advantages over traditional methods.
Chunyuan Li, Xiang Gao, Yuan Li, Xiujun Li, Baolin Peng, Yizhe Zhang, Jianfeng Gao, EMNLP, 2020
Abstract
We introduce Optimus, a large-scale Variational Autoencoder (VAE) for natural language processing. The model can be both a powerful generative model and an effective representation learning framework for natural language. Key contributions include a pre-trained universal latent embedding space for sentences on large text corpora that can be fine-tuned for various tasks, guided language generation capabilities superior to GPT-2 through abstract-level control via latent vectors, improved generalization on low-resource tasks compared to BERT due to the smooth latent space structure, and state-of-the-art performance on VAE language modeling benchmarks. We aim to revitalize interest in deep generative models within the era of large-scale pre-training and make these methods more practical for the NLP community.
Jianqiao Li, Chunyuan Li, Guoyin Wang, Hao Fu, Yuhchen Lin, Liqun Chen, Yizhe Zhang, Chenyang Tao, Ruiyi Zhang, Wenlin Wang, Dinghan Shen, Qian Yang and Lawrence Carin, EMNLP, 2020
Xiang Gao, Yizhe Zhang, Michel Galley, Chris Brockett and Bill Dolan, EMNLP, 2020
Yu Cheng, Zhe Gan, Yizhe Zhang, Oussama Elachqar, Dianqi Li, Jingjing Liu, Findings of EMNLP, 2020
Siyang Yuan, Ke Bai, Liqun Chen, Yizhe Zhang, Chenyang Tao, Chunyuan Li, Guoyin Wang, Ricardo Henao, Lawrence Carin, BMVC(oral presentation), 2020
Xinnuo Xu, Yizhe Zhang, Lars Liden, Sungjin Lee, Interspeech, 2020
Pengyu Cheng, Renqiang Min, Dinghan Shen, Christopher Malon, Yizhe Zhang, Yitong Li and Lawrence Carin, ACL, 2020
Yichen Huang, Yizhe Zhang, Oussama Elachqar, Yu Cheng, ACL, 2020
Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan, system demonstration, ACL, 2020
Abstract
We present DialoGPT, a conversational response generation model trained on 147 million Reddit exchanges from 2005-2017. The system extends the Hugging Face PyTorch transformer to attain a performance close to human both in terms of automatic and human evaluation in single-turn dialogue settings. We demonstrate that our approach produces responses superior to baseline systems in relevance, content quality, and contextual consistency. We have made both the pretrained model and training pipeline publicly available to advance research in neural dialogue systems.
Xinjie Fan, Yizhe Zhang, Zhendong Wang, Mingyuan Zhou, ICLR, 2020
Liqun Chen, Ke Bai, Chenyang Tao, Yizhe Zhang, Guoyin Wang, Wenlin Wang, Ricardo Henao, Lawrence Carin, AAAI, 2020
Yuan Li, Chunyuan Li, Yizhe Zhang, Xiujun Li, Guoqing Zheng, Lawrence Carin, Jianfeng Gao, AAAI, 2020
Chen Qu, Chenyan Xiong, Yizhe Zhang, Corby Rosset, W. Bruce Croft and Paul Bennett, SIGIR, 2020
2019
Xinnuo Xu, Yizhe Zhang, Lars Liden and Sungjin Lee, SIGDIAL(Best paper nomination), 2019
Woon Sang Cho, Pengchuan Zhang, Yizhe Zhang, Xiujun Li, Michel Galley, Chris Brockett, Mengdi Wang, Jianfeng Gao, Workshop on Narrative Understanding, NAACL, 2019
Dinghan Shen, Asli Celikyilmaz, Yizhe Zhang, Liqun Chen, Xin Wang, Jianfeng Gao, Lawrence Carin, ACL, 2019
Xiang Gao, Yizhe Zhang, Sungjin Lee, Michel Galley, Chris Brockett, Jianfeng Gao and Bill Dolan, EMNLP, 2019
Ping Yu, Ruiyi Zhang, Chunyuan Li, Yizhe Zhang, Changyou Chen, Imitation, Intent, and Interaction(I3), ICML, 2019
Vighnesh Leonardo Shiv, Chris Quirk, Anshuman Suri, Xiang Gao, Khuram Shahid, Nithya Govindarajan, Yizhe Zhang, Jianfeng Gao, Michel Galley, Chris Brockett, Tulasi Menon, Bill Dolan, system demonstration, ACL, 2019
Liqun Chen, Guoyin Wang, Chenyang Tao, Dinghan Shen, Yizhe Zhang and Lawrence Carin, ACL, 2019
Liqun Chen, Yizhe Zhang, Ruiyi Zhang, Chenyang Tao, Zhe Gan, Haichao Zhang, Bai Li, Dinghan Shen, Changyou Chen, Lawrence Carin, ICLR, 2019
Xiang Gao, Sungjin Lee, Yizhe Zhang, Chris Brockett, Michel Galley, Jianfeng Gao, Bill Dolan, NAACL, 2019
Woon Sang Cho, Yizhe Zhang, Sudha Rao, Chris Brockett and Sungjin Lee, WNGT, EMNLP, 2019
Dianqi Li, Yizhe Zhang, Zhe Gan, Yu Cheng, Chris Brockett, Ming-Ting Sun and Bill Dolan, EMNLP, 2019
2018
Wenlin Wang, Yunchen Pu, Vinay Kumar Verma, Kai Fan, Yizhe Zhang, Changyou Chen, Piyush Rai, Lawrence Carin, AAAI, 2018
Dinghan Shen, Guoyin Wang, Wenlin Wang, Martin Renqiang Min, Qinliang Su, Yizhe Zhang, Chunyuan Li, Ricardo Henao and Lawrence Carin, ACL, 2018
Yunchen Pu, Shuyang Dai, Zhe Gan, Weiyao Wang, Guoyin Wang, Yizhe Zhang, Ricardo Henao, Lawrence Carin, ICML, 2018
Guoyin Wang, Chunyuan Li, Wenlin Wang, Yizhe Zhang, Dinghan Shen, Xinyuan Zhang, Ricardo Henao and Lawrence Carin, ACL, 2018
Yizhe Zhang, Michel Galley, Jianfeng Gao, Zhe Gan, Xiujun Li, Chris Brockett, Bill Dolan, NIPS, 2018
Dinghan Shen, Yizhe Zhang, Ricardo Henao, Qinliang Su, Lawrence Carin, AAAI, 2018
Liqun Chen, Shuyang Dai, Chenyang Tao, Dinghan Shen, Zhe Gan, Haichao Zhang, Yizhe Zhang, Lawrence Carin, NIPS, 2018
2017
Zhe Gan, Liqun Chen, Weiyao Wang, Yunchen Pu, Yizhe Zhang, Lawrence Carin, NIPS, 2017
Yizhe Zhang, Changyou Chen, Zhe Gan, Ricardo Henao, Lawrence Carin, ICML, 2017
Yizhe Zhang, Dinghan Shen, Guoyin Wang, Ricardo Henao, Zhe Gan, Lawrence Carin, NIPS, 2017
Yizhe Zhang, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Lawrence Carin, ICML, 2017
2016
Kai Fan, Yizhe Zhang, Lawrence Carin, Katherine Heller, ICDM, 2016
Yizhe Zhang, Xiangyu Wang, Changyou Chen, Lawrence Carin, NIPS, 2016
Changyou Chen, Nan Ding, Chunyuan Li, Yizhe Zhang, Lawrence Carin, NIPS, 2016
Yizhe Zhang, Ricardo Henao, Jianling Zhong, Lawrence Carin, Alexander Hartemink, AAAI, 2016
Yizhe Zhang, Changyou Chen, Ricardo Henao, Lawrence Carin, ECML, 2016
Yizhe Zhang, Zhe Gan, Lawrence Carin, Workshop on Adversarial Training, NIPS, 2016
Yizhe Zhang, Changyou Chen, Ricardo Henao, Lawrence Carin, ICDM, 2016
Yizhe Zhang, Ricardo Henao, Chunyuan Li, Lawrence Carin, IJCAI, 2016
2015
Yizhe Zhang, Yupeng He and Chaochun Wei, BMC Genomics, 2015
Yizhe Zhang, Ricardo Henao, Chunyuan Li, Lawrence Carin., Workshop on representation learning, NIPS, 2015
2012
Jiemeng Liu, Haifeng Wang, Hongxing Yang, Yizhe Zhang, Jinfeng Wang, Fangqing Zhao and Ji Qi, Nucleic Acids Research, 2012
Yupeng He, Yizhe Zhang, Guangyong Zheng and Chaochun Wei, BMC Genomics, 2012