Sitemap

A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.

Pages

Posts

portfolio

publications

Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan, system demonstration, ACL, 2020
Abstract

We present DialoGPT, a conversational response generation model trained on 147 million Reddit exchanges from 2005-2017. The system extends the Hugging Face PyTorch transformer to attain a performance close to human both in terms of automatic and human evaluation in single-turn dialogue settings. We demonstrate that our approach produces responses superior to baseline systems in relevance, content quality, and contextual consistency. We have made both the pretrained model and training pipeline publicly available to advance research in neural dialogue systems.

Chunyuan Li, Xiang Gao, Yuan Li, Xiujun Li, Baolin Peng, Yizhe Zhang, Jianfeng Gao, EMNLP, 2020
Abstract

We introduce Optimus, a large-scale Variational Autoencoder (VAE) for natural language processing. The model can be both a powerful generative model and an effective representation learning framework for natural language. Key contributions include a pre-trained universal latent embedding space for sentences on large text corpora that can be fine-tuned for various tasks, guided language generation capabilities superior to GPT-2 through abstract-level control via latent vectors, improved generalization on low-resource tasks compared to BERT due to the smooth latent space structure, and state-of-the-art performance on VAE language modeling benchmarks. We aim to revitalize interest in deep generative models within the era of large-scale pre-training and make these methods more practical for the NLP community.

Yizhe Zhang, Guoyin Wang, Chunyuan Li, Zhe Gan, Chris Brockett, Bill Dolan, EMNLP, 2020
Abstract

We introduce POINTER, a model designed for text generation under lexical constraints. The approach operates through progressive insertion of new tokens between existing tokens in a parallel manner, applied recursively until completion. This generates a coarse-to-fine hierarchy that enhances interpretability. We pre-train on Wikipedia and achieve state-of-the-art results on constrained generation tasks. The non-autoregressive decoding strategy produces logarithmic time complexity during inference, offering efficiency advantages over traditional methods.

Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, Weizhu Chen, ArXiv preprint, 2022
Abstract

This paper explores applying GPT-3's few-shot learning to the SemEval 2021 MeasEval task, which involves identifying measurements and their associated attributes in scientific literature. Despite initial promise, we found that GPT-3 underperformed compared to our prior multi-turn question-answering approach. We identified several limitations: technical constraints included limits on the size of the prompt and answer, restricting the training signal available. More fundamentally, we discovered that generative models struggle with factual retention, and prompt modifications produced unpredictable results that hindered systematic performance improvement.

Yizhe Zhang, Jiatao Gu, Zhuofeng Wu, Shuangfei Zhai, Joshua Susskind, Navdeep Jaitly, NeurIPS, 2023
Abstract

Autoregressive models for text sometimes generate repetitive and low-quality output because errors accumulate during generation. We propose PLANNER, which merges latent semantic diffusion with autoregressive generation to produce fluent, lengthy text while maintaining paragraph-level control. The approach combines a decoding module with a planning module that generates semantic embeddings progressively, demonstrating effectiveness on semantic generation, text completion, and summarization tasks.

Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Joshua M Susskind, Navdeep Jaitly, ICLR, 2023
Abstract

We introduce an innovative framework for high-resolution image and video generation. Our approach proposes a diffusion process that processes inputs across multiple resolutions simultaneously, utilizing a NestedUNet architecture where smaller-scale features are embedded within larger-scale parameters. A key innovation is the progressive training approach that moves from lower to higher resolutions, substantially enhancing optimization for high-resolution outputs. The method demonstrates capabilities across diverse applications including class-conditioned image generation, high-resolution text-to-image synthesis, and text-to-video tasks. Notably, the approach enables training a single pixel-space model at resolutions up to 1024x1024 pixels, achieving strong zero-shot generalization using the CC12M dataset, which contains only 12 million images.

Shuangfei Zhai, Tatiana Likhomanenko, Etai Littwin, Dan Busbridge, Jason Ramapuram, Yizhe Zhang, Jiatao Gu, Joshua M Susskind, ICML, 2023
Abstract

We investigate training instability in Transformers by analyzing attention layer dynamics. Our research found that low attention entropy is accompanied by high training instability, which can take the form of oscillating loss or divergence. We propose σReparam, a technique that reparametrizes linear layers using spectral normalization plus a learned scalar to prevent entropy collapse. We provide theoretical grounding by proving that attention entropy decreases exponentially with the spectral norm of attention logits. Experimental validation spans multiple domains—vision, machine translation, speech recognition, and language modeling—demonstrating that σReparam enables competitive performance while eliminating common training requirements like warmup, weight decay, layer normalization, and adaptive optimizers.

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al., ICLR, 2024
Abstract

Recent advances in large language models (LLMs) have enabled AI systems to perform increasingly complex software engineering tasks. However, building generalist AI agents that can operate effectively across diverse real-world software development scenarios remains challenging. We present All-Hands (OpenHands), an open platform designed to enable the development and evaluation of AI agents that can act as generalist software developers. All-Hands provides a unified framework for agents to interact with software development environments through standardized actions, observations, and sandboxed execution. The platform supports diverse agent architectures and enables systematic evaluation on comprehensive benchmarks spanning code generation, debugging, issue resolution, and repository understanding tasks. We demonstrate that agents built on All-Hands can effectively tackle real-world GitHub issues and compete with state-of-the-art proprietary systems while being fully open-source. Our platform enables the research community to collaboratively advance towards more capable and generalizable AI software developers.

Jiatao Gu, Yuyang Wang, Yizhe Zhang, Qihang Zhang, Dinghuai Zhang, Navdeep Jaitly, Josh Susskind, Shuangfei Zhai, ICLR, 2024
Abstract

We introduce DART, a transformer-based approach for text-to-image generation that unifies autoregressive (AR) and diffusion within a non-Markovian framework, allowing it to iteratively denoise image patches using an architecture similar to standard language models. Unlike traditional diffusion models limited by their Markovian property, DART overcomes this constraint without requiring image quantization, enabling more effective image modeling. The model handles both text and image data in a single unified architecture through unified training. DART demonstrates competitive performance on class-conditioned and text-to-image tasks, providing a scalable, efficient alternative to traditional diffusion models that establishes a new benchmark for scalable, high-quality image synthesis.

Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, et al., ICLR, 2024
Abstract

Diffusion Language Models (DLMs) have emerged as a promising new paradigm for text generative modeling, potentially addressing limitations of autoregressive (AR) models. We demonstrate how to convert existing autoregressive models (GPT2 and LLaMA, ranging from 127M to 7B parameters) into diffusion models called DiffuGPT and DiffuLLaMA. Using less than 200B tokens, we achieve models competitive with their autoregressive counterparts while enabling unique capabilities like fill-in-the-middle generation without prompt reordering.

Yizhe Zhang, Jiarui Lu, Navdeep Jaitly, ACL, 2024
Abstract

Large language models (LLMs) have shown impressive capabilities in various NLP tasks, but their ability to perform multi-step reasoning and strategic planning in conversational settings remains unclear. We introduce the Entity-Deduction Arena, a benchmark designed to probe the conversational reasoning and planning capabilities of LLMs through an entity deduction game. In this game, an AI agent must identify a hidden entity by asking strategic yes/no questions, requiring the model to maintain context, reason about information gain, and plan question sequences effectively. Our comprehensive evaluation of state-of-the-art LLMs reveals that while they can handle short reasoning chains, they struggle significantly with long-horizon planning, often failing to ask informative questions or properly utilize previous answers. We analyze common failure modes including premature convergence, circular reasoning, and inefficient information gathering. Our findings highlight fundamental limitations in current LLMs' ability to perform sustained strategic reasoning in interactive settings and provide insights for developing more capable conversational AI systems.

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, Heng Ji, ICML, 2024
Abstract

Large language model (LLM) agents have demonstrated remarkable capabilities in automating complex tasks across diverse domains. However, most existing approaches rely on generating natural language actions or constrained predefined action spaces, which can be ambiguous or inflexible for real-world applications. We propose CodeAct, a framework that enables LLM agents to express and execute actions through executable Python code. This approach offers several advantages: (1) Python's expressiveness allows agents to combine multiple primitive actions flexibly, (2) code execution provides deterministic and verifiable action outcomes, and (3) the structured nature of code facilitates better error handling and debugging. We evaluate CodeAct on diverse interactive tasks spanning web browsing, database querying, and embodied control. Our experiments show that CodeAct agents consistently outperform both natural language and predefined action baselines, achieving state-of-the-art results while being more sample-efficient and robust to distribution shifts.

Jiatao Gu, Ying Shen, Shuangfei Zhai, Yizhe Zhang, Navdeep Jaitly, Joshua M. Susskind, NeurIPS, 2024
Abstract

We introduce Kaleido, a method for enhancing image generation diversity in conditional diffusion models. Diffusion models often exhibit limited diversity in the sampled images, particularly when sampling with a high classifier-free guidance weight. Our solution integrates an autoregressive language model that processes captions and generates intermediate latent representations—including textual descriptions, bounding boxes, object blobs, and visual tokens. These diverse latent variables serve as enriched conditioning signals for the diffusion process. Our experimental findings demonstrate that Kaleido successfully increases the variety of generated images while preserving quality and maintaining fidelity to the generated latent guidance signals, thereby enabling improved control over image generation outcomes.

Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, Yizhe Zhang, ICML, 2025
Abstract

We introduce SWE-Gym, a new training environment containing 2,438 real-world Python tasks, each with an executable codebase, unit tests, and natural language specifications. Fine-tuning language model-based software engineering agents on this dataset achieves up to 19% absolute gains in resolve rate on SWE-Bench Verified and Lite test sets. We further explore inference-time scaling via verifiers trained on agent trajectories, reaching state-of-the-art results for open-weight agents: 32.0% and 26.0% on the respective benchmarks. SWE-Gym, trained models, and agent trajectories are publicly available to support future research.

Ruixiang Zhang, Shuangfei Zhai, Jiatao Gu, Yizhe Zhang, Huangjie Zheng, Tianrong Chen, Miguel Angel Bautista, Josh Susskind, Navdeep Jaitly, NeurIPS, 2025
Abstract

The paper presents TarFlowLM, a framework that reimagines language modeling by operating in continuous latent space rather than discrete tokens. We propose using transformer-based autoregressive normalizing flows to model these continuous representations, enabling bidirectional context capture through alternating-direction transformations and block-wise generation with variable token patch sizes. The approach introduces mixture-based coupling transformations to handle complex dependencies within the latent space and establishes theoretical links to conventional discrete autoregressive models. Experimental results demonstrate competitive likelihood performance while showcasing the framework's flexible modeling capabilities.

Ruixiang Zhang, Shuangfei Zhai, Yizhe Zhang, James Thornton, Zijing Ou, Joshua Susskind, Navdeep Jaitly, ICML, 2025
Abstract

We present Target Concrete Score Matching (TCSM), a novel training objective for discrete diffusion models. TCSM provides a general framework with broad applicability, supporting pre-training discrete diffusion models directly from data samples. Many existing discrete diffusion approaches naturally emerge as special cases of our more general TCSM framework. The framework enables fine-tuning using reward functions, preference data, and knowledge distillation from autoregressive models by estimating the concrete score of the target distribution in the original clean data space.

Shansan Gong, Ruixiang Zhang, Huangjie Zheng, Jiatao Gu, Navdeep Jaitly, Lingpeng Kong, Yizhe Zhang, arXiv preprint, 2025
Abstract

Diffusion large language models (dLLMs) are compelling alternatives to autoregressive (AR) models because their denoising models operate over the entire sequence. We trained a 7B model on 130B code tokens and proposed coupled-GRPO, a novel reinforcement learning sampling scheme. Our work achieved +4.4% on EvalPlus and demonstrated how diffusion models can reduce dependence on autoregressive bias during code generation.

Haoqiang Kang, Yizhe Zhang, Nikki Lijing Kuang, Nicklas Majamaki, Navdeep Jaitly, Yi-An Ma, Lianhui Qin, arXiv preprint, 2025
Abstract

Large Language Models (LLMs) demonstrate reasoning through chain-of-thought (CoT) generation. However, LLM's autoregressive decoding may limit the ability to revisit and refine earlier tokens holistically, leading to inefficient exploration for diverse solutions. We propose LaDiR, combining a Variational Autoencoder and latent diffusion model to create iterative refinement capabilities for reasoning. It encodes reasoning steps into thought tokens and uses blockwise bidirectional attention for parallel generation of diverse reasoning trajectories, showing improvements in accuracy, diversity, and interpretability across mathematical reasoning and planning benchmarks.

Jie He, Richard He Bai, Sinead Williamson, Jeff Z. Pan, Navdeep Jaitly, Yizhe Zhang, arXiv preprint, 2025
Abstract

Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge but still suffers from long contexts and disjoint retrieval-generation optimization. We propose CLaRa, a framework performing embedding-based compression and joint optimization in a shared continuous space. It introduces SCP, a data synthesis framework, and trains components end-to-end via language modeling loss with a differentiable top-k estimator, achieving state-of-the-art compression and reranking performance on QA benchmarks.

Ruixiang Zhang, Shuangfei Zhai, Linh Tran, Yizhe Zhang, Tao Wang, Navdeep Jaitly, Joshua Susskind, arXiv preprint, 2025
Abstract

Standard discrete diffusion models treat all unobserved states identically by mapping them to an absorbing [MASK] token, creating an information void where semantic information that could be inferred from unmasked tokens is lost between denoising steps. We present CADD, which augments discrete state spaces with paired continuous latent space diffusion. This approach represents masked tokens as noisy yet informative vectors rather than collapsed states. At sampling time, the continuous latent can guide discrete denoising while enabling trade-offs between diverse outputs and contextually precise generation, demonstrating improvements over mask-based diffusion across text, image synthesis, and code modeling tasks.

Shansan Gong, Zijing Ou, Yizhe Zhang, Navdeep Jaitly, Mukai Li, arXiv preprint, 2025
Abstract

Autoregressive language models (ARMs) deliver strong likelihoods, but are inherently serial: they generate one token per forward pass, which limits throughput and inflates latency for long sequences. We introduce FS-DFM (Few-Step Discrete Flow-Matching), a diffusion language model designed to accelerate text generation. The model maintains consistent quality across different sampling step budgets with a stable update rule using teacher guidance from long-run trajectories. It achieves perplexity parity with 1,024-step baselines using only 8 steps and delivers up to 128× faster sampling for 1,024-token generation.

talks

teaching