Dinghan Shen, Guoyin Wang, Wenlin Wang, Martin Renqiang Min, Qinliang Su, Yizhe Zhang, Chunyuan Li, Ricardo Henao and Lawrence Carin, ACL, 2018

[ paper ] [ code ]

Zero-Shot Learning via Class-Conditioned Deep Generative Models.

Wenlin Wang, Yunchen Pu, Vinay Kumar Verma, Kai Fan, Yizhe Zhang, Changyou Chen, Piyush Rai, Lawrence Carin, AAAI, 2018

[ paper ]

Domain Adaptive Text Style Transfer.

Dianqi Li, Yizhe Zhang, Zhe Gan, Yu Cheng, Chris Brockett, Ming-Ting Sun and Bill Dolan, EMNLP, 2019

[ paper ] [ code ]

Generating a Common Question from Multiple Documents using Multi-source Encoder-Decoder Models.

Woon Sang Cho, Yizhe Zhang, Sudha Rao, Chris Brockett and Sungjin Lee, WNGT, EMNLP, 2019

[ paper ]

Improving Sequence-to-SequeJointly Optimizing Diversity and Relevance in Neural Response Generation.

Xiang Gao, Sungjin Lee, Yizhe Zhang, Chris Brockett, Michel Galley, Jianfeng Gao, Bill Dolan, NAACL, 2019

[ paper ] [ code ]

Improving Sequence-to-Sequence Learning via Optimal Transport.

Liqun Chen, Yizhe Zhang, Ruiyi Zhang, Chenyang Tao, Zhe Gan, Haichao Zhang, Bai Li, Dinghan Shen, Changyou Chen, Lawrence Carin, ICLR, 2019

[ paper ] [ code ]

Improving Textual Network Embedding with Global Attention via Optimal Transport.

Liqun Chen, Guoyin Wang, Chenyang Tao, Dinghan Shen, Yizhe Zhang and Lawrence Carin, ACL, 2019

[ paper ]

Microsoft ICECAPS: An Open-Source Toolkit for Conversation Modeling.

Vighnesh Leonardo Shiv, Chris Quirk, Anshuman Suri, Xiang Gao, Khuram Shahid, Nithya Govindarajan, Yizhe Zhang, Jianfeng Gao, Michel Galley, Chris Brockett, Tulasi Menon, Bill Dolan, system demonstration, ACL, 2019

[ paper ] [ code ]

Self-Enhanced Inverse Reinforcement Learning for Text Generation.

Ping Yu, Ruiyi Zhang, Chunyuan Li, Yizhe Zhang, Changyou Chen, Imitation, Intent, and Interaction(I3), ICML, 2019

[ paper ]

Structuring latent spaces for stylized response generation.

Xiang Gao, Yizhe Zhang, Sungjin Lee, Michel Galley, Chris Brockett, Jianfeng Gao and Bill Dolan, EMNLP, 2019

[ paper ] [ code ]

Towards Generating Long and Coherent Text with Multi-Level Latent Variable Models.

Dinghan Shen, Asli Celikyilmaz, Yizhe Zhang, Liqun Chen, Xin Wang, Jianfeng Gao, Lawrence Carin, ACL, 2019

[ paper ]

Towards coherent and cohesive long-form text generation.

Woon Sang Cho, Pengchuan Zhang, Yizhe Zhang, Xiujun Li, Michel Galley, Chris Brockett, Mengdi Wang, Jianfeng Gao, Workshop on Narrative Understanding, NAACL, 2019

[ paper ]

Unsupervised Dialogue Spectrum Generation for Log Dialogue Ranking.

Xinnuo Xu, Yizhe Zhang, Lars Liden and Sungjin Lee, SIGDIAL(Best paper nomination), 2019

[ paper ]

Contextual Re-Ranking with Behavior Aware Transformers.

Chen Qu, Chenyan Xiong, Yizhe Zhang, Corby Rosset, W. Bruce Croft and Paul Bennett, SIGIR, 2020

[ paper ]

Complementary Auxiliary Classifiers for Label-Conditional Text Generation.

Yuan Li, Chunyuan Li, Yizhe Zhang, Xiujun Li, Guoqing Zheng, Lawrence Carin, Jianfeng Gao, AAAI, 2020

[ paper ]

Sequence Generation with Optimal-Transport-Enhanced Reinforcement Learning.

Liqun Chen, Ke Bai, Chenyang Tao, Yizhe Zhang, Guoyin Wang, Wenlin Wang, Ricardo Henao, Lawrence Carin, AAAI, 2020

[ paper ]

Xinjie Fan, Yizhe Zhang, Zhendong Wang, Mingyuan Zhou, ICLR, 2020

[ paper ]

DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation.

Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan, system demonstration, ACL, 2020

[ paper ] [ code ]

Abstract

We present DialoGPT, a conversational response generation model trained on 147 million Reddit exchanges from 2005-2017. The system extends the Hugging Face PyTorch transformer to attain a performance close to human both in terms of automatic and human evaluation in single-turn dialogue settings. We demonstrate that our approach produces responses superior to baseline systems in relevance, content quality, and contextual consistency. We have made both the pretrained model and training pipeline publicly available to advance research in neural dialogue systems.

INSET: Sentence Infilling with Inter-sentential Generative Pre-training.

Yichen Huang, Yizhe Zhang, Oussama Elachqar, Yu Cheng, ACL, 2020

[ paper ] [ code ]

Improving Disentangled Text Representation Learning with Information Theoretical Guidance.

Pengyu Cheng, Renqiang Min, Dinghan Shen, Christopher Malon, Yizhe Zhang, Yitong Li and Lawrence Carin, ACL, 2020

[ paper ]

Datasets and Benchmarks for Task-Oriented Log Dialogue Ranking Task

Xinnuo Xu, Yizhe Zhang, Lars Liden, Sungjin Lee, Interspeech, 2020

Weakly supervised cross-domain alignment with optimal transport Task

Siyang Yuan, Ke Bai, Liqun Chen, Yizhe Zhang, Chenyang Tao, Chunyuan Li, Guoyin Wang, Ricardo Henao, Lawrence Carin, BMVC(oral presentation), 2020

[ paper ]

Contextual Text Style Transfer.

Yu Cheng, Zhe Gan, Yizhe Zhang, Oussama Elachqar, Dianqi Li, Jingjing Liu, Findings of EMNLP, 2020

[ paper ]

Dialogue Response Ranking Training with Large-Scale Human Feedback Data.

Xiang Gao, Yizhe Zhang, Michel Galley, Chris Brockett and Bill Dolan, EMNLP, 2020

[ code ]

Improving Text Generation with Student-Forcing Optimal Transport.

Jianqiao Li, Chunyuan Li, Guoyin Wang, Hao Fu, Yuhchen Lin, Liqun Chen, Yizhe Zhang, Chenyang Tao, Ruiyi Zhang, Wenlin Wang, Dinghan Shen, Qian Yang and Lawrence Carin, EMNLP, 2020

Optimus: Organizing Sentences via Pre-trained Modeling of a Latent Space.

Chunyuan Li, Xiang Gao, Yuan Li, Xiujun Li, Baolin Peng, Yizhe Zhang, Jianfeng Gao, EMNLP, 2020

[ paper ] [ code ]

Abstract

We introduce Optimus, a large-scale Variational Autoencoder (VAE) for natural language processing. The model can be both a powerful generative model and an effective representation learning framework for natural language. Key contributions include a pre-trained universal latent embedding space for sentences on large text corpora that can be fine-tuned for various tasks, guided language generation capabilities superior to GPT-2 through abstract-level control via latent vectors, improved generalization on low-resource tasks compared to BERT due to the smooth latent space structure, and state-of-the-art performance on VAE language modeling benchmarks. We aim to revitalize interest in deep generative models within the era of large-scale pre-training and make these methods more practical for the NLP community.

POINTER: Constrained Text Generation via Insertion-based Generative Pre-training.

Yizhe Zhang, Guoyin Wang, Chunyuan Li, Zhe Gan, Chris Brockett, Bill Dolan, EMNLP, 2020

[ paper ]

Abstract

We introduce POINTER, a model designed for text generation under lexical constraints. The approach operates through progressive insertion of new tokens between existing tokens in a parallel manner, applied recursively until completion. This generates a coarse-to-fine hierarchy that enhances interpretability. We pre-train on Wikipedia and achieve state-of-the-art results on constrained generation tasks. The non-autoregressive decoding strategy produces logarithmic time complexity during inference, offering efficiency advantages over traditional methods.

Contrastively Smoothed Class Alignment for Unsupervised Domain Adaptation

Shuyang Dai, Yu Cheng, Yizhe Zhang, Zhe Gan, JJ Liu, Lawrence Carins, ACCV, 2020

[ paper ]

A Controllable Model of Grounded Response Generation.

Zeqiu Wu, Michel Galley, Chris Brockett, Yizhe Zhang, Xiang Gao, Chris Quirk, Rik Koncel-Kedziorski, Jianfeng Gao, Hannaneh Hajishirzi, Mari Ostendorf, Bill Dolan, AAAI, 2021

[ paper ]

Data Augmentation for Abstractive Query-Focused Multi-Document Summarization.

Ramakanth Pasunuru, Asli Celikyilmaz , Michel Galley, Chenyan Xiong, Yizhe Zhang, Mohit Bansal, Jianfeng Gao, AAAI, 2021

[ paper ]

Unsupervised Common Question Generation from Multiple Documents using Reinforced Contrastive Coordinator.

Woon Sang Cho, Yizhe Zhang, Sudha Rao, Asli Celikyilmaz, Chenyan Xiong, Jianfeng Gao, Mengdi Wang, Bill Dolan, EACL, 2021

[ paper ]

Contextualized perturbation for textual adversarial attack.

Dianqi Li, Yizhe Zhang, Hao Peng, Liqun Chen, Chris Brockett, Ming-Ting Sun, Bill Dolan, NAACL, 2021

[ paper ] [ code ]

Consistent Dialogue Generation with Self-supervised Feature Learning.

Yizhe Zhang, Xiang Gao, Sungjin Lee, Chris Brockett, Michel Galley, Jianfeng Gao, Bill Dolan, SIGDIAL, 2021

[ paper ] [ code ]

Finetuning Pretrained Transformers into RNNs. (Oral Presentation)

Jungo Kasai, Hao Peng, Yizhe Zhang, Dani Yogatama, Yi Mao, Weizhu Chen, Noah A Smith, EMNLP, 2021

[ paper ]

Joint Retrieval and Generation Training for Grounded Text Generation.

Yizhe Zhang, Siqi Sun, Xiang Gao, Yuwei Fang, Chris Brockett, Michel Galley, Jianfeng Gao, Bill Dolan, AAAI, 2022

[ paper ] [ code ]

A Token-level Reference-free Hallucination Detection Benchmark for Free-form Text Generation.

Tianyu Liu, Yizhe Zhang, Chris Brockett, Yi Mao, Zhifang Sui, Weizhu Chen, Bill Dolan, ACL, 2022

[ paper ] [ code ]

An Adversarially-Learned Turing Test for Dialog Generation Models

Xiang Gao, Yizhe Zhang, Michel Galley, Bill Dolan, ArXiv preprint, 2022

[ paper ] [ code ]

Linearizing Transformer with Key-Value Memory

Yizhe Zhang, Deng Cai, ArXiv preprint, 2022

[ paper ]

Narrative Incoherence Detection.

Deng Cai, Yizhe Zhang, Yichen Huang, Wai Lam, Bill Dolan, ArXiv preprint, 2022

[ paper ]

What Makes Good In-Context Examples for GPT-3

Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, Weizhu Chen, ArXiv preprint, 2022

[ paper ]

Abstract

This paper explores applying GPT-3's few-shot learning to the SemEval 2021 MeasEval task, which involves identifying measurements and their associated attributes in scientific literature. Despite initial promise, we found that GPT-3 underperformed compared to our prior multi-turn question-answering approach. We identified several limitations: technical constraints included limits on the size of the prompt and answer, restricting the training signal available. More fundamentally, we discovered that generative models struggle with factual retention, and prompt modifications produced unpredictable results that hindered systematic performance improvement.

PLANNER: Generating Diversified Paragraph via Latent Language Diffusion Model

Yizhe Zhang, Jiatao Gu, Zhuofeng Wu, Shuangfei Zhai, Joshua Susskind, Navdeep Jaitly, NeurIPS, 2023

[ paper ] [ code ]

Abstract

Autoregressive models for text sometimes generate repetitive and low-quality output because errors accumulate during generation. We propose PLANNER, which merges latent semantic diffusion with autoregressive generation to produce fluent, lengthy text while maintaining paragraph-level control. The approach combines a decoding module with a planning module that generates semantic embeddings progressively, demonstrating effectiveness on semantic generation, text completion, and summarization tasks.

Matryoshka Diffusion Models

Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Joshua M Susskind, Navdeep Jaitly, ICLR, 2023

[ paper ]

Abstract

We introduce an innovative framework for high-resolution image and video generation. Our approach proposes a diffusion process that processes inputs across multiple resolutions simultaneously, utilizing a NestedUNet architecture where smaller-scale features are embedded within larger-scale parameters. A key innovation is the progressive training approach that moves from lower to higher resolutions, substantially enhancing optimization for high-resolution outputs. The method demonstrates capabilities across diverse applications including class-conditioned image generation, high-resolution text-to-image synthesis, and text-to-video tasks. Notably, the approach enables training a single pixel-space model at resolutions up to 1024x1024 pixels, achieving strong zero-shot generalization using the CC12M dataset, which contains only 12 million images.

Stabilizing Transformer Training by Preventing Attention Entropy Collapse

Shuangfei Zhai, Tatiana Likhomanenko, Etai Littwin, Dan Busbridge, Jason Ramapuram, Yizhe Zhang, Jiatao Gu, Joshua M Susskind, ICML, 2023

[ paper ]

Abstract

We investigate training instability in Transformers by analyzing attention layer dynamics. Our research found that low attention entropy is accompanied by high training instability, which can take the form of oscillating loss or divergence. We propose σReparam, a technique that reparametrizes linear layers using spectral normalization plus a learned scalar to prevent entropy collapse. We provide theoretical grounding by proving that attention entropy decreases exponentially with the spectral norm of attention logits. Experimental validation spans multiple domains—vision, machine translation, speech recognition, and language modeling—demonstrating that σReparam enables competitive performance while eliminating common training requirements like warmup, weight decay, layer normalization, and adaptive optimizers.

All-Hands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al., ICLR, 2024

[ paper ] [ code ]

Abstract

Recent advances in large language models (LLMs) have enabled AI systems to perform increasingly complex software engineering tasks. However, building generalist AI agents that can operate effectively across diverse real-world software development scenarios remains challenging. We present All-Hands (OpenHands), an open platform designed to enable the development and evaluation of AI agents that can act as generalist software developers. All-Hands provides a unified framework for agents to interact with software development environments through standardized actions, observations, and sandboxed execution. The platform supports diverse agent architectures and enables systematic evaluation on comprehensive benchmarks spanning code generation, debugging, issue resolution, and repository understanding tasks. We demonstrate that agents built on All-Hands can effectively tackle real-world GitHub issues and compete with state-of-the-art proprietary systems while being fully open-source. Our platform enables the research community to collaboratively advance towards more capable and generalizable AI software developers.

DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation

Jiatao Gu, Yuyang Wang, Yizhe Zhang, Qihang Zhang, Dinghuai Zhang, Navdeep Jaitly, Josh Susskind, Shuangfei Zhai, ICLR, 2024

[ paper ]

Abstract

We introduce DART, a transformer-based approach for text-to-image generation that unifies autoregressive (AR) and diffusion within a non-Markovian framework, allowing it to iteratively denoise image patches using an architecture similar to standard language models. Unlike traditional diffusion models limited by their Markovian property, DART overcomes this constraint without requiring image quantization, enabling more effective image modeling. The model handles both text and image data in a single unified architecture through unified training. DART demonstrates competitive performance on class-conditioned and text-to-image tasks, providing a scalable, efficient alternative to traditional diffusion models that establishes a new benchmark for scalable, high-quality image synthesis.

Scaling Diffusion Language Models via Adaptation from Autoregressive Models

Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, et al., ICLR, 2024

[ paper ] [ code ]

Abstract

Diffusion Language Models (DLMs) have emerged as a promising new paradigm for text generative modeling, potentially addressing limitations of autoregressive (AR) models. We demonstrate how to convert existing autoregressive models (GPT2 and LLaMA, ranging from 127M to 7B parameters) into diffusion models called DiffuGPT and DiffuLLaMA. Using less than 200B tokens, we achieve models competitive with their autoregressive counterparts while enabling unique capabilities like fill-in-the-middle generation without prompt reordering.

The Entity-Deduction Arena: A Playground for Probing the Conversational Reasoning and Planning Capabilities of LLMs

Yizhe Zhang, Jiarui Lu, Navdeep Jaitly, ACL, 2024

[ paper ]

Abstract

Large language models (LLMs) have shown impressive capabilities in various NLP tasks, but their ability to perform multi-step reasoning and strategic planning in conversational settings remains unclear. We introduce the Entity-Deduction Arena, a benchmark designed to probe the conversational reasoning and planning capabilities of LLMs through an entity deduction game. In this game, an AI agent must identify a hidden entity by asking strategic yes/no questions, requiring the model to maintain context, reason about information gain, and plan question sequences effectively. Our comprehensive evaluation of state-of-the-art LLMs reveals that while they can handle short reasoning chains, they struggle significantly with long-horizon planning, often failing to ask informative questions or properly utilize previous answers. We analyze common failure modes including premature convergence, circular reasoning, and inefficient information gathering. Our findings highlight fundamental limitations in current LLMs' ability to perform sustained strategic reasoning in interactive settings and provide insights for developing more capable conversational AI systems.

Executable Code Actions Elicit Better LLM Agents

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, Heng Ji, ICML, 2024

[ paper ] [ code ]

Abstract

Large language model (LLM) agents have demonstrated remarkable capabilities in automating complex tasks across diverse domains. However, most existing approaches rely on generating natural language actions or constrained predefined action spaces, which can be ambiguous or inflexible for real-world applications. We propose CodeAct, a framework that enables LLM agents to express and execute actions through executable Python code. This approach offers several advantages: (1) Python's expressiveness allows agents to combine multiple primitive actions flexibly, (2) code execution provides deterministic and verifiable action outcomes, and (3) the structured nature of code facilitates better error handling and debugging. We evaluate CodeAct on diverse interactive tasks spanning web browsing, database querying, and embodied control. Our experiments show that CodeAct agents consistently outperform both natural language and predefined action baselines, achieving state-of-the-art results while being more sample-efficient and robust to distribution shifts.

Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling

Jiatao Gu, Ying Shen, Shuangfei Zhai, Yizhe Zhang, Navdeep Jaitly, Joshua M. Susskind, NeurIPS, 2024

[ paper ]

Abstract

We introduce Kaleido, a method for enhancing image generation diversity in conditional diffusion models. Diffusion models often exhibit limited diversity in the sampled images, particularly when sampling with a high classifier-free guidance weight. Our solution integrates an autoregressive language model that processes captions and generates intermediate latent representations—including textual descriptions, bounding boxes, object blobs, and visual tokens. These diverse latent variables serve as enriched conditioning signals for the diffusion process. Our experimental findings demonstrate that Kaleido successfully increases the variety of generated images while preserving quality and maintaining fidelity to the generated latent guidance signals, thereby enabling improved control over image generation outcomes.

Training Software Engineering Agents and Verifiers with SWE-Gym

Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, Yizhe Zhang, ICML, 2025

[ paper ] [ code ]

Abstract

We introduce SWE-Gym, a new training environment containing 2,438 real-world Python tasks, each with an executable codebase, unit tests, and natural language specifications. Fine-tuning language model-based software engineering agents on this dataset achieves up to 19% absolute gains in resolve rate on SWE-Bench Verified and Lite test sets. We further explore inference-time scaling via verifiers trained on agent trajectories, reaching state-of-the-art results for open-weight agents: 32.0% and 26.0% on the respective benchmarks. SWE-Gym, trained models, and agent trajectories are publicly available to support future research.

Flexible Language Modeling in Continuous Space with Transformer-based Autoregressive Flows

Ruixiang Zhang, Shuangfei Zhai, Jiatao Gu, Yizhe Zhang, Huangjie Zheng, Tianrong Chen, Miguel Angel Bautista, Josh Susskind, Navdeep Jaitly, NeurIPS, 2025

[ paper ]

Abstract

The paper presents TarFlowLM, a framework that reimagines language modeling by operating in continuous latent space rather than discrete tokens. We propose using transformer-based autoregressive normalizing flows to model these continuous representations, enabling bidirectional context capture through alternating-direction transformations and block-wise generation with variable token patch sizes. The approach introduces mixture-based coupling transformations to handle complex dependencies within the latent space and establishes theoretical links to conventional discrete autoregressive models. Experimental results demonstrate competitive likelihood performance while showcasing the framework's flexible modeling capabilities.

Target Concrete Score Matching: A Holistic Framework for Discrete Diffusion

Ruixiang Zhang, Shuangfei Zhai, Yizhe Zhang, James Thornton, Zijing Ou, Joshua Susskind, Navdeep Jaitly, ICML, 2025

[ paper ]

Abstract

We present Target Concrete Score Matching (TCSM), a novel training objective for discrete diffusion models. TCSM provides a general framework with broad applicability, supporting pre-training discrete diffusion models directly from data samples. Many existing discrete diffusion approaches naturally emerge as special cases of our more general TCSM framework. The framework enables fine-tuning using reward functions, preference data, and knowledge distillation from autoregressive models by estimating the concrete score of the target distribution in the original clean data space.

DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation

Shansan Gong, Ruixiang Zhang, Huangjie Zheng, Jiatao Gu, Navdeep Jaitly, Lingpeng Kong, Yizhe Zhang, arXiv preprint, 2025

[ paper ] [ code ]

Abstract

Diffusion large language models (dLLMs) are compelling alternatives to autoregressive (AR) models because their denoising models operate over the entire sequence. We trained a 7B model on 130B code tokens and proposed coupled-GRPO, a novel reinforcement learning sampling scheme. Our work achieved +4.4% on EvalPlus and demonstrated how diffusion models can reduce dependence on autoregressive bias during code generation.

LaDiR: Latent Diffusion Enhances LLMs for Text Reasoning

Haoqiang Kang, Yizhe Zhang, Nikki Lijing Kuang, Nicklas Majamaki, Navdeep Jaitly, Yi-An Ma, Lianhui Qin, arXiv preprint, 2025

[ paper ]

Abstract

Large Language Models (LLMs) demonstrate reasoning through chain-of-thought (CoT) generation. However, LLM's autoregressive decoding may limit the ability to revisit and refine earlier tokens holistically, leading to inefficient exploration for diverse solutions. We propose LaDiR, combining a Variational Autoencoder and latent diffusion model to create iterative refinement capabilities for reasoning. It encodes reasoning steps into thought tokens and uses blockwise bidirectional attention for parallel generation of diverse reasoning trajectories, showing improvements in accuracy, diversity, and interpretability across mathematical reasoning and planning benchmarks.

CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning

Jie He, Richard He Bai, Sinead Williamson, Jeff Z. Pan, Navdeep Jaitly, Yizhe Zhang, arXiv preprint, 2025

[ paper ] [ code ]

Abstract

Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge but still suffers from long contexts and disjoint retrieval-generation optimization. We propose CLaRa, a framework performing embedding-based compression and joint optimization in a shared continuous space. It introduces SCP, a data synthesis framework, and trains components end-to-end via language modeling loss with a differentiable top-k estimator, achieving state-of-the-art compression and reranking performance on QA benchmarks.

Continuously Augmented Discrete Diffusion Model for Categorical Generative Modeling

Ruixiang Zhang, Shuangfei Zhai, Linh Tran, Yizhe Zhang, Tao Wang, Navdeep Jaitly, Joshua Susskind, arXiv preprint, 2025

[ paper ]

Abstract

Standard discrete diffusion models treat all unobserved states identically by mapping them to an absorbing [MASK] token, creating an information void where semantic information that could be inferred from unmasked tokens is lost between denoising steps. We present CADD, which augments discrete state spaces with paired continuous latent space diffusion. This approach represents masked tokens as noisy yet informative vectors rather than collapsed states. At sampling time, the continuous latent can guide discrete denoising while enabling trade-offs between diverse outputs and contextually precise generation, demonstrating improvements over mask-based diffusion across text, image synthesis, and code modeling tasks.

FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Models

Shansan Gong, Zijing Ou, Yizhe Zhang, Navdeep Jaitly, Mukai Li, arXiv preprint, 2025

[ paper ]

Abstract

Autoregressive language models (ARMs) deliver strong likelihoods, but are inherently serial: they generate one token per forward pass, which limits throughput and inflates latency for long sequences. We introduce FS-DFM (Few-Step Discrete Flow-Matching), a diffusion language model designed to accelerate text generation. The model maintains consistent quality across different sampling step budgets with a stable update rule using teacher guidance from long-run trajectories. It achieves perplexity parity with 1,024-step baselines using only 8 steps and delivers up to 128× faster sampling for 1,024-token generation.