Diffusion forcing next-token prediction meets full-sequence diffusion

Keywords:diffusion, sequence modeling, decision making, planning

TL;DR:A novel way of training next-token prediction models to diffuse whole sequences at once enables long-horizon guidance, stable infinite rollout of continuous signals, and new planning & policy techniques.

Abstract:This paper presents Diffusion Forcing, a new training paradigm where a diffusion model is trained to denoise a set of tokens with independent per-token noise levels. We apply Diffusion Forcing to sequence generative modeling by training a causal next-token prediction model to generate one or several future tokens without fully diffusing past ones. Our approach is shown to combine the strengths of next-token prediction models, such as variable-length generation, with the strengths of full-sequence diffusion models, such as the ability to guide sampling to desirable trajectories. Our method offers a range of additional capabilities, such as (1) rolling-out sequences of continuous tokens, such as video, with lengths past the training horizon, where baselines diverge and (2) new sampling and guiding schemes that uniquely profit from Diffusion Forcing's variable-horizon and causal archite

Next-Token Prediction Meets Full-Sequence Diffusion

Diffusion forcing combines the strength of full-sequence diffusion models and next-token models, acting as either or a mix at sampling time for different applications without retraining. Watch this video to learn more.

“With Diffusion Forcing, we are taking a step to bringing video generation and robotics closer together,” says senior author Vincent Sitzmann , MIT assistant professor and member of CSAIL, where he leads the Scene Representation group.

More From SAE Media Group

Transcript

00:00:01 (air whooshes) (bright music) - With Diffusion Forcing, the problem we tackled is the one of trying to learn how the world works and how to accomplish certain tasks, first by just watching how other people accomplish these tasks and then afterwards, you know, being immersed in the world yourself and interacting with it. The specific problem we're focusing on

00:00:20 in Diffusion Forcing is called sequence prediction. And the goal there is to basically be able to, given a set of observations, try to predict what would have to happen next in a sequence to get to a certain goal. And specifically, Diffusion Forc

Abstract

This paper presents Diffusion Forcing, a new training paradigm where a diffusion model is trained to denoise a set of tokens with independent per-token noise levels. We apply Diffusion Forcing to sequence generative modeling by training a causal next-token prediction model to generate one or several future tokens without fully diffusing past ones. Our approach is shown to combine the strengths of next-token prediction models, such as variable-length generation, with the strengths of full-sequence diffusion models, such as the ability to guide sampling to desirable trajectories. Our method offers a range of additional capabilities, such as (1) rolling-out sequences of continuous tokens, such as video, with lengths past the training horizon, where baselines diverge and (2) new sampling and guiding schemes that uniquely profit from Diffusion Forcing's variable-horizon and causal architecture, and which lead to marked performance gains in decision-making and planning tasks. In addition to its empirical success, our method is proven to optimize a variational lower bound on the likelihoods of all subsequences of tokens drawn from the true joint distribution.

Diffusion
Abstract:
This paper presents Diffusion Forcing, a new training paradigm where a diffusion model is trained to denoise a set of tokens with independent per-token noise levels. We apply Diffusion Forcing to sequence generative modeling by training a causal next-token prediction model to generate one or several future tokens without fully diffusing past ones. Our approach is shown to combine the strengths of next-token prediction models, such as variable-length generation, with the strengths of full-sequence diffusion models, such as the ability to guide sampling to desirable trajectories. Our method offers a range of additional capabilities, such as (1) rolling-out sequences of continuous tokens, such as video, with lengths past the training horizon, where baselines diverge and (2) new sampling and guiding schemes that uniquely profit from Diffusion Forcing's variable-horizon and causal architecture, and which lead to marked performance gains in decision-making and planning tasks. In addition to its empirical success, our method is proven to optimize a variational lower bound on the likelihoods of all subsequences of tokens drawn from the true joint distribution. Project website: