TiDAR: Think in Diffusion, Talk in Autoregression

Jingyu Liu^1*, Xin Dong^*, Zhifan Ye², Rishabh Mehta, Yonggan Fu, Vartika Singh, Jan Kautz, Ce Zhang¹, Pavlo Molchanov

^* Equal Contribution, XD as the Project Lead

NVIDIA
¹ affiliated with Univerisity of Chicago. Work done during Jingyu's internship at NVIDIA
² affiliated with Georgia Institute of Technology. Work done during Zhifan's internship at NVIDIA
(SGLang inference code will be released soon)

Paper

TiDAR's Parallel Diffusion Drafting and Autoregressive Sampling

(We illustrate with draft length = 3; TiDAR 1.5B and 8B use draft length = 16 in practice.)

Model Response Examples

Select a Question:

Display Speed: 1.0x (drag to slow down)

What is the project about?

Diffusion language models hold the promise of fast parallel generation, while autoregressive (AR) models typically excel in quality. But why choose? TiDAR allows you have the best of both worlds in one model with zero overhead by leveraging the unused compute to memory density on GPUs.
We introduce TiDAR, a sequence-level hybrid architecture that drafts tokens (Thinking) in Diffusion and samples final outputs (Talking) in AutoRegression - all within a single forward pass using a structured attention mask.
We extensively evaluate TiDAR against AR models, speculative decoding, and diffusion variants across generative and likelihood tasks at 1.5B and 8B scales. Thanks to the parallel drafting and sampling as well as exact KV cache support, TiDAR outperforms speculative decoding in measured throughput and surpasses diffusion models like Dream and Llada in both efficiency and quality. Most notably, TiDAR is the first architecture to close the quality gap with AR models while delivering 4.71x to 5.91x more tokens per second.

The Secret Sauce: Free Token Slots

The power of modern GPUs gets fully utilized under a balanced load between compute density and memory IO. Increasing batch size can increase the compute density by reusing the model weights, but this requires loading more KV cache for each sample and more aggressive model sharding. It turns out that passing a decent amount of extra tokens to the forward results in similar latency. Under the context of TiDAR, we append sets of mask tokens to perform diffusion pre-drafting and also verifiying tokens from last step to conduct autoregressive sampling. When the draft length is chosen wisely, computing these extra tokens in parallel does not introduce extra latency, and therefore, we refer to these as "free token slots".

Sequence-level Hybrid Attention

We conduct parallel diffusion drafting and autoregressive sampling in a single forward pass with a structured attention mask. In TiDAR, the prefix tokens and tokens from the last step that are waiting to be verified are attended causally, while the draft tokens are attended in a bidirectional manner.

Diffusion Self-Speculation with Autoregressive Verification

Comparison of TiDAR with Related Works — Figure 3: Comparing TiDAR with other frameworks.

With the help of the customized structured mask, we can verify the proposals from last step as well as pre-draft new proposals starting at every single prefix in a single forward. Similar to speculative decoding, we can guarantee at least one token to be accepted in each forward pass but TiDAR can accept much more tokens due to its superior drafting ability. Given the right draft length, we can maximally leverage the free token slots and achieve great decoding speedup from exceptionally high acceptance rate.

Why is TiDAR special?

TiDAR achieves the best inference latency under small batch sizes (better than SOTA speculative decoding such as EAGLE-v3).
TiDAR significantly outperforms the quality-efficiency trade-offs than open-source DLMs such as Dream and Llada.
TiDAR is a standalone model that is easy to train and deploy with minimal hyperparameters to tune and low serving overhead.
Training TiDAR is straightforward and data-efficient, with the dual objective of diffusion (Mask Token Prediction) and autoregression (Next Token Prediction), resulting in richer training signals than both traditional AR and Diffusion models.

Main Results

Quality-Efficiency Trade-offs

TiDAR achieves impressive speedup over traditional AR models, and stays impressively competitive to SOTA speculative decoding method, EAGLE-v3, giving a wall-clock token per second throughput boost from 4.71x to 5.91x with PyTorch + Flex Attention implementation.

Likelihood & Generative Performance

We compare TiDAR with AR models (Llama, SmolLM, Qwen) and diffusion models (Dream, Llada, Block Diffusion) on downstream tasks. With only continual pretraining, not only does TiDAR outperform diffusion models (with the best quality at 1T/NFE) but also matches or significantly closes the gap to AR models.

Ablation Studies

In our paper, we also conducted detailed ablation studies to better understand the design choices:

Pareto Frontier under the same training recipe.
Comparing TiDAR's parallel decoding with other popular diffusion denoising strategies (such as threshold-based and left-to-right).
TiDAR gives the flexibility to verify with either AR or diffusion (or a mixture of both) predictions, and we showcase their effects on downstream performance.
One of the critical advantages of TiDAR is that we use one-step diffusion for efficient drafting. To this end, we apply a simple full masking strategy to bridge train-test consistency and improve loss signal density.

BibTeX


        @misc{liu2025tidarthinkdiffusiontalk,
          title={TiDAR: Think in Diffusion, Talk in Autoregression}, 
          author={Jingyu Liu and Xin Dong and Zhifan Ye and Rishabh Mehta and Yonggan Fu and Vartika Singh and Jan Kautz and Ce Zhang and Pavlo Molchanov},
          year={2025},
          eprint={2511.08923},
          archivePrefix={arXiv},
          primaryClass={cs.CL},
          url={https://arxiv.org/abs/2511.08923}, 
        }