Diffusion-MPC in Discrete Domains: Feasibility Constraints, Horizon Effects, and Critic Alignment: Case study with Tetris

Researchers applied Diffusion-based Model Predictive Control (Diffusion-MPC) to Tetris, demonstrating that feasibility-constrained sampling is essential for discrete combinatorial domains. The study found that enforcing game-legal placements during sampling improved game scores by 6.8% and survival time by 5.6%, while revealing systematic misalignment between pretrained DQN critics and actual rollout quality. Contrary to assumptions, shorter planning horizons outperformed longer ones in Tetris's sparse reward environment.

Diffusion-MPC in Discrete Domains: Feasibility Constraints, Horizon Effects, and Critic Alignment: Case study with Tetris

Diffusion-MPC for Tetris: A Case Study in Discrete Combinatorial Control

Researchers have introduced a novel application of diffusion models for planning in complex, discrete environments, using the classic game of Tetris as a benchmark. The study, detailed in the paper "Diffusion-MPC for Discrete Domains: A Tetris Case Study" (arXiv:2603.02348v1), investigates Diffusion-based Model Predictive Control (Diffusion-MPC), a method that samples potential action sequences with a discrete denoiser and selects the best via reranking. This work provides critical insights into the structural challenges and practical requirements for deploying such planners in domains with combinatorial action spaces, where traditional continuous-space assumptions break down.

Core Methodology: Sampling and Reranking Actions

The planner employs a MaskGIT-style discrete denoiser to generate candidate sequences of Tetris piece placements. A key innovation is the use of feasibility-constrained sampling, where logit masking is applied to ensure only valid, game-legal placements are considered during the denoising process. After sampling multiple candidate action sequences, the system uses a reranking step to select the optimal action to execute. The study rigorously analyzes three distinct reranking strategies: a simple heuristic score, a pretrained Deep Q-Network (DQN) critic, and a hybrid approach combining both.

Critical Findings on Feasibility, Critics, and Horizon

The analysis reveals that feasibility masking is not just beneficial but necessary for effective planning in discrete domains. Unconstrained sampling produced a significant 46% of invalid actions. By enforcing feasibility, the system achieved a 6.8% improvement in game score and a 5.6% increase in survival time. The evaluation of reranking strategies uncovered a major pitfall: the pretrained DQN critic was systematically misaligned with actual rollout quality, leading to high decision regret (mean 17.6, 90th percentile 36.6). This indicates that an off-the-shelf value function may not be a reliable reward proxy for the diffusion planner's imagined trajectories.

Furthermore, the research challenged the assumption that longer planning horizons are always better. In Tetris's environment of sparse and delayed rewards, shorter planning horizons (H) outperformed longer ones. This suggests that uncertainty compounds in long imagined rollouts, leading to degraded decision quality. The study concludes that compute budget parameters—specifically the number of candidates (K) and the planning horizon (H)—determine the dominant failure mode: small K limits the quality of discovered options, while large H amplifies model mismatch and misranking errors.

Why This Matters for AI Planning

This case study moves diffusion models beyond continuous domains like image generation and into the realm of sequential decision-making. The findings have significant implications for the future of AI planning and control, particularly for real-world applications with discrete, constrained action spaces.

  • Validity is Paramount: In discrete combinatorial spaces, planners must explicitly enforce action feasibility; assuming the model will learn constraints is insufficient and costly.
  • Critic Alignment is a Key Challenge: Integrating a learned value function into a diffusion planner is non-trivial. A misaligned critic can severely degrade performance, necessitating careful diagnostics and potentially hybrid scoring methods.
  • Compute Budget Dictates Failure Modes: The trade-off between sampling breadth (K) and planning depth (H) is critical. Practitioners must tune these parameters based on the reward structure and model reliability of their specific domain.
  • Foundation for Real-World Applications: The diagnostics and architectural choices explored, such as logit masking, provide a practical blueprint for applying diffusion planners to complex problems like logistics, chip design, or molecular synthesis, where actions are discrete and constraints are rigid.

By dissecting the performance of Diffusion-MPC in the controlled but complex environment of Tetris, this research provides a crucial framework for understanding and overcoming the hurdles of bringing advanced generative planning models to challenging discrete-world tasks.

常见问题