LLM: Diffusion, MBO, and LLM Pretraining

Overview

This theme studies the intersection of diffusion and flow-based generative models, offline black-box / model-based optimization (MBO), and LLM pretraining. A current focus is Design-Bench 2.0, an LLM-oriented benchmark that adapts offline MBO algorithms to LLM-related tasks, alongside diffusion- and flow-based methods for black-box and multi-objective optimization.

Motivation

This theme sits at the intersection of three lines of work that increasingly reinforce one another: diffusion and flow-based generative modeling, offline model-based optimization (MBO), and large language model pretraining. Offline MBO methods learn to propose high-performing designs purely from a static dataset of past evaluations, and diffusion/flow models have proven to be powerful tools for representing and editing those design distributions. The theme explores how these optimization ideas transfer to the LLM setting, where the “design space” becomes language- and sequence-structured.

Project Goals

Develop diffusion- and flow-based estimators and samplers for offline black-box and multi-objective optimization.
Bridge offline MBO algorithms with LLM-related tasks, treating language and sequence generation as an optimization problem.
Build an LLM-oriented benchmark that standardizes evaluation of these methods.

Recent Progress

The team has started an LLM-oriented Design-Bench 2.0, with the goal of adapting offline MBO algorithms to LLM-related tasks. This extends the classic offline black-box optimization benchmarking setup toward language-model settings, providing a common ground to test diffusion-, flow-, and optimization-based methods on LLM tasks.

Recent published results span diffusion estimation for offline black-box optimization (SPADE, ICML 2026), training diffusion language models directly for black-box optimization (ICML 2026 Spotlight), and a preprint on diffusion large language models for black-box optimization, building on earlier work in design editing (TMLR), guided flows for multi-objective optimization (ICLR 2025), and importance-aware co-teaching (NeurIPS 2023). See the publications list above for details.

Related Publications

Support-Proximity Augmented Diffusion Estimation for Offline Black-Box Optimization

Yonghan Yang, Ye Yuan, Zipeng Sun, Linfeng Du, Bowei He, Haolun Wu, Can Chen, Xue Liu

ICML 2026 · 2026
Training Diffusion Language Models for Black-Box Optimization

Zipeng Sun, Can Chen, Ye Yuan, Haolun Wu, Jiayao Gu, Christopher Pal, Xue Liu

ICML 2026 (Spotlight) · 2026
Diffusion Large Language Models for Black-Box Optimization

Ye Yuan, Can Chen, Zipeng Sun, Dinghuai Zhang, Christopher Pal, Xue Liu

arXiv preprint · 2026
Design Editing for Offline Model-based Optimization

Ye Yuan, Youyuan Zhang, Can Chen, Haolun Wu, Zixuan Li, Jianmo Li, James J. Clark, Xue Liu

TMLR · 2025
ParetoFlow: Guided Flows in Multi-Objective Optimization

Ye Yuan, Can Chen, Christopher Pal, Xue Liu

ICLR 2025 · 2025
Importance-aware Co-teaching for Offline Model-based Optimization

Ye Yuan, Can Chen, Zixuan Liu, Willie Neiswanger, Xue Liu

NeurIPS 2023 · 2023

Impact Holders

Impact holders and user communities will be added as the project scope becomes clearer.

LLM: Diffusion, MBO, and LLM Pretraining

Overview

Motivation

Project Goals

Recent Progress

Related Publications

Support-Proximity Augmented Diffusion Estimation for Offline Black-Box Optimization

Training Diffusion Language Models for Black-Box Optimization

Diffusion Large Language Models for Black-Box Optimization

Design Editing for Offline Model-based Optimization

ParetoFlow: Guided Flows in Multi-Objective Optimization

Importance-aware Co-teaching for Offline Model-based Optimization

Impact Holders