../

proj5/

├── overview/ Project Overview
├── partA/ The Power of Diffusion Models!
├── part0/ Setup
└── part1/ Sampling Loops
└── partB/ Flow Matching from Scratch!
├── part1/ Single-Step Denoising UNet
└── part2/ Flow Matching Model

Diffusion Models!

CS180 Project 5: Diffusion Models

Overview

This project explores diffusion models, a powerful class of generative models that learn to reverse a gradual noising process to generate high-quality images from text prompts. We work with the DeepFloyd IF model, implementing diffusion sampling loops and experimenting with various text-to-image generation tasks.

Part A: The Power of Diffusion Models!

Exploring diffusion models with DeepFloyd IF

Part 0: Setup

Getting familiar with DeepFloyd diffusion model

Generated Prompt Embeddings

I generated embeddings for the following text prompts, exploring a variety of subjects and artistic styles:

Selected Images and Analysis

From my prompt collection, I selected the following three for image generation and analysis:

Random Seed: 100 (used consistently across all parts of the project)
Horse meadow 20 steps
20 inference steps
Violin 20 steps
20 inference steps
Dinosaur 20 steps
20 inference steps
"a photo of a horse running through a meadow"
"an oil painting of a violin resting on sheet music"
"a high quality picture of a dinosaur in a forest"
Horse meadow 40 steps
40 inference steps
Violin 40 steps
40 inference steps
Dinosaur 40 steps
40 inference steps

Reflection on Model Output Quality

The DeepFloyd model demonstrates impressive capability in understanding and translating text descriptions into coherent visual representations:

All images were generated using the same random seed (100) for consistency. The model shows strong semantic understanding and generates visually coherent results that align well with the textual descriptions.

Part 1: Sampling Loops

Implementing and modifying diffusion sampling loops

1.1 Implementing the Forward Process

The forward process adds noise to a clean image progressively. I implemented this using the equation:

$$x_t = \sqrt{\bar{\alpha}_t} \cdot x_0 + \sqrt{1 - \bar{\alpha}_t} \cdot \epsilon$$

where $\epsilon$ is random noise sampled from a standard normal distribution $\mathcal{N}(0, I)$.

Here's the Berkeley Campanile at different noise levels (timesteps):

Original Campanile
Original (t = 0)
Campanile with noise at t=250
t = 250
Campanile with noise at t=500
t = 500
Campanile with noise at t=750
t = 750

1.2 Classical Denoising

I attempted to denoise the noisy Campanile images using classical Gaussian blur filtering. As expected, this approach struggles significantly with higher noise levels and doesn't produce satisfactory results.

Noisy image at t=250
Noisy Image (t=250)
Noisy image at t=500
Noisy Image (t=500)
Noisy image at t=750
Noisy Image (t=750)
Gaussian denoised at t=250
Gaussian Blur Denoising (t=250)
Gaussian denoised at t=500
Gaussian Blur Denoising (t=500)
Gaussian denoised at t=750
Gaussian Blur Denoising (t=750)

1.3 One-Step Denoising

Using the pretrained DeepFloyd UNet, I implemented one-step denoising. The model estimates the noise in the image and removes it to recover an approximation of the original clean image. This approach works significantly better than Gaussian blur, especially at lower noise levels.

Original Campanile
Original (t=250)
Original Campanile
Original (t=500)
Original Campanile
Original (t=750)
Noisy image at t=250
Noisy Image (t=250)
Noisy image at t=500
Noisy Image (t=500)
Noisy image at t=750
Noisy Image (t=750)
One-step denoised at t=250
Estimate of Original (t=250)
One-step denoised at t=500
Estimate of Original (t=500)
One-step denoised at t=750
Estimate of Original (t=750)

1.4 Iterative Denoising

I implemented iterative denoising which removes noise over multiple steps, producing much higher quality results than single-step denoising.

Here's the progressive denoising process shown every 5th step:

Denoising t=690
t = 690
Denoising t=540
t = 540
Denoising t=390
t = 390
Denoising t=240
t = 240
Denoising t=90
t = 90

Final comparison of different denoising approaches:

Original Campanile
Original
Iteratively denoised
Iterative Denoising
One-step denoised
One-Step Denoising
Gaussian blurred
Gaussian Blur

1.5 Diffusion Model Sampling

Using the iterative denoising function starting from pure noise (i_start = 0), I generated 5 samples with the prompt "a high quality photo". The results demonstrate the model's ability to generate diverse images from random noise.

Sample 1
Sample 1
Sample 2
Sample 2
Sample 3
Sample 3
Sample 4
Sample 4
Sample 5
Sample 5

1.6 Classifier-Free Guidance (CFG)

To improve image quality, I implemented Classifier-Free Guidance with $\gamma = 7$. This technique computes both conditional and unconditional noise estimates, then combines them to produce higher quality images.

CFG Sample 1
CFG Sample 1
CFG Sample 2
CFG Sample 2
CFG Sample 3
CFG Sample 3
CFG Sample 4
CFG Sample 4
CFG Sample 5
CFG Sample 5

1.7 Image-to-image Translation

Following the SDEdit algorithm, I added noise to the original Campanile image and then denoised it using the prompt "a high quality photo". Different noise levels create different degrees of editing, with higher noise levels allowing for more dramatic changes.

Original Campanile
Original Campanile
SDEdit noise 1
i_start = 1
SDEdit noise 3
i_start = 3
SDEdit noise 5
i_start = 5
SDEdit noise 7
i_start = 7
SDEdit noise 10
i_start = 10
SDEdit noise 20
i_start = 20

Results on my own test images:

Soccer Ball Image

Original Soccer
Original Soccer Ball
Soccer noise 1
i_start = 1
Soccer noise 3
i_start = 3
Soccer noise 5
i_start = 5
Soccer noise 7
i_start = 7
Soccer noise 10
i_start = 10
Soccer noise 20
i_start = 20

Violin Image

Original Violin
Original Violin
Violin noise 1
i_start = 1
Violin noise 3
i_start = 3
Violin noise 5
i_start = 5
Violin noise 7
i_start = 7
Violin noise 10
i_start = 10
Violin noise 20
i_start = 20

1.7.1 Editing Hand-Drawn and Web Images

This procedure works particularly well when starting with non-realistic images and projecting them onto the natural image manifold. I experimented with both web images and hand-drawn sketches.

Painting Web Image

Original avocado
Original Web Image
Avocado noise 1
i_start = 1
Avocado noise 3
i_start = 3
Avocado noise 5
i_start = 5
Avocado noise 7
i_start = 7
Avocado noise 10
i_start = 10
Avocado noise 20
i_start = 20

Hand-Drawn Image 1

Original drawing 1
Original Hand-Drawn Image
Drawing 1 noise 1
i_start = 1
Drawing 1 noise 3
i_start = 3
Drawing 1 noise 5
i_start = 5
Drawing 1 noise 7
i_start = 7
Drawing 1 noise 10
i_start = 10
Drawing 1 noise 20
i_start = 20

Hand-Drawn Image 2

Original drawing 2
Original Hand-Drawn Image
Drawing 2 noise 1
i_start = 1
Drawing 2 noise 3
i_start = 3
Drawing 2 noise 5
i_start = 5
Drawing 2 noise 7
i_start = 7
Drawing 2 noise 10
i_start = 10
Drawing 2 noise 20
i_start = 20

1.7.2 Inpainting

Following the RePaint algorithm, I implemented inpainting by running the diffusion denoising loop while forcing pixels outside the edit mask to match the original image (with appropriate noise added for each timestep).

Campanile original
Original
Campanile inpainted
Inpainted Result
Campanile mask
Mask

Billboard Inpainting

Billboard original
Original
Billboard mask
Mask
Billboard inpainted
Inpainted Result

Tree Inpainting

Tree original
Original
Tree mask
Mask
Tree inpainted
Inpainted Result

1.7.3 Text-Conditional Image-to-image Translation

Similar to SDEdit but now guided by specific text prompts. This allows for more controlled edits that not only project onto the natural image manifold but also incorporate semantic guidance from the text prompt.

Campanile → Golden Temple at Sunrise

Original Campanile
Original Campanile
Campanile to Golden Temple noise 1
i_start = 1
Campanile to Golden Temple noise 3
i_start = 3
Campanile to Golden Temple noise 5
i_start = 5
Campanile to Golden Temple noise 7
i_start = 7
Campanile to Golden Temple noise 10
i_start = 10
Campanile to Golden Temple noise 20
i_start = 20

Soccer Ball → Golden Temple at Sunrise

Original Soccer Ball
Original Soccer Ball
Soccer instrument noise 1
i_start = 1
Soccer instrument noise 3
i_start = 3
Soccer instrument noise 5
i_start = 5
Soccer instrument noise 7
i_start = 7
Soccer instrument noise 10
i_start = 10
Soccer instrument noise 20
i_start = 20

Violin → Golden Temple at Sunrise

Original Violin
Original Violin
Violin tech noise 1
i_start = 1
Violin tech noise 3
i_start = 3
Violin tech noise 5
i_start = 5
Violin tech noise 7
i_start = 7
Violin tech noise 10
i_start = 10
Violin tech noise 20
i_start = 20

1.8 Visual Anagrams

Visual anagrams are optical illusions created using diffusion models that show different images when viewed right-side up versus upside down. I implemented this by averaging noise estimates from two different text prompts - one for the normal orientation and one for the flipped image.

Campfire People ↔ Old Man

Visual anagram 1
Image 1
Visual anagram 2
Image 2
"An oil painting of an old man" ↔ "An oil painting of people around a campfire"

Horse ↔ Dinosaur

Horse dinosaur 1
Image 1
Horse dinosaur 2
Image 2
"A high quality picture of a dinosaur in a forest" ↔ "A photo of a horse running through a meadow"

Pumpkins ↔ Porsche

Pumpkin Porsche 1
Image 1
Pumpkin Porsche 2
Image 2
"A high quality photo of a Porsche on a mountain road" ↔ "A still life painting of pumpkins on a wooden table"

1.9 Hybrid Images

Following Factorized Diffusion, I created hybrid images by combining low frequencies from one noise estimate with high frequencies from another. This technique creates images that appear different when viewed from various distances or when the viewing conditions change.

Hybrid image: toilet paper and sushi
Toilet Paper ↔ Sushi
Low-pass: "A piece of sushi on a plate with chopsticks"
High-pass: "A roll of toilet paper"
Hybrid image: crown and tree
Crown ↔ Tree
Low-pass: "A crown sitting on a table"
High-pass: "A single lone tree in a field"

Part B: Flow Matching from Scratch!

Training your own flow matching model on MNIST

In this part, we'll train our own flow matching model on MNIST from scratch. Flow matching is a powerful technique for generative modeling that learns to map noise to data through a continuous flow. We'll implement and train a UNet to perform this transformation iteratively.

Part 1: Training a Single-Step Denoising UNet

Building and training a basic denoising model

1.1 Implementing the UNet

I implemented a UNet architecture for denoising noisy MNIST digits. The UNet consists of downsampling and upsampling blocks with skip connections, designed to map noisy images back to clean images in a single step.

1.2 Denoising Process Visualization

To understand how noise affects the input images, I visualized the noising process using $\sigma \in \{0, 0.2, 0.4, 0.5, 0.6, 0.8, 1\}$. This shows how different noise levels progressively corrupt the original MNIST digits.

Noise visualization process
Noising process visualization with different sigma values

1.2.1 Training Results

I trained the denoising UNet for 5 epochs using the MNIST dataset with $\sigma = 0.5$. The model learns to map noisy images $\tilde{x}$ back to clean images $x$. Here are the results after 1 and 5 epochs of training:

Epoch 1 results
After 1 Epoch
Epoch 5 results
After 5 Epochs

Training Loss Curve

Training loss curve
Training loss over iterations for denoising task

1.2.2 Out-of-Distribution Testing

I tested the trained denoiser (which was trained on σ = 0.5) on different noise levels that it hadn't seen during training. This evaluates the model's generalization to different noise conditions.

Sigma 0.0
σ = 0.0
Sigma 0.2
σ = 0.2
Sigma 0.4
σ = 0.4
Sigma 0.5
σ = 0.5
Sigma 0.6
σ = 0.6
Sigma 0.8
σ = 0.8
Sigma 1.0
σ = 1.0

The results show that the model performs best at the noise level it was trained on (σ = 0.5) and levels below that, with decreasing quality as we move upwards from this training condition. At very high noise levels (σ = 1.0), the model struggles significantly.

1.2.3 Denoising Pure Noise

I trained a separate model to denoise pure random Gaussian noise to generate MNIST-like digits. This is essentially a generative task where we start with pure noise and try to produce realistic digits.

Pure noise epoch 1
After 1 Epoch
Pure noise epoch 5
After 5 Epochs

Training Loss for Pure Noise Denoising

Pure noise training loss
Training loss for pure noise denoising task

Observations on Pure Noise Denoising

When training to denoise pure noise, the model generates images that look like a combination of all digits, resembling an '8' shape. This makes sense because with MSE loss, the model learns to predict the average of all training examples to minimize the squared distance to every possible target digit. Since an '8' contains features common to most digits (curves, loops, vertical lines), it represents a reasonable average across the entire MNIST dataset.

This demonstrates why single-step denoising from pure noise is insufficient for high-quality generation - the model converges to a blurry centroid rather than learning to generate diverse, sharp individual digits.

Part 2: Training a Flow Matching Model

Implementing iterative denoising with time conditioning

Flow matching addresses the limitations of single-step denoising by learning to iteratively remove noise over multiple timesteps. We condition the UNet on time t to predict the flow (velocity) needed to move from noisy data toward clean data.

2.1 Time-Conditioned UNet Architecture

I modified the UNet to accept time conditioning through FCBlocks that inject the scalar timestep t into the network. The time signal is normalized to [0,1] and embedded through fully connected layers before modulating the feature maps.

2.2 Training the Time-Conditioned UNet

The time-conditioned UNet is trained to predict the flow at various timesteps $t$. During training, we sample random timesteps and train the model to predict the velocity field that moves from the noisy distribution toward the data distribution.

Time-conditioned training loss
Training loss for time-conditioned UNet

2.3 Sampling from the Time-Conditioned UNet

Using the trained time-conditioned UNet, I generated samples by starting from pure noise and iteratively applying the predicted flow. The results show significant improvement over single-step denoising, with recognizable digits emerging by epoch 10.

Time-conditioned epoch 1
After 1 Epoch
Time-conditioned epoch 5
After 5 Epochs
Time-conditioned epoch 10
After 10 Epochs

2.4 Class-Conditioned UNet

To improve generation quality and enable controlled generation, I extended the UNet to also condition on digit class (0-9). The class information is encoded as a one-hot vector and injected into the network through additional FCBlocks for class conditioning. This also requires implementing classifier-free guidance during both training and sampling.

2.5 Training the Class-Conditioned UNet

I trained the class-conditioned model with different optimization strategies to compare their effectiveness. During training, 10% of samples use unconditional generation (class dropout) to enable classifier-free guidance.

Training with Learning Rate Scheduler

I trained the class-conditioned model with an exponential learning rate scheduler for improved convergence.

Scheduler training loss
Training loss with exponential learning rate scheduler

Training Without Learning Rate Scheduler

I also experimented with removing the exponential learning rate scheduler to see if similar performance could be achieved through architectural or optimization improvements. I maintained the same performance by adjusting the initial learning rate and using a learning rate of $lr = 0.0001$.

No scheduler training loss
Training Loss (No Scheduler)
Scheduler training loss
Training Loss (With Scheduler)

2.6 Sampling from the Class-Conditioned UNet

Using classifier-free guidance with $\gamma = 0.01$, I generated 4 instances of each digit (0-9). The class conditioning allows for much more controlled and higher-quality generation compared to the time-only model.

Results with Learning Rate Scheduler

Class-conditioned epoch 1
After 1 Epoch (with scheduler)
Class-conditioned epoch 5
After 5 Epochs (with scheduler)
Class-conditioned epoch 10
After 10 Epochs (with scheduler)

Results without Learning Rate Scheduler

No scheduler epoch 1
After 1 Epoch (no scheduler)
No scheduler epoch 5
After 5 Epochs (no scheduler)
No scheduler epoch 10
After 10 Epochs (no scheduler)

Even without the learning rate scheduler, by adjusting the initial learning rate to $lr = 0.0001$, the model achieved recognizable digit generation by epoch 10. Both approaches demonstrate the effectiveness of class conditioning for controlled generation.

© 2025 Sukhamrit Singh. All rights reserved.