the power of diffusion models
part a: playing around with pre-trained diffusion models
overview
for this project, we will be using the deepfloyd if diffusion model. trained by stability ai, it is a two stage model where the first stage produces images of size 64 x 64 and second takes the outputs of the first stage as input to produce 256 x 256 sized images.
link to DeepFloyd/IF-I-XL-v1.0.
the seed i will be using for the project: 34224309
below are some images i generated using different prompts.
prompt: ‘a man wearing a hat’
prompt: ‘an oil painting of a snowy mountain village’
prompt: ‘a rocket ship’
reflection of the images
for all of the different prompts, the images with higher number of inference steps tend to be higher quality. though the trade-off is efficieny (time to generate), the images are more intricate. for example, for the prompt ‘a man wearing a hat’, you can see that for num_inference_steps=20, the man looks hyper realistic whereas the man in the image for num_inference_steps=100 looks more real.
implementing the forward process
i implemented noisy_im = forward(im, t)
function which adds noise to an image for timestep t. t in this case is [0, 999] where a higher t produces a noisier image than a lower t.
using the function, i run an image of the berkeley campanile through it for t=[250, 500, 750]
original image of berkeley campanile
noisy images of berkeley campanile
you can see that as the timestep t grows larger, we get more noise on the image!
classical denoising
the classical way to denoise an image is to apply gaussian blurring on the images. let’s try that:
gaussian blur denoising
you can see that it didn’t really work. the iamges are just blurry and we do not get the original image back.
one-step denoising
now, let’s use the pretrained diffusion model to help us denoise. we can use the UNet to estimate the noise and try denoising it. this can be done by passing the noisy image through the stage_1.unet
.
t=250
t=500
t=750
you can see that the results are way better using this one-step approach compared to the gaussian blur! however, we notice that the estimated original image is quite similar to the ground truth but it’s not quite there. this is not unexpected, since the noise we retrieved from the unet is only an estimate!
another observation is that the higher the timestep, the more the estimated original image deviates from the ground truth image. this is because with more noise, it is harder for us to retrieve the original image, since we are working with a harder problem.
iterative denoising
the one-step approach is a significant improvement from the classical denoising method, but we can do better. in theory, instead of finding the estimated noise in ‘one shot’, we can iteratively find the estimated noises from one timestep t to t’ in smaller steps to get a more accurate estimate of our noise.
one way to do this is to start with the noisy image at time t=1000 (noisiest), denoise for one step to get the image at t=999 and continue to t=0. however, this would require to run the diffusion model 1000 times, which is slow and expensive. to fix this, we use strided time steps (a stride of 30 in our case).
to achieve this, we use equations 6 and 7 from the ddpm paper to help us with this task.
in this part, i implement the iterative_denoise(image, i_start)
function which takes a noisy image image
, and a starting index i_start
. this function denoises an image starting at timestep timestep[i_start]
.
below are results of the noisy images of the campanile every 5th loop of denoising (it should gradually become less noisy the smaller t is)
noisy campanile images with iterative denoising
diffusion model sampling
other than using iterative_denoising
to denoise an image, we can use it to generate images from scratch. we can set i_start=0
and pass in random noise - pure noise. below are the results of 5 images of “a high quality photo”:
you can see that the we get really cool images! though they are not necessarily “high quality” per se, we do get something from just pure noise which is awesome.
classifier-free guidance (cfg)
the issue with generating images from pure noise is that some images don’t really mean anything - you can’t really tell what’s going on in the images. we can improve the quality of the images (at the expense of diversity) through a technique called classifier-free guidance.
we implement iterative_denoise_cfg()
function and show 5 images of “a high quality photo” with scale=7:
you can see that the images are a lot more realistic (and make more sense)!
image-to-image translation
previously, we took a real image, added noise, and denoised it, essentially making edits to existing images. this works beacuse the diffusion model “hallucinates” when denoising an image which forces it to “edit” the image in creative ways; the larger the noise, the larger the edits.
now, let’s try to take an original test image, noise it a little and force it back onto the image manifold without any conditioning; we should get an image similar to the original image. this followed the SDEdit algorithm.
to do this, we:
- run the forward process to get a noisy test image
- run
iterative_denoise_cfg
function using a starting index of [1, 3, 5, 7, 10, 20] steps. the result should be a series of “edits” to the original image which matches the original image closer and closer asi_start
grows big.
campanile
sather gate
rock
editing hand-drawn and web images
we can try the same thing with images from the web and our own drawings
web image: drake’s album cover ‘for all the dogs’
handrawn image 1: plane in the sky
handrawn image 2: two lovely flowers
inpainting
using the same procedure, we can perform inpainting which follows the RePaint paper. to do this, we create a mask over an area of the image and set it to 1 (the portion we want to edit/generate new content) and 0 elsewhere. we then run the diffusion denoising loop. but, for every step, we force the original image where mask, m, is 0.
campanile
sather gate
rock
i had to run the diffusion model a couple of times since it was not trained for this specific task. after a few tries, i got some nice results!
text-conditioned image-to-image translation
we can do the same thing but instead of giving it the prompt “a high quality photo”, we can use different prompts! this will guide the projection of the generated image rather than just a projection onto the natural image manifold.
we use different noise levels for each prompt [1, 3, 5, 7, 10, 20] which should give us an image that is a closer representation of the original image the higher up we go.
prompt: “a rocket ship”
prompt: “a man wearing a hat”
this one is slightly harder to tell, especially for higher noise levels, due to the mask being smaller compared to the rocket ship. it is expected that for higher noise levels that the figure of the man slowly disappears since it is meant to resemble the original picture more and more.
however, we do see int he first few images that we get a figure standing in the center of the image - almost as if they are posing for a picture under sather gate!
it is really hard to tell if the man is actually wearing a hat due to the scale of the images.
prompt: “a pencil”
overall, good results!
visual angrams
one cool thing we can do with diffusion models is that we can create visual angrams.
to do this, we can denoise the image x_t with the first prompt, to obtain the first noise. we can then flip x_t upside down and denoise the next image to get the second noise. we then flip the second noise back, making sure it is upright. we then average both noises. lastly, perform denoising + diffusion.
prompts: “an oil painting of people around a campfire” + “an oil painting of an old man”
prompts: “an oil painting of people around a campfire” + “an oil painting of an old man” (v2)
prompts: “an oil painting of people around a campfire” + “a photo of a dog”
prompts: “an oil painting of people around a campfire” + “a photo of a dog” (v2)
prompts: “an oil painting of a snowy mountain village” + “an oil painting of an old man”
prompts: “an oil painting of a snowy mountain village” + “an oil painting of an old man” (v2)
hybrid images
we can create hybrid images similar to the way above. to do this, we implement factorized diffusion.
prompt: “a lithograph of waterfalls” + “a lithograph of a skull” x2
prompt: “a peacock” + “a forest”
prompt: “a lithograph of an elephant” + “a lithograph of istanbul”
reflection
part a of the project was incredible cool, especially the visual angrams. i feel like even for a human artist, it is quite hard to come up with these visual angrams but it takes the diffusion model just a couple of seconds. it is also really interesting what sorts of new content the diffusion model comes up with!
part b: training a diffusion model
overview
now that we have played around with a pre-trained diffusion model, let’s train our own!
training a single-step denoising UNet
to begin, i will implement a simple one-step denoiser. we am to train a denoiser D_theta such that noisy image z is mapped to clean image x. we can do this by optimizing over L2 loss:
implementing the UNet
for this project, we will implement the denoise as a UNET.
training a denoiser with the UNet
to do this, we generate training data pairs of (z, x) where x represents a clean MNIST digit. for each training batch, we generate z from x with the following noising process:
using this equation, we can add noise to MNIST digits with varying sigma values [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0].
training
i first trained a denoiser to denoise noisy image z with sigma=0.5
applied to a clean image x. this was done with the MNIST dataset via torchvision.datasets.MNIST
, recommended batch size: 256, and trained for 5 epochs. the adam optimizer with learning rate of 1e-4 was used.
the denoised results of the 1st and 5th epoch of the digits are the following:
you can see that the results from the digits after 5 epoch of training are sharper and less blurry compared to the digits with only 1 epoch of training. this entire procedure took ~7 minutes to run on a Colab T4 GPU.
out-of-distribution testing
we trained the denoiser on sigma=0.5
. let’s see it in action with varying levels of sigmas. for the testing process, i used noises sigma=[0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]
.
you can see that as the noise gets larger, the denoised output gets progressively worse. this is expected since it is harder for the denoiser to map back to the original, clean image.
training a diffusion model
instead of using the UNet to predict the clean image, we can instead use it to predict the noise.
adding time conditioning to UNet
to implement the ddpm, we can inject timestep t into the architecture.
training the unet
we train the time-conditioned unet using the following approach. rather than predicting the clean image directly from a noisy image, we focus on estimating the amount of noise added to the image. in this process, we start with clean images, add noise to them, predict the added noise, and then calculate the loss by comparing it to the actual noise introduced.
we train with hidden dimension=64, batch_size=128, and over 20 epochs. we use an adam optimizer with an initial learning rate of 1e-3 and apply an exponential learning rate decay scheduler, stepping it every epoch.
here are our training losses:
sampling from the unet
to sample from the unet, we use the following algorithm:
we can see that the results after the 20th epoch comes out better than the results after the 5th epoch.
adding class conditioning to UNet
our goal is to extend the time-conditioned unet by introducing class-conditioning, allowing it to generate images of specific digits. this involves embedding class information into the network at the same locations as the timestep data. we obtain class labels from the mnist dataset and feed a one-hot encoded vector for each class into the model. to ensure flexibility, we set the class conditioning vector to zero with a probability of p_uncond = 0.1, enabling the unet to operate independently of class information.
here’s how the algorithm works:
training the unet
we train the time-conditioned unet using the following approach. rather than predicting the clean image directly from a noisy image, we focus on estimating the amount of noise added to the image. in this process, we start with clean images, add noise to them, predict the added noise, and then calculate the loss by comparing it to the actual noise introduced.
here are our training losses:
sampling from the unet
to sample from the unet, we use the following algorithm:
we can see that the results after the 20th epoch comes out better than the results after the 5th epoch. specifically, for the 5th epoch, we can see that a 0 looks weird and a 9 looks funky as well.