How to Generate Images with Stable Diffusion in Seconds, for Pennies | by Lak Lakshmanan | Aug, 2022

By Jessie Hobb On Aug 25, 2022

And the limitations (today!) of this approach

The authors of Stable Diffusion, a latent text-to-image diffusion model, have released the weights of the model and it runs quite easily and cheaply on standard GPUs. This article shows you how you can generate images for pennies (it costs about 65c to generate 30–50 images).

Start a Vertex AI Notebook

The Stable Diffusion model is written in Pytorch and works best if you have more than 10 GB of RAM and a reasonably modern GPU.

On Google Cloud, go to the Vertex AI Workbench by opening the link https://console.cloud.google.com/vertex-ai/workbench

Creating a PyTorch notebook in Google Cloud

Then, create a new Vertex AI Pytorch notebook with a Nvidia Tesla T4. Accept the defaults. This instance cost about 65c an hour when I did it.

Remember to stop the notebook or delete it once you are done with it. The difference? If you stop the notebook, you will be charged for the disk (a few cents a month, but it allows you to start back faster next time). If you delete the notebook, you will have to start afresh. In either case, you won’t have to pay for the GPU which is the bulk of that 65c/hr expense.

While the instance is starting, do the next step.

Register for a Hugging Face account

The weights are released on Hugging Face Hub, and so you will need to create an account and accept the terms under which the weights are released. Please do that by:

Clone my notebook and create token.txt

I have conveniently put the code in this article on GitHub, so simply clone my notebook:

which is in this repository:

https://github.com/lakshmanok/lakblogs

and open the notebook stablediffusion/stable_diffusion.ipynb

Right-click on the navigation pane and create a new text file. Call it token.txt and paste your access token (from the previous section) into that file.

Install packages

The first cell of the notebook simply installs the Python packages needed (run the cells in the notebook one by one):

pip install --upgrade --quiet diffusers transformers scipy

Restart the IPython kernel once you do this using the button on the notebook:

Read the access token

Remember the access token you pasted into token.txt? Let’s read it:

with open('token.txt') as ifp:
access_token = ifp.readline()
print('Read a token of length {}'.format( len(access_token) ))

Load the model weights

To load the model weights, use a Hugging Face library called diffusers:

def load_pipeline(access_token):
import torch
from diffusers import StableDiffusionPipelinemodel_id = "CompVis/stable-diffusion-v1-4"
device = "cuda"
pipe = StableDiffusionPipeline.from_pretrained(model_id, 
torch_dtype=torch.float16, 
revision="fp16", 
use_auth_token=access_token)
pipe = pipe.to(device)
return pipe

I’m using a slightly worse version of the model here so that it executes fast. Read the Huggingface documentation for other options.

Create an image for a text prompt

To create an image for a text prompt, you simply call the pipeline created above passing in a text prompt.

def generate_image(pipe, prompt):
from torch import autocast
with autocast("cuda"):
image = pipe(prompt.lower(), guidance_scale=7.5)["sample"][0]  outfilename = prompt.replace(' ', '_') + '.png'
image.save(outfilename)
return outfilename

Here, I’m passing in the prompt “Bald guy being easily impressed by a robot”:

outfilename = generate_image(pipeline, prompt="Bald guy being easily impressed by a robot")

This took less than a minute, and is good enough quality for presentations, story boards, and the like. Not bad, eh?

Limited to its training set

AI models are limited by what they are trained on. Let’s pass in a cultural reference it’s unlikely to have been seen much training data on:

outfilename = generate_image(pipeline, prompt="Robots in the style of Hindu gods creating new images")

The result?

Robots in the style of Hindu gods creating new images

Well, it’s kinda picked the pose of Ganesha and endowed him with machine-like limbs, and used Tibetan prayer-wheels for the images. There is no magic here — ML models simply regurgitate bits and pieces of what they have seen in the training dataset , and that is what is going on.

My cultural reference here was to the gods churning the ocean of milk and that flew completely over the model’s head:

Google Image Search knows all about the Hindu creation myth of churning the ocean of milk

Let’s see if we can explicitly help the model to jog its memory by passing in the specific term that allowed Google Image Search to retrieve all those images:

outfilename = generate_image(pipeline, prompt="Robots churning the ocean of milk to create the world")

Does this look like robots churning an ocean of milk?

That doesn’t help either. The Hindu creation myths must not have been part of the dataset used in training the model.

Other limitations

So cultural references are out. What else? The model won’t generate realistic faces or textual signs — I’ll let you try these out. Each instantiation starts from a random set of points, so there is no way to build a set of images that have consistency (like a comic book).

Also, these are simply the limitations today. Someone’s eventually going to be able to train on a larger dataset, and figure out how to keep it from generating toxic content.

Still — image generation used to require serious horsepower. But we can now do it on a bog-standard GPU and 15 GB of RAM. This is essentially Cloud Functions territory — you can easily imagine taking my code above and putting into a Cloud Function so that it becomes an image generation API.

Conclusion

To finish off, here are a couple more images generated by the model along with the prompt that generated it:

How cool is it that you are able to generate images corresponding to text prompts in seconds for pennies?

My notebook is on GitHub at https://github.com/lakshmanok/lakblogs/blob/main/stablediffusion/stable_diffusion.ipynb

Enjoy!