Stable Diffusion as an API: Make a Person-Removing Microservice | by Mason McGough | Feb, 2023

By Jessie Hobb On Feb 4, 2023

Landscape image produced using Stable Diffusion 2 (by author).

Remove people from photos with a Stable Diffusion microservice

Stable Diffusion is a cutting-edge open-source tool for generating images from text. The Stable Diffusion Web UI opens up many of these features with an API as well as the interactive UI. We will first introduce how to use this API, then set up an example using it as a privacy-preserving microservice to remove people from images.

So many innovations in machine learning-based data generators happened last year you might be able to call 2022 the “Year of Generative AI.” We had DALL-E 2, the text-to-image generation model from OpenAI that produced strikingly realistic images of astronauts riding horses and dogs wearing people clothes. GitHub Copilot, the powerful code completion tool that will autocomplete statements, write documentation, and implement entire functions for you from a single comment, was released to the public as a subscription service. We had Dream Fields, Dream Fusion, and Magic3D, a series of groundbreaking models capable of producing textured 3D models from text alone. Last but certainly not least we had ChatGPT, the cutting-edge AI chatbot which these days needs no introduction.

This list barely even scratches the surface. In just the world of generative image models like DALL-E 2 we also have Midjourney, Google Imagen, StarryAI, WOMBO Dream, NightCafe, InvokeAI, Lexica Aperture, Dream Studio, Deforum… I think you get the picture. 😉 📷 It seems like no exaggeration to say that generative AI has captured the imaginations of the whole world.

While many of the popular generative AI tools like ChatGPT, GitHub Copilot, and DALL-E 2 are proprietary and paywalled, the open-source community has not skipped a beat. Last year, LMU Munich, Runway, and Stability AI collaborated to publicly share Stable Diffusion, a powerful yet efficient text-to-image model efficient enough to run on consumer hardware. This means that anyone with a decent GPU and an internet connection can download the Stable Diffusion code and model weights, bringing low-cost image generation to the world.

The Stable Diffusion Web UI, one of the most popular tools leveraging Stable Diffusion, exposes a wide range of the settings and features of Stable Diffusion in an interactive browser-based user interface. A lesser-known feature of this project is that you can use it as an HTTP API, allowing you to request images from your own applications.

The Stable Diffusion Web UI with an example generation (photo by author).

It has a metric truckload of features, such as inpainting, outpainting, resizing, upscaling, variations, and many more. The project wiki provides a great overview of all the features. In addition, it provides scripting for extensibility.

Setup

Before beginning, ensure that you have a GPU (NVIDIA preferably but AMD is also supported) with at least 8GB of VRAM to play with on your system. That will ensure that you can load the model into memory. Next, you want to clone the repo to your system (for instance via HTTPS):

git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui.git

Follow the installation instructions for your system as they may be different from mine. I used an install of Ubuntu 18.04 to set this up, but it should also work on Windows and Apple Silicon. These instructions will include setting up a Python environment, so make sure that whichever environment you set up is active when you launch the server later.

Once that is done, we need a copy of the model weights. I am using Stable Diffusion 2.0, but Stable Diffusion 2.1 is now available as well. Whichever option you pick, be sure to download the weights for the stablediffusion repository. Lastly, copy those weights to the models/Stable-diffusion folder like so:

cp 768-v-ema.ckpt models/Stable-diffusion

Now you should be ready to start generating images! To launch the server, execute the following from the root directory (be sure that the environment you set up is activated):

python launch.py

The server will take some time to get set up, as it likely needs to install requirements, load the model weights into memory, and check for embeddings, among other things. When it is ready, you should see a message in your terminal that looks like this:

Running on local URL:  http://127.0.0.1:7860

The UI is browser-based, so navigate to “127.0.0.1:7860” in your favorite web browser. If it is working, it should look something like this:

The Stable Diffusion Web UI when first opened (photo by author).

Usage

You should now be ready to generate some images! Go ahead and generate something by entering text into the “Prompt” field and clicking “Generate.” If this is your first time using this UI, take a second to explore and learn some of its features and settings. Refer to the wiki if you have any questions. This knowledge will come in handy later when designing your API.

I will not delve too deep into how to use the web UI since many others before me have done so. However, I will provide the following cheat sheet of basic settings for reference.

Sampling method: The sampling algorithm. This can greatly affect the content of the generated image and overall appearance. The execution time and the results can differ greatly between methods. Ideally experiment with this option first.
Sampling steps: The number of denoising steps during the image generation process. Some results will change drastically with the number of steps whereas others will quickly lead to diminishing returns. A value of 20–50 is ideal for most samplers.
Width, Height: The output image dimensions. For SD 2.0, 768×768 is the preferred resolution. The resolution can affect the generated content.
CFG scale: The Classifier-Free Guidance (CFG) scale. Increasing this increases how much the image is impacted by the prompt. Lower values produce more creative results.
Denoising strength: Determines how much variation on the original image to allow for. A value of 0.0 results in no change. A value of 1.0 disregards the original image entirely. Starting with a value between 0.4–0.6 is generally a safe option.
Seed: The random seed value. Useful when you want to compare the effect of a setting with as little variation as possible. If you like a particular generation but want to modify it a bit, copy the seed.

The web UI is meant for a single user and works great as an interactive art tool for making your own creations. However, if we want to build applications using this as the engine then we will want an API. A lesser-known (and lesser-documented) feature of the stable-diffusion-webui project is that it also has a built-in API. The web UI is built with Gradio but there is also a FastAPI app that can be launched with the following:

python launch.py --nowebui

This gives us an API that exposes many of the features we had in the web UI. We can send POST requests with our prompt and parameters and receive responses that contain output images.

As an example, we will now set up a simple microservice that removes people from photos. This has many applications, such as preserving the privacy of individuals. We can use stable diffusion as a rudimentary privacy-preserving filter, which removes people from photos without any unsightly mosaicing or pixel blocking.

Note that this is a basic setup; it does not include encryption, load-balancing, multitenancy, RBAC, or any other features. This setup may not be suitable for production, but it can be useful for setting up applications on a home or private server.

Start application in API mode

The following instructions will use the server in API mode, so go ahead and stop the web UI for now with CTRL+C. Start it up again in API mode with the --nowebui option:

python launch.py --nowebui

The server should print something like this when it is ready:

INFO:     Uvicorn running on http://127.0.0.1:7861 (Press CTRL+C to quit)

Send a request to the API

The first thing we will want to do is demonstrate how to make a request to the API. We wish to send a POST request to the txt2img (i.e. “text-to-image”) API of the application to simply generate an image.

We will use the requests package, so install that if you have not already:

pip install requests

We can send a request containing a prompt as a simple string. The server will return an image as a base64-encoded PNG file, which we will need to decode. To decode a base64 image, we simply use base64.b64decode(b64_image). The following script should be all you need to test this out:

import json
import base64import requests
def submit_post(url: str, data: dict):
"""
Submit a POST request to the given URL with the given data.
"""
return requests.post(url, data=json.dumps(data))
def save_encoded_image(b64_image: str, output_path: str):
"""
Save the given image to the given output path.
"""
with open(output_path, "wb") as image_file:
image_file.write(base64.b64decode(b64_image))
if __name__ == '__main__':
txt2img_url = 'http://127.0.0.1:7861/sdapi/v1/txt2img'
data = {'prompt': 'a dog wearing a hat'}
response = submit_post(txt2img_url, data)
save_encoded_image(response.json()['images'][0], 'dog.png')

Copy the contents to a file and name it sample-request.py. Now execute this with:

python sample-request.py

If it worked, it should save a copy of the image to the file dog.png. Mine looked like this dapper fellow:

Image created with ‘sample-request.py’ (photo by author).

Keep in mind that your results will vary from mine. If you encounter issues, double-check the output from the terminal running the stable diffusion app. It could be that the server was not finished setting up yet. If you get an issue like “404 Not Found,” double-check that the URL was typed correctly and is pointing to the correct address (e.g. 127.0.0.1).

Masking an image

If all is working so far, then great! But how can we use this to modify images we already have? For that we will want to use the img2img (i.e. “image-to-image”) API. This API uses stable diffusion to modify an image that you submit. We will use the inpainting feature: given an image and a mask, the inpainting technique will try to replace the masked portion of the image with content generated by stable diffusion. The mask acts as a weight that smoothly interpolates between the original image and a generation to blend the two together.

Rather than make a mask by hand, we will attempt to generate one using one of the many pre-trained computer vision models available to us. We will use the “person” class of the model outputs to generate a mask. While an object detection model would work, I chose to use a segmentation model so that you can experiment with using either dense masks or bounding boxes.

We will need a sample image to test with. We could download one from the Internet, but in the spirit of preserving privacy (and copyright), why not make one with stable diffusion? The following is one I generated with the prompt “beautiful mountain landscape, a woman walking away from the camera.”

Image generated by stable diffusion (photo by author).

You can download this one, but I encourage you to try to generate one yourself. Of course, you can use real photos as well. The following is minimal code to apply a stock segmentation model from torchvision to this image as a mask.

import torch
from torchvision.models.segmentation import fcn_resnet50, FCN_ResNet50_Weights
from torchvision.io.image import read_image
from torchvision.utils import draw_segmentation_masks
import matplotlib.pyplot as pltif __name__ == '__main__':
img_path = 'woman-on-trail.png'
# Load model
weights = FCN_ResNet50_Weights.DEFAULT
model = fcn_resnet50(weights=weights, progress=False)
model = model.eval()
# Load image
img = read_image(img_path)
# Run model
input_tform = weights.transforms(resize_size=None)
batch = torch.stack([input_tform(img)])
output = model(batch)['out']
# Apply softmax to outputs
sem_class_to_idx = {cls: idx for (idx, cls) in enumerate(weights.meta['categories'])}
normalized_mask = torch.nn.functional.softmax(output, dim=1)
# Show results
class_idx = 1
binary_masks = (normalized_mask.argmax(class_idx) == sem_class_to_idx['person'])
img_masked = draw_segmentation_masks(img, masks=binary_masks, alpha=0.7)
plt.imshow(img_masked.permute(1, 2, 0).numpy())
plt.show()

Like before, copy this to a file named segment-person.py. Execute the code with the following:

python segment-person.py

The resulting prediction should look something like this:

Result of segmentation mask applied to image (photo by author).

We now have the machinery to make requests to the API and to predict bounding boxes. Now we can start building out our microservice.

Person removal microservice

Let us now turn to our practical example: removing people from images. The microservice should do the following:

Read a number of input arguments
Load an image from a file
Apply a segmentation model with the class “person” to the image to create a mask
Convert the image and mask to base64 encoding
Send a request containing the base64-encoded image, the base64-encoded mask, the prompt, and any arguments to the img2img API of the local server
Decode and save the output image as a file

Since we already covered all of these steps individually, the microservice has already been implemented for you in this GitHub Gist. Now download the script and execute it on the image “woman-on-trail.png” (or whichever image you like) using the following command:

python inpaint-person.py woman-on-trail.png -W 1152 -H 768

The -W and -H indicate the desired output width and height, respectively. It will save the generated image as inpaint-person.png and the corresponding mask as mask_inpaint-person.png. Yours will be different, but this is the output I received:

Result of API call using the raw segmentation mask (image by author).

Hmm, not quite what we are looking for. Seems that much of the person still remains, particularly the silhouette. We may need to mask a larger area. For this, let us try converting the mask to a bounding box. We can do this using the -B flag.

python inpaint-person.py woman-on-trail.png -W 1152 -H 768 -B

The output I received is this:

Result of API call using a bounding box as mask (photo by author).

That is also not quite right! A concrete column is not something we would expect to find in the middle of a trail. Perhaps bringing in a prompt will help steer things in the right direction. We use the -p flag to add the prompt “mountain scenery, landscape, trail” to the request. We also dilate the bounding box with -D 32 to remove some of the edge effects and blur the bounding box with -b 16 to blend the mask with the background a bit.

python inpaint-person.py woman-on-trail.png \
-W 1152 -H 768 \
-b 16 -B -D 32 \
-p "mountain scenery, landscape, trail"

With this I received the following output:

Result of final API call (photo by author).

Now that is looking plausible! Keep playing around with different images, settings, and prompts to get it working for your use case. To see a complete list of arguments and hints available with this script, enter python inpaint-person.py -h.

It is very likely that your images looked very different from the ones above. Because it is an inherently stochastic process, even using stable diffusion with the same settings can produce radically different outputs. There is quite a steep learning curve to understand all of the features and proper prompt design and even then the results can be finicky. Making an image look exactly the way you like is extremely difficult and requires much trial and error.

To aid in your quest, keep the following tips in mind:

Use the web UI to find the right parameters that work for your use case before moving to the API.
Rely on the prompt matrix and X/Y plot features when finetuning an image to your liking. These will help you rapidly explore the parameter search space.
Be mindful of the seeds. If you like a specific output but want to iterate on it, copy the seed.
Try a different generator like Midjourney! Every tool is slightly different.
Use Internet resources like Lexica as inspiration and to find good prompts.
Use the “Create a text file next to every image with generation parameters” option in the settings menu to keep track of the prompts and settings you use to make every image.

Most importantly, have fun!