ControlNet with Stable Diffusion

Often when I have been using image generators Stable Diffusion or Dall-E2, I already have an idea in my head of what I want the image to be. And then I find myself getting annoyed that I can’t just slightly tweak the input to get it to show what I want. Turns out, I am not alone and ControlNet was born to do just that. It enables the user to pass in a hand drawing, a sketch or some rough outlines, along with a text input to better guide the output from Stable Diffusion.

The algorithms are used to first produce a binary map of the image, which is just normally a grayscale image showing edge lines of the object (to various degrees) or indicators of depth or textures in the image. ControlNet generates output with the text prompt and looks to identifies the areas to retain in the generated output that are within the binary mask from the algorithm image output. Sound confusing? Wait till you see the examples and you see it in action.

To start off, I discuss three different algorithms used in ControlNet in this article and will complete the next three in another article. Before we jump into the controlling the output, let us first try a prompt and see what the output is like initially.

What I wanted to the output to generate was a realistic photograph of a boutique shopfront, with a sale poster in the window and a scooter out the front. The prompt I used initially for SD is listed below with tips from various reddit threads to get better images more consistent with photography rather than paintings.

Original output

Prompt: boutique shopfront with blank poster in window and scooter out the front, realistic, photograph, cannon eos r3, high quality photography, medium shot, sharp, 3 point lighting, flash with softbox, by Annie Leibovitz, 80mm
Negative prompt: low quality, low resolution, render, oversaturated, low contrast, abstract, blurry, haze, glow

Evidently, the image generator picked up about half of the prompt. Stable Diffusion is noted to not properly understand composition so not detecting the scooter in front of the shop or the poster in a shopfront window is not 100% surprising. As a result, this would be a good use case to pass in an example image and pass it in the model so it can better identify what we want.

Doing a quick google search, I found this image from BannerBuzz to better align what I was looking for.

How do we pass this in and what is the output we get? Let’s dive into the three different approaches I will discuss today, starting with Canny edge detection.

Canny edge

Canny Edge is an edge detection algorithm developed in 1986 by John Canny to help detect a wide range of edges in images. Wikipedia has a great walkthrough of the algorithm steps, but at a high level they are:

Grayscale reduction & Smoothing via Gaussian filter applied to smooth the image
Intensity gradient applied to smoothed image
Non-max suppression applied to the intensity gradient image
Double threshold applied to the non-max suppression image, where gradient values indicate pixel strength
1. Looking for edge pixels that exist because of noise and colour variation
2. Applies low and high threshold values to detect weak and high edges
Hysteresis applied to the previous image

From the previous image, the canny edge with low threshold of 100 & high threshold of 200. If we reduce the edge thresholds to 40 & 60, you can see how many more edges are detected within the original image.

In terms of then applying either of these images as a control net over stable diffusion, with the same prompts as above becomes:

Given how strict the output mimics the original image, it is really useful applying a new style over an existing photograph. It makes sense that the example for canny edge detection over control net is a cyborg bird. It is applying a cyborg style exactly over the image of the bird passed in.

Looking at the binary masks outputted from the canny edge detection algorithm, we can see exactly what lines it is looking to retain in any generated output.

Reducing the conditioning scale factor should reduce the rate at which the output adheres to the binary mask. Given the large number of edges in either algorithm image, even at the lowest setting the output generated was still very in line with the input image.

A frustrating aspect of canny edge is manually having to set the upper and lower thresholds for the image. It is quickly evident that the threshold for one image is not the same as another which means batch processing images became a slow and tedious process. These limitations were the driving force behind the next algorithm we are going to look at.

HED

HED Maps edge detection was developed at Cornell University to overcome a few limitations with canny edge detection including the number of preprocessing steps and manually needing to play around with the upper and lower threshold values.

Using the same original shopfront image, the HED algorithm outputs the following mask automatically.

With a prompt of: Cheerful, modern shopfront the output generated was:

Looking at the output, it does not capture the intricate detail of canny edge at the lower threshold but closer to the higher threshold image. The image is consistent in both the look and feel of the original one and bonus, we didn’t have to manually input the threshold value meaning batch processing images is much simpler now.

The generated output is consistent in style with a painting, directly impacted from Stable Diffusion’s underlying dataset. The output has picked up most of the detail in the poster and maintained the original diagonal section lines and the two key headings. While the heading words are wrong and the 50% off now appears to be replaced with Japanese characters, these are minor fixes as Stable diffusion does not output text.

However, the main concern in the generated output is the building edge where it joins to the building on the right as it appears is inconsistent in both style and appearance. It doesn’t belong to either building and looks to more match with the interior of the store and completely stands out. Looking at the depth map, you can see the same section is different on the depth map. In this case a quick edit on the depth map to straighten the edge would most likely fix it.

MLSD

Mobile line segment detector looks to find straight lines and edges within images in a resource constrained environment. Altering the value will inform the model the threshold about what is a line while the distance will edit the distance the image generated is shown from the datapoints.

Continuing with the shopfront picture, the mlsd algorithm detects the following image:

It is evident that the straight lines detect shadows and reflection in the windows and screen, while generating new images the poster is lost and moves into the reference image. As the scooter has round edges, we lose it entirely from the picture. As evident in the difference between the detailed Canny output and HED image maps, we expect this output to be more loosely inspired by the diagram rather than an exact copy of the original image. MLSD gives us a really useful feature to shift the distance back from the origin point of view from the original image. Altering the distance, moves the reference point back and stable diffusion extends the frame of reference.

MLSD is well-suited when the use case when the input image is a looser reference to the image generator and when the input image has primarily straight edges. The frustrating aspect was to lose the photo depth detection in the image as the poster now appears as part of the store background. In order to control for this, the advantage would be be use this in conjunction with a depth control net which we discuss in the next one.

Conclusion:

These three edge detection algorithms show to differing levels all how to guide the output from Stable Diffusion. As a quick overview canny edge and hed maps can help preserve the original details from the image and carry it in to the new image. To batch process images, HED would be better as the user does not need to identify the best threshold values for each image. MLSD is best suited to images with primarily straight lines and the use case where the original image is a loose inspiration point.

I will discuss in more detail in the next one normal, depth and segmentation masks. Thanks for following along!