logo
Published on

ControlNet with Stable Diffusion Part 2

Authors
  • avatar
    Name
    Lauren Jarrett
    Twitter

ControlNet with Stable Diffusion part 2

Continuing on the previous article, we will continue to explore the different control net algorithms to better control the output from Stable Diffusion. In this article, I will discuss and go through normal, depth and segmentation maps and show how each guides the output from Stable Diffusion.

Original output refresher

  • Prompt: boutique shopfront with blank poster in window and scooter out the front, realistic, photograph, cannon eos r3, high quality photography, medium shot, sharp, 3 point lighting, flash with softbox, by Annie Leibovitz, 80mm
  • Negative prompt: low quality, low resolution, render, oversaturated, low contrast, abstract, blurry, haze, glow
Original output 1
Original output 2

Doing a quick google search, I found this image from BannerBuzz to better align what I was looking for.

Input image

Now let's jump into Normal maps

Normal maps

Normal maps are a texture mapping technique used to add simulated detail to a 3D surface. Being coloured in RGB, the colour represents the different textures identified with the image with red being the x axis, green being the y axis and blue being the z axis. For a completely level surface, we would expect it to blue. Looking at the shopfront original image, the normal map is shown below.

Normal Map

The poster in the window is highlighted really well with the shopfront store outline detected well from the original image. We can also see we have sharp edges between the store front, scooter, plant and a clear distinction to the store to the right. We can see that the poster is blue showing it is a level surface with green indicating the y axis movement into the store and along the footpath. Carrying this into the output from Stable Diffusion, the output is guided to match the normal map.

Normal Map Output

The output provided here maintains the window poster and depth to the rest of store. The door is clearly outlined and the different edges to the sidewalk, pillar and store next door. This input has clearly helped stable diffusion recognise that there are multiple stores as it maintained the same grey edge colourings into the next shopfront too.

If we change the normal map scale down to 1 out of 30, it clearly gives Stable Diffusion more creative freedom when generating the image. The output below has different colourings as well as an overall look and feel. The reflections in the windows are present but contradictory with it obscuring the rest of the store.

Normal Map Output with less guidance

If we then regenerate the output with a guidance scale of 25 out of 30, as expected it very strictly adheres to the input image. The door & window above it are exceptionally consistent with Stable Diffusion placing borders in the same look and feel. It also added the same border around the poster in the window tying everything together. We have lost the sidewalk and replaced it with what looks like a step up into the building, but it is consistent so it still fits with the output image.

Normal Map Output with strict guidance

Overall, I was very impressed with how well normal maps guided the output. They helped achieve a consistent output regardless of the scale selected and enabled Stable Diffusion to generate a consistent theme within the image and carry it across the different elements. It maintained the window poster, scooter and plant elements individually and ensured that all were cohesive within the output from Stable diffusion.

Depth2Image

Moving onto the next algorithm Depth 2 image. Given the composition within the original image and prompt with the poster in the shop window and scooter in front of the shop, my personal hypothesis suspected that this algorithm would generate the best output from Stable Diffusion. Why? Stable diffusion normally generates an image with a 64x64 depth map, but depth 2 image passes in a 512 x 512 depth map meaning that we can preserve more details from the original input image. Looking at what the algorithm detects, the depth map from the original image is shown below.

Depth map

The map keeps the entire poster perfectly in front of the shop window with sharp edges. It is the same depth as the door and shopfront frame. The darker edges represent a deeper image with lighter being items located in front. Looking at the pillar and plant, it has correctly identified the different levels with the pillar in front of the shopfront & the plant in front of the pillar.

Passing that into Stable Diffusion, the output examples are below. In output 1, Stable diffusion changed the poster from the original input into a store window display and created different levels in the store by adding in down lights and stock within the background. The scooter has been transformed into an odd looking plant holder with a scooter wheel on the bottom. Going back to the map , it is pretty evident that the map has some square edges from the scooter start and the back wheel held up, but that has been entirely lost in the newly generated output.

Output 2 was even worse. We lost all the cohesive elements from the original image and have weird images in the door window, shop poster and even stock within the store. The scooter is again lost and updated to some sort of plant holder with square edges. The pillar and shopfront top are different tones to the point that it looks obscure with a funny white ornament near the top of the pillar.

Depth output 1

Depth output 2

This was a bit of a disappointment given how well normal maps worked and generated a cohesive story within the image. These first output with a depth 2 image map was ok with some odd elements but the second one is absolutely rubbish. I feel like I want to stubbornly dig more into why the output doesn’t seem to be making the most of the accurate depth map passed in. The map clearly shows the scooter seat being an individual element but Stable diffusion has clearly gotten lost with the repeated square elements. Overall, clearly more to dig into and research here to make the most of the accurate depth map. Onto the next and final algorithm to talk about today.

Semantic segmentation

I have already written an article to clearly explain semantic segmentation but at a high level, it seeks to segmented all pixels in the image into classes. Looking at the output, the same colours indicate the same class. Interesting to note that the scooter wheel on the back is in a dark teal and on the ground is pink indicating that the image segmented had identified that they are not both wheels. Something to check out in the input.

Segmentation output 1

Segmentation output 2

This was a bit of a disappointment again but looking at the semantic segmentation maps not unexpected. The input is pretty inline with the segmentation map. With the advances in segmentation led my Meta, my theory with this is that we need a newer image segmentation to better capture the input to generate the output. A classic example of Garbage in, garbage out.

Looking around on [Hugging Face] https://huggingface.co/mfidabel/controlnet-segment-anything another model uses the Segment Anything Model (SAM) from Meta as a ControlNet over Stable Diffusion. This is what I love above AI, everything moves so rapidly and advances are happening everyday. Using the same prompts, we will try again.

Segmentation map

SAM segmentation output 1

SAM segmentation output 2

Instantly we can see the better output generated. SAM advances carry into Stable Diffusion better and get better output. In terms of fine-tuning SAML for a specific instance and then generating that specific instance in output it can be done but bear in mind until computer vision models stitch these two ends together, Stable Diffusion would need to be fine tuned on the new class as well.

Conclusion

Overall, we can see how each different algorithm controls Stable Diffusion in a different way. Normal maps were the best surprise and performed better than expected in terms of the images generated. The maintained the input elements and also generated cohesive images.

Depth2Image was a surprise because while the elements all maintained the depth from the input image, the output generated seemed to lose all consistency and cohesion. Segmentation maps worked well, especially using the more recent SAML model rather than the control net version which gave us a better input to pass into Stable Diffusion.

It will be interesting to see if the issues with segmentation and depth maps will be fixed in the new update on Stable Diffusion XL model, which I will be writing about in my next article.