
Lesson Description
The "Stable Diffusion Overview" Lesson is part of the full, Open Source AI with Python & Hugging Face course featured in this preview video. Here's what you'd learn in this lesson:
Steve discusses stable diffusion, which transforms chaotic noise into recognizable images like cats or hippos. The process involves iteratively refining random noise towards a desired image. Various parameters such as num_inference_steps, GPU capacity, and guidance_scale influence the quality and speed of image generation.
Transcript from the "Stable Diffusion Overview" Lesson
[00:00:00]
>> Steve Kinney: So we talked about that with text. We had the Transformers library. Great library. We did all this stuff with the Transformers library. And, yeah, I made some charts at one point with a different library. And that Bertviz thing, which was kind of cool but confusing. That was neat, but that was all from the Transformers library.
[00:00:22]
Hugging Face has another kind of similar library called Diffusers, and they help us use a different model architecture called stable diffusion, in which we turn chaotic noise into Moo Deng. I don't care if you remember what a transformer block is. Honestly, whatever, if you're not aware what Moo Deng is, you owe it to yourself to go see a delightful, ungovernable, pygmy hippo.
[00:00:58]
And again, if you imagine the cats walking across Collab in the very beginning and me generating pictures of baby hippo on a weekend, you can understand why my wife was not convinced that I was working. But I used it to get out of lots of stuff, so it's fine.
[00:01:15]
And so, effectively, there's like, is it a Michelangelo quote? There's a lot of those quotes where we think they're quoted to somebody. You find out half the Oscar Wilde quotes are misattributed. I long to be somebody when I'm long dead to have tons of quotes that I didn't say attributed to me.
[00:01:31]
But I think the Michelangelo thing is, how do you make a sculpture? You chip away all the parts of the rock that are not the David or what have you. And that is effectively, again, if Transformers and generative AI is simply guess the next word. And all this stuff seems like magic.
[00:01:50]
It's just a statistical guessing statistic. Stable diffusion is, you start out with a chaotic randomness of just TV static, but with more colors, and you keep taking away all the parts that aren't what you wanted to see. So when we see all of those crazy AI images or whatever, all that's really happening is you start with absolute chaos.
[00:02:19]
And then each time, you try to get a little bit closer to a hippo, right? Or to a cat, cuz we all know the Internet is powered by cats. Right, it has a bunch of real images, right, that are labeled like this is a cat. This is a hippo.
[00:02:38]
This is a whatever. And each time it goes, that's still not a hippo, all right? Still not a hippo, all right? Keep going. Keep doing stuff until we think we have a hippo, right? And maybe with the weaker models, you end up with a hand with 19 fingers on it, right?
[00:03:01]
And that's how you end up with that. Cause it's like everyone's like, well, it's so wrong. I'm like, yeah, if you think about the fact that it just started with rando pixels and just got its way there, honestly, a six fingered hand is pretty impressive to me. And to be clear, again, we are gonna try to keep this on the cheapest hardware we can.
[00:03:20]
So prepare to be underwhelmed, [LAUGH], right? And with small models that will render quickly. If I had the time to do the thing I did in high school when I worked on VHS videos where I would just choose the most ridiculous render and then go hang out in the cafeteria in my video editing class so I didn't have to go.
[00:03:42]
And it doesn't matter. But if we're trying to do this quickly, we're gonna do small models, we're gonna do on the cheap GPUs and stuff like that. Still kinda cool. So we take the random static. It knows what real images look like. And every time, every pass, step by step, it tries to get you one iteration closer to a cat wearing sunglasses.
[00:04:05]
There's a few pieces to this one we already know and maybe love unclear, which is the Text Encoder, which is usually you can do image to image too, right? But usually it is take the string of text, turn it into an image, right? So the part where it figures out what the text means is the same as before.
[00:04:29]
Then there's a few terms they will appear, so they're worth explaining. They are not worth stressing out over. Because the best way to learn any of this stuff is to learn just enough to be dangerous and throw yourself at it, right? Feeling like you need to have a PhD before you can do anything is silly.
[00:04:50]
So you'll see the term U-Net, which is that core neural network that predicts the noise in an image at each step. So it's like what parts are still noise? The Scheduler tries to like, how much noise can we remove at each interval. This is basically like how much GPU you got, right?
[00:05:07]
How much can I use? Let me dole it out until we get there, right? The more GPU you have, the less iterations you have to go through. And I will explain some of these other terms in this next one. Cuz this one, even as I was typing it stressed me out.
[00:05:22]
The Variational Autoencode, it compresses the image down to a lower latent space. I'll tell you what that is, don't worry. For efficient processing and then decode as a final representation back to a visible image. The point that I want you to remember is that it's not unlike the encoding, decoding, transformer piece, which is there were words, there is image.
[00:05:43]
Turn it into numbers that we can play with and then turn it back into image that I can enjoy, right? Those pieces are all the kind of the pillars, you will see them in the code. There will be no quiz that you memorize them. You will get a feel for them as time goes on.
[00:06:03]
A whole bunch of knobs to turn. As you can imagine, I picked the ones that at the moment that I was making this slide, felt like the best ones. So how many steps do you wanna do to denoise it? Fewer steps faster, fewer steps worse. More steps longer, more steps better, [LAUGH], right?
[00:06:28]
How big is your GPU, honestly? The guidance scale is how strictly the model should adhere to your prompt. This is effectively not unlike the temperature, especially if you're doing image to image. How much creative license do you want to give it? So a lot of the kind of concepts might have different names cuz they weren't developed by the same people at the same time, but the fundamental principles are the same.
[00:06:54]
And turns out shocker to everyone, bigger images take more, [LAUGH], right? So keep them small if you want them fast or you don't want to, I don't know, have $1,000 electricity bill. I don't know.
Learn Straight from the Experts Who Shape the Modern Web
- In-depth Courses
- Industry Leading Experts
- Learning Paths
- Live Interactive Workshops