Open Source AI with Python & Hugging Face

Image-to-Image Generation

Temporal

Lesson Description

The "Image-to-Image Generation" Lesson is part of the full, Open Source AI with Python & Hugging Face course featured in this preview video. Here's what you'd learn in this lesson:

Steve demonstrates image-to-image transformation, which involves using an image as a baseline to further adjust it. By combining text-to-image and image-to-image pipelines, adjustments can be made to images based on different strengths, affecting the level of transformation.

Join Now

Preview

Transcript from the "Image-to-Image Generation" Lesson

[00:00:00]
>> Steve Kinney: So text image is interesting, but we've all seen it before. What about image to image? This involves taking two pipelines, involves taking a text to image pipeline in our case, then you can actually use an image as the baseline to then further adjust the image as well.

[00:00:24]
So you will in this case in the example we're gonna see, we're gonna have a stable diffusion pipeline, which is text to image model. And then we will have an image to image pipeline where we'll instead, I don't know why I'm explaining to you what those words mean.

[00:00:37]
Text to image takes text to image. Image to image does image to image, clear? And again, strength is kinda the parameter that we want here, which is like our temperature earlier and our guidance later of, and actually, in this one, it was earlier in the day on whatever weekend I was working on it.

[00:01:01]
So I actually will show you across three different ones without having to generate it. And yeah, low will stay very close to the original image to the point where it's like, why did I even bother doing this, right? And as you approach 1.0, it will take greater and greater liberties and like, arguably get you closer or farther, depending on what you're exactly trying to do.

[00:01:27]
Let's talk about it first, I promise to tell you what latent space meant other than just seeming like a cool album name for a 90s band where they're just sitting on a couch with their guitars. So the input image encoded in latent space. Then a certain amount of noise is added to this representation that's determined by the string.

[00:01:47]
So basically we take one, we put back some noise, right? And then we take away the noise that we added. It's not wrong to think about latent spaces. Like again, when we took all that text and turned it into numbers, not totally right, don't me. But it's not wrong either, and so yeah, basically, effectively what we're gonna do is we're gonna turn it into that kind of like number ones and zeros weird GPU land that we barely understand.

[00:02:20]
We're gonna add some more chaos with the prompt and then we're gonna remove that chaos and like find we get what we want underneath. So yeah, in that space, we can basically we're working with it in that kind of GPU little area and we're not like worrying about in the full image.

[00:02:35]
So it's not that we're like taking, we're kinda moving both the original image back into like the ones and zeros, land of numbers and then doing everything in there. So it's not actually editing the original image per se, it's kind of like taking it back into its weird GPU essence, right?

[00:02:53]
Adding the chaos back in and then producing another image. All right, so image stuff, which I think is super fascinating, there's not enough interesting stuff happening here. I'm pulling in this matplotlib, which is a graphing library of Python. It's not important to the actual making of the images.

[00:03:12]
It is important for making pretty graphs which as a JavaScript engineer, it's really easy to use. We're not gonna get to say there is this library also for Python that Hugging Face supports called Gradio, where it'll make a UI where you can adjust all these parameters and share the full on UI.

[00:03:30]
And I'm just like, why do I have a job? Now, they're very simple UIs. I'm like, to be clear, I have been very happy running JavaScript for a very long time. I started playing with Python to play with this stuff. It's pretty good. I like it. I like me some python.

[00:03:47]
Okay, we've seen all this stuff before. Then we come down here. All right, I'm gonna run this while we talk about it, cuz I'm not an idiot. Grabbing. There is a stable diffusion two that's on Hugging Face two and I think even there are some bigger ones so on and so forth.

[00:04:08]
I grabbed this one because I knew I was going to live be doing this. Go ahead, grab a bigger one. You can even use some of the bigger ones I think on the free tier as well. I was also just going for speed in this case as well.

[00:04:20]
Okay, so we're going to grab two pipelines this time. We're gonna have the text to image pipeline that we had before. So that's gonna be stable diffusion pipeline, makes sense? The model id, we're going to use the same model for both, so on and so forth. Again, the safe tensors are just ones that make sure they can't execute arbitrary code.

[00:04:43]
They are tensors that are safe. That's like everyone likes it when you explain a term using the words in the term. And then we've got this image to image pipeline which is, yeah, just using again a different set of abstractions, same model, right? It's just the like extra, like the tokenizing the inputs, all that kind of stuff.

[00:05:00]
Hugging Face is just wrapping it up for you. You could do all this stuff by Hand. You could also make your own HTTP server and roll your own crypto too. You know, like, do whatever you want. So then I have, and you can see the code for this.

[00:05:18]
You can see the code. It's two strings. I did break that out pretty well. Where we can change. So I've got a sketch of a house. Like a literal sketch, a simple sketch of a house, and then a futuristic glass house with neon lights, cyberpunk style. I think I was looking at those Mudang pictures too much, and that's how I ended up there.

[00:05:39]
And so then we'll go ahead and we'll use the text to image pipeline, and I will run that, I gotta put those strings into memory. So let's do that as soon as it's done loading these models, and then we're gonna create that base image. So we're gonna start with just what we saw in the previous section where we're gonna take some text.

[00:06:01]
We're going to make an image intentionally. I'm gonna start the next one while I'm at it. Intentionally. I chose a sketch. It should look like a little pencil drawing. It's really like, if live coding wasn't scary enough, Live random. Will the AI do what you want? Is a new level of fear that I've engaged in this week.

[00:06:24]
And so it will theoretically give me a sketch. And then what I'm doing is I'm gonna show you at strengths. 0.3, 0.5, and 0.7, I think. And you can change any of these numbers, like I keep telling you. At one point I was 0.8, I think I made my point at 0.7, that's where I left it.

[00:06:40]
Okay, that's our sketch of a house. Neat. And I will say, when you play with this, play with these prompts, did I make it a haunted house at some point? Of course I did, right? Did I make it like Princess Bubblegum's castle from Adventure Time? Yes. Right, do all of those things.

[00:07:04]
It is part of the great joy that I have not felt since I first started learning how to program. And I put something in a database and took it back out again. And that no longer fills me with joy. And this does. So you should experience that joy too.

[00:07:15]
The first time I did Ajax and changed something to the DOM and I felt like a wizard. I didn't feel like a wizard anymore until now. All right, so we got our house. Effectively. The loop here is basically because I'm going through the three strengths. The important part here is We've got that image to image pipeline where we're passing in the base image, our transformation prompt, and then some stuff that I'm doing to put it on a pretty graph for you.

[00:07:48]
Don't worry about that, that's not important. But you can see that's the original 0.3, I guess. Right. There are some changes. 0.5, we're getting a little futuristic cyberpunk. And then 0.7, we are, we're there. I can see that it came from the image, but I'm starting to lose the thread a little bit.

[00:08:17]
You know what I mean?
>> Speaker 2: There's a big difference between those last two.
>> Steve Kinney: Yeah. And let's play a fun game, I don't know with the code that I was flexible enough to take more than 3. So what we're going to do is 0.3 was a waste of our time.

[00:08:31]
Let's turn this to let's do six. Let's see the difference between 6, 5, 6 and let's do I do 8 or 9.
>> Speaker 2: 9.
>> Steve Kinney: 9, all right, all right, all right.
>> Steve Kinney: So we've got the one we did at 5. Again, not particularly slow, except the fact that we're doing three of them.

[00:08:57]
And the nice part too is with this notebook, if one wanted, you can go and regenerate that base image. It's still going to use the same base image as before because that's still in memory. I didn't rerun that part of the code block. I think six was the sweet spot.

[00:09:13]
I gotta say, six is definitely the sweet spot. Nine looks cool. But again, if you think about it, the way that this works, took the original image, added back noise and denoised it again. So nine is obviously adding back. Like the closer you get to one, the more you added noise back and like noised over the original.

[00:09:31]
So it makes sense. But yeah, I think even compared to like seven, I mean five, you could argue that if we're truly taking the nature of the assignment, which is take the original image, it's that cyberpunk thing that I think makes it a little like nuanced here. I would argue that 5 understood the assignment the best.

[00:09:53]
I kinda like 6 though too.
>> Speaker 2: It looks like the same house over a 10 year period. Every 10 years.
>> Steve Kinney: Yeah. Yeah. And then who knows what happened here.
>> Speaker 2: This is when that's the portal that they walk through.
>> Steve Kinney: Yeah. This is when the neighborhood really got a wine bar at this point.

[00:10:12]
Yeah. So you have the ability to not only create images from text, but also manipulate images. And I would just want to say that we only have so much time today and even trying to stay on the free tier. There's also video models and audio models on hugging face as well that are worth looking at.

Learn Straight from the Experts Who Shape the Modern Web

250+
In-depth Courses
Industry Leading Experts
24
Learning Paths
Live Interactive Workshops

Get Unlimited Access Now