Open Source AI with Python & Hugging Face

Summarization & Named Entity Recognition

Temporal

Lesson Description

The "Summarization & Named Entity Recognition" Lesson is part of the full, Open Source AI with Python & Hugging Face course featured in this preview video. Here's what you'd learn in this lesson:

Steve explains that summarization is limited by factors like model size, rollout stages, and progressive enhancement constraints, while NER can identify and label entities such as people, places, and organizations from unstructured text, making it valuable for structuring data.

Join Now

Preview

Transcript from the "Summarization & Named Entity Recognition" Lesson

[00:00:00]
>> Steve Kinney: Not totally related to hugging face. I think one thing that's kind of cool that is coming out is out depending on what version of Chrome you're on is in Chrome. Google has three kind of Gemini Nano models that are available accessible from JavaScript in the browser, a summarizer API, a language detection API and a language model like an LLM for text generation.

[00:00:30]
So not only do you have hugging face for a lot of this stuff, both in JavaScript as well as Python, but for stuff like summary and even text generation, they are multi gigabyte downloads that the user has to download of your website. So you can only really use them for progressive enhancement.

[00:00:49]
And they're only available in the browser. But some of these are also things that you will be able to do in, in the browser. It's like tiers of rollout. Some of them are just available, some of them are only available in Chrome extensions, some of them only available in a feature flag, some only available in Chrome 137 and higher.

[00:01:05]
So you can't totally rely on them just yet, but you will. So yeah, we have this idea of summary where it will come and summarize it down a little bit as well. Nothing particularly fancy. And you gotta kinda play around with where the stop token is cuz you can see it kinda stops here.

[00:01:22]
And so you can figure out either a, how you wanna massage at or where you wanna cut sentences, so on and so forth. So summarization there it exists. Nothing super exciting about that one. It does what you think it's gonna do as you kinda go along, right? And again, everything.

[00:01:39]
Quality versus length, trade off, so on and so forth. The one that is somewhat interesting though is this name and named entity recognition, right? Where this will look at a piece of text and it'll try to pull out the nouns effectively, right? Like the places and it will categorize and labels like, New York City, that is a location, right?

[00:02:05]
John Smith is a person, Apple Incorporated is a company, right? It will like pull out like multiple things and kind of show it to you and label it. So let's run that cuz I think it'll take a second or two to download, but then we'll look at the text while that happens.

[00:02:21]
So here we have a bunch of strings of text like we've had the whole way through and we have run through them all and basically pass them to this pipeline. And then we're just going to try to get a sense of where are the various different pieces that it knows about that it will pull out so you can make some.

[00:02:41]
These are written in a way where. Where they should all be somewhat rich with a few things in them, right? Most of them have some kind of location, some kind of either company or person in them, right? And you'll see that, like just based on strings of text.

[00:02:57]
I should just go down and see how big the model is. This one's a little bit bigger. Hence the talking. Because it needs to have effectively a dictionary of these terms. And it's not like that far off from the idea of this question answering that we saw earlier, except that it has kind of.

[00:03:12]
It's like a few of these kind of mixed. If you think about it. Where it. Will pull out the word, the label it gave it from that set of. I have a table down here that'll show them to you. For most models, whether it's per. Or person, depends on the model, per se, org location or geopolitical entities, dates, times, you can start to pull out these core things, which is also, again, super kind of an interesting thing.

[00:03:44]
I was saying the other day that one of the interesting things that we can build with AI are a lot of things that people want to build. The hard part is not the code. The hard part is taking unstructured data and making it structured. If you wanted to make a list of every lunch special in Minneapolis, writing the code to have a website of every lunch special in Minneapolis is not the hard part.

[00:04:10]
It is pulling out the locations and the times and the places and so on and so forth. That's the hard part. So tools that let you then take arbitrary unstructured data and pull it out in a structured way and then turn it into an array, a database row or what have you, I think is the thing where we see a whole bunch of stuff that would have been hard and needed a lot of human power to do becomes a lot easier and opens up a whole new set of things.

[00:04:36]
So you can kind of see in these output. See, talked long enough. Now the answers are here. So, yeah, 1.3 gigs, not bad. Also did not take that long on servers that are not mine. So you can see like Microsoft, that's an organization. It's pretty confident about that.

[00:04:53]
It will also kind of give you the start and end point of where that showed up in the string, right? So you could do stuff where you can even grab something and highlight it, like if you wanted to. Bill Gates, That sounds like a person. It's pretty confident about that.

[00:05:05]
Albuquerque in New Mexico pulled those out as places. So it Kind of just takes arbitrary text and can categorize through a known set of things. The earlier zero shot classification that we saw before, you get to define any category you want, right? And it could pull out or label a given string of text the whole way through.

[00:05:25]
The question answering that we saw earlier could answer a question and give you the start and end point. This kind of does a little bit of both, where you can both have a known set of categories and then also give an arbitrary string of text and we'll pull out those things exactly where they were found in there, so on and so forth.

[00:05:43]
John Smith Apple Cupertino, California so on and so forth begins to pull out all of these kind of core and important things in there as well. So the named entity recognition, depending on the model, it has a set of things that it recognizes, and that's all you notice that one was like several times bigger than the previous ones.

[00:06:06]
So if you wanted one that could like find dates, phone numbers, email addresses, maybe it'd be a different model. Yeah, yeah. And with some of these, I think these are the common ones, but you can see some of them vary based on the model. Right. I would wonder if there are like specialty ones that could theoretically have a few more categories as well, right?

[00:06:30]
Phone numbers are interesting too, because at what point is that actually just a regex? Unclear. But as I joked about myself earlier, then revert back to my younger self, I know how to regex that one.
>> Male: What's the smart way to store entities after you harvest them or retrieve them out of text?

[00:06:47]
>> Steve Kinney: So I guess what are you trying to do, right?
>> Male: Let's say you have more of an interactive app and you wanna make sure you're not doing all of the same work over and over again. Like, is that something you'd handle at the app level or is there.

[00:06:56]
Are there constructs that work with LLMs where it's like, hey, this is in context, the entities we're talking about previously. Is it kinda flagging the data?
>> Steve Kinney: Yeah, I mean, I think that becomes like a data storage caching thing. Right? Like, that's where my mind immediately jumps to.

[00:07:10]
Right? I would probably, don't hold me to this. My immediate answer is like, I would like hash the initial input. Have I processed this before? If so, pull what I have already, so on and so forth. That is the initial place my mind jumps to, which is hash it, cache it, and then pull it out again if I needed it.

[00:07:33]
And then what do you do with that data? Are you trying to count how many times the United nations is mentioned across the thing? I think then it depends on what application you're trying to build or what are you processing data, so on and so forth. But yeah, as you can see, like, even stuff like half counts as a percent, right?

[00:07:55]
And likely when we look at how attention works, it will be half in that context, and we'll see how that works in a little bit as well.

Learn Straight from the Experts Who Shape the Modern Web

250+
In-depth Courses
Industry Leading Experts
24
Learning Paths
Live Interactive Workshops

Get Unlimited Access Now