(A cleaned up transcription of this video section.)

Introduction to ULMFiT

The problem we'll solve

This series of three steps here is what we're going to start by digging into. We're going to start out with a movie review like this one and decide whether it's positive or negative sentiment about the movie. That is the problem we have.

In the training set, we have 25,000 movie reviews. We've got 25,000 movie reviews, and for each one we have like one bit of information: they liked it or they didn't like it. That's what we're going to look into in a lot more detail today and in the current lessons. Our neural networks—remember, they're just a bunch of matrix multiplies and simple nonlinearities, particularly replacing negatives with zeros—those weight matrices start out random. And so if you start out with some random parameters and try to train those parameters to learn how to recognize positive versus negative movie reviews, you only have literally 25,000 ones and zeros to actually tell you "I like this one, I don't like that one." That's clearly not enough information to learn basically how to speak English—how to speak English well enough to recognize they liked this or they didn't like this. And sometimes that can be pretty nuanced, right? The English language, often particularly with movie reviews—people, because these are like online movie reviews on IMDB—people can often use sarcasm. It could be really quite tricky.

Our approach

So for a long time, until in fact until very recently—like this year—neural nets didn't do a good job at all of this kind of classification problem. And that was why: there's not enough information available. So the trick—hopefully you can all guess—it's to use transfer learning. It's always the trick. So last year in this course, I tried something crazy, which was I thought, "What if I try transfer learning to demonstrate that it can work for NLP as well?" And I tried it out, and it worked extraordinarily well. And so here we are a year later, and transfer learning in NLP is absolutely the hit thing now. And so I'm going to describe to you what happens.

The key thing is we're going to start with the same kind of thing that we used for computer vision: a pre-trained model that's been trained to do something different to what we're doing with it. And so for ImageNet, that was originally built as a model to predict which of a thousand categories each photo falls into, and people then fine-tune that for all kinds of different things, as we've seen. So we're going to start with a pre-trained model that's going to do something else—not movie review classification. We're going to start with a pre-trained model which is called a language model.

Language Models and Pre-Training

A language model has a very, very specific meaning in NLP, and it's this: a language model is a model that learns to predict the next word of a sentence. And to predict the next word of a sentence, you actually have to know quite a lot about English (assuming you're doing it in English) and quite a lot of world knowledge. By world knowledge—a different example: here's your language model, and it reads "I'd like to eat a hot ___." What? Obviously "dog," right? "It was a hot ___." What? Probably "day," right? Now, previous approaches to NLP used something called n-grams largely, which is basically saying how often do these pairs or triplets of words tend to appear next to each other. And n-grams are terrible at this kind of thing. As you can see, there's not enough information here to decide what the next word probably is. But with a neural net, you absolutely can.

So here's the nice thing: if you train a neural net to predict the next word of a sentence, then you actually have a lot of information. Rather than having a single bit for every 2,000-word movie review—liked it or didn't like it—every single word, you can try and predict the next word. So in a 2,000-word movie review, there are 1,999 opportunities to predict the next word. Better still, you don't just have to look at movie reviews, because really the hard thing isn't so much "does this person like the movie or not" but "how do you speak English?" And so you can learn how to speak English roughly from some much bigger set of documents.

And so what we did was we started with Wikipedia. Stephen Merity and some of his colleagues built something called the WikiText-103 dataset, which is simply a subset of most of the largest articles from Wikipedia with a little bit of preprocessing that's available for download. And so you're basically grabbing Wikipedia, and then I built a language model on all of Wikipedia, right? So I've just built a neural net which would predict the next word in every significantly sized Wikipedia article. And that's a lot of information—if I remember correctly, it's something like a billion tokens. All right, so we've got a billion separate things to predict. Every time we make a mistake on one of those predictions, we get the loss, we get gradients from that, and we can update our weights and make them better and better until we can get pretty good at predicting the next word of Wikipedia.

Why is that useful? Because at that point, I've got a model that knows probably how to complete sentences like this, and so it knows quite a lot about English and quite a lot about how the world works—what kinds of things tend to be hot in different situations, for instance. I mean, ideally it would learn things like "In 1996, in a speech to the United Nations, United States President ___ said." Now that would be a really good language model, because it would actually have to know who was the United States president in that year. So getting really good at training language models is a great way to learn a lot about, or teach a neural net a lot about, you know, what is our world, what's in our world, how do things work in our world. So it's a really fascinating topic, and it's actually one that philosophers have been studying for hundreds of years. Now there's actually a whole theory of philosophy which is about what can be learned from studying language alone. So it turns out, empirically, quite a lot.

Fine-Tuning

And so here's the interesting thing: you can start by training a language model on all of Wikipedia, and then we can make that available to all of you, just like a pre-trained ImageNet model for vision. We've now made available a pre-trained WikiText model for NLP—not because it's particularly useful of itself (predicting the next word of sentences is somewhat useful, but not normally what we want to do), but it tells us it's a model that understands a lot about language and a lot about what language describes.

So then we can take that and we can do transfer learning to create a new language model that's specifically good at predicting the next word of movie reviews. So if we can build a language model that's good at predicting the next word of movie reviews—pre-trained with the WikiText model, right—then that's going to understand a lot about "my favorite actor is Tom ___," or, you know, "I thought the photography was fantastic, but I wasn't really so happy about the director," whatever, right? It's going to learn a lot about specifically how movie reviews are written. It'll even learn things like what are the names of some popular movies.

So that would then mean we can still use a huge corpus of lots of movie reviews, even if we don't know whether they're positive or negative, right, to learn a lot about how movie reviews are written. So for all of this pre-training and all of this language model fine-tuning, we don't need any labels at all. It's what the researcher Yann LeCun calls self-supervised learning. In other words, it's a classic supervised model—we have labels, right—but the labels are not things that somebody else has created; they're kind of built into the dataset itself.

So this is really, really neat, because at this point we've now got something that's good at understanding movie reviews, and we can fine-tune that with transfer learning to do the thing we want to do, which in this case is to classify movie reviews to be positive or negative. And so my hope was, when I tried this last year, that at that point 25,000 ones and zeros would be enough feedback to fine-tune that model. And it turned out it absolutely was.

Question: Does the language model approach work for text in forums that use informal English, misspelled words, slang, or short forms—like "S6" instead of "Samsung S6"?

Answer: Yes, absolutely it does, particularly if you start with your WikiText model and then fine-tune it with what we call a target corpus. So your target corpus is just a bunch of documents, right—could be emails or tweets or medical reports or whatever. So you could fine-tune it so it can learn a bit about the specifics of the slang, you know, or abbreviations or whatever that didn't appear in the full corpus.

And so interestingly, this is one of the big things that people were surprised about when we did this research last year. People thought that learning from something like Wikipedia wouldn't be that helpful because it's not that representative of how people tend to write. But it turns out it's extremely helpful, because there's a much bigger difference between Wikipedia and random words than there is between, like, Wikipedia and Reddit, say. So it kind of gets you 99% of the way there.

So these language models themselves can be quite powerful.

jph00/lesson4-2019-nlp.md

Introduction to ULMFiT

The problem we'll solve

Our approach

Language Models and Pre-Training

Fine-Tuning