Yoav Goldberg, Jan 23, 2021.
The FAccT paper "On the Dangers of Stochastic Parrots: Can Languae Models be Too Big" by Bender, Gebru, McMillan-Major and Shmitchell has been the center of a controversary recently. The final version is now out, and, owing a lot to this controversary, would undoubtly become very widely read. I read an earlier draft of the paper, and I think that the new and updated final version is much improved in many ways: kudos for the authors for this upgrade. I also agree with and endorse most of the content. This is important stuff, you should read it.
However, I do find some aspects of the paper (and the resulting discourse around it and around technology) to be problematic. These weren't clear to me when initially reading the first draft several months ago, but they became very clear to me now. These points are for the most part not major disagreements with the content, but they also go in some ways against the very core premise of the paper. I think they are also important voices in the debate. This short piece is an attempt to concisely list them.
The criticism has two parts:
- The paper is attacking the wrong target.
- The paper takes one-sided political views, without presenting it as such and without presenting the alternative views.
Let's handle them in turn. We'll start with the first one.
The argument as a one-liner: The real criticism is not about model size, its about any language model. Framing it about size is harmful.
The paper's title asks a direct question: "can language models be too big?" This question directly connects the dangers and concerns with the size of the language models. This is already manifested in numerous online discussions and various popular media pieces attacking the dangers in large language models, calling for regulating the size of language models, to stop big tech companies from monopolizing large language models, etc, etc.
But the paper doesn't really deal with the dangers of large language models at all. The title question "can language models be too big?" is not answered. And for a good reason: it is the wrong question to ask. Size has nothing to do with it. Indeed, not a single criticism or concern in the paper is actually about model size. Yet, the framing is that of size, and I think this is harmful and dangerous, as I will explain below. The harm is already done: the media and the public took to a size-centric debate, and equate dangers with size. I am afraid this trend will be hard to reverse. This is an attempt to try do so.
The paper raises three main lines of concern:
- Environmental cost of training large models
- Unfathomable training data
- Models acting as stochastic parrots that repeat and manifest issues in the data.
Note that neither of these are actually about model size per-se. The first is about computational efficiency. The second and third are intertwined, but the core issues they raise are training data quality (which relates to some extent to training data size), and output quality. There is also an underlying issue of lack of transparency and lack of interpretabilty.
All of these concerns hold just as well also for small and efficient language models. Size is just irrelevant.
Smaller models can still be inefficient and have a high environmental cost. Especially if the smaller models will not be as effective as the larger ones, so they cannot offset the cost. More importantly, model size is not directly linked to computation efficiency. Already in the list of models in the paper, some of the larger models (in terms of parameter count) are also more computationally efficient (specifically the switch transformer). On the other side, some models use heavy parameter sharing across layers, which reduces the parameter count, while still remaining high on computational inefficiency and carbon costs. Or a model can be just be small and inefficient. There is really no good way, and no good reason, to equate size with efficiency. The question that should be asked here then is not "can LMs be too big?" but "can LMs be too environmentally costly?". These are different questions. While the harm in asking the wrong question in this case is not that big, it still exists. It may detract from looking into architectures that are both big and efficient (like the switch transfromer, or based on specialized hardware), and it may cause more waste by shifting focus in smaller models that will resort to other forms of expensive computation (training and inference with algorithms that are polynomial in number of parameters rather than linear?), or just will be more costly in aggregate.
Turning to the other issues, here focusing on the size argument really becomes dangerous and harmful: the described concerns are for the most part valid and important, but they are just as valid to smaller models as they are to larger ones. We can feed unfathomable (or just plain bad) training data also to a smaller model. Smaller models are also stochastic parrots. Smaller models are also not interpretable. And the harms remain. A smaller model can still exhibit the same undesired behaviors, it can still be racist, sexist, biased, status-quo-amplifier, etc, etc. And it is just as uninterprable as the larger ones. The described dangers and concerns are dangers and concerns of language models, not specifically of large language models, and they do not grow or shrink with size. By framing the issue around size, people may conclude that small models are fine, or somehow less dangerous w.r.t to the concerns raised in the paper. This is totally wrong. People should be just as responsible when using smaller LMs, as they are when using large ones.
[Update, Jan 24, 2021 --- added the following 2 paragraph] Gebru, on social media, stated that they consider "data size" to be part of "model size" as well, and that they say so in the paper. I didn't read it this way, and it sounds odd to me to say "can models be too big" when you mean "can training data size be too big". But even under this interpreation, the paper does not say why large training size is bad, and it certainly doesn't say why training data can be "too big". The argument the paper does make is that data size is not enough to ensure properties such diversity, quality, etc. I agree with this, and I agree that such properties should be looked at. All of section 4 in the paper is an important read. But the argument it makes is "size is not enough" not that "size is bad and can be too big". Maybe large amounts of high-quality data will be hard to collect. Fine, so its a challenge. Still, there is currently no reason to believe that if we manage to collect large amounts of high-quality data, it will be a-priori worse than using small amounts of high-quality data. Size is not the issue. Quality is, and focusing on size is a distraction.
(Side notes: it may very well be that we will realize that after some data size, model quality may deteriorate. It has observed before. But this is an empirical question that should be verified. It does not mean that large data is a priori bad. Similarly, authors like Tal Linzen argue that people learn from much smaller data samples than models, and hence researching models that use less data is worthwhile. Again, full agreement here, but this is unrelated to the potential dangers of language models.)
The argument as a one-liner: The authors suggest that good (= not dangerous) language models are language models which reflect the world as they think the world should be. This is a political argument, which packs within it an even larger political argument. However, an alternative view by which language models should reflect language as it is being used in a training corpus is at least as valid, and should be acknowledged.
The paper takes several assumptions as given, without stating them as assumptions, and without considering the alternatives. This is mostly centered in section 6.2 (Risks and Harms) though it is also manifested in other parts of the paper. A similar critic has been expressed by Michael Lissack. My arguments here are somewhat different than his. Lissack also goes into much greater depth in several aspects which I don't touch (and some that I don't fully agree with).
I will focus on section 6.2 (Risks and Harms). This section states several potential harms, and in doing so states how the authors think a language model should behave, and, more broadly, how a machine-learning system should model the world. The view expressed in this section are opinions, and very one sided at that. However, the fact that they are merely opinions, or that there is a valid debate to be had around them, is never acknowledged or even hinted at. While I agree with many of the opinions, I also disagree with some. And regardless of my personal opinion, I think there is a important debate that should be made explicit. I will focus on the major issue I see.
A major question to be asked is "do we want our models to reflect the data as it is, or the world as we believe it should be". The authors take a very conclusive stance here for the second, but the first option is also valid, and must at least be considered. This is to a large extent a political question, and it becomes even more political when taking the "world as we believe it should be" stance that the authors take: different groups believe in different things. The paper reflects a set of beliefs that is very much north-american and left-leaning.
If we take language models as models of human language, do we want the model to be aware of slurs? The paper very clearly argues that "no it definitely should not". But one could easily argue that, yes, we certainly do want the model to be aware of slurs. Slurs are part of language. If we don't want the model to generate slurs, this is a valid request in some use-cases. But restricting them outright? this could be undesired. As an simple example, consider a model that does not know any slur or profanity words. Such words are not in the model's vocabulary, and it never saw slurs or profanities in its training. Not only this model is now not modeling human language (because language does have slurs and profanities), it will also not be able to recognize unwanted behaviors when encountering them. If we want to classify text for toxicity, such a model will let very toxic texts pass, because it will not recognize them as such. This also ties into debates about censorship, use-vs-mention, the validity of having "taboo words", etc.
Similarly for other linguistic forms that authors list as undesirable such as microagressions, dog-whistles, or subtle patterns such as refering to "woman doctors" or "both genders". Again, if we want our models to actually model human language use, we want these patterns in the model. If we use language models to, for example, compare bodies of texts from different sources, or to study societies based on the texts they left behind (as many digithal humanities scholars are now doing) we do want to have these encoded in the model. If we study political discourse, we want these things in the model. And so on and so on. Even if we just want a stochastic parrot that generates fanfiction or stories in some genre, we want to accurately reflect this genre. Literature has profanities, slurs and microagressions even if just as literary devices. If Charles Bukowski can write mysoginist stories, why can't a model write such stories? If Salinger can use the word "fuck" in a story, why can't a model? If the Wu-Tang Clan can use the n-word in their rap lyrics, why can't a model? Yes, there are places when this behavior is inappropriate. Maybe even most ocassions. But it is far from clear to me that the solution should be in the language model itself, rather than in the larger application. And it is even less clear that the solution should be all-encompassing, and not on a case-by-case basis.
These are just two examples, but there are many good reasons to argue that a model of language use should reflect how the language is actually being used. I find this view to be highly non-controversial. However, this is part of a much larger debate that I cannot do justice in this short piece. My point is that this debate is valid, it must take place, and it should have been acknowledged in the paper. In the least, we should acknowledge the option that there could be two kinds of LMs, and that both are valid, maybe depending on the final use-case or ocassion. The paper does not acknowldege that. This is unscientific and, in my opinion, also harmful. (And all of this without even touching the issue of "who gets to decide what are the slurs, microagressions and behaviors that should be avoided" which is a huge political issue on its own, and on which the authors take a very opinionated stand).
[Update, Jan 24, 2021 --- added the following text]
Based on some conversations on twitter with Gebru and others, I would like to clarify the following point: I read section 6.2 of the paper as prescribing how a language model should behave. That is, I read it as advocating for language models that, among other things:
- do not replicate the hegemonic world view they pick up from their training data.
- do not produce slurs or other forms of language that may seem derogatory, even if present in their training data.
- do not produce utterances that are picked up from the training data which can be perceived as microagressions, abusive language, biased language, etc.
- in particular, do not produce patterns such as the phrases "both genders" or "woman doctors" in the same frequency as these appear in the data.
- and so on.
Another possible reading of section 6.2 is that it merely lists these as potential things a careful user should be aware of, and aware of their impilcations, and then decide whether they want to include them in their language model or not. That is, as merely advice, not a perscription. Some language models CAN produce such behavior and be considered good. This reading was not natural to me, but if this is your reading of section 6.2 and the rest of the paper, then great. It means that you probably also agree with all you read so far by me in this section, sans the "one-sidedness" remark, and can easily reconcile it with the world-view presented in the paper. That's great.
[end update]
A growing list of criticims to this piece raised on twitter, for most of them including my responses in the twitter thread. If you want me to link to a specific tweet (or any other URL of you choosing) which is not listed yet, either ask me to, or create a PR.
Hello Yoav!
I found the paper to be quite an interesting read. I also think I have some semi-rebuttals to your analysis and/or scenarios for you to consider which you might find interesting. I'd be interested to understand how you square these with your feedback on the paper.
Attacking the wrong target
I broadly disagree with this statement:
I think I do agree with you on the environmental point. So this comment does not address this at all.
I think my comment broadly revolves around the idea of model architectures and model size. Naturally, there are some models which are more interpretable (for a given range of definitions of that word, e.g faithfulness), such as linear models. But equally, I could design a linear model which had 1 billion parameters, trained on a classification task with L2 regularization that would be extremely difficult to interpret via any standard method for linear model interpretability due to its size. So in the case that we care deeply about classes of model interpretability, there are still ways to scale models in less complex classes in terms of the size of their parameters which make them more difficult to interpret. For this reason I do not completely buy your argument that scale is not relevant.
Secondly, would you stand by your statement that smaller models are harder to interpret when taken to an extreme? As you suggest here:
suppose there existed a 100 parameter lanaguage model that could generate the same level of output quality as GPT-3. Do you believe that in this scenario users (and by users here I probably mean NLP practitioners) would not understand how it works? E.g there are methods available in certain size regimes, such as brute force search, which would allow (in my opinion) a much higher degree of interpretability.
So in these two examples, I believe that model size is a valid direction of analysis, even in the case that a model may be considered to be classically interpretable.
Finally, from a practical perspective, the larger models are, the harder they are to interpret - you need multiple GPUs to run many of the larger models, and TPUs to train them. I guess that this is also likely to drive research into smaller versions of them, but it also disuades research into interpretability/biases relative to other models, given their enormous size. Of course, there are other things that discourage this too, such as models not being released etc.
One Sided Political view
I agree with the analysis that the paper describes what "can" happen in such language models, rather than the prescriptive reading. In particular, I found that section 6.2 explicitly describes several conditions for the commentary which follows, e.g
In my opinion, all of these points are a-political, given that they do not consider the content of what the model may be societally biased toward.
Additionally, I wonder about this sentence:
I broadly agree with you! But this is not the question that your rebuttal needs to address, which is actually: does a language model trained on un-altered/cleaned web text represent such a model of language use? For example, it is extremely unclear to me that any large generative language model should not suffer from the problems with bias amplification described in Men Also Like Shopping:
Reducing Gender Bias Amplification using Corpus-level Constraints. For this reason, I find your anecdotes describing literary works using offensive language unconvincing. Perhaps a model can use offensive language like this, but currently we have no clear way of asserting that it is not doing so with a non-distributional level of frequency with respect to its training corpus. Edit: I subsequently saw that you corrected this on twitter, but it is not clear from the text as presented.
Also, I am convinced by Emily Bender's comment on twitter, that web text certainly does not represent a natural distribution of language use, and is to me, a much clearer source of statistical bias in the various training corpora than cleaning/removal of bad language. But your commentary does not really address this either.
Anyway, thanks for the read - it was a thought provoking accompaniment to the paper, which made me think more about the points it made - even if I did end up disagreeing with you on several parts of it. Also, perhaps consider that people who haven't met you in real life may not know the "Yoav Goldberg tone". I distinctly remember the vitriol you directed at the authors of that GAN language generation paper at NeurIPS and found it extremely distasteful when I didn't know you, whereas after having met you in person I could imagine how you intended it to be received! What i'm trying to say is - your points could have been delivered better, with a little more grace, and a little less sark. But I know that may be too much to ask 😅 .
Mark