Yoav Goldberg, Jan 23, 2021.
The FAccT paper "On the Dangers of Stochastic Parrots: Can Languae Models be Too Big" by Bender, Gebru, McMillan-Major and Shmitchell has been the center of a controversary recently. The final version is now out, and, owing a lot to this controversary, would undoubtly become very widely read. I read an earlier draft of the paper, and I think that the new and updated final version is much improved in many ways: kudos for the authors for this upgrade. I also agree with and endorse most of the content. This is important stuff, you should read it.
However, I do find some aspects of the paper (and the resulting discourse around it and around technology) to be problematic. These weren't clear to me when initially reading the first draft several months ago, but they became very clear to me now. These points are for the most part not major disagreements with the content, but they also go in some ways against the very core premise of the paper. I think they are also important voices in the debate. This short piece is an attempt to concisely list them.
The criticism has two parts:
- The paper is attacking the wrong target.
- The paper takes one-sided political views, without presenting it as such and without presenting the alternative views.
Let's handle them in turn. We'll start with the first one.
The argument as a one-liner: The real criticism is not about model size, its about any language model. Framing it about size is harmful.
The paper's title asks a direct question: "can language models be too big?" This question directly connects the dangers and concerns with the size of the language models. This is already manifested in numerous online discussions and various popular media pieces attacking the dangers in large language models, calling for regulating the size of language models, to stop big tech companies from monopolizing large language models, etc, etc.
But the paper doesn't really deal with the dangers of large language models at all. The title question "can language models be too big?" is not answered. And for a good reason: it is the wrong question to ask. Size has nothing to do with it. Indeed, not a single criticism or concern in the paper is actually about model size. Yet, the framing is that of size, and I think this is harmful and dangerous, as I will explain below. The harm is already done: the media and the public took to a size-centric debate, and equate dangers with size. I am afraid this trend will be hard to reverse. This is an attempt to try do so.
The paper raises three main lines of concern:
- Environmental cost of training large models
- Unfathomable training data
- Models acting as stochastic parrots that repeat and manifest issues in the data.
Note that neither of these are actually about model size per-se. The first is about computational efficiency. The second and third are intertwined, but the core issues they raise are training data quality (which relates to some extent to training data size), and output quality. There is also an underlying issue of lack of transparency and lack of interpretabilty.
All of these concerns hold just as well also for small and efficient language models. Size is just irrelevant.
Smaller models can still be inefficient and have a high environmental cost. Especially if the smaller models will not be as effective as the larger ones, so they cannot offset the cost. More importantly, model size is not directly linked to computation efficiency. Already in the list of models in the paper, some of the larger models (in terms of parameter count) are also more computationally efficient (specifically the switch transformer). On the other side, some models use heavy parameter sharing across layers, which reduces the parameter count, while still remaining high on computational inefficiency and carbon costs. Or a model can be just be small and inefficient. There is really no good way, and no good reason, to equate size with efficiency. The question that should be asked here then is not "can LMs be too big?" but "can LMs be too environmentally costly?". These are different questions. While the harm in asking the wrong question in this case is not that big, it still exists. It may detract from looking into architectures that are both big and efficient (like the switch transfromer, or based on specialized hardware), and it may cause more waste by shifting focus in smaller models that will resort to other forms of expensive computation (training and inference with algorithms that are polynomial in number of parameters rather than linear?), or just will be more costly in aggregate.
Turning to the other issues, here focusing on the size argument really becomes dangerous and harmful: the described concerns are for the most part valid and important, but they are just as valid to smaller models as they are to larger ones. We can feed unfathomable (or just plain bad) training data also to a smaller model. Smaller models are also stochastic parrots. Smaller models are also not interpretable. And the harms remain. A smaller model can still exhibit the same undesired behaviors, it can still be racist, sexist, biased, status-quo-amplifier, etc, etc. And it is just as uninterprable as the larger ones. The described dangers and concerns are dangers and concerns of language models, not specifically of large language models, and they do not grow or shrink with size. By framing the issue around size, people may conclude that small models are fine, or somehow less dangerous w.r.t to the concerns raised in the paper. This is totally wrong. People should be just as responsible when using smaller LMs, as they are when using large ones.
[Update, Jan 24, 2021 --- added the following 2 paragraph] Gebru, on social media, stated that they consider "data size" to be part of "model size" as well, and that they say so in the paper. I didn't read it this way, and it sounds odd to me to say "can models be too big" when you mean "can training data size be too big". But even under this interpreation, the paper does not say why large training size is bad, and it certainly doesn't say why training data can be "too big". The argument the paper does make is that data size is not enough to ensure properties such diversity, quality, etc. I agree with this, and I agree that such properties should be looked at. All of section 4 in the paper is an important read. But the argument it makes is "size is not enough" not that "size is bad and can be too big". Maybe large amounts of high-quality data will be hard to collect. Fine, so its a challenge. Still, there is currently no reason to believe that if we manage to collect large amounts of high-quality data, it will be a-priori worse than using small amounts of high-quality data. Size is not the issue. Quality is, and focusing on size is a distraction.
(Side notes: it may very well be that we will realize that after some data size, model quality may deteriorate. It has observed before. But this is an empirical question that should be verified. It does not mean that large data is a priori bad. Similarly, authors like Tal Linzen argue that people learn from much smaller data samples than models, and hence researching models that use less data is worthwhile. Again, full agreement here, but this is unrelated to the potential dangers of language models.)
The argument as a one-liner: The authors suggest that good (= not dangerous) language models are language models which reflect the world as they think the world should be. This is a political argument, which packs within it an even larger political argument. However, an alternative view by which language models should reflect language as it is being used in a training corpus is at least as valid, and should be acknowledged.
The paper takes several assumptions as given, without stating them as assumptions, and without considering the alternatives. This is mostly centered in section 6.2 (Risks and Harms) though it is also manifested in other parts of the paper. A similar critic has been expressed by Michael Lissack. My arguments here are somewhat different than his. Lissack also goes into much greater depth in several aspects which I don't touch (and some that I don't fully agree with).
I will focus on section 6.2 (Risks and Harms). This section states several potential harms, and in doing so states how the authors think a language model should behave, and, more broadly, how a machine-learning system should model the world. The view expressed in this section are opinions, and very one sided at that. However, the fact that they are merely opinions, or that there is a valid debate to be had around them, is never acknowledged or even hinted at. While I agree with many of the opinions, I also disagree with some. And regardless of my personal opinion, I think there is a important debate that should be made explicit. I will focus on the major issue I see.
A major question to be asked is "do we want our models to reflect the data as it is, or the world as we believe it should be". The authors take a very conclusive stance here for the second, but the first option is also valid, and must at least be considered. This is to a large extent a political question, and it becomes even more political when taking the "world as we believe it should be" stance that the authors take: different groups believe in different things. The paper reflects a set of beliefs that is very much north-american and left-leaning.
If we take language models as models of human language, do we want the model to be aware of slurs? The paper very clearly argues that "no it definitely should not". But one could easily argue that, yes, we certainly do want the model to be aware of slurs. Slurs are part of language. If we don't want the model to generate slurs, this is a valid request in some use-cases. But restricting them outright? this could be undesired. As an simple example, consider a model that does not know any slur or profanity words. Such words are not in the model's vocabulary, and it never saw slurs or profanities in its training. Not only this model is now not modeling human language (because language does have slurs and profanities), it will also not be able to recognize unwanted behaviors when encountering them. If we want to classify text for toxicity, such a model will let very toxic texts pass, because it will not recognize them as such. This also ties into debates about censorship, use-vs-mention, the validity of having "taboo words", etc.
Similarly for other linguistic forms that authors list as undesirable such as microagressions, dog-whistles, or subtle patterns such as refering to "woman doctors" or "both genders". Again, if we want our models to actually model human language use, we want these patterns in the model. If we use language models to, for example, compare bodies of texts from different sources, or to study societies based on the texts they left behind (as many digithal humanities scholars are now doing) we do want to have these encoded in the model. If we study political discourse, we want these things in the model. And so on and so on. Even if we just want a stochastic parrot that generates fanfiction or stories in some genre, we want to accurately reflect this genre. Literature has profanities, slurs and microagressions even if just as literary devices. If Charles Bukowski can write mysoginist stories, why can't a model write such stories? If Salinger can use the word "fuck" in a story, why can't a model? If the Wu-Tang Clan can use the n-word in their rap lyrics, why can't a model? Yes, there are places when this behavior is inappropriate. Maybe even most ocassions. But it is far from clear to me that the solution should be in the language model itself, rather than in the larger application. And it is even less clear that the solution should be all-encompassing, and not on a case-by-case basis.
These are just two examples, but there are many good reasons to argue that a model of language use should reflect how the language is actually being used. I find this view to be highly non-controversial. However, this is part of a much larger debate that I cannot do justice in this short piece. My point is that this debate is valid, it must take place, and it should have been acknowledged in the paper. In the least, we should acknowledge the option that there could be two kinds of LMs, and that both are valid, maybe depending on the final use-case or ocassion. The paper does not acknowldege that. This is unscientific and, in my opinion, also harmful. (And all of this without even touching the issue of "who gets to decide what are the slurs, microagressions and behaviors that should be avoided" which is a huge political issue on its own, and on which the authors take a very opinionated stand).
[Update, Jan 24, 2021 --- added the following text]
Based on some conversations on twitter with Gebru and others, I would like to clarify the following point: I read section 6.2 of the paper as prescribing how a language model should behave. That is, I read it as advocating for language models that, among other things:
- do not replicate the hegemonic world view they pick up from their training data.
- do not produce slurs or other forms of language that may seem derogatory, even if present in their training data.
- do not produce utterances that are picked up from the training data which can be perceived as microagressions, abusive language, biased language, etc.
- in particular, do not produce patterns such as the phrases "both genders" or "woman doctors" in the same frequency as these appear in the data.
- and so on.
Another possible reading of section 6.2 is that it merely lists these as potential things a careful user should be aware of, and aware of their impilcations, and then decide whether they want to include them in their language model or not. That is, as merely advice, not a perscription. Some language models CAN produce such behavior and be considered good. This reading was not natural to me, but if this is your reading of section 6.2 and the rest of the paper, then great. It means that you probably also agree with all you read so far by me in this section, sans the "one-sidedness" remark, and can easily reconcile it with the world-view presented in the paper. That's great.
[end update]
A growing list of criticims to this piece raised on twitter, for most of them including my responses in the twitter thread. If you want me to link to a specific tweet (or any other URL of you choosing) which is not listed yet, either ask me to, or create a PR.
I'm late to this game but would add two points:
I guess my point is that there is no silver bullet, a methodology which if followed would guarantee unbiased models. Do large LMs have problems? Of course they do. Would models produced according to the methodology recommended by the "Stochastic parrots" paper have problems? I'm pretty sure they would too.
What I think is the biggest merit "Stochastic parrots" is making a point that building language models is not just a computer science and mathematical problem, it's also a social science / humanities problem and therefore requires social science / humanities expertise as well. And this one has been sorely lacking in the development of modern AI.