Yoav Goldberg, Jan 23, 2021.
The FAccT paper "On the Dangers of Stochastic Parrots: Can Languae Models be Too Big" by Bender, Gebru, McMillan-Major and Shmitchell has been the center of a controversary recently. The final version is now out, and, owing a lot to this controversary, would undoubtly become very widely read. I read an earlier draft of the paper, and I think that the new and updated final version is much improved in many ways: kudos for the authors for this upgrade. I also agree with and endorse most of the content. This is important stuff, you should read it.
However, I do find some aspects of the paper (and the resulting discourse around it and around technology) to be problematic. These weren't clear to me when initially reading the first draft several months ago, but they became very clear to me now. These points are for the most part not major disagreements with the content, but they also go in some ways against the very core premise of the paper. I think they are also important voices in the debate. This short piece is an attempt to concisely list them.
The criticism has two parts:
- The paper is attacking the wrong target.
- The paper takes one-sided political views, without presenting it as such and without presenting the alternative views.
Let's handle them in turn. We'll start with the first one.
The argument as a one-liner: The real criticism is not about model size, its about any language model. Framing it about size is harmful.
The paper's title asks a direct question: "can language models be too big?" This question directly connects the dangers and concerns with the size of the language models. This is already manifested in numerous online discussions and various popular media pieces attacking the dangers in large language models, calling for regulating the size of language models, to stop big tech companies from monopolizing large language models, etc, etc.
But the paper doesn't really deal with the dangers of large language models at all. The title question "can language models be too big?" is not answered. And for a good reason: it is the wrong question to ask. Size has nothing to do with it. Indeed, not a single criticism or concern in the paper is actually about model size. Yet, the framing is that of size, and I think this is harmful and dangerous, as I will explain below. The harm is already done: the media and the public took to a size-centric debate, and equate dangers with size. I am afraid this trend will be hard to reverse. This is an attempt to try do so.
The paper raises three main lines of concern:
- Environmental cost of training large models
- Unfathomable training data
- Models acting as stochastic parrots that repeat and manifest issues in the data.
Note that neither of these are actually about model size per-se. The first is about computational efficiency. The second and third are intertwined, but the core issues they raise are training data quality (which relates to some extent to training data size), and output quality. There is also an underlying issue of lack of transparency and lack of interpretabilty.
All of these concerns hold just as well also for small and efficient language models. Size is just irrelevant.
Smaller models can still be inefficient and have a high environmental cost. Especially if the smaller models will not be as effective as the larger ones, so they cannot offset the cost. More importantly, model size is not directly linked to computation efficiency. Already in the list of models in the paper, some of the larger models (in terms of parameter count) are also more computationally efficient (specifically the switch transformer). On the other side, some models use heavy parameter sharing across layers, which reduces the parameter count, while still remaining high on computational inefficiency and carbon costs. Or a model can be just be small and inefficient. There is really no good way, and no good reason, to equate size with efficiency. The question that should be asked here then is not "can LMs be too big?" but "can LMs be too environmentally costly?". These are different questions. While the harm in asking the wrong question in this case is not that big, it still exists. It may detract from looking into architectures that are both big and efficient (like the switch transfromer, or based on specialized hardware), and it may cause more waste by shifting focus in smaller models that will resort to other forms of expensive computation (training and inference with algorithms that are polynomial in number of parameters rather than linear?), or just will be more costly in aggregate.
Turning to the other issues, here focusing on the size argument really becomes dangerous and harmful: the described concerns are for the most part valid and important, but they are just as valid to smaller models as they are to larger ones. We can feed unfathomable (or just plain bad) training data also to a smaller model. Smaller models are also stochastic parrots. Smaller models are also not interpretable. And the harms remain. A smaller model can still exhibit the same undesired behaviors, it can still be racist, sexist, biased, status-quo-amplifier, etc, etc. And it is just as uninterprable as the larger ones. The described dangers and concerns are dangers and concerns of language models, not specifically of large language models, and they do not grow or shrink with size. By framing the issue around size, people may conclude that small models are fine, or somehow less dangerous w.r.t to the concerns raised in the paper. This is totally wrong. People should be just as responsible when using smaller LMs, as they are when using large ones.
[Update, Jan 24, 2021 --- added the following 2 paragraph] Gebru, on social media, stated that they consider "data size" to be part of "model size" as well, and that they say so in the paper. I didn't read it this way, and it sounds odd to me to say "can models be too big" when you mean "can training data size be too big". But even under this interpreation, the paper does not say why large training size is bad, and it certainly doesn't say why training data can be "too big". The argument the paper does make is that data size is not enough to ensure properties such diversity, quality, etc. I agree with this, and I agree that such properties should be looked at. All of section 4 in the paper is an important read. But the argument it makes is "size is not enough" not that "size is bad and can be too big". Maybe large amounts of high-quality data will be hard to collect. Fine, so its a challenge. Still, there is currently no reason to believe that if we manage to collect large amounts of high-quality data, it will be a-priori worse than using small amounts of high-quality data. Size is not the issue. Quality is, and focusing on size is a distraction.
(Side notes: it may very well be that we will realize that after some data size, model quality may deteriorate. It has observed before. But this is an empirical question that should be verified. It does not mean that large data is a priori bad. Similarly, authors like Tal Linzen argue that people learn from much smaller data samples than models, and hence researching models that use less data is worthwhile. Again, full agreement here, but this is unrelated to the potential dangers of language models.)
The argument as a one-liner: The authors suggest that good (= not dangerous) language models are language models which reflect the world as they think the world should be. This is a political argument, which packs within it an even larger political argument. However, an alternative view by which language models should reflect language as it is being used in a training corpus is at least as valid, and should be acknowledged.
The paper takes several assumptions as given, without stating them as assumptions, and without considering the alternatives. This is mostly centered in section 6.2 (Risks and Harms) though it is also manifested in other parts of the paper. A similar critic has been expressed by Michael Lissack. My arguments here are somewhat different than his. Lissack also goes into much greater depth in several aspects which I don't touch (and some that I don't fully agree with).
I will focus on section 6.2 (Risks and Harms). This section states several potential harms, and in doing so states how the authors think a language model should behave, and, more broadly, how a machine-learning system should model the world. The view expressed in this section are opinions, and very one sided at that. However, the fact that they are merely opinions, or that there is a valid debate to be had around them, is never acknowledged or even hinted at. While I agree with many of the opinions, I also disagree with some. And regardless of my personal opinion, I think there is a important debate that should be made explicit. I will focus on the major issue I see.
A major question to be asked is "do we want our models to reflect the data as it is, or the world as we believe it should be". The authors take a very conclusive stance here for the second, but the first option is also valid, and must at least be considered. This is to a large extent a political question, and it becomes even more political when taking the "world as we believe it should be" stance that the authors take: different groups believe in different things. The paper reflects a set of beliefs that is very much north-american and left-leaning.
If we take language models as models of human language, do we want the model to be aware of slurs? The paper very clearly argues that "no it definitely should not". But one could easily argue that, yes, we certainly do want the model to be aware of slurs. Slurs are part of language. If we don't want the model to generate slurs, this is a valid request in some use-cases. But restricting them outright? this could be undesired. As an simple example, consider a model that does not know any slur or profanity words. Such words are not in the model's vocabulary, and it never saw slurs or profanities in its training. Not only this model is now not modeling human language (because language does have slurs and profanities), it will also not be able to recognize unwanted behaviors when encountering them. If we want to classify text for toxicity, such a model will let very toxic texts pass, because it will not recognize them as such. This also ties into debates about censorship, use-vs-mention, the validity of having "taboo words", etc.
Similarly for other linguistic forms that authors list as undesirable such as microagressions, dog-whistles, or subtle patterns such as refering to "woman doctors" or "both genders". Again, if we want our models to actually model human language use, we want these patterns in the model. If we use language models to, for example, compare bodies of texts from different sources, or to study societies based on the texts they left behind (as many digithal humanities scholars are now doing) we do want to have these encoded in the model. If we study political discourse, we want these things in the model. And so on and so on. Even if we just want a stochastic parrot that generates fanfiction or stories in some genre, we want to accurately reflect this genre. Literature has profanities, slurs and microagressions even if just as literary devices. If Charles Bukowski can write mysoginist stories, why can't a model write such stories? If Salinger can use the word "fuck" in a story, why can't a model? If the Wu-Tang Clan can use the n-word in their rap lyrics, why can't a model? Yes, there are places when this behavior is inappropriate. Maybe even most ocassions. But it is far from clear to me that the solution should be in the language model itself, rather than in the larger application. And it is even less clear that the solution should be all-encompassing, and not on a case-by-case basis.
These are just two examples, but there are many good reasons to argue that a model of language use should reflect how the language is actually being used. I find this view to be highly non-controversial. However, this is part of a much larger debate that I cannot do justice in this short piece. My point is that this debate is valid, it must take place, and it should have been acknowledged in the paper. In the least, we should acknowledge the option that there could be two kinds of LMs, and that both are valid, maybe depending on the final use-case or ocassion. The paper does not acknowldege that. This is unscientific and, in my opinion, also harmful. (And all of this without even touching the issue of "who gets to decide what are the slurs, microagressions and behaviors that should be avoided" which is a huge political issue on its own, and on which the authors take a very opinionated stand).
[Update, Jan 24, 2021 --- added the following text]
Based on some conversations on twitter with Gebru and others, I would like to clarify the following point: I read section 6.2 of the paper as prescribing how a language model should behave. That is, I read it as advocating for language models that, among other things:
- do not replicate the hegemonic world view they pick up from their training data.
- do not produce slurs or other forms of language that may seem derogatory, even if present in their training data.
- do not produce utterances that are picked up from the training data which can be perceived as microagressions, abusive language, biased language, etc.
- in particular, do not produce patterns such as the phrases "both genders" or "woman doctors" in the same frequency as these appear in the data.
- and so on.
Another possible reading of section 6.2 is that it merely lists these as potential things a careful user should be aware of, and aware of their impilcations, and then decide whether they want to include them in their language model or not. That is, as merely advice, not a perscription. Some language models CAN produce such behavior and be considered good. This reading was not natural to me, but if this is your reading of section 6.2 and the rest of the paper, then great. It means that you probably also agree with all you read so far by me in this section, sans the "one-sidedness" remark, and can easily reconcile it with the world-view presented in the paper. That's great.
[end update]
A growing list of criticims to this piece raised on twitter, for most of them including my responses in the twitter thread. If you want me to link to a specific tweet (or any other URL of you choosing) which is not listed yet, either ask me to, or create a PR.
From Yoav Goldberg's response: "A major question to be asked is "do we want our models to reflect the data as it is, or the world as we believe it should be". The authors take a very conclusive stance here for the second, but the first option is also valid, and must at least be considered. This is to a large extent a political question, and it becomes even more political when taking the "world as we believe it should be" stance that the authors take: different groups believe in different things. The paper reflects a set of beliefs that is very much north-american and left-leaning."
With all due respect to Yoav Goldberg, an accomplished and thoughtful computer scientist, the above paragraph reflects a set of beliefs that is very much that of a privileged white man who is not in the habit of writing about politics. Or, to put the point more accurately, it is that of a privileged white man who believes that what the world is is simply what he understands it to be--based on his own observations and experience--and that any less familiar view must be partial and "political" in some way (perhaps North American and left-leaning). To be sure, the distinction between "the world as it is" and "the world as it ought to be" introduces a minimal awareness of philosophical argument. But it introduces it only to dismiss the validity of any normative view (what ought to be) as distinct from a heavy presumption in favor of "what is" in which, moreover, "what is" implicitly is defined as "what I (very neutrally and apolitically) assume it to be." By this logic, if there is too much gun violence in the United States (a claim about "what is") and I argue that it doesn't have to be that way and shouldn't (what ought to be), I can be dismissed as a North American lefty by anyone who doesn't want to think seriously about what can be done to limit gun violence. My view, after all, is a particular political stance and "different groups believe in different things"--so the evidence can't possibly speak in favor of my argument for less violence; gun violence just is what is is.
Even more problematic, however, given its importance to Goldberg's own research, is the faulty assumption that the largest datasets accurately represent what is; that, in other words, such data offers an empirical account of the totality of language as we know it. Of course, Goldberg never makes this assumption explicit (and undoubtledly would not wish to prove it). But his implicit assumption is obvious. For example, notice the slippage between data and world in the following sentence: "A major question to be asked is "do we want our models to reflect the data as it is, or the world as we believe it should be". Observe how "the DATA AS IT IS" (which in the Gebru et al. paper is something like the scrapable internet as it is) is contrasted to "THE WORLD AS WE BELIEVE IT SHOULD BE." Here the scrapable internet and the people and language that this data best represents get to stand for the "world as it is"; and get to stand in opposition to the world as "we believe it should be." The scrapable internet thus becomes a neutral signifier of empirical reality: everything else is just some group's pipedream or axe to grind. (And that group is surely not Goldberg's.)
Yet, whatever one's view of the paper in question, Gebru et al. quite clearly establish that THE DATA as it is (the scrapable internet) does not represent THE WORLD as it is. But this point, which Goldberg does not contest, seems to have been lost entirely in the haste to paint the paper's authors as activists rather than scientists.
Interestingly, Goldberg's summary of the paper accurately points to the "unfathombability" of training data as a chief feature of the argument against so-called stochastic parrots. But Goldberg never returns to the question of unfathomability. To the contrary, via slippages in logic like that illustrated above, he turns unfathomble training data into authoritative, legible, and transparent data (the "world as it is").
I think Goldberg is actually conflating several questions, but here are two key ones: First, given that there is a lot of biased language on the Internet, a great deal of which represents the prejudices of the most vocal people or texts, how can we design models and/or curate data sets to account for these asymmetries, exclusions, and idiosyncrasies? Second, since models use this data for diverse purposes that have material impact on all kinds of people in all sorts of situations what can be done--or rather, what must be done in the interest of fairness--to assure that those who are under-represented in or misrepresented by these unfathomable data sets do not become victims of what purports to be objective assessment.
I must say that Goldberg's response to the paper is quite telling. Here we have a knowledgeable and thoughtful scientist who begins with the premise that the paper he has read is "important stuff" only, in the end, to reject its central premise on the question of ethics. I'm willing to bet that if we asked Yoav Goldberg whether he believes that the scrapable internet is an objective and inclusive proxy for knowledge of the world in its totality, he would say, "No of course not." And yet his response, whether he realizes it or not, sets that problem aside by invoking an insupportable distinction between the alleged neutrality of data and the alleged politicization of any argument that unmasks that illusion of neutrality.
If Goldberg's performance is an indication, the field has a long way to go before it can even begin to discuss, much less to address, the serious problems that beset it.