Japanese "bad words" lists for wiki vandalism detection.
- bad_words.txt: racial slurs, offensive expressions etc that wouldn't be permitted in articles, and inappropriate in discussions.
- informal_words.txt1: expressions in a conversational tone, inappropriate in articles, but acceptable in discussions.
- informal_patterns.txt: substrings typically found in written conversations. They may or may not be aligned with the beginnings and ends of segments/tokens/words.
Intended to be used for:
- https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service/Word_lists/ja
- https://phabricator.wikimedia.org/T117997
Footnotes
-
"informal" might be a misnomer. I noticed that discussions between users in Japanese tend to contain writing styles marked with "high politeness", more often than encyclopedic pages, academic articles etc. I imagine the same can be said to some extent to English: you don't normally say "I would be delighted if you __" and "could you please __" as part of an encyclopedic article. ↩