Did an Internet search for list of common english words
. Here we go.
Link: https://eslforums.com/list-of-words/
List of 700+ Most Common English Words Everyone Should Learn!
Sure. Let's extract the 5, 6, and 7-letter length groups.
$ curl https://eslforums.com/list-of-words/ > eslforums.html
$ grep "^<li>" eslforums.html | cut -c 5-10 | grep "<" | grep -v "/" | cut -d "<" -f 1 | grep -v "^$" > eslforums-5.txt
$ grep "^<li>" eslforums.html | cut -c 5-11 | grep "<" | grep -v "/" | cut -d "<" -f 1 | grep -v "^$" > eslforums-6.txt
$ grep "^<li>" eslforums.html | cut -c 5-12 | grep "<" | grep -v "/" | cut -d "<" -f 1 | grep -v "^$" > eslforums-7.txt
I happened to notice that there were some place names (like "Sweden") in the eslforums
files, so I removed them manually.
That gives us word lists of the following lengths:
$ for i in {5..7}; do; wc -l eslforums-$i.txt; done
131 eslforums-5.txt
117 eslforums-6.txt
82 eslforums-7.txt
Which is a good start but clearly we need to go deeper.
Link: https://github.com/dolph/dictionary
What's this? A bunch of text files? Now we're talking!
These are 25,322 words that everyone should be familiar with.
Sounds good. Let's break that into the 5, 6, and 7-letter length groups:
$ grep -o '\<.\{5\}\>' popular.txt > popular-5.txt
$ grep -o '\<.\{6\}\>' popular.txt > popular-6.txt
$ grep -o '\<.\{7\}\>' popular.txt > popular-7.txt
That gives us word lists of the following lengths:
$ for i in {5..7}; do; wc -l popular-$i.txt; done
3088 popular-5.txt
4080 popular-6.txt
4269 popular-7.txt
Aww yiss, that's the stuff.
These two groups of words are likely to overlap, but the amount is unknown, and we wouldn't want to leave good words behind. So let's get these sorted.
First, how many words—including potential duplicates—are in each length list?
$ for i in {5..7}; do; cat popular-$i.txt eslforums-$i.txt | wc -l; done
3219
4197
4351
Not bad, but we need to be careful here...
We want unique entries only so in principle a straight string comparison is the way to go; however, the eslforums
files start with CAPITAL letters, so we need to deal with that.
There are any number of ways to do this. Let's use awk
because why not?
$ for i in {5..7}; do; cp eslforums-$i.txt f; awk '{ print tolower($0) }' f > eslforums-$i.txt; done; rm f
Super, let's extract only the unique entries and compare:
$ for i in {5..7}; do; cat popular-$i.txt eslforums-$i.txt | sort | uniq | wc -l; done
3090
4086
4273
Ultimately, for all of that work that eslforums
made us do, it only netted us a handful of extra words beyond the Dolph lists.
Now we can merge the two:
$ for i in {5..7}; do; cat popular-$i.txt eslforums-$i.txt | sort | uniq > wordlist-$i.txt; done
Then merge those into a single file for Wordlem:
$ cat wordlist-5.txt wordlist-6.txt wordlist-7.txt > en.txt
Which gives us over 11,000 words to play with, total:
$ wc -l en.txt
11449 en.txt