Last active
January 19, 2022 10:02
-
-
Save eddieantonio/6f8548a323f3504f72237f070453a807 to your computer and use it in GitHub Desktop.
My attempt to find a good starter guess for Wordle
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"id": "9fb33443", | |
"metadata": {}, | |
"source": [ | |
"# What is the best Wordle starter word?\n", | |
"\n", | |
"> By [Eddie Antonio Santos](https://eddieantonio.ca/) ([@\\_eddieantonio](https://twitter.com/_eddieantonio))\n", | |
"\n", | |
"Since about the second time I played [Wordle][], I have been using **THEIR** as my first guess every day. My reasoning is that it has some pretty [high frequency English letters][letter frequencies], namely, \"e\", \"r\", \"i\", and \"t\". Plus it starts with \"th\", which — at least from a glance — seems like a very English-y way to start a word.\n", | |
"\n", | |
"But then the Twittersphere settled on a different opener:\n", | |
"\n", | |
"> Favorite Wordle starter word, go! (Mine is “farts” or “farty.” Started because I’ll always be the 12-year-old version of me, and, duh, it’s hilarious, but continues because it’s been weirdly effective.)\n", | |
">\n", | |
"> — [@AmyBartner](https://twitter.com/AmyBartner/status/1483079705095987204)\n", | |
"\n", | |
"**FARTS**.\n", | |
"\n", | |
"**FARTS** also has some good, high-frequency letters: \"a\", \"r\", \"t\", and \"s\". \"f\", not so much. Also, it would really be much better with an \"e\". Or at least, that's what I assume.\n", | |
"\n", | |
"But then I got thinking... What if I can find a good Wordle starter word with a tiny bit of data science?\n", | |
"\n", | |
"\n", | |
"[Wordle]: https://www.powerlanguage.co.uk/wordle/\n", | |
"[letter frequencies]: https://www3.nd.edu/~busiforc/handouts/cryptography/letterfrequencies.html" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "23e4f6b0", | |
"metadata": {}, | |
"source": [ | |
"Let's begin! 😄" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"id": "1b479b44", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import pandas as pd" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "c7fd8285", | |
"metadata": {}, | |
"source": [ | |
"Let's ignore Python for a second, since getting a list of 5-letter English words can be accomplished entirely on \\*nix command line.\n", | |
"\n", | |
"I started by downloading [a list of English words mined from Wikipedia][wordlist].\n", | |
"\n", | |
"\n", | |
"```sh\n", | |
"curl -L 'https://github.com/IlyaSemenov/wikipedia-word-frequency/blob/c7b079e2e46ce735812b901a52c27681ab20550d/results/enwiki-20190320-words-frequency.txt?raw=true' -o wordlist.txt\n", | |
"```\n", | |
"\n", | |
"...and filtering for only five letter words containing the basic English alphabet with no accents:\n", | |
"\n", | |
"\n", | |
"```sh\n", | |
"awk '$1 ~ /^[a-z]{5}$/' wordlist.txt > wordle.txt\n", | |
"```\n", | |
"\n", | |
"[wordlist]: https://github.com/IlyaSemenov/wikipedia-word-frequency/\n", | |
"\n", | |
"We can then load this with Panadas's absolutely bomb-diggity `read_csv()` function:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"id": "b23a6f9c", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"wordle = pd.read_csv(\"wordle.txt\", header=None, delim_whitespace=True)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "052cf87a", | |
"metadata": {}, | |
"source": [ | |
"Next, let's create the canonical `DataFrame` called `df` with columns of the data types we want.\n", | |
"\n", | |
"The list downloaded from Wikipedia is sorted in descending frequency, so I arbitrarily chose to keep the top 10,000 most frequent words, because the words after around 10,000 get pretty funky." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"id": "631e8cdf", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df = pd.DataFrame(\n", | |
" {\n", | |
" \"Word\": pd.Categorical(wordle[0]),\n", | |
" \"Count\": wordle[1]\n", | |
" }\n", | |
")\n", | |
"df = df.head(10_000)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "7d954c9b", | |
"metadata": {}, | |
"source": [ | |
"Sadly, the top 10,000 words does not include **FARTS** or **SOARE** (from [somebody else][Kubaryk] who had the same idea, but used a different method to find the optimal word), so let's append those to the end of our dataframe:\n", | |
"\n", | |
"[Kubaryk]: https://www.theringer.com/2022/1/7/22870249/what-to-do-when-playing-the-word-game-wordle-isnt-enough-solve-it" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"id": "9c98b2c9", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df = df.append([{\"Word\": \"farts\", \"Count\": 203}, {\"Word\": \"soare\", \"Count\": 115}])" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "a70d6653", | |
"metadata": {}, | |
"source": [ | |
"This is what our data looks like:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"id": "fa69437a", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Word</th>\n", | |
" <th>Count</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>which</td>\n", | |
" <td>6412646</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>first</td>\n", | |
" <td>4840311</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>their</td>\n", | |
" <td>4339413</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>after</td>\n", | |
" <td>4204053</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>other</td>\n", | |
" <td>3004434</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Word Count\n", | |
"0 which 6412646\n", | |
"1 first 4840311\n", | |
"2 their 4339413\n", | |
"3 after 4204053\n", | |
"4 other 3004434" | |
] | |
}, | |
"execution_count": 5, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "eb3ab1b5", | |
"metadata": {}, | |
"source": [ | |
"Now that we have a decent list of words, how are we going to find a good way to select a word?\n", | |
"\n", | |
"## My idea\n", | |
"\n", | |
"The idea is that we want words that **maximize the probablity** that **a letter at any given position** is a likely letter in the day's puzzle.\n", | |
"\n", | |
"I wrote \"maximize the probability\" in that last sentence as if I know anything about statistics, which unfortunately I do not. What I do know, however is to:\n", | |
"\n", | |
" 1. count things\n", | |
" 2. rank things\n", | |
"\n", | |
"## Counting things\n", | |
"\n", | |
"My initial idea on how to get a good word is based on one key insight: **uniform letter frequency**—that is, how frequent a letter can appear _anywhere_ in a word—**is too restrictive** for the task of finding a good five letter word.\n", | |
"\n", | |
"Instead, let's count letter frequencies **for each of the five positions independently**.\n", | |
"\n", | |
"Time to use an absolute gem in the Python standard library, [`collections.Counter`][Counter]:\n", | |
"\n", | |
"[Counter]: https://docs.python.org/3/library/collections.html#collections.Counter" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"id": "362cbd0f", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from collections import Counter\n", | |
"\n", | |
"letter_counters = [Counter() for _ in range(5)]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "79cc468c", | |
"metadata": {}, | |
"source": [ | |
"That created five empty counter corresponding to each of the five letters of any given Wordle puzzle.\n", | |
"\n", | |
"Let's count all the letters from the wordlist:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"id": "ea224c92", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"for word in df[\"Word\"]:\n", | |
" for i, letter in enumerate(word):\n", | |
" letter_counters[i][letter] += 1" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "c58c1fa6", | |
"metadata": {}, | |
"source": [ | |
"Now we can ask questions such as: Given a five letter word, **what are the top 5 most common letters** that occur at the start of a word?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"id": "eb42b578", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[('s', 1075), ('b', 727), ('m', 676), ('a', 675), ('c', 663)]" | |
] | |
}, | |
"execution_count": 8, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"letter_counters[0].most_common(5)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "da1d7f73", | |
"metadata": {}, | |
"source": [ | |
"...how about the top 5 second letters in a word?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"id": "06d6bb47", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[('a', 2271), ('o', 1397), ('e', 1245), ('i', 1031), ('u', 795)]" | |
] | |
}, | |
"execution_count": 9, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"letter_counters[1].most_common(5)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "36a3c7a4", | |
"metadata": {}, | |
"source": [ | |
"...middle letter?\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"id": "b0a467eb", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[('r', 1031), ('a', 963), ('n', 901), ('i', 800), ('l', 755)]" | |
] | |
}, | |
"execution_count": 10, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"letter_counters[2].most_common(5)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "590b4cf9", | |
"metadata": {}, | |
"source": [ | |
"And so on..." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"id": "5d876894", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[('e', 1413), ('a', 1075), ('i', 830), ('n', 738), ('t', 682)]" | |
] | |
}, | |
"execution_count": 11, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"letter_counters[3].most_common(5)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"id": "d1e45577", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[('s', 1605), ('a', 1248), ('e', 1234), ('n', 844), ('y', 705)]" | |
] | |
}, | |
"execution_count": 12, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"letter_counters[4].most_common(5)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "ace93056", | |
"metadata": {}, | |
"source": [ | |
"Eyeballing this data, it looks like **the distribution is different for each letter position**. I won't go into English phonotactics and orthography in this notebook, but this result makes sense, from a linguistic standpoint." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "6f8ab02a", | |
"metadata": {}, | |
"source": [ | |
"Now that we have letter frequencies, we can **rank each letter** for each position in a word." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"id": "d4252446", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"letter_ranks = [{letter: i + 1 for i, (letter, _) in enumerate(letters.most_common())}\n", | |
" for letters in letter_counters]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "6e99600c", | |
"metadata": {}, | |
"source": [ | |
"This gives a data structure that allows us to ask, \"what is rank of the letter 't' at the beginning of a word\"?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"id": "9a1a3303", | |
"metadata": { | |
"scrolled": true | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"6" | |
] | |
}, | |
"execution_count": 14, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"letter_ranks[0][\"t\"]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "2502ed9b", | |
"metadata": {}, | |
"source": [ | |
"So \"t\" is the **sixth most common** letter that can start a word.\n", | |
"\n", | |
"Another example, what rank does \"s\" have at the end of a word? Intuitively, \"s\" appears in most English plurals, so it should be ranked pretty high, right? It might even be number 1!" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"id": "4c7c27ee", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"1" | |
] | |
}, | |
"execution_count": 15, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"letter_ranks[-1][\"s\"]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "8a776c76", | |
"metadata": {}, | |
"source": [ | |
"And indeed it is! So maybe a good Wordle starter is a plural word? Maybe a word like **FARTS**?" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "6eff55db", | |
"metadata": {}, | |
"source": [ | |
"## Ranking things\n", | |
"\n", | |
"That brings me to my second idea: we can use the metric of [**mean-recipricoal rank**][MRR] (MRR) to **evaluate each word** by finding how well the _overall_ rank of the letters in any given word do.\n", | |
"\n", | |
"\n", | |
"A **good word** will have an MRR of close to **1.0** — that would mean all of its letters are the top ranked in each position. Looking at the letter frequencies above, that word would be **SARES**, which to my knowledge, is not an English word!\n", | |
"\n", | |
"The **worst word** possible would have an MRR **aproaching zero** if it uses the very bottom ranked letter in each position.\n", | |
"\n", | |
"We'll create a new column `MRR` that contains the MRR of the each word in the wordlist:\n", | |
"\n", | |
"[MRR]: https://en.wikipedia.org/wiki/Mean_reciprocal_rank" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"id": "64db9b91", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from statistics import mean\n", | |
"\n", | |
"def mrr_words(row):\n", | |
" return mean([1 / letter_ranks[i][letter] for i, letter in enumerate(row[\"Word\"])])\n", | |
" \n", | |
"df[\"MRR\"] = df.apply(mrr_words, axis=1)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "18168aab", | |
"metadata": {}, | |
"source": [ | |
"Let's look at our top ranked words!" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"id": "5ae0ce13", | |
"metadata": { | |
"scrolled": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Word</th>\n", | |
" <th>Count</th>\n", | |
" <th>MRR</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>5914</th>\n", | |
" <td>sores</td>\n", | |
" <td>1031</td>\n", | |
" <td>0.900000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>7392</th>\n", | |
" <td>saris</td>\n", | |
" <td>712</td>\n", | |
" <td>0.866667</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>7442</th>\n", | |
" <td>saree</td>\n", | |
" <td>705</td>\n", | |
" <td>0.866667</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3051</th>\n", | |
" <td>mares</td>\n", | |
" <td>3408</td>\n", | |
" <td>0.866667</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6041</th>\n", | |
" <td>sires</td>\n", | |
" <td>994</td>\n", | |
" <td>0.850000</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Word Count MRR\n", | |
"5914 sores 1031 0.900000\n", | |
"7392 saris 712 0.866667\n", | |
"7442 saree 705 0.866667\n", | |
"3051 mares 3408 0.866667\n", | |
"6041 sires 994 0.850000" | |
] | |
}, | |
"execution_count": 17, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df.sort_values(by=\"MRR\", ascending=False).head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "d4ca2261", | |
"metadata": {}, | |
"source": [ | |
"Gross! The top-ranked word is **SORES**. Yuck! 🤮\n", | |
"\n", | |
"But hold on: **SORES** contains the letter \"S\" twice. Repeated letters in the first guess of a Wordle puzzle are a waste! You could have yet another letter to guess!" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "0d52bf7f", | |
"metadata": {}, | |
"source": [ | |
"## We can do better!\n", | |
"\n", | |
"We'll throw out all words that have **duplicated letters**.\n", | |
"\n", | |
"Let's create a function that returns `True` if a row has one or more duplicated letters:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 18, | |
"id": "734320cc", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def has_duplicate_letters(row):\n", | |
" word = row[\"Word\"]\n", | |
" unique_letters = set(word)\n", | |
" return len(word) > len(unique_letters)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "107e4b84", | |
"metadata": {}, | |
"source": [ | |
"We create a new, filtered dataframe (creatively called `df2`) that lacks words with duplicated letters:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 19, | |
"id": "f1236751", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df2 = df.loc[df.apply(lambda row: has_duplicate_letters(row) == False, axis=1)]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "fd67ba2a", | |
"metadata": {}, | |
"source": [ | |
"Let's sort _this_ dataframe (and reset its indexing while we're at it!):" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 20, | |
"id": "cae8fe45", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df2 = df2.sort_values(by=\"MRR\", ascending=False).reset_index(drop=True)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "646a67da", | |
"metadata": {}, | |
"source": [ | |
"And let's look at our top results!" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 21, | |
"id": "2ded2c20", | |
"metadata": { | |
"scrolled": true | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Word</th>\n", | |
" <th>Count</th>\n", | |
" <th>MRR</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>mares</td>\n", | |
" <td>3408</td>\n", | |
" <td>0.866667</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>cares</td>\n", | |
" <td>6288</td>\n", | |
" <td>0.840000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>pares</td>\n", | |
" <td>622</td>\n", | |
" <td>0.828571</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>lares</td>\n", | |
" <td>899</td>\n", | |
" <td>0.825000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>dares</td>\n", | |
" <td>1434</td>\n", | |
" <td>0.822222</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Word Count MRR\n", | |
"0 mares 3408 0.866667\n", | |
"1 cares 6288 0.840000\n", | |
"2 pares 622 0.828571\n", | |
"3 lares 899 0.825000\n", | |
"4 dares 1434 0.822222" | |
] | |
}, | |
"execution_count": 21, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df2.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "6db1c92c", | |
"metadata": {}, | |
"source": [ | |
"**MARES**\n", | |
"\n", | |
"🐴🐴🐴\n", | |
"\n", | |
"**MARES** is our top Wordle starter word—at least according to letter frequency.\n", | |
"\n", | |
"Other good first words are **CARES**, **PARES**, **LARES** (is that even a word?!?), and **DARES** (whew!).\n", | |
"\n", | |
"Notably, [others][Miller] have also found that **LARES** and **CARES** are top words.\n", | |
"\n", | |
"[Miller]: https://www.theringer.com/2022/1/7/22870249/what-to-do-when-playing-the-word-game-wordle-isnt-enough-solve-it" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "204c9369", | |
"metadata": {}, | |
"source": [ | |
"## Appendix: How good are the other guesses?\n", | |
"\n", | |
"How does **THEIR** hold-up in my ranking? Does **FARTS** cut the mustard? What about that **SOARE**, found by brute-force? **FIRST** is pretty high in the regular wordlist, but is it a good starter word?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 22, | |
"id": "78605a68", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Word</th>\n", | |
" <th>Count</th>\n", | |
" <th>MRR</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>90</th>\n", | |
" <td>farts</td>\n", | |
" <td>203</td>\n", | |
" <td>0.655385</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>623</th>\n", | |
" <td>soare</td>\n", | |
" <td>115</td>\n", | |
" <td>0.500000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3007</th>\n", | |
" <td>first</td>\n", | |
" <td>4840311</td>\n", | |
" <td>0.309829</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>5048</th>\n", | |
" <td>their</td>\n", | |
" <td>4339413</td>\n", | |
" <td>0.186905</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Word Count MRR\n", | |
"90 farts 203 0.655385\n", | |
"623 soare 115 0.500000\n", | |
"3007 first 4840311 0.309829\n", | |
"5048 their 4339413 0.186905" | |
] | |
}, | |
"execution_count": 22, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df2.loc[df2[\"Word\"].isin([\"their\", \"farts\", \"soare\", \"first\"])]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "6cd58c50", | |
"metadata": {}, | |
"source": [ | |
"Well how about that? **FARTS** is the winner in this race! **SOARE** does not do well given my metric, but it's the second best in this particular group. That said, **SOARE** was chosen experimentally and my ranking is based off a metric I pulled out of my butt. As for my old standby **THEIR**? Well, I should have used **FARTS** all along..." | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3 (ipykernel)", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.10.2" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 5 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment