Last active
March 18, 2018 20:51
-
-
Save codefisher/c33da67b6c7f84398cc1 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from collections import Counter | |
boys = ["mark", "john", "mark", "fred", "paul", "john"] | |
girls = ["mary", "joan", "joan", "emma", "mary"] | |
b = Counter(boys) | |
g = Counter(girls) | |
c = b + g | |
#result: c = Counter({'mark': 2, 'joan': 2, 'john': 2, 'mary': 2, 'fred': 1, 'paul': 1, 'emma': 1}) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
names = ['mark', 'henry', 'matthew', 'paul', | |
'luke', 'robert', 'joseph', 'carl', 'michael'] | |
# len is our 'key' function here | |
d = {k: [i for x, i in v] for k, v in itertools.groupby(sorted((len(x), x) for x in names), key=operator.itemgetter(0))} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from collections import defaultdict | |
d = defaultdict(int) | |
for name in names: | |
key = len(name) | |
d[key] += 1 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from collections import defaultdict | |
names = ["mark", "john", "mark", "fred", "paul", "john"] | |
d = defaultdict(int) | |
for name in names: | |
d[name] += 1 | |
#result: d = {'mark': 2, 'john': 2, 'fred': 1, 'paul': 1} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from collections import Counter | |
names = ["mark", "john", "mark", "fred", "paul", "john"] | |
d = Counter(names) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from collections import Counter | |
names = ['mark', 'henry', 'matthew', 'paul', | |
'luke', 'robert', 'joseph', 'carl', 'michael'] | |
d = Counter(len(name) for name in names) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from collections import defaultdict | |
d = defaultdict(list) | |
for name in names: | |
key = len(name) | |
d[key].append(name) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Python How To: Group and Count with Dictionaries | |
<p> | |
In this post I want to have a look at the different possible solutions to two rather simple and closely related problems. How to group objects into a dictionary, or count them. It something that at least I have found I do every now an then in various forms. I want to start from the simplest good solution, and work towards better and faster solutions. | |
</p> | |
<h2>Grouping</h2> | |
<p> | |
So the problem that we want to solve is that we have some items that we want to group according to some criteria. So in python that is going to be turning a list or some other iterable into a dictionary of lists. We are also going to want some function to create the value we group by, and for the sake of simplicity we will use <code>len()</code> since that is the simplest built in function that gives a key we might group on. In your own code, you would replace that with some logic that gives you the key or tag that you want to group things by. | |
</p> | |
<p> | |
Or to put that all in pseudo code (also working python), that would be the following: | |
</p> | |
<code codelang="python"> | |
names = ['mark', 'henry', 'matthew', 'paul', | |
'luke', 'robert', 'joseph', 'carl', 'michael'] | |
d = {} | |
for name in names: | |
key = len(name) | |
if key not in d: | |
d[key] = [] | |
d[key].append(name) | |
# result: d = {4: ['mark', 'paul', 'luke', 'carl'], 5: ['henry'], 6: ['robert', 'joseph'], 7: ['matthew', 'michael']} | |
</code> | |
<p> | |
So that is a nice start, loop over the data, create a key value for each. Then we do a check to see if it already there or not. This here is the best way to check if a key is in a dictionary. The alternatives is either doing a <code>try except</code> around a key look up, which is really ugly and slow. Or checking the return value of <code>d.get(key)</code>, but that prevents putting <code>None</code> values in the dictionary. | |
</p> | |
<p> | |
There is a down side to this, and that is there the key has to be hashed two or three times (python dictionaries are internally a kind of hash map, that gives them their almost linear lookup time). First in the if statement, and a possible second in the assignment to a empty <code>list()</code> and finally in the lookup for the append. That python has to hash the value has an overhead. So how might we do better? The following is one possible better solution. | |
</p> | |
<code codelang="python"> | |
d = {} | |
for name in names: | |
key = len(name) | |
d.setdefault(key, []).append(name) | |
</code> | |
<p> | |
This uses the <code>setdefault()</code> method. This is a function, that even the developers of Python admit freely is confusingly named. The problem is any descriptive alternatives look like <code>do_a_get_lookup_but_if_not_found_assign_the_second_argument()</code>. So more or less, the same code as we wrote ourselves before, but since it is done by the dictionary itself, the key is only hashed once. It will be faster when we have lots of values. | |
</p> | |
</p> | |
This is still not the best code that we could do, there is a nicer way. It involves using a data structure called <code>defaultdict</code> that lives in the <code>collections</code> module. If you have not checked out the collections module recently, I recommend you <a href="https://docs.python.org/3/library/collections.html">read its docs</a>, there are a number of very useful utilities in it. With that aside, <code>defaultdict</code> lets us create a dictionary like object that is different only in that if a lookup fails, it uses the argument passed to it during creation (in our case <code>list</code>) to fill that key in. It lets us now write code like this: | |
</p> | |
<code codelang="python"> | |
from collections import defaultdict | |
d = defaultdict(list) | |
for name in names: | |
key = len(name) | |
d[key].append(name) | |
</code> | |
<p> | |
So now we can just look up the key and append to it, not worrying about if it exists or not. If it does not, the defaultdict will create the value for us. | |
</p> | |
<h2>Counting</h2> | |
<p> | |
Now we have mastered grouping, counting should be simple. We just have to know that <code>int()</code> when called returns the value <code>0</code>, so that can be passed to <code>defaultdict</code>. So here we have: | |
</p> | |
<code codelang="python"> | |
from collections import defaultdict | |
d = defaultdict(int) | |
for name in names: | |
key = len(name) | |
d[key] += 1 | |
</code> | |
<p> | |
Here a common use case is not even to use a key, but to count just the number of times something appears. In that case, we could do the following simplified version. | |
</p> | |
<code codelang="python"> | |
from collections import defaultdict | |
names = ["mark", "john", "mark", "fred", "paul", "john"] | |
d = defaultdict(int) | |
for name in names: | |
d[name] += 1 | |
#result: d = {'mark': 2, 'john': 2, 'fred': 1, 'paul': 1} | |
</code> | |
<p> | |
This was considered common enough that there is even a built in way to do this using <code>Counter</code>, so the above can be reduced to this. | |
</p> | |
<code codelang="python"> | |
from collections import Counter | |
names = ["mark", "john", "mark", "fred", "paul", "john"] | |
d = Counter(names) | |
</code> | |
<p> | |
Counter comes with some nice little extras, such as being able to add, or subtract results. So we could do something like the following. | |
</p> | |
<code codelang="python"> | |
from collections import Counter | |
boys = ["mark", "john", "mark", "fred", "paul", "john"] | |
girls = ["mary", "joan", "joan", "emma", "mary"] | |
b = Counter(boys) | |
g = Counter(girls) | |
c = b + g | |
#result: c = Counter({'mark': 2, 'joan': 2, 'john': 2, 'mary': 2, 'fred': 1, 'paul': 1, 'emma': 1}) | |
</code> | |
<p> | |
But what happens if you want to use <code>Counter</code> but need to pass the result though some key function first? How would you do it? The solution would be to put a generator inside of it like the following. | |
</p> | |
<code codelang="python"> | |
from collections import Counter | |
names = ['mark', 'henry', 'matthew', 'paul', | |
'luke', 'robert', 'joseph', 'carl', 'michael'] | |
d = Counter(len(name) for name in names) | |
</code> | |
<h2>Useful key functions</h2> | |
<p> | |
Some possible common cases when grouping or counting, is you might want to do so based on some item in or attribute of the items you are grouping. So for examples, your data might be a tuple of first and last names, or a dictionaries with first and last name keys, or a class with first and last name attributes. If that is what you group or count by, there are two built in functions that can help do this, without needing to write our own functions. Those are <code>itemgetter()</code> and <code>attrgetter</code> from the <code>operator</code> module. Some examples might help. | |
</p> | |
<code codelang="python"> | |
from collections import defaultdict | |
from operator import itemgetter, attrgetter | |
# if names used tuples | |
names = [('mary', 'smith'), ('mark', 'davis')] | |
# the last_name function would look like | |
last_name = itemgetter(1) | |
# if names are dictionaries | |
names = [{'first': 'mary', 'last':'smith'), ('first': 'mark', 'last': 'davis')] | |
# the last_name function would look like | |
last_name = itemgetter('last') | |
# if names are classes with first and last as attributes | |
names = [Person('mary', 'smith'), Person('mark', 'davis')] | |
# the last_name function would look like | |
last_name = attrgetter('last') | |
d = defaultdict(list) | |
for name in names: | |
key = last_name(name) | |
d[key].append(name) | |
</code> | |
<h2>Bonus</h2> | |
<p> | |
When I was studying Software Engineering I got a job tutoring for the first year programming course, which was in python and had 200-300 students depending on semester (hence the need for tutors to help with questions during practicals). One the challengers some of more curious students used to ask, is how I would do certain things in one line (I ended up doing their whole first in a single 1500 character line). Often really bad code, but also often rather interesting trying to reduce a problem to a single statement. I had a shot at doing it for this, and this was the solution that I came up with in a few minutes. I leave working out how it works as an exercise to the reader. I would never use it in production code. | |
</p> | |
<code codelang="python"> | |
names = ['mark', 'henry', 'matthew', 'paul', | |
'luke', 'robert', 'joseph', 'carl', 'michael'] | |
# len is our 'key' function here | |
d = {k: [i for x, i in v] for k, v in itertools.groupby(sorted((len(x), x) for x in names), key=operator.itemgetter(0))} | |
</code> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from collections import defaultdict | |
from operator import itemgetter, attrgetter | |
# if names used tuples | |
names = [('mary', 'smith'), ('mark', 'davis')] | |
# the last_name function would look like | |
last_name = itemgetter(1) | |
# if names are dictionaries | |
names = [{'first': 'mary', 'last':'smith'), ('first': 'mark', 'last': 'davis')] | |
# the last_name function would look like | |
last_name = itemgetter('last') | |
# if names are classes with first and last as attributes | |
names = [Person('mary', 'smith'), Person('mark', 'davis')] | |
# the last_name function would look like | |
last_name = attrgetter('last') | |
d = defaultdict(list) | |
for name in names: | |
key = last_name(name) | |
d[key].append(name) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
d = {} | |
for name in names: | |
key = len(name) | |
d.setdefault(key, []).append(name) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
names = ['mark', 'henry', 'matthew', 'paul', | |
'luke', 'robert', 'joseph', 'carl', 'michael'] | |
d = {} | |
for name in names: | |
key = len(name) | |
if key not in d: | |
d[key] = [] | |
d[key].append(name) | |
# result: d = {4: ['mark', 'paul', 'luke', 'carl'], 5: ['henry'], 6: ['robert', 'joseph'], 7: ['matthew', 'michael']} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment