Created
March 12, 2020 16:39
-
-
Save JnyJny/39e6cea5fec4bb523aab297c472bfb5c to your computer and use it in GitHub Desktop.
Comments for PyBites 178
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from dateutil.parser import parse | |
import re | |
import datetime | |
def get_min_max_amount_of_commits( | |
commit_log: str = commits, year: int = None | |
) -> (str, str): | |
""" | |
Calculate the amount of inserts / deletes per month from the | |
provided commit log. | |
Takes optional year arg, if provided only look at lines for | |
that year, if not, use the entire file. | |
Returns a tuple of (least_active_month, most_active_month) | |
""" | |
with open(commits) as commit: | |
commit = commit.readlines() | |
# 1. The variable 'commit' is cobbered here, going from a file object | |
# to list of strings. | |
# | |
# 2. The context manager will hold the commit log file open until | |
# the manager exits, but you are reading all the data out of the | |
# file in one go. | |
data = [i.strip("\n").split("|") for i in commit] | |
# 3. Because we read all the data out of the file in one blob, | |
# we are now forced to perform several interations thru the | |
# rather than interacting with a line of data at a time. | |
# When the size of the file becomes large, this will become | |
# a performance issue that will be difficult to address with | |
# this design. | |
# | |
# 4. Many old-school programmers (C and Fortran especially) will | |
# use single letter variables to denote index variables. It is | |
# tempting even in comprehensions, but I would use a more descriptive | |
# name, e.g substitue 'line' for 'i' in the list comprehension above. | |
date = [parse(i[0], fuzzy=True).strftime("%Y-%m-%d %H:%M:%S") for i in data] | |
# 5. Here we are again traversing 'data' to parse the date out of each | |
# line in the file and have increased the programs memory demands by | |
# keeping all those dates around in memory. | |
# 6. Using indices (0 and 1) to reference the date section and | |
# the payload section of a split line is confusing to the | |
# reader and the programmer who needs to fix a bug. | |
# 7. I don't think it's necesary to use strftime on the returned | |
# datetime.datetime object. | |
changes = [i[1] for i in data] | |
# 7. We've just made a copy of the payload section of all the lines, | |
# increasing our memory footprint again. The semantic content | |
# has improved, 'changes' versus 'i[1]', it comes at a price. | |
change_count = [] | |
for i in changes: | |
temp = re.findall(r"\d+", i) | |
change = list(map(int, temp)) | |
change_count.append(sum(change[1:])) | |
# 8. If ever you have a problem and you say "I could use regex | |
# to solve that" you will soon find that you have two (or | |
# more) problems. The first problem is regular expression | |
# matching is slow. The next problem is crafting a regex that | |
# reliably matches the text you trying to parse out. The | |
# third problem is they can be very sensitive to changes in | |
# the input text. | |
# 9. We are relying on date, changes and change_count lists having | |
# the same number of items. If one of the operations that builds | |
# those lists results in a shorter or longer list, it will make | |
# matching counts to dates more error prone. | |
# 10. While map is part of the language, I would use a list comprehension | |
# instead. | |
commit_dict = dict(zip(date, change_count)) | |
# 11. We've expanded our memory usage again, creating a new dictionary | |
# with the computed date string. | |
key_list = [] | |
for key in commit_dict.keys(): | |
if key[0:7] not in key_list: | |
key_list.append(key[0:7]) | |
# 12. This construct loops thru all the keys in the dictionary | |
# we just made to 'unique'-ify the keys. It's confusiong | |
# and not obvious why would want to do this since we are | |
# using a slice to pull the first seven characters off the | |
# front of the key. We'd have to go back to the strftime | |
# to figure out what the slice refers to. | |
commit_totals = {} | |
for item in key_list: | |
val = sum([commit_dict.get(i) for i in commit_dict.keys() if item in i]) | |
commit_totals[item] = val | |
# 13. We're building another dictionary (memory usage) and | |
# iterating thru the key_list and the list comprehension | |
# is pretty complicated (and another embedded interation | |
# which is a performance concern). While not incorrect to | |
# use the dictionary keys method, iterating on the bare | |
# dictionary returns the keys. | |
if year == None: | |
least_active_month = min(commit_totals, key=commit_totals.get) | |
most_active_month = max(commit_totals, key=commit_totals.get) | |
# 14. While not wrong, using the 'key' keyword argument with | |
# the get method is redundant. Min and Max both do the | |
# right thing for dictionaries and arrays. | |
return (least_active_month, most_active_month) | |
else: | |
year_dict = {k: v for (k, v) in commit_totals.items() if str(year) in k} | |
least_active_month = min(year_dict, key=year_dict.get) | |
most_active_month = max(year_dict, key=year_dict.get) | |
return (least_active_month, most_active_month) | |
# 15. If the user asked for a specific year range, then we | |
# create another dictionary filtered for the requested year. | |
# The comprehension has a performance concern where the | |
# the year is converted to a string for every item in | |
# commit_totals. This is called a 'loop invariant' since | |
# the value doesn't change in the body of the loop. A | |
# better practice would have been to convert year to a string | |
# value and then use that in the comprehension. | |
# 16. In my opinion, the else clause isn't necessary here | |
# since the if clause returns. I like it since it brings | |
# the indention level back to base level of the function | |
# body. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment