Skip to content

Instantly share code, notes, and snippets.

@amontalenti
Last active August 29, 2015 13:57
Show Gist options
  • Select an option

  • Save amontalenti/9613187 to your computer and use it in GitHub Desktop.

Select an option

Save amontalenti/9613187 to your computer and use it in GitHub Desktop.

Hello again, growing Pythonistas!

Pleased and excited

So, first of all, I have to say that I was simultaneously pleased & impressed with the quality of submissions I got from you guys for our little Python take-home assignment. Pleased, because the problem seemed to be accessible enough that each of you could work on it with little background in Python beyond the bits you've been exposed to over the last few months. Impressed, because all of your submissions passed all of my test cases!

I was also happy with how different the submissions were. We Pythonistas often like to brag that, unlike in other languages, in Python there is "rarely more than one way to do it". However, this was a simple example, yet the solutions provided varied widely. (And indeed, in programming, no matter how simple the language, there's always more than one way to do it.) This therefore gives us a nice window into various understandings of Python code style, architecture, and idioms.

Patterns among code

Some patterns that seemed to apply to all the submissions regardless of who wrote them:

  • didn't use private/helper functions to make the code read a little bit more easily

  • didn't use helpful labels for important values

  • no comments beyond the occasional edge case or point of confusion

  • magic strings & magic numbers used too liberally and sometimes without explanation (e.g. items[2] or "://")

  • extra-long and not-very-readable code lines

  • code repetition (violation of "DRY principle")

  • no new test cases written to validate assumptions

Line-by-line comments

However, on the whole, your submissions are easy to correct and improve, and I've provided line-by-line code comments on a Github commit linked here:

https://github.com/Parsely/python-adv-slides/commit/9fc5ba261ae370c70b79054303d0a5e10052a28c

I put all the code under test using a shared set of test cases in tests.py.

Also, I've provided a version which is my own implementation of url_parse and url_join (am.py). I made sure to write this before looking over your submissions, so that I didn't bias my view of what to improve / do.

How my code compares

From an at-a-glance look-and-feel standpoint, here are some differences I notice about mine:

  • It is much flatter. And, not just because I follow PEP8's indentation rules. When I happen upon nesting in my code, I tend to break things into a separate helper functions for readability. This flattens the code. This comes from a principle I have that flatter code is easier to read, and it also comes from the Zen of Python, which says, "Flat is better than nested".

  • It has more function-level documentation. This also adds test cases (in this case, via doctest). Function-level documentation can help clarify the design of a function and its inputs and outputs. In my case, I used doctest format (which has the added benefit that the test cases in your code comments can, themselves, be tested automatically!) and this helped me iron out other bugs.

  • It is a little longer (in terms of source-lines-of-code, or SLOC) -- stripped of comments we're at 40 SLOC on the low end, and 60 SLOC on the high end (mind & Emmett's) for this problem. I am generally a fan of shorter code (obviously, not absurdly shorter, like one-character variable names, but shorter as in "more expressive" / "less repetitive"). Certainly, the above two points make my code longer -- comments & utility function definitions add a little quantity to the line numbers, but then their re-use re-balances it. I think the other reason it is longer is because I try not to pack too much into single lines of code, I sprinkle stepwise comments in for clarity, and and I also break what could be "very long lines" into shorter, hopefully clearer, lines. This is partially a style thing on my part -- even KB/TM/DD don't necessarily follow this style -- but I think it's worth mentioning that labeling things in Python is very cheap, and often has two side benefits: making your code faster and making your code more readable.

  • It avoids repetition like the plague. In fact, the few cases where I use helper functions, it's typically to remove nesting and reduce repetition.

Architecture overview

One thing we can all agree on -- functions were the right level of abstraction for this task. And the inputs and outputs were pretty clear: a string URL in, parsed URL as dict out; then a parsed URL as dict in, and a string URL out.

Some of you might not even know how to create classes in Python yet, but rest assured, even if you did, they would not be a right fit for this problem. When you have inputs and outputs so well defined and you're dealing with "immutable algorithms" -- that is, algorithms where the inputs solely determine the outputs over all time -- then your best abstraction is the function.

Alternate approaches

Another thing worth pondering is what other data structures might be a better return type than a dictionary for url_parse. Well, actually, we just gave ourselves a hint. Ideally, the return value would be immutable. But we'd like it to have named properties for things like scheme, host, port. Well, there is actually a stdlib data structure that could do just that -- it's called namedtuple. This would create us one:

from collections import namedtuple
ParsedUrl = namedtuple("ParsedUrl", 
              "scheme host port path query fragment")

And then we could create it like so:

url = ParsedUrl(scheme="http://",
          host="g.co",
          path="/",
          port=80, query=None, fragment=None)
assert url.host == 'g.co'

This data structure would have the added benefit of being more memory-efficient than a dictionary and being immutable (so, nicely usable as the input to other functions with a guarantee that your data won't be messed with).

Doing this "for real"?

One last question you may have is -- how would you go about implementing a parser for the full URL spec if you had to and weren't limited to the built-in libraries I hinted at?

Well, as Emmett hinted at, you'd probably do well to actually read the spec, rather than relying upon my hand-picked test cases. You can check out Emmett's more fleshed out implementation here:

https://github.com/Parsely/python-adv-slides/commit/c8dab2eb5745e6e1c3572b902d0ab9ad3c465c77

(along with my comments on it)

You'd also be wise to see if other languages have already implemented it and perhaps have written some good comprehensive test cases you could use. But going deeper on the implementation approach, parsing things is actually an entire field of study in Computer Science because of its importance to areas like Natural Language Processing and Programming Language design. One option would be to use the venerable Regular Expressions language, provided through Python's re module. In fact, people have probably implemented efficient RegEx patterns for URL's that work across programming languages. Another idea would be to use a lexing and parsing framework, like, for example, PLY or rPLY in Python land. Ironically enough, a Python open source developer has also released a library called parsley which helps with the generation of parsers from "grammar rules", which goes a little deeper into the theory. The reason for writing a "real parser", you might take these alternative approaches, is because there are a lot of edge cases that none of our submissions covered, but that, for example, a web browser has to be very careful to implement correctly, and without the help of a better abstraction (e.g. a parser generator), it can be hard to cover them all.

Finally, it's this level of complexity to a problem as simple as "parsing and formatting a URL" that points to one of Python's greatest virtues: "batteries included". Thanks to the urllib, urlparse, & urllib2 libraries, all of this is handled for you by the standard library. And thanks to the open source community, you have libraries like requests which wraps all these libraries up in a nice simple facade that makes it possible for you to just go, "requests.get(my_url)" and get back an HTTP response object to play with. Ah, the joy of abstraction.

@amontalenti
Copy link
Copy Markdown
Author

relevant (private) video for Parse.ly team folks: https://www.youtube.com/watch?v=TaXe5T_Qs5s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment