Hacker School pair programming interview with Yeuk Hon Wong

Environment: I think we will try the coding on Linux. I use Mac but I can just work from a vagrant machine or from an EC2 instance.

robots-txt-scanner takes in a robots.txt file and outputs a list of (token, lexeme) tuple.

print scanner.scan("User-agent: Google\nDisallow: *")
(('\\USER_AGENT_VALUE/', 'User-agent: Google'), ('\\DISALLOW_VALUE/', 'Disallow: *'))

Features:

Add support for understanding Disallow: (empty), Disallow: /admin?, and ``Disallow: /login/``` (see https://google.com/robots.txt).
Rules can be applied to specific or ALL (*) robots so we can create different robot objects with each holding its own rules (e.g. google.disallow == ['/admin']). This means we need to get the actual value out of the scanner.

from robot_scanner import RobotTxt

robots_txt = RobotTxt("robots.txt")
for ua in robots_txt:
    print(ua)

print(hasattr(robots_txt.ua, 'google')) # case-insensitive
print(robots_txt.ua.google.disallow) # case-insensitive

^ Actually, I am not quite sure the best to construct this API. Since google is not always in every robots.txt, how can I make this API more user friendly? Or am I worrying too much since if they want to iterate over ua without knowing what is in the file they probably will end up using the iterator version.

Would be a nice discussion during pair-programming I guess..

Bugs:

Even the starting example has a bug!

scanner.scan("User-agent: Google\nDisallow: *")
(('\\USER_AGENT_VALUE/', 'User-agent: Google'), ('\\DISALLOW_VALUE/', 'Disallow: '))

I would expect the * in the output but it is not....

I think Feature 1 and Bug 1 are fairly quick and Feature 2 is also doable. Even if we don't finish implementing feature 2 I think some discussion around it would be useful, but I am optimistic about the pace.

yeukhon/IDEAS.md