Environment: I think we will try the coding on Linux. I use Mac but I can just work from a vagrant machine or from an EC2 instance.
robots-txt-scanner takes in a robots.txt file and outputs a list of (token, lexeme) tuple.
print scanner.scan("User-agent: Google\nDisallow: *")
(('\\USER_AGENT_VALUE/', 'User-agent: Google'), ('\\DISALLOW_VALUE/', 'Disallow: *'))
Features:
- Add support for understanding
Disallow:
(empty),Disallow: /admin?
, and ``Disallow: /login/``` (see https://google.com/robots.txt). - Rules can be applied to specific or ALL (
*
) robots so we can create different robot objects with each holding its own rules (e.g.google.disallow == ['/admin']
). This means we need to get the actual value out of the scanner.
from robot_scanner import RobotTxt
robots_txt = RobotTxt("robots.txt")
for ua in robots_txt:
print(ua)
print(hasattr(robots_txt.ua, 'google')) # case-insensitive
print(robots_txt.ua.google.disallow) # case-insensitive
^ Actually, I am not quite sure the best to construct this API. Since google
is not always in every
robots.txt, how can I make this API more user friendly? Or am I worrying too much since
if they want to iterate over ua without knowing what is in the file they probably will end up
using the iterator version.
Would be a nice discussion during pair-programming I guess..
Bugs:
- Even the starting example has a bug!
scanner.scan("User-agent: Google\nDisallow: *")
(('\\USER_AGENT_VALUE/', 'User-agent: Google'), ('\\DISALLOW_VALUE/', 'Disallow: '))
I would expect the *
in the output but it is not....
I think Feature 1 and Bug 1 are fairly quick and Feature 2 is also doable. Even if we don't finish implementing feature 2 I think some discussion around it would be useful, but I am optimistic about the pace.