Created
July 21, 2009 17:12
-
-
Save mjbommar/151466 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
''' | |
@author: Michael Bommarito | |
@contact [email protected] | |
@date Jul 21, 2009 | |
''' | |
""" | |
# Go to the NFL website and find the page that lists all teams: http://www.nfl.com/teams/ | |
# Pick your favorite team and select the team roster. | |
# Now, pick a few of your favorite players and check out their profile page. | |
# Do you notice any patterns in the data or structure on each player's page? | |
# Pay special attention to the URL for each player's profile page. Do you notice any patterns at the end of the URL? | |
# Describe the URL pattern in words. Are there a certain number of letters or numbers in any particular order? | |
id=CAR356737 | |
id=COU714650 | |
id=GAN308500 | |
id=JOH338168 | |
id=AAA000000 | |
These are like identifiers for each person. There's an equation that takes their real name and creates the "digital name." | |
Regular expressions are a simple way to extract patterns from text if they can be described like this. | |
Regular Expression: id=([A-Z0-9]+) | |
id= text that precedes | |
[A-Z] match a letter A through Z | |
[0-9] match a number 0 through 9 | |
[A-Z0-9] match a letter A through Z or a number 0 through 9 | |
[A-Z0-9]+ match one or more instances of a number of letters | |
(...) i want to keep this part of the text | |
""" | |
# re is the module that provides support for regular expression. | |
# 'import re' is the command to make the module available to your program. | |
import re | |
# This is an example string with three real URLs. | |
exampleText = 'http://www.nfl.com/players/tomzbikowski/profile?id=ZBI355964 http://www.nfl.com/players/stefanrodgers/profile?id=ROD526034 http://www.nfl.com/players/dawanlandry/profile?id=LAN144473' | |
# This line creates the regular expression finder. | |
idFinder = re.compile('id=([A-Z0-9]+)') | |
# This line tells the regular expression to extract the unique identifiers for each URL. | |
print idFinder.findall(exampleText) | |
""" | |
You should see the following output: | |
['ZBI355964', 'ROD526034', 'LAN144473'] | |
""" |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment