Created
May 28, 2014 09:37
-
-
Save rossmounce/9f514d330ac2092200c7 to your computer and use it in GitHub Desktop.
python regex
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
I know I'm doing all types of wrong here: | |
Source HTML file here: http://mdpi.com/1420-3049/19/4/5150/htm | |
I want the text for the dc.source: | |
Molecules 2014, Vol. 19, Pages 5150-5162 | |
Am using beautiful soup, so probably best to do it in that BUT it should also be regex-able. I can do this in bash no problem! | |
hand = open('1420-3049.19.4.5150.htm') | |
for ling in hand: | |
ling = ling.rstrip() | |
if re.search('name="dc.source"', ling) : | |
bibinfo = ling.strip('\<').strip('>') | |
print bibinfo+" "+originalurl | |
output: | |
<meta name="dc.source" content="Molecules 2014, Vol. 19, Pages 5150-5162" http://mdpi.com/1420-3049/19/4/5150/htm | |
#NotWhatIWanted / nor expected | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Oops - didn't notice you'd already put the beautiful soup version up. A better way to skin this particular cat :)