Created
May 28, 2014 09:37
-
-
Save rossmounce/9f514d330ac2092200c7 to your computer and use it in GitHub Desktop.
python regex
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
I know I'm doing all types of wrong here: | |
Source HTML file here: http://mdpi.com/1420-3049/19/4/5150/htm | |
I want the text for the dc.source: | |
Molecules 2014, Vol. 19, Pages 5150-5162 | |
Am using beautiful soup, so probably best to do it in that BUT it should also be regex-able. I can do this in bash no problem! | |
hand = open('1420-3049.19.4.5150.htm') | |
for ling in hand: | |
ling = ling.rstrip() | |
if re.search('name="dc.source"', ling) : | |
bibinfo = ling.strip('\<').strip('>') | |
print bibinfo+" "+originalurl | |
output: | |
<meta name="dc.source" content="Molecules 2014, Vol. 19, Pages 5150-5162" http://mdpi.com/1420-3049/19/4/5150/htm | |
#NotWhatIWanted / nor expected | |
Oops - didn't notice you'd already put the beautiful soup version up. A better way to skin this particular cat :)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
For the record (Hacky McHack) - this should get the string you want in bibinfo. I think you might also want to set re.I = True as I HTML is supposed to be case insensitive in tag and attribute names.
for ling in hand:
match = re.search('<.*meta.*dc\.source.*content\=[\"\'](.*)[\"\']',ling)
if match:
print ling, match.group(1)
bibinfo = match.group(1)
Output with that test file:
<meta name="dc.source" content="Molecules 2014, Vol. 19, Pages 5150-5162">
Molecules 2014, Vol. 19, Pages 5150-5162
>>> bibinfo
'Molecules 2014, Vol. 19, Pages 5150-5162'