Last active
August 11, 2016 15:41
-
-
Save ZempTime/08496304ab15d1c8d075b8ddc98ec1ec to your computer and use it in GitHub Desktop.
Page Scraping Logic/Ideas
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# from http://blog.rubyroidlabs.com/2016/04/web-scraping-2/ | |
def strip_bad_chars(text) | |
text.gsub!(/"/, "'"); | |
text.gsub!(/\u2018/, "'"); | |
text.gsub!(/[”“]/, '"'); | |
text.gsub!(/’/, "'"); | |
text | |
end | |
def clean_body(text) | |
text.gsub!(/(\r)?\n/, ""); | |
text.gsub!(/\s+/, ' '); | |
end | |
text = Sanitize.clean(text, :elements => ['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'p', 'a', 'b', 'strong', 'em', 'img', 'iframe'], | |
:attributes => { | |
'a' => ['href', 'title', 'name'], | |
'img' => ['src', 'title', 'alt'], | |
'iframe' => ['src', 'url', 'class', 'id', 'width', 'height', 'name'], | |
}, | |
:protocols => { | |
'a' => { | |
'href' => ['http', 'https', 'mailto'] | |
}, | |
'iframe' => { | |
'src' => ['http', 'https'] | |
} | |
}) | |
# clean start and end whitespace | |
text = text.strip; | |
return text | |
end | |
# Principles: | |
# - clean the cruft, rid the nonessential from the page at the start | |
# - make specific parts on the page easily and clearly callable | |
# - there is data cleaning, data organization, then data extraction. these three phases of logic are clearer if they're distinct | |
# - goal is to write logic that you will understand when you return to it in 6 months. | |
# - unlike most other places in code, comments are probably necessary for wonky web pages. | |
# - lean upon CSS selectors where possible |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment