Created
May 4, 2020 12:17
-
-
Save jamescalam/02b84b65ee6199f0ec18a1d5ca45e56c to your computer and use it in GitHub Desktop.
Function used to pull a single letter page for the Epistulae Morales Ad Lucilium extraction.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # create function to pull letter from webpage (pulls text within <p> elements | |
| def pull_letter(http): | |
| # get html from webpage given by 'http' | |
| html = requests.get(http).text | |
| # parse into a beautiful soup object | |
| soup = BeautifulSoup(html, "html.parser") | |
| # build text contents within all p elements | |
| txt = '\n'.join([x.text for x in soup.find_all('p')]) | |
| # replace extended whitespace with single space | |
| txt = txt.replace(' ', ' ') | |
| # replace webpage references ('[1]', '[2]', etc) | |
| txt = re.sub('\[\d+\]', '', txt) | |
| # replace all number bullet points that Seneca uses ('1.', '2.', etc) | |
| txt = re.sub('\d+. ', '', txt) | |
| # remove double newlines | |
| txt = txt.replace("\n\n", "\n") | |
| # and return the result | |
| return txt |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment