Skip to content

Instantly share code, notes, and snippets.

@jamescalam
Created May 4, 2020 12:17
Show Gist options
  • Select an option

  • Save jamescalam/02b84b65ee6199f0ec18a1d5ca45e56c to your computer and use it in GitHub Desktop.

Select an option

Save jamescalam/02b84b65ee6199f0ec18a1d5ca45e56c to your computer and use it in GitHub Desktop.
Function used to pull a single letter page for the Epistulae Morales Ad Lucilium extraction.
# create function to pull letter from webpage (pulls text within <p> elements
def pull_letter(http):
# get html from webpage given by 'http'
html = requests.get(http).text
# parse into a beautiful soup object
soup = BeautifulSoup(html, "html.parser")
# build text contents within all p elements
txt = '\n'.join([x.text for x in soup.find_all('p')])
# replace extended whitespace with single space
txt = txt.replace(' ', ' ')
# replace webpage references ('[1]', '[2]', etc)
txt = re.sub('\[\d+\]', '', txt)
# replace all number bullet points that Seneca uses ('1.', '2.', etc)
txt = re.sub('\d+. ', '', txt)
# remove double newlines
txt = txt.replace("\n\n", "\n")
# and return the result
return txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment