Skip to content

Instantly share code, notes, and snippets.

@eliasdabbas
Created October 21, 2023 14:14
Show Gist options
  • Save eliasdabbas/01e8e79bffb70553f691a0ddc1816386 to your computer and use it in GitHub Desktop.
Save eliasdabbas/01e8e79bffb70553f691a0ddc1816386 to your computer and use it in GitHub Desktop.
Get the main headline story from the homepages of new websites.
import advertools as adv
url_xpath_selectors = {
'https://www.ft.com': ('main_story_headline', '//span[contains(@class, "text text--color-black text-display--scale-7 text--weight-500")]/text()'),
'https://www.nytimes.com': ('main_story_headline', '//h3[@class="indicate-hover css-si8ren"]/text()'),
'https://www.economist.com': ('main_story_headline', '//a[@data-analytics="top_stories:headline_1"]/text()'),
'https://edition.cnn.com': ('main_story_headline', '//h2[@class="container__title_url-text container_lead-package__title_url-text"]/text()'),
'https://www.nbcnews.com': ('main_story_headline', '//h2[@class="storyline__headline founders-cond fw6 important large headlineOnly"]/text()'),
'https://www.bbc.com': ('main_story_headline', '//a[@rev="hero1|headline"]/text()'),
'https://www.foxnews.com': ('main_story_headline', '(//header[@class="info-header"])[1]//a/text()'),
}
for url, selector in url_xpath_selectors.items():
adv.crawl(
url,
'/home/user_name/news_headlines.jl',
xpath_selectors={
selector[0]: selector[1]
},
custom_settings={
'LOG_FILE': '/home/user_name/news_headlines.log'
})
@eliasdabbas
Copy link
Author

eliasdabbas commented Oct 21, 2023

Sample output data:

Screenshot 2023-10-21 at 3 22 04 PM

Link to data: https://bit.ly/492QaVO

Create a virtual environment, let's say virtual_env

From the command line run

crontab -e

Then add the following line to the end of the file:

@hourly PATH=/path/to/virtual_env/bin ;  /path/to/virtual_env/bin/python /path/to/news_headlines_automated.py

Note: Make sure you use the full path to your environment, Python, and your script

More on how to automate python scripts on a Linux server: https://bit.ly/476BSlt
In addition to @hourly, you can use @daily @weekly @monthly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment