Created
October 21, 2023 14:14
-
-
Save eliasdabbas/01e8e79bffb70553f691a0ddc1816386 to your computer and use it in GitHub Desktop.
Get the main headline story from the homepages of new websites.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import advertools as adv | |
url_xpath_selectors = { | |
'https://www.ft.com': ('main_story_headline', '//span[contains(@class, "text text--color-black text-display--scale-7 text--weight-500")]/text()'), | |
'https://www.nytimes.com': ('main_story_headline', '//h3[@class="indicate-hover css-si8ren"]/text()'), | |
'https://www.economist.com': ('main_story_headline', '//a[@data-analytics="top_stories:headline_1"]/text()'), | |
'https://edition.cnn.com': ('main_story_headline', '//h2[@class="container__title_url-text container_lead-package__title_url-text"]/text()'), | |
'https://www.nbcnews.com': ('main_story_headline', '//h2[@class="storyline__headline founders-cond fw6 important large headlineOnly"]/text()'), | |
'https://www.bbc.com': ('main_story_headline', '//a[@rev="hero1|headline"]/text()'), | |
'https://www.foxnews.com': ('main_story_headline', '(//header[@class="info-header"])[1]//a/text()'), | |
} | |
for url, selector in url_xpath_selectors.items(): | |
adv.crawl( | |
url, | |
'/home/user_name/news_headlines.jl', | |
xpath_selectors={ | |
selector[0]: selector[1] | |
}, | |
custom_settings={ | |
'LOG_FILE': '/home/user_name/news_headlines.log' | |
}) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Sample output data:
Link to data: https://bit.ly/492QaVO
Create a virtual environment, let's say
virtual_env
From the command line run
Then add the following line to the end of the file:
@hourly PATH=/path/to/virtual_env/bin ; /path/to/virtual_env/bin/python /path/to/news_headlines_automated.py
Note: Make sure you use the full path to your environment, Python, and your script
More on how to automate python scripts on a Linux server: https://bit.ly/476BSlt
In addition to
@hourly
, you can use@daily
@weekly
@monthly