In this tutorial we'll learn step by step how to create a simple web scrapper with Python 3. At the end of this tutorial you'll be able to get list of Ted Talks with a Python script with parameters. You'll be executing something like:
$ python tedscraper.py -s "Artificial Inteligence" --page 1 --results-per-page 5
1 - Playlist: Artificial intelligence (10 talks)
2 - Gil Weinberg: Can robots be creative?
3 - Peter Norvig | TED Speaker
4 - Dan Finkel: Can you solve the rogue AI riddle?
5 - Margaret Mitchell | TED Speaker
- Set the environment
- Check what we want to scrape
- Check the libraries we're going to use
- Get the Data
- Parse the content
- Improve the execution interface
For this tutorial we're going to use Pycharm with python3 in Ubuntu 19.10
First we need to do is create our new Project: If you just installed Pycharm or closed all your opened projects, you should be able to see this screen Insert screenshot here
but if you're in an opened project just go to File > New Project
- Set the project name (I suggest
tedscraper
) - Unfold the Project Interpreter options and make sure the option
New Environment
is selected and the base interpreter is python3 - Click on Create button
Insert screenshot here
Now let's make sure that the virtual environment was created successfuly. Go to the bottom tab terminal. You should be able to see the prompt prefix venv Insert screenshot here and highlight the venv prefix
Now we have Pycharm already working with a virtualenv. For more information about virtual environments in python, check HERE
Now let's explore our target. In the browser let's go into ted.com and let's make a search in there:
- Go to the searchbar a the top right
- Copy the resulting URL. Something like this
https://www.ted.com/search?page=2&q=artificial+intelligence
Insert screenshot here
Now active the Developer Tools and take a look at any article's title in order to get some unique identifier, maybe from the class or the id. In this case we'll use the <article>
element with the class m1 search__result
and the <h3>
and the <a>
element.
Insert screenshot here
https://requests.readthedocs.io/en/master/
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
let's install the packages we we'll user
pip install -r requirements.txt
Now, let's create our script (insert screenshot)
Let's make a quick test for the request
import requests
url = "https://www.ted.com"
response = requests.get(url)
print (response.status_code)
Let's execute our script (insert screenshot)
and we we'll see a 200
code, which means success
Now let's make the real request
import requests
from bs4 import BeautifulSoup
import argparse
parser = argparse.ArgumentParser(description="Scrape TED Talks")
parser.add_argument('-s', '--search-term', required=True)
parser.add_argument('-p', '--page', type=int, default=1)
parser.add_argument('-rp', '--results-per-page', type=int, default=10)
args = parser.parse_args()
search_term = args.search_term
page_number = args.page
RESULTS_PER_PAGE = args.results_per_page
url = "https://www.ted.com/search"
params = {'page': page_number, 'per_page': RESULTS_PER_PAGE, 'q': search_term}
def scrape():
response = requests.get(url, params)
soup = BeautifulSoup(response.content, "html.parser")
return soup.find_all('h3', {'class': 'h7 m4'})
if __name__ == '__main__':
print("Start")
articles = scrape()
for idx, article in enumerate(articles, 1):
article_title = article.a.text
print(f"{idx} - {article_title}")
Insert commit here
At this point we've just made a GET request and print the raw html. Now we need to get the specific element and their contents