TED Scraper

Introduction

In this tutorial we'll learn step by step how to create a simple web scrapper with Python 3. At the end of this tutorial you'll be able to get list of Ted Talks with a Python script with parameters. You'll be executing something like:

$ python tedscraper.py -s "Artificial Inteligence" --page 1 --results-per-page 5

1 - Playlist: Artificial intelligence (10 talks)
2 - Gil Weinberg: Can robots be creative?
3 - Peter Norvig | TED Speaker
4 - Dan Finkel: Can you solve the rogue AI riddle?
5 - Margaret Mitchell | TED Speaker

Index

Set the environment
Check what we want to scrape
Check the libraries we're going to use
Get the Data
Parse the content
Improve the execution interface

0. Set the environment

For this tutorial we're going to use Pycharm with python3 in Ubuntu 19.10

First we need to do is create our new Project: If you just installed Pycharm or closed all your opened projects, you should be able to see this screen Insert screenshot here

but if you're in an opened project just go to File > New Project

Set the project name (I suggest tedscraper)
Unfold the Project Interpreter options and make sure the option New Environment is selected and the base interpreter is python3
Click on Create button

Insert screenshot here

Now let's make sure that the virtual environment was created successfuly. Go to the bottom tab terminal. You should be able to see the prompt prefix venv Insert screenshot here and highlight the venv prefix

Now we have Pycharm already working with a virtualenv. For more information about virtual environments in python, check HERE

1. Check what we want to scrape

Now let's explore our target. In the browser let's go into ted.com and let's make a search in there:

Go to the searchbar a the top right
Copy the resulting URL. Something like this https://www.ted.com/search?page=2&q=artificial+intelligence Insert screenshot here

Now active the Developer Tools and take a look at any article's title in order to get some unique identifier, maybe from the class or the id. In this case we'll use the <article> element with the class m1 search__result and the <h3> and the <a> element. Insert screenshot here

2. Check the libraries we need

Requests

https://requests.readthedocs.io/en/master/

BeautifulSoup4

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

3. Get the data

let's install the packages we we'll user pip install -r requirements.txt

Now, let's create our script (insert screenshot)

Let's make a quick test for the request

import requests

url = "https://www.ted.com"
response = requests.get(url)
print (response.status_code)

Let's execute our script (insert screenshot) and we we'll see a 200 code, which means success

Now let's make the real request

import requests
from bs4 import BeautifulSoup
import argparse

parser = argparse.ArgumentParser(description="Scrape TED Talks")
parser.add_argument('-s', '--search-term', required=True)
parser.add_argument('-p', '--page', type=int, default=1)
parser.add_argument('-rp', '--results-per-page', type=int, default=10)
args = parser.parse_args()

search_term = args.search_term
page_number = args.page
RESULTS_PER_PAGE = args.results_per_page

url = "https://www.ted.com/search"
params = {'page': page_number, 'per_page': RESULTS_PER_PAGE, 'q': search_term}


def scrape():
    response = requests.get(url, params)
    soup = BeautifulSoup(response.content, "html.parser")
    return soup.find_all('h3', {'class': 'h7 m4'})


if __name__ == '__main__':
    print("Start")
    articles = scrape()
    for idx, article in enumerate(articles, 1):
        article_title = article.a.text
        print(f"{idx} - {article_title}")

Insert commit here

At this point we've just made a GET request and print the raw html. Now we need to get the specific element and their contents

vijoin/tedscraper.md