Last active
November 18, 2019 23:51
-
-
Save BrambleXu/db2899ea4c6461f42f15dda02be03d86 to your computer and use it in GitHub Desktop.
A simple crawler example to build a course material downloader
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import os | |
import requests | |
from lxml import etree | |
import wget | |
# prepare | |
download_directory = 'slides/' | |
url = 'http://inst.eecs.berkeley.edu/~cs61a/fa18/' | |
# make request | |
r = requests.get(url) | |
html = etree.HTML(r.text) | |
# extract links | |
slide_links = html.xpath('//li/a[text()="8pp"]/@href') | |
slide_links = list(set(slide_links)) # remove the duplicated links | |
print(len(slide_links)) | |
# download | |
for slide in slide_links: | |
print(slide) | |
download_link = url+slide | |
file_name = os.path.basename(slide) | |
download_path = download_directory + file_name # complete download link | |
wget.download(download_link, download_path) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Here is the explanation: Download Course Materials with A Simple Python Crawler