-
-
Save elssar/5160757 to your computer and use it in GitHub Desktop.
#!/usr/bin/env python | |
""" | |
Download all the pdfs linked on a given webpage | |
Usage - | |
python grab_pdfs.py url <path/to/directory> | |
url is required | |
path is optional. Path needs to be absolute | |
will save in the current directory if no path is given | |
will save in the current directory if given path does not exist | |
Requires - requests >= 1.0.4 | |
beautifulsoup >= 4.0.0 | |
Download and install using | |
pip install requests | |
pip install beautifulsoup4 | |
""" | |
__author__= 'elssar <[email protected]>' | |
__license__= 'MIT' | |
__version__= '1.0.0' | |
from requests import get | |
from urlparse import urljoin | |
from os import path, getcwd | |
from bs4 import BeautifulSoup as soup | |
from sys import argv | |
def get_page(base_url): | |
req= get(base_url) | |
if req.status_code==200: | |
return req.text | |
raise Exception('Error {0}'.format(req.status_code)) | |
def get_all_links(html): | |
bs= soup(html) | |
links= bs.findAll('a') | |
return links | |
def get_pdf(base_url, base_dir): | |
html= get_page() | |
links= get_all_links(html) | |
if len(links)==0: | |
raise Exception('No links found on the webpage') | |
n_pdfs= 0 | |
for link in links: | |
if link['href'][-4:]=='.pdf': | |
n_pdfs+= 1 | |
content= get(urljoin(base_url, link['href'])) | |
if content.status==200 and content.headers['content-type']=='application/pdf': | |
with open(path.join(base_dir, link.text+'.pdf'), 'wb') as pdf: | |
pdf.write(content.content) | |
if n_pdfs==0: | |
raise Exception('No pdfs found on the page') | |
print "{0} pdfs downloaded and saved in {1}".format(n_pdfs, base_dir) | |
if __name__=='__main__': | |
if len(argv) not in (2, 3): | |
print 'Error! Invalid arguments' | |
print __doc__ | |
exit(-1) | |
arg= '' | |
url= argv[1] | |
if len(argv)==3: | |
arg= argv[2] | |
base_dir= [getcwd(), arg][path.isdir(arg)] | |
try: | |
get_pdf(base_dir) | |
except Exception, e: | |
print e | |
exit(-1) |
Me too.
I tried the following modification which solved the problem "pdf_get() requires exactly 2 args":
Change line 41 to html= get_page(base_url)
Change line 68 to get_pdf(url ,base_dir)
However, the script gives new error "An exception has occurred, use %tb to see the full traceback.
SystemExit: -1".
I traced back the error but cannot find a solution to get this working.
Helps will be appreciated. Thanks.
Nice Code, Worked like a charm! Couple of tweaks and i was able to download all the pdf files.
I also get the "pdf_get() requires exactly 2 args" error whatever I do.
If I got it right, the point about the "2 args" error would be an approach to test if both arguments, base_url and base_dir, were present in the call of the function? But it is strange, Python would immediately rise an exception if we try to run this code without providing the arguments. I did some modifications to this code and it is running.
https://gist.github.com/Felipe-UnB/5c45ea5a8a7910b35dc31fbc750dad58
The easiest solution to this is to just use the wget
command on the terminal
For example:
wget -r -P ./pdfs -A pdf http://kea.kar.nic.in/
@danny311296 Your code returns an error
>>> wget -r -P ./pdfs -A pdf http://kea.kar.nic.in/
File "<stdin>", line 1
wget -r -P ./pdfs -A pdf http://kea.kar.nic.in/
^
SyntaxError: invalid syntax
@Adisain It should work on Ubuntu and most Unix systems.
Maybe try
wget -r -P pdfs -A pdf http://kea.kar.nic.in/
instead on other systems
Thanks @danny311296
@danny, the command works amazingly for the above website, Please for the website https://nclt.gov.in it is throwing cannot verify certificate error, so should i try python codes?
I get
pdf_get() requires exactly 2 args