-
-
Save elssar/5160757 to your computer and use it in GitHub Desktop.
| #!/usr/bin/env python | |
| """ | |
| Download all the pdfs linked on a given webpage | |
| Usage - | |
| python grab_pdfs.py url <path/to/directory> | |
| url is required | |
| path is optional. Path needs to be absolute | |
| will save in the current directory if no path is given | |
| will save in the current directory if given path does not exist | |
| Requires - requests >= 1.0.4 | |
| beautifulsoup >= 4.0.0 | |
| Download and install using | |
| pip install requests | |
| pip install beautifulsoup4 | |
| """ | |
| __author__= 'elssar <[email protected]>' | |
| __license__= 'MIT' | |
| __version__= '1.0.0' | |
| from requests import get | |
| from urlparse import urljoin | |
| from os import path, getcwd | |
| from bs4 import BeautifulSoup as soup | |
| from sys import argv | |
| def get_page(base_url): | |
| req= get(base_url) | |
| if req.status_code==200: | |
| return req.text | |
| raise Exception('Error {0}'.format(req.status_code)) | |
| def get_all_links(html): | |
| bs= soup(html) | |
| links= bs.findAll('a') | |
| return links | |
| def get_pdf(base_url, base_dir): | |
| html= get_page() | |
| links= get_all_links(html) | |
| if len(links)==0: | |
| raise Exception('No links found on the webpage') | |
| n_pdfs= 0 | |
| for link in links: | |
| if link['href'][-4:]=='.pdf': | |
| n_pdfs+= 1 | |
| content= get(urljoin(base_url, link['href'])) | |
| if content.status==200 and content.headers['content-type']=='application/pdf': | |
| with open(path.join(base_dir, link.text+'.pdf'), 'wb') as pdf: | |
| pdf.write(content.content) | |
| if n_pdfs==0: | |
| raise Exception('No pdfs found on the page') | |
| print "{0} pdfs downloaded and saved in {1}".format(n_pdfs, base_dir) | |
| if __name__=='__main__': | |
| if len(argv) not in (2, 3): | |
| print 'Error! Invalid arguments' | |
| print __doc__ | |
| exit(-1) | |
| arg= '' | |
| url= argv[1] | |
| if len(argv)==3: | |
| arg= argv[2] | |
| base_dir= [getcwd(), arg][path.isdir(arg)] | |
| try: | |
| get_pdf(base_dir) | |
| except Exception, e: | |
| print e | |
| exit(-1) |
Nice Code, Worked like a charm! Couple of tweaks and i was able to download all the pdf files.
I also get the "pdf_get() requires exactly 2 args" error whatever I do.
If I got it right, the point about the "2 args" error would be an approach to test if both arguments, base_url and base_dir, were present in the call of the function? But it is strange, Python would immediately rise an exception if we try to run this code without providing the arguments. I did some modifications to this code and it is running.
https://gist.github.com/Felipe-UnB/5c45ea5a8a7910b35dc31fbc750dad58
The easiest solution to this is to just use the wget command on the terminal
For example:
wget -r -P ./pdfs -A pdf http://kea.kar.nic.in/
@danny311296 Your code returns an error
>>> wget -r -P ./pdfs -A pdf http://kea.kar.nic.in/
File "<stdin>", line 1
wget -r -P ./pdfs -A pdf http://kea.kar.nic.in/
^
SyntaxError: invalid syntax
@Adisain It should work on Ubuntu and most Unix systems.
Maybe try
wget -r -P pdfs -A pdf http://kea.kar.nic.in/
instead on other systems
Thanks @danny311296
@danny, the command works amazingly for the above website, Please for the website https://nclt.gov.in it is throwing cannot verify certificate error, so should i try python codes?
I tried the following modification which solved the problem "pdf_get() requires exactly 2 args":
Change line 41 to html= get_page(base_url)
Change line 68 to get_pdf(url ,base_dir)
However, the script gives new error "An exception has occurred, use %tb to see the full traceback.
SystemExit: -1".
I traced back the error but cannot find a solution to get this working.
Helps will be appreciated. Thanks.