Skip to content

Instantly share code, notes, and snippets.

@jsfenfen
Last active February 18, 2017 06:40
Show Gist options
  • Select an option

  • Save jsfenfen/61ee78b2ceab032a3399ed69d925cb73 to your computer and use it in GitHub Desktop.

Select an option

Save jsfenfen/61ee78b2ceab032a3399ed69d925cb73 to your computer and use it in GitHub Desktop.
test if a file is a pdf by only downloading the first 4 bytes. It may be better to use http headers, but if you think those may be wrong...
import requests
def test_for_pdf(url):
r = requests.get(url, stream=True)
return ( next(r.iter_content(chunk_size=4)) == '%PDF' )
# See pdf file spec, p. 92: http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf
if __name__ == '__main__':
# a pdf:
url = 'https://fremont.gov/DocumentCenter/View/31856'
print(url + " is pdf? " + str(test_for_pdf(url)) )
# another pdf
url = 'https://fremont.gov/DocumentCenter/Home/View/584'
print(url + " is pdf? " + str(test_for_pdf(url)) )
# not a pdf
url = 'https://fremont.gov/27/Departments'
print(url + " is pdf? " + str(test_for_pdf(url)) )
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment