Created
October 13, 2020 02:35
-
-
Save jwoglom/361a1051bfb8168ae69acafcc568005b to your computer and use it in GitHub Desktop.
Download Perusall readings as PDF
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
title = "The title of the article" | |
urls=""" | |
<image URLs scraped from the page> | |
""" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# dependencies: imagemagick, img2pdf | |
from data import title, urls | |
folder = title.replace(' ','-') | |
import requests | |
import os | |
if not os.path.exists(folder): | |
os.mkdir(folder) | |
i = 0 | |
for u in urls.splitlines(): | |
if u: | |
print('Downloading chunk', i, 'of', title) | |
open('{}/{:0>2}.png'.format(folder, i), 'wb').write(requests.get(u.strip()).content) | |
i += 1 | |
pgno = 1 | |
for j in range(0, i, 6): | |
f = ' '.join(['{}/{:0>2}.png'.format(folder, k) for k in range(j, min(i, j+6))]) | |
print('Converting page', pgno) | |
os.system('convert -append %s %s/page_%s.png' % (f, folder, pgno)) | |
pgno += 1 | |
print('Converting to pdf') | |
pages = ' '.join(['{}/page_{}.png'.format(folder, k) for k in range(1, pgno)]) | |
os.system('img2pdf %s -o %s.pdf' % (pages, title)) | |
print('Done') |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/* | |
* Click on a reading in the Perusall web interface, | |
* and run this script in the developer console. | |
* Copy-and-paste the console.info output to data.py. | |
*/ | |
var len = 0; | |
var times = 0; | |
var i = setInterval(() => { | |
var img = document.querySelectorAll("img.chunk"); img[img.length-1].scrollIntoView(); | |
if (len < img.length) { | |
len = img.length; | |
} else if (times > 3) { | |
var urls = []; | |
img.forEach((e) => urls.push(e.src)); | |
var spl = location.pathname.split('/'); | |
console.info('urls = """\n'+urls.join('\n')+'\n"""\n\ntitle="'+spl[spl.length-1]+'"\n'); | |
clearInterval(i); | |
} else { | |
times++; | |
} | |
}, 2000); |
Yo this was cool and works well, just a warning.
If running in Linux make sure you have the following packages installed:
sudo apt-get install imagemagick
sudo apt install img2pdf
If not you would spend too much time wondering why the os.system() calls are not working properly.
Also, make sure you load the entire document in perusall before running the js script to pull the png URLs.
Other than that, it works well.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I've tried to run the script and it works but not well.
I will explaing what is my issue.
I'm having problems, i think, in the part where I have to scrape the file online on perusall to get the urls.
The problem is that the get_urls script work and return me the urls but not all of them, and when the download _perusall.py script run it create a pdf that show only the initial and last pages.
Probably the problem is that perusall dosen't load the page in time for being captured
Any suggestion of what to do? Thanks.