-
-
Save rjw57/b9fbbd173d22aca42a80 to your computer and use it in GitHub Desktop.
#!/usr/bin/env python3 | |
# | |
# THIS SCRIPT REQUIRES PYTHON 3 | |
# | |
# Install requirements via: | |
# pip3 install docopt pillow reportlab | |
# | |
# Dedicated to the public domain where possible. | |
# See: https://creativecommons.org/publicdomain/zero/1.0/ | |
""" | |
Download a pocketmags magazines in PDF format from the HTML5 reader. | |
Usage: | |
pmdown.py (-h | --help) | |
pmdown.py [options] <pdf> <url> | |
Options: | |
-h, --help Print brief usage summary. | |
--dpi=DPI Set image resolution in dots per inch. | |
[default: 150] | |
<pdf> Save output to this file. | |
<url> A URL to one image from the magazine. | |
Notes: | |
PLEASE USE THIS SCRIPT RESPONSIBLY. THE MAGAZINE PUBLISHING INDUSTRY RELIES | |
HEAVILY ON INCOME FROM SALES WITH VERY SLIM PROFIT MARGINS. | |
URLs for pocketmag images can be found by using the HTML 5 reader and | |
right-clicking on a page and selecting "inspect element". Look for URLs of | |
the form: | |
http://magazines.magazineclonercdn.com/<uuid1>/<uuid2>/high/<num>.jpg | |
where <uuid{1,2}> are strings of letters and numbers with dashes separating | |
them and <num> is some 4-digit number. | |
""" | |
import itertools | |
import re | |
from contextlib import contextmanager | |
from urllib.error import HTTPError | |
from urllib.parse import urlparse, urlunparse | |
from urllib.request import urlopen | |
import docopt | |
from PIL import Image | |
from reportlab.pdfgen import canvas | |
from reportlab.lib.units import inch | |
# The pattern of the URL path for a magazine | |
URL_PATH_PATTERN = re.compile(r'(?P<prefix>^[a-f0-9\-/]*/high/)[0-9]{4}.jpg') | |
@contextmanager | |
def saving(thing): | |
"""Context manager which ensures save() is called on thing.""" | |
try: | |
yield thing | |
finally: | |
thing.save() | |
def main(): | |
opts = docopt.docopt(__doc__) | |
pdf_fn, url = (opts[k] for k in ('<pdf>', '<url>')) | |
url = urlparse(url) | |
dpi = float(opts['--dpi']) | |
m = URL_PATH_PATTERN.match(url.path) | |
if not m: | |
raise RuntimeError('URL path does not match expected pattern') | |
prefix = m.group('prefix') | |
c = canvas.Canvas(pdf_fn) | |
with saving(c): | |
for page_num in itertools.count(0): | |
page_url = list(url) | |
page_url[2] = '{}{:04d}.jpg'.format(prefix, page_num) | |
page_url = urlunparse(page_url) | |
print('Downloading page {} from {}...'.format(page_num, page_url)) | |
try: | |
with urlopen(page_url) as f: | |
im = Image.open(f) | |
except HTTPError as e: | |
if e.code == 404: | |
print('No image found => stopping') | |
break | |
raise e | |
w, h = tuple(dim / dpi for dim in im.size) | |
print('Image is {:.2f}in x {:.2f}in at {} DPI'.format(w, h, dpi)) | |
c.setPageSize((w*inch, h*inch)) | |
c.drawInlineImage(im, 0, 0, w*inch, h*inch) | |
c.showPage() | |
if __name__ == '__main__': | |
main() |
I'm trying to find a url right now @Numbr6 could you point me in the right direction, i have been searching for a while but can only find the thumbnails.
Is this still working ? i tried but that URL thing didnt able to find can you explain a bit more about where exactly need to look for jpg image.
Thanks
Great little script...!
I had to edit the regex for the url. Not sure if my magazine is different or if they have changed how they are constructed.
URL_PATH_PATTERN = re.compile(r'(?P<prefix>/mcmags/[a-f0-9\-/]*/mid/)[0-9]{4}.jpg')
Note the addition of /mcmags/
I've added /mcmags/ as @ear9mrn mentioned above, but it's still not working for me.
I keep getting the following error:
/Applications/Python\ 3.8/pmdown.py testing.pdf https://mcdatastore.blob.core.windows.net/mcmags/3db0b440-0324-44c8-8200-027ab05a34cd/a40ae4de-81a4-46b5-a0c9-8f2205421129/extralow/0003.jpg
Traceback (most recent call last):
File "/Applications/Python 3.8/pmdown.py", line 101, in
main()
File "/Applications/Python 3.8/pmdown.py", line 73, in main
raise RuntimeError('URL path does not match expected pattern')
RuntimeError: URL path does not match expected pattern
Any ideas? Thanks!
Got it, I needed to manually change "extralow" to "mid" in the image URL. Superb, thanks!
i want script for magzter magazine download
Got it, I needed to manually change "extralow" to "mid" in the image URL. Superb, thanks!
could you post your edited script? can't get it to accept my url even after following your steps. I get "expected string or bytes-like object"
at line 66: opts = docopt.docopt(__doc__)
could you post your edited script? can't get it to accept my url even after following your steps. I get
"expected string or bytes-like object"
at line 66:opts = docopt.docopt(__doc__)
I'm not sure if you're struggling with the same problem I had. I didn't need to edit the script, I just had to edit the URL. If that didn't work for you then there could be something else amiss.
Can someone do a Zinio downloader, please?! :)
So i'm running the following code in terminal and it's not seeming to do anything:
python3 pmdown.py -h test.pdf https://mcdatastore.blob.core.windows.net/mcmags/a8123f62-3fab-4a47-9702-a2e521a8c829/4f8f60e2-c901-4ce6-af4d-21dc14e0e5d8/mid/0000.jpg
Anything I could have possibly missed. After pressing enter in terminal, i just get the instructions contained in """ | """. This is my first time with Python.
so the default is currently "extralow" and we can change it to "mid" but does anyone know how to get the higher quality jpg?
I know there is a higher quality available but I tried "high", and "extrahigh" but it just gives an error page, anyone know the right directory name for the high quality images?
looks like the format has changed, it is no longer .jpg but .bin and looks like its not jpg files.
https://mcdatastore.blob.core.windows.net/mcmags/{ ... }/{ ... }/high/0018.bin
I'm getting this error:
Traceback (most recent call last): File "/Users/greg/Desktop/pmdown.py", line 60, in <module> main() File "/Users/greg/Desktop/pmdown.py", line 25, in main opts = docopt.docopt(__doc__) File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/docopt.py", line 558, in docopt DocoptExit.usage = printable_usage(doc) File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/docopt.py", line 466, in printable_usage usage_split = re.split(r'([Uu][Ss][Aa][Gg][Ee]:)', doc) File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/re.py", line 231, in split return _compile(pattern, flags).split(string, maxsplit) TypeError: expected string or bytes-like object
Any ideas?
how can i get the uuid1 and uuid2 for the magazine please.
so the default is currently "extralow" and we can change it to "mid" but does anyone know how to get the higher quality jpg?
I know there is a higher quality available but I tried "high", and "extrahigh" but it just gives an error page, anyone know the right directory name for the high quality images?
I've tried everything I can think of and I can't get a better quality than "mid." It's a shame, because when you download the allowed 2 pages via Pocketmags, the quality is far superior.
so the default is currently "extralow" and we can change it to "mid" but does anyone know how to get the higher quality jpg?
I know there is a higher quality available but I tried "high", and "extrahigh" but it just gives an error page, anyone know the right directory name for the high quality images?I've tried everything I can think of and I can't get a better quality than "mid." It's a shame, because when you download the allowed 2 pages via Pocketmags, the quality is far superior.
Perhaps the 2-page print is the solution🤔. My coding days were when BASIC was a new thing
and have progressed little since then but isn't it possible to write a code that reiteratively prints two pages at a time until all are done? Then we could combine those in one pdf pretty easily, I'd have thought.
let numberOfPages = 71;
for (let index = 0; index < numberOfPages; index += 2) {
document.getElementById('print_menu').click();
setTimeout(() => {
let pages = document.querySelectorAll('[pagenum="' + (index + 1) + '"]');
pages[0].click();
if (index + 2 <= numberOfPages)
{
pages = document.querySelectorAll('[pagenum="' + (index + 2) + '"]');
pages[0].click();
}
document.getElementById('printPages').click();
}, 500);
}
I've modified this script to enable downloading of magazines in "high" quality and have created an option to add a magazine title to the generated PDF's metadata. I've published my new version in a separate GitHub repo as Gists don't seem to support pull requests. You can find it here: https://github.com/RichardJRL/pocketmagstopdf
The original author, rjw57, is welcome to include my changes in his Gist here if he wishes
I've now further modified the script to download the whole magazine at the same quality that the restricted 2-page print option on the website offers.
As before, I've published my modified version on my GitHub page: https://github.com/RichardJRL/pocketmagstopdf
Python neophyte here. I was able to find the various IDs and to get the latest script running, but after finding the last good page of the mag, the script terminates with ERROR - Unable to download magazine: HTTP error code 405. Any guidance would be appreciated.
Sorry, new to Github, too. This is in reference to pocketmagstopdf. If I need to post elsewhere, please let me know.
Never mind. I'll post to the Issues of that repository.
<"https://magazineclonerepub.blob.core.windows.net/mcepub/1004/185936/image/d42477b6-7790-4ef4-891d-d7bc64ab6213.jpg"> is the only URL ending with .jpg I could locate. Thank you.