Last active
June 10, 2024 08:32
-
-
Save sneakers-the-rat/6d158eb4c8836880cf03191cb5419c8f to your computer and use it in GitHub Desktop.
Elsevier PDF "hashes"
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[ | |
"FCi27mtaKod38ztmGndn-y8NNz.r.lt6SndqGztz_ztr-ngqQm9aMo9eOnMeJntuNntu", | |
"D2ei2mgqJz9b-m.mGmPqRyLNNnwmOlt7.ywiGmt-Kndr9otqRywv8o9ePmtiNmd2Sn92Tma", | |
"6U7vcmPuOn9uLnMaGyM7-nLNNntv9lt6RmtaGmweOyMmJnMmSmgmOo9eOnM6LnMaRmM-Tma", | |
"lXLf8owyQztiMzwqGnMz7zcNNotb7lwf.m9qGzt6Km.qMngqLndqLo9eOotaNm96Mmt6Tma", | |
"FCi27y9qOnd-Ny96GmPmOmcNNzwf-lwj-m9mGztz7ytaMnM78n9v-o9ePmM6Rm9-Qn9eTma", | |
"XlEDumMz7nM7-m9iGogmRmLNNyt_8lwiKz9eGm9-Pm.v7ztiLztz_o9eOnMeQnd-Sodm", | |
"lXLf8yt-JywmNmPeGm9n9n8NNzgn.lt_8zwqGogz7zgn7zt6SyPr-o9eOnM6Pot2Mn9qTma", | |
"FCi27zgf8mdqMmMeGnMmMy8NNz9eQlweNy.eGmMiMm96Qmgr9nMb-o9ePmtuRmt6JotmTma", | |
"FCi27nwmKnMeSodeGm.z.y8NNntz.lt-PywmGy9__ngqQmtiPmtb7o9ePmteJotyJoduTma", | |
"HIoniz.qOnd-Nmt-GmteNn8NNot7.lt-QndaGnPv.mdaMmt6RnMqMo9ePmdmOmdiKod-Tma", | |
"ZtV1wntuPyPn9z.qGyPv7msNNytz7lwiKyM6GntmJnt_-nteRm.mRo9eOnM6Pot2MnMyTma", | |
"d2UUdywiJmtz7zt-Gm9eQmcNNzt2Qlwf7m9uGzd_7zdf7owr9yMqOo9ePmtaKnM2NmduTma", | |
"tprDsnMeJn9iOnweGnPuQnsNNz.eMlt-Qm.mGotz.ytiNz.yRmd-Mo9eOnM6Pot2OmM6Tma", | |
"tprDsyPiNn9iQn9-GmMiSy8NNn96Llwf9owiGowqQyMiRzwv_ngqPo9eOnM6Pot2OndyTma", | |
"ZIFNOztmRotn9owiGzduNmsNNnd-Rlt_8otiGot-Oy92QnMeSyMqKo9eOnM6Pot2OntaTma", | |
"D2ei2nMb_zwmSowyGzwv8mLNNotj8lt-My9yGmtaModaNm92RytySo9ePmtaKn92Qmt2Tma", | |
"d2UUdot__owr-y9mGodqLocNNn.eOlwmPmtaGmgj7ndn_nMiMndiNo9ePmdiLnMmPotmTmq", | |
"6U7vcmtuSndmSntqGmdiMy8NNnPz7lt_7ndeGmtv7n9eLndj_zduJo9ePmtiOntmNntmTma", | |
"ZtV1wn9mMnd2MzwiGz9eRysNNmgySlt7_ot-Gy97.mgiKotqKnt_.o9eOnM6Pot2Mn96Tma", | |
"XlEDuyweNmtz9ntqGm9aMocNNodr9lt__z9iGmdj_n9yNnt6Sm9-Lo9ePmd6KotmRnM2Tma", | |
"HIonintn-z9uPogmGnMeSzsNNogf-lwj.z.qGmgqSn9yPndf7mdmLo9eOotuLm9aNodqTma", | |
"ZlkjsyMj7mPr.ndiGowuMmcNNy.mNlwj9m.yGmtb7z.qRz.iKyt38o9eOnM6Pot2MnMeTma", | |
"Dpairmdj9mPr8nwmGn.r7z8NNnMb7lwj8otiGyt-MzwuKzd__nt39o9ePmtaPotaJm9-Tma", | |
"6mIUqngiNzduNn9iGmgeJnsNNot2Rlt-SzguGzt2Oodf_n.eNodz.o9eOn9mQnMqOm9e", | |
"FCi27mwr_mPn-m.mGmPuKncNNmduOlweOytuGogj.yMv-z92Pyt6Mo9eOnM6Pot2Mn9yTma", | |
"6U7vcngj-zt2Ln.uGodr8mcNNmdeSlweKmd2Gzdz9nM3_mgf7yt2Ro9ePmt6Sn9qLntyTma", | |
"zjJBNmPn.mdiRntiGzgmPnLNNmM2Klt6JmMqGy9aNz9aMmdv_mwuNo9ePm96Qm9iRndiTma", | |
"FCi27mPmRnPiKngeGngqJzcNNogj8lwj-zwiGnPiLmtb7y9qKzgeMo9eOnMeLn9aNm9m" | |
] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import exiftool | |
from pathlib import Path | |
import json | |
import pdb | |
import re | |
paper_root = Path().home() / 'location/of/papers' | |
hashes = [] | |
get_n = 100 | |
processed = 0 | |
rehash = re.compile(r'<([0-9A-Za-z_.-]{40,})/>') | |
try: | |
with exiftool.ExifTool() as et: | |
for path in paper_root.glob('**/*.pdf'): | |
md = et.execute(b'-b', b'-xmp', str(path).encode('utf-8')) | |
try: | |
md = md.decode('utf-8') | |
except UnicodeDecodeError: | |
print(f'Couldnt decode {path}') | |
continue | |
ahash = rehash.findall(md) | |
hashes.extend(ahash) | |
if len(ahash)>0: | |
processed += 1 | |
finally: | |
with open('elsev_hashes.json', 'w') as hashfile: | |
json.dump(hashes, hashfile, indent=2) | |
print(f'processed {processed} files') |
Some more examples here with associated DOIs: https://gist.github.com/Aariq/a23958e168e347f1bacf9dfa777b911f
I managed to get hashes that are very close on the same paper ( https://doi.org/10.1016/j.ijhydene.2021.11.149 ) :
lXLf8 ndj8y.uMn9q G yPn8m8 NN ogiM l t-SyPu G y.z8zwf8zgiNmMqM o9e PndmNn9iMoti Tma
FCi27 ndj8y.uMn9q G yPn8m8 NN ogiM l t-SyPu G y.z8zwf8zgiNmMqM o9e PndmNn9iMot2 Tma
LMfns mgeLmPf7zgm G y.iJns NN mPuQ l wf.ogm G nduLot2Mz9v9otr7 o9e PndmNn9iNmdq Tma
w8arl mgeLmPf7zgm G y.iJns NN mPuQ l wf.ogm G nduLot2Mz9v9otr7 o9e PndmNn9iNmd- Tma
I put some spaces in the hashes, because I think there are some patterns at such positions.
Later obtained hashes seem very different, still.
Here are some informations regarding the files, in the same order than the hashes :
File: 1-s2.0-S0360319921045377-main.pdf
Size: 5225391 Blocks: 7833 IO Block: 131072 regular file
Device: 0,37 Inode: 1067528 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 1000/ remy) Gid: ( 100/ users)
Access: 2022-01-29 14:19:53.072211357 +0100
Modify: 2022-01-29 14:19:53.185217711 +0100
Change: 2022-01-29 14:19:53.325225583 +0100
Birth: 2022-01-29 14:19:53.072211357 +0100
File: 1-s2.0-S0360319921045377-main(1).pdf
Size: 5225391 Blocks: 7833 IO Block: 131072 regular file
Device: 0,37 Inode: 1067359 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 1000/ remy) Gid: ( 100/ users)
Access: 2022-01-29 14:19:57.310442520 +0100
Modify: 2022-01-29 14:19:57.493452096 +0100
Change: 2022-01-29 14:19:57.539454503 +0100
Birth: 2022-01-29 14:19:57.310442520 +0100
File: 1-s2.0-S0360319921045377-main(2).pdf
Size: 5225391 Blocks: 7833 IO Block: 131072 regular file
Device: 0,37 Inode: 1067360 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 1000/ remy) Gid: ( 100/ users)
Access: 2022-01-29 14:20:04.484795768 +0100
Modify: 2022-01-29 14:20:04.608801481 +0100
Change: 2022-01-29 14:20:04.663804016 +0100
Birth: 2022-01-29 14:20:04.484795768 +0100
File: 1-s2.0-S0360319921045377-main(3).pdf
Size: 5225391 Blocks: 7833 IO Block: 131072 regular file
Device: 0,37 Inode: 1067005 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 1000/ remy) Gid: ( 100/ users)
Access: 2022-01-29 14:20:09.293007869 +0100
Modify: 2022-01-29 14:20:09.448014381 +0100
Change: 2022-01-29 14:20:09.492016229 +0100
Birth: 2022-01-29 14:20:09.293007869 +0100
WOW that looks like they might just be timestamps, that is LAZY on their part. I'll try and systematically sample across time and see if i can get repeating patterns/match subsections with times. I think you're right, those do seem to be independent and repeatable sections.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The few I downloaded from open access were visible to
grep
; usually toward the end of the file in an XML stream:A variation on https://twitter.com/Jofkos/status/1486244612960366593.