Skip to content

Instantly share code, notes, and snippets.

@Lewiscowles1986
Forked from kafran/extract_har.py
Last active August 21, 2024 03:26
Show Gist options
  • Save Lewiscowles1986/645e79295efa84698f4e45cd06d610ea to your computer and use it in GitHub Desktop.
Save Lewiscowles1986/645e79295efa84698f4e45cd06d610ea to your computer and use it in GitHub Desktop.
Python 3 script to extract images from HTTP Archive (HAR) files
import json
import base64
import os
import pathlib
from urllib.parse import urlparse
# list of supported image mime-types
# Special thanks to https://gist.github.com/FurloSK/0477e01024f701db42341fc3223a5d8c
# Special mention, and thanks to MDN
# https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types/Common_types
mimetypes = {
"image/webp": ".webp",
"image/jpeg": ".jpeg", # *.jpg files have two possible extensions
"image/jpeg": ".jpg", # (but .jpeg is official and thus preferred)
"image/png": ".png",
"image/svg+xml": ".svg",
"image/avif": ".avif",
"image/bmp": ".bmp",
"image/gif": ".gif",
"image/vnd.microsoft.icon": ".ico",
"image/tiff": ".tif", # *.tiff files have two possible extensions
"image/tiff": ".tiff", # (but .tiff is what I know and prefer)
}
# make sure the output directory exists before running!
folder = os.path.join(os.getcwd(), "imgs")
with open("src.har", "rb") as f:
har = json.loads(f.read())
entries = har["log"]["entries"]
for entry in entries:
mimetype = entry["response"]["content"]["mimeType"]
url = urlparse(entry["request"]["url"])
path = pathlib.Path(url.path)
filename = path.stem
response_text = entry["response"]["content"].get("text")
encoding = entry["response"]["content"].get("encoding", "literal")
if not response_text:
continue
# Python lets you lookup values against dictionaries using the in keyword
if mimetype in mimetypes:
ext = mimetypes[mimetype]
file = os.path.join(folder, str(path.parent)[1:], f"{filename}{ext}")
os.makedirs(os.path.join(folder, str(path.parent)[1:]), exist_ok=True)
print(file)
with open(file, "wb") as f:
f.write(
response_text.encode(encoding = "UTF-8", errors = "strict")
if encoding == "literal"
else base64.b64decode(response_text)
)
@Lewiscowles1986
Copy link
Author

Lewiscowles1986 commented Nov 4, 2023

Technically, we could get even more fruity using pathlib:

Permissive to unknown image mime types

  1. if the mimetype begins with "image/"
  2. Check to see if it mathes one of our mimetypes [preferences of ext]
    a. [True] Take the filename, and get without extension, then add our own
    b. [False] check for extension
    i. [True] use extension
    ii. [False] sub-process / function call
    1. extension replaces image/ from mime-type
    2. extension splits string remaining by .
    3. use either 3 or 4 characters of the leading and trailing string parts after
    4. allows .vnd.microsoft.ico or .vnd.microsoft.icon
  3. Save file

Respect / enforce some level of path heirarchy within imgs

  1. Parse url as a url
  2. Send Path to Pathlib Path object (if not already one)
  3. Use basename for file part, but capture parent parts
  4. Ensure directory exists for url domain (maybe needs sanitising?)
  5. os.path.join becomes a series of Path concatenations
    a. whatever extension manipulation algorithm uses (de-couples from above idea on permissive mime-types)
    b. {domain}/{path.parent}/{basename}.{ext}
  6. Save file

@MrCheatEugene
Copy link

@Lewiscowles1986
Copy link
Author

Lewiscowles1986 commented Nov 4, 2023

@MrCheatEugene you say tried; It's not the worst start. If I may. Did they explain why C?

@Lewiscowles1986
Copy link
Author

@MrCheatEugene I've uploaded a working C example https://github.com/Lewiscowles1986/har-img-extract
I've only built it on OSX, but I'd welcome contributions and feedback if your friend struggles to get it to build or encounters errors that I have not.

@Lewiscowles1986
Copy link
Author

Another update. Tracking this via https://github.com/Lewiscowles1986/har-img-extract/tree/python from now on. Latest update will likely be the last via Gist

Doesn't have linting, instructions, automated checks or anything else I'd really like. Lovely little hack though.

@MrCheatEugene
Copy link

@MrCheatEugene you say tried; It's not the worst start. If I may. Did they explain why C?

IDK, just for fun

@MrCheatEugene
Copy link

@MrCheatEugene I've uploaded a working C example https://github.com/Lewiscowles1986/har-img-extract I've only built it on OSX, but I'd welcome contributions and feedback if your friend struggles to get it to build or encounters errors that I have not.

he managed to get it working yesterday, https://github.com/OhMyCatile/ExtractHar

@Lewiscowles1986
Copy link
Author

Yeah, I saw the Rust edition. I Don't know enough Rust to comment on it. Very cool that there are now so many forks of this code.

@dnk8n
Copy link

dnk8n commented Nov 11, 2023

Would you like to please include an open-source or other type of license so that we know how we are legally allowed to use your code?

@Lewiscowles1986
Copy link
Author

dnk8n, however the heck you like; use it to burn baby sheep for all I care.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment