Skip to content

Instantly share code, notes, and snippets.

@kowalcj0
Created October 26, 2018 23:28
Show Gist options
  • Save kowalcj0/89c8d91f7bbdc3642a1a0a065f89cf49 to your computer and use it in GitHub Desktop.
Save kowalcj0/89c8d91f7bbdc3642a1a0a065f89cf49 to your computer and use it in GitHub Desktop.
Extract links to music websites from Mastodon's outbox.json and download them with youtube-dl in parallel with covers and metadata
#! /usr/bin/python
"""Extract links to music websites from Mastodon's outbox.json
outbox.json contains all of your toots
"""
import json
from bs4 import BeautifulSoup as Soup
def extract_music_urls():
with open("outbox.json") as f:
j = json.loads(f.read())
prefixes = ("https://youtu.be", "https://youtube.com",
"https://www.youtube.com", "https://soundcloud.com",
"https://m.soundcloud.com", "https://vimeo.com")
links = []
for m in j["orderedItems"]:
if "content" in m["object"]:
html = Soup(m["object"]["content"], 'html.parser')
hrefs = [a['href'] for a in html.find_all('a') if a['href'].startswith(prefixes)]
if hrefs:
links.append(hrefs[0])
return sorted(links)
if __name__ == "__main__":
urls = extract_music_urls()
print("\n".join(urls))
@kowalcj0
Copy link
Author

kowalcj0 commented Oct 26, 2018

To download audio tracks from all of the extracted music URLs, run:
./extract_music_urls.py | parallel 'LC_ALL=en_US.UTF-8 youtube-dl -f bestaudio --extract-audio --audio-format mp3 --audio-quality 0 --embed-thumbnail --add-metadata {}'

if you want to download the files without transcoding to mp3 then:
./extract_music_urls.py | parallel 'LC_ALL=en_US.UTF-8 youtube-dl -f bestaudio --extract-audio --embed-thumbnail --add-metadata {}'

@kowalcj0
Copy link
Author

to get rid of some special chars from filenames you can use iconv:
find . -type f -exec bash -c 'mv "$1" "${1%/*}/$(iconv -f UTF8 -t ASCII//TRANSLIT <<< ${1##*/})"' -- {} \;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment