Skip to content

Instantly share code, notes, and snippets.

@thcipriani
Created March 28, 2016 18:20
Show Gist options
  • Save thcipriani/fd7dfcd494b2db61107a to your computer and use it in GitHub Desktop.
Save thcipriani/fd7dfcd494b2db61107a to your computer and use it in GitHub Desktop.
Create m3u files from Archive.org search results
#!/usr/bin/env python2
# coding: utf-8
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
#
# Making a playlist from archive.org's search
# ---
# Archive.org has a massive selection of music. Many individual ogg files
# may exist under a single identifier:
# https://archive.org/details/The_Open_Goldberg_Variations-11823
#
# This script automates creating a list of individual files.
#
# As input, this script takes a search list generated via wget following
# the steps here: https://blog.archive.org/2012/04/26/downloading-in-bulk-using-wget/
#
# Usage is like:
#
# ./archive-org-m3u.py < search.csv > 'The Open Golberg Variations.m3u'
#
# -- OR --
#
# echo 'https://archive.org/download/The_Open_Goldberg_Variations-11823' | \
# ./archive-org-m3u.py > 'The Open Goldberg Variations.m3u
import sys
import requests
from pyquery import PyQuery as pq
URLS = sys.stdin.read()
playlist = []
for url in URLS.splitlines():
r = requests.get(url)
r.raise_for_status()
page = pq(r.text)
pres = page('pre a')
for i in range(0, len(pres)):
link = pres.eq(i).text()
if not link.endswith('.ogg'):
continue
playlist.append(url + '/' + link)
sys.stdout.write('\n'.join(playlist))
@ernstki
Copy link

ernstki commented Jan 5, 2022

This was super-helpful, thanks! I didn't even know pyquery was a thing that exists, and it was the perfect tool for the job!

Here's the "I'm Feeling Lucky" version. ;)

#!/usr/bin/env python3
"""
m3uify - create a .m3u playlist from archive.org search results

  usage:
    $ chmod a+x m3uify
    $ ./m3uify COUNT EXT TERM [TERM...] > playlist.m3u
    $ mpv playlist.m3u
  
  where:
    COUNT  is how many search results to return
    EXT    is the media file extension, e.g., "mp3"
    TERM   is a search term or terms

  example:
    $ ./m3uify 5 mp3 biodata sonification
"""
import sys, csv, pyquery, requests

if sys.argv[1].startswith('-h') or sys.argv[1].startswith('--h'):
    print(__doc__)
    sys.exit()

u = 'https://archive.org'
c = int(sys.argv[1])
e = sys.argv[2]
t = " ".join(sys.argv[3:])
r = requests.get(u + '/advancedsearch.php',
    params={
        'q': 'subject:' + t,
        'fl': 'identifier',
        'sort': 'publicdate asc',
        'output': 'csv'
    }
)
rows = csv.reader(r.text.splitlines())
next(rows)  # skip the header

for row in rows:
    f = pyquery.PyQuery(u + '/details/' + row[0])('a[href$=".%s"]' % e).attr.href
    print(u + f)
    c -= 1
    if c == 0:
        break

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment