Last active
April 8, 2025 13:24
-
-
Save JosephRedfern/d60bdc584d84b1451cc6052e955b755c to your computer and use it in GitHub Desktop.
Scrapes the youtube video IDs for the youtube-8m data set. Probably buggy. Could be threaded.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import requests | |
from collections import defaultdict | |
csv_prefix = "https://research.google.com/youtube8m/csv" | |
r = requests.get("{0}/verticals.json".format(csv_prefix)) | |
verticals = r.json() | |
block_urls = defaultdict(list) | |
count = 0 | |
for cat, urls in verticals.items(): | |
for url in urls: | |
jsurl = "{0}/j/{1}.js".format(csv_prefix, url.split("/")[-1]) | |
block_urls[cat[1:]].append(jsurl) | |
count += 1 #lazy. | |
ids_by_cat = defaultdict(list) | |
downloaded = 0.0 | |
for cat_name, block_file_urls in block_urls.items(): | |
for block_file_url in block_file_urls: | |
print("[{0}%] Downloading block file: {1} {2}".format((100.0*downloaded/count), block_file_url, cat_name)) | |
try: | |
r = requests.get(block_file_url) | |
idlist = r.content.split("\"")[3] | |
ids = [n for n in idlist.split(";") if len(n) > 3] | |
ids_by_cat[cat_name] += ids | |
except IndexError, IOError: | |
print("Failed to download or process block at {0}".format(block_file_url)) | |
downloaded += 1 #increment even if we've failed. | |
with open("{0}.txt".format(cat_name), "w") as idfile: | |
print("Writing ids to {0}.txt".format(cat_name)) | |
for vid in ids_by_cat[cat_name]: | |
idfile.write("{0}\n".format(vid)) | |
print("Done.") |
Hi @naveenv2,
Ahh, these are base64 encoded strings, and need decoding first. For example, using the base64
utility (https://linux.die.net/man/1/base64):
(base) ~ ❯❯❯ echo "bklKdg==" | base64 -d
nIJv%
This yields nIJv
(the %
represents the lack of newline at the end of the string), which is a valid video: https://data.yt8m.org/2/j/i/nI/nIJv.js
In Python you can use the base64 decode module's b64decode
function (https://docs.python.org/3/library/base64.html), though there may be some method on TFRecord that can do this for you.
Ah yes. Thanks for pointing it out.
This is precisely what I was looking for.
Thanks a lot! :)
Glad I could help!
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi @JosephRedfern,
I don't think it's a case of missing videos. I checked a couple of tfrecords. The IDs are 8-character long (as compared to the mentioned 4-char ID), something like this:
(
jlist
is a list of json outputs extracted for a tfrecord file using this)I suspect that the URL format mentioned on the website (
/AB/ABCD.js
) isn't compatible with these IDs. I also tried various combinations (like dropping the recurring'=='
and'dg=='
text from the ID), but none of them got a hit.I hope I'm looking at the right values though. Please correct me if I missed out anything.