-
-
Save JosephRedfern/d60bdc584d84b1451cc6052e955b755c to your computer and use it in GitHub Desktop.
| import requests | |
| from collections import defaultdict | |
| csv_prefix = "https://research.google.com/youtube8m/csv" | |
| r = requests.get("{0}/verticals.json".format(csv_prefix)) | |
| verticals = r.json() | |
| block_urls = defaultdict(list) | |
| count = 0 | |
| for cat, urls in verticals.items(): | |
| for url in urls: | |
| jsurl = "{0}/j/{1}.js".format(csv_prefix, url.split("/")[-1]) | |
| block_urls[cat[1:]].append(jsurl) | |
| count += 1 #lazy. | |
| ids_by_cat = defaultdict(list) | |
| downloaded = 0.0 | |
| for cat_name, block_file_urls in block_urls.items(): | |
| for block_file_url in block_file_urls: | |
| print("[{0}%] Downloading block file: {1} {2}".format((100.0*downloaded/count), block_file_url, cat_name)) | |
| try: | |
| r = requests.get(block_file_url) | |
| idlist = r.content.split("\"")[3] | |
| ids = [n for n in idlist.split(";") if len(n) > 3] | |
| ids_by_cat[cat_name] += ids | |
| except IndexError, IOError: | |
| print("Failed to download or process block at {0}".format(block_file_url)) | |
| downloaded += 1 #increment even if we've failed. | |
| with open("{0}.txt".format(cat_name), "w") as idfile: | |
| print("Writing ids to {0}.txt".format(cat_name)) | |
| for vid in ids_by_cat[cat_name]: | |
| idfile.write("{0}\n".format(vid)) | |
| print("Done.") |
Hi @JosephRedfern,
I don't think it's a case of missing videos. I checked a couple of tfrecords. The IDs are 8-character long (as compared to the mentioned 4-char ID), something like this:
>>> jlist[483]['features']['feature']['id']['bytesList']['value']
['bklKdg==']
>>> jlist[123]['features']['feature']['id']['bytesList']['value']
['bGVKdg==']
>>> jlist[892]['features']['feature']['id']['bytesList']['value']
['eFVKdg==']
>>> jlist[928]['features']['feature']['id']['bytesList']['value']
['TW1Kdg==']
(jlist is a list of json outputs extracted for a tfrecord file using this)
I suspect that the URL format mentioned on the website (/AB/ABCD.js) isn't compatible with these IDs. I also tried various combinations (like dropping the recurring '==' and 'dg==' text from the ID), but none of them got a hit.
I hope I'm looking at the right values though. Please correct me if I missed out anything.
Hi @naveenv2,
Ahh, these are base64 encoded strings, and need decoding first. For example, using the base64 utility (https://linux.die.net/man/1/base64):
(base) ~ ❯❯❯ echo "bklKdg==" | base64 -d
nIJv%
This yields nIJv (the % represents the lack of newline at the end of the string), which is a valid video: https://data.yt8m.org/2/j/i/nI/nIJv.js
In Python you can use the base64 decode module's b64decode function (https://docs.python.org/3/library/base64.html), though there may be some method on TFRecord that can do this for you.
Ah yes. Thanks for pointing it out.
This is precisely what I was looking for.
Thanks a lot! :)
Glad I could help!
@naveenv2 It would be fairly easy to script up a tool that requested the verticals list, pulled out the links to the different pages for each category, then made a request to the URL that provides the translation to non-anonymised Video ID (https://research.google.com/youtube8m/video_id_conversion.html).
However, unlike the previous method, doing it this way would require a request for every url, which would take a while and feels a bit abusive.
As for the issue of video ids in the tfrecords file -- is this the case for all videos, or just some of them? As noted in the video id conversion page, "When a video gets deleted, or made private by its uploader, the lookup URL becomes invalid", so I'd expect at least some lookups to return an error.