Skip to content

Instantly share code, notes, and snippets.

@chaonan99
Last active August 10, 2016 17:26
Show Gist options
  • Save chaonan99/5c396db7c5996f2a942ebbcc2e10396b to your computer and use it in GitHub Desktop.
Save chaonan99/5c396db7c5996f2a942ebbcc2e10396b to your computer and use it in GitHub Desktop.
An amazingly short script to download dataset according to a list of url. This works on TGIF but can transfer to other datasets.
# [Description] This file download dataset from a set of urls
# [Author] chaonan99
# [Date] 08/09/2016
# [Email] [email protected]
import os,re,optparse,urllib,string
import numpy as np
parser = optparse.OptionParser()
parser.add_option('-i', '--input', dest='infile', help='input txt file name')
parser.add_option('-o', '--output_folder', dest='outfolder', help='output folder name')
(opts, args) = parser.parse_args()
if not os.path.isdir(opts.outfolder):
os.mkdir(opts.outfolder)
# Change the regular expression if you want to use this on other datasets. But it is somehow general solution to most application scenarios
for url in np.asarray(re.compile(r'(https?:\/\/)?([\w\-\_\.]+)+(\/[\w\-\_]*)*(?![\w\-\_]+\.)(\/[\w\-\_\.]*)').findall(open(opts.infile).read())):
if not os.path.isfile(opts.outfolder+url[-1]):
print("Download: " + url[-1])
urllib.urlretrieve(''.join(url), opts.outfolder+url[-1])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment