tmbdev · September 2, 2020 17:36
diff --git a/ytsamples-split.ipynb b/ytsamples-split.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Sharded YouTube Files\n",
    "\n",
    "We have about 6M videos downloaded from YouTube (part of the YouTube 8m dataset released by Google).\n",
    "\n",
    "These videos are stored as 3000 shards, each containing about 2000 videos and each about 56 Gbytes large."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "gs://lpr-yt8m-lo-sharded/videos-lo\n",
      "gs://lpr-yt8m-lo-sharded/yt8m-lo-000000.tar\n",
      "gs://lpr-yt8m-lo-sharded/yt8m-lo-000001.tar\n",
      "gs://lpr-yt8m-lo-sharded/yt8m-lo-000002.tar\n",
      "gs://lpr-yt8m-lo-sharded/yt8m-lo-000003.tar\n",
      "gs://lpr-yt8m-lo-sharded/yt8m-lo-000004.tar\n",
      "gs://lpr-yt8m-lo-sharded/yt8m-lo-000005.tar\n",
      "gs://lpr-yt8m-lo-sharded/yt8m-lo-000006.tar\n",
      "gs://lpr-yt8m-lo-sharded/yt8m-lo-000007.tar\n",
      "gs://lpr-yt8m-lo-sharded/yt8m-lo-000008.tar\n",
      "Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>\n",
      "BrokenPipeError: [Errno 32] Broken pipe\n"
     ]
    }
   ],
   "source": [
    "!gsutil ls gs://lpr-yt8m-lo-sharded | head"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each shard contains the video itself (`.mp4`) plus a lot of associated metadata."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "---2pGwkL7M.annotations.xml\n",
      "---2pGwkL7M.description\n",
      "---2pGwkL7M.dllog\n",
      "---2pGwkL7M.info.json\n",
      "---2pGwkL7M.mp4\n",
      "---2pGwkL7M.txt\n",
      "-2cScG5TqjQ.annotations.xml\n",
      "-2cScG5TqjQ.description\n",
      "-2cScG5TqjQ.dllog\n",
      "-2cScG5TqjQ.info.json\n",
      "^C\n",
      "Caught CTRL-C (signal 2) - exiting\n"
     ]
    }
   ],
   "source": [
    "!gsutil cat gs://lpr-yt8m-lo-sharded/yt8m-lo-000000.tar | tar tf - | head"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Splitting the Videos\n",
    "\n",
    "For training, we usually don't want to use the entire YouTube video at once; they are variable length and are hard to fit into the GPU. Instead, we want to split up the video into frames or short clips.\n",
    "\n",
    "Here, we are using a script based on ffmpeg to split up each video into a set of clips. We also rescale all the images to a common size. The input video for this script is assumed to be `sample.mp4`, and the output clips will be stored in files like `sample-000000.mp4`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting extract-segments.sh\n"
     ]
    }
   ],
   "source": [
    "%%writefile extract-segments.sh\n",
    "#!/bin/bash\n",
    "exec > $(dirname $0)/_extract-segments-last.log 1>&2\n",
    "#ps auxw --sort -vsz | sed 5q\n",
    "\n",
    "set -e\n",
    "set -x\n",
    "set -a\n",
    "\n",
    "size=${size:-256:128}\n",
    "duration=${duration:-2}\n",
    "count=${count:-999999999} \n",
    "\n",
    "# get mp4 metadata (total length, etc.)\n",
    "ffprobe sample.mp4 -v quiet -print_format json -show_format -show_streams > sample.mp4.json\n",
    "\n",
    "# perform the rescaling and splitting\n",
    "ffmpeg -loglevel error -stats -i sample.mp4 \\\n",
    "    -vf \"scale=$size:force_original_aspect_ratio=decrease,pad=$size:(ow-iw)/2:(oh-ih)/2\" \\\n",
    "    -c:a copy -f segment -segment_time $duration -reset_timestamps 1  \\\n",
    "    -segment_format_options movflags=+faststart \\\n",
    "    sample-%06d.mp4\n",
    "\n",
    "# copy the metadata into each video fragment\n",
    "for s in sample-??????.mp4; do\n",
    "    b=$(basename $s .mp4)\n",
    "    cp sample.mp4.json $b.mp4.json || true\n",
    "    cp sample.info.json $b.info.json || true\n",
    "done"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "!chmod 755 ./extract-segments.sh"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Running the Script over All Videos in a Shard"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we use the `tarp` command to iterate the above script over each `.mp4` file in shard `000000`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting splitit.sh\n"
     ]
    }
   ],
   "source": [
    "%%writefile splitit.sh\n",
    "gsutil cat gs://lpr-yt8m-lo-sharded/yt8m-lo-000000.tar |\n",
    "tarp proc -m $(pwd)/extract-segments.sh - -o - |\n",
    "tarp split - -o yt8m-clips-%06d.tar"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now run this script using:\n",
    "\n",
    "```Bash\n",
    "$ bash ./splitit.sh\n",
    "```\n",
    "\n",
    "It's best to do this outside Jupyter since Jupyter doesn't work with long running shell jobs."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Of course, if you want to run this code over all 3000 shards, you probably will want to submit jobs based on `splitit.sh` to some job queuing system.\n",
    "\n",
    "Also note that the `--post` option to `tarp split` lets us upload output shards as soon as they have been created, allowing the script to work with very little local storage."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Output\n",
    "\n",
    "Let's have a look at the output to make sure it's OK."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "---2pGwkL7M/000000.mp4.json\n",
      "---2pGwkL7M/000000.info.json\n",
      "---2pGwkL7M/000000.mp4\n",
      "---2pGwkL7M/000001.info.json\n",
      "---2pGwkL7M/000001.mp4\n",
      "---2pGwkL7M/000001.mp4.json\n",
      "---2pGwkL7M/000002.info.json\n",
      "---2pGwkL7M/000002.mp4\n",
      "---2pGwkL7M/000002.mp4.json\n",
      "---2pGwkL7M/000003.info.json\n",
      "tar: write error\n"
     ]
    }
   ],
   "source": [
    "!tar tf yt8m-clips-000000.tar | head"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These clips have been uploaded to `gs://nvdata-ytsamples/yt8m-clips-{000000..000009}.tar`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Processing Many Shards in Parallel\n",
    "\n",
    "One of the benefits of sharding is that you can process them in parallel. The YT8m-lo dataset consists of 3000 shards with about 2000 videos each. To process them in parallel, you can use standard tools. You need to modify the script a little to take a shard name and to upload the result into a common bucket.\n",
    "\n",
    "```Bash\n",
    "# splitshard.sh\n",
    "gsutil cat gs://lpr-yt8m-lo-sharded/yt8m-lo-$1.tar |\n",
    "tarp proc -m $(pwd)/extract-segments.sh - -o - |\n",
    "tarp split - -o yt8m-clips-$1-%06d.tar -p 'gsutil cp %s gs://my-output/%s && rm -f %s`\n",
    "```\n",
    "\n",
    "On a single machine, you could run run this script with:\n",
    "\n",
    "```Bash\n",
    "for shard in {000000..003000}; do echo $shard; done |\n",
    "xargs -i bash splitshard.sh $shard\n",
    "```\n",
    "\n",
    "More likely, you are going to run a job of this size in the cluster, using Kubernetes or some other job queuing program. Generally, you need to create a Docker container containing the above script (and all dependencies), and then submit it with a simple loop again:\n",
    "\n",
    "```Bash\n",
    "for shard in {000000..003000}; do echo $shard; done |\n",
    "submit-job --container=splitshard-container $1\n",
    "```\n",
    "\n",
    "If you store your shards on AIStore, you can also simply tell AIStore to transform a set of shards using the `splitshard-container` for you."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Sharded YouTube Files\n",
	"\n",
	"We have about 6M videos downloaded from YouTube (part of the YouTube 8m dataset released by Google).\n",
	"\n",
	"These videos are stored as 3000 shards, each containing about 2000 videos and each about 56 Gbytes large."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"gs://lpr-yt8m-lo-sharded/videos-lo\n",
	"gs://lpr-yt8m-lo-sharded/yt8m-lo-000000.tar\n",
	"gs://lpr-yt8m-lo-sharded/yt8m-lo-000001.tar\n",
	"gs://lpr-yt8m-lo-sharded/yt8m-lo-000002.tar\n",
	"gs://lpr-yt8m-lo-sharded/yt8m-lo-000003.tar\n",
	"gs://lpr-yt8m-lo-sharded/yt8m-lo-000004.tar\n",
	"gs://lpr-yt8m-lo-sharded/yt8m-lo-000005.tar\n",
	"gs://lpr-yt8m-lo-sharded/yt8m-lo-000006.tar\n",
	"gs://lpr-yt8m-lo-sharded/yt8m-lo-000007.tar\n",
	"gs://lpr-yt8m-lo-sharded/yt8m-lo-000008.tar\n",
	"Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>\n",
	"BrokenPipeError: [Errno 32] Broken pipe\n"
	]
	}
	],
	"source": [
	"!gsutil ls gs://lpr-yt8m-lo-sharded \| head"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Each shard contains the video itself (`.mp4`) plus a lot of associated metadata."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"---2pGwkL7M.annotations.xml\n",
	"---2pGwkL7M.description\n",
	"---2pGwkL7M.dllog\n",
	"---2pGwkL7M.info.json\n",
	"---2pGwkL7M.mp4\n",
	"---2pGwkL7M.txt\n",
	"-2cScG5TqjQ.annotations.xml\n",
	"-2cScG5TqjQ.description\n",
	"-2cScG5TqjQ.dllog\n",
	"-2cScG5TqjQ.info.json\n",
	"^C\n",
	"Caught CTRL-C (signal 2) - exiting\n"
	]
	}
	],
	"source": [
	"!gsutil cat gs://lpr-yt8m-lo-sharded/yt8m-lo-000000.tar \| tar tf - \| head"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Splitting the Videos\n",
	"\n",
	"For training, we usually don't want to use the entire YouTube video at once; they are variable length and are hard to fit into the GPU. Instead, we want to split up the video into frames or short clips.\n",
	"\n",
	"Here, we are using a script based on ffmpeg to split up each video into a set of clips. We also rescale all the images to a common size. The input video for this script is assumed to be `sample.mp4`, and the output clips will be stored in files like `sample-000000.mp4`."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Overwriting extract-segments.sh\n"
	]
	}
	],
	"source": [
	"%%writefile extract-segments.sh\n",
	"#!/bin/bash\n",
	"exec > $(dirname $0)/_extract-segments-last.log 1>&2\n",
	"#ps auxw --sort -vsz \| sed 5q\n",
	"\n",
	"set -e\n",
	"set -x\n",
	"set -a\n",
	"\n",
	"size=${size:-256:128}\n",
	"duration=${duration:-2}\n",
	"count=${count:-999999999} \n",
	"\n",
	"# get mp4 metadata (total length, etc.)\n",
	"ffprobe sample.mp4 -v quiet -print_format json -show_format -show_streams > sample.mp4.json\n",
	"\n",
	"# perform the rescaling and splitting\n",
	"ffmpeg -loglevel error -stats -i sample.mp4 \\\n",
	" -vf \"scale=$size:force_original_aspect_ratio=decrease,pad=$size:(ow-iw)/2:(oh-ih)/2\" \\\n",
	" -c:a copy -f segment -segment_time $duration -reset_timestamps 1 \\\n",
	" -segment_format_options movflags=+faststart \\\n",
	" sample-%06d.mp4\n",
	"\n",
	"# copy the metadata into each video fragment\n",
	"for s in sample-??????.mp4; do\n",
	" b=$(basename $s .mp4)\n",
	" cp sample.mp4.json $b.mp4.json \|\| true\n",
	" cp sample.info.json $b.info.json \|\| true\n",
	"done"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {},
	"outputs": [],
	"source": [
	"!chmod 755 ./extract-segments.sh"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Running the Script over All Videos in a Shard"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Next, we use the `tarp` command to iterate the above script over each `.mp4` file in shard `000000`."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Overwriting splitit.sh\n"
	]
	}
	],
	"source": [
	"%%writefile splitit.sh\n",
	"gsutil cat gs://lpr-yt8m-lo-sharded/yt8m-lo-000000.tar \|\n",
	"tarp proc -m $(pwd)/extract-segments.sh - -o - \|\n",
	"tarp split - -o yt8m-clips-%06d.tar"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We can now run this script using:\n",
	"\n",
	"```Bash\n",
	"$ bash ./splitit.sh\n",
	"```\n",
	"\n",
	"It's best to do this outside Jupyter since Jupyter doesn't work with long running shell jobs."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Of course, if you want to run this code over all 3000 shards, you probably will want to submit jobs based on `splitit.sh` to some job queuing system.\n",
	"\n",
	"Also note that the `--post` option to `tarp split` lets us upload output shards as soon as they have been created, allowing the script to work with very little local storage."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Output\n",
	"\n",
	"Let's have a look at the output to make sure it's OK."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 41,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"---2pGwkL7M/000000.mp4.json\n",
	"---2pGwkL7M/000000.info.json\n",
	"---2pGwkL7M/000000.mp4\n",
	"---2pGwkL7M/000001.info.json\n",
	"---2pGwkL7M/000001.mp4\n",
	"---2pGwkL7M/000001.mp4.json\n",
	"---2pGwkL7M/000002.info.json\n",
	"---2pGwkL7M/000002.mp4\n",
	"---2pGwkL7M/000002.mp4.json\n",
	"---2pGwkL7M/000003.info.json\n",
	"tar: write error\n"
	]
	}
	],
	"source": [
	"!tar tf yt8m-clips-000000.tar \| head"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"These clips have been uploaded to `gs://nvdata-ytsamples/yt8m-clips-{000000..000009}.tar`."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Processing Many Shards in Parallel\n",
	"\n",
	"One of the benefits of sharding is that you can process them in parallel. The YT8m-lo dataset consists of 3000 shards with about 2000 videos each. To process them in parallel, you can use standard tools. You need to modify the script a little to take a shard name and to upload the result into a common bucket.\n",
	"\n",
	"```Bash\n",
	"# splitshard.sh\n",
	"gsutil cat gs://lpr-yt8m-lo-sharded/yt8m-lo-$1.tar \|\n",
	"tarp proc -m $(pwd)/extract-segments.sh - -o - \|\n",
	"tarp split - -o yt8m-clips-$1-%06d.tar -p 'gsutil cp %s gs://my-output/%s && rm -f %s`\n",
	"```\n",
	"\n",
	"On a single machine, you could run run this script with:\n",
	"\n",
	"```Bash\n",
	"for shard in {000000..003000}; do echo $shard; done \|\n",
	"xargs -i bash splitshard.sh $shard\n",
	"```\n",
	"\n",
	"More likely, you are going to run a job of this size in the cluster, using Kubernetes or some other job queuing program. Generally, you need to create a Docker container containing the above script (and all dependencies), and then submit it with a simple loop again:\n",
	"\n",
	"```Bash\n",
	"for shard in {000000..003000}; do echo $shard; done \|\n",
	"submit-job --container=splitshard-container $1\n",
	"```\n",
	"\n",
	"If you store your shards on AIStore, you can also simply tell AIStore to transform a set of shards using the `splitshard-container` for you."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.9"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 4
	}
No results found