vdavez · February 26, 2017 16:47
diff --git a/s3-to-mturk.ipynb b/s3-to-mturk.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## S3 to Mechanical Turk\n",
    "\n",
    "\n",
    "### Processing the S3 bucket to get a list of files\n",
    "\n",
    "First, we used the AWS S3 CLI to list all of the objects and save them in a json file.\n",
    "\n",
    "```\n",
    "aws s3api list-objects --bucket <bucketname> > results.json\n",
    "```\n",
    "\n",
    "Next, we used `jq` to convert the file into a CSV\n",
    "\n",
    "```\n",
    "cat results.json | jq -r '.[\"Content\"][] | [.Key, .ETag] | @csv' > results.csv\n",
    "```\n",
    "\n",
    "Then, we access only the photos (\\*.jpg) files, and save it as a new CSV:\n",
    "\n",
    "```\n",
    "sed -r s/\\\"\\\"\\\"//g results.csv | grep .jpg > images.csv\n",
    "```\n",
    "\n",
    "Now, we can compare the ETags to find duplicates, and drop them."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "### Determining the number of duplicates\n",
    "\n",
    "#### First pass\n",
    "\n",
    "Next, we write a bash script to skip duplicates in the S3 bucket, called `checksums.sh`. This is necessary because we have 36172 images, and we want to make sure we're not asking turkers to look at the same image twice.\n",
    "\n",
    "```\n",
    "#!/bin/bash\n",
    "\n",
    "# Usage: ./checksums.sh\n",
    "\n",
    "# Initialize the checksums array...\n",
    "checksums=()\n",
    "\n",
    "# Loop through the BMR files\n",
    "IFS=$'\\n';for file in $(cat images.csv | grep .jpg); do\n",
    "    hash=${file#*.jpg\\\",}\n",
    "\n",
    "    # Check to see if it's in the checksum list\n",
    "    match=0;\n",
    "    \n",
    "    # Note: This is not very efficient at *all* because you have to loop through\n",
    "    # the *entire* list. Which is O(log n!) or something like that.\n",
    "    \n",
    "    for i in \"${checksums[@]}\"; do\n",
    "      # If in checksum list, say \"MATCH!\" and skip\n",
    "      if [ \"$i\" == \"$hash\" ] ; then\n",
    "        echo \"MATCH! Found $hash and therefore skipping upload to s3\"\n",
    "        match=1\n",
    "      fi;\n",
    "    done;\n",
    "\n",
    "    # If it was not in the checksum list, do whatever we *planned* to do\n",
    "    # with file. In this case, load to S3...\n",
    "    if [ \"$match\" -eq 0 ]; then\n",
    "      checksums+=(\"$hash\")\n",
    "\n",
    "        # S3 code to go here..\n",
    "\n",
    "    fi;\n",
    "\n",
    "done\n",
    "```\n",
    "\n",
    "From this, we run `bash checksums.sh | wc -l` to find duplicates. This turned out to be a massive *fail* because of the complexity of the operation. Happily, there's a better way.\n",
    "\n",
    "#### Second pass\n",
    "\n",
    "Given the complexity, we decide to use a different command line tool, `awk`. This \n",
    "\n",
    "```\n",
    "awk -F ',' '{filename=$1;line=$2;if(dup[line]++)print line}' images.csv | wc -l\n",
    "```\n",
    "\n",
    "We this operation, we found that there 15293 duplicates! That seemed improbable until I realized that the S3 Bucket including a Testing folder that was a duplicate folder. After removing that Testing folder, we found 276 duplicates. *Phew*.\n",
    "\n",
    "Ok. So, now that's done, let's run the following (note the `!` character right before the `dup`:\n",
    "\n",
    "```\n",
    "cat images.csv | awk '!/Testing/' | awk -F ',' '{filename=$1;line=$2;if(!dup[line]++)print filename}' | awk -F '[\"|\"]' '{print \"https://s3.amazonaws.com/beamstudyfoodphotos/\"$2}' > image_urls.csv\n",
    "```\n",
    "\n",
    "Now, we have 20674 files ready for turking!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Selecting the BMRs\n",
    "\n",
    "Now, we ultimately don't want to run *every* BMR at the moment, because Tracy's research primarily involves WIC data. So, now, we want to filter for the WIC only data. To do this, I use some python.\n",
    "\n",
    "```py\n",
    "import csv\n",
    "import re\n",
    "\n",
    "with open('bmrs.csv') as csvfile:\n",
    "    bmrreader = csv.DictReader(csvfile)\n",
    "    wic = []\n",
    "    for bmr in bmrreader:\n",
    "       if bmr[\"WIC_STATUS\"] == \"1\":\n",
    "          wic.append(bmr[\"BMR\"])\n",
    "with open(\"image_urls.csv\", \"r\") as fp:\n",
    "    fp.readline() #skip the first line\n",
    "    urls = fp.readlines()\n",
    "\n",
    "results = [url for url in urls if re.findall(\"BMR_\\d+\", url)[0].replace(\"_\",\"\") in wic]\n",
    "print(\"There are %s files that are in the %s WIC-related BMR files. There are %s total files.\" % (len(results), len(wic), len(urls)))\n",
    "\n",
    "# There are 7992 files that are in the 84 WIC-related BMR files. There are 20674 total files.\n",
    "\n",
    "with open(\"wic_images_urls.csv\", \"w\") as fp:\n",
    "    fp.write(\"image_url\\n\")\n",
    "    fp.write(\"\".join(results))\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The total cost for 7992 files at $0.24 per file ($0.08 per HIT, 3 HITs per file) is: $1918.08.\n"
     ]
    }
   ],
   "source": [
    "# At , that is:\n",
    "print(\"The total cost for %s files at $0.24 per file ($0.08 per HIT, 3 HITs per file) is: $%s.\" % (7992, 0.24 * 7992))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.4.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"deletable": true,
	"editable": true
	},
	"source": [
	"## S3 to Mechanical Turk\n",
	"\n",
	"\n",
	"### Processing the S3 bucket to get a list of files\n",
	"\n",
	"First, we used the AWS S3 CLI to list all of the objects and save them in a json file.\n",
	"\n",
	"```\n",
	"aws s3api list-objects --bucket <bucketname> > results.json\n",
	"```\n",
	"\n",
	"Next, we used `jq` to convert the file into a CSV\n",
	"\n",
	"```\n",
	"cat results.json \| jq -r '.[\"Content\"][] \| [.Key, .ETag] \| @csv' > results.csv\n",
	"```\n",
	"\n",
	"Then, we access only the photos (\\*.jpg) files, and save it as a new CSV:\n",
	"\n",
	"```\n",
	"sed -r s/\\\"\\\"\\\"//g results.csv \| grep .jpg > images.csv\n",
	"```\n",
	"\n",
	"Now, we can compare the ETags to find duplicates, and drop them."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"deletable": true,
	"editable": true
	},
	"source": [
	"### Determining the number of duplicates\n",
	"\n",
	"#### First pass\n",
	"\n",
	"Next, we write a bash script to skip duplicates in the S3 bucket, called `checksums.sh`. This is necessary because we have 36172 images, and we want to make sure we're not asking turkers to look at the same image twice.\n",
	"\n",
	"```\n",
	"#!/bin/bash\n",
	"\n",
	"# Usage: ./checksums.sh\n",
	"\n",
	"# Initialize the checksums array...\n",
	"checksums=()\n",
	"\n",
	"# Loop through the BMR files\n",
	"IFS=$'\\n';for file in $(cat images.csv \| grep .jpg); do\n",
	" hash=${file#*.jpg\\\",}\n",
	"\n",
	" # Check to see if it's in the checksum list\n",
	" match=0;\n",
	" \n",
	" # Note: This is not very efficient at all because you have to loop through\n",
	" # the entire list. Which is O(log n!) or something like that.\n",
	" \n",
	" for i in \"${checksums[@]}\"; do\n",
	" # If in checksum list, say \"MATCH!\" and skip\n",
	" if [ \"$i\" == \"$hash\" ] ; then\n",
	" echo \"MATCH! Found $hash and therefore skipping upload to s3\"\n",
	" match=1\n",
	" fi;\n",
	" done;\n",
	"\n",
	" # If it was not in the checksum list, do whatever we planned to do\n",
	" # with file. In this case, load to S3...\n",
	" if [ \"$match\" -eq 0 ]; then\n",
	" checksums+=(\"$hash\")\n",
	"\n",
	" # S3 code to go here..\n",
	"\n",
	" fi;\n",
	"\n",
	"done\n",
	"```\n",
	"\n",
	"From this, we run `bash checksums.sh \| wc -l` to find duplicates. This turned out to be a massive fail because of the complexity of the operation. Happily, there's a better way.\n",
	"\n",
	"#### Second pass\n",
	"\n",
	"Given the complexity, we decide to use a different command line tool, `awk`. This \n",
	"\n",
	"```\n",
	"awk -F ',' '{filename=$1;line=$2;if(dup[line]++)print line}' images.csv \| wc -l\n",
	"```\n",
	"\n",
	"We this operation, we found that there 15293 duplicates! That seemed improbable until I realized that the S3 Bucket including a Testing folder that was a duplicate folder. After removing that Testing folder, we found 276 duplicates. Phew.\n",
	"\n",
	"Ok. So, now that's done, let's run the following (note the `!` character right before the `dup`:\n",
	"\n",
	"```\n",
	"cat images.csv \| awk '!/Testing/' \| awk -F ',' '{filename=$1;line=$2;if(!dup[line]++)print filename}' \| awk -F '[\"\|\"]' '{print \"https://s3.amazonaws.com/beamstudyfoodphotos/\"$2}' > image_urls.csv\n",
	"```\n",
	"\n",
	"Now, we have 20674 files ready for turking!"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Selecting the BMRs\n",
	"\n",
	"Now, we ultimately don't want to run every BMR at the moment, because Tracy's research primarily involves WIC data. So, now, we want to filter for the WIC only data. To do this, I use some python.\n",
	"\n",
	"```py\n",
	"import csv\n",
	"import re\n",
	"\n",
	"with open('bmrs.csv') as csvfile:\n",
	" bmrreader = csv.DictReader(csvfile)\n",
	" wic = []\n",
	" for bmr in bmrreader:\n",
	" if bmr[\"WIC_STATUS\"] == \"1\":\n",
	" wic.append(bmr[\"BMR\"])\n",
	"with open(\"image_urls.csv\", \"r\") as fp:\n",
	" fp.readline() #skip the first line\n",
	" urls = fp.readlines()\n",
	"\n",
	"results = [url for url in urls if re.findall(\"BMR_\\d+\", url)[0].replace(\"_\",\"\") in wic]\n",
	"print(\"There are %s files that are in the %s WIC-related BMR files. There are %s total files.\" % (len(results), len(wic), len(urls)))\n",
	"\n",
	"# There are 7992 files that are in the 84 WIC-related BMR files. There are 20674 total files.\n",
	"\n",
	"with open(\"wic_images_urls.csv\", \"w\") as fp:\n",
	" fp.write(\"image_url\\n\")\n",
	" fp.write(\"\".join(results))\n",
	"```"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"The total cost for 7992 files at $0.24 per file ($0.08 per HIT, 3 HITs per file) is: $1918.08.\n"
	]
	}
	],
	"source": [
	"# At , that is:\n",
	"print(\"The total cost for %s files at $0.24 per file ($0.08 per HIT, 3 HITs per file) is: $%s.\" % (7992, 0.24 * 7992))"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.4.3"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}