Skip to content

Instantly share code, notes, and snippets.

@vdavez
Created February 26, 2017 16:47
Show Gist options
  • Save vdavez/14df82e013182f98e5e434ffd97f9201 to your computer and use it in GitHub Desktop.
Save vdavez/14df82e013182f98e5e434ffd97f9201 to your computer and use it in GitHub Desktop.
S3 to Mechanical Turk instructions
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## S3 to Mechanical Turk\n",
"\n",
"\n",
"### Processing the S3 bucket to get a list of files\n",
"\n",
"First, we used the AWS S3 CLI to list all of the objects and save them in a json file.\n",
"\n",
"```\n",
"aws s3api list-objects --bucket <bucketname> > results.json\n",
"```\n",
"\n",
"Next, we used `jq` to convert the file into a CSV\n",
"\n",
"```\n",
"cat results.json | jq -r '.[\"Content\"][] | [.Key, .ETag] | @csv' > results.csv\n",
"```\n",
"\n",
"Then, we access only the photos (\\*.jpg) files, and save it as a new CSV:\n",
"\n",
"```\n",
"sed -r s/\\\"\\\"\\\"//g results.csv | grep .jpg > images.csv\n",
"```\n",
"\n",
"Now, we can compare the ETags to find duplicates, and drop them."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Determining the number of duplicates\n",
"\n",
"#### First pass\n",
"\n",
"Next, we write a bash script to skip duplicates in the S3 bucket, called `checksums.sh`. This is necessary because we have 36172 images, and we want to make sure we're not asking turkers to look at the same image twice.\n",
"\n",
"```\n",
"#!/bin/bash\n",
"\n",
"# Usage: ./checksums.sh\n",
"\n",
"# Initialize the checksums array...\n",
"checksums=()\n",
"\n",
"# Loop through the BMR files\n",
"IFS=$'\\n';for file in $(cat images.csv | grep .jpg); do\n",
" hash=${file#*.jpg\\\",}\n",
"\n",
" # Check to see if it's in the checksum list\n",
" match=0;\n",
" \n",
" # Note: This is not very efficient at *all* because you have to loop through\n",
" # the *entire* list. Which is O(log n!) or something like that.\n",
" \n",
" for i in \"${checksums[@]}\"; do\n",
" # If in checksum list, say \"MATCH!\" and skip\n",
" if [ \"$i\" == \"$hash\" ] ; then\n",
" echo \"MATCH! Found $hash and therefore skipping upload to s3\"\n",
" match=1\n",
" fi;\n",
" done;\n",
"\n",
" # If it was not in the checksum list, do whatever we *planned* to do\n",
" # with file. In this case, load to S3...\n",
" if [ \"$match\" -eq 0 ]; then\n",
" checksums+=(\"$hash\")\n",
"\n",
" # S3 code to go here..\n",
"\n",
" fi;\n",
"\n",
"done\n",
"```\n",
"\n",
"From this, we run `bash checksums.sh | wc -l` to find duplicates. This turned out to be a massive *fail* because of the complexity of the operation. Happily, there's a better way.\n",
"\n",
"#### Second pass\n",
"\n",
"Given the complexity, we decide to use a different command line tool, `awk`. This \n",
"\n",
"```\n",
"awk -F ',' '{filename=$1;line=$2;if(dup[line]++)print line}' images.csv | wc -l\n",
"```\n",
"\n",
"We this operation, we found that there 15293 duplicates! That seemed improbable until I realized that the S3 Bucket including a Testing folder that was a duplicate folder. After removing that Testing folder, we found 276 duplicates. *Phew*.\n",
"\n",
"Ok. So, now that's done, let's run the following (note the `!` character right before the `dup`:\n",
"\n",
"```\n",
"cat images.csv | awk '!/Testing/' | awk -F ',' '{filename=$1;line=$2;if(!dup[line]++)print filename}' | awk -F '[\"|\"]' '{print \"https://s3.amazonaws.com/beamstudyfoodphotos/\"$2}' > image_urls.csv\n",
"```\n",
"\n",
"Now, we have 20674 files ready for turking!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Selecting the BMRs\n",
"\n",
"Now, we ultimately don't want to run *every* BMR at the moment, because Tracy's research primarily involves WIC data. So, now, we want to filter for the WIC only data. To do this, I use some python.\n",
"\n",
"```py\n",
"import csv\n",
"import re\n",
"\n",
"with open('bmrs.csv') as csvfile:\n",
" bmrreader = csv.DictReader(csvfile)\n",
" wic = []\n",
" for bmr in bmrreader:\n",
" if bmr[\"WIC_STATUS\"] == \"1\":\n",
" wic.append(bmr[\"BMR\"])\n",
"with open(\"image_urls.csv\", \"r\") as fp:\n",
" fp.readline() #skip the first line\n",
" urls = fp.readlines()\n",
"\n",
"results = [url for url in urls if re.findall(\"BMR_\\d+\", url)[0].replace(\"_\",\"\") in wic]\n",
"print(\"There are %s files that are in the %s WIC-related BMR files. There are %s total files.\" % (len(results), len(wic), len(urls)))\n",
"\n",
"# There are 7992 files that are in the 84 WIC-related BMR files. There are 20674 total files.\n",
"\n",
"with open(\"wic_images_urls.csv\", \"w\") as fp:\n",
" fp.write(\"image_url\\n\")\n",
" fp.write(\"\".join(results))\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The total cost for 7992 files at $0.24 per file ($0.08 per HIT, 3 HITs per file) is: $1918.08.\n"
]
}
],
"source": [
"# At , that is:\n",
"print(\"The total cost for %s files at $0.24 per file ($0.08 per HIT, 3 HITs per file) is: $%s.\" % (7992, 0.24 * 7992))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.4.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment