Skip to content

Instantly share code, notes, and snippets.

@michael-erasmus
Last active October 19, 2021 17:59
Show Gist options
  • Save michael-erasmus/6a5acddcb56548874ffe780e19b7701d to your computer and use it in GitHub Desktop.
Save michael-erasmus/6a5acddcb56548874ffe780e19b7701d to your computer and use it in GitHub Desktop.
Speeding up the deletion of an S3 bucket with millions of nested files

I had a really interesting journey today with a thorny little challenge I had while trying to delete all the files in a s3 bucket with tons of nested files. The bucket path (s3://buffer-data/emr/logs/) contained log files created by ElasticMapReduce jobs that ran every day over a couple of years (from early 2015 to early 2018).

Each EMR job would run hourly every day, firing up a cluster of machines and each machine would output it's logs. That resulted thousands of nested paths (one for each job) that contained thousands of other files. I estimated that the total number of nested files would be between 5-10 million.

I had to estimate this number by looking at samples counts of some of the nested directories, because getting the true count would mean having to recurse through the whole s3 tree which was just too slow. This is also exactly why it was challenging to delete all the files.

Deleting all the files in a s3 object like this is pretty challenging, since s3 doesn't really work like a true file system. What we think of as a files's parent 'directory' in s3 is basically just a prefix that's associated with that stored object.

The parent directory object has no knowledge of the files it 'contains', so you can't just delete the parent directory and clean up all the files within it.

To delete all the files in a s3 'directory', you can use the aws command line with the recursive flag

aws s3 rm --recursive s3://buffer-data/emr/logs

When I tried running this command on my bucket, I left it running for over 24 hours, only to find that it ended up deleting a fraction of the data.

The problem was that the aws command would only delete a 1000 objects at a time maximum, that it all happens in sequence. I didn't know exactly how long it would take finish since I couldn't even accurately tell how many log files there where, but I knew it would take days, and I couldn't wait that long.

So I had to find another way. After some digging in the documentation, it didn't seem like there was any way to force the aws command to execute in parallel, but luckily the shell has us covered.

To hack together a way to delete nested files faster, I used a combination of the aws s3 ls command and xargs (with a bit of sed to help with some text formatting).

Here is the one-liner I came up with.

aws s3 ls s3://buffer-data/emr/dp-logs/ | grep df | sed -e 's/PRE /s3:\/\/buffer-data\/emr\/dp-logs\//g' | xargs -L1 -P 0 aws s3 rm --recursive

Let me break that down a bit. The aws s3 ls command will just list all the nested objects with the dp-logs prefix (because I don't specify the --recursive flag, it won't recursive those further, which would also take a really long time to finish).

All the directories with logs in them started with a df prefix, which is why I pipe the output of the the ls command through a grep df command to filter them out.

To a actually run a aws s3 rm command for each one of the nested directories, I used the xargs command. But to get that to work, I first had to a little cleanup of the output of the ls command. The output looks like this:

PRE df-Y43SNR3SQOJ4/

Notice that it just contains the object name without the full prefix. That is easy to fix with sed:

sed -e 's/PRE /s3:\/\/buffer-data\/emr\/dp-logs\//g'

This turns the output into this:

s3://buffer-data/emr/dp-logs/df-Y43SNR3SQOJ4/

Finally I can then pipe this output into xargs to run a aws s3 rm command for each of the nested directories. But why go through all of that? The key reason why is that although xargs will by default run each command in sequence, you can change that by specifying the -P flag.

xargs -L1 -P 0 aws s3 rm --recursive`

Setting -P 0 will run as many processes as it can at once. When I ran this on my laptop at first, it brought everything else on my machine to a halt, so I fired up a beefy machine on EC2 (with 8 cores) instead, set up the aws command line on there and let it run from there.

And presto! That's all I needed to do to get a job that could have taken days to run to finish within a couple of hours instead!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment