I had a really interesting journey today with a thorny little challenge I had while trying to delete all the files in a s3 bucket with tons of nested files.
The bucket path (s3://buffer-data/emr/logs/
) contained log files created by ElasticMapReduce jobs that ran every day over a couple of years (from early 2015 to early 2018).
Each EMR job would run hourly every day, firing up a cluster of machines and each machine would output it's logs. That resulted thousands of nested paths (one for each job) that contained thousands of other files. I estimated that the total number of nested files would be between 5-10 million.
I had to estimate this number by looking at samples counts of some of the nested directories, because getting the true count would mean having to recurse through the whole s3 tree which was just too slow. This is also exactly why it was challenging to delete all the files.
Deleting all the files in a s3 object like this is pretty challenging, since s3 doesn't really work like a true file system. What we think of as a files's parent 'directory' in s3 is basically just a prefix that's associated with that stored object.
The parent directory object has no knowledge of the files it 'contains', so you can't just delete the parent directory and clean up all the files within it.
To delete all the files in a s3 'directory', you can use the aws
command line with the recursive
flag
aws s3 rm --recursive s3://buffer-data/emr/logs
When I tried running this command on my bucket, I left it running for over 24 hours, only to find that it ended up deleting a fraction of the data.
The problem was that the aws command would only delete a 1000 objects at a time maximum, that it all happens in sequence. I didn't know exactly how long it would take finish since I couldn't even accurately tell how many log files there where, but I knew it would take days, and I couldn't wait that long.
So I had to find another way. After some digging in the documentation, it didn't seem like there was any way to force the aws
command to execute in parallel, but luckily the shell has us covered.
To hack together a way to delete nested files faster, I used a combination of the aws s3 ls
command and xargs
(with a bit of sed
to help with some text formatting).
Here is the one-liner I came up with.
aws s3 ls s3://buffer-data/emr/dp-logs/ | grep df | sed -e 's/PRE /s3:\/\/buffer-data\/emr\/dp-logs\//g' | xargs -L1 -P 0 aws s3 rm --recursive
Let me break that down a bit. The aws s3 ls
command will just list all the nested objects with the dp-logs
prefix (because I don't specify the --recursive
flag, it won't recursive those further, which would also take a really long time to finish).
All the directories with logs in them started with a df
prefix, which is why I pipe the output of the the ls
command through a grep df
command to filter them out.
To a actually run a aws s3 rm
command for each one of the nested directories, I used the xargs
command. But to get that to work, I first had to a little cleanup of the output of the ls
command.
The output looks like this:
PRE df-Y43SNR3SQOJ4/
Notice that it just contains the object name without the full prefix. That is easy to fix with sed
:
sed -e 's/PRE /s3:\/\/buffer-data\/emr\/dp-logs\//g'
This turns the output into this:
s3://buffer-data/emr/dp-logs/df-Y43SNR3SQOJ4/
Finally I can then pipe this output into xargs
to run a aws s3 rm
command for each of the nested directories.
But why go through all of that? The key reason why is that although xargs
will by default run each command in sequence, you can change that by specifying the -P
flag.
xargs -L1 -P 0 aws s3 rm --recursive`
Setting -P 0
will run as many processes as it can at once.
When I ran this on my laptop at first, it brought everything else on my machine to a halt, so I fired up a beefy machine on EC2 (with 8 cores) instead, set up the aws
command line on there and let it run from there.
And presto! That's all I needed to do to get a job that could have taken days to run to finish within a couple of hours instead!