I had a really interesting journey today with a thorny little challenge I had while trying to delete all the files in a s3 bucket with tons of nested files.
The bucket path (s3://buffer-data/emr/logs/
) contained log files created by ElasticMapReduce jobs that ran every day over a couple of years (from early 2015 to early 2018).
Each EMR job would run hourly every day, firing up a cluster of machines and each machine would output it's logs.
That resulted thousands of nested paths (one for each job) that contained thousands of other files.
I estimated that the total number of nested files would be between 5-10 million.
I had to estimate this number by looking at samples counts of some of the nested directories, because getting the true count would mean having to recurse through the whole s3 tree which was just too slow. This is also exactly why it was challenging to delete all the files.
Deleting all the files in a s3 object like this is pretty challenging, since s3 doesn't really work like a true f