Let's say you have a model, with an files attached, using Paperclip. You have a couple millions of those files and you're not sure that every one of them (and all its thumbnails) are still used by a database record.
You could use this rake task to recursively scan all the directories and check if the files need to be kept or destroyed.
In this example, the model is called Picture
, the attachment is image
and the path is partitioned like images/001/412/497/actual_file.jpg
The task is going down the path. Each time the path ends with 3 triplets of digits ("001/412/497" for example) it looks for a record with the ID 1412497. If such a record doesn't exist, the whole directory is moved to a parallel images_deleted
directory. At the end you can delete the files if you like, or move them to an archive location.
You can use the "dry run" mode : to print which files would be removed
rake paperclip:clean_orphan_files DRY_RUN=1
You'd get a line for each orphan attachment with it's ID. You can also put this into a file for latter inspection
rake paperclip:clean_orphan_files DRY_RUN=1 > clean_orphan_files.out
If you think you've made a huge mistake, you can revert this :
cp -r image_deleted/* images/
rmdir images_deleted
and you'll be back to normal.
NB : this code has run on a "production" server without any issue, but it's not tested with automated tests. For the moment it's still a couple of methods in a rake task. It's really benefit being extracted into a class. It's also not particularly well coded. I'm pretty sure some parts could really be improved, made more readable. Feel free to comment.