NOTE: The script and some information may be inaccurate in this tutorial, I'm under the impression my files were found because they were cached by vscode, nonetheless parts of the tutorial will work specifically if you have some contents of a file (strings, grep, dd)
This is a last resort after trying testdisk's un-delete feature which I find works quite well. In this situation however testdisk was unable to identify the files in the directory I was looking for. It helps to know a unique string in the file but is not fully necessary.
If the file is really important then stop writing to that disk, if it's only kind of important, depending on how full your disk is it may not matter if you're super lazy (I recovered this file a few days after the fact still using my disk as normal, but you should not do this). You should also probably at least work on a different drive during the process (saving these dumps and such to a different drive).
strings -t d your_device_or_raw_dump | grep -i "your_hopefully_rare_identifiable_string_or_file_uri" > saved_strings
It will take a while for strings
to run as it finds every valid null-terminated string and outputs it along with its location line by line, something like:
12345 your_contiguous_string
...
Meaning you can then just grep this to find your string.. (-i for insensitive). Your saved strings will be a subset of the strings command. The offset is what's important--that is the exact number of bytes away from the start of your device or dump.
Note that if you remember only the filename or directory that might work too in certain cases, specifically I was able to find files I was working on with vscode on ext4 where the file path was stored right before the file. Here is an example from when I was recovering with file:///home/u8sand/Programs/work/covid19/
13562414972 "file:///home/u8sand/Programs/work/covid19/2020-05-20",
13562416593 "file:///home/u8sand/Programs/work/covid19/2020-05-06",
13562421577 "file:///home/u8sand/Programs/work/covid19/2020-04-23/figure_4.py", # <- file i was looking for
13562421657 "file:///home/u8sand/Programs/work/covid19/2020-05-21/drugs.json",
13562421736 "file:///home/u8sand/Programs/work/covid19/2020-04-23/figure_5_2.png",
13562421819 "file:///home/u8sand/Programs/work/covid19/2020-04-23/data.tsv",
13562424093 "file:///home/u8sand/Programs/work/covid19/2020-05-06/03_signature_unification.py",
13562426081 "file:///home/u8sand/Programs/work/covid19/figure_5.py", # <- file i was looking for
13563069911 "file:///home/u8sand/Programs/work/covid19/2020-05-22",
In these strings dumps you can subsequently filter it by the filename with a "{" (or just search using less)
75631951872 file:///home/u8sand/Programs/work/covid19/figure_5.py {"mtime":1587662271099,"ctime":1587567723331,"size":2397,"etag":"3513od2ee2fa","orphaned":false}
75632504832 file:///home/u8sand/Programs/work/covid19/figure_4.py {"mtime":1587667729883,"ctime":1587567584887,"size":4554,"etag":"351434q9c4ms","orphaned":false}
75632582656 file:///home/u8sand/Programs/work/covid19/figure_4.py {"mtime":1587667729883,"ctime":1587567584887,"size":4554,"etag":"351434q9c4ms","orphaned":false}
These were the files I was looking for, these strings occurred directly before the entire file contents.
Because most sane file-systems write files contiguously (when they aren't too big), you should be able to find your file in the contiguous section near the offset you got. It's good to have a relative understanding of how large your file is, if it's say 1KB, then we can grab say 4KB of padded data just to be sure we fully get it all. If you're searching by this filename technique you know you're at the start so you could pad just 100 bytes in front and more on the end..
WARNING: always be careful with dd, if you screw up the if
and of
direction you will likely be very sad when you're writing over what you wanted to be reading. Recovering data that was written over with dd is ridiculously difficult if at all possible, you won't find a tutorial on that--it involves physical techniques.
dd if=your_device_or_raw_dump of=filtered_dump.dd skip=$(expr your_strings_offset - 2000) count=4000 bs=1
dd
is perfect for this, it lets you dump the part of your choice very rapidly. With your dump file you need to make sure it has what you're looking for and trim the padding.
Now that it's a workable size, we can just hex dump it and check ourselves. xdd
is a very simple hex dump utility but any hex editor will do.
xxd filtered_dump.dd | less
00000030: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000040: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000050: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000060: 0000 0000 6669 6c65 3a2f 2f2f 686f 6d65 ....file:///home
00000070: 2f75 3873 616e 642f 5072 6f67 7261 6d73 /u8sand/Programs
00000080: 2f77 6f72 6b2f 636f 7669 6431 392f 6669 /work/covid19/fi
00000090: 6775 7265 5f35 2e70 7920 7b22 6d74 696d gure_5.py {"mtim
000000a0: 6522 3a31 3538 3736 3632 3237 3130 3939 e":1587662271099
000000b0: 2c22 6374 696d 6522 3a31 3538 3735 3637 ,"ctime":1587567
000000c0: 3732 3333 3331 2c22 7369 7a65 223a 3233 723331,"size":23
000000d0: 3937 2c22 6574 6167 223a 2233 3531 336f 97,"etag":"3513o
000000e0: 6432 6565 3266 6122 2c22 6f72 7068 616e d2ee2fa","orphan
000000f0: 6564 223a 6661 6c73 657d 0a23 2525 0a69 ed":false}.#%%.i
00000100: 6d70 6f72 7420 7265 0a69 6d70 6f72 7420 mport re.import
00000110: 7061 6e64 6173 2061 7320 7064 0a69 6d70 pandas as pd.imp
00000120: 6f72 7420 6e75 6d70 7920 6173 206e 700a ort numpy as np.
00000130: 6672 6f6d 206d 6174 706c 6f74 6c69 6220 from matplotlib
...
00000a40: 6727 2c20 6470 693d 3330 3029 0a70 6c74 g', dpi=300).plt
00000a50: 2e73 6176 6566 6967 2827 6669 6775 7265 .savefig('figure
00000a60: 5f35 2e70 6e67 272c 2064 7069 3d33 3030 _5.png', dpi=300
00000a70: 290a 706c 742e 7368 6f77 2829 0a00 0000 ).plt.show()....
00000a80: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000a90: 0000 0000 0000 0000 0000 0000 0000 0000 ................
There is my complete file, to grab it and throw away the rest of the junk just grab the hash start and end, you can also remove some junk just after the fact like some extra zeros.. Alternatively you can try to use the "size" value in the metadata just before the file, I may make a script for parsing a file once you've found the offset of the metadata. span: 00000060 00000a80
Now we use dd again on the file dump
dd if=filtered_dump.dd of=fig5.py skip=$((16#00000060)) count=$(expr $((16#00000a80)) - $((16#00000060))) bs=1
The $((16#...))
is a bash conversion from hex to decimal, so we can just do the calculations inline.
We can now open it in vim or a decent text editor and eliminate the extra characters on the bottom and top and save it. Done. Our file is restored!
The curses based 'hexedit' package can literally open a hard disk in hex mode and search for ascii strings (note: you have to press tab to switch search mode). You can try that as well.
I wrote this script after making this tutorial because it seemed pretty simple and useful to do; in retrospect it probably only works for the unique vscode case. It's in this gist as well and can be used for (the last, or in fact the entire thing albiet it might be a bit slower). Perhaps it can be rewritten in rust for speed. It works very good on the filtered_dump but could technically work on the original disk as well.
python lostfiles.py --file=filtered_dump.dd --search=file:///home/u8sand/Programs/work/covid19/figure_5\.py
python3 lostfiles.py --file=dump_fig5 --search=file:///home/u8sand/Programs/work/covid19/figure_5\.py [0]
Found file:///home/u8sand/Programs/work/covid19/figure_5.py {"mtime":1587662271099,"ctime":1587567723331,"size":2397,"etag":"3513od2ee2fa","orphaned":false} at 100 parse (to view/save)? [Y/n]
File size meta: 2397 (2.397kb), eof: 2434 (2.434kb), view? [Y/n]
#%%
import re
import pandas as pd
...
plt.savefig('figure_5.png', dpi=300)
plt.show()
Save to file (leave blank to not save)? fig5.py
Done. avg 0.000b per second
On whole disk
python3 lostfiles.py --file=/dev/sdc1 --search=file:///home/u8sand/Programs/work/covid19/figure_5\.py --bs=1g --mpr=25g
Currently searching at 25000000000 (25.000gb) avg 123.181mb per second...
Currently searching at 50000000000 (50.000gb) avg 116.401mb per second...
Currently searching at 75000000000 (75.000gb) avg 115.780mb per second...
Currently searching at 100000000000 (100.000gb) avg 118.150mb per second...
...
Currently searching at 275000000000 (275.000gb) avg 111.233mb per second...
Currently searching at 300000000000 (300.000gb) avg 110.508mb per second...
Found file:///home/u8sand/Programs/work/covid19/2020-05-06/compare.py {"mtime":1588776710489,"ctime":1588770431049,"size":4571,"etag":"353053ces4ne","orphaned":false} at 303790122240 parse (to view/save)? [Y/n] n
Found file:///home/u8sand/Programs/work/covid19/2020-05-22/00_process.py {"mtime":1590162721521,"ctime":1590157118618,"size":7648,"etag":"3559llmjo7tm","orphaned":false} at 304167609600 parse (to view/save)? [Y/n] n
Found file:///home/u8sand/Programs/work/covid19/figure_5.py {"mtime":1587667607872,"ctime":1587567723331,"size":2476,"etag":"35142sq742hr","orphaned":false} at 331504145664 parse (to view/save)? [Y/n]
File size meta: 2476 (2.476kb), eof: 2537 (2.537kb), view? [Y/n]
#%%
import re
import pandas as pd
...
plt.savefig('figure_5.png', dpi=300)
plt.show()
Save to file (leave blank to not save)? fig5.py
...
If it keeps that rate it'll take around 2.5 hours to search a tb disk, i guess it's not too bad.