Skip to content

Instantly share code, notes, and snippets.

@u8sand
Last active June 6, 2020 19:15
Show Gist options
  • Save u8sand/2b73c0cd4b7d61a2316233e295b1da19 to your computer and use it in GitHub Desktop.
Save u8sand/2b73c0cd4b7d61a2316233e295b1da19 to your computer and use it in GitHub Desktop.
A gist describing a simple way with standard unix tools you can recover lost files from the contents of your disk when all else fails.

NOTE: The script and some information may be inaccurate in this tutorial, I'm under the impression my files were found because they were cached by vscode, nonetheless parts of the tutorial will work specifically if you have some contents of a file (strings, grep, dd)

Recovering important lost files from disk with unix tools

This is a last resort after trying testdisk's un-delete feature which I find works quite well. In this situation however testdisk was unable to identify the files in the directory I was looking for. It helps to know a unique string in the file but is not fully necessary.

If the file is really important then stop writing to that disk, if it's only kind of important, depending on how full your disk is it may not matter if you're super lazy (I recovered this file a few days after the fact still using my disk as normal, but you should not do this). You should also probably at least work on a different drive during the process (saving these dumps and such to a different drive).

Step 1: locating an identifiable string on your file dump or disk

strings -t d your_device_or_raw_dump | grep -i "your_hopefully_rare_identifiable_string_or_file_uri" > saved_strings

It will take a while for strings to run as it finds every valid null-terminated string and outputs it along with its location line by line, something like:

12345 your_contiguous_string
...

Meaning you can then just grep this to find your string.. (-i for insensitive). Your saved strings will be a subset of the strings command. The offset is what's important--that is the exact number of bytes away from the start of your device or dump.

Dealing with file names / directory names

Note that if you remember only the filename or directory that might work too in certain cases, specifically I was able to find files I was working on with vscode on ext4 where the file path was stored right before the file. Here is an example from when I was recovering with file:///home/u8sand/Programs/work/covid19/

13562414972             "file:///home/u8sand/Programs/work/covid19/2020-05-20",
13562416593             "file:///home/u8sand/Programs/work/covid19/2020-05-06",
13562421577             "file:///home/u8sand/Programs/work/covid19/2020-04-23/figure_4.py", # <- file i was looking for
13562421657             "file:///home/u8sand/Programs/work/covid19/2020-05-21/drugs.json",
13562421736             "file:///home/u8sand/Programs/work/covid19/2020-04-23/figure_5_2.png",
13562421819             "file:///home/u8sand/Programs/work/covid19/2020-04-23/data.tsv",
13562424093             "file:///home/u8sand/Programs/work/covid19/2020-05-06/03_signature_unification.py",
13562426081             "file:///home/u8sand/Programs/work/covid19/figure_5.py", # <- file i was looking for
13563069911             "file:///home/u8sand/Programs/work/covid19/2020-05-22",

In these strings dumps you can subsequently filter it by the filename with a "{" (or just search using less)

75631951872 file:///home/u8sand/Programs/work/covid19/figure_5.py {"mtime":1587662271099,"ctime":1587567723331,"size":2397,"etag":"3513od2ee2fa","orphaned":false}
75632504832 file:///home/u8sand/Programs/work/covid19/figure_4.py {"mtime":1587667729883,"ctime":1587567584887,"size":4554,"etag":"351434q9c4ms","orphaned":false}
75632582656 file:///home/u8sand/Programs/work/covid19/figure_4.py {"mtime":1587667729883,"ctime":1587567584887,"size":4554,"etag":"351434q9c4ms","orphaned":false}

These were the files I was looking for, these strings occurred directly before the entire file contents.

Step 2: Dumping a small contiguous part of your disk

Because most sane file-systems write files contiguously (when they aren't too big), you should be able to find your file in the contiguous section near the offset you got. It's good to have a relative understanding of how large your file is, if it's say 1KB, then we can grab say 4KB of padded data just to be sure we fully get it all. If you're searching by this filename technique you know you're at the start so you could pad just 100 bytes in front and more on the end..

WARNING: always be careful with dd, if you screw up the if and of direction you will likely be very sad when you're writing over what you wanted to be reading. Recovering data that was written over with dd is ridiculously difficult if at all possible, you won't find a tutorial on that--it involves physical techniques.

dd if=your_device_or_raw_dump of=filtered_dump.dd skip=$(expr your_strings_offset - 2000) count=4000 bs=1

dd is perfect for this, it lets you dump the part of your choice very rapidly. With your dump file you need to make sure it has what you're looking for and trim the padding.

Step 3: Viewing your dump in Hex

Now that it's a workable size, we can just hex dump it and check ourselves. xdd is a very simple hex dump utility but any hex editor will do.

xxd filtered_dump.dd | less 
00000030: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000040: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000050: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000060: 0000 0000 6669 6c65 3a2f 2f2f 686f 6d65  ....file:///home
00000070: 2f75 3873 616e 642f 5072 6f67 7261 6d73  /u8sand/Programs
00000080: 2f77 6f72 6b2f 636f 7669 6431 392f 6669  /work/covid19/fi
00000090: 6775 7265 5f35 2e70 7920 7b22 6d74 696d  gure_5.py {"mtim
000000a0: 6522 3a31 3538 3736 3632 3237 3130 3939  e":1587662271099
000000b0: 2c22 6374 696d 6522 3a31 3538 3735 3637  ,"ctime":1587567
000000c0: 3732 3333 3331 2c22 7369 7a65 223a 3233  723331,"size":23
000000d0: 3937 2c22 6574 6167 223a 2233 3531 336f  97,"etag":"3513o
000000e0: 6432 6565 3266 6122 2c22 6f72 7068 616e  d2ee2fa","orphan
000000f0: 6564 223a 6661 6c73 657d 0a23 2525 0a69  ed":false}.#%%.i
00000100: 6d70 6f72 7420 7265 0a69 6d70 6f72 7420  mport re.import 
00000110: 7061 6e64 6173 2061 7320 7064 0a69 6d70  pandas as pd.imp
00000120: 6f72 7420 6e75 6d70 7920 6173 206e 700a  ort numpy as np.
00000130: 6672 6f6d 206d 6174 706c 6f74 6c69 6220  from matplotlib 
...
00000a40: 6727 2c20 6470 693d 3330 3029 0a70 6c74  g', dpi=300).plt
00000a50: 2e73 6176 6566 6967 2827 6669 6775 7265  .savefig('figure
00000a60: 5f35 2e70 6e67 272c 2064 7069 3d33 3030  _5.png', dpi=300
00000a70: 290a 706c 742e 7368 6f77 2829 0a00 0000  ).plt.show()....
00000a80: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000a90: 0000 0000 0000 0000 0000 0000 0000 0000  ................

Step 4: Restoring your file

There is my complete file, to grab it and throw away the rest of the junk just grab the hash start and end, you can also remove some junk just after the fact like some extra zeros.. Alternatively you can try to use the "size" value in the metadata just before the file, I may make a script for parsing a file once you've found the offset of the metadata. span: 00000060 00000a80

Now we use dd again on the file dump

dd if=filtered_dump.dd of=fig5.py skip=$((16#00000060)) count=$(expr $((16#00000a80)) - $((16#00000060))) bs=1

The $((16#...)) is a bash conversion from hex to decimal, so we can just do the calculations inline.

We can now open it in vim or a decent text editor and eliminate the extra characters on the bottom and top and save it. Done. Our file is restored!

Using hexedit

The curses based 'hexedit' package can literally open a hard disk in hex mode and search for ascii strings (note: you have to press tab to switch search mode). You can try that as well.

Using the lostfiles.py script

I wrote this script after making this tutorial because it seemed pretty simple and useful to do; in retrospect it probably only works for the unique vscode case. It's in this gist as well and can be used for (the last, or in fact the entire thing albiet it might be a bit slower). Perhaps it can be rewritten in rust for speed. It works very good on the filtered_dump but could technically work on the original disk as well.

python lostfiles.py --file=filtered_dump.dd --search=file:///home/u8sand/Programs/work/covid19/figure_5\.py
python3 lostfiles.py --file=dump_fig5 --search=file:///home/u8sand/Programs/work/covid19/figure_5\.py          [0]
Found file:///home/u8sand/Programs/work/covid19/figure_5.py {"mtime":1587662271099,"ctime":1587567723331,"size":2397,"etag":"3513od2ee2fa","orphaned":false} at 100 parse (to view/save)? [Y/n] 
File size meta: 2397 (2.397kb), eof: 2434 (2.434kb), view? [Y/n] 
#%%
import re
import pandas as pd
...
plt.savefig('figure_5.png', dpi=300)
plt.show()

Save to file (leave blank to not save)? fig5.py
Done. avg 0.000b per second

On whole disk

python3 lostfiles.py --file=/dev/sdc1 --search=file:///home/u8sand/Programs/work/covid19/figure_5\.py --bs=1g --mpr=25g
Currently searching at 25000000000 (25.000gb) avg 123.181mb per second...
Currently searching at 50000000000 (50.000gb) avg 116.401mb per second...
Currently searching at 75000000000 (75.000gb) avg 115.780mb per second...
Currently searching at 100000000000 (100.000gb) avg 118.150mb per second...
...
Currently searching at 275000000000 (275.000gb) avg 111.233mb per second...
Currently searching at 300000000000 (300.000gb) avg 110.508mb per second...
Found file:///home/u8sand/Programs/work/covid19/2020-05-06/compare.py {"mtime":1588776710489,"ctime":1588770431049,"size":4571,"etag":"353053ces4ne","orphaned":false} at 303790122240 parse (to view/save)? [Y/n] n
Found file:///home/u8sand/Programs/work/covid19/2020-05-22/00_process.py {"mtime":1590162721521,"ctime":1590157118618,"size":7648,"etag":"3559llmjo7tm","orphaned":false} at 304167609600 parse (to view/save)? [Y/n] n
Found file:///home/u8sand/Programs/work/covid19/figure_5.py {"mtime":1587667607872,"ctime":1587567723331,"size":2476,"etag":"35142sq742hr","orphaned":false} at 331504145664 parse (to view/save)? [Y/n]  
File size meta: 2476 (2.476kb), eof: 2537 (2.537kb), view? [Y/n] 
#%%
import re
import pandas as pd
...
plt.savefig('figure_5.png', dpi=300)
plt.show()

Save to file (leave blank to not save)? fig5.py
...

If it keeps that rate it'll take around 2.5 hours to search a tb disk, i guess it's not too bad.

''' lostfiles.py
This is part of a last resort method to recover files on a disk. It's only been tested with ext4.
Though I use this myself to *recover* files, who knows if this script can cause data loss, I will not be held responsible, you've been warned.
The idea is:
1. you can open data dumps or disks in binary read mode
2. you can seek through that huge file searching for a file by file name, ext4 seems to store files along with their names like so:
file:///your_path/your_file.txt {"mtime":1587662271099,"ctime":1587567723331,"size":2397,"etag":"3513od2ee2fa","orphaned":false}
3. If you provide `file:///your_path/your_file.txt`, we can attempt to locate this section, parse the metadata and the file right after it
based on the 'size' reported and the EOF (last null byte at least as far away as the reported file size, in practice I've found the size to be small).
Because of this, it works best for raw text files.. never tried it on things like images though it *might* work (just don't "view" it)
Python isn't the fastest language for seeking through your hard drive so you may want to `strings` + `grep` prior to using this to find a close location in your hard disk.
That being said, I've not benchmarked it, it does report its current speed. If you do choose to use this on your entire hard disk, make sure you make
your buffer size pretty high 1MB ~ 1GB
See python3 lostfiles.py --help or just read the script to see what parameters it accepts.
Byte values can be written in human form (base10 kind of) i.e. --bs=64m skip=100g
'''
import re
import sys
import time
import json
import click
def file_chunk_generator(fh, buf_size=4096):
''' Read a `file` in chunks of `buf_size`
'''
with fh as fr:
while True:
buf = fr.read(buf_size)
if not buf:
break
yield buf
def human_bytes(b):
if b >= int(1e15):
return f"{b/1e15:.3f}pb"
elif b >= int(1e12):
return f"{b/1e12:.3f}tb"
elif b >= int(1e9):
return f"{b/1e9:.3f}gb"
elif b >= int(1e6):
return f"{b/1e6:.3f}mb"
elif b >= int(1e3):
return f"{b/1e3:.3f}kb"
else:
return f"{b:.3f}b"
import re
human_byte_matcher = re.compile(r'^(\d+(\.\d+)?)(\w+)?$')
def from_human_bytes(s):
m = human_byte_matcher.match(s)
num = float(m.group(1))
spec = m.group(3) or ''
if spec.lower() in {'pb', 'p'}:
mul = 1e15
elif spec.lower() in {'tb', 't'}:
mul = 1e12
elif spec.lower() in {'gb', 'g'}:
mul = 1e9
elif spec.lower() in {'mb', 'm'}:
mul = 1e6
elif spec.lower() in {'kb', 'k'}:
mul = 1e3
elif spec.lower() in {'b', ''}:
mul = 1
else:
raise Exception(f"Unrecognized {spec}")
return int(num * mul)
@click.command()
@click.option('--file', help='File to search for lost files')
@click.option('--search', help='Unique file/directory file:// path to search for')
@click.option('--skip', default='0', help='Skip ahead number of bytes')
@click.option('--bs', default='4096', help='Buffer size to use when searching')
@click.option('--mpr', default='1g', help='Multiple of buffer size to report progress, default to every GB (0 to disable)')
def search(file, search, skip='0', bs='4096', mpr='1g'):
# setup search
matcher = re.compile(f"(?P<search>{search}) (?P<meta>{{.+}})".encode())
# prepare args
skip = from_human_bytes(skip) if type(skip) == str else skip
bs = from_human_bytes(bs) if type(bs) == str else bs
mpr = from_human_bytes(mpr) if type(mpr) == str else mpr
# prepare file
fr = open(file, 'rb', buffering=0)
if skip:
fr.seek(skip)
offset = skip
else:
offset = 0
it = iter(file_chunk_generator(fr, buf_size=bs // 2))
# start searching
last_chunk = b''
input_time_wasted = 0
start = time.time()
try:
while True:
chunk = next(it)
offset += bs // 2
full_chunk = last_chunk + chunk
if mpr and (offset - skip) % int(mpr) == 0:
print(f"Currently searching at {offset} ({human_bytes(offset)}) avg {human_bytes(float(offset - skip) / (time.time() - start + input_time_wasted))} per second...")
m = matcher.search(full_chunk)
if m is None:
del last_chunk
last_chunk = chunk
continue
else:
start_input = time.time()
start, stop = m.span()
print(f"Found {m.group('search')} {m.group('meta')} at {offset + start - (bs // 2)} parse (to view/save)? [Y/n] ", end='')
yn = input().strip()
if yn != 'n':
meta = json.loads(m.group('meta'))
size = meta['size']
#
remaining_chunk = full_chunk[stop+1:]
eof = None
while eof is None or eof < size:
remaining_chunk += next(it)
try:
eof = remaining_chunk.index(b'\0') if eof is None else remaining_chunk.index(b'\0', eof)
except ValueError:
continue
file_contents = remaining_chunk[:eof]
print(f"File size meta: {size} ({human_bytes(size)}), eof: {eof} ({human_bytes(eof)}), view? [Y/n] ", end='')
yn = input().strip()
if yn != 'n':
try:
print(file_contents.decode())
except UnicodeDecodeError:
print(file_contents)
#
print('Save to file (leave blank to not save)? ', end='')
fn = input().strip()
if fn:
open(fn, 'wb').write(file_contents)
last_chunk = remaining_chunk[eof:]
input_time_wasted += time.time() - start_input
except StopIteration:
pass
except KeyboardInterrupt:
print('Stopping...')
#
print(f"Done. avg {human_bytes(float(offset - skip) / (time.time() - start + input_time_wasted))} per second")
if __name__ == '__main__':
search()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment