Skip to content

Instantly share code, notes, and snippets.

@yarko
Created August 7, 2020 04:30
Show Gist options
  • Save yarko/d0310a805e5bab73432a99db8af9fa3d to your computer and use it in GitHub Desktop.
Save yarko/d0310a805e5bab73432a99db8af9fa3d to your computer and use it in GitHub Desktop.
Start looking at large mbox / gmail file
#!/usr/bin/env python
'''
separate_mbox [input_mbox] [filter_label]
'''
from collections import Counter
from pprint import pprint
from sys import argv, stdin, stdout, stderr
# write based on first label
LABELS = 'X-Gmail-Labels: '
after_tag = len(LABELS)
is_labels = lambda s: s.startswith(LABELS)
NEWMAIL = 'From '
is_newmail = lambda s: s.startswith(NEWMAIL)
# split mbox on stdin,
# filtering remaining to stdout
# and writing <<label_name>>.mbox as the split-out piece
# if there is no arg, then just do count & label dump on current input
filter_out = None
f_filtered = None
if len(argv) > 1:
filter_out = argv.pop() # use the last argv
f_filtered = open(filter_out+".mbox", "x")
# if there is still an arg, use this as the input file
if len(argv) > 1:
fin = open(argv[1], "r")
else:
fin = stdin
labels = Counter()
n = 0
this_mail = []
for line in fin:
#
if is_newmail(line):
n += 1
if filter_out and this_mail:
# write out this mail to the appropriate file
outfile = f_filtered if filter_this else stdout
outfile.writelines(this_mail)
this_mail = []
this_mail.append(line)
if is_labels(line): # count up all the labels
these_labels = line[after_tag:].strip().split(',')
labels.update(these_labels)
filter_this = filter_out in these_labels
'''
if n & 0xfff == 0:
print(f'\r{n:7}', end='', file=stderr)
'''
# write out the last mbox entry:
if filter_out:
outfile = f_filtered if filter_this else stdout
outfile.writelines(this_mail)
print(f'\n{n:7} emails, {len(labels)} labels', file=stderr)
pprint(labels, stream=stderr)
if f_filtered: f_filtered.close()
if fin != stdin: fin.close()
@yarko
Copy link
Author

yarko commented Aug 7, 2020

I've never processed an mbox file before. My gmail account, after many, many years reached near 10G in size:

 -rw-r--r-- 1 yarko yarko 9978202677 Apr  8 17:58 All_mail_Including_Spam_and_Trash.mbox

and was overloading my gmail account.

I wanted to scan the files to identify which items I could delete, and more interestingly which filters I could set based on what, so I could - myself - manage "expiring" semi-important mail after some amount of time (google expires and deletes "spam" folder after 30 days, so that seemed like a good start).

My idea was to start by splitting up the many various labels I had accumulated over the years.

But the processing was either hung up, or was very slow. Which?

I tried an assortment of profilers for python - they either made things worse, and provided no useful results, or themselves crashed.

I found scalene, and finally started to get some answers.

(I have not to date completed this email filtering / expiry project - because of imminent space constraints, I manually deleted some millions of old emails).

In any case, scalene worked when I used it. Maybe I'll pick this up again "real soon".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment