Created
August 7, 2020 04:30
-
-
Save yarko/d0310a805e5bab73432a99db8af9fa3d to your computer and use it in GitHub Desktop.
Start looking at large mbox / gmail file
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env python | |
''' | |
separate_mbox [input_mbox] [filter_label] | |
''' | |
from collections import Counter | |
from pprint import pprint | |
from sys import argv, stdin, stdout, stderr | |
# write based on first label | |
LABELS = 'X-Gmail-Labels: ' | |
after_tag = len(LABELS) | |
is_labels = lambda s: s.startswith(LABELS) | |
NEWMAIL = 'From ' | |
is_newmail = lambda s: s.startswith(NEWMAIL) | |
# split mbox on stdin, | |
# filtering remaining to stdout | |
# and writing <<label_name>>.mbox as the split-out piece | |
# if there is no arg, then just do count & label dump on current input | |
filter_out = None | |
f_filtered = None | |
if len(argv) > 1: | |
filter_out = argv.pop() # use the last argv | |
f_filtered = open(filter_out+".mbox", "x") | |
# if there is still an arg, use this as the input file | |
if len(argv) > 1: | |
fin = open(argv[1], "r") | |
else: | |
fin = stdin | |
labels = Counter() | |
n = 0 | |
this_mail = [] | |
for line in fin: | |
# | |
if is_newmail(line): | |
n += 1 | |
if filter_out and this_mail: | |
# write out this mail to the appropriate file | |
outfile = f_filtered if filter_this else stdout | |
outfile.writelines(this_mail) | |
this_mail = [] | |
this_mail.append(line) | |
if is_labels(line): # count up all the labels | |
these_labels = line[after_tag:].strip().split(',') | |
labels.update(these_labels) | |
filter_this = filter_out in these_labels | |
''' | |
if n & 0xfff == 0: | |
print(f'\r{n:7}', end='', file=stderr) | |
''' | |
# write out the last mbox entry: | |
if filter_out: | |
outfile = f_filtered if filter_this else stdout | |
outfile.writelines(this_mail) | |
print(f'\n{n:7} emails, {len(labels)} labels', file=stderr) | |
pprint(labels, stream=stderr) | |
if f_filtered: f_filtered.close() | |
if fin != stdin: fin.close() |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I've never processed an mbox file before. My gmail account, after many, many years reached near 10G in size:
and was overloading my gmail account.
I wanted to scan the files to identify which items I could delete, and more interestingly which filters I could set based on what, so I could - myself - manage "expiring" semi-important mail after some amount of time (google expires and deletes "spam" folder after 30 days, so that seemed like a good start).
My idea was to start by splitting up the many various labels I had accumulated over the years.
But the processing was either hung up, or was very slow. Which?
I tried an assortment of profilers for python - they either made things worse, and provided no useful results, or themselves crashed.
I found
scalene
, and finally started to get some answers.(I have not to date completed this email filtering / expiry project - because of imminent space constraints, I manually deleted some millions of old emails).
In any case,
scalene
worked when I used it. Maybe I'll pick this up again "real soon".