Created
February 17, 2017 05:42
-
-
Save garywu/9daaf0b70e80232dab9944919a079772 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| http://www.winwaed.com/blog/2012/04/09/calculating-word-statistics-from-the-gutenberg-corpus/ | |
| Following on from the previous article about scanning text files for word statistics, I shall extend this to use real large corpora. First we shall use this script to create statistics for the entire Gutenberg English language corpus. Next I shall do the same with the entire English language Wikipedia. | |
| Project Gutenberg is a non-profit project to digitize books with expired copyright. It is noted for including large numbers of classic texts and novels. All of the texts are available for free online. Although the site could be spidered, this is strongly discouraged due to the limited resources of the project. Instead, a CD or DVD image can be downloaded (or purchased); or you can create a mirror. Instructions for doing this are here, and include the following recommend shell command: | |
| rsync -avHS --delete --delete-after --exclude '*cache/generated' ftp@ftp.ibiblio.org::gutenberg /home/ftp/pub/mirrors/gutenberg | |
| 1 | |
| rsync -avHS --delete --delete-after --exclude '*cache/generated' ftp@ftp.ibiblio.org::gutenberg /home/ftp/pub/mirrors/gutenberg | |
| I would strongly recommend you use an even restrictive command to exclude everything except for directories and files with the “.txt” suffix. If you do not do this, the rsync command will download a wide range of miscellaneous binary files including bitmaps, ISOs and RARs. some of these are very large. Even if you have the time, disk space, and bandwidth; it is definitely not polite to stress Project Gutenberg’s servers unnecessarily. | |
| A restrictive “text only” fetch will probably still take a few hours to run. It will also create a complex directory hierarchy. This includes a range of unnecessary text files (e.g. readme files, and the complete Human Genome), and many texts will contain multiple copies in different formats (typically ASCII and 8 bit ISO-8859-1). The next job is to tidy this up, by copying all unique texts into one large flat directory. I chose to use the ISO-8859-1 format where available, and plain ASCII when it wasn’t. This would enable the correct format to be used for accented words that have been adopted into the English language (e.g. “déjà vu”). This copy and filter process is performed using a Python script. | |
| Each Gutenberg text also includes a standard header and footer. We do not wish to include this information in the statistics, so it needs to be stripped off. For efficiency, the header and footer are removed by the same script. This also has the advantage that descriptive files (e.g. readme files) lack the header/footer markers and are automatically skipped. | |
| IMPORTANT NOTE: The Gutenberg license prohibits the distribution of their texts without the header and footer. Do not distribute the texts in this form. If there is the chance you will be sharing your files (e.g. with colleagues), then the headers and footers should be kept in tact. | |
| Here is the script: | |
| Compared to the download, this runs quite quickly and the final directory should be much easier to process. | |
| The resulting frequency table (created using the word frequency table scripts) can be downloaded as a GZIP file (28.5MB). | |
| Note that although the word frequency table scripts could be easily modified to process N-grams, the shear size of the Gutenberg dataset will prove a challenge. I shall address this in a future article. A similar process can be used to create a word frequency table for the English language Wikipedia database. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # Loops over all of the gutenberg text files in /var/bigdisk/gutenberg/gt_text | |
| # Extracting unique texts, and copies them to ./gt_raw | |
| # Removes header and footer information - leaving just the text, ready for | |
| # statistical processing | |
| # | |
| # Usage: Be sure to change these paths to point to the relevant directories on your system | |
| import string | |
| import os | |
| import gc | |
| import shutil | |
| # The logic for keeping a file based on its name is put | |
| # into a name to improve readability | |
| # fname = Name of file without path or file extension | |
| def keep_file(fname): | |
| # filter out readme, info, notes, etc | |
| if (fname.lower().find("readme") != -1): | |
| return False | |
| if (fname.find(".zip.info") != -1): | |
| return False | |
| if (fname.find("pnote") != -1 ): | |
| return False | |
| # Filter out the Human Genome | |
| if (len(fname)==4): | |
| try: | |
| n = int(fname) | |
| if (n>=2201 and n<=2224): | |
| print "*** Genome skipped:",n | |
| return False # Human Genome | |
| except ValueError: | |
| n=0 # dummy line | |
| # Looks good => keep this file | |
| return True | |
| # Recursively walk the entire directory tree finding all .txt files which | |
| # are not in old sub-directories. readme.txt files are also skipped. | |
| # Empty the output directory | |
| outputdir = "/var/bigdisk/gutenberg/gt_raw" | |
| for f in os.listdir(outputdir): | |
| fpath = os.path.join(outputdir, f) | |
| try: | |
| if (os.path.isfile(fpath)): | |
| os.unlink(fpath) | |
| except Exception, e: | |
| print e | |
| for dirname, dirnames, filenames in os.walk('/var/bigdisk/gutenberg/gt_text'): | |
| if (dirname.find('old') == -1 and | |
| dirname.find('-h') == -1 ) : | |
| # some files are duplicates, remove these and only copy a single copy | |
| # The -8 suffix takes priority (8 bit ISO-8859-1) over the | |
| # files with no suffix or -1 suffix (simple ASCII) | |
| # also remove auxiliaries: Names contain pnote or .zip.info | |
| flist = [] | |
| flist_toremove = [] | |
| for fname in filenames: | |
| fbase, fext = os.path.splitext(fname) | |
| if ( fext == '.txt'): | |
| if (keep_file(fbase)): | |
| flist.append(fname) | |
| if (fname.endswith("-8.txt") ): | |
| # -8 takes priority => remove any duplicates | |
| flist_toremove.append( fname[: (len(fname)-6)] + ".txt" ) | |
| flist_toremove.append( fname[: (len(fname)-6)] + "-0.txt" ) | |
| flist_to_proc = [i for i in flist if i not in flist_toremove] | |
| # flist_to_proc now contains the files to copy | |
| # loop over them, copying line-by-line | |
| # Check for header/footer markers - file is skipped if header marker is missing | |
| for f in flist_to_proc: | |
| infile = os.path.join(dirname, f) | |
| outfile = os.path.join(outputdir, f) | |
| bCopying = False | |
| for line in open(infile): | |
| if (not bCopying): | |
| if (line.startswith("*** START OF THIS PROJECT GUTENBERG EBOOK")): | |
| fout = open(outfile, "w") | |
| print "Copying: " + f | |
| bCopying = True | |
| elif (bCopying): | |
| if (line.startswith("*** END OF THIS PROJECT GUTENBERG EBOOK")): | |
| fout.close() | |
| bCopying = False | |
| else: | |
| fout.write(line) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment