garywu · February 17, 2017 05:42
diff --git a/!Calculating Word Statistics from the Gutenberg Corpus b/!Calculating Word Statistics from the Gutenberg Corpus
 http://www.winwaed.com/blog/2012/04/09/calculating-word-statistics-from-the-gutenberg-corpus/

 Following on from the previous article about scanning text files for word statistics, I shall extend this to use real large corpora. First we shall use this script to create statistics for the entire Gutenberg English language corpus. Next I shall do the same with the entire English language Wikipedia.


 Project Gutenberg is a non-profit project to digitize books with expired copyright. It is noted for including large numbers of classic texts and novels. All of the texts are available for free online. Although the site could be spidered, this is strongly discouraged due to the limited resources of the project. Instead, a CD or DVD image can be downloaded (or purchased); or you can create a mirror. Instructions for doing this are here, and include the following recommend shell command:


 rsync -avHS --delete --delete-after --exclude '*cache/generated' ftp@ftp.ibiblio.org::gutenberg /home/ftp/pub/mirrors/gutenberg
 1
 rsync -avHS --delete --delete-after --exclude '*cache/generated' ftp@ftp.ibiblio.org::gutenberg /home/ftp/pub/mirrors/gutenberg
 I would strongly recommend you use an even restrictive command to exclude everything except for directories and files with the “.txt” suffix. If you do not do this, the rsync command will download a wide range of miscellaneous binary files including bitmaps, ISOs and RARs. some of these are very large. Even if you have the time, disk space, and bandwidth; it is definitely not polite to stress Project Gutenberg’s servers unnecessarily.

 A restrictive “text only” fetch will probably still take a few hours to run. It will also create a complex directory hierarchy. This includes a range of unnecessary text files (e.g. readme files, and the complete Human Genome), and many texts will contain multiple copies in different formats (typically ASCII and 8 bit ISO-8859-1). The next job is to tidy this up, by copying all unique texts into one large flat directory. I chose to use the ISO-8859-1 format where available, and plain ASCII when it wasn’t. This would enable the correct format to be used for accented words that have been adopted into the English language (e.g. “déjà vu”). This copy and filter process is performed using a Python script.

 Each Gutenberg text also includes a standard header and footer. We do not wish to include this information in the statistics, so it needs to be stripped off. For efficiency, the header and footer are removed by the same script. This also has the advantage that descriptive files (e.g. readme files) lack the header/footer markers and are automatically skipped.

 IMPORTANT NOTE: The Gutenberg license prohibits the distribution of their texts without the header and footer. Do not distribute the texts in this form. If there is the chance you will be sharing your files (e.g. with colleagues), then the headers and footers should be kept in tact.

 Here is the script:

 Compared to the download, this runs quite quickly and the final directory should be much easier to process.

 The resulting frequency table (created using the word frequency table scripts) can be downloaded as a GZIP file (28.5MB).

 Note that although the word frequency table scripts could be easily modified to process N-grams, the shear size of the Gutenberg dataset will prove a challenge. I shall address this in a future article. A similar process can be used to create a word frequency table for the English language Wikipedia database.
diff --git a/word-statistics-from-the-gutenberg-corpus.py b/word-statistics-from-the-gutenberg-corpus.py
 # Loops over all of the gutenberg text files in /var/bigdisk/gutenberg/gt_text
 # Extracting unique texts, and copies them to ./gt_raw
 # Removes header and footer information - leaving just the text, ready for
 # statistical processing
 #
 # Usage: Be sure to change these paths to point to the relevant directories on your system

 import string
 import os
 import gc
 import shutil

 # The logic for keeping a file based on its name is put
 # into a name to improve readability

 # fname = Name of file without path or file extension
 def keep_file(fname):
     # filter out readme, info, notes, etc
     if (fname.lower().find("readme") != -1):
         return False
     if  (fname.find(".zip.info") != -1):
         return False
     if (fname.find("pnote") != -1 ):
         return False
     # Filter out the Human Genome
     if (len(fname)==4):
         try:
             n = int(fname)
             if (n&gt;=2201 and n&lt;=2224):
                 print "*** Genome skipped:",n
                 return False    # Human Genome
         except ValueError:
             n=0  # dummy line

     # Looks good =&gt; keep this file
     return True

 # Recursively walk the entire directory tree finding all .txt files which
 # are not in old sub-directories. readme.txt files are also skipped.

 # Empty the output directory
 outputdir = "/var/bigdisk/gutenberg/gt_raw"
 for f in os.listdir(outputdir):
     fpath = os.path.join(outputdir, f)
     try:
         if (os.path.isfile(fpath)):
             os.unlink(fpath)
     except Exception, e:
         print e

 for dirname, dirnames, filenames in os.walk('/var/bigdisk/gutenberg/gt_text'):
     if (dirname.find('old') == -1  and
         dirname.find('-h') == -1 ) :
         # some files are duplicates, remove these and only copy a single copy
         # The -8 suffix takes priority (8 bit ISO-8859-1) over the
         # files with no suffix or -1 suffix (simple ASCII)
         # also remove auxiliaries: Names contain pnote or .zip.info
         flist = []
         flist_toremove = []
         for fname in filenames:
             fbase, fext = os.path.splitext(fname)
             if ( fext == '.txt'):
                if (keep_file(fbase)):
                     flist.append(fname)
                     if (fname.endswith("-8.txt") ):
                         # -8 takes priority =&gt; remove any duplicates
                         flist_toremove.append( fname[: (len(fname)-6)] + ".txt" )
                         flist_toremove.append( fname[: (len(fname)-6)] + "-0.txt" )

         flist_to_proc = [i for i in flist if i not in flist_toremove]

        # flist_to_proc now contains the files to copy
        # loop over them, copying line-by-line
        # Check for header/footer markers - file is skipped if header marker is missing
         for f in flist_to_proc:
             infile = os.path.join(dirname, f)
             outfile = os.path.join(outputdir, f)

             bCopying = False
             for line in open(infile):
                 if (not bCopying):
                     if (line.startswith("*** START OF THIS PROJECT GUTENBERG EBOOK")):
                         fout = open(outfile, "w")
                         print "Copying: " + f
                         bCopying = True

                 elif (bCopying):
                     if (line.startswith("*** END OF THIS PROJECT GUTENBERG EBOOK")):
                         fout.close()
                         bCopying = False
                     else:
                         fout.write(line)
	http://www.winwaed.com/blog/2012/04/09/calculating-word-statistics-from-the-gutenberg-corpus/

	Following on from the previous article about scanning text files for word statistics, I shall extend this to use real large corpora. First we shall use this script to create statistics for the entire Gutenberg English language corpus. Next I shall do the same with the entire English language Wikipedia.


	Project Gutenberg is a non-profit project to digitize books with expired copyright. It is noted for including large numbers of classic texts and novels. All of the texts are available for free online. Although the site could be spidered, this is strongly discouraged due to the limited resources of the project. Instead, a CD or DVD image can be downloaded (or purchased); or you can create a mirror. Instructions for doing this are here, and include the following recommend shell command:


	rsync -avHS --delete --delete-after --exclude '*cache/generated' ftp@ftp.ibiblio.org::gutenberg /home/ftp/pub/mirrors/gutenberg
	1
	rsync -avHS --delete --delete-after --exclude '*cache/generated' ftp@ftp.ibiblio.org::gutenberg /home/ftp/pub/mirrors/gutenberg
	I would strongly recommend you use an even restrictive command to exclude everything except for directories and files with the “.txt” suffix. If you do not do this, the rsync command will download a wide range of miscellaneous binary files including bitmaps, ISOs and RARs. some of these are very large. Even if you have the time, disk space, and bandwidth; it is definitely not polite to stress Project Gutenberg’s servers unnecessarily.

	A restrictive “text only” fetch will probably still take a few hours to run. It will also create a complex directory hierarchy. This includes a range of unnecessary text files (e.g. readme files, and the complete Human Genome), and many texts will contain multiple copies in different formats (typically ASCII and 8 bit ISO-8859-1). The next job is to tidy this up, by copying all unique texts into one large flat directory. I chose to use the ISO-8859-1 format where available, and plain ASCII when it wasn’t. This would enable the correct format to be used for accented words that have been adopted into the English language (e.g. “déjà vu”). This copy and filter process is performed using a Python script.

	Each Gutenberg text also includes a standard header and footer. We do not wish to include this information in the statistics, so it needs to be stripped off. For efficiency, the header and footer are removed by the same script. This also has the advantage that descriptive files (e.g. readme files) lack the header/footer markers and are automatically skipped.

	IMPORTANT NOTE: The Gutenberg license prohibits the distribution of their texts without the header and footer. Do not distribute the texts in this form. If there is the chance you will be sharing your files (e.g. with colleagues), then the headers and footers should be kept in tact.

	Here is the script:

	Compared to the download, this runs quite quickly and the final directory should be much easier to process.

	The resulting frequency table (created using the word frequency table scripts) can be downloaded as a GZIP file (28.5MB).

	Note that although the word frequency table scripts could be easily modified to process N-grams, the shear size of the Gutenberg dataset will prove a challenge. I shall address this in a future article. A similar process can be used to create a word frequency table for the English language Wikipedia database.
	# Loops over all of the gutenberg text files in /var/bigdisk/gutenberg/gt_text
	# Extracting unique texts, and copies them to ./gt_raw
	# Removes header and footer information - leaving just the text, ready for
	# statistical processing
	#
	# Usage: Be sure to change these paths to point to the relevant directories on your system

	import string
	import os
	import gc
	import shutil

	# The logic for keeping a file based on its name is put
	# into a name to improve readability

	# fname = Name of file without path or file extension
	def keep_file(fname):
	# filter out readme, info, notes, etc
	if (fname.lower().find("readme") != -1):
	return False
	if (fname.find(".zip.info") != -1):
	return False
	if (fname.find("pnote") != -1 ):
	return False
	# Filter out the Human Genome
	if (len(fname)==4):
	try:
	n = int(fname)
	if (n>=2201 and n<=2224):
	print "*** Genome skipped:",n
	return False # Human Genome
	except ValueError:
	n=0 # dummy line

	# Looks good => keep this file
	return True

	# Recursively walk the entire directory tree finding all .txt files which
	# are not in old sub-directories. readme.txt files are also skipped.

	# Empty the output directory
	outputdir = "/var/bigdisk/gutenberg/gt_raw"
	for f in os.listdir(outputdir):
	fpath = os.path.join(outputdir, f)
	try:
	if (os.path.isfile(fpath)):
	os.unlink(fpath)
	except Exception, e:
	print e

	for dirname, dirnames, filenames in os.walk('/var/bigdisk/gutenberg/gt_text'):
	if (dirname.find('old') == -1 and
	dirname.find('-h') == -1 ) :
	# some files are duplicates, remove these and only copy a single copy
	# The -8 suffix takes priority (8 bit ISO-8859-1) over the
	# files with no suffix or -1 suffix (simple ASCII)
	# also remove auxiliaries: Names contain pnote or .zip.info
	flist = []
	flist_toremove = []
	for fname in filenames:
	fbase, fext = os.path.splitext(fname)
	if ( fext == '.txt'):
	if (keep_file(fbase)):
	flist.append(fname)
	if (fname.endswith("-8.txt") ):
	# -8 takes priority => remove any duplicates
	flist_toremove.append( fname[: (len(fname)-6)] + ".txt" )
	flist_toremove.append( fname[: (len(fname)-6)] + "-0.txt" )

	flist_to_proc = [i for i in flist if i not in flist_toremove]

	# flist_to_proc now contains the files to copy
	# loop over them, copying line-by-line
	# Check for header/footer markers - file is skipped if header marker is missing
	for f in flist_to_proc:
	infile = os.path.join(dirname, f)
	outfile = os.path.join(outputdir, f)

	bCopying = False
	for line in open(infile):
	if (not bCopying):
	if (line.startswith("*** START OF THIS PROJECT GUTENBERG EBOOK")):
	fout = open(outfile, "w")
	print "Copying: " + f
	bCopying = True

	elif (bCopying):
	if (line.startswith("*** END OF THIS PROJECT GUTENBERG EBOOK")):
	fout.close()
	bCopying = False
	else:
	fout.write(line)