simplerethink

Creating Shakespearean n-grams with just the command-line and regexes

This is a quick example showing how to use regexes to find tri-grams in Shakespeare...well, 570,872 of them, anyway, if we do some basic filtering of non-dialogue.

Though tokenization and n-grams should typically be done using a proper natural language processing framework, it's possible to do in a jiffy from the command-line, using standard Unix tools and ack, the better-than-grep utility.

	#!/usr/bin/env bash
	#
	# Backup home directory
	#
	#

	set -o errexit
	set -o nounset
	set -o pipefail

	/* eslint-disable no-undef */
	;(function (window, document) {
	"use strict"

	/**
	* The Nord Documentation object
	*
	* @constructor
	*/
	function NordDocs() {

	/**
	*
	* Search Query Mining Tool
	*
	* This script calculates the contribution of each word or phrase found in the
	* search query report and outputs a report into a Google Doc spreadsheet.
	*
	* Version: 2.2
	* Updated 2015-09-17: replacing 'KeywordText' with 'Criteria'
	* Updated 2016-10-11: replacing 'ConvertedClicks' with 'Conversions'

simplerethink

Creating Shakespearean n-grams with just the command-line and regexes

What are n-grams?