Skip to content

Instantly share code, notes, and snippets.

@t-flora
Created April 15, 2020 03:04
Show Gist options
  • Save t-flora/42503fa1d74dc809e16e8db88b6fd377 to your computer and use it in GitHub Desktop.
Save t-flora/42503fa1d74dc809e16e8db88b6fd377 to your computer and use it in GitHub Desktop.
CS110 - Huffman Codes & File Compression
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\\rightarrow$Run All).\n",
"\n",
"Make sure you fill in any place that says `YOUR CODE HERE` or \"YOUR ANSWER HERE\", as well as your name and collaborators below:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"NAME = \"Tiago Flora\"\n",
"COLLABORATORS = \"Chloe Go\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "66e4a5cae0e361c97d6c83e84754f7bb",
"grade": false,
"grade_id": "cell-aedfb9351c4dd419",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"source": [
"# CS110 Pre-class Work 14.1\n",
"\n",
"In this Pre-class work we will apply Huffman's algorithm in file compression. \n",
"## Question 1.\n",
"Below is the utility function for downloading a text file from a URL. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "b4cfc6776ba647ea042e50771b3d8483",
"grade": false,
"grade_id": "cell-b84ee6c499c87372",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"outputs": [],
"source": [
"from urllib.request import urlopen\n",
"import shutil\n",
"import gzip\n",
"import os\n",
"\n",
"\n",
"# Download the file if need be:\n",
"def download_file(url, filename):\n",
" if not os.path.exists(filename):\n",
" response = urlopen(url + filename)\n",
" shutil.copyfileobj(\n",
" gzip.GzipFile(fileobj=response), open(filename+'.txt', 'wb'))\n",
"\n",
"url = \"http://www.gutenberg.org/ebooks/\"\n",
"filename = \"100.txt.utf-8\"\n",
"\n",
"download_file(url, filename) "
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "8a077d16d8c9d48ba47b01467d9fd9c2",
"grade": false,
"grade_id": "cell-c2e379e275938e68",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"source": [
"Your tasks:\n",
"\n",
"1. Run the cell so that the file \"100.txt.utf-8\" is downloaded to your local machine. Please allow some time for the code to complete.\n",
"2. Check that the file \"100.txt.utf-8\" has been downloaded to your computer.\n",
"3. Open and view the file with your favorite text editor. \n",
"4. In the cell below, write down the size of the downloaded file (for example, 1.2GB)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"nbgrader": {
"checksum": "f18750524ace65719b91a60e8d9a43b8",
"grade": true,
"grade_id": "cell-b83cd656878410be",
"locked": false,
"points": 0,
"schema_version": 1,
"solution": true
}
},
"source": [
"5.59 MB"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "80849f8ac566ae2fe64a35cc8c8e882d",
"grade": false,
"grade_id": "cell-5ffca685c11b6b4d",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"source": [
"## Question 2. \n",
"Now, as a bit of an interlude, we will get familiar with the `bitarray` Python library, which is helpful for completing this pre-class work. Go to this [link](https://pypi.org/project/bitarray/) and read the examples in the first three cell of section **Using the module**. Once you complete the reading task, please complete the function `get_bit_array` in the cell below."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"deletable": false,
"nbgrader": {
"checksum": "ec3c2a880c1c14e2d359fac0dada1c28",
"grade": false,
"grade_id": "cell-7ba9ec08bcd38455",
"locked": false,
"schema_version": 1,
"solution": true
}
},
"outputs": [],
"source": [
"from bitarray import bitarray\n",
"\n",
"class Node(object):\n",
" \"\"\"A node in a binary tree that represents a prefix code.\"\"\"\n",
" def __init__(self, freq, symb, parent = None, lchild = None, rchild = None):\n",
" \"\"\"\n",
" - freq: float, frequency of the character\n",
" - symb: string, a character in the file \n",
" - parent: a node, parent of the current node in the tree\n",
" - lchild, rchild: left child node and right child node of the current node \n",
" in the tree.\n",
" \"\"\"\n",
" self.freq = freq\n",
" self.symb = symb\n",
" self.parent = parent\n",
" self.lchild = lchild\n",
" self.rchild = rchild\n",
" \n",
" def __lt__(self, other):\n",
" \"\"\"\n",
" nodeA < nodeB returns True if nodeA.freq < nodeB.freq. This enables\n",
" us to use the module heapq on nodes. In other words, with this function,\n",
" we can now push/insert a node into a heap as well as extract/pop the \n",
" minimum node from a heap. We studied the module heapq before. You can brush\n",
" up your memory by visiting this link: \n",
" https://docs.python.org/3.0/library/heapq.html\n",
" \n",
" \"\"\"\n",
" return self.freq < other.freq\n",
" \n",
"def get_bitarray(node):\n",
" \"\"\"\n",
" Given a node in the tree, determines the corresponding codeword for character\n",
" node.symb, using the rule in Cormen et al.: the binary codeword for a character \n",
" is the simple path from the root to that character, where 0 means “go to the \n",
" left child” and 1 means “go to the right child.\n",
" \n",
" Inputs:\n",
" - node: a node, whose codeword represented by the tree is of interest\n",
" \n",
" Outputs:\n",
" - a: a bit array that represents the codeword. For example, if the codeword is \n",
" 01001, then a is bitarray('01001')\n",
" \n",
" \"\"\"\n",
" # Initialize a list that we will later turn into a bitarray\n",
" a = []\n",
" \n",
" # We traverse the binary tree from the leaves to the root, and so we stop the loop when we reach the latter\n",
" while node.parent:\n",
" if node == node.parent.rchild:\n",
" # If the node is a righ child, append a value to 'a' that will evaluate to 1\n",
" a.append(\"1\")\n",
" else:\n",
" # If the node is a left child, append a value that will evaluate to 0, like an empty string\n",
" a.append(\"\")\n",
" node = node.parent\n",
" \n",
" # Becayse we traversed the tree from the leaves to the root, we reverse the order of the list to find the correct bitarray\n",
" a = a[::-1]\n",
" return bitarray(a)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "2fe3268d072490748502ac5657baf039",
"grade": false,
"grade_id": "cell-30c0c74327992b26",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"source": [
"## Question 3. \n",
"Complete the following function that builds a Huffman code, making use of `get_bitarray` and the module `heapq`. "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"deletable": false,
"nbgrader": {
"checksum": "ff946fc33af51c13fefcf767c4f91292",
"grade": false,
"grade_id": "cell-a99a125161e64aac",
"locked": false,
"schema_version": 1,
"solution": true
}
},
"outputs": [],
"source": [
"import heapq\n",
"def encode(symb2freq):\n",
" \"\"\"\n",
" Huffman encode the given dict mapping symbols to weights. \n",
" \n",
" Inputs:\n",
" - symb2freq: a dictionary that maps a symbol/character to\n",
" the probability frequency in the text file. For example,\n",
" {'a': .3, 'b':.6, 'c': .1}. This example symb2freq means \n",
" that symbol 'a' appears with frequency 30%, symbol 'b' 60%, \n",
" and symbol 'c' 10%.\n",
" \n",
" Outputs:\n",
" - out: a dictionary that maps a symbol/charcater to a bitarray\n",
" that represents the codeword for that symbol. For example,\n",
" out = {'a': bitarray('01'), 'b': bitarray('11'), 'c': bitarray('101')}.\n",
" \"\"\" \n",
" # Define a list containing nodes for all entries in the input dictionary\n",
" heap_t = [Node(freq, char) for char, freq in symb2freq.items()]\n",
" \n",
" # Heapify the list according to their frequencies\n",
" heapq.heapify(heap_t)\n",
" \n",
" # Create a copy of our heap to work as priority queue we will keep adding to and removing from\n",
" Q = heap_t[:]\n",
" \n",
" # While there are items in the priority queue, we remove the nodes with add the parents of the nodes with \n",
" # the smallest values to the queue\n",
" while len(Q) > 1:\n",
" left = heapq.heappop(Q)\n",
" right = heapq.heappop(Q)\n",
" \n",
" z = Node(left.freq + right.freq, left.symb + right.symb, lchild = left, rchild = right)\n",
" \n",
" left.parent = z\n",
" right.parent = z\n",
" \n",
" heapq.heappush(Q, z)\n",
" \n",
"# bitarrs = [get_bitarray(i) for i in heap_t] \n",
"# out = dict(zip(heap_t, bitarrs))\n",
"\n",
"# The lines above didn't work for some reason\n",
"\n",
" # Initialize the dictionary, and convert all characters in the heap to bitarrays\n",
" out = {}\n",
" for i in heap_t:\n",
" out[i.symb] = get_bitarray(i)\n",
" \n",
" return out"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "cfdc434c6f29f0186e0bcadcec736d83",
"grade": false,
"grade_id": "cell-fa651be15a7f3559",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"source": [
"# Question 4. \n",
"Below you are given three functions to 1) build a frequency table for a file, 2) compress a file, and 3) decompress a file. Make use of these functions to do the following:\n",
"\n",
"1. Create a compressed version of file `100.txt.utf-8.txt` that is named `100.txt.utf-8.txt.huff`.\n",
"2. Decompress `100.txt.utf-8.txt.huff` to file `100.txt.utf-8.txt.huff.dehuff.txt`. "
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"deletable": false,
"nbgrader": {
"checksum": "f1a7d2cdc3ec444c75a311ce72630ee3",
"grade": false,
"grade_id": "cell-7c911f3552280b25",
"locked": false,
"schema_version": 1,
"solution": true
}
},
"outputs": [],
"source": [
"from collections import defaultdict \n",
"import pickle\n",
"# build a frequency table:\n",
"def build_freq(filename):\n",
" freq = defaultdict(int)\n",
" with open(filename, 'rb') as f:\n",
" for line in f:\n",
" for char in line:\n",
" freq[char] += 1\n",
" total = float(sum(freq.values()))\n",
" return {char: count / total for (char, count) in freq.items()}\n",
"\n",
"# Now compress the file:\n",
"def compress(filename, encoding, compressed_name=None):\n",
" if compressed_name is None:\n",
" compressed_name = filename + \".huff\"\n",
" output = bitarray()\n",
" with open(filename, 'rb') as f:\n",
" for line in f:\n",
" for char in line:\n",
" output.extend(encoding[char])\n",
" N = len(output)\n",
" with open(compressed_name, 'wb') as f:\n",
" pickle.dump(N, f)\n",
" pickle.dump(encoding, f)\n",
" output.tofile(f)\n",
"\n",
"\n",
"# Now decompress the file:\n",
"def decompress(filename, decompressed_name=None):\n",
" if decompressed_name is None:\n",
" decompressed_name = filename + \".dehuff.txt\"\n",
" with open(filename, 'rb') as f:\n",
" N = pickle.load(f)\n",
" encoding = pickle.load(f)\n",
" bits = bitarray()\n",
" bits.fromfile(f)\n",
" bits = bits[:N]\n",
"\n",
" # Totally cheating here and using a builtin method:\n",
" output = bits.decode(encoding)\n",
" with open(decompressed_name, 'wb') as f:\n",
" f.write(bytes(output))\n",
"\n",
"\n",
"file = '100.txt.utf-8.txt'\n",
"\n",
"# We encode the file using the encode() function we built and the build_freq() function, which provides the frequency of \n",
"# each character in the file\n",
"encoding_dict = encode(build_freq(file))\n",
"\n",
"# Use the frequency dict needed to compress the file\n",
"compress(file, encoding_dict, compressed_name = '100.txt.utf-8.txt.huff')\n",
"\n",
"decompress('100.txt.utf-8.txt.huff', decompressed_name = '100.txt.utf-8.txt.huff.dehuff.txt')"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "4df4ea196f6952f4b7e15d7ed2ea9430",
"grade": false,
"grade_id": "cell-b564d21e754a0461",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"source": [
"## Question 5. \n",
"Give your answer in the cell below:\n",
"1. Report the size of the compressed file and the decompressed file in the cell below.\n",
"2. How does the size of the decompressed file compare to the size of the original file (`100.txt.utf-8.txt`)?\n",
"3. Visually skim the decompressed file and the original file. Do they appear identical?"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"nbgrader": {
"checksum": "4c171a626010799721ff7c9e3ac9a22a",
"grade": true,
"grade_id": "cell-b5988d6026475a88",
"locked": false,
"points": 0,
"schema_version": 1,
"solution": true
}
},
"source": [
"1. The size of the compressed file is 3.28 MB, while the decompressed file has the same size as the original file, 5.59 MB.\n",
"2. The size of the decompressed file is the same as the original's.\n",
"3. Yes, the files do appear identical."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "499835f219f42c8fb92f0ac6136707a4",
"grade": false,
"grade_id": "cell-b9ea500bebcac4dd",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"source": [
"## Question 6.\n",
"Compute and print out:\n",
"1. The percentage of 1’s in the compressed version\n",
"2. The percentage of 1’s in the uncompressed version"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"deletable": false,
"nbgrader": {
"checksum": "e1287afdf6be4303ec9293e1894b9643",
"grade": true,
"grade_id": "cell-a1f9f58458cf80d5",
"locked": false,
"points": 0,
"schema_version": 1,
"solution": true
}
},
"outputs": [
{
"data": {
"text/plain": [
"0.0025859681105624152"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# If we retrieve the entry of the dict returned by build_freq for the file for the key 1, we obtain the frequency\n",
"# of the key, which stands at around 0.26%\n",
"build_freq('100.txt.utf-8.txt.huff')[1]"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"None\n",
"There are 0 1s in the uncompressed version of the file\n"
]
}
],
"source": [
"print(build_freq('100.txt.utf-8.txt.huff.dehuff.txt').get(1))\n",
"print(\"There are 0 1s in the uncompressed version of the file\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment