LeopoldTal · September 14, 2021 09:02
diff --git a/tough-dough.html b/tough-dough.html
 <!DOCTYPE html>
 <html lang="en">

 <head>
 	<meta charset="utf-8" />
 	<title>tf-idf tutorial</title>
 	<script src="https://d3js.org/d3.v5.min.js"></script>
 	<script src="./set-documents.js"></script>
 </head>

 <body>
 	<h1>tf-idf</h1>

 	<figure style="float: right;">
 		<img src="machine-learning.png" alt="A robot reads a book" />
 		<figcaption>Language processing</figcaption>
 	</figure>

 	<h2>Goal</h2>
 	
 	<p>You have a whole bunch of documents: articles, product descriptions, projects, etc. You want to find
 		<strong>similarities</strong> between them.</p>

 	<p>Examples: A search engine. A recommendation engine.</p>

 	<p>For this tutorial, I'll use the 5917 featured articles on Wikipedia.</p>

 	<h2>Principle</h2>

 	<p>What does it mean for two documents to be similar, or related? Can we give a formal definition?</p>

 	<p>We're looking for a concept of <strong>distance</strong>: two copies of the same document should be at
 		distance 0. Similar documents should be closer together than totally unrelated documents.</p>

 	<p>A distance… in what? In what vector space?</p>

 	<p>Machines are stupid. They can't understand the documents, just see what words appear in them. Hey, what
 		if we used word count?</p>

 	<p>That defines dimensions: The more times the word "cheese" appears in the document, the further the
 		document is on the "cheese" axis.</p>

 	<p>We're using the <strong>frequency of the term</strong> "cheese": the "tf" in "tf-idf".</p>

 	<figure>
 		<figcaption>Frequency of the word "cheese" in all documents</figcaption>
 		<div id="singleTerm1d"></div>
 	</figure>

 	<p>That defines an absolutely enormous space: one dimension per unique word.</p>
 	
 	<p>There are thousands of words in the English language but if I draw diagrams with thousands of dimensions,
 		they won't be very clear. So I'll just show 2: "horse" and "island".</p>

 	<h2>Term frequency</h2>

 	<p>I lied to you! Counting occurrences of the word isn't the only way to do it. Other ways:</p>

 	<dl>
 		<dt>simple</dt>
 		<dd>Count occurrences.
 			<pre><code>const freqSimple = (document, term) => document.terms[term] || 0;</code></pre>
 		</dd>
 		<dt>normalised</dt>
 		<dd>Count occurrences then divide by total word count in the document, so it's actually a frequency.
 			<pre><code>const totalTermCount = document => Object.values(document.terms).reduce((a, b) => a + b);
 const freqNorm = (document, term) => freqSimple(document, term) / totalTermCount(document);</code></pre>
 		</dd>
 		<dt>booleean</dt>
 		<dd>Count whether the word appears or not.
 			<pre><code>const freqBool = (document, term) => document.terms[term] ? 1 : 0;</code></pre>
 		</dd>
 		<dt>logarithmic</dt>
 		<dd>Logarithm of the number of occurrences (+ 1 so that it's 0 if missing).
 			<pre><code>const freqLog = (document, term) => Math.log(1 + freqSimple(document, term));</code></pre>
 		</dd>
 		<dt>augmented</dt>
 		<dd>Instead of dividing by total word count, divide by the number of occurrences of the most frequent
 			word. For very long documents, that helps identify which words matter most.
 			<pre><code>const maxWordCount = document => Math.max(...Object.values(document.terms));
 const freqAug = (document, term) => 1 + freqSimple(document, term) / maxWordCount(document);</code></pre>
 		</dd>
 	</dl>

 	<p><strong>Mouse over</strong> any point to see the other terms.</p>

 	<figure>
 		<figcaption>
 			<select id="tf2d-select">
 				<option value="simple">Simple</option>
 				<option value="norm">Normalised</option>
 				<option value="bool">Booleean</option>
 				<option value="log">Logarithmic</option>
 				<option value="aug">Augmented</option>
 			</select>
 			frequency of the terms "horse" and "island" in all documents
 		</figcaption>
 		<div id="tf2d"></div>
 	</figure>

 	<h2>Inverse frequency</h2>

 	<p>Okay, that makes a pretty picture for those 2 words, but it doesn't really work: very common words
 		like "the" always come out on top.</p>

 	<p>I could build a stopword list, but I'm lazy. Also, some words will still be more common.</p>

 	<p>The secret ingredient (Spärck Jones, 1972): divide everything by the log of the number of
 		documents that contain the term.</p>

 	<p><strong>The rarer the term is, the more it counts.</strong></p>

 	<figure style="float: right;">
 		<img src="spaerck_jones.jpg" alt="Karen Spärck Jones in 2002" />
 		<figcaption>Professor Karen Spärck Jones</figcaption>
 	</figure>

 	<p>That's the "idf" ("inverse document frequency") in "tf-idf".</p>

 	<pre><code>const docCount = (allDocuments, term) => allDocuments.filter(document => document.terms[term]).length;
 const idf = (allDocuments, term) => Math.log(allDocuments.length / (1 + docCount(allDocuments, term)));
 const tfIdf = (allDocuments, document, term, freq) => freq(document, term) * idf(allDocuments, term);</code></pre>

 	<figure>
 		<figcaption>
 			<select id="tfIdf2d-select">
 				<option value="simple">Simple</option>
 				<option value="norm">Normalised</option>
 				<option value="bool">Boolean</option>
 				<option value="log">Logarithmic</option>
 				<option value="aug">Augmented</option>
 			</select>
 			tf-idf of the terms "horse" and "island" in all documents
 			(mouse over to see other terms)
 		</figcaption>
 		<div id="tfIdf2d"></div>
 	</figure>

 	<p>Success! We get terms strongly related to the topic of each article.</p>

 	<h2>Similarities</h2>

 	<p>So we've transformed each document into (term, tf-idf) pairs. How do you measure distance between
 		two of those?</p>

 	<p>If a term is very frequent in document 1, you want its frequency in document 2 to matter a lot, and
 		vice versa. To model this, <strong>multiply</strong> the tf-idf frequencies of the term in the
 		two documents.</p>

 	<p>The contribution of each term is independent, so we can just <strong>add</strong> them together.</p>

 	<pre><code>const cosineSimilarity = (allDocuments, document1, document2, freq) => {
 	const byTerms = Object.keys(document1.terms).map(term =>
 		tfIdf(allDocuments, document1, term, freq) * tfIdf(allDocuments, document2, term, freq)
 	);
 	return byTerms.reduce((a, b) => a + b);
 };</code></pre>

 	<p>Say, that's a scalar product! It's the cosine of the angle between the vectors of the two
 		documents. Hence the name "cosine similarity".</p>

 	<pre><code>const getMostSimilar = (allDocuments, toDocument, freq) => {
 	const otherDocuments = allDocuments.filter(document => document !== toDocument);
 	const similarities = otherDocuments.map(document => ({
 		document,
 		similarity: cosineSimilarity(allDocuments, document, toDocument, freq)
 	}));
 	similarities.sort((document1, document2) => document2.similarity - document1.similarity);
 	return similarities.slice(0, 10);
 };</code></pre>

 	<p>A cool feature of tf-if: you can tell <strong>why</strong> two documents are similar:
 		look at the terms with the biggest product.</p>

 	<div id="similarities">
 		<caption>
 			Articles similar to
 			<cite id="document-name"></cite>
 			<button type="button" id="random-article">change</button>
 			by
 			<select id="similarity-select">
 				<option value="simple">simple</option>
 				<option value="norm">normalised</option>
 				<option value="bool">booleean</option>
 				<option value="log">logarithmic</option>
 				<option value="aug">augmented</option>
 			</select>
 			tf-idf
 		</caption>
 		<table id="similarity-table"></table>
 	</div>

 	<p>Question time!</p>

 	<script src="./tough-dough.js"></script>
 </body>

 </html>
diff --git a/tough-dough.js b/tough-dough.js
 // Terms for demo

 const TERM_SINGLE = 'cheese';
 const TERM_X = 'horse';
 const TERM_Y = 'island';
 const NB_TOP_TERMS = 3;

 // Term frequency functions

 const freqSimple = (document, term) => document.terms[term] || 0;

 const totalTermCount = document => { // memoised
 	document.totalTermCount = document.totalTermCount || Object.values(document.terms).reduce((a, b) => a + b);
 	return document.totalTermCount;
 };
 const freqNorm = (document, term) => freqSimple(document, term) / totalTermCount(document);

 const freqBool = (document, term) => document.terms[term] ? 1 : 0;

 const freqLog = (document, term) => Math.log(1 + freqSimple(document, term));

 const maxWordCount = document => { // memoised
 	document.maxWordCount = document.maxWordCount === undefined
 		? Math.max(...Object.values(document.terms))
 		: document.maxWordCount;
 	return document.maxWordCount;
 };
 const freqAug = (document, term) => 1 + freqSimple(document, term) / maxWordCount(document);

 // Inverse document frequency

 const idfMap = {}; // memoised
 const docCount = (allDocuments, term) => allDocuments.filter(document => document.terms[term]).length;
 const idf = (allDocuments, term) => {
 	idfMap[term] = idfMap[term] || Math.log(allDocuments.length / (1 + docCount(allDocuments, term)));
 	return idfMap[term];
 };

 // tf-idf

 const tfIdf = (allDocuments, document, term, freq) => freq(document, term) * idf(allDocuments, term);

 // Similarity

 // pre-compile tfIdfs for target document
 const cosineSimilarity = (allDocuments, targetTfIdf, otherDocument, freq) => {
 	const byTerms = targetTfIdf.map(({ term, tdIdfValue }) => ({
 		term,
 		weight: tdIdfValue * tfIdf(allDocuments, otherDocument, term, freq)
 	}));
 	byTerms.sort((a, b) => b.weight - a.weight);
 	const similarity = byTerms.reduce((acc, term) => acc + term.weight, 0);
 	const why = byTerms.slice(0, NB_TOP_TERMS);
 	return { similarity, why };
 };

 const getMostSimilar = (allDocuments, toDocument, freq) => {
 	const otherDocuments = allDocuments.filter(document => document !== toDocument);

 	const precompiledTfIdf = Object.keys(toDocument.terms).map(term => ({
 		term,
 		tdIdfValue: tfIdf(allDocuments, toDocument, term, freq)
 	}));
 	precompiledTfIdf.sort( ({ tdIdfValue: val1 }, { tdIdfValue: val2 }) => val2 - val1 );
 	const targetTfIdf = precompiledTfIdf.slice(0, 500); // only use most-significant terms

 	const similarities = otherDocuments.map(document => ({
 		document,
 		...cosineSimilarity(allDocuments, targetTfIdf, document, freq)
 	}));
 	similarities.sort((document1, document2) => document2.similarity - document1.similarity);
 	return similarities.slice(0, 10);
 };

 // Interactive examples

 const getTicks = values => {
 	const maxValue = Math.max(...values);
 	const maxTick = parseFloat(maxValue.toPrecision(1));
 	const nbTicks = 5;
 	const ticks = [];
 	for (let ii = 0; ii < nbTicks; ii++) {
 		const tickValue = ii / nbTicks * maxTick;
 		const tickLabel = (Math.round(tickValue * 100) / 100).toString();
 		ticks.push(tickLabel);
 	}
 	return ticks;
 };

 const drawXAxis = (
 	axisGroup,
 	label,
 	data,
 	scale,
 	{ width, height }
 ) => {
 	axisGroup.append('svg:line')
 		.attr('x1', 0)
 		.attr('y1', height)
 		.attr('x2', width)
 		.attr('y2', height)
 		.attr('stroke', 'black')
 		.attr('class', 'xTicks');

 	const labelHeight = height + 15;

 	const ticks = getTicks(data.map(point => point.x));

 	axisGroup.selectAll('text.xAxisBottom')
 		.data(ticks)
 		.enter()
 		.append('svg:text')
 		.text(count => count)
 		.attr('x', scale)
 		.attr('y', labelHeight)
 		.attr('text-anchor', 'middle')
 		.attr('class', 'xAxisBottom');

 	axisGroup.append('svg:text')
 		.text(label)
 		.attr('x', width - 20)
 		.attr('y', labelHeight)
 		.attr('text-anchor', 'middle');
 };

 const drawYAxis = (
 	axisGroup,
 	label,
 	data,
 	scale,
 	{ width, height }
 ) => {
 	axisGroup.append('svg:line')
 		.attr('x1', width)
 		.attr('y1', 0)
 		.attr('x2', width)
 		.attr('y2', height)
 		.attr('stroke', 'black')
 		.attr('class', 'yTicks');

 	const labelLeft = width - 30;

 	const ticks = getTicks(data.map(point => point.y));

 	axisGroup.selectAll('text.yAxisLeft')
 		.data(ticks)
 		.enter()
 		.append('svg:text')
 		.text(count => count)
 		.attr('x', labelLeft)
 		.attr('y', scale)
 		.attr('text-anchor', 'right')
 		.attr('class', 'yAxisLeft');

 	axisGroup.append('svg:text')
 		.text(label)
 		.attr('x', labelLeft)
 		.attr('y', 20)
 		.attr('text-anchor', 'right');
 };

 const makeTooltip = visRoot => {
 	const tooltip = visRoot.append('div')
 		.style('display', 'none')
 		.style('position', 'absolute')
 		.style('background-color', 'white')
 		.style('border', 'solid')
 		.style('border-width', '1px')
 		.style('padding', '3px')
 		.attr('class', 'tooltip');

 	const showTooltip = () => tooltip.style('display', 'block');
 	const setTooltipText = point => {
 		tooltip
 			.html(point.getTitle())
 			.style('left', (d3.event.pageX + 5) + 'px')
 			.style('top', d3.event.pageY + 'px')
 	};

 	return { showTooltip, setTooltipText };
 };

 // move points slightly to avoid overlap
 const jiggle = coord => coord - 2 + 4 * Math.random();

 const drawCircles = (nodes, { scaleX, scaleY }, { showTooltip, setTooltipText }) => nodes
 	.append('svg:circle')
 	.attr('class', 'nodes')
 	.attr('cx', point => jiggle(scaleX(point.x)))
 	.attr('cy', point => jiggle(scaleY(point.y)))
 	.attr('r', '6px')
 	.attr('stroke', 'black')
 	.attr('fill', 'white')
 	.on('mouseover', showTooltip)
 	.on('mousemove', setTooltipText);

 // 1D: simple frequency of a single term

 const set1DExample = () => {
 	const points = window.allDocuments.map(document => ({
 		getTitle: () => `${document.title} (${freqSimple(document, TERM_SINGLE)})`,
 		x: freqSimple(document, TERM_SINGLE),
 		y: 40
 	}));

 	const visRoot = d3.select('#singleTerm1d');

 	const vis = visRoot
 		.append('svg:svg')
 		.attr('width', 620)
 		.attr('height', 80);

 	const scale = coord => 10 + 33 * coord;

 	const axisGroup = vis.append('svg:g');
 	drawXAxis(axisGroup, TERM_SINGLE, points, scale, {
 		width: 600,
 		height: 45
 	});

 	drawCircles(
 		vis.selectAll('circle .nodes').data(points).enter(),
 		{ scaleX: scale, scaleY: y => y },
 		makeTooltip(visRoot)
 	);
 };
 set1DExample();

 // 2D example: raw term frequencies

 const getFreq = shortName => {
 	const freqMap = {
 		simple: freqSimple,
 		norm: freqNorm,
 		bool: freqBool,
 		log: freqLog,
 		aug: freqAug
 	};
 	return freqMap[shortName];
 };

 const getTopTerms = (document, freq) => {
 	const withFreqs = Object.keys(document.terms).map(term => ({
 		term,
 		termFreq: freq(document, term)
 	}));
 	withFreqs.sort((term1, term2) => term2.termFreq - term1.termFreq);
 	return withFreqs.slice(0, 5);
 };

 const setTf2DExample = freq => {
 	const points = window.allDocuments.map(document => ({
 		getTitle: () => `<p>${document.title}</p><table>${
 			getTopTerms(document, freq)
 				.map(({ term, termFreq }) => `<tr><td>${term}</td><td>${termFreq}</td></tr>`)
 				.join('')
 			}</table>`,
 		x: freq(document, TERM_X),
 		y: freq(document, TERM_Y)
 	}));

 	const visRoot = d3.select('#tf2d');
 	visRoot.selectAll('*').remove();

 	const vis = visRoot
 		.append('svg:svg')
 		.attr('width', 650)
 		.attr('height', 590);

 	const maxCoord = Math.max(
 		...points.map(point => point.x),
 		...points.map(point => point.y)
 	);
 	const scaleX = coord => 40 + 500 * coord / maxCoord;
 	const scaleY = coord => 545 - 500 * coord / maxCoord;

 	const axisGroup = vis.append('svg:g');
 	drawXAxis(axisGroup, TERM_X, points, scaleX, {
 		width: 600,
 		height: 550
 	});
 	drawYAxis(axisGroup, TERM_Y, points, scaleY, {
 		width: 45,
 		height: 550
 	});

 	drawCircles(
 		vis.selectAll('circle .nodes').data(points).enter(),
 		{ scaleX, scaleY },
 		makeTooltip(visRoot)
 	);
 };

 const updateTf2DExample = e => setTf2DExample(getFreq(e.target.value));
 document.getElementById('tf2d-select').addEventListener('change', updateTf2DExample);
 setTf2DExample(freqSimple);

 // 2D example: tf-idf

 const setTfIdf2DExample = freq => {
 	const toTfIdf = (document, term) => tfIdf(window.allDocuments, document, term, freq);

 	const points = window.allDocuments.map(document => ({
 		getTitle: () => `<p>${document.title}</p><table>${
 			getTopTerms(document, toTfIdf)
 				.map(({ term, termFreq }) => `<tr><td>${term}</td><td>${termFreq}</td></tr>`)
 				.join('')
 			}</table>`,
 		x: toTfIdf(document, TERM_X),
 		y: toTfIdf(document, TERM_Y)
 	}));

 	const visRoot = d3.select('#tfIdf2d');
 	visRoot.selectAll('*').remove();

 	const vis = visRoot
 		.append('svg:svg')
 		.attr('width', 650)
 		.attr('height', 590);

 	const maxCoord = Math.max(
 		...points.map(point => point.x),
 		...points.map(point => point.y)
 	);
 	const scaleX = coord => 40 + 500 * coord / maxCoord;
 	const scaleY = coord => 545 - 500 * coord / maxCoord;

 	const axisGroup = vis.append('svg:g');
 	drawXAxis(axisGroup, TERM_X, points, scaleX, {
 		width: 600,
 		height: 550
 	});
 	drawYAxis(axisGroup, TERM_Y, points, scaleY, {
 		width: 45,
 		height: 550
 	});

 	drawCircles(
 		vis.selectAll('circle .nodes').data(points).enter(),
 		{ scaleX, scaleY },
 		makeTooltip(visRoot)
 	);
 };

 const updateTfIdf2DExample = e => setTfIdf2DExample(getFreq(e.target.value));
 document.getElementById('tfIdf2d-select').addEventListener('change', updateTfIdf2DExample);
 setTfIdf2DExample(freqSimple);

 // Similarities

 const setSimilarities = (toDocument, freq) => {
 	const nameDisplay = window.document.getElementById('document-name');
 	nameDisplay.innerText = toDocument.title;

 	const table = window.document.getElementById('similarity-table');
 	table.innerHTML = '';
 	const headerRow = window.document.createElement('tr');
 	headerRow.innerHTML = `<th>Article</th><th>Similarity</th><th>Top terms</th>`;
 	table.appendChild(headerRow);

 	const similarities = getMostSimilar(window.allDocuments, toDocument, freq);
 	similarities.forEach(({ document: { title }, similarity, why }) => {
 		const row = window.document.createElement('tr');
 		const topTerms = why.map(
 			({ term, weight }) => `${term}<small> (${weight.toPrecision(5)})</small>`
 		).join(', ');
 		row.innerHTML = `<td>${title}</td><td>${similarity.toPrecision(6)}</td><td>${topTerms}</td>`;
 		table.appendChild(row);
 	});
 };

 const getRandomDocument = () => window.allDocuments[
 	Math.floor(window.allDocuments.length * Math.random())
 ];

 window.selectedDocument = getRandomDocument();
 const changeSelectedDocument = () => {
 	window.selectedDocument = getRandomDocument();
 	setSimilarities(
 		window.selectedDocument,
 		getFreq(document.getElementById('similarity-select').value)
 	);
 };
 document.getElementById('random-article').addEventListener('click', changeSelectedDocument);

 const updateSimilarities = e => setSimilarities(window.selectedDocument, getFreq(e.target.value));
 document.getElementById('similarity-select').addEventListener('change', updateSimilarities);

 setSimilarities(window.selectedDocument, freqSimple);
	<!DOCTYPE html>
	<html lang="en">

	<head>
	<meta charset="utf-8" />
	<title>tf-idf tutorial</title>
	<script src="https://d3js.org/d3.v5.min.js"></script>
	<script src="./set-documents.js"></script>
	</head>

	<body>
	<h1>tf-idf</h1>

	<figure style="float: right;">
	<img src="machine-learning.png" alt="A robot reads a book" />
	<figcaption>Language processing</figcaption>
	</figure>

	<h2>Goal</h2>

	<p>You have a whole bunch of documents: articles, product descriptions, projects, etc. You want to find
	<strong>similarities</strong> between them.</p>

	<p>Examples: A search engine. A recommendation engine.</p>

	<p>For this tutorial, I'll use the 5917 featured articles on Wikipedia.</p>

	<h2>Principle</h2>

	<p>What does it mean for two documents to be similar, or related? Can we give a formal definition?</p>

	<p>We're looking for a concept of <strong>distance</strong>: two copies of the same document should be at
	distance 0. Similar documents should be closer together than totally unrelated documents.</p>

	<p>A distance… in what? In what vector space?</p>

	<p>Machines are stupid. They can't understand the documents, just see what words appear in them. Hey, what
	if we used word count?</p>

	<p>That defines dimensions: The more times the word "cheese" appears in the document, the further the
	document is on the "cheese" axis.</p>

	<p>We're using the <strong>frequency of the term</strong> "cheese": the "tf" in "tf-idf".</p>

	<figure>
	<figcaption>Frequency of the word "cheese" in all documents</figcaption>
	<div id="singleTerm1d"></div>
	</figure>

	<p>That defines an absolutely enormous space: one dimension per unique word.</p>

	<p>There are thousands of words in the English language but if I draw diagrams with thousands of dimensions,
	they won't be very clear. So I'll just show 2: "horse" and "island".</p>

	<h2>Term frequency</h2>

	<p>I lied to you! Counting occurrences of the word isn't the only way to do it. Other ways:</p>

	<dl>
	<dt>simple</dt>
	<dd>Count occurrences.
	<pre><code>const freqSimple = (document, term) => document.terms[term] \|\| 0;</code></pre>
	</dd>
	<dt>normalised</dt>
	<dd>Count occurrences then divide by total word count in the document, so it's actually a frequency.
	<pre><code>const totalTermCount = document => Object.values(document.terms).reduce((a, b) => a + b);
	const freqNorm = (document, term) => freqSimple(document, term) / totalTermCount(document);</code></pre>
	</dd>
	<dt>booleean</dt>
	<dd>Count whether the word appears or not.
	<pre><code>const freqBool = (document, term) => document.terms[term] ? 1 : 0;</code></pre>
	</dd>
	<dt>logarithmic</dt>
	<dd>Logarithm of the number of occurrences (+ 1 so that it's 0 if missing).
	<pre><code>const freqLog = (document, term) => Math.log(1 + freqSimple(document, term));</code></pre>
	</dd>
	<dt>augmented</dt>
	<dd>Instead of dividing by total word count, divide by the number of occurrences of the most frequent
	word. For very long documents, that helps identify which words matter most.
	<pre><code>const maxWordCount = document => Math.max(...Object.values(document.terms));
	const freqAug = (document, term) => 1 + freqSimple(document, term) / maxWordCount(document);</code></pre>
	</dd>
	</dl>

	<p><strong>Mouse over</strong> any point to see the other terms.</p>

	<figure>
	<figcaption>
	<select id="tf2d-select">
	<option value="simple">Simple</option>
	<option value="norm">Normalised</option>
	<option value="bool">Booleean</option>
	<option value="log">Logarithmic</option>
	<option value="aug">Augmented</option>
	</select>
	frequency of the terms "horse" and "island" in all documents
	</figcaption>
	<div id="tf2d"></div>
	</figure>

	<h2>Inverse frequency</h2>

	<p>Okay, that makes a pretty picture for those 2 words, but it doesn't really work: very common words
	like "the" always come out on top.</p>

	<p>I could build a stopword list, but I'm lazy. Also, some words will still be more common.</p>

	<p>The secret ingredient (Spärck Jones, 1972): divide everything by the log of the number of
	documents that contain the term.</p>

	<p><strong>The rarer the term is, the more it counts.</strong></p>

	<figure style="float: right;">
	<img src="spaerck_jones.jpg" alt="Karen Spärck Jones in 2002" />
	<figcaption>Professor Karen Spärck Jones</figcaption>
	</figure>

	<p>That's the "idf" ("inverse document frequency") in "tf-idf".</p>

	<pre><code>const docCount = (allDocuments, term) => allDocuments.filter(document => document.terms[term]).length;
	const idf = (allDocuments, term) => Math.log(allDocuments.length / (1 + docCount(allDocuments, term)));
	const tfIdf = (allDocuments, document, term, freq) => freq(document, term) * idf(allDocuments, term);</code></pre>

	<figure>
	<figcaption>
	<select id="tfIdf2d-select">
	<option value="simple">Simple</option>
	<option value="norm">Normalised</option>
	<option value="bool">Boolean</option>
	<option value="log">Logarithmic</option>
	<option value="aug">Augmented</option>
	</select>
	tf-idf of the terms "horse" and "island" in all documents
	(mouse over to see other terms)
	</figcaption>
	<div id="tfIdf2d"></div>
	</figure>

	<p>Success! We get terms strongly related to the topic of each article.</p>

	<h2>Similarities</h2>

	<p>So we've transformed each document into (term, tf-idf) pairs. How do you measure distance between
	two of those?</p>

	<p>If a term is very frequent in document 1, you want its frequency in document 2 to matter a lot, and
	vice versa. To model this, <strong>multiply</strong> the tf-idf frequencies of the term in the
	two documents.</p>

	<p>The contribution of each term is independent, so we can just <strong>add</strong> them together.</p>

	<pre><code>const cosineSimilarity = (allDocuments, document1, document2, freq) => {
	const byTerms = Object.keys(document1.terms).map(term =>
	tfIdf(allDocuments, document1, term, freq) * tfIdf(allDocuments, document2, term, freq)
	);
	return byTerms.reduce((a, b) => a + b);
	};</code></pre>

	<p>Say, that's a scalar product! It's the cosine of the angle between the vectors of the two
	documents. Hence the name "cosine similarity".</p>

	<pre><code>const getMostSimilar = (allDocuments, toDocument, freq) => {
	const otherDocuments = allDocuments.filter(document => document !== toDocument);
	const similarities = otherDocuments.map(document => ({
	document,
	similarity: cosineSimilarity(allDocuments, document, toDocument, freq)
	}));
	similarities.sort((document1, document2) => document2.similarity - document1.similarity);
	return similarities.slice(0, 10);
	};</code></pre>

	<p>A cool feature of tf-if: you can tell <strong>why</strong> two documents are similar:
	look at the terms with the biggest product.</p>

	<div id="similarities">
	<caption>
	Articles similar to
	<cite id="document-name"></cite>
	<button type="button" id="random-article">change</button>
	by
	<select id="similarity-select">
	<option value="simple">simple</option>
	<option value="norm">normalised</option>
	<option value="bool">booleean</option>
	<option value="log">logarithmic</option>
	<option value="aug">augmented</option>
	</select>
	tf-idf
	</caption>
	<table id="similarity-table"></table>
	</div>

	<p>Question time!</p>

	<script src="./tough-dough.js"></script>
	</body>

	</html>
	// Terms for demo

	const TERM_SINGLE = 'cheese';
	const TERM_X = 'horse';
	const TERM_Y = 'island';
	const NB_TOP_TERMS = 3;

	// Term frequency functions

	const freqSimple = (document, term) => document.terms[term] \|\| 0;

	const totalTermCount = document => { // memoised
	document.totalTermCount = document.totalTermCount \|\| Object.values(document.terms).reduce((a, b) => a + b);
	return document.totalTermCount;
	};
	const freqNorm = (document, term) => freqSimple(document, term) / totalTermCount(document);

	const freqBool = (document, term) => document.terms[term] ? 1 : 0;

	const freqLog = (document, term) => Math.log(1 + freqSimple(document, term));

	const maxWordCount = document => { // memoised
	document.maxWordCount = document.maxWordCount === undefined
	? Math.max(...Object.values(document.terms))
	: document.maxWordCount;
	return document.maxWordCount;
	};
	const freqAug = (document, term) => 1 + freqSimple(document, term) / maxWordCount(document);

	// Inverse document frequency

	const idfMap = {}; // memoised
	const docCount = (allDocuments, term) => allDocuments.filter(document => document.terms[term]).length;
	const idf = (allDocuments, term) => {
	idfMap[term] = idfMap[term] \|\| Math.log(allDocuments.length / (1 + docCount(allDocuments, term)));
	return idfMap[term];
	};

	// tf-idf

	const tfIdf = (allDocuments, document, term, freq) => freq(document, term) * idf(allDocuments, term);

	// Similarity

	// pre-compile tfIdfs for target document
	const cosineSimilarity = (allDocuments, targetTfIdf, otherDocument, freq) => {
	const byTerms = targetTfIdf.map(({ term, tdIdfValue }) => ({
	term,
	weight: tdIdfValue * tfIdf(allDocuments, otherDocument, term, freq)
	}));
	byTerms.sort((a, b) => b.weight - a.weight);
	const similarity = byTerms.reduce((acc, term) => acc + term.weight, 0);
	const why = byTerms.slice(0, NB_TOP_TERMS);
	return { similarity, why };
	};

	const getMostSimilar = (allDocuments, toDocument, freq) => {
	const otherDocuments = allDocuments.filter(document => document !== toDocument);

	const precompiledTfIdf = Object.keys(toDocument.terms).map(term => ({
	term,
	tdIdfValue: tfIdf(allDocuments, toDocument, term, freq)
	}));
	precompiledTfIdf.sort( ({ tdIdfValue: val1 }, { tdIdfValue: val2 }) => val2 - val1 );
	const targetTfIdf = precompiledTfIdf.slice(0, 500); // only use most-significant terms

	const similarities = otherDocuments.map(document => ({
	document,
	...cosineSimilarity(allDocuments, targetTfIdf, document, freq)
	}));
	similarities.sort((document1, document2) => document2.similarity - document1.similarity);
	return similarities.slice(0, 10);
	};

	// Interactive examples

	const getTicks = values => {
	const maxValue = Math.max(...values);
	const maxTick = parseFloat(maxValue.toPrecision(1));
	const nbTicks = 5;
	const ticks = [];
	for (let ii = 0; ii < nbTicks; ii++) {
	const tickValue = ii / nbTicks * maxTick;
	const tickLabel = (Math.round(tickValue * 100) / 100).toString();
	ticks.push(tickLabel);
	}
	return ticks;
	};

	const drawXAxis = (
	axisGroup,
	label,
	data,
	scale,
	{ width, height }
	) => {
	axisGroup.append('svg:line')
	.attr('x1', 0)
	.attr('y1', height)
	.attr('x2', width)
	.attr('y2', height)
	.attr('stroke', 'black')
	.attr('class', 'xTicks');

	const labelHeight = height + 15;

	const ticks = getTicks(data.map(point => point.x));

	axisGroup.selectAll('text.xAxisBottom')
	.data(ticks)
	.enter()
	.append('svg:text')
	.text(count => count)
	.attr('x', scale)
	.attr('y', labelHeight)
	.attr('text-anchor', 'middle')
	.attr('class', 'xAxisBottom');

	axisGroup.append('svg:text')
	.text(label)
	.attr('x', width - 20)
	.attr('y', labelHeight)
	.attr('text-anchor', 'middle');
	};

	const drawYAxis = (
	axisGroup,
	label,
	data,
	scale,
	{ width, height }
	) => {
	axisGroup.append('svg:line')
	.attr('x1', width)
	.attr('y1', 0)
	.attr('x2', width)
	.attr('y2', height)
	.attr('stroke', 'black')
	.attr('class', 'yTicks');

	const labelLeft = width - 30;

	const ticks = getTicks(data.map(point => point.y));

	axisGroup.selectAll('text.yAxisLeft')
	.data(ticks)
	.enter()
	.append('svg:text')
	.text(count => count)
	.attr('x', labelLeft)
	.attr('y', scale)
	.attr('text-anchor', 'right')
	.attr('class', 'yAxisLeft');

	axisGroup.append('svg:text')
	.text(label)
	.attr('x', labelLeft)
	.attr('y', 20)
	.attr('text-anchor', 'right');
	};

	const makeTooltip = visRoot => {
	const tooltip = visRoot.append('div')
	.style('display', 'none')
	.style('position', 'absolute')
	.style('background-color', 'white')
	.style('border', 'solid')
	.style('border-width', '1px')
	.style('padding', '3px')
	.attr('class', 'tooltip');

	const showTooltip = () => tooltip.style('display', 'block');
	const setTooltipText = point => {
	tooltip
	.html(point.getTitle())
	.style('left', (d3.event.pageX + 5) + 'px')
	.style('top', d3.event.pageY + 'px')
	};

	return { showTooltip, setTooltipText };
	};

	// move points slightly to avoid overlap
	const jiggle = coord => coord - 2 + 4 * Math.random();

	const drawCircles = (nodes, { scaleX, scaleY }, { showTooltip, setTooltipText }) => nodes
	.append('svg:circle')
	.attr('class', 'nodes')
	.attr('cx', point => jiggle(scaleX(point.x)))
	.attr('cy', point => jiggle(scaleY(point.y)))
	.attr('r', '6px')
	.attr('stroke', 'black')
	.attr('fill', 'white')
	.on('mouseover', showTooltip)
	.on('mousemove', setTooltipText);

	// 1D: simple frequency of a single term

	const set1DExample = () => {
	const points = window.allDocuments.map(document => ({
	getTitle: () => `${document.title} (${freqSimple(document, TERM_SINGLE)})`,
	x: freqSimple(document, TERM_SINGLE),
	y: 40
	}));

	const visRoot = d3.select('#singleTerm1d');

	const vis = visRoot
	.append('svg:svg')
	.attr('width', 620)
	.attr('height', 80);

	const scale = coord => 10 + 33 * coord;

	const axisGroup = vis.append('svg:g');
	drawXAxis(axisGroup, TERM_SINGLE, points, scale, {
	width: 600,
	height: 45
	});

	drawCircles(
	vis.selectAll('circle .nodes').data(points).enter(),
	{ scaleX: scale, scaleY: y => y },
	makeTooltip(visRoot)
	);
	};
	set1DExample();

	// 2D example: raw term frequencies

	const getFreq = shortName => {
	const freqMap = {
	simple: freqSimple,
	norm: freqNorm,
	bool: freqBool,
	log: freqLog,
	aug: freqAug
	};
	return freqMap[shortName];
	};

	const getTopTerms = (document, freq) => {
	const withFreqs = Object.keys(document.terms).map(term => ({
	term,
	termFreq: freq(document, term)
	}));
	withFreqs.sort((term1, term2) => term2.termFreq - term1.termFreq);
	return withFreqs.slice(0, 5);
	};

	const setTf2DExample = freq => {
	const points = window.allDocuments.map(document => ({
	getTitle: () => `<p>${document.title}</p><table>${
	getTopTerms(document, freq)
	.map(({ term, termFreq }) => `<tr><td>${term}</td><td>${termFreq}</td></tr>`)
	.join('')
	}</table>`,
	x: freq(document, TERM_X),
	y: freq(document, TERM_Y)
	}));

	const visRoot = d3.select('#tf2d');
	visRoot.selectAll('*').remove();

	const vis = visRoot
	.append('svg:svg')
	.attr('width', 650)
	.attr('height', 590);

	const maxCoord = Math.max(
	...points.map(point => point.x),
	...points.map(point => point.y)
	);
	const scaleX = coord => 40 + 500 * coord / maxCoord;
	const scaleY = coord => 545 - 500 * coord / maxCoord;

	const axisGroup = vis.append('svg:g');
	drawXAxis(axisGroup, TERM_X, points, scaleX, {
	width: 600,
	height: 550
	});
	drawYAxis(axisGroup, TERM_Y, points, scaleY, {
	width: 45,
	height: 550
	});

	drawCircles(
	vis.selectAll('circle .nodes').data(points).enter(),
	{ scaleX, scaleY },
	makeTooltip(visRoot)
	);
	};

	const updateTf2DExample = e => setTf2DExample(getFreq(e.target.value));
	document.getElementById('tf2d-select').addEventListener('change', updateTf2DExample);
	setTf2DExample(freqSimple);

	// 2D example: tf-idf

	const setTfIdf2DExample = freq => {
	const toTfIdf = (document, term) => tfIdf(window.allDocuments, document, term, freq);

	const points = window.allDocuments.map(document => ({
	getTitle: () => `<p>${document.title}</p><table>${
	getTopTerms(document, toTfIdf)
	.map(({ term, termFreq }) => `<tr><td>${term}</td><td>${termFreq}</td></tr>`)
	.join('')
	}</table>`,
	x: toTfIdf(document, TERM_X),
	y: toTfIdf(document, TERM_Y)
	}));

	const visRoot = d3.select('#tfIdf2d');
	visRoot.selectAll('*').remove();

	const vis = visRoot
	.append('svg:svg')
	.attr('width', 650)
	.attr('height', 590);

	const maxCoord = Math.max(
	...points.map(point => point.x),
	...points.map(point => point.y)
	);
	const scaleX = coord => 40 + 500 * coord / maxCoord;
	const scaleY = coord => 545 - 500 * coord / maxCoord;

	const axisGroup = vis.append('svg:g');
	drawXAxis(axisGroup, TERM_X, points, scaleX, {
	width: 600,
	height: 550
	});
	drawYAxis(axisGroup, TERM_Y, points, scaleY, {
	width: 45,
	height: 550
	});

	drawCircles(
	vis.selectAll('circle .nodes').data(points).enter(),
	{ scaleX, scaleY },
	makeTooltip(visRoot)
	);
	};

	const updateTfIdf2DExample = e => setTfIdf2DExample(getFreq(e.target.value));
	document.getElementById('tfIdf2d-select').addEventListener('change', updateTfIdf2DExample);
	setTfIdf2DExample(freqSimple);

	// Similarities

	const setSimilarities = (toDocument, freq) => {
	const nameDisplay = window.document.getElementById('document-name');
	nameDisplay.innerText = toDocument.title;

	const table = window.document.getElementById('similarity-table');
	table.innerHTML = '';
	const headerRow = window.document.createElement('tr');
	headerRow.innerHTML = `<th>Article</th><th>Similarity</th><th>Top terms</th>`;
	table.appendChild(headerRow);

	const similarities = getMostSimilar(window.allDocuments, toDocument, freq);
	similarities.forEach(({ document: { title }, similarity, why }) => {
	const row = window.document.createElement('tr');
	const topTerms = why.map(
	({ term, weight }) => `${term}<small> (${weight.toPrecision(5)})</small>`
	).join(', ');
	row.innerHTML = `<td>${title}</td><td>${similarity.toPrecision(6)}</td><td>${topTerms}</td>`;
	table.appendChild(row);
	});
	};

	const getRandomDocument = () => window.allDocuments[
	Math.floor(window.allDocuments.length * Math.random())
	];

	window.selectedDocument = getRandomDocument();
	const changeSelectedDocument = () => {
	window.selectedDocument = getRandomDocument();
	setSimilarities(
	window.selectedDocument,
	getFreq(document.getElementById('similarity-select').value)
	);
	};
	document.getElementById('random-article').addEventListener('click', changeSelectedDocument);

	const updateSimilarities = e => setSimilarities(window.selectedDocument, getFreq(e.target.value));
	document.getElementById('similarity-select').addEventListener('change', updateSimilarities);

	setSimilarities(window.selectedDocument, freqSimple);