This post will teach you how to visualize higher dimensional datasets in lower dimensions using Principal Component Analysis. And guess what?! Its all in Clojure! I was inspired to write this because I was curious how Principal Component Analysis worked, and there aren't a lot of data analysis resources out there for Clojure.
The best one I could find was from Data Sorcery https://data-sorcery.org/category/pca/.
Now that blog post was very informative on how to do Principal Component Analysis
(will be referring to this as PCA as well) in Clojure. However, when I decided to use it on a larger dataset I got an out of memory exception because the pca
function incanter provides requires a matrix as input. The input matrix requires a lot of memory if the dataset is rather large. So I decided to write my own implementation which could calculate the covariance matrix with an input as a lazyseq. That way my input could be as big as I wanted. And learning how to do it was fun! The product of my endeavor is a tiny library
that you can download from Clojars! So let's get started! :D
Now I am a real big fan of EMACS, but critics will argue it has a learning curve. Which I don't disagree with. However, I found it to be totally worth it and not much harder than other IDE's. Learning a tool takes time. The downside with emacs though is the lack of clear documentation. The upside with emacs is that it was built for LISPs so it handles clojure super smoothly. The cider package makes the development experience very interactive. I've been using emacs for a couple years and really enjoyed it.
Recently I saw Jason Gilman's talk about proto-repl. I was really impressed and decided to check it out. Getting started with Atom is much easier than emacs. There is a lot more documentation on the interwebs and writing your own extensions in Javascript is more intuitive than emacs lisp. I personally feel proto-repl's dev experience can rival that of emacs, even though it is in its infancy. I'm really excited to see the future of proto-repl as well as Atom. Now if I could only get rainbow parantheses to work properly!
We are going to use leiningen to setup our project and install the dependencies:
lein new iris-viz
This will create a directory called iris-viz and it contains all the stuff we need to get started.
We'll need to add our dependencies to the project.clj file. It should look something like this:
(defproject iris-viz "0.1.0-SNAPSHOT"
:description "FIXME: write description"
:url "http://example.com/FIXME"
:license {:name "Eclipse Public License"
:url "http://www.eclipse.org/legal/epl-v10.html"}
:dependencies [[org.clojure/clojure "1.8.0"]
[hswick/pca "0.1.0"]
[hswick/plotly-clj "0.1.3"]
[org.clojure/data.csv "0.1.3"]])
We are going to use a tiny pca library that I created to do the Principal Component Analysis. Once we have our principal components we are going to compress our original dataset into 2d. Finally, we will plot our compressed dataset using another library I made called plotly-clj. Which leverages the power of plotly with Clojure.
After we've added our list of dependencies to our project we'll need to install them. Once again, Leiningen to the rescue. Run lein deps
in your terminal and you'll see the libraries get added.
The next thing we need is the iris data. The iris dataset is a classic toy example dataset that is very useful for learning and testing. You can use it to ensure that an algorithm works properly because the results are known and easy to interpret. If you want a real deep discussion of the dataset go here. Basically it is a small table of different properties of iris flowers and their species. The data has a nice obvious clustering that will be evident when we do our visualization. You can download the dataset from here. Once downloaded make sure to place it in the iris-viz directory.
Notice that there are 6 columns:
- Column 1: Index for each row (we won't need this)
- Column 2: Sepal.Length (float)
- Column 3: Sepal.Width (float)
- Column 4: Petal.Length (float)
- Column 5: Petal.Width (float)
- Column 6: Species (labels)
There are three different species in this dataset: setosa, versicolor, and virginica. Each one of these species we can think of as a class or a cluster. Each datapoint is a row with 4 floating point numbers and a string for the specie name. We can think of a datapoint in more math like terms by saying each datapoint is a 4-dimensional vector with an associated label. The point of our project is to visualize these datapoints. However, we cannot visualize 4 dimensions because our eyes can only see 2 and 3 dimensions. Perhaps if we were some kind of super organism alien then we could visualize it in 4 dimensions (woah dude, heady).
Instead of being aliens we are going to use PCA to compress these 4 dimensional vectors into 2 dimensions. That is what reducing high dimensional data to lower dimensions means. Once we do that we can plot the output using plotly!
Now with our preferred IDE lets open up core.clj located in the src directory. The first thing we need to do is add our dependencies to the namespace so we can access the functions.
(ns iris-viz.core
(:require [pca.core :refer :all]
[plotly-clj.core :refer :all]
[clojure.data.csv :as csv]
[clojure.java.io :as io]))
Now once all of our dependencies are included in our namespace we need to do the most important step. Load the data! Without data, we have nothing!
We are going to load our data using the clojure.data.csv package's read-csv function. This is a convenient way to load the dataset from a csv file.
(def iris-lines (with-open [in-file (io/reader "iris.csv")]
(doall
(csv/read-csv in-file))))
The data is loaded, but it is not quite ready to be used. We'll need to parse each row and format it properly. We can achieve this by writing a function to parse a row into a clean and usable form. Then simply map over the data with our new function and we get a nice clean dataset.
(defn format-row [row]
(let [row (rest row);;Ignore first index value
floats (vec (map (fn [d i] (Float/parseFloat d)) row (range 0 4)))];;Grab the four float values
floats))
(def dataset (map format-row (rest iris-lines)));;Ignore the header row
If working with a larger dataset we could do the reading and formatting in one loop, and then pass the lazy seq to the covariance function which will iterate over it, but won't store the entire matrix in memory. Which I could have showed, but I was being lazy (ha, get it?!)
Once our dataset is prepared we can do some math! The first thing we need to do is calculate the covariance matrix. We can do this by using the covariance function my pca library provides:
(def covar (covariance dataset))
I'll give my high level understanding of what the covariance matrix means and why we need it, but be aware that I am no expert. If you want a more in depth discussion of covariance, I highly suggest this blog post.
It is very detailed and actually inspired my implementation of the covariance function.
My interpretation of the covariance matrix is that it contains the variance between the different relations of the features within the dataset. Features being the columns of the dataset. More abstractly, it contains the "information" about the dataset. Or a summary of the relations in the dataset. If you are familiar with some machine learning, then you can think of it as "weights". This isn't an entirely accurate picture, but I think it is important to realize that the covariance matrix and the weights of a neural net contain "knowledge" about the dataset it is trained on. They both have more knowledge when given more data.
The next step is to extract the principal components from the calculated matrix. Hence the name principal component analysis.
(def iris-components (principal-components covar 2))
The pca library I wrote makes getting these components very simple. We pass in the covariance matrix as the first argument. Then pass in the number of components we want. In our case we will be doing a 2d visualization of the dataset, so we only want 2 components. One component for each dimension. We are compressing the dataset by reducing 4 dimensions to 2.
The way principal-components
works is by first calculating the eigenvectors and eigenvalues of the covariance matrix. This is called eigendecomposition.
Now I'm not the one to explain to you what eigenvectors and eigenvalues are. So if you are curious I would suggest reading these blog posts:
- https://deeplearning4j.org/eigenvector
- http://www.visiondummy.com/2014/04/geometric-interpretation-covariance-matrix/
Eigendecomposition is absolutely fascinating and can be used for many different applications. My library uses the clatrix library to do the eigendecomposition. Which returns a map of eigenvalues and eigenvectors. These key value pairs are sorted by highest eigenvalues. Higher eigenvalue means the associated eigenvector contains more information on the variance of the data. Or more "knowledge" of the data than lower eigenvalues.
The two principal components are the two eigenvectors with the highest eigenvalues. So our principal component analysis is complete. Now to utilize the extracted components.
We will be using the principal components to visualize our dataset onto a 2d scatter plot. Be aware though that PCA can be used for other things like anomaly detection, data compression, and even classification.
Even though we are doing a data visualization, the easiest way to think about PCA is as a data compression algorithm. We reduce the dimensionality of the dataset thus compressing it. Undoutedly there is information loss, but often times the compressed version is all that is needed. We are going to compress our formatted dataset with the principal components.
(def compressed-dataset (map (partial compress iris-components) dataset))
The pca library contains a compress function which accepts the components as the first argument, and a vector for the second argument. The "compression" is basically the dot product of the components and the vector passed in. The vectors passed in being the rows of the dataset.
Plotly.js has pretty fairly simple API to make just about any kind of graph. The best part is they all look very professional too. plotly-clj is a simple wrapper I wrote that allows the user to input EDN data instead of JSON into plotly's api. The result is a beautiful graph popping up in your default browser.
plotly-clj exposes a plot
function which accepts 4 arguments: width of the graph, height of the graph, the data for the graph, and the layout. I'm going to go into more detail about the data argument, and note that we wont be using layout in this post.
Plotly wants the data argument to be in the form of an array of 'traces'. A trace is basically a map of data and settings for how to render it. There are three 3 different species, and we want each data point for a specie to have a corresponding color. Since we want the the different colors, we will need three traces. Before we can get to the traces though, we need to split our compressed data up by species.
Now remember when we formatted our dataset we did not include the species column in our dataset. The reason being that we can't calculate a covariance matrix of a dataset with our labels. However, now we need those labels now to split our dataset.
(def labels (map #(get % 5) iris-lines))
In order to split our dataset up by species, the labels have to be associated with the corresponding compressed vector. And then we filter each class by their label
(def species ["setosa" "versicolor" "virginica"])
(def split-dataset
(let [rows (map (fn [a b] (flatten [a b])) compressed-dataset labels)
filter-fn (fn [specie] (filter #(= specie (nth % 2)) rows))]
(map filter-fn species)))
Now we can build our traces and pass it into our plot:
(defn build-trace [data name]
{:name name
:type "scatter"
:x (map first data)
:y (map second data)
:mode "markers"})
(plot 800
800
(map build-trace split-dataset species)
{})
Our plot is 800 x 800 pixels. You might be wondering how we got the different colors, even though we didn't pass in any color arguments. We don't have to specify any colors because plotly will choose for us. We tell it we want the type of the plot to be a scatter plot. The mode is ser to markers so there are no lines drawn between the points.
What can we conclude from our visualization? Well we can easily see that there are three clusters in the dataset just like we talked about eariler. Even though there is information loss when converting the vectors from 4d to 2d, our compression still preserved the clustering. Thus the beauty of PCA!
PCA is a very useful tool for initial data exploration. When using other data visualization tools such as autoencoders or t-SNE it is always helpful to run PCA first. It can allow the user to see if the dataset has any structure or appararent properties. While being relatively cheap compared to other algorithms. This way you don't waste hours or days training on a bad dataset.