- A scatterplot is a plot that uses cartesian coordinates to display values of a dataset. It is a graph of points which helps us to visualize the relationship between different dimensions(features).
- The basic algorithm for creating a scatter plot is follows:
#Reference: Interactive Data Visualization: Foundations, Techniques, and Applications, Second Edition by Daniel A. Keim, Georges G. Grinstein, and Matthew O. Ward def scatterplot(xDim,yDim,cDim,rDim,rMin,rMax): for each record i: x = NORMALIZE(i,xDim) y = NORMALIZE(i,yDim) r = NORMALIZE(i,rDim,rMin,rMax) MAPCOLOR(i,cDim)#Maps the value of the dimension of the data point to a color DRAW_CIRCLE(x,y,r)
- We can gain several insights from the data using a scatter plot. We can see the correlation between the different dimensions(features) of the data point. We can formulate several hypothesis from the data, eg: in our data the MPG decreases as horsepower increases. We can also visualize and fit a curve through the data points. Scatterplots also helps us to visualize multiple dimensions in one go using different features like size, colour, shape etc. Several interesting discoveries can also be made by using a scatter plot. We can also use scatter plots to prove and convince some points by showing the patterns in data.
- Following are some features of the scatter plot that can be controlled:
- color of the point: This can be used to show categorical features in general. We can assign a color to a category. This can be controlled by choosing any feature and assigning a color to a value.
- radius/size of the data point: This can be used to show an extra dimension. We can show continuous as well as some categorical data like the number of cylinders. This can be controlled by normalizing the dimension values and mapping it to radius range.
- shape of the data point: We can use different shapes for different categories of the data.
- legends: Legends can give additional info about some dimension by providing a quick look up map. They can be derived from the dimension values of a data.
- Opacity/transparency of a data point: Varying degree of opacity can show different values of the data point.
- Tooltip/additional info on mouse over: Additional information of the dimension can be provided when mouse is hovered over a point.
- Drop downs for changing x and y dimension: x and y axis values can be made selectable by providing drop downs.
- 3D scatter plot: We can add an extra dimension and visualize the scatterplot in 3D space. This gives us one more dimension.
- Regression/Best fit line: A regression/best fit line can be used to best represent/fit the data. We can use any linear regression technique and regularization for this.
- Filters: Filters can be used to filter data on a dimension. It can be a check box with values for a dimension or we can provide fine grain controls like greater than/less than.
- Tick marks can be used to show the values at intermediate points. If we make tick marks go all across the graph it would help the user to identify data points.
- Highlighting data points on mouse over: We can highlight data points on mouse over to make the data point explicitly stand out in a scatter plot.
- X and Y axis labels: X and Y axis labels should be provided to understand the dimensions being represented.
- Selector brushes: We can use a cricular or lasso selector to select a subset of points to compute some metric.
- Reset button: A reset button can be provided to reset values to default.
- Segregated clusters: Clusters can be identified with well specified boundaries. A circle can be drawn around clusters.
- Derived dimensions: Derived dimensions which is derived from based dimensions can be used to represent data. Like velocity= distance/time. We can allow the user to define these derived dimensions.
- Zoom and pan in a scatter plot: We can enable the users to zoom and pan the data points, thus enabling them to do more fine grained analysis on the data.
- A scatterplot at its basic form is a 2D or 3D representation of data. Most of the problems we deal in real life has a lot of dimensions. I am suggesting a scatter plot where we use PCA dimensionality reduction to reduce the scatter plot dimensions and project the data into a 2D or 3D scatter plot. Here we can more easily identify clusters of data. On mouse over we should be able to see the original data point with its constituent dimensions.
- I obtained the data from Prof. Curran's video: https://github.com/curran/d3-in-motion/blob/master/data/cars.csv. The data contains car data with the following features: mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name. It has 392 data points.
- Please refer the scatter plot.
- 2 other variables: Color and Radius(size) of the circle can be changed.
- Mouse over/Tool tip is enabled.
- An interesting trend to visualize is the relationship between miles per gallon and horsepower for the data set. We can see a clear decline in miles per gallon as we increase the horsepower. This suggests that we will have to make a trade off between power and economy of the vehicle while buying one. Datsun 280-zx is an outlier which gives a very good mpg of 32.7 for 132 horsepower. It is also interesting to see that most of the high power vehicles are from USA and all the high power vehicles have 8 cylinders which suggest a V8 engine as expected. I believe that this visualization of the trend between MPG and horsepower is best captured in a scatter plot. This is because scatter plot is one of the best visualization for discovering the relationship between variables. A line graph using the best fit curve to the data is another visualization that can be used.
References:
- https://www.manning.com/livevideo/d3-js-in-motion
- Interactive Data Visualization: Foundations, Techniques, and Applications, Second Edition by Daniel A. Keim, Georges G. Grinstein, and Matthew O. Ward
- Lab 3 590V Files