Created
December 1, 2014 01:13
-
-
Save matxpg/0bb4676e05cf6051b0cc to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"metadata": { | |
"name": "Untitled0" | |
}, | |
"nbformat": 3, | |
"nbformat_minor": 0, | |
"worksheets": [ | |
{ | |
"cells": [ | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "import pandas as pd\nimport urllib2\nimport numpy \n\n#Get the assignment data\nlink = 'http://math.usask.ca/~laverty/S245/Assignments/Assignments%20Fall%202012/CompAsst/Asst2Computer2013data.xls'\nsocket = urllib2.urlopen(link)\n\n#Read the excel spreadsheet into a pandas data frame, ignoring the first column of the file as it is redundant.\nxd = pd.ExcelFile(socket)\ndf = xd.parse(xd.sheet_names[-1], header=0, parse_cols = [x for x in range(1, 11)])\n\n\n\n", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 56 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Matthew Galbraith - mpg317 - 11138371\n-------------------------------------\nSTAT-245-01 (Laverty) - Assignment 2 - Due Oct. 31, 2014\n--------------------------------------------------------\n\n<u>**Compute the correlation between each pair of events**</u>\n\nFrom the data, I created a table showing the correlation of each pair of events at the Olympic decathlon. For the table, one thing to note is that the diagonals in the table are paired with themselves and thus not important. Also, RiCj (Row i Column j) is the same as RjCi - because a pair of events e.g. x1, x2 is the same as the pair x2, x1.\n\nThe values are computed using *Pearson's rank correlation coefficient* **r**.\n\n\n<u>**Determine which events are most highly correlated and which events are least correlated.**</u>\n\nAfter making the table, I found that the events with the highest correlation were x3 and x7 - shotput and discus. They have a correlation\n\ncoefficient value of approximately 0.704722. The events with the lowest correlation were x1 and x10 - 100m and 1500m, with a correlation\n\ncoefficient value of approximately -0.045854 - a slight negative correlation. \n\n\n<u>**Comment.**</u>\n\nThe 100m and 1500m were least correlated, but they were the only pair of events in the set of all pairs of events in the decathlon that had a slight negative correlation. One thing to note is that they were both running events, but where the 100m is a sprint, the 1500m is more of a longer-distance run - sprinters may not have the endurance to perform well in a long distance run, and long distance runners may not have the speed to perform well in short distance sprints. The highest correlated events, shotput and discus, are two similiar sports. Shotput involves throwing a \"shot\" (a spherical ball) and discus involves throwing a disc. This could be why there is a such a positive correlation between performance in the two events, as both events would require similiar upper body training to perform well. Other events are interesting to comment on as well. The second least correlated events, x9 x10 (Javelin and 1500m) pair with a correlation coefficient of approximately 0.056698 makes sense, as throwing a javelin is very different to running 1500m. The second highest correlated events, x1 x5 (100m and 400m) with a correlation coefficient of approximately 0.643485. These are two shorter distance running events (one is one length of a track, and the other is one quarter of the length of a track). It would make sense that the performance in a short distance running event could be somewhat correlated with the performance of another short distance running event." | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "correlation_matrix = df.corr(method='pearson')\ncorrelation_matrix", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"html": "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>x1</th>\n <th>x2</th>\n <th>x3</th>\n <th>x4</th>\n <th>x5</th>\n <th>x6</th>\n <th>x7</th>\n <th>x8</th>\n <th>x9</th>\n <th>x10</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>x1</th>\n <td> 1.000000</td>\n <td> 0.585335</td>\n <td> 0.280897</td>\n <td> 0.223010</td>\n <td> 0.643485</td>\n <td> 0.329727</td>\n <td> 0.182300</td>\n <td> 0.151807</td>\n <td> 0.117975</td>\n <td>-0.045854</td>\n </tr>\n <tr>\n <th>x2</th>\n <td> 0.585335</td>\n <td> 1.000000</td>\n <td> 0.322786</td>\n <td> 0.559354</td>\n <td> 0.502929</td>\n <td> 0.560442</td>\n <td> 0.270514</td>\n <td> 0.317594</td>\n <td> 0.207177</td>\n <td> 0.070008</td>\n </tr>\n <tr>\n <th>x3</th>\n <td> 0.280897</td>\n <td> 0.322786</td>\n <td> 1.000000</td>\n <td> 0.358365</td>\n <td> 0.396357</td>\n <td> 0.234075</td>\n <td> 0.704722</td>\n <td> 0.214011</td>\n <td> 0.523221</td>\n <td> 0.116284</td>\n </tr>\n <tr>\n <th>x4</th>\n <td> 0.223010</td>\n <td> 0.559354</td>\n <td> 0.358365</td>\n <td> 1.000000</td>\n <td> 0.268387</td>\n <td> 0.548898</td>\n <td> 0.290969</td>\n <td> 0.474351</td>\n <td> 0.129864</td>\n <td> 0.266634</td>\n </tr>\n <tr>\n <th>x5</th>\n <td> 0.643485</td>\n <td> 0.502929</td>\n <td> 0.396357</td>\n <td> 0.268387</td>\n <td> 1.000000</td>\n <td> 0.315019</td>\n <td> 0.386719</td>\n <td> 0.289851</td>\n <td> 0.403909</td>\n <td> 0.414685</td>\n </tr>\n <tr>\n <th>x6</th>\n <td> 0.329727</td>\n <td> 0.560442</td>\n <td> 0.234075</td>\n <td> 0.548898</td>\n <td> 0.315019</td>\n <td> 1.000000</td>\n <td> 0.194883</td>\n <td> 0.479401</td>\n <td> 0.184994</td>\n <td> 0.063186</td>\n </tr>\n <tr>\n <th>x7</th>\n <td> 0.182300</td>\n <td> 0.270514</td>\n <td> 0.704722</td>\n <td> 0.290969</td>\n <td> 0.386719</td>\n <td> 0.194883</td>\n <td> 1.000000</td>\n <td> 0.343120</td>\n <td> 0.393074</td>\n <td> 0.244814</td>\n </tr>\n <tr>\n <th>x8</th>\n <td> 0.151807</td>\n <td> 0.317594</td>\n <td> 0.214011</td>\n <td> 0.474351</td>\n <td> 0.289851</td>\n <td> 0.479401</td>\n <td> 0.343120</td>\n <td> 1.000000</td>\n <td> 0.153606</td>\n <td> 0.297925</td>\n </tr>\n <tr>\n <th>x9</th>\n <td> 0.117975</td>\n <td> 0.207177</td>\n <td> 0.523221</td>\n <td> 0.129864</td>\n <td> 0.403909</td>\n <td> 0.184994</td>\n <td> 0.393074</td>\n <td> 0.153606</td>\n <td> 1.000000</td>\n <td> 0.056698</td>\n </tr>\n <tr>\n <th>x10</th>\n <td>-0.045854</td>\n <td> 0.070008</td>\n <td> 0.116284</td>\n <td> 0.266634</td>\n <td> 0.414685</td>\n <td> 0.063186</td>\n <td> 0.244814</td>\n <td> 0.297925</td>\n <td> 0.056698</td>\n <td> 1.000000</td>\n </tr>\n </tbody>\n</table>\n<p>10 rows \u00d7 10 columns</p>\n</div>", | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 61, | |
"text": " x1 x2 x3 x4 x5 x6 x7 \\\nx1 1.000000 0.585335 0.280897 0.223010 0.643485 0.329727 0.182300 \nx2 0.585335 1.000000 0.322786 0.559354 0.502929 0.560442 0.270514 \nx3 0.280897 0.322786 1.000000 0.358365 0.396357 0.234075 0.704722 \nx4 0.223010 0.559354 0.358365 1.000000 0.268387 0.548898 0.290969 \nx5 0.643485 0.502929 0.396357 0.268387 1.000000 0.315019 0.386719 \nx6 0.329727 0.560442 0.234075 0.548898 0.315019 1.000000 0.194883 \nx7 0.182300 0.270514 0.704722 0.290969 0.386719 0.194883 1.000000 \nx8 0.151807 0.317594 0.214011 0.474351 0.289851 0.479401 0.343120 \nx9 0.117975 0.207177 0.523221 0.129864 0.403909 0.184994 0.393074 \nx10 -0.045854 0.070008 0.116284 0.266634 0.414685 0.063186 0.244814 \n\n x8 x9 x10 \nx1 0.151807 0.117975 -0.045854 \nx2 0.317594 0.207177 0.070008 \nx3 0.214011 0.523221 0.116284 \nx4 0.474351 0.129864 0.266634 \nx5 0.289851 0.403909 0.414685 \nx6 0.479401 0.184994 0.063186 \nx7 0.343120 0.393074 0.244814 \nx8 1.000000 0.153606 0.297925 \nx9 0.153606 1.000000 0.056698 \nx10 0.297925 0.056698 1.000000 \n\n[10 rows x 10 columns]" | |
} | |
], | |
"prompt_number": 61 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "#Unstack the table and sort\ndf2 = correlation_matrix.unstack()\ndf2.sort(kind='quicksort')\nprint \"Sorted correlation coefficient values, ignore the last 10 which are pairs of themselves\"\nprint df2", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "Sorted correlation coefficient values, ignore the last 10 which are pairs of themselves\nx10 x1 -0.045854\nx1 x10 -0.045854\nx9 x10 0.056698\nx10 x9 0.056698\nx6 x10 0.063186\nx10 x6 0.063186\nx2 x10 0.070008\nx10 x2 0.070008\nx3 x10 0.116284\nx10 x3 0.116284\nx1 x9 0.117975\nx9 x1 0.117975\nx4 x9 0.129864\nx9 x4 0.129864\nx1 x8 0.151807\n...\nx1 x2 0.585335\nx5 x1 0.643485\nx1 x5 0.643485\nx7 x3 0.704722\nx3 x7 0.704722\nx1 x1 1.000000\nx8 x8 1.000000\nx7 x7 1.000000\nx6 x6 1.000000\nx5 x5 1.000000\nx4 x4 1.000000\nx3 x3 1.000000\nx2 x2 1.000000\nx9 x9 1.000000\nx10 x10 1.000000\nLength: 100, dtype: float64\n" | |
} | |
], | |
"prompt_number": 71 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
} | |
], | |
"metadata": {} | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment