-
-
Save Btibert3/817607 to your computer and use it in GitHub Desktop.
############################################################################### | |
# Author: @BrockTibert | |
# Purpose: Collect Historical NHL Skater Stats 1960 - 2011 (in progress) | |
# Date: February 2011 | |
# | |
# Used: R Version 2.12.1, Windows 7 Pro, StatET Plugin for Eclipse | |
# | |
# # Copyright (c) 2011, under the Simplified BSD License. | |
# For more information on FreeBSD see: http://www.opensource.org/licenses/bsd-license.php | |
# All rights reserved. | |
############################################################################### | |
#----------------------------------------------------------------------- | |
# set up script level basics | |
#----------------------------------------------------------------------- | |
## libraries | |
library(XML) | |
## directory for the project | |
DIR <- "C:/Users/Brock/Documents/My Dropbox/Projects/NHL" | |
setwd(DIR) | |
#----------------------------------------------------------------------- | |
# Create a function that will take a year and return a dataframe | |
#----------------------------------------------------------------------- | |
GrabSkaters <- function(S) { | |
# The function takes parameter S which is a string and represents the Season | |
# Returns: data frame | |
## create the URL | |
URL <- paste("http://www.hockey-reference.com/leagues/NHL_", | |
S, "_skaters.html", sep="") | |
## grab the page -- the table is parsed nicely | |
tables <- readHTMLTable(URL) | |
ds.skaters <- tables$stats | |
## determine if the HTML table was well formed (column names are the first record) | |
## can either read in directly or need to force column names | |
## and | |
## I don't like dealing with factors if I don't have to | |
## and I prefer lower case | |
for(i in 1:ncol(ds.skaters)) { | |
ds.skaters[,i] <- as.character(ds.skaters[,i]) | |
names(ds.skaters) <- tolower(colnames(ds.skaters)) | |
} | |
## fix a couple of the column names | |
colnames(ds.skaters) | |
names(ds.skaters)[10] <- "plusmin" | |
names(ds.skaters)[17] <- "spct" | |
## finally fix the columns - NAs forced by coercion warnings | |
for(i in c(1, 3, 6:18)) { | |
ds.skaters[,i] <- as.numeric(ds.skaters[, i]) | |
} | |
## convert toi to seconds, and seconds/game | |
ds.skaters$seconds <- (ds.skaters$toi*60)/ds.skaters$gp | |
## remove the header and totals row | |
ds.skaters <- ds.skaters[!is.na(ds.skaters$rk), ] | |
ds.skaters <- ds.skaters[ds.skaters$tm != "TOT", ] | |
## add the year | |
ds.skaters$season <- S | |
## return the dataframe | |
return(ds.skaters) | |
} | |
#----------------------------------------------------------------------- | |
# Use the function to loop over the seasons and piece together | |
#----------------------------------------------------------------------- | |
## define the seasons -- 2005 dataset doesnt exist | |
## if I was a good coder I would trap the error, but this works | |
SEASON <- as.character(c(1960:2004, 2006:2011)) | |
## create an empy dataset that we will append to | |
dataset <- data.frame() | |
## loop over the seasons, use the function to grab the data | |
## and build the dataset | |
for (S in SEASON) { | |
temp <- GrabSkaters(S) | |
dataset <- rbind(dataset, temp) | |
print(paste("Completed Season ", S, sep="")) | |
## pause the script so we don't kill their servers | |
Sys.sleep(3) | |
} | |
## save the dataset | |
write.table(dataset, "Historical Skater Stats.csv", sep=",", | |
row.names=F) |
This is the first time I've seen the XML library in use (new to R), so thanks for sharing your code!
THANK YOU ,
so if I wanted to do this by month (M), day (D), year (Y), what alterations would I make? New to coding...
Facing this error :-
Error: failed to load external entity "http://www.hockey-reference.com/leagues/NHL_1960_skaters.html"
Hi, can you help me? I'm facing this error:
Error in ds.skaters$toi * 60 : non-numeric argument to binary operator
Calls: ... withCallingHandlers -> withVisible -> eval -> eval -> GrabSkaters
In addition: Warning message:
In in_dir(input_dir(), evaluate(code, envir = env, new_device = FALSE, :
You changed the working directory to C:/Users/SV SEPAT/Documents/Tes R (probably via setwd()). It will be restored to C:/Users/SV Sepat/Documents/Tes R. See the Note section in ?knitr::knit
Nice post! You can considerably simplify your code by using functions from the package plyr. See the following gist for reference: https://gist.github.com/817883
Cheers,
Ramnath