Created
January 18, 2013 22:32
-
-
Save tcash21/4569252 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
library(XML) | |
library(RCurl) | |
library(ggplot2) | |
results <- c() | |
## Loop through the 30 pages of player data | |
for(i in 1:30){ | |
theURL <- paste("http://www.nhl.com/ice/playerstats.htm?fetchKey=20122ALLSASAll&viewName=summary&sort=points&pg=", i, sep="") | |
webpage <- getURL(theURL) | |
h<-htmlParse(webpage) | |
stats <- xmlToDataFrame(nodes = getNodeSet(h, "//tbody//tr"))[,-1] | |
## Grab the column names only on the first iteration | |
if(i == 1){ | |
nodes<-getNodeSet(h, "//table [@summary='2011-2012 - Regular Season - Skater - Summary - Points'] | |
//thead//tr//th//a[@title]") | |
cols <- as.character(xmlToDataFrame(nodes)[,1]) | |
column.names <- gsub("\\n", "", cols) | |
## Append columns since any sorted column and Team column do not appear in a structured format in the HTML | |
column.names<-append(column.names, "Team", after=1) | |
column.names<-append(column.names, "P", after=6) | |
## Clean up column names so they are R-friendly | |
column.names[8] <- "Plus.Minus" | |
column.names[15] <- "Shooting.Percentage" | |
column.names[16] <- "Time.On.Ice.Per.Game" | |
column.names[17] <- "Avg.Shifts.Per.Game" | |
column.names[18] <- "Faceoff.Win.Percentage" | |
results <- rbind(results, stats) | |
colnames(results) <- column.names | |
} | |
colnames(stats) <- column.names | |
results <- rbind(results, stats) | |
} | |
## Remove plus signs from +/- so we can treat it as a number | |
results$Plus.Minus <- as.numeric(gsub("\\+", "", results$Plus.Minus)) | |
## Format factors as numeric data types | |
results[,c(4:15, 17:18)] <- apply(results[,c(4:15, 17:18)], 2, function(x) as.numeric(as.character(x))) | |
results <- results[match(unique(results$Player ), results$Player),] | |
## We only care about the first Team listed and not if that player was on multiple teams in 2011-12 | |
results$Team <- gsub("\\,\\s+\\w+", "", as.character(s.results$Team)) | |
## Pull out a team to visualize | |
t.results <- subset(s.results, Team == "BOS") | |
## Plot the data and save in a PDF | |
pdf(file="Bruins.pdf", width=11, height=8) | |
ggplot(t.results, aes(x=Plus.Minus, y=P, size=Avg.Shifts.Per.Game, colour=Pos, label=Player)) + geom_text() + | |
labs(x="+/-", y="Points", title= t.results$Team) | |
dev.off() |
I am somewhat new to R and am trying to learn how to scrape data. Having looked at a few other examples, your code made a lot more sense to me.
However, I am getting an error saying XML content does not seem to be XML
The error is occuring in
stats <- xmlToDataFrame(nodes = getNodeSet(h, "//tbody//tr"))[,-1]
Do you know what I could do to get this to run properly on my computer? I am running RStudio.
Also, what is the significance of the second argument in getNodeSet ?
Hi all, this URL is no longer active on NHL's site, but if you want a much easier way to get hockey data you should check out www.stattleship.com. There is an R wrapper and it is currently in free beta mode.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Check all the lines involving column.names. I had:
"Error in colnames<-(tmp, value = c("Player", "Team", "Team", "Pos", :
'names' attribute [22] must be the same length as the vector [21]"
and I was taking stats from 2001-2002. I commented out this line:
column.names<-append(column.names, "Team", after=1)
and the vectors then had the same dimensions...though for some reason results doesn't seem to rbind correctly and only takes the last page of the index in the for loop. No idea why that is happening.
EDIT: Wow, rookie mistake. I copied the code snippet at a time to learn how each part works and accidentally put:
results <- c()
inside the for loop