Skip to content

Instantly share code, notes, and snippets.

@tcash21
Created January 18, 2013 22:32
Show Gist options
  • Save tcash21/4569252 to your computer and use it in GitHub Desktop.
Save tcash21/4569252 to your computer and use it in GitHub Desktop.
library(XML)
library(RCurl)
library(ggplot2)
results <- c()
## Loop through the 30 pages of player data
for(i in 1:30){
theURL <- paste("http://www.nhl.com/ice/playerstats.htm?fetchKey=20122ALLSASAll&viewName=summary&sort=points&pg=", i, sep="")
webpage <- getURL(theURL)
h<-htmlParse(webpage)
stats <- xmlToDataFrame(nodes = getNodeSet(h, "//tbody//tr"))[,-1]
## Grab the column names only on the first iteration
if(i == 1){
nodes<-getNodeSet(h, "//table [@summary='2011-2012 - Regular Season - Skater - Summary - Points']
//thead//tr//th//a[@title]")
cols <- as.character(xmlToDataFrame(nodes)[,1])
column.names <- gsub("\\n", "", cols)
## Append columns since any sorted column and Team column do not appear in a structured format in the HTML
column.names<-append(column.names, "Team", after=1)
column.names<-append(column.names, "P", after=6)
## Clean up column names so they are R-friendly
column.names[8] <- "Plus.Minus"
column.names[15] <- "Shooting.Percentage"
column.names[16] <- "Time.On.Ice.Per.Game"
column.names[17] <- "Avg.Shifts.Per.Game"
column.names[18] <- "Faceoff.Win.Percentage"
results <- rbind(results, stats)
colnames(results) <- column.names
}
colnames(stats) <- column.names
results <- rbind(results, stats)
}
## Remove plus signs from +/- so we can treat it as a number
results$Plus.Minus <- as.numeric(gsub("\\+", "", results$Plus.Minus))
## Format factors as numeric data types
results[,c(4:15, 17:18)] <- apply(results[,c(4:15, 17:18)], 2, function(x) as.numeric(as.character(x)))
results <- results[match(unique(results$Player ), results$Player),]
## We only care about the first Team listed and not if that player was on multiple teams in 2011-12
results$Team <- gsub("\\,\\s+\\w+", "", as.character(s.results$Team))
## Pull out a team to visualize
t.results <- subset(s.results, Team == "BOS")
## Plot the data and save in a PDF
pdf(file="Bruins.pdf", width=11, height=8)
ggplot(t.results, aes(x=Plus.Minus, y=P, size=Avg.Shifts.Per.Game, colour=Pos, label=Player)) + geom_text() +
labs(x="+/-", y="Points", title= t.results$Team)
dev.off()
@paulmotz
Copy link

paulmotz commented May 5, 2014

Check all the lines involving column.names. I had:

"Error in colnames<-(tmp, value = c("Player", "Team", "Team", "Pos", :
'names' attribute [22] must be the same length as the vector [21]"

and I was taking stats from 2001-2002. I commented out this line:

column.names<-append(column.names, "Team", after=1)

and the vectors then had the same dimensions...though for some reason results doesn't seem to rbind correctly and only takes the last page of the index in the for loop. No idea why that is happening.

EDIT: Wow, rookie mistake. I copied the code snippet at a time to learn how each part works and accidentally put:

results <- c()

inside the for loop

@DPotter555
Copy link

I am somewhat new to R and am trying to learn how to scrape data. Having looked at a few other examples, your code made a lot more sense to me.
However, I am getting an error saying XML content does not seem to be XML
The error is occuring in

stats <- xmlToDataFrame(nodes = getNodeSet(h, "//tbody//tr"))[,-1]

Do you know what I could do to get this to run properly on my computer? I am running RStudio.
Also, what is the significance of the second argument in getNodeSet ?

@tcash21
Copy link
Author

tcash21 commented Jan 23, 2016

Hi all, this URL is no longer active on NHL's site, but if you want a much easier way to get hockey data you should check out www.stattleship.com. There is an R wrapper and it is currently in free beta mode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment