Last active
March 29, 2023 16:36
-
-
Save pshapiro/616b64a4e4399326c82c34734885d5bd to your computer and use it in GitHub Desktop.
Calculate Internal PageRank from Screaming Frog Crawl
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
library("igraph") | |
# Swap out path to your Screaming Frog All Outlink CSV. For Windows, remember to change backslashes to forward slashes. | |
links <- read.csv("C:/Documents/screaming-frog-all-outlinks.csv", skip = 1) # CSV Path | |
# This line of code is optional. It filters out JavaScript, CSS, and Images. Technically you should keep them in there. | |
links <- subset(links, Type=="AHREF") # Optional line. Filter. | |
links <- subset(links, Follow=="true") | |
links <- subset(links, select=c(Source,Destination)) | |
g <- graph.data.frame(links) | |
pr <- page.rank(g, algo = "prpack", vids = V(g), directed = TRUE, damping = 0.85) | |
values <- data.frame(pr$vector) | |
values$names <- rownames(values) | |
row.names(values) <- NULL | |
values <- values[c(2,1)] | |
names(values)[1] <- "url" | |
names(values)[2] <- "pr" | |
# Swap out 'domain' and 'com' to represent your website address. | |
values <- values[grepl("https?:\\/\\/(.*\\.)?domain\\.com.*", values$url),] # Domain filter. | |
# Replace with your desired filename for the output file. | |
write.csv(values, file = "output-pagerank.csv") # Output file. |
Resolved,
in all_outlinks.csv file skip = 0 .
First line "All Outlinks" has been deleted.
;-)
Any updates to SF v. 13?
any updates for this script?
Hi guys, I'm getting these errors:
> links <- subset(links, Type=="HREF") # Optional line. Filter.
Error in eval(e, x, parent.frame()) : object 'Type' not found
> links <- subset(links, Follow=="true")
Error in eval(e, x, parent.frame()) : object 'Follow' not found
> links <- subset(links, select=c(Source,Destination))
Error in eval(substitute(select), nl, parent.frame()) :
object 'Source' not found
Any help?
There was some slight changes in the csv-file that screamingfrog outputs. This should work and you can change the Type=="Hyperlink" to look at different types of links.
# Swap out path to your Screaming Frog All Outlink CSV. For Windows, remember to change backslashes to forward slashes.
links <- read.csv("/YOUR/FILEPATH/all_outlinks.csv") # CSV Path
# This line of code is optional. It filters out JavaScript, CSS, and Images. Technically you should keep them in there.
links <- subset(links, Type=="Hyperlink") # Optional line. Filter.
links <- subset(links, Follow=="true")
links <- subset(links, select=c(Source,Destination))
g <- graph.data.frame(links)
pr <- page.rank(g, algo = "prpack", vids = V(g), directed = TRUE, damping = 0.85)
values <- data.frame(pr$vector)
values$names <- rownames(values)
row.names(values) <- NULL
values <- values[c(2,1)]
names(values)[1] <- "url"
names(values)[2] <- "pr"
# Swap out 'domain' and 'com' to represent your website address.
values <- values[grepl("https?:\\/\\/(.*\\.)?domain\\.com.*", values$url),] # Domain filter.
# Replace with your desired filename for the output file.
write.csv(values, file = "output-pagerank.csv") # Output file.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi @pshapiro,
could you please uptdate your useful code?
With new Screaming Frog Version 12.6 , there are some errors when you try to read all_outlinks.csv
Thankx a lot