Skip to content

Instantly share code, notes, and snippets.

@MarkEdmondson1234
Last active November 11, 2023 00:25
Show Gist options
  • Save MarkEdmondson1234/a7d6daedfa40ce2d6f27a1a5c56e9a50 to your computer and use it in GitHub Desktop.
Save MarkEdmondson1234/a7d6daedfa40ce2d6f27a1a5c56e9a50 to your computer and use it in GitHub Desktop.
Check 0 pageviews by comparing sitemap.XML URLs with Google Analytics visits.
library(googleAnalyticsR)
library(xml2)
library(dplyr)
ga_auth()
## date range of URLs to test
dates <- c(Sys.Date() - 30, Sys.Date())
##GA View ID
id <- 11111111
## function to get sitemap URLs
get_sitemap <- function(sitemap, field = "loc"){
sm <- as_list(read_xml(sitemap))
out <- try(Reduce(rbind,
vapply(sm, function(x) Reduce(rbind, x[[field]]), character(1))
))
if(inherits(out, "try-error")){
message("Problem with sitemap:", sitemap)
return(NULL)
}
as.vector(out)
}
## make google SEO filter
google_seo <- filter_clause_ga4(
list(
dim_filter("source", "EXACT", "google"),
dim_filter("medium", "EXACT", "organic")
),
operator = "AND")
## get the pages
pages <- google_analytics_4(id,
date_range = dates,
dimensions = "pagePath",
metrics = c("pageviews","totalEvents"),
dim_filters = google_seo,
max = -1,
anti_sample = TRUE)
## get the sitemap index file
url_si <- "http://www.example.com/sitemap.xml"
sitemap_index <- get_sitemap(url_si)
## get all the sitemaps (maybe you only need the call above if you have no sitemap index)
many_sitemaps <- lapply(sitemap_index, get_sitemap)
## all the urls in all the sitemaps
all_urls <- Reduce(c, many_sitemaps)
## Compare and get the URLs that are in XML but not in Google Analytics
## dplyr transformations
sitemap_urls <- as.tbl(as.data.frame(all_urls, stringsAsFactors = FALSE))
sitemap_urls <- sitemap_urls %>% mutate(path = paste0("/",urltools::path(all_urls)))
sitemap_not_in_ga <- anti_join(sitemap_urls, pages, by = c(path = "pagePath"))
## write out to CSV
write.csv(sitemap_not_in_ga, file = "./data/sitemap_urls_not_in_ga.csv", row.names = FALSE)
@withetu
Copy link

withetu commented Jan 30, 2017

Hello,

I am a beginner in R. While do run your R code for my GA account, here I stuck up

get the sitemap index file

url_si <- "http://www.my-domain.com/sitemap.xml"
sitemap_index <- get_sitemap(url_si)

error:
Error in x[[field]] : subscript out of bounds
Problem with sitemap:http://www.my-domain.com/sitemap.xml

Please help me!

Thank you

@MarkEdmondson1234
Copy link
Author

Sorry I get no notifications for this so missed it. Its saying you have no field in the sitemap, is it a correctly configured one? I realise you may never see this, for the same reasons I didn't.

@stringbenderb5
Copy link

Where would I place this code in a wordpress site? and will it work in wordpress?

@MarkEdmondson1234
Copy link
Author

Just saw this, sorry. It won't work in Wordpress, which is PHP. This is a script to run in R, locally on your laptop, perhaps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment