Skip to content

Instantly share code, notes, and snippets.

@nwstephens
Created August 19, 2016 01:02
Show Gist options
  • Select an option

  • Save nwstephens/b3f66baeaf2ffaabddc337df256ce3e1 to your computer and use it in GitHub Desktop.

Select an option

Save nwstephens/b3f66baeaf2ffaabddc337df256ce3e1 to your computer and use it in GitHub Desktop.
When I load data into Spark, I get an error from readr that I don't get when I use readr explicitly
library(sparklyr)
library(dplyr)
# Install spark and hadoop dependencies
spark_install(version="2.0.0-preview", hadoop_version="2.7")
# Download file
fileIn <- "https://nycopendata.socrata.com/api/views/h9gi-nx95/rows.csv?accessType=DOWNLOAD"
fileOut <- "NYPD_Motor_Vehicle_Collisions_RAW.csv"
download.file(fileIn, fileOut)
# Create the cluster connection and load data
sc <- spark_connect(master = "local",
version = "2.0.0-preview",
hadoop_version = "2.7")
# Load RAW data into Spark
nypd_raw <- spark_read_csv(sc, "nypd_raw", "NYPD_Motor_Vehicle_Collisions_RAW.csv", overwrite=TRUE)
nypd_raw # !!! Error: Variables must be length 1 or 10. !!!
# Remove problematic columns
dat <- readr::read_csv("NYPD_Motor_Vehicle_Collisions_RAW.csv")
col_names_remove <- c('OFF_STREET_NAME', 'CONTRIBUTING_FACTOR_VEHICLE_3', 'CONTRIBUTING_FACTOR_VEHICLE_4', 'CONTRIBUTING_FACTOR_VEHICLE_5', 'VEHICLE_TYPE_CODE_3', 'VEHICLE_TYPE_CODE_4', 'VEHICLE_TYPE_CODE_5')
col_ind_remove <- match(col_names_remove, gsub(" ", "_", names(dat)))
readr::write_csv(select(dat, -col_ind_remove), "NYPD_Motor_Vehicle_Collisions.csv")
# Load modfied data into Spark
nypd <- spark_read_csv(sc, "nypd", "NYPD_Motor_Vehicle_Collisions.csv", overwrite=TRUE)
nypd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment