Skip to content

Instantly share code, notes, and snippets.

@jmclawson
Last active July 8, 2023 18:59
Show Gist options
  • Save jmclawson/864b37c40465495abc597780d536e1b9 to your computer and use it in GitHub Desktop.
Save jmclawson/864b37c40465495abc597780d536e1b9 to your computer and use it in GitHub Desktop.
---
title: "Special cases of data importing in R"
---
The typical methods of importing data make it straightforward to import a single CSV file. But data will often be prepared in some other format. Some common scenarios include a folder of smaller CSV files or data prepared for use with SAS.
## Importing many CSV files
It is common to read in multiple CSV files and combine their data frames. Beyond a certain number, the process should be automated.
### Prepare sample data
For the purposes of this section, the first step is to prepare a sample data set to make this document replicable anywhere. This code divides data from the `nycflights13` package into a folder of 16 smaller CSV files to be read.
```{r}
library(dplyr)
library(readr)
if(!dir.exists("carriers")){dir.create("carriers")}
nycflights13::flights %>%
group_by(carrier) %>%
group_walk(~ write_csv(.x, file = paste0("carriers/", .y$carrier, ".csv")))
```
Each of the CSV files in the "carriers" directory now has a subset of the full data, minus its carrier name. These files can be imported all at once to recreate the original data set.
### Identify the files for importing
To read a list of files, it's necessary first to have a list. The `list.files()` function makes it easy to find all the files in a folder that match any given pattern, specified with regular expressions.
```{r}
all_filenames <- list.files(path = "carriers", pattern = "*.csv", full.names = TRUE)
```
### Import all the files
The `map()` function from `purrr` makes it easy to use `read_csv()` on every item in the list of file names. Especially when reading a large number of files, the `read_csv()` function can be noisy, but it can be quietened by setting `show_col_types` to `FALSE`. The resulting object is a list of data frames:
```{r}
library(purrr)
library(readr)
imported_data <- map(all_filenames, read_csv, show_col_types = FALSE)
class(imported_data)
```
Use `bind_rows()` to combine all of these data frames into a single large data frame.
```{r}
combined_data <- imported_data |>
bind_rows()
class(combined_data)
dim(combined_data)
```
Because the column was encoded into the file names of the CSVs, the table lacks the distinguishing "carrier" information. It can be added by setting names with `set_names()` before appropriating these names for a named id column in `bind_rows()`.
```{r}
library(stringr)
carrier_names <- all_filenames |>
str_remove_all("carriers/") |>
str_remove_all("[.]csv")
combined_data <- imported_data |>
set_names(carrier_names) |>
bind_rows(.id = "carrier")
dim(combined_data)
```
Relocating the "carrier" column and rearranging the rows are all that's necessary to get back to something like the original.
```{r}
combined_data <-
combined_data |>
relocate(carrier, .before = flight) |>
arrange(year, month, day, dep_time)
```
If the process is used regularly, it might be wise to create a function.
```{r}
combine_csvs <- function(dir, id = ".id"){
full_paths <- list.files(dir,
pattern = "*.csv",
full.names = TRUE)
simple_names <- full_paths |>
stringr::str_remove_all(paste0(dir,"[/]")) |>
stringr::str_remove_all("[.]csv")
full_paths |>
purrr::map(readr::read_csv, show_col_types = FALSE) |>
purrr::set_names(simple_names) |>
dplyr::bind_rows(.id = {{id}})
}
```
## Importing SAS data with `haven`
Reading other data types such as data from SAS is easily done with the `haven` package.
### Prepare sample data
For the purposes of this section, the first step is to download a sample data set once so that it can later be imported.
```{r}
if(!file.exists("medical.sas7bdat")) {
download.file("http://www.principlesofeconometrics.com/sas/medical.sas7bdat",
destfile = "medical.sas7bdat")
}
```
### Load the package and read the data with read_sas()
The `read_sas()` function is straightforward.
```{r}
library(haven)
medical <- read_sas("medical.sas7bdat")
class(medical)
attributes(medical)$label
```
### Access column labels
Data imported from SAS might have labels attached to columns. These are shown in RStudio's data viewer, but they can also be accessed with code using the `attr()` function to get the "label" attribute.
```{r}
attr(medical$ID, "label")
```
To see the attributes for multiple columns at once, either use the `purrr` package's `map()` function, or write a simple loop.
```{r}
for(i in 1:ncol(medical)){
this_column <- paste(colnames(medical)[i],
attr(medical[[i]], "label"))
message(this_column)
}
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment