Last active
July 8, 2023 18:59
-
-
Save jmclawson/864b37c40465495abc597780d536e1b9 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Special cases of data importing in R" | |
--- | |
The typical methods of importing data make it straightforward to import a single CSV file. But data will often be prepared in some other format. Some common scenarios include a folder of smaller CSV files or data prepared for use with SAS. | |
## Importing many CSV files | |
It is common to read in multiple CSV files and combine their data frames. Beyond a certain number, the process should be automated. | |
### Prepare sample data | |
For the purposes of this section, the first step is to prepare a sample data set to make this document replicable anywhere. This code divides data from the `nycflights13` package into a folder of 16 smaller CSV files to be read. | |
```{r} | |
library(dplyr) | |
library(readr) | |
if(!dir.exists("carriers")){dir.create("carriers")} | |
nycflights13::flights %>% | |
group_by(carrier) %>% | |
group_walk(~ write_csv(.x, file = paste0("carriers/", .y$carrier, ".csv"))) | |
``` | |
Each of the CSV files in the "carriers" directory now has a subset of the full data, minus its carrier name. These files can be imported all at once to recreate the original data set. | |
### Identify the files for importing | |
To read a list of files, it's necessary first to have a list. The `list.files()` function makes it easy to find all the files in a folder that match any given pattern, specified with regular expressions. | |
```{r} | |
all_filenames <- list.files(path = "carriers", pattern = "*.csv", full.names = TRUE) | |
``` | |
### Import all the files | |
The `map()` function from `purrr` makes it easy to use `read_csv()` on every item in the list of file names. Especially when reading a large number of files, the `read_csv()` function can be noisy, but it can be quietened by setting `show_col_types` to `FALSE`. The resulting object is a list of data frames: | |
```{r} | |
library(purrr) | |
library(readr) | |
imported_data <- map(all_filenames, read_csv, show_col_types = FALSE) | |
class(imported_data) | |
``` | |
Use `bind_rows()` to combine all of these data frames into a single large data frame. | |
```{r} | |
combined_data <- imported_data |> | |
bind_rows() | |
class(combined_data) | |
dim(combined_data) | |
``` | |
Because the column was encoded into the file names of the CSVs, the table lacks the distinguishing "carrier" information. It can be added by setting names with `set_names()` before appropriating these names for a named id column in `bind_rows()`. | |
```{r} | |
library(stringr) | |
carrier_names <- all_filenames |> | |
str_remove_all("carriers/") |> | |
str_remove_all("[.]csv") | |
combined_data <- imported_data |> | |
set_names(carrier_names) |> | |
bind_rows(.id = "carrier") | |
dim(combined_data) | |
``` | |
Relocating the "carrier" column and rearranging the rows are all that's necessary to get back to something like the original. | |
```{r} | |
combined_data <- | |
combined_data |> | |
relocate(carrier, .before = flight) |> | |
arrange(year, month, day, dep_time) | |
``` | |
If the process is used regularly, it might be wise to create a function. | |
```{r} | |
combine_csvs <- function(dir, id = ".id"){ | |
full_paths <- list.files(dir, | |
pattern = "*.csv", | |
full.names = TRUE) | |
simple_names <- full_paths |> | |
stringr::str_remove_all(paste0(dir,"[/]")) |> | |
stringr::str_remove_all("[.]csv") | |
full_paths |> | |
purrr::map(readr::read_csv, show_col_types = FALSE) |> | |
purrr::set_names(simple_names) |> | |
dplyr::bind_rows(.id = {{id}}) | |
} | |
``` | |
## Importing SAS data with `haven` | |
Reading other data types such as data from SAS is easily done with the `haven` package. | |
### Prepare sample data | |
For the purposes of this section, the first step is to download a sample data set once so that it can later be imported. | |
```{r} | |
if(!file.exists("medical.sas7bdat")) { | |
download.file("http://www.principlesofeconometrics.com/sas/medical.sas7bdat", | |
destfile = "medical.sas7bdat") | |
} | |
``` | |
### Load the package and read the data with read_sas() | |
The `read_sas()` function is straightforward. | |
```{r} | |
library(haven) | |
medical <- read_sas("medical.sas7bdat") | |
class(medical) | |
attributes(medical)$label | |
``` | |
### Access column labels | |
Data imported from SAS might have labels attached to columns. These are shown in RStudio's data viewer, but they can also be accessed with code using the `attr()` function to get the "label" attribute. | |
```{r} | |
attr(medical$ID, "label") | |
``` | |
To see the attributes for multiple columns at once, either use the `purrr` package's `map()` function, or write a simple loop. | |
```{r} | |
for(i in 1:ncol(medical)){ | |
this_column <- paste(colnames(medical)[i], | |
attr(medical[[i]], "label")) | |
message(this_column) | |
} | |
``` |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment