Skip to content

Instantly share code, notes, and snippets.

@fomightez
Last active September 27, 2020 18:42
Show Gist options
  • Save fomightez/7bf3b9058d521d6afec6de6ce63ad1e0 to your computer and use it in GitHub Desktop.
Save fomightez/7bf3b9058d521d6afec6de6ce63ad1e0 to your computer and use it in GitHub Desktop.
A few references for column and row conventions in data tables

A few references for column and row tidy data conventions in data tables and spreadsheets

from abstract of Data organization in spreadsheets. By Broman and Woo 2017. The American Statiscian

"Spreadsheets are widely used software tools for data entry, storage, analysis, and visualization. Focusing on the data entry and storage aspects, this paper offers practical recommendations for organizing spreadsheet data to reduce errors and ease later analyses. The basic principles are: be consistent, write dates like YYYY-MM-DD, don't leave any cells empty, put just one thing in a cell, organize the data as a single rectangle (with subjects as rows and variables as columns, and with a single header row), create a data dictionary, don't include calculations in the raw data files, don't use font color or highlighting as data, choose good names for things, make backups, use data validation to avoid data entry errors, and save the data in plain text file."

From page 257 of Practical Computing for Biologists by Haddock and Dunn:

"In general, all data in a column should hold the same type of values and each row should hold information that corresponds to a particular measurement."

from page 3 of Good Enough Practices in Scientific Computing:

"3. Create analysis-friendly data (1c): Analysis can be much easier if you are working with so-called “tidy” data [5]. Two key principles are: Make each column a variable: Don’t cram two variables into one, e.g., “male treated” should be split into separate variables for sex and treatment status. Store units in their own variable or in metadata, e.g., “3.4” instead of “3.4kg”. Make each row an observation: Data often comes in a wide format, because that facilitated data entry or human inspection. Imagine one row per field site and then columns for measurements made at each of several time points. Be prepared to gather such columns into a variable of measurements, plus a new variable for time point. Fig 1 presents an example of such a transformation."

from Tidy data by Hadley Wickham:

"Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:
Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.
This is Codd’s 3rd normal form, but with the constraints framed in statistical language, and the focus put on a single dataset rather than the many connected datasets common in relational databases. Messy data is any other other arrangement of the data.
Tidy data makes it easy for an analyst or a computer to extract needed variables because it provides a standard way of structuring a dataset. "

from tidyr package description:

"The goal of tidyr is to help you create tidy data. Tidy data is data where:
Each variable is in a column.
Each observation is a row.
Each value is a cell.
Tidy data describes a standard way of storing data that is used wherever possible throughout the tidyverse. If you ensure that your data is tidy, you’ll spend less time fighting with the tools and more time working on your analysis.

from https://www.datacamp.com/community/tutorials/15-easy-solutions-data-frame-problems-r#gs.oWtYIpo:

"With the data frame, R offers you a great first step by allowing you to store your data in overviewable, rectangular grids. Each row of these grids corresponds to measurements or values of an instance, while each column is a vector containing data for a specific variable."

from https://quizlet.com/147917387/data-science-flash-cards/:

"Most commonly... the table represents a particular variable, and each row corresponds to a given member of the data set in question. The data set lists values for each of the variables, such as height and weight of an object, for each member of the data set. Each value is known as a datum. The data set may comprise data for one or more members, corresponding to the number of rows."

from https://www.mathworks.com/help/matlab/matlab_prog/advantages-of-using-tables.html?s_tid=gn_loc_drop:

"t. For example, you can use a table to store experimental data, with rows representing different observations and columns representing different measured variables."

I've seen Data Organization in Spreadsheets(2018) by Broman and Woo recommended and it says in Abstract:

" The basic principles are: ... put just one thing in a cell, organize the data as a single rectangle (with subjects as rows and variables as columns, and with a single header row)... ."

I've seen Data Organization in Spreadsheets(2018) by Broman and Woo recommended and it says under section #7:

"The best layout for your data within a spreadsheet is as a single big rectangle with rows corresponding to subjects and columns corresponding to variables."

I've seen Data Organization in Spreadsheets(2018) by Broman and Woo recommended recently and it says under section #7: Make it a Rectangle:

"The best layout for your data within a spreadsheet is as a single big rectangle with rows corresponding to subjects and columns corresponding to variables."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment