Skip to content

Instantly share code, notes, and snippets.

@neilpanchal
Created June 24, 2015 08:07
Show Gist options
  • Save neilpanchal/fdd60e7f989254ef8040 to your computer and use it in GitHub Desktop.
Save neilpanchal/fdd60e7f989254ef8040 to your computer and use it in GitHub Desktop.
# Function Reference Guide
## DataFrames
#### `DataFrame(cols::Vector, colnames::Vector{ByteString})`
Construct a DataFrame from the columns given by `cols` with the index
generated by `colnames`. A DataFrame inherits from
`Associative{Any,Any}`, so Associative operations should work. Columns
are vector-like objects. Normally these are AbstractDataVector's (DataVector's
or PooledDataVector's), but they can also (currently) include standard
Julia Vectors.
#### `DataFrame(cols::Vector)`
Construct a DataFrame from the columns given by `cols` with default
column names.
#### `DataFrame()`
An empty DataFrame.
#### `copy(df::DataFrame)`
A shallow copy of `df`. Columns are referenced, not copied.
#### `deepcopy(df::DataFrame)`
A deep copy of `df`. Copies of each column are made.
#### `similar(df::DataFrame, nrow)`
A new DataFrame with `nrow` rows and the same column names and types as `df`.
### Basics
#### `size(df)`, `ndims(df)`
Same meanings as for Arrays.
#### `has(df, key)`, `get(df, key, default)`, `keys(df)`, and `values(df)`
Same meanings as Associative operations. `keys` are column names;
`values` are column contents.
#### `start(df)`, `done(df,i)`, and `next(df,i)`
Methods to iterate over columns.
#### `ncol(df::AbstractDataFrame)`
Number of columns in `df`.
#### `nrow(df::AbstractDataFrame)`
Number of rows in `df`.
#### `length(df::AbstractDataFrame)`
Number of columns in `df`.
#### `isempty(df::AbstractDataFrame)`
Whether the number of columns equals zero.
#### `head(df::AbstractDataFrame)` and `head(df::AbstractDataFrame, i::Int)`
First `i` rows of `df`. Defaults to 6.
#### `tail(df::AbstractDataFrame)` and `tail(df::AbstractDataFrame, i::Int)`
Last `i` rows of `df`. Defaults to 6.
#### `show(io, df::AbstractDataFrame)`
Standard pretty-printer of `df`. Called by `print()` and the REPL.
#### `dump(df::AbstractDataFrame)`
Show the structure of `df`. Like R's `str`.
#### `describe(df::AbstractDataFrame)`
Show a description of each column of `df`.
#### `complete_cases(df::AbstractDataFrame)`
A Vector{Bool} of indexes of complete cases in `df` (rows with no
NA's).
#### `duplicated(df::AbstractDataFrame)`
A Vector{Bool} of indexes indicating rows that are duplicates of prior
rows.
#### `unique(df::AbstractDataFrame)`
DataFrame with unique rows in `df`.
### Indexing, Assignment, and Concatenation
DataFrames are indexed like a Matrix and like an Associative. Columns
may be indexed by column name. Rows do not have names. Referencing
with one argument normally indexes by columns: `df["col"]`,
`df[["col1","col3"]]` or `df[i]`. With two arguments, rows and columns
are selected. Indexing along rows works like Matrix indexing. Indexing
along columns works like Matrix indexing with the addition of column
name access.
#### `getindex(df::DataFrame, ind)` or `df[ind]`
Returns a subset of the columns of `df` as specified by `ind`, which
may be an `Int`, a `Range`, a `Vector{Int}`, `ByteString`, or
`Vector{ByteString}`. Columns are referenced, not copied. For a
single-element `ind`, the column by itself is returned.
#### `getindex(df::DataFrame, irow, icol)` or `df[irow,icol]`
Returns a subset of `df` as specified by `irow` and `icol`. `irow` may
be an `Int`, a `Range`, or a `Vector{Int}`. `icol` may be an `Int`, a
`Range`, or a `Vector{Int}`, `ByteString`, or, `ByteString`, or
`Vector{ByteString}`. For a single-element `ind`, the column subset by
itself is returned.
#### `index(df::DataFrame)`
Returns the column `Index` for `df`.
#### `set_group(df::DataFrame, newgroup, names::Vector{ByteString})`
#### `get_groups(df::DataFrame)`
#### `set_groups(df::DataFrame, gr::Dict)`
See the Indexing section for these operations on column indexes.
#### `colnames(df::DataFrame)` or `names(df::DataFrame)`
The column names as an `Array{ByteString}`
#### `setindex!(df::DataFrame, newcol, colname)` or `df[colname] = newcol`
Replace or add a new column with name `colname` and contents `newcol`.
Arrays are converted to DataVector's. Values are recycled to match the
number of rows in `df`.
#### `insert!(df::DataFrame, index::Integer, item, name)`
Insert a column of name `name` and with contents `item` into `df` at
position `index`.
#### `insert!(df::DataFrame, df2::DataFrame)`
Insert columns of `df2` into `df1`.
#### `del!(df::DataFrame, cols)`
Delete columns in `df` at positions given by `cols` (noted with any
means that columns can be referenced).
#### `del(df::DataFrame, cols)`
Nondestructive version. Return a DataFrame based on the columns in
`df` after deleting columns specified by `cols`.
#### `deleterows!(df::DataFrame, inds)`
Delete rows at positions specified by `inds` from the given DataFrame.
#### `cbind(df1, df2, ...)` or `hcat(df1, df2, ...)` or `[df1 df2 ...]`
Concatenate columns. Duplicated column names are adjusted.
#### `rbind(df1, df2, ...)` or `vcat(df1, df2, ...)` or `[df1, df2, ...]`
Concatenate rows.
### I/O
#### `csvDataFrame(filename, o::Options)`
Return a DataFrame from file `filename`. Options `o` include
`colnames` (`"true"`, `"false"`, or `"check"` (the default)) and
`poolstrings` (`"check"` (default) or `"never"`).
### Expression/Function Evaluation in a DataFrame
#### `with(df::AbstractDataFrame, ex::Expr)`
Evaluate expression `ex` with the columns in `df`.
#### `within(df::AbstractDataFrame, ex::Expr)`
Return a copy of `df` after evaluating expression `ex` with the
columns in `df`.
#### `within!(df::AbstractDataFrame, ex::Expr)`
Modify `df` by evaluating expression `ex` with the columns in `df`.
#### `based_on(df::AbstractDataFrame, ex::Expr)`
Return a new DataFrame based on evaluating expression `ex` with the
columns in `df`. Often used for summarizing operations.
#### `colwise(f::Function, df::AbstractDataFrame)`
#### `colwise(f::Vector{Function}, df::AbstractDataFrame)`
Apply `f` to each column of `df`, and return the results as an
Array{Any}.
#### `colwise(df::AbstractDataFrame, s::Symbol)`
#### `colwise(df::AbstractDataFrame, s::Vector{Symbol})`
Apply the function specified by Symbol `s` to each column of `df`, and
return the results as a DataFrame.
### SubDataFrames
#### `sub(df::DataFrame, r, c)`
#### `sub(df::DataFrame, r)`
Return a SubDataFrame with references to rows and columns of `df`.
#### `sub(sd::SubDataFrame, r, c)`
#### `sub(sd::SubDataFrame, r)`
Return a SubDataFrame with references to rows and columns of `df`.
#### `getindex(sd::SubDataFrame, r, c)` or `sd[r,c]`
#### `getindex(sd::SubDataFrame, c)` or `sd[c]`
Referencing should work the same as DataFrames.
### Grouping
#### `groupby(df::AbstractDataFrame, cols)`
Return a GroupedDataFrame based on unique groupings indicated by the
columns with one or more names given in `cols`.
#### `start(gd)`, `done(gd,i)`, and `next(gd,i)`
Methods to iterate over GroupedDataFrame groupings.
#### `getindex(gd::GroupedDataFrame, idx)` or `gd[idx]`
Reference a particular grouping. Referencing returns a SubDataFrame.
#### `with(gd::GroupedDataFrame, ex::Expr)`
Evaluate expression `ex` with the columns in `gd` in each grouping.
#### `within(gd::GroupedDataFrame, ex::Expr)`
#### `within!(gd::GroupedDataFrame, ex::Expr)`
Return a DataFrame with the results of evaluating expression `ex` with
the columns in `gd` in each grouping.
#### `based_on(gd::GroupedDataFrame, ex::Expr)`
Sweeps along groups and applies `based_on` to each group. Returns a
DataFrame.
#### `map(f::Function, gd::GroupedDataFrame)`
Apply `f` to each grouping of `gd` and return the results in an Array.
#### `colwise(f::Function, gd::GroupedDataFrame)`
#### `colwise(f::Vector{Function}, gd::GroupedDataFrame)`
Apply `f` to each column in each grouping of `gd`, and return the
results as an Array{Any}.
#### `colwise(gd::GroupedDataFrame, s::Symbol)`
#### `colwise(gd::GroupedDataFrame, s::Vector{Symbol})`
Apply the function specified by Symbol `s` to each column of in each
grouping of `gd`, and return the results as a DataFrame.
#### `by(df::AbstractDataFrame, cols, s::Symbol)` or `groupby(df, cols) |> s`
#### `by(df::AbstractDataFrame, cols, s::Vector{Symbol})`
Return a DataFrame with the results of grouping on `cols` and
`colwise` evaluation based on `s`. Equivalent to `colwise(groupby(df,
cols), s)`.
#### `by(df::AbstractDataFrame, cols, e::Expr)` or `groupby(df, cols) |> e`
Return a DataFrame with the results of grouping on `cols` and
evaluation of `e` in each grouping. Equivalent to `based_on(groupby(df,
cols), e)`.
### Reshaping / Merge
#### `stack(df::DataFrame, cols)`
For conversion from wide to long format. Returns a DataFrame with
stacked columns indicated by `cols`. The result has column `"key"`
with column names from `df` and column `"value"` with the values from
`df`. Columns in `df` not included in `cols` are duplicated along the
stack.
#### `unstack(df::DataFrame, ikey, ivalue, irefkey)`
For conversion from long to wide format. Returns a DataFrame. `ikey`
indicates the key column--unique values in column `ikey` will be
column names in the result. `ivalue` indicates the value column.
`irefkey` is the column with a unique identifier for that . Columns
not given by `ikey`, `ivalue`, or `irefkey` are currently ignored.
#### `merge(df1::DataFrame, df2::DataFrame, bycol)`
#### `merge(df1::DataFrame, df2::DataFrame, bycol, jointype)`
Return the database join of `df1` and `df2` based on the column `bycol`.
Currently only a single merge key is supported. Supports `jointype` of
"inner" (the default), "left", "right", or "outer".
## Index
#### `Index()`
#### `Index(s::Vector{ByteString})`
An Index with names `s`. An Index is like an Associative type. An
Index is used for column indexing of DataFrames. An Index maps
ByteStrings and Vector{ByteStrings} to Indices.
#### `length(x::Index)`, `copy(x::Index)`, `has(x::Index, key)`, `keys(x::Index)`, `push!(x::Index, name)`
Normal meanings.
#### `del(x::Index, idx::Integer)`, `del(x::Index, s::ByteString)`,
Delete the name `s` or name at position `idx` in `x`.
#### `names(x::Index)`
A Vector{ByteString} with the names of `x`.
#### `names!(x::Index, nm::Vector{ByteString})`
Set names `nm` in `x`.
#### `rename(x::Index, f::Function)`
#### `rename(x::Index, nd::Associative)`
#### `rename(x::Index, from::Vector, to::Vector)`
Replace names in `x`, by applying function `f` to each name,
by mapping old to new names with a dictionary (Associative), or using
`from` and `to` vectors.
#### `getindex(x::Index, idx)` or `x[idx]`
This does the mapping from name(s) to Indices (positions). `idx` may
be ByteString, Vector{ByteString}, Int, Vector{Int}, Range{Int},
Vector{Bool}, AbstractDataVector{Bool}, or AbstractDataVector{Int}.
#### `set_group(idx::Index, newgroup, names::Vector{ByteString})`
Add a group to `idx` with name `newgroup` that includes the names in
the vector `names`.
#### `get_groups(idx::Index)`
A Dict that maps the name of each group to the names in the group.
#### `set_groups(idx::Index, gr::Dict)`
Set groups in `idx` based on the mapping given by `gr`.
## Missing Values
Missing value behavior is implemented by instantiations of the `AbstractDataVector`
abstract type.
#### `NA`
A constant indicating a missing value.
#### `isna(x)`
Return a `Bool` or `Array{Bool}` (if `x` is an `AbstractDataVector`)
that is `true` for elements with missing values.
#### `nafilter(x)`
Return a copy of `x` after removing missing values.
#### `nareplace(x, val)`
Return a copy of `x` after replacing missing values with `val`.
#### `naFilter(x)`
Return an object based on `x` such that future operations like `mean`
will not include missing values. This can be an iterator or other
object.
#### `naReplace(x, val)`
Return an object based on `x` such that future operations like `mean`
will replace NAs with `val`.
#### `na(x)`
Return an `NA` value appropriate for the type of `x`.
#### `nas(x, dim)`
Return an object like `x` filled with `NA` values with size `dim`.
## DataVector's
#### `DataArray(x::Vector)`
#### `DataArray(x::Vector, m::Vector{Bool})`
Create a DataVector from `x`, with `m` optionally indicating which values
are NA. DataVector's are like Julia Vectors with support for NA's. `x` may
be any type of Vector.
#### `PooledDataArray(x::Vector)`
#### `PooledDataArray(x::Vector, m::Vector{Bool})`
Create a PooledDataVector from `x`, with `m` optionally indicating which
values are NA. PooledDataVector's contain a pool of values with references
to those values. This is useful in a similar manner to an R array of
factors.
#### `size`, `length`, `ndims`, `ref`, `assign`, `start`, `next`, `done`
All normal Vector operations including array referencing should work.
#### `isna(x)`, `nafilter(x)`, `nareplace(x, val)`, `naFilter(x)`, `naReplace(x, val)`
All NA-related methods are supported.
## Utilities
#### `cut(x::Vector, breaks::Vector)`
Returns a PooledDataVector with length equal to `x` that divides values in `x`
based on the divisions given by `breaks`.
## Formulas and Models
#### `Formula(ex::Expr)`
Return a Formula object based on `ex`. Formulas are two-sided
expressions separated by `~`, like `:(y ~ w*x + z + i&v)`.
#### `model_frame(f::Formula, d::AbstractDataFrame)`
#### `model_frame(ex::Expr, d::AbstractDataFrame)`
A ModelFrame.
#### `model_matrix(mf::ModelFrame)`
#### `model_matrix(f::Formula, d::AbstractDataFrame)`
#### `model_matrix(ex::Expr, d::AbstractDataFrame)`
A ModelMatrix based on `mf`, `f` and `d`, or `ex` and `d`.
#### `lm(ex::Expr, df::AbstractDataFrame)`
Linear model results (type OLSResults) based on formula `ex` and `df`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment