Created
June 24, 2015 08:07
-
-
Save neilpanchal/fdd60e7f989254ef8040 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Function Reference Guide | |
## DataFrames | |
#### `DataFrame(cols::Vector, colnames::Vector{ByteString})` | |
Construct a DataFrame from the columns given by `cols` with the index | |
generated by `colnames`. A DataFrame inherits from | |
`Associative{Any,Any}`, so Associative operations should work. Columns | |
are vector-like objects. Normally these are AbstractDataVector's (DataVector's | |
or PooledDataVector's), but they can also (currently) include standard | |
Julia Vectors. | |
#### `DataFrame(cols::Vector)` | |
Construct a DataFrame from the columns given by `cols` with default | |
column names. | |
#### `DataFrame()` | |
An empty DataFrame. | |
#### `copy(df::DataFrame)` | |
A shallow copy of `df`. Columns are referenced, not copied. | |
#### `deepcopy(df::DataFrame)` | |
A deep copy of `df`. Copies of each column are made. | |
#### `similar(df::DataFrame, nrow)` | |
A new DataFrame with `nrow` rows and the same column names and types as `df`. | |
### Basics | |
#### `size(df)`, `ndims(df)` | |
Same meanings as for Arrays. | |
#### `has(df, key)`, `get(df, key, default)`, `keys(df)`, and `values(df)` | |
Same meanings as Associative operations. `keys` are column names; | |
`values` are column contents. | |
#### `start(df)`, `done(df,i)`, and `next(df,i)` | |
Methods to iterate over columns. | |
#### `ncol(df::AbstractDataFrame)` | |
Number of columns in `df`. | |
#### `nrow(df::AbstractDataFrame)` | |
Number of rows in `df`. | |
#### `length(df::AbstractDataFrame)` | |
Number of columns in `df`. | |
#### `isempty(df::AbstractDataFrame)` | |
Whether the number of columns equals zero. | |
#### `head(df::AbstractDataFrame)` and `head(df::AbstractDataFrame, i::Int)` | |
First `i` rows of `df`. Defaults to 6. | |
#### `tail(df::AbstractDataFrame)` and `tail(df::AbstractDataFrame, i::Int)` | |
Last `i` rows of `df`. Defaults to 6. | |
#### `show(io, df::AbstractDataFrame)` | |
Standard pretty-printer of `df`. Called by `print()` and the REPL. | |
#### `dump(df::AbstractDataFrame)` | |
Show the structure of `df`. Like R's `str`. | |
#### `describe(df::AbstractDataFrame)` | |
Show a description of each column of `df`. | |
#### `complete_cases(df::AbstractDataFrame)` | |
A Vector{Bool} of indexes of complete cases in `df` (rows with no | |
NA's). | |
#### `duplicated(df::AbstractDataFrame)` | |
A Vector{Bool} of indexes indicating rows that are duplicates of prior | |
rows. | |
#### `unique(df::AbstractDataFrame)` | |
DataFrame with unique rows in `df`. | |
### Indexing, Assignment, and Concatenation | |
DataFrames are indexed like a Matrix and like an Associative. Columns | |
may be indexed by column name. Rows do not have names. Referencing | |
with one argument normally indexes by columns: `df["col"]`, | |
`df[["col1","col3"]]` or `df[i]`. With two arguments, rows and columns | |
are selected. Indexing along rows works like Matrix indexing. Indexing | |
along columns works like Matrix indexing with the addition of column | |
name access. | |
#### `getindex(df::DataFrame, ind)` or `df[ind]` | |
Returns a subset of the columns of `df` as specified by `ind`, which | |
may be an `Int`, a `Range`, a `Vector{Int}`, `ByteString`, or | |
`Vector{ByteString}`. Columns are referenced, not copied. For a | |
single-element `ind`, the column by itself is returned. | |
#### `getindex(df::DataFrame, irow, icol)` or `df[irow,icol]` | |
Returns a subset of `df` as specified by `irow` and `icol`. `irow` may | |
be an `Int`, a `Range`, or a `Vector{Int}`. `icol` may be an `Int`, a | |
`Range`, or a `Vector{Int}`, `ByteString`, or, `ByteString`, or | |
`Vector{ByteString}`. For a single-element `ind`, the column subset by | |
itself is returned. | |
#### `index(df::DataFrame)` | |
Returns the column `Index` for `df`. | |
#### `set_group(df::DataFrame, newgroup, names::Vector{ByteString})` | |
#### `get_groups(df::DataFrame)` | |
#### `set_groups(df::DataFrame, gr::Dict)` | |
See the Indexing section for these operations on column indexes. | |
#### `colnames(df::DataFrame)` or `names(df::DataFrame)` | |
The column names as an `Array{ByteString}` | |
#### `setindex!(df::DataFrame, newcol, colname)` or `df[colname] = newcol` | |
Replace or add a new column with name `colname` and contents `newcol`. | |
Arrays are converted to DataVector's. Values are recycled to match the | |
number of rows in `df`. | |
#### `insert!(df::DataFrame, index::Integer, item, name)` | |
Insert a column of name `name` and with contents `item` into `df` at | |
position `index`. | |
#### `insert!(df::DataFrame, df2::DataFrame)` | |
Insert columns of `df2` into `df1`. | |
#### `del!(df::DataFrame, cols)` | |
Delete columns in `df` at positions given by `cols` (noted with any | |
means that columns can be referenced). | |
#### `del(df::DataFrame, cols)` | |
Nondestructive version. Return a DataFrame based on the columns in | |
`df` after deleting columns specified by `cols`. | |
#### `deleterows!(df::DataFrame, inds)` | |
Delete rows at positions specified by `inds` from the given DataFrame. | |
#### `cbind(df1, df2, ...)` or `hcat(df1, df2, ...)` or `[df1 df2 ...]` | |
Concatenate columns. Duplicated column names are adjusted. | |
#### `rbind(df1, df2, ...)` or `vcat(df1, df2, ...)` or `[df1, df2, ...]` | |
Concatenate rows. | |
### I/O | |
#### `csvDataFrame(filename, o::Options)` | |
Return a DataFrame from file `filename`. Options `o` include | |
`colnames` (`"true"`, `"false"`, or `"check"` (the default)) and | |
`poolstrings` (`"check"` (default) or `"never"`). | |
### Expression/Function Evaluation in a DataFrame | |
#### `with(df::AbstractDataFrame, ex::Expr)` | |
Evaluate expression `ex` with the columns in `df`. | |
#### `within(df::AbstractDataFrame, ex::Expr)` | |
Return a copy of `df` after evaluating expression `ex` with the | |
columns in `df`. | |
#### `within!(df::AbstractDataFrame, ex::Expr)` | |
Modify `df` by evaluating expression `ex` with the columns in `df`. | |
#### `based_on(df::AbstractDataFrame, ex::Expr)` | |
Return a new DataFrame based on evaluating expression `ex` with the | |
columns in `df`. Often used for summarizing operations. | |
#### `colwise(f::Function, df::AbstractDataFrame)` | |
#### `colwise(f::Vector{Function}, df::AbstractDataFrame)` | |
Apply `f` to each column of `df`, and return the results as an | |
Array{Any}. | |
#### `colwise(df::AbstractDataFrame, s::Symbol)` | |
#### `colwise(df::AbstractDataFrame, s::Vector{Symbol})` | |
Apply the function specified by Symbol `s` to each column of `df`, and | |
return the results as a DataFrame. | |
### SubDataFrames | |
#### `sub(df::DataFrame, r, c)` | |
#### `sub(df::DataFrame, r)` | |
Return a SubDataFrame with references to rows and columns of `df`. | |
#### `sub(sd::SubDataFrame, r, c)` | |
#### `sub(sd::SubDataFrame, r)` | |
Return a SubDataFrame with references to rows and columns of `df`. | |
#### `getindex(sd::SubDataFrame, r, c)` or `sd[r,c]` | |
#### `getindex(sd::SubDataFrame, c)` or `sd[c]` | |
Referencing should work the same as DataFrames. | |
### Grouping | |
#### `groupby(df::AbstractDataFrame, cols)` | |
Return a GroupedDataFrame based on unique groupings indicated by the | |
columns with one or more names given in `cols`. | |
#### `start(gd)`, `done(gd,i)`, and `next(gd,i)` | |
Methods to iterate over GroupedDataFrame groupings. | |
#### `getindex(gd::GroupedDataFrame, idx)` or `gd[idx]` | |
Reference a particular grouping. Referencing returns a SubDataFrame. | |
#### `with(gd::GroupedDataFrame, ex::Expr)` | |
Evaluate expression `ex` with the columns in `gd` in each grouping. | |
#### `within(gd::GroupedDataFrame, ex::Expr)` | |
#### `within!(gd::GroupedDataFrame, ex::Expr)` | |
Return a DataFrame with the results of evaluating expression `ex` with | |
the columns in `gd` in each grouping. | |
#### `based_on(gd::GroupedDataFrame, ex::Expr)` | |
Sweeps along groups and applies `based_on` to each group. Returns a | |
DataFrame. | |
#### `map(f::Function, gd::GroupedDataFrame)` | |
Apply `f` to each grouping of `gd` and return the results in an Array. | |
#### `colwise(f::Function, gd::GroupedDataFrame)` | |
#### `colwise(f::Vector{Function}, gd::GroupedDataFrame)` | |
Apply `f` to each column in each grouping of `gd`, and return the | |
results as an Array{Any}. | |
#### `colwise(gd::GroupedDataFrame, s::Symbol)` | |
#### `colwise(gd::GroupedDataFrame, s::Vector{Symbol})` | |
Apply the function specified by Symbol `s` to each column of in each | |
grouping of `gd`, and return the results as a DataFrame. | |
#### `by(df::AbstractDataFrame, cols, s::Symbol)` or `groupby(df, cols) |> s` | |
#### `by(df::AbstractDataFrame, cols, s::Vector{Symbol})` | |
Return a DataFrame with the results of grouping on `cols` and | |
`colwise` evaluation based on `s`. Equivalent to `colwise(groupby(df, | |
cols), s)`. | |
#### `by(df::AbstractDataFrame, cols, e::Expr)` or `groupby(df, cols) |> e` | |
Return a DataFrame with the results of grouping on `cols` and | |
evaluation of `e` in each grouping. Equivalent to `based_on(groupby(df, | |
cols), e)`. | |
### Reshaping / Merge | |
#### `stack(df::DataFrame, cols)` | |
For conversion from wide to long format. Returns a DataFrame with | |
stacked columns indicated by `cols`. The result has column `"key"` | |
with column names from `df` and column `"value"` with the values from | |
`df`. Columns in `df` not included in `cols` are duplicated along the | |
stack. | |
#### `unstack(df::DataFrame, ikey, ivalue, irefkey)` | |
For conversion from long to wide format. Returns a DataFrame. `ikey` | |
indicates the key column--unique values in column `ikey` will be | |
column names in the result. `ivalue` indicates the value column. | |
`irefkey` is the column with a unique identifier for that . Columns | |
not given by `ikey`, `ivalue`, or `irefkey` are currently ignored. | |
#### `merge(df1::DataFrame, df2::DataFrame, bycol)` | |
#### `merge(df1::DataFrame, df2::DataFrame, bycol, jointype)` | |
Return the database join of `df1` and `df2` based on the column `bycol`. | |
Currently only a single merge key is supported. Supports `jointype` of | |
"inner" (the default), "left", "right", or "outer". | |
## Index | |
#### `Index()` | |
#### `Index(s::Vector{ByteString})` | |
An Index with names `s`. An Index is like an Associative type. An | |
Index is used for column indexing of DataFrames. An Index maps | |
ByteStrings and Vector{ByteStrings} to Indices. | |
#### `length(x::Index)`, `copy(x::Index)`, `has(x::Index, key)`, `keys(x::Index)`, `push!(x::Index, name)` | |
Normal meanings. | |
#### `del(x::Index, idx::Integer)`, `del(x::Index, s::ByteString)`, | |
Delete the name `s` or name at position `idx` in `x`. | |
#### `names(x::Index)` | |
A Vector{ByteString} with the names of `x`. | |
#### `names!(x::Index, nm::Vector{ByteString})` | |
Set names `nm` in `x`. | |
#### `rename(x::Index, f::Function)` | |
#### `rename(x::Index, nd::Associative)` | |
#### `rename(x::Index, from::Vector, to::Vector)` | |
Replace names in `x`, by applying function `f` to each name, | |
by mapping old to new names with a dictionary (Associative), or using | |
`from` and `to` vectors. | |
#### `getindex(x::Index, idx)` or `x[idx]` | |
This does the mapping from name(s) to Indices (positions). `idx` may | |
be ByteString, Vector{ByteString}, Int, Vector{Int}, Range{Int}, | |
Vector{Bool}, AbstractDataVector{Bool}, or AbstractDataVector{Int}. | |
#### `set_group(idx::Index, newgroup, names::Vector{ByteString})` | |
Add a group to `idx` with name `newgroup` that includes the names in | |
the vector `names`. | |
#### `get_groups(idx::Index)` | |
A Dict that maps the name of each group to the names in the group. | |
#### `set_groups(idx::Index, gr::Dict)` | |
Set groups in `idx` based on the mapping given by `gr`. | |
## Missing Values | |
Missing value behavior is implemented by instantiations of the `AbstractDataVector` | |
abstract type. | |
#### `NA` | |
A constant indicating a missing value. | |
#### `isna(x)` | |
Return a `Bool` or `Array{Bool}` (if `x` is an `AbstractDataVector`) | |
that is `true` for elements with missing values. | |
#### `nafilter(x)` | |
Return a copy of `x` after removing missing values. | |
#### `nareplace(x, val)` | |
Return a copy of `x` after replacing missing values with `val`. | |
#### `naFilter(x)` | |
Return an object based on `x` such that future operations like `mean` | |
will not include missing values. This can be an iterator or other | |
object. | |
#### `naReplace(x, val)` | |
Return an object based on `x` such that future operations like `mean` | |
will replace NAs with `val`. | |
#### `na(x)` | |
Return an `NA` value appropriate for the type of `x`. | |
#### `nas(x, dim)` | |
Return an object like `x` filled with `NA` values with size `dim`. | |
## DataVector's | |
#### `DataArray(x::Vector)` | |
#### `DataArray(x::Vector, m::Vector{Bool})` | |
Create a DataVector from `x`, with `m` optionally indicating which values | |
are NA. DataVector's are like Julia Vectors with support for NA's. `x` may | |
be any type of Vector. | |
#### `PooledDataArray(x::Vector)` | |
#### `PooledDataArray(x::Vector, m::Vector{Bool})` | |
Create a PooledDataVector from `x`, with `m` optionally indicating which | |
values are NA. PooledDataVector's contain a pool of values with references | |
to those values. This is useful in a similar manner to an R array of | |
factors. | |
#### `size`, `length`, `ndims`, `ref`, `assign`, `start`, `next`, `done` | |
All normal Vector operations including array referencing should work. | |
#### `isna(x)`, `nafilter(x)`, `nareplace(x, val)`, `naFilter(x)`, `naReplace(x, val)` | |
All NA-related methods are supported. | |
## Utilities | |
#### `cut(x::Vector, breaks::Vector)` | |
Returns a PooledDataVector with length equal to `x` that divides values in `x` | |
based on the divisions given by `breaks`. | |
## Formulas and Models | |
#### `Formula(ex::Expr)` | |
Return a Formula object based on `ex`. Formulas are two-sided | |
expressions separated by `~`, like `:(y ~ w*x + z + i&v)`. | |
#### `model_frame(f::Formula, d::AbstractDataFrame)` | |
#### `model_frame(ex::Expr, d::AbstractDataFrame)` | |
A ModelFrame. | |
#### `model_matrix(mf::ModelFrame)` | |
#### `model_matrix(f::Formula, d::AbstractDataFrame)` | |
#### `model_matrix(ex::Expr, d::AbstractDataFrame)` | |
A ModelMatrix based on `mf`, `f` and `d`, or `ex` and `d`. | |
#### `lm(ex::Expr, df::AbstractDataFrame)` | |
Linear model results (type OLSResults) based on formula `ex` and `df`. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment