Skip to content

Instantly share code, notes, and snippets.

@freeman-lab
Last active March 31, 2016 14:23
Show Gist options
  • Save freeman-lab/cd0bbe3d09af0d05ada3dd3a7d9aa2a1 to your computer and use it in GitHub Desktop.
Save freeman-lab/cd0bbe3d09af0d05ada3dd3a7d9aa2a1 to your computer and use it in GitHub Desktop.
draft thunder api docs

data types

images

count()

Explicit count of the number of items.

For lazy or distributed data, will force a computation.

first()

Return the first element.

toblocks(size='150')

Convert to Blocks, each representing a subdivision of the larger Images data.

  • size str or tuple of block size per dimension,

    String interpreted as memory size (in megabytes, e.g. "64"). Tuple of ints interpreted as "pixels per dimension". Only valid in spark mode.

totimeseries(size='150')

Converts this Images object to a TimeSeries object.

This method is equivalent to images.asBlocks(size).asSeries().asTimeSeries().

  • size string memory size optional default = "150M"

    String interpreted as memory size (e.g. "64M").

  • units string either "pixels" or "splits" default = "pixels"

    What units to use for a tuple size.

toseries(size='150')

Converts this Images object to a Series object.

This method is equivalent to images.toblocks(size).toSeries().

  • size string memory size optional default = "150M"

    String interpreted as memory size (e.g. "64M").

tolocal()

Convert to local representation.

tospark(engine=None)

Convert to spark representation.

foreach(func)

Execute a function on each image

sample(nsamples=100, seed=None)

Extract random sample of series.

  • nsamples int optional default = 100

    The number of data points to sample.

  • seed int optional default = None

    Random seed.

map(func, dims=None, with_keys=False)

Map an array -> array function over each image

filter(func)

Filter images

reduce(func)

Reduce over images

mean()

Compute the mean across images

var()

Compute the variance across images

std()

Compute the standard deviation across images

sum()

Compute the sum across images

max()

Compute the max across images

min()

Compute the min across images

squeeze()

Remove single-dimensional axes from images.

max_projection(axis=2)

Compute maximum projections of images / volumes along the specified dimension.

  • axis int optional default = 2

    Which axis to compute projection along

max_min_projection(axis=2)

Compute maximum-minimum projections of images / volumes along the specified dimension. This computes the sum of the maximum and minimum values along the given dimension.

  • axis int optional default = 2

    Which axis to compute projection along

subsample(factor)

Downsample an image volume by an integer factor

  • sample_factor positive int or tuple of positive ints

    Stride to use in subsampling. If a single int is passed, each dimension of the image will be downsampled by this same factor. If a tuple is passed, it must have the same dimensionality of the image. The strides given in a passed tuple will be applied to each image dimension.

gaussian_filter(sigma=2, order=0)

Spatially smooth images with a gaussian filter.

Filtering will be applied to every image in the collection.

parameters

  • sigma scalar or sequence of scalars default=2

    Size of the filter size as standard deviation in pixels. A sequence is interpreted as the standard deviation for each axis. A single scalar is applied equally to all axes.

  • order choice of 0 / 1 / 2 / 3 or sequence from same set optional default = 0

    Order of the gaussian kernel, 0 is a gaussian, higher numbers correspond to derivatives of a gaussian.

uniform_filter(size=2)

Spatially filter images using a uniform filter.

Filtering will be applied to every image in the collection.

parameters size: int, optional, default=2 Size of the filter neighbourhood in pixels. A sequence is interpreted as the neighborhood size for each axis. A single scalar is applied equally to all axes.

median_filter(size=2)

Spatially filter images using a median filter.

Filtering will be applied to every image in the collection.

parameters size: int, optional, default=2 Size of the filter neighbourhood in pixels. A sequence is interpreted as the neighborhood size for each axis. A single scalar is applied equally to all axes.

localcorr(neighborhood=2)

Correlate every pixel to the average of its local neighborhood.

This algorithm computes, for every spatial record, the correlation coefficient between that record's series, and the average series of all records within a local neighborhood with a size defined by the neighborhood parameter. The neighborhood is currently required to be a single integer, which represents the neighborhood size in both x and y.

  • neighborhood int optional default=2

    Size of the correlation neighborhood (in both the x and y directions), in pixels.

subtract(val)

Subtract a constant value or an image / volume from all images / volumes in the data set.

  • val int float or ndarray

    Value to subtract

topng(path, prefix="image", overwrite=False)

Write 2d or 3d images as PNG files.

Files will be written into a newly-created directory. Three-dimensional data will be treated as RGB channels.

  • path string

    Path to output directory, must be one level below an existing directory.

  • prefix string

    String to prepend to filenames.

  • overwrite bool

    If true, the directory given by path will first be deleted if it exists.

totif(path, prefix="image", overwrite=False)

Write 2d or 3d images as TIF files.

Files will be written into a newly-created directory. Three-dimensional data will be treated as RGB channels.

  • path string

    Path to output directory, must be one level below an existing directory.

  • prefix string

    String to prepend to filenames.

  • overwrite bool

    If true, the directory given by path will first be deleted if it exists.

tobinary(path, prefix="image", overwrite=False)

Write out images or volumes as flat binary files.

Files will be written into a newly-created directory.

  • path string

    Path to output directory, must be one level below an existing directory.

  • prefix string

    String to prepend to filenames.

  • overwrite bool

    If true, the directory given by path will first be deleted if it exists.

map_as_series(func, value_size=None, block_size='150')

Efficiently apply a function to each time series

Applies a function to each time series without transforming all the way to a Series object, but using a Blocks object instead for increased efficiency in the transformation back to Images.

  • func function

    Function to apply to each time series. Should take one-dimensional ndarray and return the transformed one-dimensional ndarray.

  • value_size int optional default=None

    Size of the one-dimensional ndarray resulting from application of func. If not supplied, will be automatically inferred for an extra computational cost.

  • block_size str or tuple of block size per dimension,

    String interpreted as memory size (in megabytes e.g. "64"). Tuple of ints interpreted as "pixels per dimension".

series

flatten()

Reshape all dimensions but the last into a single dimension

count()

Explicit count of the number of items.

For lazy or distributed data, will force a computation.

first()

Return the first element.

tolocal()

Convert to local representation.

tospark(engine=None)

Convert to spark representation.

sample(nsamples=100, seed=None)

Extract random sample of series.

  • nsamples int optional default = 100

    The number of data points to sample.

  • seed int optional default = None

    Random seed.

map(func, index=None, with_keys=False)

Map an array -> array function over each series

filter(func)

Filter by applying a function to each series.

reduce(func)

Reduce over series.

mean()

Compute the mean across images

var()

Compute the variance across images

std()

Compute the standard deviation across images

sum()

Compute the sum across images

max()

Compute the max across images

min()

Compute the min across images

between(left, right)

Select subset of values within the given index range

Inclusive on the left; exclusive on the right.

  • left int

    Left-most index in the desired range

right: int Right-most index in the desired range

select(crit)

Select subset of values that match a given index criterion

  • crit function list str int

    Criterion function to map to indices, specific index value, or list of indices

center(axis=1)

Center series data by subtracting the mean either within or across records

  • axis int optional default = 0

    Which axis to center along, within (1) or across (0) records

standardize(axis=1)

Standardize series data by dividing by the standard deviation either within or across records

  • axis int optional default = 0

    Which axis to standardize along, within (1) or across (0) records

zscore(axis=1)

Zscore series data by subtracting the mean and dividing by the standard deviation either within or across records

  • axis int optional default = 0

    Which axis to zscore along, within (1) or across (0) records

squelch(threshold)

Set all records that do not exceed the given threhsold to 0

  • threshold scalar

    Level below which to set records to zero

correlate(signal)

Correlate series data against one or many one-dimensional arrays.

  • signal array or str

    Signal(s) to correlate against, can be a numpy array or a MAT file containing the signal as a variable

series_max()

Compute the value maximum of each record in a Series

series_min()

Compute the value minimum of each record in a Series

series_sum()

Compute the value sum of each record in a Series

series_mean()

Compute the value mean of each record in a Series

series_median()

Compute the value median of each record in a Series

series_percentile(q)

Compute the value percentile of each record in a Series.

  • q scalar

    Floating point number between 0 and 100 inclusive, specifying percentile.

series_std()

return self.series_stat('stdev')

series_stat(self, stat):

series_stat(stat)

Compute a simple statistic for each record in a Series

  • stat str

    Which statistic to compute

series_stats()

Compute many statistics for each record in a Series

mean_by_panel(length)

Compute the mean across fixed sized panels of each record.

Splits each record into panels of size length, and then computes the mean across panels. Panel length must subdivide record exactly.

  • length int

    Fixed length with which to subdivide.

select_by_index(val, level=0, squeeze=False, filter=False, return_mask=False)

Select or filter elements of the Series by index values (across levels, if multi-index).

The index is a property of a Series object that assigns a value to each position within the arrays stored in the records of the Series. This function returns a new Series where, within each record, only the elements indexed by a given value(s) are retained. An index where each value is a list of a fixed length is referred to as a 'multi-index', as it provides multiple labels for each index location. Each of the dimensions in these sublists is a 'level' of the multi-index. If the index of the Series is a multi-index, then the selection can proceed by first selecting one or more levels, and then selecting one or more values at each level.

  • val list of lists

    Specifies the selected index values. List must contain one list for each level of the multi-index used in the selection. For any singleton lists, the list may be replaced with just the integer.

  • level list of ints optional default=0

    Specifies which levels in the multi-index to use when performing selection. If a single level is selected, the list can be replaced with an integer. Must be the same length as val.

  • squeeze bool optional default=False

    If True, the multi-index of the resulting Series will drop any levels that contain only a single value because of the selection. Useful if indices are used as unique identifiers.

  • filter bool optional default=False

    If True, selection process is reversed and all index values EXCEPT those specified are selected.

  • return_mask bool optional default=False

    If True, return the mask used to implement the selection.

aggregate_by_index(function, level=0)

Aggregrate data in each record, grouping by index values.

For each unique value of the index, applies a function to the group indexed by that value. Returns a Series indexed by those unique values. For the result to be a valid Series object, the aggregating function should return a simple numeric type. Also allows selection of levels within a multi-index. See select_by_index for more info on indices and multi-indices.

  • function function

    Aggregating function to map to Series values. Should take a list or ndarray as input and return a simple numeric value.

  • level list of ints optional default=0

    Specifies the levels of the multi-index to use when determining unique index values. If only a single level is desired, can be an int.

stat_by_index(stat, level=0)

Compute the desired statistic for each uniue index values (across levels, if multi-index)

  • stat string

    Statistic to be computed: sum, mean, median, stdev, max, min, count

  • level list of ints optional default=0

    Specifies the levels of the multi-index to use when determining unique index values. If only a single level is desired, can be an int.

sum_by_index(level=0)

Compute sums for each unique index value (across levels, if multi-index)

mean_by_index(level=0)

Compute means for each unique index value (across levels, if multi-index)

median_by_index(level=0)

Compute medians for each unique index value (across levels, if multi-index)

std_by_index(level=0)

Compute means for each unique index value (across levels, if multi-index)

max_by_index(level=0)

Compute maximum values for each unique index value (across levels, if multi-index)

min_by_index(level=0)

Compute minimum values for each unique index value (across level, if multi-index)

count_by_index(level=0)

Count the number for each unique index value (across levels, if multi-index)

cov()

Compute covariance of a distributed matrix.

  • axis int optional default = None

    Axis for performing mean subtraction, None (no subtraction), 0 (rows) or 1 (columns)

gramian()

Compute gramian of a distributed matrix.

The gramian is defined as the product of the matrix with its transpose, i.e. A^T * A.

times(other)

Multiply a matrix by another one.

Other matrix must be a numpy array, a scalar, or another matrix in local mode.

  • other Matrix scalar or numpy array

    A matrix to multiply with

totimeseries()

Convert Series to TimeSeries, a subclass for time series computation.

toimages(size='150')

Converts Series to Images.

Equivalent to calling series.toBlocks(size).toImages()

  • size str optional default = "150M"

    String interpreted as memory size.

tobinary(path, prefix='series', overwrite=False, credentials=None)

Write data to binary files.

  • path string path or URI to directory to be created

    Output files will be written underneath path. Directory will be created as a result of this call.

  • prefix str optional default = 'series'

    String prefix for files.

  • overwrite bool

    If true, path and all its contents will be deleted and recreated as partof this call.

reading

images

fromrdd(rdd, dims=None, nrecords=None, dtype=None)

Load Images object from a Spark RDD.

Must be a collection of key-value pairs where keys are singleton tuples indexing images, and values are 2d or 3d ndarrays.

  • rdd SparkRDD

    An RDD containing images

  • dims tuple or array optional default = None

    Image dimensions (if provided will avoid check).

  • nrecords int optional default = None

    Number of images (if provided will avoid check).

  • dtype string default = None

    Data numerical type (if provided will avoid check)

fromarray(values, npartitions=None, engine=None)

Load Series object from a local array-like.

First dimension will be used to index images, so remaining dimensions after the first should be the dimensions of the images/volumes, e.g. (3, 100, 200) for 3 x (100, 200) images

  • values array-like

    The array of images

  • npartitions int default = None

    Number of partitions for parallelization (Spark only)

  • engine object default = None

    Computational engine (e.g. a SparkContext for Spark)

fromlist(items, accessor=None, keys=None, dims=None, dtype=None, npartitions=None, engine=None)

Load images from a list of items using the given accessor.

  • accessor function

    Apply to each item from the list to yield an image

  • keys list optional default=None

    An optional list of keys

  • dims tuple optional default=None

    Specify a known image dimension to avoid computation.

  • npartitions int

    Number of partitions for computational engine

frompath(path, accessor=None, ext=None, start=None, stop=None, recursive=False, npartitions=None, dims=None, dtype=None, recount=False, engine=None, credentials=None)

Load images from a path using the given accessor.

Supports both local and remote filesystems.

  • accessor function

    Apply to each item after loading to yield an image.

  • ext str optional default=None

    File extension.

  • npartitions int optional default=None

    Number of partitions for computational engine, if None will use default for engine.

  • dims tuple optional default=None

    Dimensions of images.

  • dtype str optional default=None

    Numerical type of images.

start, stop: nonnegative int, optional, default=None Indices of files to load, interpreted using Python slicing conventions.

  • recursive boolean optional default=False

    If true, will recursively descend directories from path, loading all files with an extension matching 'ext'.

  • recount boolean optional default=False

    Force subsequent record counting.

frombinary(path, shape=None, dtype=None, ext='bin', start=None, stop=None, recursive=False, nplanes=None, npartitions=None, conf='conf.json', order='C', engine=None, credentials=None)

Load images from flat binary files.

Assumes one image per file, each with the shape and ordering as given by the input arguments.

  • path str

    Path to data files or directory, specified as either a local filesystem path or in a URI-like format, including scheme. May include a single '*' wildcard character.

  • shape tuple of positive int

    Dimensions of input image data.

  • ext string optional default="bin"

    Extension required on data files to be loaded.

  • start, stop nonnegative int optional default=None

    Indices of the first and last-plus-one file to load, relative to the sorted filenames matching path and ext. Interpreted using python slice indexing conventions.

  • recursive boolean optional default=False

    If true, will recursively descend directories from path, loading all files with an extension matching 'ext'.

  • nplanes positive integer optional default=None

    If passed, will cause single files to be subdivided into nplanes separate images. Otherwise, each file is taken to represent one image.

  • npartitions int optional default=None

    Number of partitions for computational engine, if None will use default for engine.

fromtif(path, ext='tif', start=None, stop=None, recursive=False, nplanes=None, npartitions=None, engine=None, credentials=None)

Loads images from single or multi-page TIF files.

  • path str

    Path to data files or directory, specified as either a local filesystem path or in a URI-like format, including scheme. May include a single '*' wildcard character.

  • ext string optional default="tif"

    Extension required on data files to be loaded.

  • start, stop nonnegative int optional default=None

    Indices of the first and last-plus-one file to load, relative to the sorted filenames matching 'path' and 'ext'. Interpreted using python slice indexing conventions.

  • recursive boolean optional default=False

    If true, will recursively descend directories from path, loading all files with an extension matching 'ext'.

  • nplanes positive integer optional default=None

    If passed, will cause single files to be subdivided into nplanes separate images. Otherwise, each file is taken to represent one image.

  • npartitions int optional default=None

    Number of partitions for computational engine, if None will use default for engine.

frompng(path, ext='png', start=None, stop=None, recursive=False, npartitions=None, engine=None, credentials=None)

Load images from PNG files.

  • path str

    Path to data files or directory, specified as either a local filesystem path or in a URI-like format, including scheme. May include a single '*' wildcard character.

  • ext string optional default="tif"

    Extension required on data files to be loaded.

  • start, stop nonnegative int optional default=None

    Indices of the first and last-plus-one file to load, relative to the sorted filenames matching path and ext. Interpreted using python slice indexing conventions.

  • recursive boolean optional default=False

    If true, will recursively descend directories from path, loading all files with an extension matching 'ext'.

  • npartitions int optional default=None

    Number of partitions for computational engine, if None will use default for engine.

fromrandom(shape=(10, 50, 50), npartitions=1, seed=42, engine=None)

Generate random image data.

  • shape tuple optional default=(10 50 50)

    Dimensions of images.

  • npartitions int optional default=1

    Number of partitions.

  • seed int optional default=42

    Random seed.

fromexample(name=None, engine=None)

Load example image data.

Data must be downloaded from S3, so this method requires an internet connection.

  • name str

    Name of dataset, if not specified will print options.

series

fromrdd(rdd, nrecords=None, shape=None, index=None, dtype=None)

Load Series object from a Spark RDD.

Assumes keys are tuples with increasing and unique indices, and values are 1d ndarrays. Will try to infer properties that are not explicitly provided.

  • rdd SparkRDD

    An RDD containing series data.

  • shape tuple or array optional default = None

    Total shape of data (if provided will avoid check).

  • nrecords int optional default = None

    Number of records (if provided will avoid check).

  • index array optional default = None

    Index for records, if not provided will use (0, 1, ...)

  • dtype string default = None

    Data numerical type (if provided will avoid check)
    

fromarray(values, index=None, npartitions=None, engine=None)

Load Series object from a local numpy array.

Assumes that all but final dimension index the records, and the size of the final dimension is the length of each record, e.g. a (2, 3, 4) array will be treated as 2 x 3 records of size (4,)

  • values array-like

    An array containing the data.

  • index array optional default = None

    Index for records, if not provided will use (0,1,...,N) where N is the length of each record.

  • npartitions int default = None

    Number of partitions for parallelization (Spark only)

  • engine object default = None

    Computational engine (e.g. a SparkContext for Spark)

fromlist(items, accessor=None, index=None, dtype=None, npartitions=None, engine=None)

Create a Series object from a list of items and optional accessor function.

Will call accessor function on each item from the list, providing a generic interface for data loading.

  • items list

    A list of items to load.

  • accessor function optional default = None

    A function to apply to each item in the list during loading.

  • index array optional default = None

    Index for records, if not provided will use (0,1,...,N) where N is the length of each record.

  • dtype string default = None

    Data numerical type (if provided will avoid check)
    
  • npartitions int default = None

    Number of partitions for parallelization (Spark only)

  • engine object default = None

    Computational engine (e.g. a SparkContext for Spark)

fromtext(path, ext='txt', dtype='float64', skip=0, shape=None, index=None, npartitions=None, engine=None, credentials=None)

Loads Series data from text files.

Assumes data are formatted as rows, where each record is a row of numbers separated by spaces e.g. 'v v v v v'. You can optionally specify a fixed number of initial items per row to skip / discard.

  • path string

    Directory to load from, can be a URI string with scheme (e.g. "file://", "s3n://", or "gs://"), or a single file, or a directory, or a directory with a single wildcard character.

  • ext str optional default = 'txt'

    File extension.

dtype: dtype or dtype specifier, default 'float64' Numerical type to use for data after converting from text.

  • skip int optional default = 0

    Number of items in each record to skip.

  • shape tuple or list optional default = None

    Shape of data if known, will be inferred otherwise.

  • index array optional default = None

    Index for records, if not provided will use (0, 1, ...)

  • npartitions int default = None

    Number of partitions for parallelization (Spark only)

  • engine object default = None

    Computational engine (e.g. a SparkContext for Spark)

  • credentials dict default = None

    Credentials for remote storage (e.g. S3) in the form {access: ***, secret: ***}

frombinary(path, ext='bin', conf='conf.json', dtype=None, shape=None, skip=0, index=None, engine=None, credentials=None)

Load a Series object from flat binary files.

  • path string URI or local filesystem path

    Directory to load from, can be a URI string with scheme (e.g. "file://", "s3n://", or "gs://"), or a single file, or a directory, or a directory with a single wildcard character.

  • ext str optional default = 'bin'

    Optional file extension specifier.

  • conf str optional default = 'conf.json'

    Name of conf file with type and size information.

dtype: dtype or dtype specifier, default 'float64' Numerical type to use for data after converting from text.

  • shape tuple or list optional default = None

    Shape of data if known, will be inferred otherwise.

  • skip int optional default = 0

    Number of items in each record to skip.

  • index array optional default = None

    Index for records, if not provided will use (0, 1, ...)

  • engine object default = None

    Computational engine (e.g. a SparkContext for Spark)

  • credentials dict default = None

    Credentials for remote storage (e.g. S3) in the form {access: ***, secret: ***}

fromrandom(shape=(100, 10), npartitions=1, seed=42, engine=None)

Generate gaussian random series data.

  • shape tuple

    Dimensions of data.

  • npartitions int

    Number of partitions with which to distribute data.

  • seed int

    Randomization seed.

fromexample(name=None, engine=None)

Load example series data.

Data must be downloaded from S3, so this method requires an internet connection.

  • name str

    Name of dataset, options include 'iris' | 'mouse' | 'fish'. If not specified will print options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment