When to use a specific tool

Brief list describing scenarios calling for specific tools.

Spark
If you have (larger-than-memory) petabytes of JSON/XML/CSV files, a simple workflow, and a thousand node cluster

Dask
If you have (larger-than-memory) 10s-1000s of gigabytes of binary or numeric data (e.g., HDF5, netCDF4, CSV.gz), complex algorithms, and a (single) large multi-core workstation

SQLite
If you have (larger-than-memory or not) less than a terabyte of content and one writer at a time, if you need local (on-disk) data storage (permanent or temporary) for individual applications or device, if you need to query/analyze a large dataset of text files: CSV/XML (off-memory), if you want to stick to the standard library (is built-in in Python)

MongoDB, PostgreSQL
If you have (larger-than-memory) a terabyte or less of JSON/XML/CSV, if you have multiple writers at a time, if you need/want a cliet/server scheme

Cassandra
If you have lots of data coming in very quickly (from different locations), of etherogeneous types (schema-less), of many terabytes or petabytes in size, if you need multiple servers/distributed system (with potential expansion in future), if you need constant availability (fault-tolerant), and yet simple

Cython
If your code will be deploied by others, distributing a package with the optimized extentions, if you need to accelerate code that uses advanced Python features (e.g., list, dict, recursion, array allocation), if you need to directly call C, if your function operates on a pre-defined (fixed) number of dimensions

Numba
If you don’t need to distribute your code beyond your computer or your team (especially if you use Conda), if you need to accelerate code that uses scalars or (N-dimensional) arrays, if you want to write functions that work (automatically) on N-dimensional arrays

fspaolo/when_to_use.md

When to use a specific tool

Brief list describing scenarios calling for specific tools.