Miller is streaming when possible (exceptions noted below) -- most verbs:
- operate and store in memory a single and independent record at a time
- don't have a state to retain from one record to the next
- don't wait for complete ingestion of input before producing any output
- (likewise, "statements" (i.e.
$z = $x + $y
) in the Miller programming language are implicit callbacks executed once per record) - operate on files which are larger than the system's memory
- consume other programms output via pipe, e.g.
tail -f some-file | mlr --some-flags
- pipe output to other streaming tools (like
cat
,grep
,sed
, etc.)
One disadvantage: streaming requires sometimes to accumulate results on records (rows) as they arrive rather than looping through them explicitly.
Streaming | In-memory records | Memory-friendly | Output after end of input | Verbs |
---|---|---|---|---|
Fully | None | Yes | No | altkv , bar (if not auto-mode ), cat , check , clean -whitespace, cut , decimate , fill -down, fill -empty, flatten , format -values, gap , grep , having -fields, head , json -parse, json -stringify, label , merge -fields, nest (if not implode-values-across-records ), nothing , regularize , rename , reorder , repeat , reshape (if not long-to-wide ), sec2gmt , sec2gmtdate , seqgen , skip -trivial-records, sort -within-records, step , tee , template , unflatten , unsparsify (if invoked with -f ) |
Half | Input files are streamed, join file (using -f ) is loaded into memory at start |
|||
No | All | No | bar (if auto-mode ), bootstrap , count -similar, fraction , group -by, group -like, least -frequent, most -frequent, nest (if implode-values-across-records ), remove -empty-columns, reshape (if long-to-wide ), sample , shuffle , sort , tac , uniq (if mlr uniq -a -c ), unsparsify (if invoked without -f ) |
|
No | Bounded number of records | Yes | Yes | tail , top |
No | An amount of state, less than all | Variably yes | Yes | count-distinct , count , histogram , stats1 (except mlr stats1 -s for incremental stats before end of stream), stats2 , uniq (if not mlr uniq -a -c ) |
Variable, simple operations are fully streaming | Allows for logic to retain all | Yes except for logic retaining all records | End blocks executed after end of stream | filter , put |
Streaming
- Fully-streaming
- Half-streaming
- Variable
State
Records in memory
For operations requiring deeper retention, Miller retains only as much data
as needed. For example, sort
and tac
must ingest and retain all records
in memory before emitting any -- the last input record may well end up
being the first one to be emitted.
Other verbs, such as tail
and top
, need to retain only a fixed number
of records -- 10, perhaps, even if the input data has a million records.
Yet other verbs, such as stats1
and stats2
, retain only summary
arithmetic on the records they visit. These are memory-friendly: memory
usage is bounded. However, they only produce output at the end of the
record stream
Output