Code examples are in ngn/k, a dialect where int null is 0N
and a list find miss is 0N
, not #list
.
It would be nice if a grouped list gave empty int list for an outdex, and outdex of frequency gave zero. (See here for more.)
x:="abracadabra"
x
"abrcd"!(0 3 5 7 10;1 8;2 9;,4;,6)
x "z"
0N 0N 0N 0N 0N
x:#'x
x
"abrcd"!5 2 2 1 1
x "z"
0N
Existing overrides are not good enough. Prepending a dummy value to set the prototype won't always work, because that determines structure, but not the fill. And dict dyad dict
will do the dyad before you get a chance to fill any outdexing nulls that arise during the element match-up process.
x:="abracadabra"
x:((," ")!,!0),x
x "z"
!0 / good
x:#'x
x "z"
0N / bad
Perhaps we could append an extra element to a dict:
x:="abracadabra"
x,:,!0
x
"abrcd"!(0 3 5 7 10;1 8;2 9;,4;,6;!0)
Under the hood, the default value would literally be catted to the end of the dict's values list.
The extra element is used as the result of outdexing (hence 'default dict'):
x "z"
!0
The extra element is part of the dict's range, so it's included in dict range processing:
x:#'x
x
"abrcd"!5 2 2 1 1 0
x "z"
0
x+1
"abrcd"!6 3 3 2 2 1
Instead of catting the element, it can be included at the time of dict construction. Notice the lengths are different!
x:"abc"!1 2 3 4
Group would now return a default dict. For =list
, the default is !0
. For =dict
, it's 0#!dict
.
Sound simple? That was the easy part. How should the rest of k's primitives work on default dicts?
Some languages just treat dicts as "values@keys?
, the data structure", but in k they're a lot more than that.
On one view, k dicts are functions:
- domain and range are infinite - it's just that not all points are explicitly specified
- they are atomic - they do not join to lists
- we can do operations on the dict's range while keeping it as a dict, like
1+dict
, per "(f g)x
isf(g x)
" - two dicts can be merged:
f,g
withg
overwriting the domain and range off
.
Primitives that treat dicts as functions include indexing, find and cat (@ . ? ,
). A lot of dict dyad dict
operations also implicitly treat dicts as functions because they match up the elements of the ranges based on the dicts' domains:
d:`a`b!1 2
e:`b`a!3 4
d+e
`a`b!5 5
d:`b`a!("world";"hello")
e:`a`b!3 -1
e#'d
`a`b!("hel";,"d")
If your code could result in a dict outdex, you're probably using the dict as a function.
On another view, k dicts are collections:
- they have a finite number of elements
- they can be reduced without including the implicit elements in the range:
+/"ab"!1 2
is3
, even though values corresponding to"cde"...
are implicitly in the range - they can be reduced without first extracting their values; the Python equivalent would be
sum({"a":1})
working without doing.values()
on the dict - they have order, which is relevant for sequential primitives like scans and each-prior.
Primitives that treat dicts as collections include keys, values, count, take, drop, where, group, first, last, string, odometer and reverse. Each-priors require sequence, so they also fall into this bucket, along with dyadic operations that require a 1:1 correspondence between lists and dicts (say list + dict
or list dyad' dict
).
If you find yourself removing a dict's keys with .:
, or a default element would just get in the way, you're probably using the dict as a collection.
In fact, that might be the answer: redefine .:
to remove a dict's keys and default element. Then:
- if you're using the dict as a function (eg indexing in or
dict dyad dict
), keep the dict as a dict - if you're not, you're probably just using the values as a collection, so you can strip the keys and default element with
.:
.
For example:
1 2 3 + `a`b!1 2 3 / L+D with ranges conformant including the default
1 2 + . `a`b!1 2 3 / L+L with . stripping the keys and the default from the dict
I reckon most cases where you're stripping off the keys are the result of group anyway, as it's the only 'raw' primitive that would return a default dict. And even then, the default item will be benign most of the time (because it's empty list or zero).
Some more cases:
#"ab"!1 2 3
is3
.- If we take zero elements or drop
#dict
or more elements of a dict, the default remains. Override with,
if you must. +/"ab"!1 2 3
is6
. You're losing the keys anyway, so convert to a list before summing if this bothers you.&"ab"!1 2 3
is"abb "
.
The ngn/k test suite is once again a good place to look:
- In ngn's Euler p60, the group is indexed into, and outdexing occurs. While the result is still correct because issues with the prototypes are handled via a set intersection (
{x^x^y}
strips out any0N
s), it's still handling more data than necessary. If group returned a default dict, it'd remove those0N
s because outdexing would now return an empty list. - In ngn's 'classify a region by its slope', we're answering questions about groups of points (specifically the max and min of concentric rings of numbers, calculated via
|/'
and&/'
on the group result). ngn immediately strips the group results of their keys using.:
and does a&/
to see whether a condition holds for all subsets. If.:
kept the default element, it'd ruin the&/
.
Default dicts are a handy way to give a function full domain coverage without explicitly containing all the points in that type, which is especially handy when it's not possible to explicitly specify all the points because there's an infinite number of them.
But what if the input list already covers the full domain you intended? For example, let's say you want to group numbers by whether they are greater than 5, and you don't know which numbers they'll be and how many there would be (if any).
{x@=5<x} / not good if your code assumes there will be two categories!
{x@&'1~:\5<x} / better, but a lot longer
If group always left a default value on the end, that default would get in the way if the list already had 0s and 1s in it and your code assumed there were only two groups. Maybe we could recognise that the set of booleans is finite and omit the default if 0 and 1 are already present?
- Perhaps a reason to make booleans a separate type?
- It would be awkward if the list was ints but just happened to comprise booleans, but maybe int lists don't auto-squeeze down that far.
- Chars may also be a finite type, as could certain subsets like digits or upper/lowercase letters, or bytes (thanks coltim!).
So if there was a separate boolean type, it would mean:
=0#0b
(0#0b)!,!0 / default covers both 0 and 1. Still need to use 0 1# to force presence of two groups
=1 1b
(,1b)!(0 1;!0) / default covers 0
=1 0 1b
0 1b!(,1;0 2) / no default required - domain is already covered
So for finite types, the interpreter's code for group would need to know how many elements there could be in the domain, and if the count equals that, no default is needed.
Or if we really wanted to avoid the 0 1#
'force all' step, we could drop defaults entirely and instead have the group result's domain explicitly be a sorted list of all points of the type?
- This might create problems for 'unintended finite' types like chars, but maybe an (infinite) Unicode type would comes to the rescue.
- It would speed up indexing: you could skip the 'find' step of
range@domain?
and just do indexing into the range.
=0#0b
0 1b!(!0;!0)
=1 1b
0 1b!(!0;0 1)
=1 0 1b
0 1b!(,1;0 2)
In ngn/k, catting a dict with a list or atom treats the dict as an atom and forms a list - arguably not useful. This proposal changes that because dict,atomOrList
is how a default gets set. Slightly more useful is dict,table
, but that's not as common as table,dict
, so again, we don't lose much (and we can still do it with (,dict),table
).
d:`a`b!1 2
d,3
(`a`b!1 2
3)
d,,3
(`a`b!1 2
3)
Make-dict !
doesn't do a length check anymore.
This change has no impact on the way prototypes work for lists.
It might change how find works for some k dialects (see the efficiency section below).
From an efficiency perspective, this may be nicer than existing solutions/workarounds. The reasons depend on the dialect:
Situations where a default value is needed are ordinarily handled via fill (default^x?y
). But if there are misses, x?y
may create an intermediate list with a wider type than it needs once filled (eg if it would otherwise all be small integers). Handling this via a default dict also saves an interpreter loop and a second pass of the data: find-then-fill becomes a single dict lookup.
These situations may also be handled via a dict merge - (alldomain!defaultval) , exceptionsdict
- but that is not possible if the domain is large or unknown. The other alternative is checking if the lookup values are in the domain before indexing, but that does two finds, so it would be nice if we had a better way.
The biggest issue with 0N
-returning dialects is that this proposal would change how dict lookups work - values@keys?lookups
may not be the right model anymore.
Here, the status quo is a little better than if find miss returned 0N
, since you can do (values,default)keys?lookups
. However:
- that still has two interpreter steps (find-then-index), whereas dict lookup is one
- appending a default value to
values
may create a copy of the values list (maybe not an issue sincevalues
likely has refcount 1 anyway?) - it removes the possibility that under the hood a dict lookup could do something different to indexing (eg hashing) that is faster.
But the biggest advantage is that if 'find miss' is already #x
, you don't need to change anything about how dict indexing works to add default dicts: it just falls out naturally from values@keys?lookups
.
- Should the default value be used for binary searches:
dict'v
wherev
is less than all values indict
?
On balance I think this is a good idea, particularly for k dialects where find-miss is #list
.
I wonder why this shouldn't also work for lists, and perhaps the reason is that lists are much closer to collections than dicts.
I like that this is just taking the existing (values,default)keys?lookups
pattern and extending it to dicts, and it was a good exercise to clarify my understanding of how dicts should work.
It was interesting that default elements are useful when the dict is treated like a function, but are less useful when the dict is treated like a collection.
Still, I have a feeling that dicts sometimes enhance readability at the expense of speed.
I've been playing with a pure-k CSV parser. Not exactly new ground - here's one in ngn/k by DiscoDoug and another in BQN by Marshall - but I thought it would be a good learning experience.
One of the requirements of RFC 4180 is that:
- you can ignore row and field separators by wrapping a field's contents in
"
, and - you can also include a
"
within an ignored region by putting another"
before it.
So if you're moving left-to-right, you need to figure out whether a quote in a quoted region is ending that region, or if it's the escape character for another "
.
We can do this via a state machine. With default dicts, a state machine keytable looks like this:
Note: this code does not work! It assumes default dicts are implemented.
/ Classes are rows.
/ States are cols: 0) _Q_uote begin
/ 1) _K_ept char (deactivated if field or row char)
/ 2) _E_scape (could be an escaped quote char OR end of quoted section)
/ 3) Other states (F,R,space)
{classes:",\n\""
states:"QKE"
sm:("KKFF" / Field \n
"KKRR" / Row ,
"EEKQ" / Escape "
"KK ") / Other
sm:classes!/:sm / Default dict! 3=#classes, 4=#sm
sm:states!sm / Default dicts! 3=#states, 4=#'sm
" "{x[y;z]}[sm]\x}
Notice:
- the scan still captures state info about which Field and Row characters are active - the default dict allows these states to share the same transition column
- Quote begins and Kept chars also share the same transitions, but unfortunately dicts don't let you assign multiple keys to the same entry, so there's some unavoidable duplication. (Makes me yearn for Rust-style
match
.)
The problem is that while this is short and arguably clear, it's not as fast as it could be, primarily because we can't batch all of the dict lookups that result from indexing into the sm
keytable. We can batch one step of indexing the state machine by indexing in with the characters (change the last line to " "{y x}\sm x
), but that generates a table, so it's wasteful.
It's faster if we move away from dicts:
- we can batch the conversion of the input file into character classes:
3^classes?x"
. This isn't as wasteful because it's just an int list, not a full-blown table - we can similarly convert the states into numbers, and flatten
sm
into a single list where the class is a jump and the state is an offset right. We couldn't flattensm
if it was a list of dicts because we'd have duplicate keys (though maybe the dict keys could be tuples...) - the downside is that if we want to preserve info about field and row separators, we need to explicitly give them columns.
{classes:FLD,ROW,ESC
states:"FREQK " / Field, Row, Escape, Quote begin, Kept, other (space)
sm:("FFFKKF"
"RRRKKR"
"QQKEEQ"
" KK ")
sm:states?,/sm
5(sm@+)\6*3^classes?x}
This cuts the speed by roughly a quarter.
We could simplify this a lot if we just cared about identifying escaped regions: strip sm
to a two-row list (one for "
and one for others), then combine it with field and row locations obtained from char=x
to get the split locations (omitted below):
{states:"KQE " / Keep, Quote begin, Escape, other (space)
sm:("KK " / other char
"EEKQ") / quote char
sm:states?,/sm / flatten state machine
3(sm@+)\4*x="\""}
It might be faster if we swapped the find step with an index into a 128-element list of character classes. And there are faster approaches than using a state machine anyway. (Future work: compare approach taken here to others mentioned above.)
So while I don't think k dicts are a dead end, and they have lots of benefits (such as avoiding the creation of intermediate indices), they may not be appropriate in situations where it's better to do the keys?
and values@
steps separately.
If you liked this, follow @kcode on Bluesky