-
-
Save nathanmarz/9d8805d5e7d9c689cd629bd0551e84f1 to your computer and use it in GitHub Desktop.
Benchmark code: https://github.com/nathanmarz/specter/blob/master/scripts/benchmarks.clj#L176 | |
---------------------------------------------------------------------------------------------- | |
Benchmark: first value of a size 10 vector (10000000 iterations) | |
Avg(ms) vs best Code | |
261.98 1.00 (select-any FIRST data) | |
277.11 1.06 (select-any ALL data) | |
337.43 1.29 (select-first ALL data) | |
561.89 2.14 (first data) |
I'm trying to remember why APersistentVector
doesn't implement ISeq
directly. Is conj
guaranteed to add to the front of the collection if seq?
is true?
I might be missing something here but doesn't the above show that first
could be made fast for vectors, and if so why wouldn't one want that?
@ztellman vectors aren't sequences - they don't natively implement first
and rest
with the same perf characteristics expected (in particular rest
seems like the problematic one). they are seqable and can provide a logical sequence view when requested.
Regarding your assertion, not sure - maybe? I'd say cons
is the append op for sequences and conj
is a polymorphic data structure append op. Not sure if those things connect tightly enough to answer your assertion.
@maacl doing so would require breaking the abstraction in the implementation. There are several reasons that is bad. One is that it ruins the integrity of those abstractions (the Clojure runtime, and core library, and even the compiler to a large degree) work only in terms of the abstractions (like "persistent collection" or "persistent vector" or "sequence"), not in terms of concrete collection types. This is hugely beneficial when you want to create your own data structures and expect them to work in core as if they were provided data structures. Additionally, special-casing things like this would require adding conditional checks in the implementation which would make those ops slower for all data structures.
The design in Clojure differs philosophically from the collections design in other languages (Java is a good comparison). Rather than making a generic set of ops and implementing them in any way possible on each collection implementation, Clojure instead sets performance expectations for each op and sticks to those expectations, either by not providing them at all (linear search in sequential collections being the poster child) or by providing them via abstractions (like sequence) that commit only to what you would expect from that abstraction. This has the down side of requiring you to think more about which data structure is the best choice for the operations you will need to perform. It has the upside of guiding you towards choosing the best data structure and letting you program to known performance expectations once you start paying attention to this stuff. Having used both kinds of systems extensively, I vastly prefer Clojure's approach.
@puredanger that PersistentVector
doesn't implement first
natively seems to be a property of the implementation, not the data structure or underlying concept. With some effort, rest
could also be amortized O(1) (see https://github.com/lacuna/bifurcan/blob/master/src/io/lacuna/bifurcan/List.java#L112 for an example). To be clear, I'm not advocating this, I know the inertia that surrounds this part of the Clojure implementation, I was just trying to remember if there was a more fundamental reason than that. Thank you for your quick response.
@ztellman it's the amortized part of that that seems like the problematic part to me. I get what you're saying and it could easily be done - I presume that significant effort went into NOT doing that (and creating fast seqs over vectors instead) b/c amortized is not the same as actual O(1). I'm somewhat working backwards as (obviously) I wasn't around when these decisions were made.
Well, it's only amortized in the sense that conj
on a vector is amortized O(1): every 32nd element requires you to allocate a new buffer, put the old buffer onto the tree, etc. I think the real reason is that the RRB tree, which is what the above link demonstrates, wasn't around 10 years ago (the paper actually references Clojure's implementation). If there ever is a desire to update Clojure's data structures to reflect the last decade's worth of research, I'm happy to chat about it, but I understand that's probably not on the roadmap.
I was talking about amortized rest
actually as the most problematic part.
Not really a matter of interest, just prioritization. Rich has obviously invested a lot of time and interest in data structures and performance. :)
Re: prioritization, understood. As far as the performance, I haven't looked at this code in a bit, but looks like the seq implementation is just using offsets to nth
: https://github.com/clojure/clojure/blob/master/src/jvm/clojure/lang/APersistentVector.java#L464. This means that first
is always O(log n), and the entire collection stays in memory as you iterate over it.
In the alternate implementation I linked to, first
is always O(1), since it peels off a 32-wide buffer from the front of the collection whenever there's an underflow. I acknowledge this doesn't make a big difference most of the time, but I spent a bunch of time weighing different approaches so I'm going to take this opportunity to be pedantic.
You can get the first element of a vector with optimal performance with (nth v 0)
, but first
is the higher-level way of expressing that. Saying use a different data structure if you care about first
is strange to me. This isn't like prepending an element where a list is clearly the superior data structure (I think that should be supported too but that's a totally separate discussion where we have major philosophical differences). This is an operation which the data structure is perfectly capable of doing extremely fast, as (nth v 0)
shows.
I also question whether or not there's even a tradeoff here if implemented carefully. If first()
is factored out of ISeq
into its own interface, why can't vectors implement that? Are there concerns about inline caches becoming megamorphic?
@ztellman: vectors are persistent collections and are always fully realized, so having it "stay in memory" is naturally part of the deal. I'm not arguing with you that first
could be always O(1) on vectors. The point I made above is that the sequence abstraction concerns both first
and rest
and the latter is the one that seems harder to me.
@nathanmarz the last suggestion is interesting. would need some criteria and tests to evaluate more.
@puredanger Thanks for the explanation.
@reborg this is for 10 million iterations. With criterium:
In other words, this is 48.5 ns/invocation.
first
is a sequence function that operates on the seq abstraction. It implies to me as a Clojure user that the passed data structure will be used to produce a sequence, then take the first item.However, vectors are indexed data structures and if you are using them that way, things are much faster:
Vectors are designed with a tail insertion point - accessing them at the end is fast (not the beginning):
If you want
first
to be as fast as possible, use a data structure that is designed for it, like a list:The problem here is not that
first
is slow, the problem is that vector is the wrong data structure if you care aboutfirst
.all timings done on a 2013 MBP, 2.7 Ghz Intel Core i7