Skip to content

Instantly share code, notes, and snippets.

@nathanmarz
Last active October 9, 2017 19:34
Show Gist options
  • Save nathanmarz/9d8805d5e7d9c689cd629bd0551e84f1 to your computer and use it in GitHub Desktop.
Save nathanmarz/9d8805d5e7d9c689cd629bd0551e84f1 to your computer and use it in GitHub Desktop.
Specter vs "first" function
Benchmark code: https://github.com/nathanmarz/specter/blob/master/scripts/benchmarks.clj#L176
----------------------------------------------------------------------------------------------
Benchmark: first value of a size 10 vector (10000000 iterations)
Avg(ms) vs best Code
261.98 1.00 (select-any FIRST data)
277.11 1.06 (select-any ALL data)
337.43 1.29 (select-first ALL data)
561.89 2.14 (first data)
@ztellman
Copy link

@puredanger that PersistentVector doesn't implement first natively seems to be a property of the implementation, not the data structure or underlying concept. With some effort, rest could also be amortized O(1) (see https://github.com/lacuna/bifurcan/blob/master/src/io/lacuna/bifurcan/List.java#L112 for an example). To be clear, I'm not advocating this, I know the inertia that surrounds this part of the Clojure implementation, I was just trying to remember if there was a more fundamental reason than that. Thank you for your quick response.

@puredanger
Copy link

@ztellman it's the amortized part of that that seems like the problematic part to me. I get what you're saying and it could easily be done - I presume that significant effort went into NOT doing that (and creating fast seqs over vectors instead) b/c amortized is not the same as actual O(1). I'm somewhat working backwards as (obviously) I wasn't around when these decisions were made.

@ztellman
Copy link

Well, it's only amortized in the sense that conj on a vector is amortized O(1): every 32nd element requires you to allocate a new buffer, put the old buffer onto the tree, etc. I think the real reason is that the RRB tree, which is what the above link demonstrates, wasn't around 10 years ago (the paper actually references Clojure's implementation). If there ever is a desire to update Clojure's data structures to reflect the last decade's worth of research, I'm happy to chat about it, but I understand that's probably not on the roadmap.

@puredanger
Copy link

I was talking about amortized rest actually as the most problematic part.

Not really a matter of interest, just prioritization. Rich has obviously invested a lot of time and interest in data structures and performance. :)

@ztellman
Copy link

Re: prioritization, understood. As far as the performance, I haven't looked at this code in a bit, but looks like the seq implementation is just using offsets to nth: https://github.com/clojure/clojure/blob/master/src/jvm/clojure/lang/APersistentVector.java#L464. This means that first is always O(log n), and the entire collection stays in memory as you iterate over it.

In the alternate implementation I linked to, first is always O(1), since it peels off a 32-wide buffer from the front of the collection whenever there's an underflow. I acknowledge this doesn't make a big difference most of the time, but I spent a bunch of time weighing different approaches so I'm going to take this opportunity to be pedantic.

@nathanmarz
Copy link
Author

You can get the first element of a vector with optimal performance with (nth v 0), but first is the higher-level way of expressing that. Saying use a different data structure if you care about first is strange to me. This isn't like prepending an element where a list is clearly the superior data structure (I think that should be supported too but that's a totally separate discussion where we have major philosophical differences). This is an operation which the data structure is perfectly capable of doing extremely fast, as (nth v 0) shows.

I also question whether or not there's even a tradeoff here if implemented carefully. If first() is factored out of ISeq into its own interface, why can't vectors implement that? Are there concerns about inline caches becoming megamorphic?

@puredanger
Copy link

puredanger commented Aug 28, 2017

@ztellman: vectors are persistent collections and are always fully realized, so having it "stay in memory" is naturally part of the deal. I'm not arguing with you that first could be always O(1) on vectors. The point I made above is that the sequence abstraction concerns both first and rest and the latter is the one that seems harder to me.

@nathanmarz the last suggestion is interesting. would need some criteria and tests to evaluate more.

@maacl
Copy link

maacl commented Aug 30, 2017

@puredanger Thanks for the explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment