Please don't rely on those numbers since they are too tied to test environment. Instead, run this code on an environment similar to production environment.
Try to understand what is the optimal value for bulkSize
, that means,
how many operations (OP) would be better have been performed on same thread than executed in other.
Since it depends on the size and the time required to execute the operation, we would test it with the fastest operation (OP) we could measure. Get current time and store it in a variable.
First, we do it twice and get delta, to know how much time our OP takes.
We will collect the time before calling parMap
and use it as EPOCH,
instead of system EPOCH.
We run 5 operations in 3 threads to make sure we have the following scenarios
- T0E0: First element in the first thread after EPOCH, initial scenario
- T0E1: Second element in the first thread after EPOCH, serial scenario
- T1E0: First element in the second thread after EPOCH, parallel scenario
- T1E1: Second element in the second thread after EPOCH, parallel serial scenario
- T2E0: First element in the third thread after EPOCH, odd parallel scenario
3.7% of times T1E0 outperforms T0E1, while this is good, average being greater than average OP, means our system got stuck in first thread creation.
We couldn't guarantee that this only happens to the first thread, and the percentage is not expressive, so the next observations are on absolute value |T1E0-T0E1|
.
See file percentil.csv
~75th percentile of T1E0 completed in 1100ns (0.0011ms) after T0E1, T0 could execute our OP 15 times when OP takes 77ns or 55 times when OP takes 20ns.
~50th percentile of T1E0 completed in 0800ns (0.0008ms) after T0E1, T0 could execute our OP 11 times when OP takes 77ns or 40 times when OP takes 20ns.
For bulkSize
have the following scenarios:
- len(OPs)/no new thread, serial monothread
- len(OPs)/one new thread, serial offload
- len(OPs)/two new threads, parallel minimum
- len(OPs) new threads, parallel maximum
- Something between maximum and minimum
1º means we would not use parMap
at all, what makes sense for 75th percentile with len(OPs) < 40. So if 10,000 nanoseconds to start first job in first thread, means serial monothread can do len(OPs) < ~250bulkSize
< 40, would be better run in serial monothread. Update in another test, 75th
2º means we only care that it won't block current thread, not sure if current implementation can take advantage of this. And we didn't collected data to mesure how fast it could be instead of keep in monothread. Update, in another test, 75th~10,000 nanoseconds is the avarage cost of using 1 thread.
3º taking 1º as input only makes sense after bulkSize
> 40.
4º has the bottleneck of current processor max threads, 16 in test environment, but some production HPC can have 256 threads, anyway, in 1º and we affirm that would not be efficient too much less than 40 ops/threads.
5º Respecting CPU limits and efficiency limits would be the largest number between len(OPs)/(40 - (len(OPs) mod 40)) + 40
and len(OPs)/(len(cputhreads) - (len(OPs) mod len(cputhreads))) + len(cputhreads)
.
Still everything depending on the size of OP, there are faster than getMonoTime and there are slower.
As PR18 vs PR 19 shows the cost spliting tasks descreased a lot.
In PR 21 and data I could conclude a cost of ~400ms, what is ~50% percentile of PR19.
That reduce my recomendation of ~40x to ~20x
I did this using cost of call getMonoTime ~25ns but there some values for other operations.
Latency Numbers Every Programer Should Know
I would expect math operations to take ~1ns and RAM operations to take ~100ns (currently less), anything 4 times slower than this may benefit from smaller bulkSize
.
In a different benchmark try mesure the actual cost os thread releated operations.
See the results
4 scenarios:
- 1º news lock/cond operated N times
- 2º news lock/cond reused N times with sleep to clear cpu cache
- 3º news lock/cond reused N/10 times
- 4º news lock/cond reused N/10 times with sleep to clear cpu cache
Values:
- case: scneario
- op: max precision in nanoseconds
- lock: time required to acquire lock
- rele: time required to release the lock
- tryT: time required to tryAcquire lock with success
- tryF: time required to tryAcquire lock with failure
- sign: time required to send signal
Flaws:
- We need a less invasive better way to clear cpu cache, but real scenario could really came after sleep ou long wait.
- Since is a monothread test, it may be a cpu affinity values
I'm not sure if I could mesure wait or locked acquire
But I can say that with cache operations are cheap even for new ones (4º).