-
Right now we have the need to compute histogram data on large files with cache support. This can work great by having a reducer process agregating results from mappers. this works because we have one type of reducer and one type of mapper with a minimum map/reduce ratio among the workers.
-
What happens if in the future we need something like A <- B <- C. This will require at least three processes with two of them waiting. We need to make sure we can't have all processes waiting and none computing. How do we provide a garentee for this?
-
Can a reducer provide a list of requirements before-hand? Does that solve our problem? It doesnt solve it since the reducer will still need to be agregating stuff. NOT A SOLUTION
-
Another option is to not have reducing doing nothing while waiting. They should be computing the stuff they need themselves. At least this way even if no one is helping stuff will get done.
Careful not to accumulate to much incoming data while doing so. We should have a stack of tasks being processed. Do we need to check incoming for N+1 when cheking notifications in N? -> Yes. because we need to know about cancelations. Also recursive cancelations WTF. My brain hurts.
The cache serves to accelarate computations by re-using past results. The problem we have is that we do not expect to have identical requests issued multiple times. Therefor it is pointless to cache those results exactly.
However because most requests can, and will, be split up in several chunks we can cache the results for thos since they can be the same if we make them. (eg: align them on block boundary and have them have the same size.)
If a request is smaller than the block size it won't profit from it. On the other hand if a request is way bigger there will be a lot of agregation to do.
We probably want to have something like for a request try to split it in e.g 10 and re-split those in 10 requests etc. This way it should generate caches suitable for the type of requests that are made at those locations. The question is how many times we should recursively split. Two options: (1) until we have a min-size for the mappers, or (2) a fixed amount of time.
-
Problem with this is that it generates a lot of cache files and takes a huge amount of place in storage.
-
This generates less re-usable cache (for smaller requests) and if we had an unlimited ammount of workers this wouldnt scale optimally.
Maybe a solution would be (1) with an option not to commit results to disk but still spliting work for the workers until min-size. But then we loose time in the brokers. So maybe mix between (1) and (2).