Think on parallelism

In the distributed world, while writing map-reduce codes, there are many situations where the input data seems to be non partitiionable. In which case all the data though would be picked up by multiple mappers, it gets mapped to the same key. Once all the mappers are run and if we end up in having one key with huge list of values, then we would be burdening the reducers. When I say burdening reducers it invovles burdening all the steps after mapper till the data enters the reducer nodes.

Lets take an example to deal more on this situation and see how this can be resolved.

Challenge: Finding of average of natural numbers.

Bird-view:

In the first view sounds simple. But whats' that complexity is when we have all the numbers as same number in the given input file.

Sparrow-view:

There would be one key say the number itself and list of values would be n 1's.

Implementation-view: While performing the copy of the data from mapper nodes to reducer nodes, there would be one key and all the value has been added as part of list of to that key. Shuffle and Sort though happens on keys may not take much time. Copying takes too long time. Though you have time to have it copied, there are any infrasturctural issues that can occur. Never to forget in hadoop stack we are using cheap commodity hardware.

Lets see how the problem of finding average of natural numbers can be addressed.

Create custom keys deduced out of the data i.e the numbers thats already available as Text in the map function for simplicity I deduce the keys as shown in below code. map(....)
```
static int track = 1;
    if (track > 11) {
   track = 1;
```
} else track++;

Integer convKey = Integer.parseInt(val) % track; outKey.set(new Text(String.valueOf(convKey))); context.write(outKey, value);

With the above code, we have solved the problem of 1 key getting associated with all the values.

For leading easy to Reducer lets do some part of the job in the combiner. As there is no assurance of the execution of the Combiner, we shall write our partial implementation of Reducer in Combiner.

The Combiner code:

    	Integer count = 1;
	Double sum = 0D;
	Iterator<Text> itr = values.iterator();
	while (itr.hasNext()) {
		Double val = Double.parseDouble(itr.next().toString());
		count++;
		sum += val;
	}

	if (count > 1) {
		count--;
	}
	
 	context.write(new Text("avg"), new Text(sum + "_" + count));

In the above code I will have reduced 11 keys to one key 'avg'. But the required values are almost reduced while calulating the sum and count.

In the reducer the code can get so concised that we will bother only on the values. Profound aggregation happens in Reducer.

Reducer code:

	while (itr.hasNext()) {
	checkCount++;
	inter = itr.next().toString().split("_");
	sum += Double.parseDouble(inter[0]);
	count += Integer.parseInt(inter[1]);
	log.info(sum + " " + count);

	log.info("Average in for loop " + average);

	}
	average = sum / (double) count;
	context.write(outValue, new Text(average.toString()));

With this we are able to find average of n numbers efficiently.

Conclusion:

Acheivements from this approach.

No too much o burden in the copy phase from local disk (after mapper phase) to reducer nodes.
If there occurs issues in data transfer speculation should have happenned.
As the data is too huge, multiple copy phases should get initiated which can be reduced to a larger extent.

Its always good to look at the challenge in hadoop stack by having the below considerations. 1. Though we use cheap commodity hardware, we need to efficeintly utilize them. 2. Sometimes intended challenge seems simple, but think in design perspective of the code, infra and efficiency.

For complete code check:

https://github.com/mgpradeepa/hadoop-mgp-mr/tree/master/AON/AverageOfNumbers

mgpradeepa/Parallelism in map-reduce.md