One way to control I/O for large Hadoop jobs, especially those whose mappers produce many records with relatively fewer distinct keys, is to introduce a combiner phase between the mapper and reducer. I have not seen a simple diagram explaining what the types should be for the combiner and what properties the algorithm must exhibit, so here goes. If your mapper extends Mapper< K1, V1, K2, V2 >
and your reducer extends Reducer< K2, V2, K3, V3 >
, then the combiner must be an extension of Reducer< K2, V2, K2, V2 >
, and the following diagram must commute:
The triangle in the center of the diagram represents distributivity of the combiner function (i.e., reduce
(k, combine
(k, M)) = reduce
(k, combine
(k, ∪icombine
(k,σi(M))) for any partition σ = {σi | i ∈ I} of M), because Hadoop does not guarantee how many times it will combine intermediate outputs, if at all.
A common pattern is to use the same function for combiner and reducer, but for this pattern to work, we must have K2 = K3
and V2 = V3
(and of course, the reducer itself must be distributive).
I should also mention that if you use a grouping comparator for your Reducer that is different from your sorting comparator, the above diagram is not correct. Unfortunately, and I'm pretty sure this is an outright bug in Hadoop, the sorting comparator is always used to group inputs for the Combiner's reduce()
calls (see MAPREDUCE-3310).