Friday, August 05, 2011

Requirements for using a Hadoop combiner

One way to control I/O for large Hadoop jobs, especially those whose mappers produce many records with relatively fewer distinct keys, is to introduce a combiner phase between the mapper and reducer. I have not seen a simple diagram explaining what the types should be for the combiner and what properties the algorithm must exhibit, so here goes. If your mapper extends Mapper< K1, V1, K2, V2 > and your reducer extends Reducer< K2, V2, K3, V3 >, then the combiner must be an extension of Reducer< K2, V2, K2, V2 >, and the following diagram must commute:

The triangle in the center of the diagram represents distributivity of the combiner function (i.e., reduce(k, combine(k, M)) = reduce(k, combine(k, ∪icombine(k,σi(M))) for any partition σ = {σi | i ∈ I} of M), because Hadoop does not guarantee how many times it will combine intermediate outputs, if at all.

A common pattern is to use the same function for combiner and reducer, but for this pattern to work, we must have K2 = K3 and V2 = V3 (and of course, the reducer itself must be distributive).

I should also mention that if you use a grouping comparator for your Reducer that is different from your sorting comparator, the above diagram is not correct. Unfortunately, and I'm pretty sure this is an outright bug in Hadoop, the sorting comparator is always used to group inputs for the Combiner's reduce() calls (see MAPREDUCE-3310).

Thursday, June 16, 2011

Escaping property placeholders in Spring XML config

The problem

You might have encountered the awkward situation in which you are
  1. using Spring and XML config
  2. substituting properties into that config via some PropertyPlaceholderConfigurator
  3. needing to set some value in the config to a literal string of the form "${identifier}"

By default, any string of the form in 3. above is a placeholder, and if you have no value for that placeholder, you get an exception. Spring JIRA issue SPR-4953, which recognizes the fact that there is no simple escaping syntax for placeholders, is still open as of this writing.

A snippet such as the following will cause the exception if there is no value available to substitute for the variable customerName, or actually substitute a value for it if it is available. Neither result is desirable in our scenario; we want the "${customerName}" to remain intact when it is injected into our bean.
<bean id="aTrickyBean" class="org.anic.veggies.AreGoodForYou">
    <constructor-arg name="expression" value="Hello, ${customerName}!"/>
</bean>

Most workarounds I have seen are unsatisfactory. You can use a customized placeholder configurator that sets its delimiter characters to something other than the default, for example, which would mean you would have to change the look of all the actual (unescaped) placeholders just to support the ones you want escaped.

The workaround

However, in Spring 3.x, you can work around this issue in a much more simple way using the following trick with SpEL:
<bean id="aTrickyBean" class="org.anic.veggies.AreGoodForYou">
    <constructor-arg name="expression" value="#{ 'Hello, $' + '{customerName}!' }"/>
</bean>
Note that in order for this trick to work it is vital that the '$' and the '{' be physically separated (in this case, on either side of a string concatenation).