Thursday, November 11, 2010

Random notes on Hadoop

I am talking about Hadoop 0.20 using a custom jar, not streaming or Hive or Pig.
  1. Make sure your distribution has the MAPREDUCE-1182 patch.
  2. Make sure you change the default setting of dfs.datanode.max.xcievers to something very large, like 4096. And yes, the property name is misspelled. In 0.22/0.23 the property will be called dfs.datanode.max.transfer.threads.
  3. If you've declared your key type to be T, you can't write an S, even if S is a subclass of T.

  4. There are several ways to get your dependencies visible to Hadoop tasks and tools, and they are all clunky.
    • You can bundle them all into the job jar in a folder called lib, although doing so does not make it possible to have custom input formats, output formats, mappers, and reducers in separate jars.
    • You can use the -libjars argument, but if you ever have to load a class via reflection (i.e., using Class.forName), you have to use the Thread.currentThread().getContextClassLoader() rather than the default. Also you might run into HADOOP-6103.
    • You can use DistributedCache.addFileToClassPath, but you have to be sure to put the file on HDFS and refer to it by its absolute pathname without a scheme or authority, and these files are only available to the tasks, not the tool/job.

  5. DistributedCache is just plain wonky. You must
    1. put your file on hdfs somehow
    2. call DistributedCache.addCacheFile(), using the full URI with "hdfs:///"
    3. in the mapper/reducer, use java.io.* APIs to access the files represented by the paths in DistributedCache.getLocalCacheFiles(). Incredibly, the "typical usage" example in the javadocs for DistributedCache just completely elides this crucial bit. If you try to use FileSystem.get().open(), you'll get a cryptic error message with a filename that looks like it's been mangled.
      I can't find a programmatic mapping between files added via addCacheFile() and paths retrieved by getLocalCacheFiles(), although there may be some support for sharing the name or specifying it with the "hdfs://source#target" syntax. None of this API is well-documented.
  6. You can't substitute Hadoop counters for actual aggregation in your reducer, tempting as that might be. Counters will differ from run to run, even against identical inputs, because of things that vary like speculative execution and task failures.
  7. If you configure your cluster to have 0 reduce slots (perhaps because your jobs are all map-only), and you accidentally submit a job that does require a reduce phase, that job will run all mappers to completion and then hang forever.

Sunday, October 24, 2010

6 10 Things I Hate About Java (or, Scala is the Way and the Light)

I've been working with Java quite extensively for about 4 years now, and it has been enjoyable for the most part. Garbage collection, the JVM, generics, anonymous classes, and superb IDE support have made my life much easier.

But a few things make me gnash my teeth on a daily basis, and it's funny, but none of them are issues in another JVM language in which I have been dabbling, Scala. It seems the developers of that language felt my pain as well.

  1. Miserable type inference. Apparently some of the problems with it are being addressed in project coin for Java 7. The blue portions of the following code are, to any sane programmer, maddeningly superfluous, but nevertheless strictly required in Java until at least mid-2011:

    List< Integer > li1 = new ArrayList< Integer >();
    List< Integer > li2 = Arrays.asList( 1, 2, 3 );
    o.processListOfInteger( Arrays.< Integer >asList() );

    Needless to say, equivalent initializations in Scala require no such redundant information.
  2. Generic invariance. An example from last week: I'm working on an implementation FrazzleExecutorService of java.util.concurrent.ScheduledExecutorService and a refinement FrazzleFuture< T > of java.util.ScheduledFuture< T >. Covariant subtyping lets me get away with returning a FrazzleFuture< T > from a method like FrazzleExecutorService.submit() without violating the contract. But I can't return List< FrazzleFuture< T > > from FrazzleExecutorService.invokeAll() because (a) ScheduledExecutorService would have had to declare the return type to be List< ? extends ScheduledFuture< T > >; (b) the returned list should have been immutable anyway; (c) generic types like List< T > are invariant in their parameters. In Scala, there is a separate mutable and immutable collections hierarchy, and at least in the immutable one, S <: T implies List[ S ] <: List[ T ], because List is declared covariant in its parameter.
  3. Collections can't be tersely initialized. Part of the blame is the Collections framework; part of it is the goddamn language. The following code illustrates, with green text indicating typical verbosity:

    List< Integer > li1 = new ArrayList< Integer >(
        Arrays.asList(
    1, 2, 3 )
    )
    ;
    @SuppressWarnings( "serial" )
    Map< String, String > mss1 = new HashMap< String, String >() { {
        put
    ( "foo", "bar" );
        put
    ( "this", "sucks" );
    } }
    ;

    Turns out the collections literals are also not supported in Scala; you have to type List( 1, 2, 3 ) or Map( "foo" -> "bar", "this" -> "rocks" ). Excuse me if I mock Java incessantly at this point. Collection improvements have been postponed until Java 8, scheduled for mid-2012.
  4. Higher-order programming is absurdly verbose. Rather than give code samples, I'll just refer you to these guys and let you see for yourself how even a library can't save you from massive boilerplate for the simplest things. And Scala? Lambda expressions are built-in as syntactic sugar for functional objects, making higher-order code simple, readable, and terse.
  5. Modeling variant types is awkward. You have to choose from among many bad options:
    RepresentationInterrogation
    one sparsely-populated class (S + T modeled as S × T)if-ladders based on comparing to null (see 8)
    S × T and an enum of type labelsswitch + casting
    a hierarchya bunch of isS() and isT() methods and casting
    a hierarchyvarious casting attempts wrapped with ClassCastException catch blocks (ok, that's not really an option, but I get that as an answer in interviews sometimes)
    a hierarchyif-ladders based on instanceof and casting
    a hierarchypolymorphic decomposition and the inevitable bloated APIs at the base class that result
    a hierarchy that includes Visitorpainfully verbose visitors (see 4 and 10)
    a hierarchy of Throwablesthrow and various catch blocks, which I suspect compiles to the same thing as the instanceof approach, only more expensive (but actually requires the least code!)
    Scala has case classes and pattern matching built in.
  6. No tuples. One ends up either creating Pair< S, T > or dozens of throwaway classes with "And" in the name, like CountAndElapsed. Scala has tuples, although I feel like they kind of screwed up by not going the full ML and making multi-argument functions/methods be the same as single-argument functions/methods over tuples. So to call a 2-arg function f with a pair p = ( p1, p2 ), you can either call f( p1, p2 ) or f.tupled( p ). There must be some deep reason for making the distinction.
  7. No mixins. If you need stuff from 2 abstract classes, you will be copying, or aggregating (with loads of monkey delegation boilerplate) at least one of the two.
  8. Null. The following code should illustrate:

    private static Doohickey getDoohickey( Thingamajigger t ) {
        Whatsit w;
        Foobar f;
        if ( null == t )
            return null;
        else if ( null == ( w = t.getWhatsit() ) )
            return null;
        else if ( null == ( f = w.getFoobar() ) )
            return null;
        else
            return f.getDoohickey();
    }

    I believe the "Elvis" operator was developed to solve this annoyance (return t.?getWhatsit().?getFoobar().?getDoohickey();) but it did not make the cut for Java 7 or even Java 8, from what I understand. Scala's solution to this issue is to recommend that operations which might not have a value for you return Option[ T ] instead of T. You can then map your method call to the Option and get back another Option without ever seeing a null pointer exception. Option is a variant type, easily modeled in Scala but not in Java (see 5).
  9. Iterators. They can't throw checked exceptions. They have to implement remove(), often by throwing (unchecked) UnsupportedOperationExceptions. For-each syntax can't work with iterators directly. None of these problems arise with the superb collections framework in Scala which is designed hand-in-hand with its clean higher-order programming (see 4).
  10. Void. This is a holdover from C, and is obviously not anything Java can get rid of, but it's stupid. Because of void, e.g., I can never do a visitor pattern with just one kind of visitor; there has to be one whose methods return a generic type T, and another whose methods return void. And don't try to sell me on the psuedo-type Void, because you still have to accept or return null somewhere. Scala has a type Unit with a single trivial value (), and unitary methods/functions can explicitly return that value or not return anything; the semantics are the same. Thus all expressions have some meaningful type, and classes with generic types can be fully general.

Wednesday, April 07, 2010

Why tabs are better

I'm tired of this stupid "tabs vs. spaces" code style debate. Tabs win hands down on just about every measure. Anyone still laboring under the misapprehension that it makes sense to indent one's source file using spaces should consider the following:
  1. Line-based comments (‘//’, ‘#’) at the head of the line don’t screw up the indentation (unless tab depth <= 2).
  2. You can change the indentation depth without editing the file. This is a huge feature, folks. If I like shallow indentation on all my source, I can make it so, and people who prefer the other extreme are not affected. The counter-argument (put forth by Checkstyle, among others) that one should not be required to set tab depth in order to read source is absurd; tab depth is always set to something, whether you like it or not (see 11), and code indented using tabs is readable regardless of the setting, unless tab depth is ridiculously high. The only code that actually does require a fixed tab depth to be legible is code that mixes tabs and spaces, which I encounter all too often. See 10.
  3. Spaces-based indentation will inevitably become inconsistent because no one can agree on his/her favorite indentation depth (see 2).
  4. Indentation mistakes are more obvious using tabs (unless tab depth = 1, which is just stupid).
  5. Tab indentation characters, when used properly, are more semantically relevant than spaces.
  6. Files are smaller (relevant especially for Javascript, CSS, HTML).
  7. Fewer keystrokes are needed to navigate within source files. Sorry, but “Ctrl+Right arrow” is two keystrokes, plus you have to hold one of them down.
  8. Making tabbed whitespace visible in an IDE is useful for eyeballing how things line up; making spaces visible is useful for “magic eye”.
  9. Tabs are unable to support the unreadable, but nevertheless default, function-call line-break style of making parameters line up with the opening ‘(’. Remember, it is a feature that this abomination is not supportable. Unfortunately it is still possible to put just the first parameter on the same line as the ‘(’, but no indentation choice can prevent that bad decision.
  10. If you have to edit a production config file using terminal-based default emacs, should you really be checking that in? I should add that the indentation used by default in Emacs (and pervasive in high-profile source such as the JDK) is a horrific hybrid of spaces and tabs which actually does force you to set your tab depth to a fixed value of 8 in order to read code thus indented. See 2.
  11. Some well-known tools (e.g., ReviewBoard) typically display tabs with a depth of 8, which is kind of high. I claim that this tab discrimination is also a feature, because it discourages deeply-nested code which is a good thing.

The only moderately sane argument in favor of spaces is that the code "always looks the same". Isn't that nice. You can write comments that use little "^^^^" to point to something on the line above. Wow. I guess that's worth throwing out points 1-11.

I'm not going to wade into the quagmire of my other personal code style choices. But it's time this debate, which rages again and again every time I join a new team, be permanently put to bed.