Thursday, May 16, 2013

Hadoop + S3 = data loss

Amazon's Simple Storage Service is pretty awesome; you can think of it as a limitless disk drive. It is tempting to use it as a durable storage and publication mechanism for big data projects involving Apache's Hadoop, and in fact Hadoop comes pre-packaged with I/O format libraries for working with S3 as a backing store. Unfortunately, those libraries are rife with assumptions about file-system consistency, a property that S3 quite consciously does not exhibit in the US Standard availability zone. After successfully writing an object into S3, subsequent attempts to read that object are only guaranteed to succeed after some amount of time x passes; and not only is there no actual limit on x, but in fact you might get a hit after y << x followed again by no hit between time y and x. It's like a flourescent light bulb turning on, only sometimes (once every ten thousand object writes or so) it takes about three hours. Or a few weeks. So far, I have encountered the following data-loss scenarios using Hadoop directly against S3 (on Elastic MapReduce, which uses improved S3 I/O libraries from the stock ones):
  1. Closed sockets on reads. S3 closes your read sockets whenever it wants to. As long as your input file is not splittable, you will get MD5 checksum protection, but if it is splittable, and not compressed, as a reader you may not be able to detect that your data has been truncated.
  2. Map files or multiple outputs may not be correctly committed. Map files consist of "folders" each containing a file named "data" and another named "index". On S3, it is possible that one of the files does not appear to the committer in its temporary output location during the task commit operation. This case is less dangerous than the analogous Multiple Outputs case because consumers of a map file with only an index file will fail rather than truncate.
  3. Multi-part uploads. On very rare occasions I have observed multi-part upload (which is enabled by default on recent EMR AMIs) to somehow "miss" one of the parts, leading to data truncation. I have disabled all multi-part uploads for all my jobs using the boolean-valued property fs.s3n.multipart.uploads.enabled for this and other reasons (for example, I use some customized S3 client code for my implementation of the NativeFileSystemStore interface, but multi-part uploads are performed by a private S3 client instance over which I have no control). By doing so I also have to make sure all my objects are significantly smaller than 5GB, and the transfers (in and out) have to report progress to avoid being killed for dead air by the jobtracker, but that is material for another post.
  4. No attempt to commit. Possibly the most insidious bug of all is the needsTaskCommit() implementation in FileOutputCommitter. If, at the time this method is called, the temporary output "folder" in S3 does not appear to the committer, no attempt is made to commit anything. It assumes there was no output, the task succeeds, and no error occurs. Because the US Standard region of S3 makes almost no guarantees at all about LIST after PUT consistency, as I mentioned above, this situation actually arises quite often.