Thursday, January 30, 2014

How To get Job Status?

There are several ways to get status of your Hadoop Jobs.  I will discuss the Hadoop 2.0 options here

  1. Hadoop Cluster via Yarn Application Manager.  Example -  http://serverhostingyarn:8088/cluster/app/application_1390931417866_0020
  2. hadoop fs -ls /user/history/done/ | grep "1390931417866_0020"
  3. yarn command line to get status

Tuesday, January 28, 2014

Appending Files in Hadoop

Can I really append a file in HDFS OR do i need to always replace the file? http://hadoop4mapreduce.blogspot.com/2012/08/two-methods-to-append-content-to-file.html We had a client that wanted to merge data coming from same source into a single large file. The challenge was how to handle concurrency. The append process is not thread safe. The solution we used was to batch all the small files into a larger file periodically.

Compression Techniques for Hadoop

I am currently researching various options to do compression in Hadoop and found the following articles useful:
My project needs are = should have minimum performance impact + should give decent compression gains. We looked at Spilatble LZO. However this requires
  • A custom codec to be compiled on your environment (Linux)
  • The library files need to be copied to all nodes
  • Update the Hadoop site config file
  • Additionally you will need to refer to your codec in your mapreduce
  • If Hive is used, your table creation command has to refer to LZO compression codec in both READ and WRITE
With all these requirements, my client decided to use Bzip2 as it is native to Hadoop and does not require additional libraries to be installed. I have attached a quick comparison of the 2 solutions:

Welcome to My Blog

Welcome to my Blog. Through this Blog, I will share real world scenarios, issues and solutions for Hadoop based projects.