Monday, December 15, 2014

HDInsight Essentials 2nd version is coming soon

This is 2nd edition of my book HDInsight Essentials.  This one is more in-depth and go through a journey of building an enterprise data lake.   It is up to date with Hadoop 2.X and HDInsight 3.1.

I also take a real life project and walk through the ingestion, organization, transformation and reporting phases.

https://www.packtpub.com/big-data-and-business-intelligence/hdinsight-essentials-second-edition





Monday, December 8, 2014

Hive 14 released with useful features for RDBMS offload use cases

Great features in Hive 14 that make it really close to an RDBMS solution based on Hadoop: http://hortonworks.com/blog/announcing-apache-hive-0-14/

Key features:

  • Transactions with ACID semantics
  • Cost Based Optimizer
  • SQL Temporary Tables

Design Docs of Hive if you are interested to get to the details: 

Saturday, October 25, 2014

Strata Hadoop World 2014 Conference Speaker Notes and Links

I attended the 2014 Strata Hadoop Conference @ NY which was a great success and increase in participation compared to last year. Spark was a key highlight of the conference from a technology perspective. There are several new products and tools trying to capitalize the predicted 40$ Billon market. Link to slides and videos: http://strataconf.com/stratany2014/public/schedule/proceedings?imm_mid=0c5096&cmp=em-strata-na-info-stny14_thankyou

Monday, October 20, 2014

Hbase and Hive Integration

HBase has been the key database in Hadoop ecosystem providing transactional support enabling real time applications to be built on top of HDFS. Following is good article that describe the roadmap of Hbase and Hive from Hortonworks. This will help and streamline architectures in Hadooop http://hortonworks.com/blog/hbase-hive-better-together/

Tuesday, September 30, 2014

Thursday, July 3, 2014

How to kill a MapReduce job


Very common need, kill a long running job.  following is the syntax

$ hadoop job -kill <job-id>


Usage: hadoop job [GENERIC_OPTIONS] [-submit <job-file>] | [-status <job-id>] | [-counter <job-id> <group-name> <counter-name>] | [-kill <job-id>] | [-events <job-id> <from-event-#> <#-of-events>] | [-history [all] <jobOutputDir>] | [-list [all]] | [-kill-task <task-id>] | [-fail-task <task-id>] | [-set-priority <job-id> <priority>]

Yarn Capacity Schedular

Friday, June 27, 2014

Hadoop Summit 2014 Presentations online

Links to keynote and presentations of Hadoop Summit 2014.  YARN was the focus of this year and MapReduce will soon become legacy:

  • http://hadoopsummit.org/san-jose/keynote-day1/
  • http://hadoopsummit.org/san-jose/schedule/

Thursday, June 26, 2014

How to cleanup hdfs /tmp space

my jobs are getting /tmp no space left issue. has anyone used this? https://github.com/mag-/hdfs-cleanup

Monday, March 10, 2014

How to run a simple mapreduce job to verify the cluster health?


Many times you want to check if cluster is in good shape... run this command that does not require much setup...

$ hadoop jar /usr/lib/gphd/hadoop-mapreduce/hadoop-mapreduce-examples-*.jar pi 10 1000

Tuesday, February 18, 2014

HDFS Data Federation

My client has this requirement to restrict certain Hadoop directories based on user/group.  This seems to be a common issue where you want to isolate portions of HDFS to say a department or users.

Federated NameNode sounds a perfect solution for this, however this has issues and limitations.  It does not work well with shared network storage.  Additionally I would like to understand who is utilizing this in production.

One alternative is to use the Posix permissions owner-group-others  and restrict certain directories to groups.  This provides basic security but is harder to manage and does not provide a clean isolation like the federated namenode.


Some links to read

http://www.slideshare.net/huguk/hdfs-federation-hadoop-summit2011


  

Thursday, February 6, 2014

Yarn Command Line

Starting from Hadoop 2.0, we can use Yarn to run mapreduce as an application.


  • Running a Wordcount using Yarn
         $yarn jar $HADOOP_HOME/HadoopSamples.jar \
          mr.wordcount  -libjars $MYLIBS/custom-libs.jar

Tuesday, February 4, 2014

Avro DataSerialization in Hadoop


Thursday, January 30, 2014

How To get Job Status?

There are several ways to get status of your Hadoop Jobs.  I will discuss the Hadoop 2.0 options here

  1. Hadoop Cluster via Yarn Application Manager.  Example -  http://serverhostingyarn:8088/cluster/app/application_1390931417866_0020
  2. hadoop fs -ls /user/history/done/ | grep "1390931417866_0020"
  3. yarn command line to get status

Tuesday, January 28, 2014

Appending Files in Hadoop

Can I really append a file in HDFS OR do i need to always replace the file? http://hadoop4mapreduce.blogspot.com/2012/08/two-methods-to-append-content-to-file.html We had a client that wanted to merge data coming from same source into a single large file. The challenge was how to handle concurrency. The append process is not thread safe. The solution we used was to batch all the small files into a larger file periodically.

Compression Techniques for Hadoop

I am currently researching various options to do compression in Hadoop and found the following articles useful:
My project needs are = should have minimum performance impact + should give decent compression gains. We looked at Spilatble LZO. However this requires
  • A custom codec to be compiled on your environment (Linux)
  • The library files need to be copied to all nodes
  • Update the Hadoop site config file
  • Additionally you will need to refer to your codec in your mapreduce
  • If Hive is used, your table creation command has to refer to LZO compression codec in both READ and WRITE
With all these requirements, my client decided to use Bzip2 as it is native to Hadoop and does not require additional libraries to be installed. I have attached a quick comparison of the 2 solutions:

Welcome to My Blog

Welcome to my Blog. Through this Blog, I will share real world scenarios, issues and solutions for Hadoop based projects.