Hadoop Simplified: 2014

Thursday, December 18, 2014

Hive 14 Transactions, Inserts, Updates, Deletes explained

Great slides explaining Hive 14 transactional support capabilities.

Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive from Hadoop_Summit

Monday, December 15, 2014

HDInsight Essentials 2nd version is coming soon

This is 2nd edition of my book HDInsight Essentials. This one is more in-depth and go through a journey of building an enterprise data lake. It is up to date with Hadoop 2.X and HDInsight 3.1.

I also take a real life project and walk through the ingestion, organization, transformation and reporting phases.

https://www.packtpub.com/big-data-and-business-intelligence/hdinsight-essentials-second-edition

Monday, December 8, 2014

Hive 14 released with useful features for RDBMS offload use cases

Great features in Hive 14 that make it really close to an RDBMS solution based on Hadoop: http://hortonworks.com/blog/announcing-apache-hive-0-14/

Key features:

Transactions with ACID semantics
Cost Based Optimizer
SQL Temporary Tables

Design Docs of Hive if you are interested to get to the details:

https://cwiki.apache.org/confluence/display/Hive/DesignDocs

Saturday, November 1, 2014

Kafka and Hadoop - great video on various options

Slides: http://www.slideshare.net/gwenshap/kafka-hadoop-for-nyc-kafka-meetup

Saturday, October 25, 2014

Strata Hadoop World 2014 Conference Speaker Notes and Links

I attended the 2014 Strata Hadoop Conference @ NY which was a great success and increase in participation compared to last year. Spark was a key highlight of the conference from a technology perspective. There are several new products and tools trying to capitalize the predicted 40$ Billon market. Link to slides and videos: http://strataconf.com/stratany2014/public/schedule/proceedings?imm_mid=0c5096&cmp=em-strata-na-info-stny14_thankyou

Monday, October 20, 2014

Hbase and Hive Integration

HBase has been the key database in Hadoop ecosystem providing transactional support enabling real time applications to be built on top of HDFS. Following is good article that describe the roadmap of Hbase and Hive from Hortonworks. This will help and streamline architectures in Hadooop http://hortonworks.com/blog/hbase-hive-better-together/

Tuesday, September 30, 2014

HiveServer2 architecture

This is a good article on Hive Server2 architecture

HiveServer2 from Schubert Zhang

Thursday, July 3, 2014

How to kill a MapReduce job

Very common need, kill a long running job. following is the syntax

$ hadoop job -kill <job-id>

Yarn Capacity Schedular

Good article explaining how to use capacity scheduler

https://support.gopivotal.com/hc/en-us/articles/201623853-How-to-configure-queues-using-YARN-capacity-scheduler-xml-

Sunday, June 29, 2014

Hadoop I/O

This maybe the next generation of MapReduce.

https://www.google.com/events/io/io14videos

Friday, June 27, 2014

Hadoop Summit 2014 Presentations online

Links to keynote and presentations of Hadoop Summit 2014. YARN was the focus of this year and MapReduce will soon become legacy:

http://hadoopsummit.org/san-jose/keynote-day1/
http://hadoopsummit.org/san-jose/schedule/

Thursday, June 26, 2014

How to cleanup hdfs /tmp space

my jobs are getting /tmp no space left issue. has anyone used this? https://github.com/mag-/hdfs-cleanup

Saturday, June 21, 2014

HDInsight Essentials at a glance

Hd insight essentials quick view from Rajesh Nadipalli

Monday, May 5, 2014

Hortonworks Data Lake article

http://hortonworks.com/hadoop-modern-data-architecture/?mkt_tok=3RkMMJWWfF9wsRojv6%2FOZKXonjHpfsX66%2B8uWaW%2BlMI%2F0ER3fOvrPUfGjI4CRcZnI%2BSLDwEYGJlv6SgFT7TMMbFh1rgNUxc%3D

Sunday, April 13, 2014

Scaling Hive based datawarehouse in FB

https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/

Tuesday, April 8, 2014

Hortonworks HDP - Hive and Pig interface

Hortonworks Hcatalog interface is pretty useful for quick ingest and query. Also check out quick start slides on how to start the sandbox vm.

http://hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/#loading

Hortonworks Sandbox Startup Guide for VirtualBox from Hortonworks

Friday, March 14, 2014

Good reference articles on Hadoop based solution architectures

http://hortonworks.com/blog/modern-financial-services-architectures-built-hadoop/

Monday, March 10, 2014

How to run a simple mapreduce job to verify the cluster health?

Many times you want to check if cluster is in good shape... run this command that does not require much setup...

$ hadoop jar /usr/lib/gphd/hadoop-mapreduce/hadoop-mapreduce-examples-*.jar pi 10 1000

Tuesday, February 18, 2014

HDFS Data Federation

My client has this requirement to restrict certain Hadoop directories based on user/group. This seems to be a common issue where you want to isolate portions of HDFS to say a department or users.

Federated NameNode sounds a perfect solution for this, however this has issues and limitations. It does not work well with shared network storage. Additionally I would like to understand who is utilizing this in production.

One alternative is to use the Posix permissions owner-group-others and restrict certain directories to groups. This provides basic security but is harder to manage and does not provide a clean isolation like the federated namenode.

Some links to read

http://www.slideshare.net/huguk/hdfs-federation-hadoop-summit2011

https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/Federation.html

http://www.slideshare.net/AdamKawa/apache-hadoop-yarn-namenode-ha-hdfs-federation

Thursday, February 6, 2014

Yarn Command Line

Starting from Hadoop 2.0, we can use Yarn to run mapreduce as an application.

Running a Wordcount using Yarn

$yarn jar $HADOOP_HOME/HadoopSamples.jar \
mr.wordcount -libjars $MYLIBS/custom-libs.jar

To get list of all applications/mapreduce that is running

$yarn application -list
To get information on specific mapreduce job

$yarn application -status application_133690873084_3466
Reference: http://www.slideshare.net/martyhall/hadoop-tutorial-mapreduce-on-yarn-part-1-overview-and-installation
http://www.coreservlets.com/hadoop-tutorial/#Tutorial-Intro

Wednesday, February 5, 2014

Impala Presentation

Cloudera Impala from Scott Leberknight

Tuesday, February 4, 2014

Avro DataSerialization in Hadoop

Avro is a data serialization system
http://avro.apache.org/
Key advantage is that it supports schema evolution.
Schema is stored along with data.
Schema is expressed in JSON format.
Both writer and reader have to define a schema to access avro files.
This allows a good way to handle schema evolution.
Youtube link: http://www.youtube.com/watch?v=EBV4C-P3G94
IBM article: http://www.ibm.com/developerworks/library/bd-avrohadoop/
MapReduce and Avro example: http://avro.apache.org/docs/current/mr.html#Example%3A+ColorCount
Other links:

Hive has a Serde, so it can query data that is in Avro format http://www.michael-noll.com/blog/2013/07/04/using-avro-in-mapreduce-jobs-with-hadoop-pig-hive/
MSDN article on how to use this with C#: http://code.msdn.microsoft.com/Schema-Evolution-In-Avro-240f0a7a

Thursday, January 30, 2014

How To get Job Status?

There are several ways to get status of your Hadoop Jobs. I will discuss the Hadoop 2.0 options here

Hadoop Cluster via Yarn Application Manager. Example - http://serverhostingyarn:8088/cluster/app/application_1390931417866_0020
hadoop fs -ls /user/history/done/ | grep "1390931417866_0020"
yarn command line to get status

Tuesday, January 28, 2014

Appending Files in Hadoop

Can I really append a file in HDFS OR do i need to always replace the file? http://hadoop4mapreduce.blogspot.com/2012/08/two-methods-to-append-content-to-file.html We had a client that wanted to merge data coming from same source into a single large file. The challenge was how to handle concurrency. The append process is not thread safe. The solution we used was to batch all the small files into a larger file periodically.

Compression Techniques for Hadoop

I am currently researching various options to do compression in Hadoop and found the following articles useful:

http://comphadoop.weebly.com/index.html
Slide share = http://www.slideshare.net/Hadoop_Summit/kamat-singh-june27425pmroom210cv2
Splitable LZO = http://blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/
Read this Microsoft article: http://download.microsoft.com/download/1/C/6/1C66D134-1FD5-4493-90BD-98F94A881626/Compression%20in%20Hadoop%20(Microsoft%20IT%20white%20paper).docx

My project needs are = should have minimum performance impact + should give decent compression gains. We looked at Spilatble LZO. However this requires

A custom codec to be compiled on your environment (Linux)
The library files need to be copied to all nodes
Update the Hadoop site config file
Additionally you will need to refer to your codec in your mapreduce
If Hive is used, your table creation command has to refer to LZO compression codec in both READ and WRITE

With all these requirements, my client decided to use Bzip2 as it is native to Hadoop and does not require additional libraries to be installed. I have attached a quick comparison of the 2 solutions:

Welcome to My Blog

Welcome to my Blog. Through this Blog, I will share real world scenarios, issues and solutions for Hadoop based projects.