Great slides explaining Hive 14 transactional support capabilities.
Tips and Tricks to build a Hadoop eco system. References to good articles on Hadoop based solutions. Topics include: Hadoop architecture, Hive, SQL on Hadoop, Compression, Metadata.
Thursday, December 18, 2014
Monday, December 15, 2014
HDInsight Essentials 2nd version is coming soon
This is 2nd edition of my book HDInsight Essentials. This one is more in-depth and go through a journey of building an enterprise data lake. It is up to date with Hadoop 2.X and HDInsight 3.1.
I also take a real life project and walk through the ingestion, organization, transformation and reporting phases.
https://www.packtpub.com/big-data-and-business-intelligence/hdinsight-essentials-second-edition
I also take a real life project and walk through the ingestion, organization, transformation and reporting phases.
https://www.packtpub.com/big-data-and-business-intelligence/hdinsight-essentials-second-edition

Monday, December 8, 2014
Hive 14 released with useful features for RDBMS offload use cases
Great features in Hive 14 that make it really close to an RDBMS solution based on Hadoop:
http://hortonworks.com/blog/announcing-apache-hive-0-14/
Key features:
Key features:
- Transactions with ACID semantics
- Cost Based Optimizer
- SQL Temporary Tables
Design Docs of Hive if you are interested to get to the details:
Saturday, November 1, 2014
Saturday, October 25, 2014
Strata Hadoop World 2014 Conference Speaker Notes and Links
I attended the 2014 Strata Hadoop Conference @ NY which was a great success and increase in participation compared to last year.
Spark was a key highlight of the conference from a technology perspective. There are several new products and tools trying to capitalize the predicted 40$ Billon market.
Link to slides and videos:
http://strataconf.com/stratany2014/public/schedule/proceedings?imm_mid=0c5096&cmp=em-strata-na-info-stny14_thankyou
Monday, October 20, 2014
Hbase and Hive Integration
HBase has been the key database in Hadoop ecosystem providing transactional support enabling real time applications to be built on top of HDFS. Following is good article that describe the roadmap of Hbase and Hive from Hortonworks. This will help and streamline architectures in Hadooop
http://hortonworks.com/blog/hbase-hive-better-together/
Tuesday, September 30, 2014
HiveServer2 architecture
This is a good article on Hive Server2 architecture
Thursday, July 3, 2014
How to kill a MapReduce job
Very common need, kill a long running job. following is the syntax
$ hadoop job -kill <job-id>
Usage: hadoop job [GENERIC_OPTIONS] [-submit <job-file>] | [-status <job-id>] | [-counter <job-id> <group-name> <counter-name>] | [-kill <job-id>] | [-events <job-id> <from-event-#> <#-of-events>] | [-history [all] <jobOutputDir>] | [-list [all]] | [-kill-task <task-id>] | [-fail-task <task-id>] | [-set-priority <job-id> <priority>]
Yarn Capacity Schedular
Good article explaining how to use capacity scheduler
https://support.gopivotal.com/hc/en-us/articles/201623853-How-to-configure-queues-using-YARN-capacity-scheduler-xml-
Sunday, June 29, 2014
Friday, June 27, 2014
Hadoop Summit 2014 Presentations online
Links to keynote and presentations of Hadoop Summit 2014. YARN was the focus of this year and MapReduce will soon become legacy:
- http://hadoopsummit.org/san-jose/keynote-day1/
- http://hadoopsummit.org/san-jose/schedule/
Thursday, June 26, 2014
How to cleanup hdfs /tmp space
my jobs are getting /tmp no space left issue.
has anyone used this?
https://github.com/mag-/hdfs-cleanup
Saturday, June 21, 2014
Monday, May 5, 2014
Sunday, April 13, 2014
Tuesday, April 8, 2014
Hortonworks HDP - Hive and Pig interface
Hortonworks Hcatalog interface is pretty useful for quick ingest and query. Also check out quick start slides on how to start the sandbox vm.
http://hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/#loading
http://hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/#loading
Friday, March 14, 2014
Monday, March 10, 2014
How to run a simple mapreduce job to verify the cluster health?
Many times you want to check if cluster is in good shape... run this command that does not require much setup...
$ hadoop jar /usr/lib/gphd/hadoop-mapreduce/hadoop-mapreduce-examples-*.jar pi 10 1000
Tuesday, February 18, 2014
HDFS Data Federation
My client has this requirement to restrict certain Hadoop directories based on user/group. This seems to be a common issue where you want to isolate portions of HDFS to say a department or users.
Federated NameNode sounds a perfect solution for this, however this has issues and limitations. It does not work well with shared network storage. Additionally I would like to understand who is utilizing this in production.
One alternative is to use the Posix permissions owner-group-others and restrict certain directories to groups. This provides basic security but is harder to manage and does not provide a clean isolation like the federated namenode.
Some links to read
http://www.slideshare.net/huguk/hdfs-federation-hadoop-summit2011
Federated NameNode sounds a perfect solution for this, however this has issues and limitations. It does not work well with shared network storage. Additionally I would like to understand who is utilizing this in production.
One alternative is to use the Posix permissions owner-group-others and restrict certain directories to groups. This provides basic security but is harder to manage and does not provide a clean isolation like the federated namenode.
Some links to read
http://www.slideshare.net/huguk/hdfs-federation-hadoop-summit2011
Thursday, February 6, 2014
Yarn Command Line
Starting from Hadoop 2.0, we can use Yarn to run mapreduce as an application.
mr.wordcount -libjars $MYLIBS/custom-libs.jar
- Running a Wordcount using Yarn
mr.wordcount -libjars $MYLIBS/custom-libs.jar
- To get list of all applications/mapreduce that is running
$yarn application -list - To get information on specific mapreduce job
$yarn application -status application_133690873084_3466 - Reference: http://www.slideshare.net/martyhall/hadoop-tutorial-mapreduce-on-yarn-part-1-overview-and-installation
- http://www.coreservlets.com/hadoop-tutorial/#Tutorial-Intro
Wednesday, February 5, 2014
Tuesday, February 4, 2014
Avro DataSerialization in Hadoop
- Avro is a data serialization system
- http://avro.apache.org/
- Key advantage is that it supports schema evolution.
- Schema is stored along with data.
- Schema is expressed in JSON format.
- Both writer and reader have to define a schema to access avro files.
- This allows a good way to handle schema evolution.
- Youtube link: http://www.youtube.com/watch?v=EBV4C-P3G94
- IBM article: http://www.ibm.com/developerworks/library/bd-avrohadoop/
- MapReduce and Avro example: http://avro.apache.org/docs/current/mr.html#Example%3A+ColorCount
- Other links:
- Hive has a Serde, so it can query data that is in Avro format http://www.michael-noll.com/
blog/2013/07/04/using-avro-in- mapreduce-jobs-with-hadoop- pig-hive/ - MSDN article on how to use this with C#: http://code.msdn.microsoft.com/Schema-Evolution-In-Avro-240f0a7a
Thursday, January 30, 2014
How To get Job Status?
There are several ways to get status of your Hadoop Jobs. I will discuss the Hadoop 2.0 options here
- Hadoop Cluster via Yarn Application Manager. Example - http://serverhostingyarn:8088/cluster/app/application_1390931417866_0020
- hadoop fs -ls /user/history/done/ | grep "1390931417866_0020"
- yarn command line to get status
Tuesday, January 28, 2014
Appending Files in Hadoop
Can I really append a file in HDFS OR do i need to always replace the file?
http://hadoop4mapreduce.blogspot.com/2012/08/two-methods-to-append-content-to-file.html
We had a client that wanted to merge data coming from same source into a single large file. The challenge was how to handle concurrency. The append process is not thread safe. The solution we used was to batch all the small files into a larger file periodically.
Compression Techniques for Hadoop
I am currently researching various options to do compression in Hadoop and found the following articles useful:
- http://comphadoop.weebly.com/index.html
- Slide share = http://www.slideshare.net/Hadoop_Summit/kamat-singh-june27425pmroom210cv2
- Splitable LZO = http://blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/
- Read this Microsoft article: http://download.microsoft.com/download/1/C/6/1C66D134-1FD5-4493-90BD-98F94A881626/Compression%20in%20Hadoop%20(Microsoft%20IT%20white%20paper).docx
- A custom codec to be compiled on your environment (Linux)
- The library files need to be copied to all nodes
- Update the Hadoop site config file
- Additionally you will need to refer to your codec in your mapreduce
- If Hive is used, your table creation command has to refer to LZO compression codec in both READ and WRITE
Welcome to My Blog
Welcome to my Blog. Through this Blog, I will share real world scenarios, issues and solutions for Hadoop based projects.
Subscribe to:
Posts (Atom)