Tuesday, December 29, 2015

$5 for any PacktPub book


Improve your skills during the holidays... https://www.packtpub.com/all/?search=hdinsight%20essentials

Monday, December 21, 2015

Apache Kylin - Olap for Hadoop

http://kylin.apache.org Ebay technologies open sourced this. http://www.ebaytechblog.com/2014/10/20/announcing-kylin-extreme-olap-engine-for-big-data/ Check out the web interface... http://kylin.apache.org/docs/tutorial/web.html

Sunday, December 13, 2015

Apache Atlas on Data Governance

Apache Atlas is trying to be the Data Governance solution on Hadoop. Their approach has been more about integrating various metadata stores in one place which seems to be lacking in several organizations. It is still in it's early days of development. Currently they have integration with Hive from metadata bridge and Apache Ranger integration for policy enforcements.

Wednesday, October 21, 2015

How to run HDP on Azure

There are few options if you are looking at Azure to host Hadoop

  1. HDInsight - which is Microsoft's flavor of Hadoop (built on top of HDP).  This does provide good separation of storage Azure Blobstore + compute on hardware.
  2. HDP on Azure is new option, where you can get real Hortonworks distribution spun up as VM's.   Each data node is serving data and is managing compute. However in this model you cannot use Azure Blob storage easily.  If you shutdown the cluster, data is gone, unless you script the storage back to Azure blob storage and have the reverse in case you want to bring data back to Hadoop.
  3. Azure is also supporting another model that is useful for screnarios like "OnPremp" Hadoop cluster for day to day operations, but backup to Azure Blob storage.  This utilizes HDFS abstraction.  The demo below will show you how to setup a feed with 2 data paths... primary and secondary.  The secondary is on Azure blobstorage.  

 

Tuesday, September 29, 2015

Microsoft expands Azure Data Lake

Key addition... The store in Azure Data Lake is HDFS compatible so Hadoop distributions like Cloudera, Hortonworks®, and MapR can readily access the data for processing and analytics. http://blogs.technet.com/b/dataplatforminsider/archive/2015/09/28/microsoft-expands-azure-data-lake-to-unleash-big-data-productivity.aspx

Wednesday, August 26, 2015

Managing CDC using Sqoop and Hive

http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/

Thursday, July 30, 2015

Full SQL on Hadoop? Splice machine

http://www.zdnet.com/article/full-sql-on-hadoop-splice-machine-opens-up-its-database-for-trials/

Hive Authorization Model

Hive 13 onwards has an Oracle style User - Role - Privileges model for authorization https://hadooptutorial.info/hive-authorization-models-and-hive-security/

Tuesday, June 9, 2015

Hadoop Summit 2015 San Jose Key Notes

Excellent kick off and very informative key notes. Key takeaways

  • Hadoop data platform is next generation "Data Operating System"
  • Atlas from Hortonworks is the “metadata
  • Real-time was last year, today it is about “predictive”.  Can we anticipate violations and prevent it.
  • Data Scientist = Explore data and build model

Monday, June 1, 2015

Hadoop Data Lake for Healthcare

Good article from Hortonworks on Data Lake and Healthcare http://hortonworks.com/blog/hdp-for-healthcare-providers-common-data-challenges/?mkt_tok=3RkMMJWWfF9wsRonuazKdu%2FhmjTEU5z17%2BwoUKe0hIkz2EFye%2BLIHETpodcMTcNlMr7YDBceEJhqyQJxPr3AKNkNy9RxRhHqDg%3D%3D

Monday, April 13, 2015

HDFS permissions explained

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html In several scenarios, you might want to change the default behavior. The key property to change in hdfs-site.xml fs.permissions.umask-mode = 022 (note there is a bug in Apache that it does not like 4 digit as advertised, so use 3 bits)

Integrating Tableau, Hive and Elastic Search

Good article on supporting a highly interactive Tableau dashboard with Hive and Elastic Search http://ryrobes.com/systems/connecting-tableau-to-elasticsearch-read-how-to-query-elasticsearch-with-hive-sql-and-hadoop/

Monday, April 6, 2015

Microsoft buys Revolution Analytics

Microsoft is getting aggressive on Machine Learning and has acquired Revolution Analytics. They plan to integrate this with HDInsight. http://blogs.technet.com/b/machinelearning/archive/2015/04/06/microsoft-closes-acquisition-of-revolution-analytics.aspx

Saturday, April 4, 2015

Ecosystem for Data Scientists

http://www.computerworld.com/article/2902920/the-data-science-ecosystem-part-2-data-wrangling.html

Monday, March 30, 2015

How to create HTML emails in Gmail

Today i am going to discuss something that is not related to Hadoop but Gmail. How to send rich text content using Gmail. Follow this Youtube video

Thursday, February 26, 2015

Oracle Data Integrator

I have been researching on Oracle Data Integrator and it's strategy on Big Data integration. I'll add further slides, this one below maybe a good start.

Saturday, February 21, 2015

Azure HDInsight now runs on Linux

This is great news from Microsoft, HDInsight can now run on Linux servers. This allows easier migration of current data center driven Hadoop implementations to Microsoft hosted cloud solution. Below are slides from Microsoft's presentation at Strata http://cdn.oreillystatic.com/en/assets/1/event/118/Running%20Hadoop-as-a-Service%20in%20the%20Cloud%20Presentation.pptx

What's coming in Spark 2015

Architectural Considerations for Hadoop

Really great read for Architects trying to understand what components to use while building a Hadoop solution http://cdn.oreillystatic.com/en/assets/1/event/118/Architectural%20Considerations%20for%20Hadoop%20Applications%20Presentation.pdf Check out this link for other conference slides and videos http://strataconf.com/big-data-conference-ca-2015/public/schedule/proceedings

Thursday, February 19, 2015

HDInsight now supports Spark

http://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-spark-install/

Tuesday, February 17, 2015

Hive SQL cheat sheet

Check this site on simple comparison between SQL and HiveQL http://hortonworks.com/blog/hive-cheat-sheet-for-sql-users/

Tuesday, February 10, 2015

Hive on Spark

Spark is becoming the next generation MapReduce framework for Hadoop. Hive is now able to run on Tez and Spark as well. See the slides below that detail the Hive on Spark plan

Thursday, February 5, 2015

BigData startups in India

http://yourstory.com/2015/02/indian-big-data-companies-startups/

Wednesday, January 21, 2015

Hadoop and Security

This is a very old topic and no real good solutions; Hortonworks has published this article about Ranger + Dataguise http://hortonworks.com/blog/hadoop-security-different-paradigm/?mkt_tok=3RkMMJWWfF9wsRovuq%2FOZKXonjHpfsX66%2B8uWaW%2BlMI%2F0ER3fOvrPUfGjI4JSsJhI%2BSLDwEYGJlv6SgFT7TMMbFh1rgNUxc%3D