Tips and Tricks to build a Hadoop eco system. References to good articles on Hadoop based solutions. Topics include: Hadoop architecture, Hive, SQL on Hadoop, Compression, Metadata.
Wednesday, December 30, 2015
Tuesday, December 29, 2015
$5 for any PacktPub book
Improve your skills during the holidays... https://www.packtpub.com/all/?search=hdinsight%20essentials
Monday, December 21, 2015
Apache Kylin - Olap for Hadoop
http://kylin.apache.org
Ebay technologies open sourced this.
http://www.ebaytechblog.com/2014/10/20/announcing-kylin-extreme-olap-engine-for-big-data/
Check out the web interface...
http://kylin.apache.org/docs/tutorial/web.html
Sunday, December 13, 2015
Apache Atlas on Data Governance
Apache Atlas is trying to be the Data Governance solution on Hadoop. Their approach has been more about integrating various metadata stores in one place which seems to be lacking in several organizations. It is still in it's early days of development. Currently they have integration with Hive from metadata bridge and Apache Ranger integration for policy enforcements.
Wednesday, October 21, 2015
How to run HDP on Azure
There are few options if you are looking at Azure to host Hadoop
- HDInsight - which is Microsoft's flavor of Hadoop (built on top of HDP). This does provide good separation of storage Azure Blobstore + compute on hardware.
- HDP on Azure is new option, where you can get real Hortonworks distribution spun up as VM's. Each data node is serving data and is managing compute. However in this model you cannot use Azure Blob storage easily. If you shutdown the cluster, data is gone, unless you script the storage back to Azure blob storage and have the reverse in case you want to bring data back to Hadoop.
- Azure is also supporting another model that is useful for screnarios like "OnPremp" Hadoop cluster for day to day operations, but backup to Azure Blob storage. This utilizes HDFS abstraction. The demo below will show you how to setup a feed with 2 data paths... primary and secondary. The secondary is on Azure blobstorage.
Tuesday, September 29, 2015
Microsoft expands Azure Data Lake
Key addition... The store in Azure Data Lake is HDFS compatible so Hadoop distributions like Cloudera, Hortonworks®, and MapR can readily access the data for processing and analytics.
http://blogs.technet.com/b/dataplatforminsider/archive/2015/09/28/microsoft-expands-azure-data-lake-to-unleash-big-data-productivity.aspx
Wednesday, August 26, 2015
Managing CDC using Sqoop and Hive
http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/
Thursday, July 30, 2015
Full SQL on Hadoop? Splice machine
http://www.zdnet.com/article/full-sql-on-hadoop-splice-machine-opens-up-its-database-for-trials/
Hive Authorization Model
Hive 13 onwards has an Oracle style User - Role - Privileges model for authorization
https://hadooptutorial.info/hive-authorization-models-and-hive-security/
Tuesday, June 9, 2015
Hadoop Summit 2015 San Jose Key Notes
Excellent kick off and very informative key notes. Key takeaways
- Hadoop data platform is next generation "Data Operating System"
- Atlas from Hortonworks is the “metadata
- Real-time was last year, today it is about “predictive”. Can we anticipate violations and prevent it.
- Data Scientist = Explore data and build model
Monday, June 1, 2015
Hadoop Data Lake for Healthcare
Good article from Hortonworks on Data Lake and Healthcare
http://hortonworks.com/blog/hdp-for-healthcare-providers-common-data-challenges/?mkt_tok=3RkMMJWWfF9wsRonuazKdu%2FhmjTEU5z17%2BwoUKe0hIkz2EFye%2BLIHETpodcMTcNlMr7YDBceEJhqyQJxPr3AKNkNy9RxRhHqDg%3D%3D
Monday, April 13, 2015
HDFS permissions explained
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html
In several scenarios, you might want to change the default behavior. The key property to change in hdfs-site.xml
fs.permissions.umask-mode = 022 (note there is a bug in Apache that it does not like 4 digit as advertised, so use 3 bits)
Integrating Tableau, Hive and Elastic Search
Good article on supporting a highly interactive Tableau dashboard with Hive and Elastic Search
http://ryrobes.com/systems/connecting-tableau-to-elasticsearch-read-how-to-query-elasticsearch-with-hive-sql-and-hadoop/
Monday, April 6, 2015
Microsoft buys Revolution Analytics
Microsoft is getting aggressive on Machine Learning and has acquired Revolution Analytics. They plan to integrate this with HDInsight.
http://blogs.technet.com/b/machinelearning/archive/2015/04/06/microsoft-closes-acquisition-of-revolution-analytics.aspx
Saturday, April 4, 2015
Ecosystem for Data Scientists
http://www.computerworld.com/article/2902920/the-data-science-ecosystem-part-2-data-wrangling.html
Monday, March 30, 2015
How to create HTML emails in Gmail
Today i am going to discuss something that is not related to Hadoop but Gmail. How to send rich text content using Gmail.
Follow this Youtube video
Thursday, February 26, 2015
Oracle Data Integrator
I have been researching on Oracle Data Integrator and it's strategy on Big Data integration. I'll add further slides, this one below maybe a good start.
Monday, February 23, 2015
Saturday, February 21, 2015
Azure HDInsight now runs on Linux
This is great news from Microsoft, HDInsight can now run on Linux servers. This allows easier migration of current data center driven Hadoop implementations to Microsoft hosted cloud solution. Below are slides from Microsoft's presentation at Strata
http://cdn.oreillystatic.com/en/assets/1/event/118/Running%20Hadoop-as-a-Service%20in%20the%20Cloud%20Presentation.pptx
Architectural Considerations for Hadoop
Really great read for Architects trying to understand what components to use while building a Hadoop solution
http://cdn.oreillystatic.com/en/assets/1/event/118/Architectural%20Considerations%20for%20Hadoop%20Applications%20Presentation.pdf
Check out this link for other conference slides and videos
http://strataconf.com/big-data-conference-ca-2015/public/schedule/proceedings
Thursday, February 19, 2015
HDInsight now supports Spark
http://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-spark-install/
Tuesday, February 17, 2015
Hive SQL cheat sheet
Check this site on simple comparison between SQL and HiveQL
http://hortonworks.com/blog/hive-cheat-sheet-for-sql-users/
Tuesday, February 10, 2015
Hive on Spark
Spark is becoming the next generation MapReduce framework for Hadoop. Hive is now able to run on Tez and Spark as well. See the slides below that detail the Hive on Spark plan
Thursday, February 5, 2015
BigData startups in India
http://yourstory.com/2015/02/indian-big-data-companies-startups/
Friday, January 23, 2015
Wednesday, January 21, 2015
Hadoop and Security
This is a very old topic and no real good solutions; Hortonworks has published this article about Ranger + Dataguise
http://hortonworks.com/blog/hadoop-security-different-paradigm/?mkt_tok=3RkMMJWWfF9wsRovuq%2FOZKXonjHpfsX66%2B8uWaW%2BlMI%2F0ER3fOvrPUfGjI4JSsJhI%2BSLDwEYGJlv6SgFT7TMMbFh1rgNUxc%3D
Subscribe to:
Posts (Atom)