Hadoop and Azure HDInsight
Azure HDInsight is Azure’s version of Hadoop as a service. It lives in the cloud, just like other Azure services, and it’s a managed service so we don’t have to worry about some of the maintenance that’s required with Hadoop cluster.
Underneath, Azure HDInsight uses Hortonworks Data Platform (HDP)’s Hadoop components.
Each Azure HDInsight version has its own cloud distribution of HDP along with other components. Different version of HDInsight will have different version of HDP. See the reference link for technology stack and its version.
When you create Azure HDInsight, you will be asked to choose the cluster type. The cluster type is the Hadoop technology you would want to use, Hive, Spark, Storm, etc. More cluster types are being added. To see what’s currently supported, see the reference link.
Azure HDInsight can be a great data warehouse solution that lives in the cloud.
Azure HDInsight and Databricks
While Azure HDInsight is a fully managed service, there are still some management we as a user have to do. HDInsight also supports Azure Data Lake Storage and Apache Ranger integration. The sort of downside to HDInsight is it doesn’t have auto-scale and you can’t pause the deployment. This means, you will pay for the cost as long as the service lives. The typical model is to spin the service up whenever it’s needed, compute the data, store it in a permanent storage and kills the service.
This is as opposed to Databricks, which is another data warehouse solution offered by Azure, Databricks can be auto-scaled. Databricks, however, is less about ETL process and more of processing the data for analytics, machine learning and the likes. Needless to say, it has built-in library for this purpose.
The language support is also different. Language support in HDInsight depends on what cluster type you choose when you spin up the service, for example, Hive will support HiveQL (SQL-like language) in its Hive editor. Databricks supports Python, Scala, R, SQL and many others.