Data Warehouse Solutions in Azure

Date Warehousing Solutions at a Glance

With today’s big data requirements where data could be structured, unstructured, batch, stream and come in many other forms and size, traditional data warehouse is not going to cut it.

Typically, there are 4 types of data stage:

  • Ingest
  • Store
  • Processing
  • Consuming

Different technology is required at different stage. This also depends heavily on size and form of data and the 4 Vs: Volume, Variety, Velocity, Veracity.

Consideration for the solutions sometime also depends on:

  • Ease of management
  • Team skill sets
  • Language
  • Cost
  • Specification / requirements
  • Integration with existing / others system.

Azure Services

Azure offers many services for data warehouse solutions. Traditionally, data warehouse has been ETL process + relational database storage like SQL Data Warehouse. Today, that may not always be the case.

Some of Azure services for data warehousing:

  • Azure HDInsight
    Azure offers various cluster types that comes with HDInsight, fully managed by Microsoft, but still require management from users. Also supports Data Lake Storage. More about HDInsight. HDInsight sits on “Processing” data stage.
  • Azure Databricks
    Its support for machine learning, AI, analytics and stream / graph processing makes it a go-to solution for data processing. It’s also fully integrated with Power BI and other source / destination tools. Notebooks in Databricks allows collaboration between data engineers, data scientist and business users. Compare to HDInsight.
  • Azure Data Factory
    The “Ingest” part of data stage. Its function is to bring data in and move them around different system. Azure Data Factory supports different pipelines across Azure services to connect the data and even on-premise data. Azure Data Factory can be used to control the flow of data.
  • Azure SQL Data Warehouse
    Typically the end destination of data and to be consumed by business users. SQL DW is platform as a service, require less management from users and great for team who already familiar with TSQL and SSMS (SQL Management Studio). You can also scale it dynamically, pause / resume the compute. SQL DW uses internal storage to store data and include the compute component. SQL Data Warehouse sits on “Consuming” stage.
  • Database services (RDBMS, Cosmos, etc)
    SQL database, or other relational database system, Cosmos are part of the storage solutions offered in Azure Services. This is typically more expensive than Azure Storage, but also offer other features. Database services are part of “Storage” stage.
  • Azure Data Lake Storage
    Build on top of Azure Storage, ADLS offers unlimited storage and file system based on HDFS, allowing optimization for analytics purpose, like Hadoop or HDInsight. ADLS is part of “Storage” stage.
  • Azure Data Lake Analytics
    ADLA is a high-level abstraction of HDInsight. Users will not need to worry about scaling and management of the clusters at all, it’s an instant scale per job. However, this also comes with some limitations. ADLA support USQL, a SQL-like language that allows custom user defined function in C#. The tooling is also what developers are already familiar with, Visual Studio.
  • Azure Storage
  • Azure Analysis Services
  • Power BI

Which one to use?

There’s no right or wrong answer. The right solution depends on many others things, technical and non-technical as well as the considerations mentioned above.

Simon Lidberg and Benjamin Wright Jones have a really good presentation around this topic. See the link at reference for their full talk. But, basically, the flowchart to make decision looks like this:

data-warehouse-solutions-in-azure

Reference

https://myignite.techcommunity.microsoft.com/sessions/66581

What is Azure HDInsight?

Hadoop and Azure HDInsight

Azure HDInsight is Azure’s version of Hadoop as a service. It lives in the cloud, just like other Azure services, and it’s a managed service so we don’t have to worry about some of the maintenance that’s required with Hadoop cluster.

Underneath, Azure HDInsight uses Hortonworks Data Platform (HDP)’s Hadoop components.

Each Azure HDInsight version has its own cloud distribution of HDP along with other components. Different version of HDInsight will have different version of HDP. See the reference link for technology stack and its version.

When you create Azure HDInsight, you will be asked to choose the cluster type. The cluster type is the Hadoop technology you would want to use, Hive, Spark, Storm, etc. More cluster types are being added. To see what’s currently supported, see the reference link.

Azure HDInsight can be a great data warehouse solution that lives in the cloud.

Azure HDInsight and Databricks

While Azure HDInsight is a fully managed service, there are still some management we as a user have to do. HDInsight also supports Azure Data Lake Storage and Apache Ranger integration. The sort of downside to HDInsight is it doesn’t have auto-scale and you can’t pause the deployment. This means, you will pay for the cost as long as the service lives. The typical model is to spin the service up whenever it’s needed, compute the data, store it in a permanent storage and kills the service.

This is as opposed to Databricks, which is another data warehouse solution offered by Azure, Databricks can be auto-scaled. Databricks, however, is less about ETL process and more of processing the data for analytics, machine learning and the likes. Needless to say, it has built-in library for this purpose.

The language support is also different. Language support in HDInsight depends on what cluster type you choose when you spin up the service, for example, Hive will support HiveQL (SQL-like language) in its Hive editor. Databricks supports Python, Scala, R, SQL and many others.

Reference

https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-component-versioning

https://docs.microsoft.com/en-us/azure/hdinsight/

What is Hadoop?

Hadoop and Distributed System

If you were a basketball player, all you need to is dribble and shooting. With those 2 skills, you can play basketball by yourself really well. But, if you were to play in a team, you are going to be suck. To have a successful basketball team, you will have to team up with other players. But then, you will also need to learn a new skill, passing. And a coach to coordinate everyone.

This is true with monolithic vs distributed system. In monolithic system, all you have is one giant supercomputer with large amount of memory, storage and compute power. In distributed system, you don’t have to have supercomputer, but you will have multiple, maybe less powerful, computers. Just like each player in a basketball team has to learn passing, each computer has to that talk to each other now and they will also have a software to coordinate them.

This is what Hadoop is for. It’s a system to coordinate and orchestrate a cluster of computers, called node, in a distributed system. Hadoop is like the coach for the basketball team.

Hadoop does a lot of heavy lifting, such as:

  • partition data
  • coordinate compute tasks
  • fault tolerance
  • allocate capacity to process / jobs, etc.
  • monitoring
  • security
  • API

The logical components of Hadoop, HDFS, MapReduce and YARN, are what Hadoop uses to do the heavy lifting. These components are essential the storage (HDFS), programming model (MapReduce) and resource manager (YARN).

In big data processing, some crucial requirements are to:

  • store
  • process and
  • scale

These reqs are to allow store, process and analyze data efficiently and in a timely manner. Hadoop is a perfect solution for big data processing.

And because Hadoop is handling most everything in cluster management for developers, we can focus on actually doing the work, building model, processing data, reporting, analyzing, etc. The details of cluster management is abstracted away.

What’s really cool about Hadoop is also its ecosystem. A lot of tools and technology have been created on top of Hadoop. Some of the popular ones are: Hive, HBase, Spark, Pig, Flume/Sqoop, Storm, Oozie, and many more. But, that’s for another day.

Optimistic Concurrency x Eventual Consistency

Optimistic Concurrency

Less strict locking to support more simultaneous access. In optimistic concurrency, multi users are able to perform actions on the same resources without locking each other, for example, one user can write without locking another user that’s reading the same resource. Some actions will still lock the resource exclusively, for example, a schema changes.

Pessimistic Concurrency

Is the opposite, a stricter locking is used. When a user is performing an action that requires lock, other users won’t be able to do anything that would conflict with the lock, until the lock is release from the owner (first user).

Eventual Consistency

Eventual consistency guarantees more of availability than consistent data. This is achieved by prioritize availability (not locking the resource) rather than replicating the data.

Strong Consistency

The opposite of eventual consistency where it’s prioritizing consistent data across the system rather than availability.

BASE

Eventual consistency is classified as BASE (Basically Available, Soft state, Eventual consistency) semantics, as oppose to ACID principle.

Conclusion

Eventual / strong consistency is similar to optimistic / pessimistic concurrency. The difference is the terms eventual / strong consistency is often used in a distributed system where optimistic / pessimistic concurrency is used more in lower level, single entity such as database.

Azure Cosmos DB consistency levels, strong to weakest consistency:
– Strong consistency
– Bounded staleness
– Session
– Consistent prefix
– Eventual consistency