RSS

Data Warehouse Solutions in Azure

Date Warehousing Solutions at a Glance

With today’s big data requirements where data could be structured, unstructured, batch, stream and come in many other forms and size, traditional data warehouse is not going to cut it.

Typically, there are 4 types of data stage:

  • Ingest
  • Store
  • Processing
  • Consuming

Different technology is required at different stage. This also depends heavily on size and form of data and the 4 Vs: Volume, Variety, Velocity, Veracity.

Consideration for the solutions sometime also depends on:

  • Ease of management
  • Team skill sets
  • Language
  • Cost
  • Specification / requirements
  • Integration with existing / others system.

Azure Services

Azure offers many services for data warehouse solutions. Traditionally, data warehouse has been ETL process + relational database storage like SQL Data Warehouse. Today, that may not always be the case.

Some of Azure services for data warehousing:

  • Azure HDInsight
    Azure offers various cluster types that comes with HDInsight, fully managed by Microsoft, but still require management from users. Also supports Data Lake Storage. More about HDInsight. HDInsight sits on “Processing” data stage.
  • Azure Databricks
    Its support for machine learning, AI, analytics and stream / graph processing makes it a go-to solution for data processing. It’s also fully integrated with Power BI and other source / destination tools. Notebooks in Databricks allows collaboration between data engineers, data scientist and business users. Compare to HDInsight.
  • Azure Data Factory
    The “Ingest” part of data stage. Its function is to bring data in and move them around different system. Azure Data Factory supports different pipelines across Azure services to connect the data and even on-premise data. Azure Data Factory can be used to control the flow of data.
  • Azure SQL Data Warehouse
    Typically the end destination of data and to be consumed by business users. SQL DW is platform as a service, require less management from users and great for team who already familiar with TSQL and SSMS (SQL Management Studio). You can also scale it dynamically, pause / resume the compute. SQL DW uses internal storage to store data and include the compute component. SQL Data Warehouse sits on “Consuming” stage.
  • Database services (RDBMS, Cosmos, etc)
    SQL database, or other relational database system, Cosmos are part of the storage solutions offered in Azure Services. This is typically more expensive than Azure Storage, but also offer other features. Database services are part of “Storage” stage.
  • Azure Data Lake Storage
    Build on top of Azure Storage, ADLS offers unlimited storage and file system based on HDFS, allowing optimization for analytics purpose, like Hadoop or HDInsight. ADLS is part of “Storage” stage.
  • Azure Data Lake Analytics
    ADLA is a high-level abstraction of HDInsight. Users will not need to worry about scaling and management of the clusters at all, it’s an instant scale per job. However, this also comes with some limitations. ADLA support USQL, a SQL-like language that allows custom user defined function in C#. The tooling is also what developers are already familiar with, Visual Studio.
  • Azure Storage
  • Azure Analysis Services
  • Power BI

Which one to use?

There’s no right or wrong answer. The right solution depends on many others things, technical and non-technical as well as the considerations mentioned above.

Simon Lidberg and Benjamin Wright Jones have a really good presentation around this topic. See the link at reference for their full talk. But, basically, the flowchart to make decision looks like this:

data-warehouse-solutions-in-azure

Reference

https://myignite.techcommunity.microsoft.com/sessions/66581

Advertisements
 
Leave a comment

Posted by on January 20, 2019 in General

 

Tags: , , , , , , , , , , , , , , , , , ,

Entity Framework, .Net and SQL Server Table Valued Parameter

This is step by step setup of using SQL Server TVP (Table Valued Parameter) in .Net application with EF (Entity Framework). In this example, I use SQL Server 2016 (SP2-CU3), .Net 4.5.1 and EF 6.20.

1. Create a table to store data.

CREATE TABLE [dbo].[Something] (
    [Id]            INT	IDENTITY(1,1)   NOT NULL,
    [Name]          VARCHAR(150)        NOT NULL,
    [Status]        CHAR(1)             NOT NULL,
    CONSTRAINT [PK_Something_Id] PRIMARY KEY CLUSTERED ([Id] ASC)
);
GO

2. Create `User Defined Table Type` in SQL Server. For simplicity, in this example the type’s columns are same as table I created on step 1. In real-world, the type’s columns could be significantly different than table where we store the data, it might even used for join with other tables.

CREATE TYPE [dbo].[udt_Something] AS TABLE (
	[Name]		VARCHAR(150)	NOT NULL,
	[Status]	CHAR(1)		NOT NULL,
);
GO

3. Create stored procedure to take parameter (of a `User Defined Table` type we created earlier) and perform necessary task to persist our data.

CREATE PROCEDURE [dbo].[sp_SaveSomething]
	@udt_Something [dbo].[udt_Something] READONLY
AS
BEGIN
    INSERT INTO [dbo].[Something]
        SELECT *
        FROM @udt_Something
END
GO

4. Create extension method to convert `IEnumerable<T>` object to a `DataTable` object. In order to use SQL TVP, we have to pass our parameter as a `DataTable`. This method will help convert our data to `DataTable` type.
Helper.cs

using System;
using System.Collections.Generic;
using System.Data;
using System.Linq;
using System.Reflection;

namespace QC
{
    public static class Helper
    {
        public static DataTable ToDataTable<T>(this IEnumerable<T> enumerable, IEnumerable<string> orderedColumnNames)
        {
            var dataTable = new DataTable();

            // Get all properties of the object
            PropertyInfo[] properties = typeof(T).GetProperties(BindingFlags.Public | BindingFlags.Instance);
            PropertyInfo[] readableProperties = properties.Where(w => w.CanRead).ToArray();

            // Get column
            var columnNames = (orderedColumnNames ?? readableProperties.Select(s => s.Name)).ToArray();

            // Add columns to data table
            foreach (string name in columnNames)
            {
                dataTable.Columns.Add(name, readableProperties.Single(s => s.Name.Equals(name)).PropertyType);
            }

            // Add rows to data table from object
            foreach (T obj in enumerable)
            {
                dataTable.Rows.Add(columnNames.Select(s => readableProperties.Single(s2 => s2.Name.Equals(s)).GetValue(obj)).ToArray());
            }

            return dataTable;
        }
    }
}

5. For the purpose of this example, let’s say we want to save a collection of objects. This is our object definition.
Something.cs

namespace QC
{
    public class Something
    {
        public int Id { get; set; }
        public string Name { get; set; }
        public string Status { get; set; }
    }
}

6. Using EF, called stored procedure we created and pass in `SqlParameter`, which is a collection of objects that we converted to `DataTable`. Don’t forget to specify parameter type as `User Defined Table Type`.
DataAccess.cs

using System.Collections.Generic;
using System.Data;
using System.Data.SqlClient;
using System.Linq;

namespace QC
{
    public class DataAccess
    {
        public void Save(IEnumerable<Something> data)
        {
            // Columns for ordering, the order of the columns must be the same as user defined table type
            var orderedCols = new[] { "Name", "Status" };

            // SQL parameter to pass to stored procedure
            var param = new SqlParameter("@udt_Something", SqlDbType.Structured);
            param.Value = data.ToDataTable(orderedCols);
            param.TypeName = "dbo.udt_Something";

            try
            {
                // QCDB is our EF entities
                using (var db = new QCDB())
                {
                    // Call stored procedure and pass in table valued parameter
                    db.Database.ExecuteSqlCommand("EXEC dbo.sp_SaveSomething @udt_Something", param);
                }
            }
            catch
            {
                throw;
            }
        }
    }
}

7. Example of usage.
OpsController.cs

using System.Collections.Generic;
using System.Data;
using System.Data.SqlClient;
using System.Linq;

namespace QC
{
    public class OpsController : ApiController
    {
        public void SaveSomething()
        {
            var data = new List<Something>();
            data.Add(new Something { Id = 1, Name = "Chap", Status = "A" });
            data.Add(new Something { Id = 2, Name = "Stick", Status = "D" });

            var dataAccess = new DataAccess();
            dataAccess.Save(data);
        }
    }
}
 
Leave a comment

Posted by on January 18, 2019 in General

 

Tags: , , , , , , , , ,

Unit Test Internal Members

Applies to .Net application in Visual Studio 2010 and up.

Say I have C# project (call it QC project). In the project, I have class that has internal member(s) like this:

namespace QC
{
    public class PersonService
    {
        public bool PromotePerson(string userName)
        {
        }

        internal bool FindPerson(string userName)
        {
        }
    }
}

For unit test, I’d created separate project (call it QCTest project). Because internal members can only be accessed in same assembly, I won’t be able to invoke `QC.PersonService.FindPerson(string userName)` method from QCTest project.

To overcome this issue, in `AssemblyInfo.cs` of project being tested (in my case, QC project), simply add

[assembly: InternalsVisibleTo("QCTest")]

After rebuilding my solution, I am able to access `QC.PersonService.FindPerson(string userName)` in my QCTest project.

As an added bonus, if you use mock library like Moq or Rhino and try to mock internal members, you most likely need to add:

[assembly: InternalsVisibleTo("DynamicProxyGenAssembly2")]

What is `DynamicProxyGenAssembly2` ?

DynamicProxyGenAssembly2 is a temporary assembly built by mocking systems that use CastleProxy like Moq or NSubsitute. It is generated when the mock is needed and disposed of after the tests are finished.

Reference: https://stackoverflow.com/a/17179923

 
Leave a comment

Posted by on January 18, 2019 in General

 

Tags: , , , , , ,

Unit Testing DTO Classes

Reason why I unit test my DTO (Data Transfer Object) classes are because I pass these classes to client app and client app might use them and expect certain properties in response. Sometimes I change DTO properties that I used in client app and forget to refactor client app.
By unit testing my DTO classes, it acts as reminder for me to refactor client app whenever DTO properties (or its name) changes.

For example, I have Person DTO.

Person.cs

public class Person
{
    public string FirstName { get; set; }
    public string LastName { get; set; }
    public string Age { get; set; }
}

To unit test, I would create PersonMirror DTO.

public class PersonMirror
{
    public string FirstName { get; set; }
    public string LastName { get; set; }
    public string Age { get; set; }
}

Then use following helper class to unit test my DTO.

public class TestHelper
{
    public static void CompareType()
    {
        var errorMessage = "Type: {0}; Property: {1}";

        var instance = Activator.CreateInstance(typeof(T));
        var instanceType = instance.GetType();
        var mirror = Activator.CreateInstance(typeof(TMirror));
        var mirrorType = mirror.GetType();
        var mirrorProperties = mirrorType.GetProperties();

        // Remove 'Mirror' convention
        var mirrorTypeName = mirrorType.Name;
        if (mirrorType.Name.IndexOf("Mirror") > 0)
        {
            mirrorTypeName = mirrorType.Name.Substring(0, mirrorType.Name.IndexOf("Mirror"));
        }

        // Compare type name
        Assert.AreEqual(mirrorTypeName, instanceType.Name);

        // Compare properties
        foreach (var mirrorProperty in mirrorProperties)
        {
            var instanceProperty = instanceType.GetProperty(mirrorProperty.Name);

            Assert.IsNotNull(instanceProperty, string.Format(errorMessage, instanceType.FullName, mirrorProperty.Name));
        }
    }
}

Usage (using MS Test):
PersonTest.cs

[TestClass]
public class PersonTest
{
    [TestMethod]
    public void TestPersonDto()
    {
        TestHelper.CompareType<Person, PersonMirror>();
    }    
}
 
Leave a comment

Posted by on January 18, 2019 in General

 

Tags: , , , , ,

WebApi .Net – Add DelegatingHandler to Get Request Body

Out of the box, WebApi pipeline bind request body to parameter, therefore WebApi Controller (which inherit from ApiController) `Request.Content` object is empty.

For example, passing this json in the request body:

[
    {
        "Name": "Test",
        "Status": "C",
    }
]

Will bind it to WebApi Controller’s parameter, however, `Request.Content` object is empty.

screenshot message handler

Overcome this issue by creating `DelegatingHandler` to add request body back.

RequestHandler.cs

using System.Net.Http;
using System.Threading;
using System.Threading.Tasks;

namespace QC
{
    public class RequestHandler : DelegatingHandler
    {
        protected override async Task SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
        {
            // Request body stream is emptied when WebApi bind request body to parameter, here we add original request body back so we can access it
            if (request.Content != null)
            {
                string body = await request.Content.ReadAsStringAsync();
                request.Properties["body"] = body;
            }

            return await base.SendAsync(request, cancellationToken);
        }
    }
}

Register in WebApiConfig.

WebApiConfig.cs

public static class WebApiConfig
{
    public static void Register(HttpConfiguration config)
    {
        config.MessageHandlers.Add(new RequestHandler());
    }
}

Viola! `Request.Content` contains request body.

screenshot message handler2

Code is from here.

 
Leave a comment

Posted by on January 18, 2019 in General

 

Tags: , , , , ,

What is Azure HDInsight?

Hadoop and Azure HDInsight

Azure HDInsight is Azure’s version of Hadoop as a service. It lives in the cloud, just like other Azure services, and it’s a managed service so we don’t have to worry about some of the maintenance that’s required with Hadoop cluster.

Underneath, Azure HDInsight uses Hortonworks Data Platform (HDP)’s Hadoop components.

Each Azure HDInsight version has its own cloud distribution of HDP along with other components. Different version of HDInsight will have different version of HDP. See the reference link for technology stack and its version.

When you create Azure HDInsight, you will be asked to choose the cluster type. The cluster type is the Hadoop technology you would want to use, Hive, Spark, Storm, etc. More cluster types are being added. To see what’s currently supported, see the reference link.

Azure HDInsight can be a great data warehouse solution that lives in the cloud.

Azure HDInsight and Databricks

While Azure HDInsight is a fully managed service, there are still some management we as a user have to do. HDInsight also supports Azure Data Lake Storage and Apache Ranger integration. The sort of downside to HDInsight is it doesn’t have auto-scale and you can’t pause the deployment. This means, you will pay for the cost as long as the service lives. The typical model is to spin the service up whenever it’s needed, compute the data, store it in a permanent storage and kills the service.

This is as opposed to Databricks, which is another data warehouse solution offered by Azure, Databricks can be auto-scaled. Databricks, however, is less about ETL process and more of processing the data for analytics, machine learning and the likes. Needless to say, it has built-in library for this purpose.

The language support is also different. Language support in HDInsight depends on what cluster type you choose when you spin up the service, for example, Hive will support HiveQL (SQL-like language) in its Hive editor. Databricks supports Python, Scala, R, SQL and many others.

Reference

https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-component-versioning

https://docs.microsoft.com/en-us/azure/hdinsight/

 
Leave a comment

Posted by on January 15, 2019 in General

 

Tags: , , , , , , , , , , ,

Web API .Net with Basic Authentication

Github project.

Notes:
1. This work on .Net framework 4.6.1.
2. Authorization part is not covered.

The core authentication code is in `Security/BasicAuthAttribute.cs`. This class inherit from following:

ActionFilterAttribute
So we can use it as attribute to decorate controllers or actions.

IAuthenticationFilter
To invoke WebApi’s authentication pipeline. Some developer like to use `IActionFilter` for authentication, while it may work, it is not a best practice as `IActionFilter` execute later in the WebApi stack.

IAuthenticationFilter implement 2 methods:
1. `AuthenticateAsync`. Run first. This is code to authentication user. Caller pass in credential in request header. First we begin by parsing the header and user name/password credential caller passed in. Then authenticate user, in Github project, I add user to generic principal but in production app, you should validate credential against security provider (ie: ADFS, Auth0), etc.
2. `ChallengeAsync`. Run after `AuthenticateAsync`. This is where authentication failed and we can challenge caller to prove them selves, which is done by passing `Authorization Basic` in response header.

Usage
There are 3 ways to use this attribute in WebApi.
1. Globally. Every actions will require authentication.

WebApiConfig.cs

public static void Register(HttpConfiguration config)
{
    // Add global authentication
    config.Filters.Add(new BasicAuthAttribute());

    // Web API routes
    config.MapHttpAttributeRoutes();

    config.Routes.MapHttpRoute(
        name: "DefaultApi",
        routeTemplate: "api/{controller}/{id}",
        defaults: new { id = RouteParameter.Optional }
    );
}

2. In entire controller. Every actions under that controller will require authentication. Notice the `[BasicAuth]` decoration.

ValuesController.cs

[BasicAuth]
public class ValuesController : ApiController
{
    public IEnumerable Get()
    {
        return new string[] { "value1", "value2" };
    }
            
    public string Get(int id)
    {
        return "value";
    }
}

3. In specific action. Notice the `[BasicAuth]` decoration.

ValuesController.cs

public class ValuesController : ApiController
{
    public IEnumerable Get()
    {
        return new string[] { "value1", "value2" };
    }

    [BasicAuth]
    public string Get(int id)
    {
        return "value";
    }
}
 
Leave a comment

Posted by on January 11, 2019 in General

 

Tags: , , , ,

 
%d bloggers like this: