First of the Week Reading List

How to use npx: the npm package runner

There’s so much to say about `npx`, but essentially, it allows you to run package (think of the cli tool that comes with a package) without installing it globally, or without installing the package at all!


Microsoft: We want you to learn Python programming language for free

Learn Python for free! But obviously, you probably already expect this, that it will also promote the usage of Azure.


Google reveals new Python programming language course: Scholarships for 2,500

Well, if Microsoft has free Python course, Google gotta launch its own. Soon, I’m sure Amazon will follow. Google course is not free however, but will give scholarship to 2500 students.


Third-Party Components at Their Best

This post is less of educational, but more of “things you should have”, aka checklist, if you want to use third-party UI components. But, it also applies to non-UI, it applies to any open source really. And you can even use this checklist if you want to start your own open source project.


6 Ways to Unsubscribe from Observables in Angular

Yes, the glorious Observables in Angular. It’s double-edged sword, but it will only hurt you if you forget to unsubscribe. Luckily, there are 6 ways to do it. This post will show you how.


Machine Learning / AI Reading List

Common Errors in Machine Learning due to Poor Statistics Knowledge

You can learn by doing, or you can learn from someone else’s mistakes. This is the case of the later. Certainly useful to avoid the same mistakes.


Continuous Delivery for Machine Learning

This one is pretty heavy read. But if you’re into Martin Fowler’s stuff, you know they are ‘the standard’. This one covers what it takes to apply continuous delivery in machine learning model. The CD for machine learning have similarities to software’s CD, but there are few keys differences as well.


Introduction to Machine Learning in C# with ML.NET

If you want to start learning Machine Learning but don’t know where to start, ML.NET is a good starter. With familiarity of C# and .NET, you could pick up ML.NET fairly quickly. This goes into details on how to get started with ML.NET, even covers Auto-ML!


Identify guiding principles for responsible AI in your business

Everyone thinks AI is cool, futuristic and can solve _almost_ all the problems. But not many think about the consequences, side effects and what it would take to build it right. In another word, a responsible AI. This mini-course go over what we need to consider in building AI.


Multiprocessing vs. Threading in Python: What Every Data Scientist Needs to Know

I like how Sumit gives intro to parallel computing, specifically multi processes vs threading, before he went dive into how it’s applicable in Python for modeling. Worth read even if you skip the Python part.


What is Azure HDInsight?

Hadoop and Azure HDInsight

Azure HDInsight is Azure’s version of Hadoop as a service. It lives in the cloud, just like other Azure services, and it’s a managed service so we don’t have to worry about some of the maintenance that’s required with Hadoop cluster.

Underneath, Azure HDInsight uses Hortonworks Data Platform (HDP)’s Hadoop components.

Each Azure HDInsight version has its own cloud distribution of HDP along with other components. Different version of HDInsight will have different version of HDP. See the reference link for technology stack and its version.

When you create Azure HDInsight, you will be asked to choose the cluster type. The cluster type is the Hadoop technology you would want to use, Hive, Spark, Storm, etc. More cluster types are being added. To see what’s currently supported, see the reference link.

Azure HDInsight can be a great data warehouse solution that lives in the cloud.

Azure HDInsight and Databricks

While Azure HDInsight is a fully managed service, there are still some management we as a user have to do. HDInsight also supports Azure Data Lake Storage and Apache Ranger integration. The sort of downside to HDInsight is it doesn’t have auto-scale and you can’t pause the deployment. This means, you will pay for the cost as long as the service lives. The typical model is to spin the service up whenever it’s needed, compute the data, store it in a permanent storage and kills the service.

This is as opposed to Databricks, which is another data warehouse solution offered by Azure, Databricks can be auto-scaled. Databricks, however, is less about ETL process and more of processing the data for analytics, machine learning and the likes. Needless to say, it has built-in library for this purpose.

The language support is also different. Language support in HDInsight depends on what cluster type you choose when you spin up the service, for example, Hive will support HiveQL (SQL-like language) in its Hive editor. Databricks supports Python, Scala, R, SQL and many others.

Reference

https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-component-versioning

https://docs.microsoft.com/en-us/azure/hdinsight/

Quick Glance at Hadoop Ecosystems

As mentioned in Hadoop post, the community around Hadoop has built tremendous tools and technology to support developers. This becomes Hadoop ecosystem. Some of the most popular ones are:

  • Hive
    Hadoop is based on Java language but not everyone can learn Java. Hive is a software built on top of Hadoop, it exposes SQL interface, allowing SQL developers to use powerful Hadoop system in their familiar language. If you know SQL, you don’t have to experience in Java in order to leverage Hadoop. Hive is using HiveQL language, very SQL-like.
  • HBase
    Basically a non-relational database on top of Hadoop. Even though it’s a non-relational, you can integrate with other system just like a traditional database.
  • Pig
    A tool in Hadoop ecosystem used to manipulate data, transforming unstructured data to structured data. It also has interface to query the data, just like Hive.
  • Storm
    Event stream processor that lives in Hadoop, used to process stream of data (as opposed to batch data). Example would be to process stream of IOT data, where data from an IOT device keep flowing through the system.
  • Oozie
    A workflow management system to coordinate between different Hadoop technologies.
  • Flume / Sqoop
    More of integration system that will tranfer data to and from Hadoop system. If you have data that live outside of Hadoop and need to be processed in Hadoop, Flume / Sqoop will do the job.
  • Spark
    A distribute compute engine within Hadoop. It’s used to process large amount of data, prep-ing them for analytics, machine learning, etc. Needless to say, it has a lot of built-in library for machine learning, artificial intelligence, analytics, stream processing and graph processing. Spark also support various different language, Scala, Python, R, etc.

This is definitely an oversimplified explanation of Hadoop ecosystem and there are lots of other technologies not covered here. But, this should give you quick explanation of each of them.