More Reading List

Microsoft open sources SandDance, a visual data exploration tool

Microsoft demo-ed this visualization tool few years ago, it’s amazing tool to display your data beautifully. And it’s now open source!

Announcing TypeScript 3.7 RC

A lil bit older article, but describe new features in TypeScript 3.7 and reason behind its invention. Some of them are optional chaining, nullish coalescing, assertion functions. Head out to the page for full list and explanation.

13 Tech Experts Predict The Next Big Trend In Software Development

Whether you believe it or not, it’s a good-to-know reading. You can agree or disagree. Technical read, but not a heavy one for sure.

Top data visualization examples and dashboard designs

There are many way many way to display data and there’s no right or wrong way. There are, however, more effective way to display data depends on your app / needs. This post gives you some ideas on a good data visualization design.

13 useful JavaScript array tips and tricks you should know

Handy tricks in JavaScript. If you code in JS a lot, remember this on top of your head will definitely make you look like a ninja coder.

Quick Glance at Hadoop Ecosystems

As mentioned in Hadoop post, the community around Hadoop has built tremendous tools and technology to support developers. This becomes Hadoop ecosystem. Some of the most popular ones are:

  • Hive
    Hadoop is based on Java language but not everyone can learn Java. Hive is a software built on top of Hadoop, it exposes SQL interface, allowing SQL developers to use powerful Hadoop system in their familiar language. If you know SQL, you don’t have to experience in Java in order to leverage Hadoop. Hive is using HiveQL language, very SQL-like.
  • HBase
    Basically a non-relational database on top of Hadoop. Even though it’s a non-relational, you can integrate with other system just like a traditional database.
  • Pig
    A tool in Hadoop ecosystem used to manipulate data, transforming unstructured data to structured data. It also has interface to query the data, just like Hive.
  • Storm
    Event stream processor that lives in Hadoop, used to process stream of data (as opposed to batch data). Example would be to process stream of IOT data, where data from an IOT device keep flowing through the system.
  • Oozie
    A workflow management system to coordinate between different Hadoop technologies.
  • Flume / Sqoop
    More of integration system that will tranfer data to and from Hadoop system. If you have data that live outside of Hadoop and need to be processed in Hadoop, Flume / Sqoop will do the job.
  • Spark
    A distribute compute engine within Hadoop. It’s used to process large amount of data, prep-ing them for analytics, machine learning, etc. Needless to say, it has a lot of built-in library for machine learning, artificial intelligence, analytics, stream processing and graph processing. Spark also support various different language, Scala, Python, R, etc.

This is definitely an oversimplified explanation of Hadoop ecosystem and there are lots of other technologies not covered here. But, this should give you quick explanation of each of them.

What is Hadoop?

Hadoop and Distributed System

If you were a basketball player, all you need to is dribble and shooting. With those 2 skills, you can play basketball by yourself really well. But, if you were to play in a team, you are going to be suck. To have a successful basketball team, you will have to team up with other players. But then, you will also need to learn a new skill, passing. And a coach to coordinate everyone.

This is true with monolithic vs distributed system. In monolithic system, all you have is one giant supercomputer with large amount of memory, storage and compute power. In distributed system, you don’t have to have supercomputer, but you will have multiple, maybe less powerful, computers. Just like each player in a basketball team has to learn passing, each computer has to that talk to each other now and they will also have a software to coordinate them.

This is what Hadoop is for. It’s a system to coordinate and orchestrate a cluster of computers, called node, in a distributed system. Hadoop is like the coach for the basketball team.

Hadoop does a lot of heavy lifting, such as:

  • partition data
  • coordinate compute tasks
  • fault tolerance
  • allocate capacity to process / jobs, etc.
  • monitoring
  • security
  • API

The logical components of Hadoop, HDFS, MapReduce and YARN, are what Hadoop uses to do the heavy lifting. These components are essential the storage (HDFS), programming model (MapReduce) and resource manager (YARN).

In big data processing, some crucial requirements are to:

  • store
  • process and
  • scale

These reqs are to allow store, process and analyze data efficiently and in a timely manner. Hadoop is a perfect solution for big data processing.

And because Hadoop is handling most everything in cluster management for developers, we can focus on actually doing the work, building model, processing data, reporting, analyzing, etc. The details of cluster management is abstracted away.

What’s really cool about Hadoop is also its ecosystem. A lot of tools and technology have been created on top of Hadoop. Some of the popular ones are: Hive, HBase, Spark, Pig, Flume/Sqoop, Storm, Oozie, and many more. But, that’s for another day.