As mentioned in Hadoop post, the community around Hadoop has built tremendous tools and technology to support developers. This becomes Hadoop ecosystem. Some of the most popular ones are:
Hadoop is based on Java language but not everyone can learn Java. Hive is a software built on top of Hadoop, it exposes SQL interface, allowing SQL developers to use powerful Hadoop system in their familiar language. If you know SQL, you don’t have to experience in Java in order to leverage Hadoop. Hive is using HiveQL language, very SQL-like.
Basically a non-relational database on top of Hadoop. Even though it’s a non-relational, you can integrate with other system just like a traditional database.
A tool in Hadoop ecosystem used to manipulate data, transforming unstructured data to structured data. It also has interface to query the data, just like Hive.
Event stream processor that lives in Hadoop, used to process stream of data (as opposed to batch data). Example would be to process stream of IOT data, where data from an IOT device keep flowing through the system.
A workflow management system to coordinate between different Hadoop technologies.
- Flume / Sqoop
More of integration system that will tranfer data to and from Hadoop system. If you have data that live outside of Hadoop and need to be processed in Hadoop, Flume / Sqoop will do the job.
A distribute compute engine within Hadoop. It’s used to process large amount of data, prep-ing them for analytics, machine learning, etc. Needless to say, it has a lot of built-in library for machine learning, artificial intelligence, analytics, stream processing and graph processing. Spark also support various different language, Scala, Python, R, etc.
This is definitely an oversimplified explanation of Hadoop ecosystem and there are lots of other technologies not covered here. But, this should give you quick explanation of each of them.