Big data, the catchy name for massive volumes of structured, unstructured or semi-structured data, is notoriously difficult to capture, store, manage, share, analyze and visualize, at least using traditional database and software applications. That’s why big data technologies have the potential to manage and process massive volumes of data effectively and efficiently. And it’s Apache Hadoop that provides the framework and associated technologies to process large data sets across clusters of computers in a distributed way. So, in order to really understand big data, you need to understand a bit about Hadoop. Here we’ll take a look at the top terms you’ll hear in regards to Hadoop – and what they mean.
Webinar: Big Iron, Meet Big Data: Liberating Mainframe Data with Hadoop & Spark Register here |
But First, a Look at How Hadoop Works
Before going into the Hadoop eco-system, you need to understand two fundamental things clearly. The first is how a file is stored in Hadoop; the second is how stored data is processed. All Hadoop-related technologies mainly work on these two areas and make it more user-friendly. (Get the basics of how Hadoop works in How Hadoop Helps Solve the Big Data Problem.)
Now, on to the terms.
Hadoop Common
The Hadoop framework has different modules for different functionalities and these modules can interact with each other for various reasons. Hadoop Common can be defined as a common utilities library to support these modules in the Hadoop the ecosystem. These utilities are basically Java-based, archived (JARs) files. These utilities are mainly used by programmers and developers during development time.
Hadoop Distributed File System (HDFS)
The Hadoop Distributed File System (HDFS) is a sub-project of Apache Hadoop under the Apache Software Foundation. This is the backbone of storage in the Hadoop framework. It is a distributed, scalable and fault-tolerant file system that spans across multiple commodity hardware known as the Hadoop cluster. The objective of HDFS is to store a huge volume of data reliably with high throughput access to application data. The HDFS follows master/slave architecture, where the master is known as NameNode and the slaves are known as DataNodes.
MapReduce
Hadoop MapReduce is also a sub-project of the Apache Software Foundation. MapReduce is actually a software framework purely written in Java. Its primary objective is to process large datasets on a distributed environment (comprised of commodity hardware) in a completely parallel manner. The framework manages all activities like job scheduling, monitoring, executing and re-executing (in the case of failed tasks).
HBase
Apache HBase is known as the Hadoop database. It is a columnar, distributed and scalable big data store. It is also known as a type of NoSQL database that is not a relational database management system. HBase applications are also written in Java, built on top of Hadoop and runs on HDFS. HBase is used when you need real-time read/write and random access to big data. HBase is modeled based on Google’s BigTable concepts.
Hive
Apache Hive is an open-source data warehouse software system. Hive was originally developed by Facebook before it came under the Apache Software Foundation and became open source. It facilitates the management and querying of large data sets on distributed Hadoop compatible storage. Hive performs all its activities by using an SQL-like language known as HiveQL. (Learn more in A Brief Intro to Apache Hive and Pig.)
Apache Pig
Pig was originally initiated by Yahoo for developing and executing MapReduce jobs on a large volume of distributed data. Now it has become an open source project under the Apache Software Foundation. Apache Pig can be defined as a platform for analyzing very large data sets in an efficient way. Pig’s infrastructure layer produces sequences of MapReduce jobs for doing the actual processing. Pig’s language layer is known as Pig Latin and it provides SQL-like features to perform queries on distributed data sets.
Apache Spark
Spark was originally developed by the AMPLab at UC Berkeley. It became an Apache top-level project in February 2014. Apache Spark can be defined as an open source, general-purpose, cluster-computing framework that makes data analytics much faster. It is built on top of the Hadoop Distributed File System but it is not linked with the MapReduce framework. Spark’s performance is much faster compared to MapReduce. It provides high-level APIs in Scala, Python and Java.
Apache Cassandra
Apache Cassandra is another open source NoSQL database. Cassandra is widely used to manage large volumes of structured, semi-structured and unstructured data spans across multiple data centers and cloud storage. Cassandra is designed based on a “masterless” architecture, which means it does not support the master/slave model. In this architecture, all nodes are the same and the data is distributed automatically and equally across all the nodes. Cassandra’s most important features are continuous availability, linear scalability, built-in/customizable replication, no single point of failure and operational simplicity.
Yet Another Resource Negotiator (YARN)
Yet Another Resource Negotiator (YARN) is also known as MapReduce 2.0, but it actually falls under Hadoop 2.0. YARN can be defined as a job scheduling and resource management framework. The basic idea of YARN is to replace the functionalities of JobTracker by two separate daemons responsible for resource management and scheduling/monitoring. In this new framework, there will be a global ResourceManager (RM) and an application-specific master known as ApplicationMaster (AM). The global ResourceManager (RM) and NodeManager (per node slave) form the actual data computation framework. Existing MapReduce v1 applications can also be run on YARN, but those applications need to be recompiled with Hadoop2.x jars.
Impala
Impala can be defined as an SQL query engine with massive parallel processing (MPP) power. It runs natively on the Apache Hadoop framework. Impala is designed as part of the Hadoop ecosystem. It shares the same flexible file system (HDFS), metadata, resource management and security frameworks as used by other Hadoop ecosystem components. The most important point is to note that Impala is much faster in query processing compared to Hive. But we should also remember that Impala is meant for query/analysis on a small set of data, and is mainly designed as an analytics tool that works on processed and structured data.
Hadoop is an important topic in IT, but there are those who are skeptical about its long-term viability. Read more in What Is Hadoop? A Cynic’s Theory.