Apache Hadoop technology has been very popular with companies for several years, both as an on-premise installation and as a cloud solution . Google, Facebook, AOL, Baidu, IBM, Yahoo and many companies rely on the proven tools of the Hadoop ecosystem to extract, prepare and evaluate their own data . This compilation of individual components for the Hortonworks Apache Hadoop distribution works in a similar way to the compilation of many open-source individual tools for a GNU/Linux distribution.
Apache Hadoop itself is a collection (framework) of open-source software components which together enable the development of software solutions that can solve problems with large amounts of data and high computing power. Distributed data storage and distributed computing are usually carried out on computer clusters made up of standard hardware (PCs, servers). All Hadoop modules have been developed in such a way that hardware failures are always taken into account and can be intercepted by the modules themselves – a job that was executed on a computer that suffered a defect is automatically restarted on another available node, so that the end user does not notice a failure but at most a short delay.
With this blog article, we would like to first provide an overview of the Hadoop core technologies and the ecosystem based on them . We will then go into individual, important components with key functions one by one. After reading the article, you will have a comprehensive overview of the most important Apache Hadoop components .
Hadoop core components
HDFS (Hadoop Distributed File System)
The HDFS makes all drives (hard drives, SSDs) of the cluster’s machines appear like a single, highly available file system and thus represents the elementary data base for storing huge amounts of data. Individual drives or entire nodes can fail or be added without this affecting the data security of the file system. HDFS keeps all data blocks multiple times through internal data replication, which is invisible to the user. In addition, the data is distributed as evenly as possible within the computer network so that all nodes already have the data locally when calculations are carried out and no additional network traffic is generated.
YARN (Yet another resource negotiator)
YARN manages the resources of a cluster. By using queues for individual jobs, CPU, RAM, GPU and FPGA resources can be allocated to them. In addition, YARN starts data processing processes on those nodes on which the necessary data is located (principle of data locality). If a node fails, YARN will only delegate the part of the job that was on the failed node to another node for reprocessing. The user himself has little direct contact with YARN because it does most of the work in the background and represents infrastructure software for other components.
MapReduce
MapReduce is an abstract programming model for developing data processing processes that can run on a cluster. Mappers (e.g. string conversions, marking values) can transform data in parallel across the entire cluster and reducers (e.g. sum, filter) then aggregate these results. By nesting mappers and reducers, very complex program sequences are possible that can also be executed in a distributed manner. The direct programming of MapReduce jobs is now considered outdated, as more powerful and faster-to-learn interfaces (Hive, Spark) for data processing processes now exist. MapReduce jobs are only implemented directly in exceptional cases.
Hadoop ecosystem
Based on Hadoop’s core technologies, many other components and abstraction interfaces have been developed over the last 10 years, which simplify development (e.g. Hive SQL queries instead of MapReduce programs, …) or enable the use of Hadoop for new problems (event streaming, real-time processing, graph algorithms…). We will now gradually describe some components of the Hadoop ecosystem and their intended use.
Apache Hive + TEZ
Apache Hive forms a further level of abstraction above MapReduce and enables access to the files in the HDFS file system using SQL queries (HiveQL) just as would be the case when using a conventional SQL database (MySQL, MariaDB, Oracle, MSSQL, etc.). Since millions of developers, data analysts, controllers, etc. are already familiar with the SQL query language, which has existed for decades, the learning curve here is very flat and Hadoop technology can be queried like a conventional database thanks to Hive technology . However, the underlying data volumes on which the queries are performed can be several orders of magnitude higher than would be possible with conventional SQL databases. An SQL query on several terabytes of data would overwhelm an SQL database, but for a Hadoop cluster this is an everyday scenario.
Hive can use multiple engines to execute queries. Apache TEZ is commonly used to speed up queries using the directed acyclic graph optimization technique.
Apache Ambari
Apache Ambari is, in a sense, the cockpit of the Hadoop cluster . The current resource utilization, installed cluster components and much more information are displayed here. The HDFS file system can also be viewed or overviews for TEZ and Hive jobs can be selected. Ambari can therefore be used for cluster control by the admin as well as for the development of (Hive) scripts.
Apache Spark
Similar to Apache Hive, Apache Spark is based on MapReduce and forms another abstraction layer for cluster computing.
Based on the abstraction of the Resilient Distributed Dataset (RDD), Spark offers a Dataframe API with various functions for data manipulation. Spark can hold a lot of data directly in memory, process it in parallel and is therefore very fast in data processing. Skillbyte was able to replace Hive queries with Spark programs in customer projects and improve the runtime by over 80%. It is therefore worth using Spark for complex data operations on large data sets (joins, complex algorithms) .
Spark is very versatile and can execute SQL queries (Spark SQL), run machine learning algorithms (MLib) across the entire cluster, process streaming data (Spark Streaming) and perform graph processing (GraphX). Spark scripts can be programmed using Python (Pyspark), Scala or Java.
Apache HBase
HBase is a distributed open source NoSQL database based on the HDFS file system. In contrast to Hive and Spark, HBase optimizes fast, random data access . It is an error-tolerant technology for processing large „sparse data“ databases . That is, large databases with less information value because the values are repeated often and only differ slightly from one another – for example, sensor data for measuring temperature in a relatively constant environment. When developing HBase applications, it is important to „think outside the box“ because applications must be designed for the storage model of a column-oriented database .
Apache Airflow
Airflow is a job scheduler with which, among other things, several work steps can be carried out automatically on a time-based basis, i.e. at previously specified intervals. Several Hive scripts, Spark jobs and other steps can be switched one after the other to calculate data routes that must be updated cyclically (reports, uploads to third-party systems, etc.) . Airflow also monitors the execution of the individual steps and can automatically restart them if an error occurs or send messages via e-mail if errors cannot be resolved by repeatedly running the job. A large number of actors such as Spark, Hive, HTTP REST calls, Sqoop and many others are already available in the standard installation. In addition, it is very easy to expand it with your own actors using Python. In short: Airflow is a very flexible tool for automating and monitoring regularly occurring tasks on Hadoop clusters .
Apache Zookeeper
Zookeeper is used for the distributed configuration of distributed systems and is a basic infrastructure building block for many Hadoop services. Among other things, Zookeeper coordinates the individual Hadoop services (which nodes are available, which server is the active name node within the Hadoop cluster, etc.) and maintains the status information in the entire cluster network. Zookeeper itself is a distributed system and is designed to be fail-safe. As an infrastructure building block, Hadoop users rarely come into direct contact with Zookeeper.
Data delivery systems
There are many other Apache projects for integrating data from third-party systems into the Hadoop world. Below we would like to briefly describe some systems for importing Hadoop data (yellow components in the bottom right of the diagram).
Apache Sqoop
Sqoop forms a bridge between JDBC data sources (relational databases from almost all SQL database manufacturers offer this interface) and the Hadoop cluster . Sqoop is used to extract data from existing databases using SQL commands and store it on the HDFS of the Hadoop cluster so that it can be further processed there. Recurring imports can be automated with Airflow.
For example, start and stop times of advertising campaigns can be imported from an external database into the Hadoop cluster over several years in order to then assign the specific campaign to the raw data of the web server requests already available in Hadoop. Even huge amounts of data can then be processed in the Hadoop world – only limited by the number and equipment of the nodes (CPUs, RAM, storage space, etc.).
Apache Kafka
Apache Kafka is a distributed open source system for real-time processing of data streams . Originally developed by Linkedin, it was handed over to the Apache Software Foundation. Kafka is a distributed, scalable message broker that can show its strengths when large numbers of messages have to be reliably distributed between many different sources (producers) and sinks (consumers) . Web server logs, events from sensors, user actions on websites – Kafka can collect data from various systems using Kafka Connect and offers configurable guarantees (at most once, at least once, exactly-once) when processing them, which affect message delivery, making it suitable for many use cases.
We hope we have been able to give you a comprehensive impression of the basic Hadoop structure and the surrounding ecosystem. We are curious to hear about your experiences and the tools you use. Please send us your recipes for success, questions or other suggestions by email to [email protected]
We look forward to your suggestions!