012 What is Hadoop Cluster? Learn to Build a Cluster in Hadoop
In this blog, we will get familiar with Hadoop cluster the heart of Hadoop framework. First, we will talk about what is a Hadoop cluster? Then look at the basic architecture and protocols it uses for communication. And at last, we will discuss what are the various benefits that Hadoop cluster provide.
在这篇博客中,我们将熟悉 Hadoop 框架的核心 Hadoop 集群.首先,我们将讨论什么是 Hadoop 集群?然后看看它用于通信的基本架构和协议.最后,我们将讨论 Hadoop 集群提供的各种好处.
So, let us begin our journey of Hadoop Cluster.
所以,让我们开始 Hadoop 集群的旅程.
What is Hadoop Cluster? Learn to Build a Cluster in Hadoop
1. What is Hadoop Cluster?
A Hadoop cluster is nothing but a group of computers connected together via LAN. We use it for storing and processing large data sets. Hadoop clusters have a number of commodity hardware connected together. They communicate with a high-end machine which acts as a master. These master and slaves implement distributed computing over distributed data storage. It runs open source software for providing distributed functionality.
Hadoop 集群只不过是一组通过局域网连接在一起的计算机.我们用它来存储和处理大数据集.Hadoop 集群有许多商品硬件连接在一起.他们与作为主人的高端机器交流.这些主服务器和从服务器通过分布式数据存储实现分布式计算.它运行提供分布式功能的开源软件.
2. What is the Basic Architecture of Hadoop Cluster?
Hadoop cluster has master-slave architecture.
Hadoop 集群有主从式架构.
i. Master in Hadoop Cluster
It is a machine with a good configuration of memory and CPU. There are two daemons running on the master and they are NameNode and Resource Manager.
这是一台内存和 CPU 配置都很好的机器.在 master 上运行着两个守护进程,它们是 NameNode 和资源管理器.
a. Functions of NameNode
Manages file system namespace
Regulates access to files by clients
Stores metadata of actual data Foe example – file path, number of blocks, block id, the location of blocks etc.
Executes file system namespace operations like opening, closing, renaming files and directories
管理文件系统命名空间
管理客户端对文件的访问
存储实际数据的元数据,例如文件路径、块数量、块 id 、块位置等.
执行打开、关闭、重命名文件和目录等文件系统命名空间操作
The NameNode stores the metadata in the memory for fast retrieval. Hence we should configure it on a high-end machine.
NameNode 将元数据存储在内存中,以便快速检索.因此,我们应该在高端机器上配置它.
b. Functions of Resource Manager
It arbitrates resources among competing nodes
Keeps track of live and dead nodes
它在竞争节点之间仲裁资源
跟踪活节点和死节点
ii. Slaves in the Hadoop Cluster
It is a machine with a normal configuration. There are two daemons running on Slave machines and they are – DataNode and Node Manager
这是一台配置正常的机器.在从属机器上运行着两个守护进程,它们是-DataNode 和 Node Manager
a. Functions of DataNode
It stores the business data
It does read, write and data processing operations
Upon instruction from a master, it does creation, deletion, and replication of data blocks.
它存储业务数据
它做读、写和数据处理操作.
根据 master 的指令,它会创建、删除和复制数据块.
b. Functions of NodeManager
It runs services on the node to check its health and reports the same to ResourceManager.
它在节点上运行服务来检查其运行状况,并向 resource cemanager 报告.
We can easily scale Hadoop cluster by adding more nodes to it. Hence we call it a linearly scaled cluster. Each node added increases the throughput of the cluster.
通过增加更多的节点,我们可以很容易地扩展 Hadoop 集群.因此,我们称之为线性规模的集群.每个节点的加入都增加了集群的吞吐量.
Client nodes in Hadoop cluster – We** install Hadoop** and configure it on client nodes.
Hadoop 集群中的客户端节点安装 Hadoop并在客户端节点上配置它.
c. Functions of the client node
To load the data on the Hadoop cluster.
Tells how to process the data by submitting MapReduce job.
Collects the output from a specified location.
在 Hadoop 集群上加载数据.
告诉如何通过提交 MapReduce 作业来处理数据.
从指定位置收集输出.
3. Single Node Cluster VS Multi-Node Cluster
As the name suggests, single node cluster gets deployed over a single machine. And multi-node clusters gets deployed on several machines.
顾名思义,单个节点集群通过单机.多节点集群被部署到几台机器.
In single-node Hadoop clusters, all the daemons like NameNode, DataNode run on the same machine. In a single node Hadoop cluster, all the processes run on one JVM instance. The user need not make any configuration setting. The Hadoop user only needs to set JAVA_HOME variable. The default factor for single node Hadoop cluster is one.
在单节点 Hadoop 集群,像 NameNode 、 DataNode 这样的所有守护进程都在同一台机器上运行.在单个节点 Hadoop 集群中,所有进程都在一个 JVM 实例上运行.用户不需要进行任何配置设置.Hadoop 用户只需要设置 java _ home 变量就可以了.单节点 Hadoop 集群的默认因素是 1.
In multi-node Hadoop clusters, the daemons run on separate host or machine. A multi-node Hadoop cluster has master-slave architecture. In this NameNode daemon run on the master machine. And DataNode daemon runs on the slave machines. In multi-node Hadoop cluster, the slave daemons like DataNode and NodeManager run on cheap machines. On the other hand, master daemons like NameNode and ResourceManager run on powerful servers. Ina multi-node Hadoop cluster, slave machines can be present in any location irrespective of the physical location of the master server.
在多节点 Hadoop 集群,守护进程在单独的主机或机器上运行.多节点 Hadoop 集群具有主从结构.在主机器上运行的这个 NameNode 守护进程中.DataNode 守护进程在从机上运行.在多节点 Hadoop 集群中,像 DataNode 和 NodeManager 这样的从属守护进程在廉价的机器上运行.另一方面,像 NameNode 和 resource cemanager 这样的主守护进程运行在功能强大的服务器上.在多节点 Hadoop 集群中,无论主服务器的物理位置如何,从属机器都可以出现在任何位置.
4. Communication Protocols Used in Hadoop Clusters
The HDFS communication protocol works on the top of TCP/IP protocol. The client establishes a connection with NameNode using configurable TCP port. Hadoop cluster establishes the connection to the client using client protocol. DataNode talks to NameNode using the DataNode Protocol. A Remote Procedure Call (RPC) abstraction wraps both Client protocol and DataNode protocol. NameNode does not initiate any RPC instead it responds to RPC from the DataNode.
HDFS通信协议在 tcp/ip 协议的基础上工作.客户端使用可配置的 TCP 端口与 NameNode 建立连接.Hadoop 集群使用客户端协议建立到客户端的连接.DataNode DataNode 会谈、复制指令使用的协议.远程过程调用 (RPC) 抽象包装了客户端协议和 DataNode 协议.NameNode 不会启动任何 RPC,而是从 DataNode 响应 RPC.
5. How to Build a Cluster in Hadoop
Building a Hadoop cluster is a non- trivial job. Ultimately the performance of our system will depend upon how we have configured our cluster. In this section, we will discuss various parameters one should take into consideration while setting up a Hadoop cluster.
构建 Hadoop 集群是一项非常重要的工作.最终,我们系统的性能将取决于我们如何配置集群.在本节中,我们将讨论在设置 Hadoop 集群时应该考虑的各种参数.
For choosing the right hardware one must consider the following points
要选择合适的硬件,必须考虑以下几点
Understand the kind of workloads, the cluster will be dealing with. The volume of data which cluster need to handle. And kind of processing required like CPU bound, I/O bound etc.
Data storage methodology like data compression technique used if any.
Data retention policy like how frequently we need to flush.
了解集群将处理的工作负载类型.集群需要处理的数据量.以及 CPU 绑定、 I/O 绑定等所需的处理方式.
数据存储方法,如使用的数据压缩技术.
数据保留策略,比如我们需要刷新的频率.
Sizing the Hadoop Cluster
For determining the size of Hadoop clusters we need to look at how much data is in hand. We should also examine the daily data generation. Based on these factors we can decide the requirements of a number of machines and their configuration. There should be a balance between performance and cost of the hardware approved.
为了确定 Hadoop 集群的大小,我们需要查看手头有多少数据.我们也应该检查日常的数据生成.基于这些因素,我们可以决定一些机器的需求及其配置.批准的硬件的性能和成本之间应该有一个平衡.
Configuring Hadoop Cluster
For deciding the configuration of Hadoop cluster, run typical Hadoop jobs on the default configuration to get the baseline. We can analyze job history log files to check if a job takes more time than expected. If so then change the configuration. After that repeat the same process to fine tune the Hadoop cluster configuration so that it meets the business requirement. Performance of the cluster greatly depends upon resources allocated to the daemons. The Hadoop cluster allocates one CPU core for small to medium data volume to each DataNode. And for large data sets, it allocates two CPU cores to the HDFS daemons.
要决定 Hadoop 集群的配置,请运行典型的Hadoop 作业获取基线的默认配置.我们可以分析作业历史记录日志文件,以检查作业是否比预期花费更多的时间.如果是这样,请更改配置.之后重复同样的流程,对 Hadoop 集群配置进行微调,使其满足业务需求.集群的性能在很大程度上取决于分配给守护进程的资源.Hadoop 集群为每个 DataNode 分配一个中小数据量的 CPU 核心.对于大型数据集,它会为 HDFS 守护进程分配两个 CPU 内核.
6. Hadoop Cluster Management
When you deploy your Hadoop cluster in production it is apparent that it would scale along all dimensions. They are volume, velocity, and variety. Various features that it should have to become production-ready are – robust, round the clock availability, performance and manageability. Hadoop cluster management is the main aspect of your big data initiative.
当您在生产环境中部署 Hadoop 集群时,很明显它会扩展到所有维度.它们是体积、速度和种类.它应该具备的各种功能在全天候的可用性、性能和可管理性方面都是可靠的.Hadoop 集群管理是你的大数据计划的主要方面.
A good cluster management tool should have the following features:-
一个好的集群管理工具应该具有以下特性:-
- It should provide diverse work-load management, security, resource provisioning, performance optimization, health monitoring. Also, it needs to provide policy management, job scheduling, back up and recovery across one or more nodes.
- Implement NameNode high availability with load balancing, auto-failover, and hot standbys
- Enabling policy-based controls that prevent any application from gulping more resources than others.
- Managing the deployment of any layers of software over Hadoop clusters by performing regression testing. This is to make sure that any jobs or data won’t crash or encounter any bottlenecks in daily operations.
- 它应该提供多样化的工作负载管理、安全性、资源调配、性能优化、运行状况监控.此外,它还需要在一个或多个节点上提供策略管理、作业调度、备份和恢复.
- 实施高可用性具有负载平衡、自动故障切换和热备用功能
- 启用基于策略的控制,防止任何应用程序比其他应用程序占用更多的资源.
- 通过执行回归测试来管理 Hadoop 集群上任何层软件的部署.这是为了确保任何工作或数据在日常操作中不会崩溃或遇到任何瓶颈.
7. Benefits of Hadoop Clusters
7. Hadoop 的集群
Here is a list of benefits provided by Clusters in Hadoop –
以下是 Hadoop 中集群提供的好处列表-
Robustness
Data disks failures, heartbeats and re-replication
Cluster Rrbalancing
Data integrity
Metadata disk failure
Snapshot
鲁棒性
数据磁盘故障、心跳和重新复制
集群资源平衡
数据完整性
元数据磁盘故障
快照
i. Robustness
The main objective of Hadoop is to store data reliably even in the event of failures. Various kind of failure is NameNode failure, DataNode failure, and network partition. DataNode periodically sends a heartbeat signal to NameNode. In network partition, a set of DataNodes gets disconnected with the NameNode. Thus NameNode does not receive any heartbeat from these DataNodes. It marks these DataNodes as dead. Also, Namenode does not forward any I/O request to them. The replication factor of the blocks stored in these DataNodes falls below their specified value. As a result, NameNode initiates replication of these blocks. In this way, NameNode recovers from the failure.
的Hadoop 的主要目标即使发生故障,也要可靠地存储数据.各种各样的故障是 NameNode 故障、 DataNode 故障和网络分区.DataNode 定期发送心跳信号、复制指令.在网络分区中,一组数据节点与 NameNode 断开连接.因此,NameNode 不会从这些数据节点接收到任何心跳.它将这些 DataNodes 标记为死.此外,Namenode 不会向他们转发任何 I/O 请求.存储在这些 DataNodes 中的块的复制因子低于其指定值.因此,NameNode 启动这些块的复制.这样,NameNode 就可以从故障中恢复过来.
ii. Data Disks Failure, Heartbeats, and Re-replication
NameNode receives a heartbeat from each DataNode. NameNode may fail to receive heartbeat because of certain reasons like network partition. In this case, it marks these nodes as dead. This decreases the replication factor of the data present in the dead nodes. Hence NameNode initiates replication for these blocks thereby making the cluster fault tolerant.
、复制指令接收从每个 DataNode 心跳.由于某些原因,如网络分区,NameNode 可能无法接收心跳.在这种情况下,它将这些节点标记为死节点.这降低了死节点中数据的复制因子.因此,NameNode 为这些块启动复制,从而使集群容错.
iii. Cluster Rebalancing
Iii.集群再平衡
The HDFS architecture automatically does cluster rebalancing. Suppose the free space in a DataNode falls below a threshold level. Then it automatically moves some data to another DataNode where enough space is available.
的HDFS 架构自动进行集群再平衡.假设 DataNode 中的自由空间低于阈值水平.然后,它会自动将一些数据移动到另一个有足够空间的 DataNode.
iv. Data Integrity
四、数据完整性
Hadoop cluster implements checksum on each block of the file. It does so to see if there is any corruption due to buggy software, faults in storage device etc. If it finds the block corrupted it seeks it from another DataNode that has a replica of the block.
Hadoop集群对文件的每个块实现校验和.这样做是为了看看是否有任何由于软件错误、存储设备故障等原因导致的损坏.如果发现块损坏,它会从另一个具有块副本的 DataNode 中寻找它.
v. Metadata Disk Failure
五、元数据磁盘故障
FSImage and Editlog are the central data structures of HDFS. Corruption of these files can stop the** functioning of HDFS**. For this reason, we can configure NameNode to maintain multiple copies of FSImage and EditLog. Updation of multiple copies of FSImage and EditLog can degrade the performance of Namespace operations. But it is fine as Hadoop deals more with the data-intensive application rather than metadata intensive operation.
FSImage 和 Editlog 是 HDFS 的核心数据结构.这些文件的损坏可以阻止运作 HDFS.因此,我们可以将 NameNode 配置为维护 FSImage 和 EditLog 的多个副本.FSImage 和 EditLog 的多个副本的更新会降低命名空间操作的性能.但是,由于 Hadoop 更多地处理数据密集型应用程序,而不是元数据密集型操作,这很好.
vi. Snapshot
六、快照
Snapshot is nothing but storing a copy of data at a particular instance of time. One of the usages of the snapshot is to rollback a failed HDFS instance to a good point in time. We can take Snapshots of the sub-tree of the file system or entire file system. Some of the uses of snapshots are disaster recovery, data backup, and protection against user error. We can take snapshots of any directory. Only the particular directory should be set as Snapshottable. The administrators can set any directory as snapshottable. We cannot rename or delete a snapshottable directory if there are snapshots in it. After removing all the snapshots from the directory, we can rename or delete it.
快照只是在特定的时间实例中存储数据副本.快照的一个用法是将失败的 HDFS 实例回滚到一个好的时间点.我们可以对文件系统或整个文件系统的子树进行快照.快照的一些用途是灾难恢复、数据备份和防止用户错误.我们可以拍摄任何目录的快照.应该只将特定目录设置为 Snapshottable.管理员可以将任何目录设置为 snapshottable.如果 snapshottable 目录中有快照,我们不能重命名或删除它.从目录中删除所有快照后,我们可以重命名或删除它.
8. Summary
8. 简要
There are several options to manage a Hadoop cluster. One of them is** Ambari**. Hortonworks promote Ambari and many other players. We can manage more than one Hadoop cluster at a time using Ambari. Cloudera Manager is one more tool for Hadoop cluster management. Cloudera manager permits us to deploy and operate complete Hadoop stack very easily. It provides us with many features like performance and health monitoring of the cluster. Hope this helped. Share your feedback through comments.
管理 Hadoop 集群有几个选项.其中之一是安巴里.霍尔顿的作品促进了安巴里和其他许多球员.我们可以使用 Ambari 一次管理多个 Hadoop 集群.Cloudera Manager 是 Hadoop 集群管理的又一个工具.Cloudera manager 允许我们非常容易地部署和操作完整的 Hadoop 堆栈.它为我们提供了许多功能,比如集群的性能和运行状况监控.希望这有所帮助.通过评论分享你的反馈.
You must explore Top Hadoop Interview Questions
译者注: Hadoop 集群使用 Cloudera Manager + CDH 来管理部署,相对比较轻松