HDFS原论文笔记摘抄

原文:The Hadoop Distributed Filesystem (2010): pdf acm

本文纯属看原论文的笔记和摘抄。只是在尽量提高内容:阅读自述的比例而已。

Questions

  • When do HDFS clusters become unbalanced? How does rebalancing works?
  • What does fault tolerance look like in HDFS?
  • What are some common failure modes in HDFS? Which ones are recoverable and which ones are not?
  • How is it different from GFS?
  • What's the real risk of reducing the replication factor?

Notes

Architecture

NameNode

  • Files and directories represented as inodes on NameNode.
  • NameNode maintains namespace to file location mapping.
  • HDFS client read: asks NN for locations of data blocks for file X. Then reads from blocks closest to itself.
  • HDFS client write: requests NN for three DataNodes to host block replicas.
  • Entire namespace kept in RAM.

DataNodes

  • Block replica:
    • File 1: the file data itself
    • File 2: metadata, i.e. checksum + generation stamp
  • On startup, DN does handshake with NN, verifies namespace ID, software version, if not match, DN shuts down.
  • DN registers with NN. DNs persistently store unique storage IDs (independent of IP address or Port).
  • Block report: identifies blocks in possession to NN, has block id, generation stamp, length for each block replica. Sent on startup and at every hour.
  • DN sends NN a heartbeat every 3 seconds. Heartbeat contains: storage capacity, % in-use, # of data transfers in progress.
  • NN gets worried after 10 min and deems any non-responsive DN unavailable, will schedule creation of new replicas.

HDFS Client

  • Code library that exports the HDFS file system interface.
  • HDFS provides an API that exposes the locations of a file blocks. Useful for optimizing for locality on MR jobs.

Image and Journal

  • Journal: write-ahead commit log for file system changes.

The checkpoint file is never changed by the NameNode; it is replaced in its entirety when a new checkpoint is created during restart, when requested by the administrator, or by the CheckpointNode.

During startup the NameNode initializes the namespace image from the checkpoint, and then replays changes from the journal until the image is up-to-date with the last state of the file system. A new checkpoint and empty journal are written back to the storage directories before the NameNode starts serving clients.

Saving a transaction to disk becomes a bottleneck since all other threads need to wait until the synchronous flush-and-sync procedure initiated by one of them is complete.

Checkpoint Node

  • This is a role that a Name Node can take on.
  • If you checkpoint often, your journal will be shorter and it will take a much shorter time to restart the NameNode.

The CheckpointNode periodically combines the existing
checkpoint and journal to create a new checkpoint and an
empty journal. The CheckpointNode usually runs on a different host from the NameNode since it has the same memory requirements as the NameNode. It downloads the current checkpoint and journal files from the NameNode, merges them locally, and returns the new checkpoint back to the NameNode.

What does it mean that the Checkpoint and Backup are roles that a NameNode can take on exclusive of each other? It seems that a backup node cannot also be a namenode simultaneously.

Backup Node

Why would you ever want a Checkpoint Node instead of a Backup Node, since Backup nodes can create checkpoints but also maintain an up to date backup of the NameNode?

File System Snapshots

  • A NN snapshot is basically a new checkpoint.
  • DN snapshots can be created on handshake.

...each DataNode creates a copy of the storage directory and hard links existing block files into it. When the DataNode removes a block it removes only the hard link, and block modifications during appends use the copy-on-write technique.

  • Old blocks are untouched so the backed up old hard links will still work.

If an upgraded NameNode due to a software bug purges its image then backing up only the namespace state still results in total data loss, as the NameNode will not recognize the blocks reported by DataNodes, and will order their deletion. Rolling back in this case will recover the metadata, but the data itself will be lost.

File I/O and Replica Management

File Read / Write

  • Single writer, multiple reader.
  • HDFS Client opens file for writing, is given an exclusive lease that it constantly must renew via heartbeat, least revoked on file close.
  • Soft limit expiry --other clients can pre-empt the least.
  • Hard limit expiry -- NN closes file and revokes lease on your behalf.
  • Data is not guaranteed to be immediately visible to readers. To do so you have to use the hflush operation--it waits until all DNs in the pipeline ack the successful transmission of the packet.
  • Each data block has a checksum, this is verified by HDFS client while reading.

Block Placement

  1. No Datanode contains more than one replica of any block.
  2. No rack contains more than two replicas of the same block, provided there are sufficient racks on the cluster.

Replication Management

Balancer

  • Balances disk space: aims to make sure the disk utilization of each DataNode is within a threshold of the disk utilization of the entire cluster.

Block Scanner

  • Happens on each DataNode, scans block replicas and verifies checksums.

Decomissioning

  • You can mark a DN as decommissioning, and it will take no new replicas. Its block replicas will be duplicated to other DNs, and once the NN is happy with the level of replication it will mark said DN as decommissioned.

Inter-Cluster Data Copy DistCp

Terminology

  • MDS: Metadata Servers
  • Image: inode data and list of blocks belonging to each file (name system metadata).
  • Checkpoint: a persistent record of the image stored in the local host's native file system.
  • Journal: modification log of the image.

简书上扩展阅读:
http://www.jianshu.com/p/556242973dbb
http://www.jianshu.com/p/cf723a856d04
http://www.jianshu.com/p/df6dfc339a91
http://www.jianshu.com/p/05a205519e6e

其它扩展阅读:
http://www.slideshare.net/YuvalCarmel/gfs-vs-hdfs

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容

  • (一)分布式文件系统概述 数据量越来越多,在一个操作系统管辖的范围存不下了,那么就分配到更多的操作系统管理的磁盘中...
    时待吾阅读 5,658评论 0 0
  • 明确自己:人生观、价值观、事业观 完善自己:人生方向、心态调整、人脉积累、口才提升、工作能力、处事心计、投资理财 ...
    iChuck阅读 3,851评论 0 1
  • 放假后上班的第一天,朦朦胧胧中我似乎是醒着的,又似乎还在沉睡。 我能清楚的意识到现在身上盖的被单不足以抵抗十月的秋...
    咩一阅读 1,384评论 0 0
  • 日本是一个极其礼貌有序的国家,在其国民特质上最明显的体现就是不麻烦别人。他们在地铁车厢内会将报纸折成半个版...
    richfancy阅读 2,047评论 0 0
  • 最近产品上遇到了一个小需求, 大致是模仿映客直播的开始直播界面输入标题的那部分样式, 看了一下, 自带的UITex...
    codiy_huang阅读 9,311评论 0 5