HDFS原论文笔记摘抄

原文：The Hadoop Distributed Filesystem (2010): pdf acm

本文纯属看原论文的笔记和摘抄。只是在尽量提高内容：阅读自述的比例而已。

Questions

When do HDFS clusters become unbalanced? How does rebalancing works?
What does fault tolerance look like in HDFS?
What are some common failure modes in HDFS? Which ones are recoverable and which ones are not?
How is it different from GFS?
What's the real risk of reducing the replication factor?

Notes

Architecture

NameNode

Files and directories represented as inodes on NameNode.
NameNode maintains namespace to file location mapping.
HDFS client read: asks NN for locations of data blocks for file X. Then reads from blocks closest to itself.
HDFS client write: requests NN for three DataNodes to host block replicas.
Entire namespace kept in RAM.

DataNodes

Block replica:
- File 1: the file data itself
- File 2: metadata, i.e. checksum + generation stamp
On startup, DN does handshake with NN, verifies namespace ID, software version, if not match, DN shuts down.
DN registers with NN. DNs persistently store unique storage IDs (independent of IP address or Port).
Block report: identifies blocks in possession to NN, has block id, generation stamp, length for each block replica. Sent on startup and at every hour.
DN sends NN a heartbeat every 3 seconds. Heartbeat contains: storage capacity, % in-use, # of data transfers in progress.
NN gets worried after 10 min and deems any non-responsive DN unavailable, will schedule creation of new replicas.

HDFS Client

Code library that exports the HDFS file system interface.
HDFS provides an API that exposes the locations of a file blocks. Useful for optimizing for locality on MR jobs.

Image and Journal

Journal: write-ahead commit log for file system changes.

The checkpoint file is never changed by the NameNode; it is replaced in its entirety when a new checkpoint is created during restart, when requested by the administrator, or by the CheckpointNode.

During startup the NameNode initializes the namespace image from the checkpoint, and then replays changes from the journal until the image is up-to-date with the last state of the file system. A new checkpoint and empty journal are written back to the storage directories before the NameNode starts serving clients.

Saving a transaction to disk becomes a bottleneck since all other threads need to wait until the synchronous flush-and-sync procedure initiated by one of them is complete.

Checkpoint Node

This is a role that a Name Node can take on.
If you checkpoint often, your journal will be shorter and it will take a much shorter time to restart the NameNode.

The CheckpointNode periodically combines the existing
checkpoint and journal to create a new checkpoint and an
empty journal. The CheckpointNode usually runs on a different host from the NameNode since it has the same memory requirements as the NameNode. It downloads the current checkpoint and journal files from the NameNode, merges them locally, and returns the new checkpoint back to the NameNode.

What does it mean that the Checkpoint and Backup are roles that a NameNode can take on exclusive of each other? It seems that a backup node cannot also be a namenode simultaneously.

Backup Node

Why would you ever want a Checkpoint Node instead of a Backup Node, since Backup nodes can create checkpoints but also maintain an up to date backup of the NameNode?

File System Snapshots

A NN snapshot is basically a new checkpoint.
DN snapshots can be created on handshake.

...each DataNode creates a copy of the storage directory and hard links existing block files into it. When the DataNode removes a block it removes only the hard link, and block modifications during appends use the copy-on-write technique.

Old blocks are untouched so the backed up old hard links will still work.

If an upgraded NameNode due to a software bug purges its image then backing up only the namespace state still results in total data loss, as the NameNode will not recognize the blocks reported by DataNodes, and will order their deletion. Rolling back in this case will recover the metadata, but the data itself will be lost.

File I/O and Replica Management

File Read / Write

Single writer, multiple reader.
HDFS Client opens file for writing, is given an exclusive lease that it constantly must renew via heartbeat, least revoked on file close.
Soft limit expiry --other clients can pre-empt the least.
Hard limit expiry -- NN closes file and revokes lease on your behalf.
Data is not guaranteed to be immediately visible to readers. To do so you have to use the hflush operation--it waits until all DNs in the pipeline ack the successful transmission of the packet.
Each data block has a checksum, this is verified by HDFS client while reading.

Block Placement

No Datanode contains more than one replica of any block.
No rack contains more than two replicas of the same block, provided there are sufficient racks on the cluster.

Replication Management

Balancer

Balances disk space: aims to make sure the disk utilization of each DataNode is within a threshold of the disk utilization of the entire cluster.

Block Scanner

Happens on each DataNode, scans block replicas and verifies checksums.

Decomissioning

You can mark a DN as decommissioning, and it will take no new replicas. Its block replicas will be duplicated to other DNs, and once the NN is happy with the level of replication it will mark said DN as decommissioned.

Inter-Cluster Data Copy `DistCp`

Terminology

MDS: Metadata Servers
Image: inode data and list of blocks belonging to each file (name system metadata).
Checkpoint: a persistent record of the image stored in the local host's native file system.
Journal: modification log of the image.

简书上扩展阅读：
http://www.jianshu.com/p/556242973dbb
http://www.jianshu.com/p/cf723a856d04
http://www.jianshu.com/p/df6dfc339a91
http://www.jianshu.com/p/05a205519e6e

其它扩展阅读：
http://www.slideshare.net/YuvalCarmel/gfs-vs-hdfs