HDFS原论文笔记摘抄

原文:The Hadoop Distributed Filesystem (2010): pdf acm

本文纯属看原论文的笔记和摘抄。只是在尽量提高内容:阅读自述的比例而已。

Questions

  • When do HDFS clusters become unbalanced? How does rebalancing works?
  • What does fault tolerance look like in HDFS?
  • What are some common failure modes in HDFS? Which ones are recoverable and which ones are not?
  • How is it different from GFS?
  • What's the real risk of reducing the replication factor?

Notes

Architecture

NameNode

  • Files and directories represented as inodes on NameNode.
  • NameNode maintains namespace to file location mapping.
  • HDFS client read: asks NN for locations of data blocks for file X. Then reads from blocks closest to itself.
  • HDFS client write: requests NN for three DataNodes to host block replicas.
  • Entire namespace kept in RAM.

DataNodes

  • Block replica:
    • File 1: the file data itself
    • File 2: metadata, i.e. checksum + generation stamp
  • On startup, DN does handshake with NN, verifies namespace ID, software version, if not match, DN shuts down.
  • DN registers with NN. DNs persistently store unique storage IDs (independent of IP address or Port).
  • Block report: identifies blocks in possession to NN, has block id, generation stamp, length for each block replica. Sent on startup and at every hour.
  • DN sends NN a heartbeat every 3 seconds. Heartbeat contains: storage capacity, % in-use, # of data transfers in progress.
  • NN gets worried after 10 min and deems any non-responsive DN unavailable, will schedule creation of new replicas.

HDFS Client

  • Code library that exports the HDFS file system interface.
  • HDFS provides an API that exposes the locations of a file blocks. Useful for optimizing for locality on MR jobs.

Image and Journal

  • Journal: write-ahead commit log for file system changes.

The checkpoint file is never changed by the NameNode; it is replaced in its entirety when a new checkpoint is created during restart, when requested by the administrator, or by the CheckpointNode.

During startup the NameNode initializes the namespace image from the checkpoint, and then replays changes from the journal until the image is up-to-date with the last state of the file system. A new checkpoint and empty journal are written back to the storage directories before the NameNode starts serving clients.

Saving a transaction to disk becomes a bottleneck since all other threads need to wait until the synchronous flush-and-sync procedure initiated by one of them is complete.

Checkpoint Node

  • This is a role that a Name Node can take on.
  • If you checkpoint often, your journal will be shorter and it will take a much shorter time to restart the NameNode.

The CheckpointNode periodically combines the existing
checkpoint and journal to create a new checkpoint and an
empty journal. The CheckpointNode usually runs on a different host from the NameNode since it has the same memory requirements as the NameNode. It downloads the current checkpoint and journal files from the NameNode, merges them locally, and returns the new checkpoint back to the NameNode.

What does it mean that the Checkpoint and Backup are roles that a NameNode can take on exclusive of each other? It seems that a backup node cannot also be a namenode simultaneously.

Backup Node

Why would you ever want a Checkpoint Node instead of a Backup Node, since Backup nodes can create checkpoints but also maintain an up to date backup of the NameNode?

File System Snapshots

  • A NN snapshot is basically a new checkpoint.
  • DN snapshots can be created on handshake.

...each DataNode creates a copy of the storage directory and hard links existing block files into it. When the DataNode removes a block it removes only the hard link, and block modifications during appends use the copy-on-write technique.

  • Old blocks are untouched so the backed up old hard links will still work.

If an upgraded NameNode due to a software bug purges its image then backing up only the namespace state still results in total data loss, as the NameNode will not recognize the blocks reported by DataNodes, and will order their deletion. Rolling back in this case will recover the metadata, but the data itself will be lost.

File I/O and Replica Management

File Read / Write

  • Single writer, multiple reader.
  • HDFS Client opens file for writing, is given an exclusive lease that it constantly must renew via heartbeat, least revoked on file close.
  • Soft limit expiry --other clients can pre-empt the least.
  • Hard limit expiry -- NN closes file and revokes lease on your behalf.
  • Data is not guaranteed to be immediately visible to readers. To do so you have to use the hflush operation--it waits until all DNs in the pipeline ack the successful transmission of the packet.
  • Each data block has a checksum, this is verified by HDFS client while reading.

Block Placement

  1. No Datanode contains more than one replica of any block.
  2. No rack contains more than two replicas of the same block, provided there are sufficient racks on the cluster.

Replication Management

Balancer

  • Balances disk space: aims to make sure the disk utilization of each DataNode is within a threshold of the disk utilization of the entire cluster.

Block Scanner

  • Happens on each DataNode, scans block replicas and verifies checksums.

Decomissioning

  • You can mark a DN as decommissioning, and it will take no new replicas. Its block replicas will be duplicated to other DNs, and once the NN is happy with the level of replication it will mark said DN as decommissioned.

Inter-Cluster Data Copy DistCp

Terminology

  • MDS: Metadata Servers
  • Image: inode data and list of blocks belonging to each file (name system metadata).
  • Checkpoint: a persistent record of the image stored in the local host's native file system.
  • Journal: modification log of the image.

简书上扩展阅读:
http://www.jianshu.com/p/556242973dbb
http://www.jianshu.com/p/cf723a856d04
http://www.jianshu.com/p/df6dfc339a91
http://www.jianshu.com/p/05a205519e6e

其它扩展阅读:
http://www.slideshare.net/YuvalCarmel/gfs-vs-hdfs

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 217,826评论 6 506
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,968评论 3 395
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 164,234评论 0 354
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,562评论 1 293
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,611评论 6 392
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,482评论 1 302
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,271评论 3 418
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 39,166评论 0 276
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,608评论 1 314
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,814评论 3 336
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,926评论 1 348
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,644评论 5 346
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,249评论 3 329
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,866评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,991评论 1 269
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 48,063评论 3 370
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,871评论 2 354

推荐阅读更多精彩内容

  • (一)分布式文件系统概述 数据量越来越多,在一个操作系统管辖的范围存不下了,那么就分配到更多的操作系统管理的磁盘中...
    时待吾阅读 1,501评论 0 0
  • 明确自己:人生观、价值观、事业观 完善自己:人生方向、心态调整、人脉积累、口才提升、工作能力、处事心计、投资理财 ...
    iChuck阅读 657评论 0 1
  • 放假后上班的第一天,朦朦胧胧中我似乎是醒着的,又似乎还在沉睡。 我能清楚的意识到现在身上盖的被单不足以抵抗十月的秋...
    咩一阅读 196评论 0 0
  • 日本是一个极其礼貌有序的国家,在其国民特质上最明显的体现就是不麻烦别人。他们在地铁车厢内会将报纸折成半个版...
    richfancy阅读 276评论 0 0
  • 最近产品上遇到了一个小需求, 大致是模仿映客直播的开始直播界面输入标题的那部分样式, 看了一下, 自带的UITex...
    codiy_huang阅读 3,747评论 0 5