2018-01-05 Hadoop Platform and Application Framework -- Lesson 1 Big Data Hadoop Stack

What is hadoop

Hadoop is created by Doug Cutting(who is working at Yahoo, and now is CA at Cloudera) and Mike Cafarella in 2005. 

"Hadoop" is the name of Doug's son's elephant toy.


Hadoop is (1)apache open source software (2)framework, for (3)storage and (4)large scale processing of data-sets, on (5)clusters of commodity hardware.

(2) is provided by MapReduce, a shared and integrated foundation where we can bring additional tools to the framework.

(4)proceed with (3)cheap computing storage. (5)Scalability is the core.

New way to (4)take and analyze data: schema and read style -- creating schema while reading raw data, instead of create-schema-after-read. More granularity, complex analytics on small amount of data.


Hadoop MapReduce is derived from Google's MapReduce.

Hadoop HDFS (Hadoop Distribution File System) is derived from Google FS.


Framework Basic Modules

Hadoop Common

base libraries, tools needed by other modules


Hadoop Distributed File System

storing data on commodity machine across the entire cluster

jave coded.

1 node in Hadoop = 1 name node + 1 HDFS cluster of data nodes.   

NameNode = Primary NameNode and Secondary NameNode building snapshots of the primary's.


Hadoop YARN

managing compute in cluster in order to schedule users and applications.


Hadoop MapReduce

Programming model 

Scaling data across a lot of different processes.

Engine:

    job tracker dispatch job to task trackers in cluster

Zoo

bigdata table : derive HBASE, handle massive date tables

mySQL Gateway : adjust to allow query data.

Sawzall : high level access MapReduce in the cluster and submit jobs.

Evenflow : chain together complex work codes and coordinate events and services

Dremel : in metadata manager, able to process a very large amount of unstructured data.

Chubby : coordinate all of these above


Cloudera Stack


Ecosystem

Core components

Sqoop 

Transferring bulk data between Hadoop and structured datastores like relational databases.

CLI tool,  import tables/DB to HDFS. 


HBASE

Based on Google's bigdata table

Handle massive data tables with billions columes.


Hive

Data warehouse software facilitates querying and managing large datasets residing in distributed storage, by projecting structure on the top of all of this data and allow us to use SQL like queries.

SQL Language : Hive QL


Pig

Scripting Language Pig Latin for creating MapReduce programs using Hadoop

It can execute bi-directionally with other languages.

Excel at describing data analysis problem as data flows


Oozie

Workflow scheduler system, maange Hadoop jobs

Support job schedules for MapReduce, Pig, Hive, Sqoop, etc


Zookeeper

It provides a distributed configuration service and synchronization service so he can synchronize all these jobs and a naming registry for the entire distributed system.

Distributed applications use the zookeeper to store immediate updates to important configuration information on the cluster itself.


Flume

collecting, aggregating, and moving large amonts of log data


Other components

Impala

Massively parallel processing SQL query engine


Spark

A scalable data analytics platform that incorporates primitives for in-memory computing. 

Scala language

Greatly support machine learning libraties 

URL

https://hadoop.apache.org/

Lesson1 Slides

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容