What is hadoop

Hadoop is created by Doug Cutting(who is working at Yahoo, and now is CA at Cloudera) and Mike Cafarella in 2005.

"Hadoop" is the name of Doug's son's elephant toy.

Hadoop is (1)apache open source software (2)framework, for (3)storage and (4)large scale processing of data-sets, on (5)clusters of commodity hardware.

(2) is provided by MapReduce, a shared and integrated foundation where we can bring additional tools to the framework.

(4)proceed with (3)cheap computing storage. (5)Scalability is the core.

New way to (4)take and analyze data: schema and read style -- creating schema while reading raw data, instead of create-schema-after-read. More granularity, complex analytics on small amount of data.

Hadoop MapReduce is derived from Google's MapReduce.

Hadoop HDFS (Hadoop Distribution File System) is derived from Google FS.

Framework Basic Modules

Hadoop Common

base libraries, tools needed by other modules

Hadoop Distributed File System

storing data on commodity machine across the entire cluster

jave coded.

1 node in Hadoop = 1 name node + 1 HDFS cluster of data nodes.

NameNode = Primary NameNode and Secondary NameNode building snapshots of the primary's.

Hadoop YARN

managing compute in cluster in order to schedule users and applications.

Hadoop MapReduce

Programming model

Scaling data across a lot of different processes.

Engine:

job tracker dispatch job to task trackers in cluster

Zoo

bigdata table : derive HBASE, handle massive date tables

mySQL Gateway : adjust to allow query data.

Sawzall : high level access MapReduce in the cluster and submit jobs.

Evenflow : chain together complex work codes and coordinate events and services

Dremel : in metadata manager, able to process a very large amount of unstructured data.

Chubby : coordinate all of these above

Cloudera Stack

Ecosystem

Core components

Sqoop

Transferring bulk data between Hadoop and structured datastores like relational databases.

CLI tool, import tables/DB to HDFS.

HBASE

Based on Google's bigdata table

Handle massive data tables with billions columes.

Hive

Data warehouse software facilitates querying and managing large datasets residing in distributed storage, by projecting structure on the top of all of this data and allow us to use SQL like queries.

SQL Language : Hive QL

Pig

Scripting Language Pig Latin for creating MapReduce programs using Hadoop

It can execute bi-directionally with other languages.

Excel at describing data analysis problem as data flows

Oozie

Workflow scheduler system, maange Hadoop jobs

Support job schedules for MapReduce, Pig, Hive, Sqoop, etc

Zookeeper

It provides a distributed configuration service and synchronization service so he can synchronize all these jobs and a naming registry for the entire distributed system.

Distributed applications use the zookeeper to store immediate updates to important configuration information on the cluster itself.

Flume

collecting, aggregating, and moving large amonts of log data

Other components

Impala

Massively parallel processing SQL query engine

Spark

A scalable data analytics platform that incorporates primitives for in-memory computing.

Scala language

Greatly support machine learning libraties

URL

https://hadoop.apache.org/

Lesson1 Slides

2018-01-05 Hadoop Platform and Application Framework -- Lesson 1 Big Data Hadoop Stack