What is hadoop
Hadoop is created by Doug Cutting(who is working at Yahoo, and now is CA at Cloudera) and Mike Cafarella in 2005.
"Hadoop" is the name of Doug's son's elephant toy.
Hadoop is (1)apache open source software (2)framework, for (3)storage and (4)large scale processing of data-sets, on (5)clusters of commodity hardware.
(2) is provided by MapReduce, a shared and integrated foundation where we can bring additional tools to the framework.
(4)proceed with (3)cheap computing storage. (5)Scalability is the core.
New way to (4)take and analyze data: schema and read style -- creating schema while reading raw data, instead of create-schema-after-read. More granularity, complex analytics on small amount of data.
Hadoop MapReduce is derived from Google's MapReduce.
Hadoop HDFS (Hadoop Distribution File System) is derived from Google FS.
Framework Basic Modules
Hadoop Common
base libraries, tools needed by other modules
Hadoop Distributed File System
storing data on commodity machine across the entire cluster
jave coded.
1 node in Hadoop = 1 name node + 1 HDFS cluster of data nodes.
NameNode = Primary NameNode and Secondary NameNode building snapshots of the primary's.
Hadoop YARN
managing compute in cluster in order to schedule users and applications.
Hadoop MapReduce
Programming model
Scaling data across a lot of different processes.
Engine:
job tracker dispatch job to task trackers in cluster
Zoo
bigdata table : derive HBASE, handle massive date tables
mySQL Gateway : adjust to allow query data.
Sawzall : high level access MapReduce in the cluster and submit jobs.
Evenflow : chain together complex work codes and coordinate events and services
Dremel : in metadata manager, able to process a very large amount of unstructured data.
Chubby : coordinate all of these above
Cloudera Stack
Ecosystem
Core components
Sqoop
Transferring bulk data between Hadoop and structured datastores like relational databases.
CLI tool, import tables/DB to HDFS.
HBASE
Based on Google's bigdata table
Handle massive data tables with billions columes.
Hive
Data warehouse software facilitates querying and managing large datasets residing in distributed storage, by projecting structure on the top of all of this data and allow us to use SQL like queries.
SQL Language : Hive QL
Pig
Scripting Language Pig Latin for creating MapReduce programs using Hadoop
It can execute bi-directionally with other languages.
Excel at describing data analysis problem as data flows
Oozie
Workflow scheduler system, maange Hadoop jobs
Support job schedules for MapReduce, Pig, Hive, Sqoop, etc
Zookeeper
It provides a distributed configuration service and synchronization service so he can synchronize all these jobs and a naming registry for the entire distributed system.
Distributed applications use the zookeeper to store immediate updates to important configuration information on the cluster itself.
Flume
collecting, aggregating, and moving large amonts of log data
Other components
Impala
Massively parallel processing SQL query engine
Spark
A scalable data analytics platform that incorporates primitives for in-memory computing.
Scala language
Greatly support machine learning libraties
URL
https://hadoop.apache.org/