安装 ZooKeeper

Requirements

  • Java 8

cd /opt

wget http://mirrors.hust.edu.cn/apache/zookeeper/stable/zookeeper-3.4.7.tar.gz

tar -zvxf zookeeper-3.4.7.tar.gz

mv zookeeper-3.4.7 zookeeper

cd zookeeper

cd conf

cp zoo_sample.cfg zoo.cfg

vi zoo.cfg

tickTime=2000
initLimit=10
syncLimit=5
dataDir=/home/hadoop/zookeeper/export
clientPort=2181
#server.1=zoo1:2888:3888
#server.2=zoo2:2888:3888
#server.3=zoo3:2888:3888

Install Kafka

cd /opt

wget http://mirrors.hust.edu.cn/apache/kafka/0.9.0.0/kafka-0.9.0.0-src.tgz

tar -zvxf kafka-0.9.0.0-src.tgz

mv kafka-0.9.0.0 kafka

Install Hadoop

cd /opt

wget http://mirrors.hust.edu.cn/apache/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz

tar -zvxf hadoop-2.7.1.tar.gz

mv hadoop-2.7.1 hadoop

cd hadoop

vi etc/hadoop/hdfs-site.xml

        <property>
                <name>dfs.datanode.max.transfer.threads</name>
                <value>4096</value>
        </property>
        <property>
                <name>dfs.replication</name >
                <value>1</value>
        </property>
        <property>
                <name>dfs.name.dir</name>
                <value>file:///opt/hadoop/hadoopinfra/hdfs/namenode</value>
        </property>
        <property>
                <name>dfs.data.dir</name>
                <value>file:///opt/hadoop/hadoopinfra/hdfs/datanode</value>
        </property>

vi etc/hadoop/hadoop-env.sh

export JAVA_HOME=/usr

vi etc/hadoop/core-site.xml

<configuration>
   <property>
      <name>fs.default.name</name>
      <value>hdfs://master:9000</value>
   </property>
</configuration>

vi etc/hadoop/core-site.xml

   <property>
      <name>fs.default.name</name>
      <value>hdfs://localhost:9000</value>
   </property>

vi etc/hadoop/yarn-site.xml

        <property>
                <name>yarn.nodemanager.aux-services</name>
                <value>mapreduce_shuffle</value>
        </property>

cp etc/hadoop/mapred-site.xml.template etc/hadoop/mapred-site.xml

vi etc/hadoop/mapred-site.xml

        <property>
                <name>mapreduce.framework.name</name>
                <value>yarn</value>
        </property>

Initial format of namenode - format will erase any existing data

hdfs namenode -format

Install HBase

cd /opt

wget http://mirrors.hust.edu.cn/apache/hbase/stable/hbase-1.1.2-src.tar.gz

tar -zvxf hbase-1.1.2-src.tar.gz

mv hbase-1.1.2 hbase

cd hbase

vi conf/hbase-env.sh

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
export HBASE_REGIONSERVERS=${HBASE_HOME}/conf/regionservers
export HBASE_MANAGES_ZK=false

# Configure PermSize. Only needed in JDK7\. You can safely remove it for JDK8+
#export HBASE_MASTER_OPTS="$HBASE_MASTER_OPTS -XX:PermSize=128m -XX:MaxPermSize=128m"
#export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS -XX:PermSize=128m -XX:MaxPermSize=128m"

vi conf/hbase-site.xml

<configuration>
  <property>
    <name>hbase.zookeeper.quorum</name>
    <value>master,data2,data3</value>
    <description>The directory shared by RegionServers.
    </description>
  </property>
  <property>
    <name>hbase.zookeeper.property.dataDir</name>
    <value>/home/hadoop/zookeeper/export</value>
    <description>Property from ZooKeeper config zoo.cfg.
    The directory where the snapshot is stored.
    </description>
  </property>
  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://master:9000/home/hadoop/hbase</value>
    <description>The directory shared by RegionServers.
    </description>
  </property>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
    <description>The mode the cluster will be in. Possible values are
      false: standalone and pseudo-distributed setups with managed Zookeeper
      true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh)
    </description>
  </property>
</configuration>

vi conf/regionservers

data2
data3

ENV

vi ~/.bashrc

如果是用yum安装了jdk1.8,那就不要配置 export JAVA_HOME=/opt/jdk1.8.0_40

#HADOOP VARIABLES START
#export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
export JAVA_HOME=/opt/jdk1.8.0_40
export HADOOP_INSTALL=/opt/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib/native"
export HADOOP_HOME=$HADOOP_INSTALL
#HADOOP VARIABLES END

#HBASE VARIABLES
export HBASE_HOME=/opt/hbase
export HBASE_CONF=$HBASE_HOME/conf
export CLASSPATH=$CLASSPATH:$BASE_HOME/lib/*
#HBASE VARIABLES END

export PATH=$PATH:$HBASE_HOME/bin

export CQLSH_HOST=127.0.0.1
export CQLSH_PORT=9042

source ~/.bashrc

Additional installs

yum install thrift

yum install snappy-devel

pip install sqlalchemy

pip install zmq

pip install pyzmq

Install Distributed Frontera from GIT - Recommended method

cd /opt

git clone https://github.com/scrapinghub/distributed-frontera.git

pip install /opt/distributed-frontera

Install Distributed Frontera with PIP

pip install distributed-frontera

pip install hbase-thrift

pip install PyHBase

required:

happybase, kafka-python, msgpack-python, python-snappy, frontera, thrift

firewall tweaking

sudo firewall-cmd --zone=public --add-port=2181/tcp --permanent

sudo firewall-cmd --zone=public --add-port=60000/tcp --permanent

sudo firewall-cmd --zone=public --add-port=9000/tcp --permanent

sudo firewall-cmd --zone=public --add-port=9000/tcp --permanent

sudo firewall-cmd --reload

start and stop services

hadoop

/opt/hadoop/sbin/start-dfs.sh

/opt/hadoop/sbin/start-yarn.sh

/opt/hadoop/sbin/stop-dfs.sh

/opt/hadoop/sbin/stop-yarn.sh

zookeeper

/opt/zookeeper/bin/zkServer.sh start

/opt/zookeeper/bin/zkServer.sh stop

view zookeeper

/opt/zookeeper/bin/zkCli.sh -server 127.0.0.1:2181

hbase

/opt/hbase/bin/hbase-daemon.sh start master

/opt/hbase/bin/hbase-daemon.sh start regionserver

/opt/hbase/bin/hbase-daemon.sh stop master

/opt/hbase/bin/hbase-daemon.sh stop regionserver

thrift for hbase

hbase thrift start

hbase thrift -p 7777 start

kafka

/opt/kafka/bin/kafka-server-start.sh /opt/kafka/config/server.properties

Verify services are running

jps

must have these running

25571 HMaster
25764 HRegionServer
26420 Main
25110 DataNode
26519 Jps
24968 NameNode
14988 QuorumPeerMain
25310 SecondaryNameNode

sample 

https://github.com/scrapinghub/distributed-frontera//blob/master/docs/source/topics/quickstart.rst

/opt/hbase/bin/hbase shell

create_namespace ‘crawler’

quit

cd /var/www/html

git clone https://github.com/sibiryakov/general-spider.git

cd general-spider

vi frontier/workersettings.py

=== replace content ===
# -*- coding: utf-8 -*-
from frontera.settings.default_settings import *
#from distributed_frontera.settings.default_settings import MIDDLEWARES
from distributed_frontera.settings import default_settings

MAX_REQUESTS = 0
MAX_NEXT_REQUESTS = 128     # Size of batch to generate per partition, should be consistent with
                            # CONCURRENT_REQUESTS in spider. General recommendation is 5-7x CONCURRENT_REQUESTS
CONSUMER_BATCH_SIZE = 512   # Batch size for updates to backend storage
NEW_BATCH_DELAY = 30.0      # This cause spider to wait for specified time, after getting empty response from
                            # backend

#--------------------------------------------------------
# Url storage
#--------------------------------------------------------
BACKEND = 'distributed_frontera.backends.hbase.HBaseBackend'
HBASE_DROP_ALL_TABLES = False
HBASE_THRIFT_PORT = 9090
HBASE_THRIFT_HOST = 'localhost'
HBASE_QUEUE_PARTITIONS = 2  # Count of spider instances
STORE_CONTENT = True

MIDDLEWARES.extend([
    'frontera.contrib.middlewares.domain.DomainMiddleware',
    'frontera.contrib.middlewares.fingerprint.DomainFingerprintMiddleware'
])

KAFKA_LOCATION = 'localhost:9092'
FRONTIER_GROUP = 'scrapy-crawler'
INCOMING_TOPIC = 'frontier-done'    # Topic used by spiders where to send fetching results
OUTGOING_TOPIC = 'frontier-todo'    # Requests that needs to be downloaded is written there
SCORING_GROUP = 'scrapy-scoring'
SCORING_TOPIC = 'frontier-score'    # Scores provided by strategy worker using this channel and read by storage
                                    # worker.

#--------------------------------------------------------
# Logging
#--------------------------------------------------------
LOGGING_EVENTS_ENABLED = False
LOGGING_MANAGER_ENABLED = True
LOGGING_BACKEND_ENABLED = True
LOGGING_DEBUGGING_ENABLED = False

vi frontier/spider_settings.py

=== replace content ===

# -*- coding: utf-8 -*-
from frontera.settings.default_settings import *
#from distributed_frontera.settings.default_settings import MIDDLEWARES
from distributed_frontera.settings import default_settings

SPIDER_PARTITION_ID = 0                 # Partition ID assigned
MAX_NEXT_REQUESTS = 256                 # Should be consistent with MAX_NEXT_REQUESTS set for Frontera worker
DELAY_ON_EMPTY = 5.0

MIDDLEWARES.extend([
    'frontera.contrib.middlewares.domain.DomainMiddleware',
    'frontera.contrib.middlewares.fingerprint.DomainFingerprintMiddleware'
])

#--------------------------------------------------------
# Crawl frontier backend
#--------------------------------------------------------
BACKEND = 'distributed_frontera.backends.remote.KafkaOverusedBackend'
KAFKA_LOCATION = 'localhost:9092'       # Your Kafka service location
SPIDER_PARTITION_ID = 0                 # Partition ID assigned
HBASE_NAMESPACE = 'crawler'

#--------------------------------------------------------
# Logging
#--------------------------------------------------------
LOGGING_ENABLED = True
LOGGING_EVENTS_ENABLED = False
LOGGING_MANAGER_ENABLED = False
LOGGING_BACKEND_ENABLED = False
LOGGING_DEBUGGING_ENABLED = False

open new terminal -> start ZeroMQ broker

cd /var/www/html/general-spider

python -m distributed_frontera.messagebus.zeromq.broker

open new terminal -> start DB worker

cd /var/www/html/general-spider

python -m distributed_frontera.worker.main --config frontier.workersettings

open new terminal -> start strategy worker

cd /var/www/html/general-spider

python -m distributed_frontera.worker.score --config frontier.strategy0 --strategy distributed_frontera.worker.strategy.bfs

open new terminal -> Starting the spiders

cd /var/www/html/general-spider

scrapy crawl general -L INFO -s FRONTERA_SETTINGS=frontier.spider0 -s SEEDS_SOURCE=seeds_es_dmoz.txt

scrapy crawl general -L INFO -s FRONTERA_SETTINGS=frontier.spider1

注意:

启动顺序为 hadoop -> ZooKeeper -> HBase

关闭顺序为 HBase -> ZooKeeper -> hadoop

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 214,377评论 6 496
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,390评论 3 389
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 159,967评论 0 349
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,344评论 1 288
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,441评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,492评论 1 292
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,497评论 3 412
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,274评论 0 269
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,732评论 1 307
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,008评论 2 328
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,184评论 1 342
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,837评论 4 337
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,520评论 3 322
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,156评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,407评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,056评论 2 365
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,074评论 2 352

推荐阅读更多精彩内容

  • 姓名:郭金 学号:17101223407 【嵌牛导读】:ZooKeeper是一个分布式的,开放源码的分布式应用程序...
    宝宝啦啦啦阅读 305评论 2 0
  • 入门指南 1. 简介 Quickstart会让你启动和运行一个单节点单机HBase。 2. 快速启动 – 单点HB...
    和心数据阅读 4,531评论 1 41
  • 1、安装及配置 首先,再安装前,需要安装并配置好JDK(选择Oracle Java8 SE)。 其次,需要下载Zo...
    菜心有毒阅读 4,882评论 1 1
  • 日本游戏街霸中,有一个角色,是巴西还是北美的人物,会从胸口托出一个球,然后推给对方,远程攻击,口里会说“阿都咳”。...
    亲密数阅读 145评论 0 1
  • 「我們又見面了」那熟悉的聲音說著,轉頭發現是剛剛的少年 「你是剛剛的……」我不斷的想著剛剛的少年和現在看到的,簡直...
    d723bb10402e阅读 320评论 0 0