avatar

目录
SparkStreaming项目实战(一)

SparkStreaming项目实战(一)

一、项目介绍

整合Hadoop各组件。

二、环境准备

2.1 创建机器

这里采用3台阿里云ECS【建议内存2-4G】(当然也可以用本地的虚拟机来模拟):

修改主机名

  • hadoop000
  • hadoop001
  • hadoop002
Code
# vi /etc/hostname

修改IP映射

在所有机子上进行如下操作:

Code
# vi /etc/hosts

根据每个机子的IP,添加映射关系:
172.18.150.195 hadoop000
172.18.74.23 hadoop001
172.18.128.181 hadoop002

阿里云内网互通教程https://blog.csdn.net/weixin_42167895/article/details/106394009

2.2 创建用户

Linux上创建hadoop用户,并赋予sudo权限。登录密码:hadoop

Code
# adduser hadoop
# passwd hadoop

添加 sudo 权限

  1. 切换root用户

  2. 添加sudo文件的写权限

    Code
    # chmod u+w /etc/sudoers
  3. 修改/etc/sudoers文件

    Code
    # vi /etc/sudoers

    在原有root下添加如下内容:

    Code
    ## Allow root to run any commands anywhere
    root ALL=(ALL) ALL
    hadoop ALL=(ALL) ALL
  4. 撤销sudoers文件写权限

    Code
    # chmod u-w /etc/sudoers

2.3 创建目录

在Linux上hadoop用户的根目录创建如下目录:

  • app:存放软件的安装目录
  • data:存放测试数据
  • lib:存放开发的jar
  • software:存放软件安装包目录
  • source:存放框架源码
Code
mkdir app data lib software source

2.4 软件版本

cdh 版本对应组件版本https://blog.csdn.net/weixin_42286868/article/details/104817644

本次实战所用版本

apache-flume-1.9.0-bin.tar.gz

apache-maven-3.6.3-bin.tar.gz

hadoop-3.1.2-centos7.6-x64.tar.gz

hbase-2.2.4-bin.tar.gz

jdk-8u251-linux-x64.tar.gz

kafka_2.12-2.4.1.tgz

scala-2.12.11.tgz(Spark 2.4.5使用Scala 2.12

spark-2.4.5-bin-hadoop2.7.tgz(注意!spark用编译源码!Choose a package type: Source Code。我因为编译不了,此次不采用编译方式)

zookeeper-3.4.14.tar.gz

2.5 集群规划

软件 Hadoop000 Hadoop001 Hadoop002
HDFS NameNode
DataNode
NameNode
DataNode
DataNode
YARN ResourceManager
NodeManager
ResourceManager
NodeManager
NodeManager
ZooKeeper ZooKeeper ZooKeeper ZooKeeper
Kafka Kafka Kafka Kafka
HBase RegionServer RegionServer RegionServer
Master
Flume Flume Flume Flume
Spark Spark Spark Spark

三、SSH 配置

  1. 在三台机子上分别生成密钥对

    Code
    ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
  2. 分别追加公钥,并设置权限

    Code
    cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
    chmod 644 ~/.ssh/authorized_keys
  3. 分发给其余2台机器

    Code
    # 在000上
    scp ~/.ssh/id_rsa.pub hadoop@hadoop001:/home/hadoop/.ssh/id_rsa_hadoop000.pub
    scp ~/.ssh/id_rsa.pub hadoop@hadoop002:/home/hadoop/.ssh/id_rsa_hadoop000.pub

    # 在001上
    scp ~/.ssh/id_rsa.pub hadoop@hadoop000:/home/hadoop/.ssh/id_rsa_hadoop001.pub
    scp ~/.ssh/id_rsa.pub hadoop@hadoop002:/home/hadoop/.ssh/id_rsa_hadoop001.pub

    # 在002上
    scp ~/.ssh/id_rsa.pub hadoop@hadoop001:/home/hadoop/.ssh/id_rsa_hadoop002.pub
    scp ~/.ssh/id_rsa.pub hadoop@hadoop000:/home/hadoop/.ssh/id_rsa_hadoop002.pub
  4. 追加其他机器的公钥

    Code
    # 在000上
    cat ~/.ssh/id_rsa_hadoop001.pub >> ~/.ssh/authorized_keys
    cat ~/.ssh/id_rsa_hadoop002.pub >> ~/.ssh/authorized_keys

    # 在001上
    cat ~/.ssh/id_rsa_hadoop000.pub >> ~/.ssh/authorized_keys
    cat ~/.ssh/id_rsa_hadoop002.pub >> ~/.ssh/authorized_keys

    # 在002上
    cat ~/.ssh/id_rsa_hadoop001.pub >> ~/.ssh/authorized_keys
    cat ~/.ssh/id_rsa_hadoop000.pub >> ~/.ssh/authorized_keys
  5. 测试相互是否能ssh免密登录成功

四、便捷配置

4.1 批量执行命令脚本

  1. 在000上创建一个脚本文件

    Code
    sudo touch /usr/local/bin/xcall.sh
    # 将脚本标记为可执行文件
    sudo chmod a+x /usr/local/bin/xcall.sh
    # 编辑内容
    sudo vi /usr/local/bin/xcall.sh
  2. xcall.sh添加以下内容(表示通过SSH对所有主机进行命令操作):

    shell
    #!/bin/bash

    params=$@
    for (( i = 0 ; i <= 2 ; i = $i + 1 )) ; do
    echo ============= hadoop00$i ==============
    ssh hadoop00$i "$params"
    done
  3. 执行脚本

    Code
    [hadoop@hadoop000 ~]$ xcall.sh hostname
    ============= hadoop000 ==============
    hadoop000
    ============= hadoop001 ==============
    hadoop001
    ============= hadoop002 ==============
    hadoop002

4.2 批量复制文件脚本

  1. 在000上创建一个脚本文件

    Code
    sudo touch /usr/local/bin/xscp.sh
    # 将脚本标记为可执行文件
    sudo chmod a+x /usr/local/bin/xscp.sh
    # 编辑内容
    sudo vi /usr/local/bin/xscp.sh
  2. xcall.sh添加以下内容(表示通过SSH对所有主机进行命令操作):

    shell
    #!/bin/bash

    if [[ $# -lt 1 ]] ; then echo no params ; exit ; fi

    p=$1
    dir=`dirname $p`
    filename=`basename $p`
    cd $dir
    fullpath=`pwd -P .`

    user=`whoami`
    for (( i = 1 ; i <= 2 ; i = $i + 1 )) ; do
    echo ============= hadoop00$i ==============
    scp -r $p ${user}@hadoop00$i:$fullpath
    done
  3. 执行脚本

    Code
    [hadoop@hadoop000 ~]$ mkdir tmp
    [hadoop@hadoop000 ~]$ echo test >> tmp/xscp.txt
    [hadoop@hadoop000 ~]$ xscp.sh tmp
    ============= hadoop000 ==============
    xscp.txt 100% 5 23.6KB/s 00:00
    ============= hadoop001 ==============
    xscp.txt 100% 5 4.0KB/s 00:00
    ============= hadoop002 ==============
    xscp.txt

4.3 显示路径

Code
[hadoop@hadoop000 software]$ vi ~/.bash_profile 
[hadoop@hadoop000 software]$ source ~/.bash_profile
[hadoop@hadoop000 /home/hadoop/software]$

五、JDK 安装

  1. 下载

  2. 解压到 ~/app

    Code
    tar -zxvf ~/software/jdk-8u251-linux-x64.tar.gz -C ~/app/
  3. 将java配置系统环境变量中:~/.bash_profile

    Code
    # JDK
    export JAVA_HOME=/home/hadoop/app/jdk1.8.0_251
    export PATH=$JAVA_HOME/bin:$PATH
  4. 配置生效: source ~/.bash_profile

  5. 分发

    Code
    [hadoop@hadoop000 /home/hadoop/app]$xscp.sh jdk1.8.0_251
    [hadoop@hadoop000 /home/hadoop/app]$xscp.sh ~/.bash_profile
  6. 检测安装

    Code
    java -version

六、Hadoop 完全分布式搭建

hadoop000上如下操作:

  1. 下载

  2. 解压

    Code
    [hadoop@hadoop000 /home/hadoop/software]$tar -zxvf hadoop-3.1.2-centos7.6-x64.tar.gz -C ~/app/
  3. 配置环境变量:vi ~/.bash_profile

    Code
    # Hadoop
    export HADOOP_HOME=/home/hadoop/app/hadoop-3.1.2
    export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
  4. 配置生效: source ~/.bash_profile

  5. 修改配置文件

    • /etc/hadoop/hadoop-env.sh

      Code
      export JAVA_HOME=/home/hadoop/app/jdk1.8.0_251
    • */etc/hadoop/hdfs-site.xml *

      xml
      <configuration>
      <!-- 指定HDFS副本的数量,副本数不要超过节点数量。 -->
      <property>
      <name>dfs.replication</name>
      <value>3</value>
      </property>
      </configuration>
    • /etc/hadoop/core-site.xml

      xml
      <configuration>
      <property>
      <name>fs.defaultFS</name>
      <value>hdfs://hadoop000/</value>
      </property>
      <!-- 配置Hadoop的临时工作目录存放数据,默认/tmp/hadoop-${user.name} -->
      <property>
      <name>hadoop.tmp.dir</name>
      <value>/home/hadoop/app/tmp/hadoop</value>
      </property>
      </configuration>
    • /etc/hadoop/mapred-site.xml

      xml
      <configuration>
      <!-- 指定mr运行时框架,这里指定在yarn上,默认是local在本地跑 -->
      <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
      </property>
      </configuration>
    • /etc/hadoop/yarn-site.xml

      xml
      <configuration>
      <!-- 指定YARN(ResourceManager)的地址 -->
      <property>
      <name>yarn.resourcemanager.hostname</name>
      <value>hadoop000</value>
      </property>
      <!-- reduce获取数据的方式 -->
      <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
      </property>
      </configuration>
    • /etc/hadoop/worker

      Code
      hadoop000
      hadoop001
      hadoop002
  6. 分发

    Code
    [hadoop@hadoop000 /home/hadoop/app]$xscp.sh hadoop-3.1.2
    $ xscp.sh ~/.bash_profile
  7. 初始化文件系统(在NN上,即Hadoop000上)

    Code
    $ hadoop namenode -format
  8. 启动Hadoop集群

    Code
    $ start-dfs.sh
    $ start-yarn.sh

    [hadoop@hadoop000 /home/hadoop]$xcall.sh ~/app/jdk1.8.0_251/bin/jps
    ============= hadoop000 ==============
    20068 Jps
    19591 ResourceManager
    19031 NameNode
    19367 SecondaryNameNode
    19721 NodeManager
    19183 DataNode
    ============= hadoop001 ==============
    9156 DataNode
    9268 NodeManager
    9388 Jps
    ============= hadoop002 ==============
    8882 Jps
    8649 DataNode
    8761 NodeManager

七、ZooKeeper 安装

hadoop000上如下操作:

  1. 下载

  2. 解压

    Code
    tar -zxvf  ~/software/zookeeper-3.4.14.tar.gz -C ~/app/
  3. 配置环境变量:~/.bash_profile

    Code
    # ZK
    export ZK_HOME=/home/hadoop/app/zookeeper-3.4.14
    export PATH=$ZK_HOME/bin:$PATH
  4. 配置生效: source ~/.bash_profile

  5. 修改配置文件

    • /conf/zoo.cfg

      Code
      [hadoop@hadoop000 /home/hadoop/app/zookeeper-3.4.14/conf]$cp zoo_sample.cfg zoo.cfg
      [hadoop@hadoop000 /home/hadoop/app/zookeeper-3.4.14/conf]$vi zoo.cfg

      # 修改该属性:
      dataDir=/home/hadoop/app/tmp/zookeeper

      # 追加内容
      # server.n=host:port1:port2,数字n必须是myid中的值
      # port1:leader端口, 作为leader时,供follower连接的端口
      # port2:选举端口,选举leader时,供其他follower连接的端口
      server.1=hadoop000:2888:3888
      server.2=hadoop001:2888:3888
      server.3=hadoop002:2888:3888

      # 然后创建对应目录:(每台都要)
      $mkdir -p /home/hadoop/app/tmp/zookeeper
    • /bin/zkEnv.sh

      Code
      # 修改以下两个地方的目录,将日志文件输出到安装目录
      if [ "x${ZOO_LOG_DIR}" = "x" ]
      then
      ZOO_LOG_DIR="${ZOOKEEPER_PREFIX}/logs"
      fi

      if [ "x${ZOO_LOG4J_PROP}" = "x" ]
      then
      ZOO_LOG4J_PROP="INFO,ROLLINGFILE"
      fi
    • /conf/log4j.properties

      Code
      # 修改以下几个地方
      zookeeper.root.logger=INFO, ROLLINGFILE
      zookeeper.log.dir=/home/hadoop/app/zookeeper-3.4.14/logs
      zookeeper.tracelog.dir=/home/hadoop/app/zookeeper-3.4.14/logs

      log4j.appender.ROLLINGFILE=org.apache.log4j.DailyRollingFileAppender
      #log4j.appender.ROLLINGFILE.MaxFileSize=10MB (把这句话注释掉)
  6. 分发

    Code
    [hadoop@hadoop000 /home/hadoop/app]$xscp.sh zookeeper-3.4.14
    [hadoop@hadoop000 /home/hadoop/app]$xscp.sh ~/.bash_profile
  7. 在每台主机的ZK数据目录dataDir中添加myid

    Code
    [hadoop@hadoop000 /home/hadoop]$echo 1 > /home/hadoop/app/tmp/zookeeper/myid
    [hadoop@hadoop001 /home/hadoop]$echo 2 > /home/hadoop/app/tmp/zookeeper/myid
    [hadoop@hadoop002 /home/hadoop]$echo 3 > /home/hadoop/app/tmp/zookeeper/myid
  8. 在所有机器上启动服务

    Code
    $zkServer.sh start
  9. 查看进程

    Code
    $xcall.sh ~/app/jdk1.8.0_251/bin/jps
    ============= hadoop000 ==============
    5458 QuorumPeerMain
    6405 Jps
    ============= hadoop001 ==============
    5156 QuorumPeerMain
    5944 Jps
    ============= hadoop002 ==============
    5809 Jps
    5012 QuorumPeerMain

八、Hadoop HA 配置

8.1 HDFS HA + 自动容灾

  1. 修改配置文件

    • */etc/hadoop/hdfs-site.xml *

      xml
      <configuration>
      <!-- 新增以下内容 -->

      <!-- 添加集群服务名称 -->
      <property>
      <name>dfs.nameservices</name>
      <value>mycluster</value>
      </property>

      <!-- myucluster下的名称节点两个id(只能有2个),HA不需要第二名称节点 -->
      <property>
      <name>dfs.ha.namenodes.mycluster</name>
      <value>nn1,nn2</value>
      </property>
      <!-- 配置每个nn的rpc地址 -->
      <property>
      <name>dfs.namenode.rpc-address.mycluster.nn1</name>
      <value>hadoop000:8020</value>
      </property>
      <property>
      <name>dfs.namenode.rpc-address.mycluster.nn2</name>
      <value>hadoop001:8020</value>
      </property>
      <!-- 配置每个nn的webui端口 -->
      <property>
      <name>dfs.namenode.http-address.mycluster.nn1</name>
      <value>hadoop000:9870</value>
      </property>
      <property>
      <name>dfs.namenode.http-address.mycluster.nn2</name>
      <value>hadoop001:9870</value>
      </property>

      <!-- 名称节点共享编辑目录,即JN节点(在DN上) -->
      <property>
      <name>dfs.namenode.shared.edits.dir</name>
      <value>qjournal://hadoop000:8485;hadoop001:8485;hadoop002:8485/mycluster</value>
      </property>
      <!-- 配置JN存放edit的本地路径 -->
      <property>
      <name>dfs.journalnode.edits.dir</name>
      <value>/home/hadoop/app/tmp/hadoop/journal</value>
      </property>

      <!-- java类,client使用它判断哪个节点是激活态 -->
      <property>
      <name>dfs.client.failover.proxy.provider.mycluster</name>
      <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
      </property>

      <!-- 脚本列表或者java类,在容灾保护激活态的nn. -->
      <property>
      <name>dfs.ha.fencing.methods</name>
      <value>
      sshfence
      shell(/bin/true)
      </value>
      </property>
      <property>
      <name>dfs.ha.fencing.ssh.private-key-files</name>
      <value>/home/wbw/.ssh/id_rsa</value>
      </property>

      <!-- 启动自动容灾 -->
      <property>
      <name>dfs.ha.automatic-failover.enabled</name>
      <value>true</value>
      </property>
      </configuration>
    • /etc/hadoop/core-site.xml

      xml
      <configuration>
      <!-- *修改为集群名称 -->
      <property>
      <name>fs.defaultFS</name>
      <value>hdfs://mycluster</value>
      </property>

      <!-- *指定zk连接地址 -->
      <property>
      <name>ha.zookeeper.quorum</name>
      <value>hadoop000:2181,hadoop001:2181,hadoop002:2181</value>
      </property>
      </configuration>
  2. 分发到所有机器

    Code
    [hadoop@hadoop000 /home/hadoop/app/hadoop-3.1.2/etc]$xscp.sh hadoop
  3. 数据迁移

    • 先停掉所有Hadoop进程,然后再所有机器上启动JN进程

      Code
      $ hadoop-daemon.sh start journalnode
    • 启动jn之后,在两个NN之间进行disk元数据同步

      • 到hadoop000将元数据信息复制到hadoop001

        Code
        scp -r /home/hadoop/app/tmp/hadoop/dfs hadoop@hadoop001:/home/hadoop/app/tmp/hadoop/
      • 在新的nn(未格式化的nn)【这里为hadoop001】上运行一下命令,实现待命状态引导。

        Code
        # 需要hadoop000的namenode为启动状态,提示是否格式化,选择N
        [hadoop@hadoop000 /home/hadoop]$hadoop-daemon.sh start namenode
        [hadoop@hadoop001 /home/hadoop]$hdfs namenode -bootstrapStandby
      • 在一个NN上执行以下命令,完成edit日志到jn节点的传输

        Code
        [hadoop@hadoop001 /home/hadoop]$hdfs namenode -initializeSharedEdits
  4. 关闭所有Hadoop进程,然后登录其中一台NN,再ZK中初始化HA状态

    Code
    $ hdfs zkfc -formatZK
  5. 启动dfs进程

    Code
    $ start-dfs.sh

    [hadoop@hadoop000 /home/hadoop/app/tmp/hadoop/dfs]$xcall.sh ~/app/jdk1.8.0_251/bin/jps
    ============= hadoop000 ==============
    20128 QuorumPeerMain
    31971 DFSZKFailoverController
    31531 DataNode
    31404 NameNode
    31757 JournalNode
    32029 Jps
    ============= hadoop001 ==============
    14689 DataNode
    14599 NameNode
    14793 JournalNode
    14924 DFSZKFailoverController
    14959 Jps
    9439 QuorumPeerMain
    ============= hadoop002 ==============
    11890 DataNode
    8935 QuorumPeerMain
    11994 JournalNode
    12047 Jps

8.2 RM 自动容灾

  1. 配置文件

    • /etc/hadoop/yarn-site.xml

      xml
      <!-- 添加如下内容 -->

      <!-- 开启yarn的HA -->
      <property>
      <name>yarn.resourcemanager.ha.enabled</name>
      <value>true</value>
      </property>
      <!-- 配置名字ID -->
      <property>
      <name>yarn.resourcemanager.cluster-id</name>
      <value>cluster1</value>
      </property>
      <property>
      <name>yarn.resourcemanager.ha.rm-ids</name>
      <value>rm1,rm2</value>
      </property>
      <!-- 配置RM节点地址 -->
      <property>
      <name>yarn.resourcemanager.hostname.rm1</name>
      <value>hadoop000</value>
      </property>
      <property>
      <name>yarn.resourcemanager.hostname.rm2</name>
      <value>hadoop001</value>
      </property>
      <!-- 配置RM,WEB-UI端口 -->
      <property>
      <name>yarn.resourcemanager.webapp.address.rm1</name>
      <value>hadoop000:8088</value>
      </property>
      <property>
      <name>yarn.resourcemanager.webapp.address.rm2</name>
      <value>hadoop001:8088</value>
      </property>
      <!-- 配置ZK集群 -->
      <property>
      <name>yarn.resourcemanager.zk-address</name>
      <value>hadoop000:2181,hadoop001:2181,hadoop002:2181</value>
      </property>
  2. 分发到所有机器

    Code
    $xscp.sh yarn-site.xml
  3. 启动yarn

九、HBase 安装

hadoop000上如下操作:

  1. 下载

  2. 解压

    Code
    [hadoop@hadoop000 /home/hadoop/software]$tar -zxvf hbase-2.2.4-bin.tar.gz -C ~/app/
  3. 配置环境变量中:~/.bash_profile

    Code
    # HBase
    export HBASE_HOME=/home/hadoop/app/hbase-2.2.4
    export PATH=$HBASE_HOME/bin:$PATH
  4. 配置生效: source ~/.bash_profile

  5. 验证安装

    Code
    hbase version
  6. 修改配置文件

    • /conf/hbase-env.sh

      Code
      # 修改JDK路径
      export JAVA_HOME=/home/hadoop/app/jdk1.8.0_251
      # 使用自己的ZK管理
      export HBASE_MANAGES_ZK=false
    • /conf/hbse-site.xml

      xml
      <configuration>
      <!-- 使用完全分布式 -->
      <property>
      <name>hbase.cluster.distributed</name>
      <value>true</value>
      </property>

      <!-- 指定hbase数据在hdfs上的存放路径 -->
      <property>
      <name>hbase.rootdir</name>
      <value>hdfs://mycluster/hbase</value>
      </property>
      <!-- 配置zk地址 -->
      <property>
      <name>hbase.zookeeper.quorum</name>
      <value>hadoop000:2181,hadoop001:2181,hadoop002:2181</value>
      </property>
      <!-- zk的本地目录 -->
      <property>
      <name>hbase.zookeeper.property.dataDir</name>
      <value>/home/hadoop/app/tmp/zookeeper</value>
      </property>
      </configuration>
    • /conf/regionservers

      Code
      hadoop000
      hadoop001
      hadoop002
  7. 把Hadoop关于HDFS的相关配置文件(hdfs-site.xml和core-site.xml)拷贝到HBase的conf目录下

    Code
    [hadoop@hadoop000 /home/hadoop/app/hadoop-3.1.2/etc/hadoop]$cp hdfs-site.xml ~/app/hbase-2.2.4/conf/
    [hadoop@hadoop000 /home/hadoop/app/hadoop-3.1.2/etc/hadoop]$cp core-site.xml ~/app/hbase-2.2.4/conf
  8. 分发到其他机器

    Code
    [hadoop@hadoop000 /home/hadoop/app]$xscp.sh hbase-2.2.4
    [hadoop@hadoop000 /home/hadoop/app]$xscp.sh ~/.bash_profile
  9. 在Hadoop002上启动Hbase

    Code
    $ start-hbase.sh

    [hadoop@hadoop000 /home/hadoop/app]$xcall.sh ~/app/jdk1.8.0_251/bin/jps
    ============= hadoop000 ==============
    20128 QuorumPeerMain
    31971 DFSZKFailoverController
    1141 HRegionServer
    32394 ResourceManager
    31404 NameNode
    31757 JournalNode
    1389 Jps
    32527 NodeManager
    ============= hadoop001 ==============
    15105 ResourceManager
    15698 Jps
    14599 NameNode
    14793 JournalNode
    15594 HRegionServer
    14924 DFSZKFailoverController
    15199 NodeManager
    9439 QuorumPeerMain
    ============= hadoop002 ==============
    12148 NodeManager
    12389 HRegionServer
    8935 QuorumPeerMain
    18910 HMaster
    11994 JournalNode
    12493 Jps
  10. 高可用配置:直接另一台机器上启动MASTER即可

    Code
    hbase-daemon.sh start master
  11. 如果发现HMaster自动关闭,可以查看日志。如果和WAL有关,则再hbase-site加如下内容:

    xml
    <!-- 解决HMaster自动关闭 -->
    <property>
    <name>hbase.unsafe.stream.capability.enforce</name>
    <value>false</value>
    </property>

    如果还不行,删除zk下rmr /hbase。然后删除Hadoop的日志和数据,然后格式话HDFS。

十、Flume 安装

hadoop000上如下操作:

  1. 下载

  2. 解压到 ~/app

    Code
    tar -zxvf ~/software/apache-flume-1.9.0-bin.tar.gz -C ~/app/
  3. 配置环境变量中:~/.bash_profile

    Code
    # Flume
    export FLUME_HOME=/home/hadoop/app/apache-flume-1.9.0-bin
    export PATH=$FLUME_HOME/bin:$PATH
  4. 配置生效: source ~/.bash_profile

  5. flume-env.sh 配置

    Code
    cd ~/app/apache-flume-1.9.0-bin/conf/
    cp flume-env.sh.template flume-env.sh

    # 添加如下内容
    export JAVA_HOME=/home/hadoop/app/jdk1.8.0_251
  6. 分发

    Code
    [hadoop@hadoop000 /home/hadoop/app]$xscp.sh apache-flume-1.9.0-bin
    [hadoop@hadoop000 /home/hadoop/app]$xscp.sh ~/.bash_profile
  7. 检测安装

    Code
    flume-ng version

进行测试,查看是否两机器间flume是否可以正常运行,更多使用方式参考《Flume学习笔记》

十一、Kafka 安装

hadoop000上如下操作:

  1. 下载

  2. 解压

    Code
    [hadoop@hadoop000 /home/hadoop/software]$tar -zxvf kafka_2.12-2.4.1.tgz -C ~/app/
  3. 配置环境变量

    Code
    [hadoop@hadoop000 /home/hadoop/software]$vi ~/.bash_profile
    # KAFKA
    export KAFKA_HOME=/home/hadoop/app/kafka_2.12-2.4.1
    export PATH=$KAFKA_HOME/bin:$PATH
  4. 配置生效: source ~/.bash_profile

  5. 修改配置文件

    • server.properties

      Code
      # 拷贝一份初始配置文件
      [hadoop@hadoop000 /home/hadoop/app/kafka_2.12-2.4.1/config]$cp server.properties server.properties.bak

      修改server.properties内容:

      properties
      # 设置ID,保证集群中唯一(这里取和zk一样的编号)
      broker.id=1
      # 打开注释(注意!每台机子要记得改)
      listeners=PLAINTEXT://hadoop000:9092
      # 修改日志目录
      log.dirs=/home/hadoop/app/tmp/kafka-logs
      # 修改zookeeper集群
      zookeeper.connect=hadoop000:2181,hadoop001:2181,hadoop002:2181
  6. 分发

    Code
    [hadoop@hadoop000 /home/hadoop/app]$xscp.sh kafka_2.12-2.4.1
    [hadoop@hadoop000 /home/hadoop/app]$xscp.sh ~/.bash_profile

    *注意修改各机器server.properties中的 * broker.idlisteners

  7. 在所有机器上启动服务(确保ZK集群开启)

    Code
    kafka-server-start.sh -daemon $KAFKA_HOME/config/server.properties
  8. 查看进程

    Code
    $xcall.sh ~/app/jdk1.8.0_251/bin/jps
    ============= hadoop000 ==============
    5458 QuorumPeerMain
    6405 Jps
    6331 Kafka
    ============= hadoop001 ==============
    5156 QuorumPeerMain
    5876 Kafka
    5944 Jps
    ============= hadoop002 ==============
    5809 Jps
    5012 QuorumPeerMain
    5741 Kafka

进行测试,查看是否两机器间kafka是否可以正常运行,更多使用方式参考《Kafka学习笔记》

十二、Maven 安装

hadoop000上如下操作:

  1. 下载

  2. 解压

    Code
    [hadoop@hadoop000 /home/hadoop/software]$tar -zxvf apache-maven-3.6.3-bin.tar.gz -C ~/app/
  3. 配置环境变量:~/.bash_profile

    Code
    # MAVEN
    export MAVEN_HOME=/home/hadoop/app/apache-maven-3.6.3
    export PATH=$MAVEN_HOME/bin:$PATH
  4. 配置生效: source ~/.bash_profile

  5. 添加阿里仓库 conf/settings.xml

    xml
    <!-- 阿里云中央仓库 -->
    <mirror>
    <id>alimaven</id>
    <name>aliyun maven</name>
    <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
    <mirrorOf>central</mirrorOf>
    </mirror>
  6. 查看版本

    Code
    mvn -version
    Apache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f)
    Maven home: /home/hadoop/app/apache-maven-3.6.3
    Java version: 1.8.0_251

十三、Scala 安装

  1. 下载

  2. 解压

    Code
    [hadoop@hadoop000 /home/hadoop/software]$tar -zxvf scala-2.11.12.tgz -C ~/app/
  3. 配置环境变量

    Code
    # SCALA
    export SCALA_HOME=/home/hadoop/app/scala-2.11.12
    export PATH=$SCALA_HOME/bin:$PATH
  4. 配置生效: source ~/.bash_profile

十四、Spark 安装

14.1 直接解压方式

hadoop000上如下操作:

  1. 下载

  2. 解压

    Code
    [hadoop@hadoop000 /home/hadoop/software]$tar -zxvf spark-2.4.5-bin-hadoop2.7.tgz -C ~/app/
  3. 配置环境变量

    Code
    [hadoop@hadoop000 /home/hadoop/software]$vi ~/.bash_profile
    # SPARK
    export SPARK_HOME=/home/hadoop/app/spark-2.4.5-bin-hadoop2.7
    export PATH=$SPARK_HOME/bin:$PATH
  4. 配置生效: source ~/.bash_profile

  5. 分发

    Code
    [hadoop@hadoop000 /home/hadoop/app]$xscp.sh spark-2.4.5-bin-hadoop2.7
    [hadoop@hadoop000 /home/hadoop/app]$xscp.sh ~/.bash_profile
  6. 测试

    Code
    spark-shell

14.2 编译源码方式(推荐)

【官网教程】https://spark.apache.org/docs/latest/building-spark.html

本人在做实验的时候,编译失败?所以采用了第一种方式安装

hadoop000上如下操作:

  1. 下载

  2. 解压

    Code
    [hadoop@hadoop000 /home/hadoop/software]$tar -zxvf spark-2.4.5.tgz -C ~/source/
  3. 修改 pom.xml 文件,添加仓库(注意)

    xml
    <repository>
    <id>cloudera</id>
    <name>cloudera repository</name>
    <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
    </repository>
  4. Maven编译指定Hadoop版本并启用YARN,使用Hive和JDBC支持进行构建

    • 第一种方式

      Code
      ./build/mvn -Pyarn -Phadoop-3.1 -Dhadoop.version=3.1.2 -Phive -Phive-thriftserver -DskipTests clean package

      设置maven内存大小,根据实际情况调整。

      Code
      export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=1g"
    • 第二种方式(推荐)【阿里云上太慢了,这里再本地机器上进行编译,同时使用代理】16:15~

      编译成一个包,这里名字建议直接用Hadoop的版本。

      Code
      ./dev/make-distribution.sh --name hadoop3.1.2 --pip --r --tgz -Pyarn -Phadoop-3.1 -Dhadoop.version=3.1.2 -Phive -Phive-thriftserver -DskipTests clean package

      修改 ./dev/make-distribution.sh 以跳过检查

      • 注释掉以下内容:位于文件中的128~146行。

        Code
        128 #VERSION=$("$MVN" help:evaluate -Dexpression=project.version $@ 2>/dev/null\
        129 # | grep -v "INFO"\
        130 # | grep -v "WARNING"\
        131 # | tail -n 1)
        132 #SCALA_VERSION=$("$MVN" help:evaluate -Dexpression=scala.binary.version $@ 2>/dev/null\
        133 # | grep -v "INFO"\
        134 # | grep -v "WARNING"\
        135 # | tail -n 1)
        136 #SPARK_HADOOP_VERSION=$("$MVN" help:evaluate -Dexpression=hadoop.version $@ 2>/dev/null\
        137 # | grep -v "INFO"\
        138 # | grep -v "WARNING"\
        139 # | tail -n 1)
        140 #SPARK_HIVE=$("$MVN" help:evaluate -Dexpression=project.activeProfiles -pl sql/hive $@ 2>/dev/null\
        141 # | grep -v "INFO"\
        142 # | grep -v "WARNING"\
        143 # | fgrep --count "<id>hive</id>";\
        144 # # Reset exit status to 0, otherwise the script stops here if the last grep finds nothing\
        145 # # because we use "set -o pipefail"
        146 # echo -n)
      • 添加以下内容

        Code
        VERSION=2.4.5
        SCALA_VERSION=2.11
        SPARK_HADOOP_VERSION=3.1.2
        SPARK_HIVE=1
      • 【选做】还可以调整内存大小,默认是1G,这里改成2G

        Code
        export MAVEN_OPTS="${MAVEN_OPTS:--Xmx8g -XX:ReservedCodeCacheSize=2g}"

提示:

如果在编译过程中,看到的异常信息不太懂,可以在编译命令后面添加 -X,就能看到更详细的编译信息。

十五、日志生成

文章作者: IT小王
文章链接: https://wangbowen.cn/2020/05/27/SparkStreaming%E9%A1%B9%E7%9B%AE%E5%AE%9E%E6%88%98/
版权声明: 本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来自 IT小王

评论