Spark安装：Spark集群及开发环境搭建

2014-09-26 by guozhongxin

安装Spark准备

在准备安装spark之前，需要准备以下安装包，并完成以下预备动作。

scala安装包，可以在scala官方网站下载
spark安装包，可以在spark官网下载，用两种形式的安装包：
- source code package
- pre-build package
在主节点实现ssh免密码登陆其他节点。

install scala - scala安装

download scala-2.10.4.tgz and unzip：

tar -zxf scala-2.10.4.tgz
vi ~/.bashrc
    export SCALA_HOME=...   
    export PATH=$PATH:$SCALA_HOME/bin
source ~/.bashrc

install spark. - spark安装

There are two types of spark installation package, source package that you need build spark at first, and prebuild package.

Spark的安装包有两种形式：源码包（用户需要自己下载后在平台上编译），以及已经编译打包好的安装包

To build source package, you should unzip the package and edit pom.xml in the directory, change and some jars' version: protobuf, hbase, hive. Then, you can run this command :

在用源码包安装时，你需要先解压缩安装包，然后修改文件夹中中pom.xml文件，将hadoop、protobuf、hbase、hive的版本号修改为当前环境的版本。之后在这个文件夹下运行这条命令：

./make-distribution.sh --hadoop 2.4.0 --with-yarn --with-hive --with-tachyon --tgz --skip-java-test

If you choose prebuild package with the right hadoop version, you needn't build it by yourself.

如果你选择了已经build好的安装包，以上步骤不需执行。

将自己编译或是下载的编译包解压缩，并配置环境变量：

tar -zxf spark-1.0.0-bin-2.2.0.tgz
vi ~/.bashrc
    export SCALA_HOME=...  
    export PATH=$PATH:$SCALA_HOME/bin:$SCALA_HOME/sbin
source ~/.bashrc

Configure Spark cluster - Spark集群配置

edit $SPARK_HOME/conf/slaves, and input all node IP :

masters
slave1
slave2
slave3
create and edit $SPARK_HOME/conf/spark_env.sh

export HADOOP_HOME=/opt/apache/hadoop-2.4.0
export HADOOP_CONF_DIR=/opt/apache/hadoop-2.4.0/etc/hadoop
export JAVA_HOME=/usr/local/jdk1.7.0_60
export SCALA_HOME=/home/yarn/scala-2.10.4

export SPARK_WORKER_MEMORY=16g
export SPARK_WORKER_INSTANCES=1
export SPARK_MASTER_IP=master

实际上安装好之后conf文件夹下有一个spark_env.sh的模板，里边有各个变量的解释说明，在这不一一累述
Copy to other node

要将各个节点上的这两个文件都进行配置

Configure Spark App - Spark作业属性配置

对于作业执行的属性配置，spark提供了三种不同的配置方法

create and edit $SPARK_HOME/conf/spark_default.conf

spark.master spark://master:7077
spark.eventLog.enabled true
spark.eventLog.dir hdfs://master:8020/sparklog
spark.local.dir ...

在通过$SPARK_HOME/bin/spark-submit这个脚本提交作业时，通过

$SPARK_HOME/bin/spark-submit  /
--master spark://master:7077  /
--conf spark.eventLog.enabled=true ...  /
***.jar

通过代码中对SparkContext来对这些属性赋值

这三种方法的优先级是：
3 高于 2 高于1

Tips

If you change SPARK_WORKER_INSTANCES, CHECK worker's process in every node
If old worker's process is still working , you can use this command to kill them:
```
ps -ef | grep Worker | grep -v grep | cut -c 9-15 | xargs kill -s 9
```
and restart Spark Cluster
if you want to start history server, you should assign logs' path:
```
$SPARK_HOME/sbin/start-historyserver.sh  $SPARK_HOME/logs
```
If you wanna save a job's log, you should assign two properties:

spark.eventLog.enabled true
spark.eventLog.dir hdfs://master:8020/sparklog