Spark有以下四种运行模式:
- local: 本地单进程模式,用于本地开发测试Spark代码
- standalone:分布式集群模式,Master-Worker架构,Master负责调度,Worker负责具体Task的执行
- on yarn/mesos: 运行在yarn/mesos等资源管理框架之上,yarn/mesos提供资源管理,spark提供计算调度,并可与其他计算框架(如MapReduce/MPI/Storm)共同运行在同一个集群之上
- on cloud(EC2): 运行在AWS的EC2之上
本章主要介绍本地模式和standalone模式的安装部署配置。
本地模式
注: 本文只介绍Linux的安装部署配置方式,Windows以及Mac下类似,用户请自行参考官方文档
1. 前提条件
Spark依赖JDK 6.0以及Scala 2.9.3以上版本,所以首先确保已安装JDK和Scala的合适版本并加入PATH。 本节比较简单,不在此赘述。安装完毕,请验证JDK和Scala的版本
hadoop@Aspire-5830TG:~$ echo $JAVA_HOME
/usr/local/jdk
hadoop@Aspire-5830TG:~$ java -version
java version "1.6.0_43"
Java(TM) SE Runtime Environment (build 1.6.0_43-b01)
Java HotSpot(TM) 64-Bit Server VM (build 20.14-b01, mixed mode)
hadoop@Aspire-5830TG:~$ echo $SCALA_HOME
/usr/local/scala
hadoop@Aspire-5830TG:~$ scala
Welcome to Scala version 2.9.3 (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_43).
Type in expressions to have them evaluated.
Type :help for more information.
scala>
2. 安装Spark
Spark的安装和简单,只需要将Spark的安装包download下来,加入PATH即可。我们采用官方v0.7.3版本
cd /user/local/spark
wget http://spark-project.org/download/spark-0.7.3-prebuilt-hadoop1.tgz
tar zxvf spark-0.7.3-prebuilt-hadoop1.tgz
ln -s spark-0.7.3 spark-release
vim /etc/profile
export SPARK_HOME=/usr/local/spark/spark-release
export PATH=$SPARK_HOME/bin
3. 配置Spark
Spark的配置文件只有一个: $SPARK_HOME/conf/spark-env.sh。本地开发模式的配置很简单,只需要配置JAVA_HOME和SCALA_HOME。实例如下:
export JAVA_HOME=/usr/local/jdk
export SCALA_HOME=/usr/local/scala
# add spark example jar to CLASSPATH
export SPARK_EXAMPLES_JAR=$SPARK_HOME/examples/target/scala-2.9.3/spark-examples_2.9.3-0.7.3.jar
4. 测试验证
Spark提供了两种运行模式:
1) run脚本: 用于运行已经生成的jar包中的代码,如Spark自带的example中的SparkPi。
hadoop@Aspire-5830TG:/usr/local/spark/spark-release$ ./run spark.examples.SparkPi local
# 此处略去一万字....
Pi is roughly 3.1358
2) spark-shell: 用于interactive programming
hadoop@Aspire-5830TG:/usr/local/spark/spark-release$ ./spark-shell
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 0.7.3
/_/
Using Scala version 2.9.3 (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_43)
Initializing interpreter...
13/07/16 15:28:04 WARN Utils: Your hostname, Aspire-5830TG resolves to a loopback address: 127.0.0.1; using 172.16.239.1 instead (on interface vmnet8)
13/07/16 15:28:04 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Creating SparkContext...
Spark context available as sc.
Type in expressions to have them evaluated.
Type :help for more information.
scala> val days = List("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")
days: List[java.lang.String] = List(Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, Saturday)
scala> val daysRDD = sc.parallelize(days)
daysRDD: spark.RDD[java.lang.String] = ParallelCollectionRDD[0] at parallelize at <console>:14
scala> daysRDD.count()
res0: Long = 7
scala>
Standalone模式
Note:Spark是Master-Worker架构,在部署之前需要确定Master/Worker机器。同时,分布式集群模式要共享HDFS上的数据,因此需要在每个节点安装Hadoop。
1. 前提条件
- 集群模式下,Master节点要能够ssh无密码登陆各个Worker节点。
- 与本地模式一样,Spark集群的每个节点需要安装JDK和Scala。
- Spark集群的每个节点需要安装Hadoop(作为hadoop client访问HDFS上的数据)。
2. 安装Spark
与本地环境一样,Spark集群的每个节点需要download spark的tar.gz包,解压,并配置环境变量。
3. 配置Spark
分布式集群模式下,Spark的配置文件有两个:
- $SPARK_HOME/conf/slaves
slaves 是一个文本文件,将各个worker节点的IP加进去,一行一个,示例如:
10.2.6.133
10.2.6.134
10.2.6.154
- $SPARK_HOME/conf/spark-env.sh文件:
Spark的环境配置文件,定义Spark的Master/Worker以及资源定义,示例如:
export JAVA_HOME=/usr/local/jdk
export SCALA_HOME=/usr/local/scala
# SSH related
export SPARK_SSH_OPTS="-p58422 -o StrictHostKeyChecking=no"
# Spark Master IP
export SPARK_MASTER_IP=10.2.6.152
# set the number of cores to use on worker
export SPARK_WORKER_CORES=16
# set how much memory to use on worker
export SPARK_WORKER_MEMORY=16g
export SPARK_EXAMPLES_JAR=/usr/local/spark/spark-0.7.3/examples/target/scala-2.9.3/spark-examples_2.9.3-0.7.3.jar
# LZO codec related
export LD_LIBRARY_PATH=/usr/local/hadoop/lzo/lib
export SPARK_LIBRARY_PATH=/usr/local/hadoop/hadoop-release/lib/native/Linux-amd64-64/
在Spark集群的每个节点上配置好以上两个配置文件
4. 启动集群
在Master节点上,执行如下命令:
[hadoop@cosmos152 spark-release]$ echo $SPARK_HOME
/usr/local/spark/spark-release
[hadoop@cosmos152 spark-release]$ bin/start-all.sh
starting spark.deploy.master.Master, logging to /usr/local/spark/spark-0.7.3/bin/../logs/spark-hadoop-spark.deploy.master.Master-1-cosmos152.hadoop.out
Master IP: 10.2.6.152
cd /usr/local/spark/spark-0.7.3/bin/.. ; /usr/local/spark/spark-release/bin/start-slave.sh 1 spark://10.2.6.152:7077
10.2.6.133: starting spark.deploy.worker.Worker, logging to /usr/local/spark/spark-0.7.3/bin/../logs/spark-hadoop-spark.deploy.worker.Worker-1-cosmos133.hadoop.out
10.2.6.134: starting spark.deploy.worker.Worker, logging to /usr/local/spark/spark-0.7.3/bin/../logs/spark-hadoop-spark.deploy.worker.Worker-1-cosmos134.hadoop.out
10.2.6.154: starting spark.deploy.worker.Worker, logging to /usr/local/spark/spark-0.7.3/bin/../logs/spark-hadoop-spark.deploy.worker.Worker-1-cosmos154.hadoop.out
[hadoop@cosmos152 spark-release]$
5. 测试验证
推荐部署一个Client节点上,专门用于Spark Job的提交。配置与上述节点一致。在Client节点上,执行以下命令:
hadoop@cosmos155:/usr/local/spark/spark-release$ ./run spark.examples.SparkPi spark://10.2.6.152:7077
# 此处略去一万字....
Pi is roughly 3.1358