
Learning Apache Spark 2
By :

As discussed previously, Apache Spark currently supports three Cluster managers:
We'll look at setting these up in much more detail in Chapter 8, Operating in Clustered Mode, which talks about the operation in a clustered mode.
Until now we have used Spark for exploratory analysis, using Scala and Python shells. Spark can also be used in standalone applications that can run in Java, Scala, Python, or R. As we saw earlier, Spark shell and PySpark
provide you with a SparkContext.
However, when you are using an application, you need to initialize your own SparkContext.
Once you have a SparkContext
reference, the remaining API remains exactly the same as for interactive query analysis. After all, it's the same object, just a different context in which it is running.
The exact method of using Spark with your application differs based on your preference of language. All Spark artifacts are hosted in Maven central. You can add a Maven dependency with the following coordinates:
groupId: org.apache.spark artifactId: spark_core_2.10 version: 1.6.1
You can use Maven to build the project or alternatively use Scale/Eclipse IDE to add a Maven dependency to your project.
Apache Maven is a build automation tool used primarily for Java projects. The word maven means "accumulator of knowledge" in Yiddish. Maven addresses the two core aspects of building software: first, it describes how the software is built and second, it describes its dependencies.
You can configure your IDE's to work with Spark. While many of the Spark developers use SBT or Maven on the command line, the most common IDE being used is IntelliJ IDEA. Community edition is free, and after that you can install JetBrains Scala Plugin. You can find detailed instructions on setting up either IntelliJIDEA or Eclipse to build Spark at http://bit.ly/28RDPFy.
The spark submit script in Spark's bin
directory, being the most commonly used method to submit Spark applications to a Spark cluster, can be used to launch applications on all supported cluster types. You will need to package all dependent projects with your application to enable Spark to distribute that across the cluster. You would need to create an assembly JAR (aka uber/fat JAR) file containing all of your code and relevant dependencies.
A spark application with its dependencies can be launched using the bin/spark-submit script. This script takes care of setting up the classpath and its dependencies, and it supports all the cluster-managers and deploy modes supported by Spark.
Figure 1.16: Spark submission template
For Python applications:
<application-jar>
, simply pass in your .py
file..zip
, .egg
, and .py
files to the search path with - .py
files.
Cluster Manager |
Deployment Mode |
Application Type |
Support |
MESOS |
Cluster |
R |
Not Supported |
Standalone |
Cluster |
Python |
Not Supported |
Standalone |
Cluster |
R |
Not Supported |
Local |
Cluster |
- |
Incompatible |
- |
Cluster |
Spark-Shell |
Not applicable |
- |
Cluster |
Sql-Shell |
Not Applicable |
- |
Cluster |
Thrift Server |
Not Applicable |