Home About Blog pRojects

Answers to FAQ about SparkR for R users

Many people keep asking me whether I have tried SparkR, is it worth using, is it sexy or WHAT is it at all. I felt that creating frequently asked questions (FAQ) in the field of WHAT is that Spark/SparkR? would help many R Scientists to understand this Big Data Buzz-tool. I have gathered information from the documentation and some code from stackoverflow questions in preparation for the list below.

Q1: How to explain what is Spark to the regular R user?

From the documentation of Spark Overview

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

So in simple words it is a computational engine that is being optimized for speed of computations.

Q2: Can I run Spark applications from R: SparkR?

From the Spark Overview we know that it is possible to use Spark from R without knowing Spark syntax at all. It can be done through SparkR: R frontend for Spark package, which is included in Spark since the version 1.4.0. Now Spark is released under version 1.6.1.

Q3: How do I install SparkR?

By now, SparkR is distributed with Spark. So after you install Spark you will get SparkR package in the installation directory. This means that loading SparkR package will be possible from the /R/lib subdirectory of the directory in which Spark was installed with the usage of lib.loc parameter in library()

library(SparkR, lib.loc = "/opt/spark/R/lib")

Installation under Windows and OS is well described on this r-bloggers post.

Q4: How do I install Spark?

Spark can be downloaded from this site and can be build with the following instructions

Q5: Can I use Spark on Windows?

Sure! The easiest solution on Windows is to build from source, after you first get Maven. See this guide http://spark.apache.org/docs/latest/building-spark.html

Explanation on stackoverflow

Q6: How to start Spark application using SparkR from shell?

Assuming that the installation directory was /opt/spark one can run SparkR package in shell with

/opt/spark/bin/sparkR

where probably /spark/ directory at your machine would like rather like spark-ver-hadoop-ver/.

It will turn on R with attached SparkR package, that have run Spark application locally. This way of using SparkR grants the advantage of having Spark context and SQL context created automatically at the start without user engagement. You can see this with the start information

Welcome to SparkR! Spark context is available as sc, SQL context is available as sqlContext

You can specify the number of executors and executor cores for Spark application with additional arguments

/opt/spark/bin/sparkR --num-executors 5 --executor-cores 5

Q7: How to start Spark application using SparkR from RStudio?

If you’d rather use RStudio to work with R frontend, which I really recommend, you have to load SparkR with the non-default directory. Therefore you should get familiar with lib.loc parameter in library() function

library(SparkR, lib.loc = "/opt/spark/R/lib")

Then you will have to start Spark context and SQL context manually with

# this is optional
sparkEnvir <- list(spark.num.executors='5', spark.executor.cores='5')
# initializing Spark context
sc <- sparkR.init(sparkHome = "/opt/spark", 
                  sparkEnvir = sparkEnvir)
# initializing SQL context
sqlContext <- sparkRSQL.init(sc)

There is also extra spark.driver.memory option described in Starting Up from RStudio.

Q8: What are those all Spark contexts and how I can benefit from them?

Spark context is required to run Spark itself and any other context is required to connect Spark with its extensions. SQL Context enables to use Spark SQL which mainly means that one can write SQL statements instead of using Spark functions and that Spark SQL module will translate those statements for Spark.

For example hiveContext can be created with

hiveContext <- sparkRHive.init(sc)

and then one would be able to send SQL/HQL statements to HiveServer like

SparkResult <- sql(hiveContext, "select statement")

Note that Spark should have been built with Hive support and more details on the difference between SQLContext and HiveContext can be found in the SQL programming guide

Q9: Can I run Spark application locally or only on Yarn Hadoop Cluster?

You can do it both ways. Yarn Cluster requires additional master parameter specification.

  • shell
/opt/spark/bin/sparkR --master yarn-client --num-executors 5 --executor-cores 5
  • RStudio
sc <- sparkR.init(sparkHome = "/opt/spark", 
                  master = "yarn-client",
                  sparkEnvir = sparkEnvir)

Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster.

Source: Launching Spark on YARNn

Q10: Stopping Spark application?

There’s nothing simpler

sparkR.stop()

Q11: What can be potential Spark data source?

One can connect to HiveServer by specifying hiveContext, but also one can read csv files to work with. This requires additional Spark package, a good start is spark-csv_2.10:1.0.3 package which can be downloaded from Spark packages repository http://spark-packages.org/

  • shell
/opt/spark/bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3 --master yarn-client --num-executors 5 --executor-cores 5
  • RStudio
sparkEnvir <- list(spark.num.executors='5', spark.executor.cores='5',
                   packages='com.databricks:spark-csv_2.10:1.0.3')
# initializing Spark context
sc <- sparkR.init(sparkHome = "/opt/spark", 
                  sparkEnvir = sparkEnvir)

Q12: Can I use R functions on Spark?

No. Mainly you are using Spark functions, but called from R equivalents. You can not run on Spark engine a function that exists in R but has no equivalents in Spark.

But you can get data from Spark application with

SparkR::collect(SparkResult) -> RResult

use R function on collected dataset and send it back to the RDD Spark format, so called SparkR DataFrames

operations_and_functions(RResult) -> RResultChanged
df <- createDataFrame(sqlContext/hiveContext, RResultChanged)

A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R, but with richer optimizations under the hood.

Q13: What about machine learning and MLlib?

Spark provide great machine learning library called MLlib with algorithms based on stochastic gradient descent optimization so that Spark can be connected to stream data and use online machine learning methods. It is not possible to connect R with MLlib library so far and the only machine learning algorithms available from SparkR are Gaussian and Binomial GLM models.

Q14: Where can I read more?! I want more examples!

There is a great documentation page about using Spark from R on Spark Programming Guides.

If you have any more questions or feel that something is not clear, you can add a comment in the Disqus panel below. If you’d want to hear more about R integration with Spark like dplyr.spark.hive in the future you can get feed by clicking this ATOM feed link.