December 18, 2023
Connect to Celeste to run Spark
Spark is available to run in ERISOne, however due to the high amount of resources Spark is able to consume, we ask you to follow this instructions when you need to send a Spark job. Currently we recommend to run Spark only in Celeste which should have enough resources (CPU, memory) to run big data jobs.
- Connect to Celeste: After to connect to any ERISOne, for example through the Linux Remote Desktop you can ssh to celeste by just typing:
ssh -Y username@celeste
2. Load the Spark module:
module use /apps/modulefiles/test
module load spark/1.4.1
3. See how busy the nodes are:
bhosts elephant_hg
There needs to be enough job slots available for your request
Request an interactive session (500GB memory)
bsub -Is -q interact-big -R 'rusage[mem=500000]' -M 520000 -n 8 bash
set a ulimit to prevent the job exceeding estimated memory (add 20%)
ulimit -v 550000
Connect to the Spark master
By default the Spark master is running with a standalone configuration and has one worker with a maximum of 16 cores and 500G memory. If you want to take a look to the status of the Spark master and workers and applications running you can type in the console.
elinks http://localhost:8080
this command will show you a simple view of the Spark status.
There are several ways to run Spark applications depending on the language you want to use to deploy your applications. The main languages you can run are Scala, Java, Python and R. The following are the list of modules that you can use:
- spark-shell: An interactive Spark shell for scala and java.
- spark-submit: Submit Spark scripts, it could be an script in either scala, java, python or R.
- pyspark: Python interpreter.
- sparkR: R interpreter.
- spark-sql: Mix SQL queries with Spark programs.
Connecting with the Spark shell
To run an application with the interactive Spark shell, use the --master option and pass an option --total-executor-cores <numCores>
to control the number of cores that spark-shell uses. And the option --executor-memory <mem> to limit the amount of memory to use per executor process, the default value is 512mb.
. For example:
spark-shell --master spark://celeste.research.partners.org:7077 --total-executor-cores 10 --executor-memory 2G
You will have the application connected until you exit the interactive shell. If you need help starting with the interactive analysis in Spark take a look to the following link http://spark.apache.org/docs/latest/quick-start.html.
Launching Spark applications
If you have an script or example you want to launch you can use the spark-submit command. There are a few examples you can try to test.
Phyton script example
The following example calculates an approximation to Pi, to test it you can download the script pi.py and run the application.
spark-submit --master spark://celeste.research.partners.org:7077 /
--executor-memory 200G /
--total-executor-cores 10 /
pi.py 500
It is important to consider to limit the memory and number of cores of your application to avoid consuming all the resources.
You can also verify if the application if running in:
elinks http://localhost:8080
The output of this example writes a lot of information on screen about the process.
R application example
You can run R scripts or run applications with the interactive sparkR module. A basic example of a R dataframe creation can be run to test. Download the dataframe.R. To run the R script:
sparkR --master spark://celeste.research.partners.org:7077 /
--executor-memory 2G /
--total-executor-cores 10 /
dataframe.R
Here also you can limit the memory and number of cores that you need for your application.
If you need more help running Spark jobs take a look to the Spark Apache documentation http://spark.apache.org/examples.html. You can also find many more examples in the Spark github repository https://github.com/apache/spark.