Guest post originally published by Allison Richardet, Principal Software Development Engineer, Mastercard

At MasterCard, the Internal Cloud Team maintains our on-premises Kubernetes platform. Our work includes maintaining Kubernetes clusters, the core deployments we rely on and provide our tenants – such as logging, monitoring, etc. – and providing our tenants with a great experience.

One of our tenants, the Data Warehouse Team, has historically utilized Native Apache Spark on YARN, with HDFS. They approached our team with the need to move their big data workloads to Kubernetes; they wanted to go cloud native, and we had an exciting opportunity to work with Apache Spark on Kubernetes.

So, our journey began with the Spark Operator. A move to Kubernetes and Operators would open cloud native possibilities for our internal customer, the Data Warehouse Team. We had the opportunity to help them take advantage of scalability and cost improvements, and a switch S3 would further achieve these goals.

Background

What is an operator and why should we, or you for that matter, be interested? First, an operator extends the Kubernetes API with custom resources. An operator also defines a custom controller to monitor its resource types. Combining custom resources with a  custom controller yields a declarative API where the operator reconciles differences between the declared and actual state of the cluster. In other words, an operator handles automation with respect to its resources.

With all these benefits, our team was excited to utilize the Spark Operator for Kubernetes to support our tenant. Commonly, Native Apache Spark utilizes HDFS. However, moving to the cloud and running the Spark Operator on Kuberentes, S3 is a nice alternative to HDFS due to its cost benefits and ability to scale as needed. Interestingly enough, S3 is not available by default with the Spark Operator. We referenced the Spark Operator as well as the Hadoop-AWS integration documentation. Additionally, we will share details on the following 4 steps: Image Updates, SparkApplication Configuration, S3 Credentials, and S3 Flavor. Follow along with our steps to integration to utilize S3 with your Spark jobs with the Spark Operator for Kubernetes.

Our Workflow

As with most of the applications we deploy to Kubernetes clusters, we utilize Helm charts. The Helm chart for Kubernetes Operator for Apache Spark can be found here.

Values & Helm Template

We update the values.yaml according to our needs, and then run helm template to generate the manifests we will deploy to Kubernetes clusters. We find having the visibility in what will be created and control over the deployment is worth the extra step; the templates are stored in git, and our CD tool takes care of the deployment.

The default chart values will allow you to get up and running quickly. Depending on your needs, the following are a few modifications you may want to make:

  • Enable webhook: The Mutating Admission Webhook is disabled by default. Enabling allows customization of the SparkApplication driver and executor pods, including mounting volumes, ConfigMaps, affinity/anti-affinity, and more.
  • Define ingressUrlFormat: optional ingress for the Spark UI.

See the Quick Start Guide and default values.yaml for additional details and options.

Requirements

Additional configuration for the SparkApplication, including a custom docker image is required to run SparkApplications that utilize S3. The Hadoop S3AConnector is the tool that makes it possible to read from or write to S3.

1. Image Updates

The docker image used by the SparkApplication requires the addition of two jars (hadoop-aws and either aws-java-sdk or aws-java-sdk-bundle), and versions vary based on the Spark version and Hadoop profile.

There are a few things to keep in mind during this step.

  1. User & Permissions
  2. Additional Jars

If using the spark images as a starting point, reference their respective Dockerfiles to properly align with the user and location when adding the jars.

Let’s take a look at the python Dockerfile. The user is set to root before any install tasks are carried out, and then reset to the ${spark_uid}.

By inspecting the base image, you can see the jars are located in /opt/spark/jars or $SPARK_HOME/jars. Finally, update the permissions for the jars so they can be utilized.

The docs for Uploading to S3 provide information on the jars to use; however, we needed a newer Hadoop version which included fs.s3a.path.style.access configuration – we will discuss this in a later section. At the time of writing, we were using spark operator version v1beta2-1.2.0-3.0.0, with base spark version 3.0.0. Using the gcr.io/spark-operator/spark-py:v3.0.0-hadoop3 image as a starting point, we added the following jars hadoop-aws-3.1.0.jar and aws-java-sdk-bundle-1.11.271.jar. It took a bit of experimentation to determine the correct combination for the final working image.

2. SparkApplication Configuration

The SparkApplication requires additional configuration to communicate with S3. The following is the minimum configuration required in the spec.sparkConf section:

sparkConf:

     spark.hadoop.fs.s3a.endpoint: <endpoint>

     spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem

Credentials to access S3 must also be provided. There are configuration options similar to the above; however it is highly discouraged as they are string values, hence goes against security best practices.

3. S3 Credentials

In lieu of providing the s3 credentials in the SparkApplication’s sparkConf, we create a Kubernetes secret and define environment variables for the driver and executor(s). The Spark Operator documentation provides several options for utilizing a secret as well as thorough examples for either Mounting Secrets or Specifying Environment Variables.

Next, since we are utilizing environment variables to authenticate to S3, we set the following option in the sparkConf:

sparkConf:

     spark.hadoop.fs.s3a.aws.credentials.provider: com.amazonaws.auth.EnvironmentVariableCredentialsProvider

This is not required, and if not provided, the order of credentials provider classes are attempted in the following order:

  1. org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
  2. com.amazonaws.auth.EnvironmentVariableCredentialsProvider
  3. com.amazonaws.auth.InstanceProfileCredentialsProvider

4. S3 Flavor

A few other options in the SparkApplication’s sparkConf to keep in mind based on your particular S3 are the following:

sparkConf:

     extraJavaOptions: -Dcom.amazonaws.services.s3.enableV4=true

     spark.hadoop.fs.s3a.path.style.access: “true”

    spark.hadoop.fs.s3a.connection.ssl.enabled: “true”

Path Style Access – by enabling path style access, this disables virtual hosting (which is enabled by default). Enabling path style access removes the requirement to set up DNS for the default virtual hosting.

Enabling SSL – if you are utilizing TLS/SSL, be sure to enable this option in the sparkConf of the SparkApplication.

Extra Java Options – vary based on your needs.

Using S3

Now that you have everything setup to allow you to use S3, you have two options: utilize S3 for dependencies or upload to S3.

Dependencies & S3

The mainApplicationFile and additional dependencies used by a spark job, including files or jars may also be stored and obtained from S3. They can be defined in the SparkApplication in the spec.deps field along with other dependency options. Jars or files specified in the spec.deps.jars or spec.deps.files are used by the spark-submit for —jars and —files, respectively. The format for accessing a dependency in s3 is s3a://bucket/path/to/file.

Upload to S3

When uploading to S3, the format for the file location is s3a://bucket/path/to/destination. The bucket must exist or the upload will fail. If the destination file(s) already exists, upload will fail.

Conclusion

We touched on the 4 steps required to get up and running with the Spark Operator and S3: image updates, required options in the SparkApplication’s sparkConf, S3 credentials, and additional options based on your particular S3. Finally, we gave some pointers on how to utilize S3 for dependencies and uploading to S3.

In the end, we helped our internal customer, the Data Warehouse Team, move their big data workloads from Native Apache Spark to Kubernetes. The Spark Operator on Kubernetes has great cloud native benefits, and we wanted to share our experiences with the greater community. We hope this walkthrough of the Spark Operator and S3 integration will help you and/or your team get up and running with the Spark Operator and S3.

Resources

spark-on-k8s-operator repo

spark-on-k8s-operator Helm chart

Hadoop-AWS Module docs