Improving data locality for analytics jobs on Kubernetes using Alluxio

CNCF Member Online program
Presented by: Alluxio

Recorded: Tuesday January 21, 2020

Download Slides

Program Speakers: Gene Pang, PMC Maintainer @Alluxio, and Adit Madan, Software Engineer @Alluxio

In the on-prem days, one key performance optimization for Apache Hadoop or Apache Spark workloads is to run tasks on nodes with local HDFS data. However, while adoption of the Cloud & Kubernetes makes scaling compute workloads exceptionally easy, HDFS is often not an option. Effectively accessing data from cloud-native storage services like AWS S3 or even on-premises HDFS becomes harder as data locality is lost.

Originated from UC Berkeley AMPLab, the open source project Alluxio approaches this problem in a new way by helping to move data closer to compute workloads efficiently and on-demand, and unify data across multiple or remote clouds, and many more. This webinar will describe the concept and internal mechanism using the stack of Spark+Alluxio in Kubernetes to enhance data locality even when the storage service is outside or remote.

Particularly, we will go over:

  • Why Spark is able to make a locality-aware schedule when working with Alluxio in K8s environment using the host network
  • Why a pod running Alluxio can share data efficiently with a pod running Spark on the same host using domain socket and host path volume
  • The roadmap of Alluxio to further improve running analytics jobs like Spark and Presto, including the on-going closer integration with Presto