Day 3 -

June 7th

Track 1 (Olympic Room)

09:00 - 10:00

Use Spark from Anywhere: A Spark Client in Scala Powered by Spark Connect

Over the past decade, developers, researchers, and the community at large have successfully built tens of thousands of data applications using Spark. Since then, use cases and requirements of data applications have evolved: Today, every application, from web services that run in application servers, interactive environments such as notebooks and IDEs, to phones and edge devices such as smart home devices, want to leverage the power of data.

However, Spark’s driver architecture is monolithic, running client applications on top of a scheduler, optimizer and analyzer. This architecture makes it hard to address these new requirements: there is no built-in capability to remotely connect to a Spark cluster from languages other than SQL.

Spark Connect introduces a decoupled client-server architecture for Apache Spark that allows remote connectivity to Spark clusters using the DataFrame API and unresolved logical plans as the protocol. The separation between client and server allows Spark and its open ecosystem to be leveraged from everywhere. It can be embedded in modern data applications, in IDEs, Notebooks and programming languages.

This talk highlights how simple it is to connect to Spark using Spark Connect from any data applications or IDEs. We showcase how we build the Scala client for Spark Connect that integrates with the server and attempts to solve one of Sparks hardest problem the clients application isolation from the server. We will do a deep dive into the architecture of Spark Connect and give an outlook of how the community can participate in the extension of Spark Connect for new programming languages and frameworks - to bring the power of Spark everywhere.

Denny Lee

Databricks

Denny is a Databricks Developer Advocate. A hands-on distributed systems and data sciences engineer with extensive experience developing internet-scale data platforms, and predictive analytics systems.

He also has a Masters of Biomedical Informatics from Oregon Health and Sciences University and has implemented powerful data solutions for enterprise Healthcare customers. His current technical focuses include Distributed Systems, Apache Spark, Deep Learning, Machine Learning, and Genomics.

Ginger Holt

Databricks

Ginger Holt is a Senior Staff Data Scientist at Databricks. She develops forecasting and predictive models for revenue, sales, capacity, and other business planning needs. She obtained her Ph.D. in Statistics from Rice University where her research was in multivariate time series analysis. She was an Assistant Professor of Systems and Information Engineering at the University of Virginia from 2005-2009 where she conducted research on multivariate time series methodology development.

She has researched prediction methodologies in many different applications including computational finance, retail, capacity planning, environmental networks, cyber security, ecology, and econometrics. She has also worked at Meta, Walmart Labs, and HP, forecasting capacity needs and demand, and at BP as a quantitative analyst developing forecasting methodologies used in technical trading strategies.

Subscribe

Join our conference

Subscribe and follow @ScalaDays on Twitter for the latest conference updates.