Apache Spark is the engine powering many data-driven use cases, from data engineering to data science and machine learning applications. At QuantumBlack, Spark is considered a key technology and used in a number of client engagements, from a Data Engineering, Data Science and Platform Engineering point of view. This talk will be around the lessons learned after running successfully Apache Spark workloads in production in the cloud for a number of years. As public cloud adoption grows in the enterprise, more and more organizations are choosing to run Apache Spark workloads on cloud infrastructure. While the cloud presents many benefits, there are a number of challenges that aren’t obvious until you start and require sometimes different approaches or thinking.
This talk will look into a few different areas, starting with the Jigsaw pieces you face with Open Source software, balancing a platform for stability along with allowing innovation. The talk will then look at approaches used to combat the not so obvious challenges and trade-offs of using cloud scalable storage backends for storing/retrieving data. Finally, there’ll be a section on the considerations needed for reliability and manageability of robust analytic pipelines.