Apache Spark is the next-generation distributed computing framework, rapidly becoming the de facto standard for big data analytics. It provides rich, expressive APIs in multiple languages, including Scala, Java, Python, and R. However, depending on the use case—a data scientist working in an Jupyter Notebook or a data engineer implementing long-running Spark submit jobs—choosing the right language can be a dilemma. This session uses a Spark application that performs “sentiment analysis of Twitter data” to compare and contrast the feature differences between the languages, API coverages, and overall productivity. With concrete examples, it provides insight to help you decide when to use Scala, Java, Python, or perhaps a mix of these.
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or All of Them?
1. David Taieb
STSM - IBM Cloud Data Services
Developer advocate
david_taieb@us.ibm.com
My boss wants me to work on Apache
Spark: should I use Scala, Java or Python or
a combination of the three?
Java One 2016, San Francisco
One thing to note here is how recent all of this attention is – keep in mind that Spark was only founded in 2009 and open-sourced in 2010
RDD’s track lineage info to rebuild lost data through DAG’s (directed acyclic graph)
If “Data node A” fails, the spark engine has all the information which contains what input splits or RDD’s were transformed to compute the RDD part of node A
And if taking into account that catastrophic failures like that are a rare occasion , we can see why this approach can enhance performance
The only shortcoming of this approach is that there will be a lot of re-computation done when such a failure eventually occurs