Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Leveraging the Spark-HPCC Ecosystem

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Chargement dans…3
×

Consultez-les par la suite

1 sur 19 Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Similaire à Leveraging the Spark-HPCC Ecosystem (20)

Publicité

Plus par HPCC Systems (20)

Plus récents (20)

Publicité

Leveraging the Spark-HPCC Ecosystem

  1. 1. 2019 HPCC Systems® Community Day Challenge Yourself – Challenge the Status Quo James McMullan Sr Software Engineer LexisNexis Risk Solutions Leveraging the Spark-HPCC Systems Ecosystem
  2. 2. Overview • Spark-HPCC Plugin & Connector • Basics of reading / writing to / from HPCC Systems • Brief introduction to Apache Zeppelin • Create a random forest model in Spark • Compare to Kaggle competition leaderboard • Future of Spark-HPCC Systems Ecosystem • Closing thoughts Leveraging the Spark-HPCC Systems Ecosystem
  3. 3. Spark - HPCC Systems Connector
  4. 4. Spark-HPCC Systems – Overview • Spark-HPCC Systems Connector • Spark library • Allows reading and writing to HPCC Systems • Can be installed on any Spark cluster • Spark Plugin – Managed Spark Cluster • Requires HPCC Systems 7.0+ • Spark cluster that mirrors Thor cluster • Configured through Config Manager • Installs Spark-HPCC Systems connector Leveraging the Spark-HPCC Systems Ecosystem
  5. 5. Spark-HPCC Systems Connector - Progress • Added support for remote writing • HPCC Systems 7.2+ • Improved performance • Scala, Python and R • Increased reliability • Lots of testing and bug fixes • Added support for DataSource API v1 • Unified Read / Write interface Leveraging the Spark-HPCC Systems Ecosystem
  6. 6. Spark-HPCC Systems Connector – Reading Leveraging the Spark-HPCC Systems Ecosystem clusterURL = "http://192.168.56.101:8010" fileName = "example::dataset" # Read dataset from HPCC Systems df = spark.read.load(format="hpcc", host=clusterURL, password="", username="", limitPerFilePart=100, projectList="field1, field2", fileAccessTimeout=240, path=fileName) clusterURL <- "http://192.168.56.101:8010" fileName <- "example::dataset" # Read dataset from HPCC Systems df <- read.df(source = "hpcc", host = clusterURL, password = "", username = "", limitPerFilePart = 100, projectList = "field1, field2", fileAccessTimeout = 240, path = fileName) PySpark Read Example SparkR Read Example
  7. 7. Spark-HPCC Systems Connector – Writing Leveraging the Spark-HPCC Systems Ecosystem clusterURL = "http://192.168.56.101:8010" fileName = "example::dataset" # Write dataset to HPCC Systems df.write.save(format="hpcc", mode="overwrite", host=clusterURL, password="", username="", cluster="mythor", path=fileName) clusterURL <- "http://192.168.56.101:8010" fileName <- "example::dataset" # Write dataset to HPCC Systems write.df(df, source = "hpcc", host = clusterURL, cluster = "mythor", path = fileName, mode = "overwrite", password = "", username = "", fileAccessTimeout = 240) PySpark Write Example SparkR Write Example
  8. 8. Apache Zeppelin
  9. 9. Apache Zeppelin - Overview • Multi-user Notebook Environment • Front end for Spark • Collaborative • Easy to use • Handles resource management • Handles job queuing and resource allocation • We do not support or package Zeppelin Leveraging the Spark-HPCC Systems Ecosystem
  10. 10. Apache Zeppelin – Features • Multi-user environment by default • Version Control • Interpreters are bound at a Paragraph level • Allows multiple languages in a single notebook • Built-in visualization tools • Ability to move data between languages • Credential management Leveraging the Spark-HPCC Systems Ecosystem
  11. 11. Spark ML Model
  12. 12. Spark ML Model – Brief Intro to Random Forests • Random Forests: Ensemble of decision trees • Averaging output of multiple decision trees gives a better prediction • Random Forests requires data to be numeric Leveraging the Spark-HPCC Systems Ecosystem
  13. 13. Spark ML Model – Bulldozers R US • Open source bulldozer auction dataset from Kaggle • www.kaggle.com/c/bluebook-for-bulldozers • Create a Random Forest Model to predict auction price • Compare our model against the Kaggle leaderboard • Score is calculated by RMSLE (Root Mean Square Log Error) • RMSLE provides a percentage based error Leveraging the Spark-HPCC Systems Ecosystem
  14. 14. Leveraging the Spark-HPCC Systems Ecosystem
  15. 15. Spark ML Model – Results • Our RMSLE ~ 0.26 • Around 50th out of 450 participants • Not bad for little to no feature engineering • RMSLE around ~0.22 is possible with Random Forests • Hyper parameter tuning • Feature engineering • Deep Learning can do better than ~0.22 Leveraging the Spark-HPCC Systems Ecosystem
  16. 16. Spark-HPCC Systems – Future & Future Use Cases • Continued support and improvement • Leveraging libraries in Spark, Python and R • Optimus – Data cleaning for Spark • Matplotlib • Spark Streaming • IoT Events • Telematics • Deep Learning with Spark • Possible now through external libraries • Spark 3.0 will support Tensorflow natively Leveraging the Spark-HPCC Systems Ecosystem
  17. 17. Closing Thoughts • Spark-HPCC Systems ecosystem provides new opportunities • Access to an entire ecosystem of libraries and tools • Apache Zeppelin is great • Machine Learning and Deep Learning are accessible • FastAI MOOC is a great way to learn • Everyone should learn ML & Deep Learning Leveraging the Spark-HPCC Systems Ecosystem
  18. 18. Questions? Spark Plugin: https://hpccsystems.com/download Spark-HPCC Systems Connector: https://github.com/hpcc-systems/Spark-HPCC Bulldozer Model Notebook: https://github.com/hpcc-systems/ FastAI: https://fast.ai
  19. 19. View this presentation on YouTube: https://www.youtube.com/watch?v=AQF9XP-Hd74&list=PL-8MJMUpp8IKH5- d56az56t52YccleX5h&index=4&t=0s (4:55:00) Leveraging the Spark-HPCC Systems Ecosystem

Notes de l'éditeur

  • We had a problem. We needed a front end interface for Spark
    Using command line to submit jobs to a cluster is not a good workflow for datascience
    This is a solved problem. Notebook environments like Jupyter Notebooks or Apache Zeppelin were created to solve this problem
    Internally we evaluated both Jupyter Notebooks or Apache Zeppelin and found that Apache Zeppelin met our needs better than Jupyter Notebooks
    We have been testing Apache Zeppelin with Spark since Feburary
    We have also contributed some code to mainline Zeppelin to meet our needs
    We aren’t packaging Zeppelin alongside the Spark-HPCC environment
    The reason I am discussing Zeppelin. Is I will be using Zeppelin during the demo portion of the talk and wanted to give some background
  • Show Spark demo

×