Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Big Data Processing with Spark and .NET - Microsoft Ignite 2019

583 vues

Publié le

Learn how to harness the best of Spark and .NET for your data processing pipelines with the newest support for Spark in .NET and Azure Synapse.

Publié dans : Données & analyses
  • Identifiez-vous pour voir les commentaires

Big Data Processing with Spark and .NET - Microsoft Ignite 2019

  1. 1.  Apache Spark is an OSS fast analytics engine for big data and machine learning  Improves efficiency through:  General computation graphs beyond map/reduce  In-memory computing primitives  Allows developers to scale out their user code & write in their language of choice  Rich APIs in Java, Scala, Python, R, SparkSQL etc.  Batch processing, streaming and interactive shell  Available on Azure via Azure Synapse Azure Databricks Azure HDInsight IaaS/Kubernetes
  2. 2. .NET Developers 💖 Apache Spark… A lot of big data-usable business logic (millions of lines of code) is written in .NET! Expensive and difficult to translate into Python/Scala/Java! Locked out from big data processing due to lack of .NET support in OSS big data solutions In a recently conducted .NET Developer survey (> 1000 developers), more than 70% expressed interest in Apache Spark! Would like to tap into OSS eco-system for: Code libraries, support, hiring
  3. 3. Goal: .NET for Apache Spark is aimed at providing .NET developers a first-class experience when working with Apache Spark. Non-Goal: Converting existing Scala/Python/Java Spark developers.
  4. 4. We are developing it in the open! Contributions to foundational OSS projects: • Apache Spark Core: SPARK-28271, SPARK-28278, SPARK-28283, SPARK-28282, SPARK-28284, SPARK-28319, SPARK-28238, SPARK-28856, SPARK-28970, SPARK-29279, SPARK-29373 • Apache Arrow: ARROW-4997, ARROW-5019, ARROW-4839, ARROW-4502, ARROW-4737, ARROW-4543, ARROW-4435, ARROW-4503, ARROW-4717, ARROW-4337, ARROW-5887, ARROW-5908, ARROW-6314, ARROW-6682 • Pyrolite (Pickling Library): Improve pickling/unpickling performance, Add a Strong Name to Pyrolite, Improve Pickling Performance, Hash set handling, Improve unpickling performance .NET for Apache Spark is open source • Website: https://dot.net/spark • GitHub: https://github.com/dotnet/spark • Version 0.6 released Oct 2019 Spark project improvement proposals: • Interop support for Spark language extensions: SPARK-26257 • .NET bindings for Apache Spark: SPARK-27006
  5. 5. .NET provides full-spectrum Spark support Spark DataFrames with SparkSQL Works with Spark v2.3.x/v2.4.x and includes ~300 SparkSQL functions Grouped Map Delta Lake .NET Spark UDFs Batch & streaming Including Spark Structured Streaming and all Spark-supported data sources .NET Standard 2.0 Works with .NET Framework v4.6.1+ and .NET Core v2.1/v3.x and includes C#/F# support .NET Standard Data Science Including access to ML.NET Interactive Notebook with C# REPL Speed & productivity Performance optimized interop, as fast or faster than pySpark, Support for HW Vectorization https://github.com/dotnet/spark/examples
  6. 6. 0.6 8 DataStreamWriter.PartitionBy() RelationalGroupedDataset.Mean(),Max(),Avg(),Min(),Agg(),Count() SparkSession.*Session(),Range(),Conf() UDF with Row as a parameter Delta Lake’s DeltaTable SparkSession.Catalog UDF with Array.Map as a return type UDF debugging Vector & GroupedMap UDFspark.yarn.archives support Compatibility check for Microsoft.Spark.Worker AssemblyLoader enhancement for loading UDFs Resolver signer fix Arrow & Pickling perf improvement Arcade build infrastructure TPC-H update with Arrow DataStreamWriter.Trigger ComplexTypes.MapType Support for Spark 2.3.*, Spark 2.4.[1/2/4] Worker binaries for MacOS UDF with dependent types DataFrameReader.Load() Source link for Nuget packageSparkFile .NET for Apache Spark
  7. 7. Language comparison: TPC-H Query 2 val europe = region.filter($"r_name" === "EUROPE") .join(nation, $"r_regionkey" === nation("n_regionkey")) .join(supplier, $"n_nationkey" === supplier("s_nationkey")) .join(partsupp, supplier("s_suppkey") === partsupp("ps_suppkey")) val brass = part.filter(part("p_size") === 15 && part("p_type").endsWith("BRASS")) .join(europe, europe("ps_partkey") === $"p_partkey") val minCost = brass.groupBy(brass("ps_partkey")) .agg(min("ps_supplycost").as("min")) brass.join(minCost, brass("ps_partkey") === minCost("ps_partkey")) .filter(brass("ps_supplycost") === minCost("min")) .select("s_acctbal", "s_name", "n_name", "p_partkey", "p_mfgr", "s_address", "s_phone", "s_comment") .sort($"s_acctbal".desc, $"n_name", $"s_name", $"p_partkey") .limit(100) .show() var europe = region.Filter(Col("r_name") == "EUROPE") .Join(nation, Col("r_regionkey") == nation["n_regionkey"]) .Join(supplier, Col("n_nationkey") == supplier["s_nationkey"]) .Join(partsupp, supplier["s_suppkey"] == partsupp["ps_suppkey"]); var brass = part.Filter(part["p_size"] == 15 & part["p_type"].EndsWith("BRASS")) .Join(europe, europe["ps_partkey"] == Col("p_partkey")); var minCost = brass.GroupBy(brass["ps_partkey"]) .Agg(Min("ps_supplycost").As("min")); brass.Join(minCost, brass["ps_partkey"] == minCost["ps_partkey"]) .Filter(brass["ps_supplycost"] == minCost["min"]) .Select("s_acctbal", "s_name", "n_name", "p_partkey", "p_mfgr", "s_address", "s_phone", "s_comment") .Sort(Col("s_acctbal").Desc(), Col("n_name"), Col("s_name"), Col("p_partkey")) .Limit(100) .Show(); Similar syntax – dangerously copy/paste friendly! $”col_name” vs. Col(“col_name”) Capitalization Scala C# C# vs Scala (e.g., == vs ===)
  8. 8. Demo 2: Twitter analysis in the Cloud
  9. 9. What is happening when you write .NET Spark code? DataFrame SparkSQL .NET for Apache Spark .NET Program Did you define a .NET UDF? Regular execution path (no .NET runtime during execution) Same Speed as with Scala Spark Interop between Spark and .NET Faster than with PySpark No Yes Spark operation tree
  10. 10. Works everywhere! Cross platform Cross Cloud Windows Ubuntu Azure & AWS Databricks macOS AWS EMR Spark Azure HDI Spark Installed out of the box Azure Synapse Installation docs on Github
  11. 11. More programming experiences in .NET (UDAF, UDT support, multi- language UDFs) What’s next? Spark data connectors in .NET (e.g., Apache Kafka, Azure Blob Store, Azure Data Lake) Tooling experiences (e.g., Jupyter, VS Code, Visual Studio, others?) Idiomatic experiences for C# and F# (LINQ, Type Provider) Go to https://github.com/dotnet/spark and let us know what is important to you! Out-of-Box Experiences (Azure Synapse, Azure HDInsight, Azure Databricks, Cosmos DB Spark, SQL 2019 BDC, …)
  12. 12. Call to action: Engage, use & guide us! Useful links: • http://github.com/dotnet/spark • https://www.nuget.org/packages/Microsoft.Spark https://aka.ms/GoDotNetForSpark • https://docs.microsoft.com/dotnet/spark Website: • https://dot.net/spark (Request a Demo!) Starter Videos .NET for Apache Spark 101: • Watch on YouTube • Watch on Channel 9 Available out-of-box on Azure Synapse & Azure HDInsight Spark Running .NET for Spark anywhere— https://aka.ms/InstallDotNetForSpark You & .NET
  13. 13. @MikeDoesBigData, mrys@microsoft.com #DotNetForSpark #MSIgnite #THR3110 I will be available at the Connect with Expert and/or the Azure Synapse booth