Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Operational Analytics Using Spark and NoSQL Data Stores

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Chargement dans…3
×

Consultez-les par la suite

1 sur 47 Publicité

Operational Analytics Using Spark and NoSQL Data Stores

Télécharger pour lire hors ligne

NoSQL data stores have emerged for scalable capture and real-time analysis of data. Apache Spark and Hadoop provide additional scalable analytics processing. This session looks at these technologies and how they can be used to support operational analytics to improve operational effectiveness. It also looks at an example of how operational analytics can be implemented in NoSQL environments using the Basho Data Platform with Apache Spark:
•The emergence of NoSQL, Hadoop and Apache Spark
•NoSQL Use Cases
•The need for operational analytics
•Types of operational analysis
•Key requirements for operational analytics
•Operational analytics using the Basho Data Platform with Apache Spark.

NoSQL data stores have emerged for scalable capture and real-time analysis of data. Apache Spark and Hadoop provide additional scalable analytics processing. This session looks at these technologies and how they can be used to support operational analytics to improve operational effectiveness. It also looks at an example of how operational analytics can be implemented in NoSQL environments using the Basho Data Platform with Apache Spark:
•The emergence of NoSQL, Hadoop and Apache Spark
•NoSQL Use Cases
•The need for operational analytics
•Types of operational analysis
•Key requirements for operational analytics
•Operational analytics using the Basho Data Platform with Apache Spark.

Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Les utilisateurs ont également aimé (20)

Publicité

Similaire à Operational Analytics Using Spark and NoSQL Data Stores (20)

Plus par DATAVERSITY (20)

Publicité

Plus récents (20)

Operational Analytics Using Spark and NoSQL Data Stores

  1. 1. Delivering Operational Analytics Using Spark and NoSQL Data Stores Mike Ferguson Managing Director Intelligent Business Strategies Basho Webinar January, 2016
  2. 2. 2 Copyright © Intelligent Business Strategies 1992-2016! About Mike Ferguson Mike Ferguson is Managing Director of Intelligent Business Strategies Limited. As an analyst and consultant he specializes in business intelligence, data management and enterprise business integration. With over 34 years of IT experience, Mike has consulted for dozens of companies, spoken at events all over the world and written numerous articles. Formerly he was a principal and co-founder of Codd and Date Europe Limited – the inventors of the Relational Model, a Chief Architect at Teradata on the Teradata DBMS and European Managing Director of DataBase Associates. www.intelligentbusiness.biz mferguson@intelligentbusiness.biz Twitter: @mikeferguson1 Tel/Fax (+44)1625 520700
  3. 3. 3 Copyright © Intelligent Business Strategies 1992-2016! Topics  The changing landscape of operational and analytical systems  Scalable operational applications and NoSQL data stores  Big data analytics – The era of Hadoop and Spark  The value of operational analytics  Operational analytics using The Basho Data Platform and Apache Spark  Conclusions
  4. 4. 4 Copyright © Intelligent Business Strategies 1992-2016! Topics The changing landscape of operational and analytical systems  Scalable operational applications and NoSQL data stores  Big data analytics – The era of Hadoop and Spark  The value of operational analytics  Operational analytics using The Basho Data Platform and Apache Spark  Conclusions
  5. 5. 5 Copyright © Intelligent Business Strategies 1992-2016! The Application Processing Spectrum Source: BI-Research Copyright © BI-Research, 2013-Present
  6. 6. 6 Copyright © Intelligent Business Strategies 1992-2016! Big Data Processing – There Is A Growing Number of Data Stores Optimized for Operational or Analytical Workloads OLTP RDBMS NoSQL DBMS NoSQL • ACID support missing in many NoSQL DBMSs • Can you live with losing a transaction? • OK for sensor data for example Analytical RDBMS
  7. 7. 7 Copyright © Intelligent Business Strategies 1992-2016! Analytical Systems A Closed Loop Is Still Needed – It Just Now Also Includes NoSQL Technologies Operational applications Scalable Analytical Systems data data new data new insights Scalable Operational applications Relational & NoSQL systems Relational & NoSQL systems
  8. 8. 8 Copyright © Intelligent Business Strategies 1992-2016! Topics - – Where Are We?  The changing landscape of operational and analytical systems Scalable operational applications and NoSQL data stores  Big data analytics – The era of Hadoop and Spark  The value of operational analytics  Operational analytics using The Basho Data Platform and Apache Spark  Conclusions
  9. 9. 9 Copyright © Intelligent Business Strategies 1992-2016! Analytical Systems Demand For Scalable Operational Systems With High Write Processing Is Driving Demand for NoSQL DBMS Operational applications Scalable Analytical Systems data data new data new insights Scalable operational applications
  10. 10. 10 Copyright © Intelligent Business Strategies 1992-2016! Success of Big Data Analytics Depends On Being Able To Scale To Capture High Velocity, High Volume Data  Successful big data analytics requires 1. Ability to scale operational systems to capture, stream and store the required transactional and non-transactional data – Support peak transaction rates – Support peak capture of non-transactional data e.g. shopping cart data – Support peak data arrival rates e.g. sensor data – Support peak ingestion rates 2. Scalable Big Data analytics 3. Closed loop integration of analytical systems back into core operational transaction processing systems – Make prescriptive insights available to all that need them to continuously optimise operations and maximise effectiveness
  11. 11. 11 Copyright © Intelligent Business Strategies 1992-2016! E-Business And Mobile Means Operational Systems Are Having To Scale To Support Masses Of Concurrent Users Many more users Operational applications Transactional applications dataWeb logs Cluster Mobile devices WWW data data data partitioned data
  12. 12. 12 Copyright © Intelligent Business Strategies 1992-2016! Example Operational Applications Requiring Scalability That Are Fuelling Demand For NoSQL DBMSs  Web and mobile commerce • Shopping cart data, session storage  Internet of Things (IoT) and other time series applications • Need to scale as the number of devices / things increase  Mobile gaming • Player profile data, session storage, game performance stats  Healthcare • Store unstructured healthcare digital imaging and video data  Social network applications
  13. 13. 13 Copyright © Intelligent Business Strategies 1992-2016! Types Of NoSQL Database And Product Examples NoSQL Database Type NoSQL Product Examples Key Value store Aerospike, Amazon DynamoDB, Basho Riak KV, Redis, MemcacheDB, Voldemort Document database CouchDB, IBM DB2 (XML & JSON), MongoDB, IBM Cloudant, Marklogic, Terrastore, JackRabbit, RaptorDB Column Family database Casandra, DataStax, Google BigTable, Hadoop HBase, Hypertable, HPCC, Amazon SimpleDB Graph database AllegroGraph, GraphBase, Horton, InfiniteGraph, IBM DB2, Neo4j, Oracle Spatial and Graph, Titan, Cray Research, Teradata Aster Multi-modal database ArangoDB, CortexDB, MarkLogic , MongoDB FoundationDB,  Some NoSQL databases are aimed at write processing (data collection)  Others are aimed at specific big data analytical workloads  Issues include lack of standard APIs, weak or no optimizer and non- immediate consistency
  14. 14. 14 Copyright © Intelligent Business Strategies 1992-2016! Global NoSQL Market Size And Forecast 2013 - 2020 Source: https://www.alliedmarketresearch.com/NoSQL-market
  15. 15. 15 Copyright © Intelligent Business Strategies 1992-2016! Key Value Stores Can Store Any Data - Examples Key Value 10034 John Smith 82771 93441 { "firstName": ”Wayne", "lastName": ”Rooney", "age": 25, "address": { "streetAddress": "21 Sir Matt Busby Way", "city": ”Manchester”, “country”: “England”, "postalCode": “M1 6DY” }, "phoneNumbers": [ { "type": "home”, "number": ”0161-123-1234” }, { "type": ”mobile", "number": ”07779-123234” } ] } Key value store features: • Very simple to understand • Very scalable - hash partitioning • Data access is via the key • The application controls what’s stored in the value • Very fast performance • Acceleration via in-memory processing • Eventual consistency • Often no support for data types • No built-in referential integrity • No understanding of data relationships • The application must understand any relationships in data • Programmer is in complete control • Application must navigate complex data Use for specific operational applications
  16. 16. 16 Copyright © Intelligent Business Strategies 1992-2016! Key Value Stores – The Key Is Hashed To Partition The Data Source: Microsoft The value can be anything • A single data field • A JSON document • An XML document • Text • Image…… Key Value Easy to partition (hash the key) Very fast to retrieve and store data The application needs to know • What is stored in the VALUE • How the value is structured • How to process the value Key needs to be unique Can use HTTP to read and write data e.g. CURL –XPUT, CURL -XGET
  17. 17. 17 Copyright © Intelligent Business Strategies 1992-2016! Key Value Stores – A Basho Riak KV Cluster Has Virtual Nodes Running on Physical Nodes Source: Basho SHA1 is a hashing function that hashes a key to determine the node Riak hash partitions and replicates data (3 copies of the data is the default) e.g. PUT, POST, GET…. the valuethe key hash the key Nodes can be added and removed to a Riak cluster while it is running
  18. 18. 18 Copyright © Intelligent Business Strategies 1992-2016! Key Value Stores - A Basho Riak KV Ring Riak uses partitions (64 partitions are the default) and also replicates the partitions for high availability Source: Basho Writing replicas
  19. 19. 19 Copyright © Intelligent Business Strategies 1992-2016! Topics – Where Are We?  The changing landscape of operational and analytical systems  Scalable operational applications and NoSQL data stores Big data analytics – The era of Hadoop and Spark  The value of operational analytics  Operational analytics using The Basho Data Platform and Apache Spark  Conclusions
  20. 20. 20 Copyright © Intelligent Business Strategies 1992-2016! Analytical Systems Demand For Scalable Analytical Systems Is Also Exploding Operational applications Scalable Analytical Systems data data new data new insights Scalable operational applications
  21. 21. 21 Copyright © Intelligent Business Strategies 1992-2016! A Hadoop System Java, Python, Scala file file file file file file file file file file file file file file webHDFS (An HTTP interface to HDFS has REST APIs) HDFS file file file file PIG latin scripts 3rd Party SQL on Hadoop Analytic Application index indexIndex partition SQL BI Tools Storm YARN MapReduce Tez Spark SQL HBase w e b H D F S APIs to HBase, APIs to HDFS executes on MR, Tez & Spark
  22. 22. 22 Copyright © Intelligent Business Strategies 1992-2016! Faster Execution Engines For Analytic Applications – Apache Spark Java, Python, Scala file file file file file file file file file file file file file file webHDFS (An HTTP interface to HDFS has REST APIs) HDFS file file file file PIG latin scripts 3rd Party SQL on Hadoop Analytic Application index indexIndex partition SQL BI Tools Storm YARN MapReduce Tez Spark SQL HBase w e b H D F S APIs to HBase, APIs to HDFS
  23. 23. 23 Copyright © Intelligent Business Strategies 1992-2016! Spark Is A General Purpose In-Memory Execution Framework That Can Run With Or Without Hadoop file file file file file file file file file file file file file file HDFS file file file file Storm YARN MapReduce Tez Spark HBase w e b H D F S HDFS, S3….. Tachyon Spark also includes an HDFS compatible in-memory file system You can use Spark with or without Tachyon The Spark stack is integrated – E.g. You can use Spark Streaming, SparkSQL and MLBase together in the same application Applications / BI Tools Spark Core Spark Streaming R Spark SQL + DataFrames GraphX (Graph Computation) MLlib (Machine Learning) SQL Python Scala Java
  24. 24. 24 Copyright © Intelligent Business Strategies 1992-2016! Applications / BI Tools Spark Core Spark Streaming R Spark SQL + DataFrames GraphX (Graph Computation) MLlib (Machine Learning) SQL Python Scala Java Apache Spark Provides distributed task dispatching, scheduling, and basic I/O. For analysis of real- time streaming data A library of pre-built analytic algorithms that can run in parallel across a Spark cluster A graph analysis engine running on Spark Query structured data in Spark apps using SQL or a DataFrames API
  25. 25. 25 Copyright © Intelligent Business Strategies 1992-2016! Spark In-Memory Analytic Applications Can Do A Lot More Than Map Reduce Processing  Keep only one copy in memory in a JVM  Track lineage of job operators used to derive the data  Use the lineage to re-compute the data if there is a failure  No MapReduce execution needed • Just Spark APIs map map join filter reduce Source: Amplab Spark application HDFSfile file file file file file Spark Applications / BI Tools Spark Core Spark Streaming R Spark SQL + DataFrames GraphX (Graph Computation) MLlib (Machine Learning) SQL Python Scala Java
  26. 26. 26 Copyright © Intelligent Business Strategies 1992-2016! Spark Applications Operate On RDDs (Data) – You Can Do A Lot More Than Map and Reduce  RDD = Resilient Distributed Datasets  An RDD is a read-only, partitioned collection of records  RDDs can be only created through operators on either 1. A dataset in stable storage or 2. Other existing RDDs. Map Reduce Sample Filter Count Take Groupby Fold First Sort Reducebykey Partitionby Union groupByKey Mapwith Join Cogroup Mapwith Leftouterjoin Cross Pipe Rightouterjoin Zip Save Spark Operators Spark Applications
  27. 27. 27 Copyright © Intelligent Business Strategies 1992-2016! Simplifying Access To Data Using Via SparkSQL and Spark DataFrames  A DataFrame is a distributed collection of data organized into named columns  Conceptually equivalent to a relational DBMS table or a data frame in R/Python  DataFrames can be constructed from a wide array of sources: • Structured data files • Hive tables • External databases • Existing RDDs  Uses schema on read Image source: Databricks.com Note: that Spark data sources can be relational & NoSQL DBMSs
  28. 28. 28 Copyright © Intelligent Business Strategies 1992-2016! Spark Is Going Over The Top of Multiple Data Stores For Scalable In-Memory Analytics Across The Entire Ecosystem Streaming data Hadoop data store Data Warehouse RDBMS NoSQL DBMS EDW DW & martsAdvanced Analytic (multi-structured data) mart Operational NoSQL Data Stores Streaming analytics e.g. Casandra, Basho Riak Applications / BI Tools Spark Core Spark Streaming R Spark SQL + DataFrames GraphX (Graph Computation) MLlib (Machine Learning) SQL Python Scala Java
  29. 29. 29 Copyright © Intelligent Business Strategies 1992-2016! Topics – Where Are We?  The changing landscape of operational and analytical systems  Scalable operational applications and NoSQL data stores  Big data analytics – The era of Hadoop and Spark The value of operational analytics  Operational analytics using The Basho Data Platform and Apache Spark  Conclusions
  30. 30. 30 Copyright © Intelligent Business Strategies 1992-2016! Key Business Drivers And Objectives For Operational Analytics  Combine operational and analytical processing at scale to: • Improve customer engagement • Reduce risk • Avoid unplanned operational cost • Optimise operational effectiveness  Use BI/Analytics to drive and guide business operations to help achieve specific target business goals and KPI targets Automated analysis of operational events as they happen Automated alerts On-demand recommendations  Integrate BI/Analytics into every business process to: • Create a ‘insight driven’ employee base • Enable mass execution of business strategy via facilitating mass contribution towards achieve specific business goals
  31. 31. 31 Copyright © Intelligent Business Strategies 1992-2016! Five Types Of Operational BI/Analytics 1. Simple operational reporting of current position/state e.g. session state 2. Situational awareness via visualisation of live operational data typically on dashboards 3. On-demand analytics of live operational and/or historical data to improve operational decisions and effectiveness 4. On-demand recommendations for guidance 5. Event stream processing to monitor, automatically analyse and act on events in real-time to prevent problems arising and to optimise business operations
  32. 32. 32 Copyright © Intelligent Business Strategies 1992-2016! BI/ Analytics Apps / Services Operational Analytics – What’s The Difference Between On-Demand Vs Event-Driven Analysis? BI/ Analytics Services Application On-Demand Analytical service (query, report, model, recommendation) Message, file arrival, pattern, trigger Event-Driven Analytical service (query, report, model, recommendation) streaming data
  33. 33. 33 Copyright © Intelligent Business Strategies 1992-2016! Analytics Need To Be Integrated Into Business Processes To Optimize Business Operations Customers Partners & suppliers Customer relationship management Operations management Supply chain management Marketing Sales Service/support Operations Finance/accounting Procurement Inventorycontrol Shipping/distribution Humanresources Employees Integrated Intelligent Business Operations Integrated On-Demand Business Intelligence
  34. 34. 34 Copyright © Intelligent Business Strategies 1992-2016! High Value Application Use Cases for Streaming Analytics Streaming Analytics Source: Adapted from a slide by IBM
  35. 35. 35 Copyright © Intelligent Business Strategies 1992-2016! Responding To Events And Event Patterns Means Reducing Action Time The time between an event occurring and action being taken being as close to zero as possible Action distance or action time Event- driven data integration Automated analysis Automated decision and action taking Source: Dr Richard Hackathorne
  36. 36. 36 Copyright © Intelligent Business Strategies 1992-2016! With Event Stream Processing The Architecture Has To Change Data cleansing & integration Store data Query/Analyze (human) Store data Query/Analyze (automated) Classic Use of Analytics Event / Stream processing Act (automated or human) Data cleansing & integration
  37. 37. 37 Copyright © Intelligent Business Strategies 1992-2016! Time Series Analysis – Query Processing Uses a Time Window to Look at Continuously Streaming Data Time Window T1 T2 E.g. 5 seconds or 30 seconds or 5 minutes Pattern/correlation Continuous time series queries (CQs) operate on the data as it flows by Stream processin g server CQs A set of queries (continuous queries) reside in the data stream server to process incoming data Data is pushed into the queries High frequency data
  38. 38. 38 Copyright © Intelligent Business Strategies 1992-2016! Key Requirements For Operational Analytics  On-demand, event-driven and scheduled invocation of analytics  Monitor streaming events as they happen via automatic analysis  Automatic analysis via predictive and statistical models  Automatic interpretation of predictive/statistical model outcomes  Rule-driven automatic actions to automate decision making • E.g. Alerts, recommendations, transaction and process invocation  Integrate operational analytics into operational applications  Operational reporting  Scale to support large numbers of events and concurrent users  Store relevant data together to speed up analytics execution  Run predictive and statistical models close to the data  Run analytics on a 24x365 basis
  39. 39. 39 Copyright © Intelligent Business Strategies 1992-2016! The Importance of In Memory Processing  Massively parallel in-memory processing is mission critical for scalable operational systems and operational analytics  Why? • Performance is a critical • Large number of concurrent user requests for on-demand analytics • Large number of concurrent application requests for on-demand analytics • Event driven operational analytics on very high velocity data needs memory
  40. 40. 40 Copyright © Intelligent Business Strategies 1992-2016! Topics – Where Are We?  The changing landscape of operational and analytical systems  Scalable operational applications and NoSQL data stores  Big data analytics – The era of Hadoop and Spark  The value of operational analytics Operational analytics using The Basho Data Platform and Apache Spark  Conclusions
  41. 41. 41 Copyright © Intelligent Business Strategies 1992-2016! The Basho Data Platform SERVICE INSTANCES STORAGE INSTANCES Solr Spark Redis (Caching) Solr Elastic Search Web Services 3rd Party Web Services & Integrations Riak!KV! !Key/Value Riak S2 ! Object Storage Riak TS !! Time!Series! Document Store Columnar Graph Replicate & Synchronize Message Routing Cluster Management & Monitoring Logging & Analytics Internal Data Store CORE SERVICES BASHO DEVELOPED BASHO INTEGRATED THE!BASHO!DATA!PLATFORM! Source:(Basho( hash partitioning, cluster scalability, triple replication, multi-datacentre replication co-locates time-series data, high availability, scalability replicates and synchronises data within and across Riak KV, Redis and Spark Clusters Automated cluster management simplifies administration Integrated in-memory caching for faster application performance Search based query processing on Riak data using Solr indexes Integrated in-memory analytics for Riak KV and Riak TS data
  42. 42. 42 Copyright © Intelligent Business Strategies 1992-2016! Riak TS Is A New Basho Storage Instance Optimised for Time Series Data And Analytics  A distributed NoSQL database optimised for time series sequenced, unstructured data capture, aggregation and analysis from the Internet of Things (IoT)  Highly availability  Scalability - add nodes to a cluster without sharding  Automated and uniform data distributed across the cluster • Time of geohash based data co-location to ensure time series data is located on the same node  Data validation on input  APIs and client libraries for Java, Ruby, Python, Go, Erlang, Node.js or .NET.  Spark integration for operational analysis of time series data.
  43. 43. 43 Copyright © Intelligent Business Strategies 1992-2016! Operational Analytics Using The Basho Data Platform And Apache Spark Opera&onal* analy&cs** web*service* Opera&onal* analy&c** applica&on* BI*Tool* data data data hash*par&&oned*data* Scalable* opera&onal applica&on* Spark**Core* Spark* Stream <ing* BlinkDB* Spark* SQL* GraphX* SparkR*MLlib* write*back* Opera&onal*Analy&cs*Using*The*Basho*Data*PlaHorm* recent data
  44. 44. 44 Copyright © Intelligent Business Strategies 1992-2016! Operational Analytics Using The Basho Data Platform And Apache Spark - 2 • Can develop Spark operational analytic applications on low latency data stored in Basho Riak KV • Spark-based analytical web services can be invoked on- demand to analyse data in Riak KV • Use on-demand Spark jobs for historical analysis and predictions • Insights produced from analysing Riak KV data in can be written back to Riak KV for use by other applications • A form of closed-loop processing • Spark Streaming can be used to calculate rollups and detect abnormalities on streaming sensor data • Recent data can be kept in Redis for dashboard visualization
  45. 45. 46 Copyright © Intelligent Business Strategies 1992-2016! Topics – Where Are We?  The changing landscape of operational and analytical systems  Scalable operational applications and NoSQL data stores  Big data analytics – The era of Hadoop and Spark  The value of operational analytics  Operational analytics using The Basho Data Platform and Apache Spark Conclusions
  46. 46. 47 Copyright © Intelligent Business Strategies 1992-2016! Conclusions  As operational application processing scales, so too does the need to scale operational analytics  Basho is using in-memory processing to accelerate operational applications (via Redis) and to introduce scalable operational analytics (via Spark) into these applications  New scalable ‘smart’ operational applications are therefore becoming possible with careful design in a NoSQL environment
  47. 47. 48 Copyright © Intelligent Business Strategies 1992-2016! www.intelligentbusiness.biz mferguson@intelligentbusiness.biz Twitter: @mikeferguson1 Tel/Fax (+44)1625 520700 Thank You!

×