Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Reliable Data Intestion in BigData / IoT

1 009 vues

Publié le

Many of the Big Data and IoT use cases are based on combing data from multiple data sources and to make them available on a Big Data platform for analysis. The data sources are often very heterogeneous, from simple files, databases to high-volume event streams from sensors (IoT devices). It’s important to retrieve this data in a secure and reliable manner and integrate it with the Big Data platform so that it is available for analysis in real-time (stream processing) as well as in batch (typical big data processing). In past some new tools have emerged, which are especially capable of handling the process of integrating data from outside, often called Data Ingestion. From an outside perspective, they are very similar to a traditional Enterprise Service Bus infrastructures, which in larger organization are often in use to handle message-driven and service-oriented systems. But there are also important differences, they are typically easier to scale in a horizontal fashion, offer a more distributed setup, are capable of handling high-volumes of data/messages, provide a very detailed monitoring on message level and integrate very well with the Hadoop ecosystem. This session will present and compare Apache Flume, Apache NiFi, StreamSets and the Kafka Ecosystem and show how they handle the data ingestion in a Big Data solution architecture.

Publié dans : Technologie
  • Identifiez-vous pour voir les commentaires

Reliable Data Intestion in BigData / IoT

  1. 1. BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH Reliable Data Ingestion in Big Data/IoT Guido Schmutz @gschmutz
  2. 2. Guido Schmutz Working for Trivadis for more than 19 years Oracle ACE Director for Fusion Middleware and SOA Co-Author of different books Consultant, Trainer, Software Architect for Java, SOA & Big Data / Fast Data Member of Trivadis Architecture Board Technology Manager @ Trivadis More than 25 years of software development experience Contact: guido.schmutz@trivadis.com Blog: http://guidoschmutz.wordpress.com Slideshare: http://www.slideshare.net/gschmutz Twitter: gschmutz Reliable Data Ingestion in Big Data/IoT
  3. 3. Our company. Reliable Data Ingestion in Big Data/IoT Trivadis is a market leader in IT consulting, system integration, solution engineering and the provision of IT services focusing on and technologies in Switzerland, Germany, Austria and Denmark. We offer our services in the following strategic business fields: Trivadis Services takes over the interacting operation of your IT systems. O P E R A T I O N
  4. 4. COPENHAGEN MUNICH LAUSANNE BERN ZURICH BRUGG GENEVA HAMBURG DÜSSELDORF FRANKFURT STUTTGART FREIBURG BASEL VIENNA With over 600 specialists and IT experts in your region. Reliable Data Ingestion in Big Data/IoT 14 Trivadis branches and more than 600 employees 200 Service Level Agreements Over 4,000 training participants Research and development budget: CHF 5.0 million Financially self-supporting and sustainably profitable Experience from more than 1,900 projects per year at over 800 customers
  5. 5. Reliable Data Ingestion in Big Data/IoT Technology on its own won't help you. You need to know how to use it properly.
  6. 6. Reliable Data Ingestion in Big Data/IoT Introduction
  7. 7. Big Data Definition (4 Vs) + Time to action ? – Big Data + Real-Time = Stream Processing Characteristics of Big Data: Its Volume, Velocity and Variety in combination Reliable Data Ingestion in Big Data/IoT
  8. 8. Ever increasing volume and velocity - Internet of Things (IoT) Wave Internet of Things (IoT): Enabling communication between devices, people & processes to exchange useful information & knowledge that create value for humans Term was first proposed by Kevin Ashton in 1999 Source: The Economist Source: Ericsson, June 2016 Reliable Data Ingestion in Big Data/IoT
  9. 9. What is Data Ingestion? Acquiring data as it is produced from Data Source(s) Transforming into a consumable form Delivering the transformed data to the consuming system(s) The challenge: Doing this continuously and at scale across a wide variety of sources and consuming systems Ingress and Egress are to other terms referring to data movement in and out of a system Reliable Data Ingestion in Big Data/IoT
  10. 10. Hadoop Clusterd Hadoop Cluster Hadoop Cluster Lambda Architecture for Big Data Location Social Click stream Sensor Data Billing & Ordering CRM / Profile Marketing Campaigns Call Center Mobile Apps Batch Analytics Streaming Analytics Event Hub Event Hub Event Hub NoSQL Parallel Processing Distributed Filesystem Stream Analytics NoSQL Reference / Models SQL Search Dashboard BI Tools Enterprise Data Warehouse Search Online & Mobile Apps SQL Import Weather Data Reliable Data Ingestion in Big Data/IoT
  11. 11. SQL Import Hadoop Clusterd Hadoop Cluster Hadoop Cluster Location Social Click stream Sensor Data Billing & Ordering CRM / Profile Marketing Campaigns Call Center Weather Data Mobile Apps Batch Analytics Streaming Analytics Event Hub Event Hub Event Hub NoSQL Parallel Processing Distributed Filesystem Stream Analytics NoSQL Reference / Models SQL Search Dashboard BI Tools Enterprise Data Warehouse Search Online & Mobile Apps Integrate Sanitize / Normalize Deliver
  12. 12. IoT GW MQTT Broker Continuous Ingestion - DataFlow Pipelines DB Source Big Data Log Stream Processing IoT Sensor Event Hub Topic Topic REST Topic IoT GW CDC GW Connect CDC DB Source Log CDC Native IoT Sensor IoT Sensor 12 Dataflow GW Topic Topic Queue Messaging GW Topic Dataflow GW Dataflow Topic REST 12 File Source Log Log Log Social Native Reliable Data Ingestion in Big Data/IoT
  13. 13. DataFlow Pipeline Reliable Data Ingestion in Big Data/IoT • Flow-based ”programming” • Ingest Data from various sources • Extract – Transform – Load • High-Throughput, straight-through data flows • Data Lineage • Batch- or Stream-Processing • Visual coding with flow editor • Event Stream Processing (ESP) but not Complex Event Processing (CEP) Source: Confluent
  14. 14. SQL Polling Change Data Capture (CDC) File Stream (File Tailing) File Stream (Appender) Continuous Ingestion – Integrating data sources Sensor Stream Reliable Data Ingestion in Big Data/IoT
  15. 15. Ingestion with/without Transformation? Reliable Data Ingestion in Big Data/IoT Zero Transformation • No transformation, plain ingest, no schema validation • Keep the original format – Text, CSV, … • Allows to store data that may have errors in the schema Format Transformation • Prefer name of Format Translation • Simply change the format • Change format from Text to Avro • Does schema validation Enrichment Transformation • Add new data to the message • Do not change existing values • Convert a value from one system to another and add it to the message Value Transformation • Replaces values in the message • Convert a value from one system to another and change the value in-place • Destroys the raw data!
  16. 16. Reliable Data Ingestion in Big Data/IoT Challenges
  17. 17. Why is Data Ingestion Difficult? Physical and Logical Infrastructure changes rapidly Key Challenges: Infrastructure Automation Edge Deployment Infrastructure Drift Data Structures and formats evolve and change unexpectedly Key Challenges: Consumption Readiness Corruption and Loss Structure Drift Data semantics change with evolving applications Key Challenges Timely Intervention System Consistency Semantic Drift Reliable Data Ingestion in Big Data/IoT Source: Streamsets
  18. 18. Challenges for Ingesting Sensor Data Reliable Data Ingestion in Big Data/IoT Multitude of sensors Real-Time Streaming Multiple Firmware versions Bad Data from damaged sensors Regulatory Constraints Data Quality Source: Cloudera
  19. 19. Key Elements of Data Ingestion Reliable Data Ingestion in Big Data/IoT Idempotence Batching (Bulk) Data Transformation Compression Availability and Recoverability Reliable Data Transfer and Data Validation Resource Consumption Performance Monitoring
  20. 20. Reliable Data Ingestion in Big Data/IoT Implementing Event Hub – Apache Kafka
  21. 21. How to implement an Event Hub? Apache Kafka to the rescue • Distributed publish-subscribe messaging system • Designed for processing of high-volume, real time activity stream data (logs, metrics, social media, …) • Stateless (passive) architecture, offset-based consumption • Provides Topics, but does not implement JMS standard • Initially developed at LinkedIn, now part of Apache • Peak Load on single cluster: 2 million messages/sec, 4.7 Gigabits/sec inbound, 15 Gigabits/sec outbound Kafka Cluster Consumer Consumer Consumer Producer Producer Producer Reliable Data Ingestion in Big Data/IoT
  22. 22. Reliable Data Ingestion in Big Data/IoT Implementing Data Flow
  23. 23. Apache Flume distributed data collection service gets flows of data (like logs) from their source aggregates them to where they have to be processed Sources: files, syslog, avro, … Sinks: HDFS files, HBase, … Reliable Data Ingestion in Big Data/IoT Source: Flume Documentation
  24. 24. Apache Sqoop Reliable Data Ingestion in Big Data/IoT • Sqoop exchanges data between an RDBMS and Hadoop • It can import all tables, single table, or a portion of a table into HDFS • Does this very efficiently via a Map-only MapReduce job • Result is a directory in HDFS containing comma- delimited text • Scoop can also export data from HDFS back to the database $ sqoop import --connect jdbc:mysql://localhost/company --username twheeler --password bigsecret --warehouse-dir /mydata --table customers
  25. 25. Oracle GoldenGate Reliable Data Ingestion in Big Data/IoT • Provides low-impact change data capture solution for Oracle and Non- Oracle RDMBS • Non-intrusive • Low-Latency • Open, modular Architecture • Supports heterogeneous systems • Oracle GoldenGate for Big Data provides Hadoop and Kafka Support
  26. 26. Apache Kafka Connect • a tool for scalably and reliably streaming data between Apache Kafka and other data systems • is not an ETL framework • Pre-build connectors available for Data Source and Data Sinks • JDBC (Source) • Oracle GoldenGate (Source) • MQTT (Source) • HDFS (Sink) • Elasticsearch (Sink) • MongoDB (Sink) • Cassandra (Source & Sink) Reliable Data Ingestion in Big Data/IoT Source: Confluent
  27. 27. Apache NiFi & MiNiFi • Originated at NSA as Niagarafiles • Open sourced December 2014, Apache TLP July 2015 • Opaque, file-oriented payload • Distributed system of processors with centralized control • Based on flow-based programming concepts • Data Provenance • Web-based user interface • Apache MiNiFi focuses on the collection of data at the source of its creation Reliable Data Ingestion in Big Data/IoT
  28. 28. StreamSets Data Collector Founded by ex-Cloudera, Informatica employees Continuous open source, intent-driven, big data ingest Visible, record-oriented approach fixes combinatorial explosion Batch or stream processing • Standalone, Spark cluster, MapReduce cluster IDE for pipeline development by ‘civilians’ Relatively new - first public release September 2015 So far, vast majority of commits are from StreamSets staff Reliable Data Ingestion in Big Data/IoT
  29. 29. Other Alternatives Reliable Data Ingestion in Big Data/IoT • Spring Cloud Data Flow • Node-RED • Project Flogo • Oracle Streaming Analytics • Spark Streaming • …
  30. 30. Reliable Data Ingestion in Big Data/IoT What about existing Integration Platforms?
  31. 31. Oracle’s Service Bus as a consumer of Kafka Service Bus 12c Cloud Apps Business Service Cloud Proxy Service Kafka Cloud API Mobile Apps Pipeline Routing Kafka Sensor / IoT Web Apps Business Service REST Business Service WSDL Backend Apps REST Backend Apps WSDL Proxy Service Kafka Pipeline Routing Database DB CDC Stream Processing Reliable Data Ingestion in Big Data/IoT
  32. 32. Oracle’s Service Bus as a producer to Kafka Service Bus 12c Cloud Apps Business Service Cloud Proxy Service REST Cloud API Mobile Apps Pipeline Routing Sensor / IoT Web Apps Business Service REST Business Service Kafka Backend Apps REST Proxy Service SOAP Pipeline Routing Reliable Data Ingestion in Big Data/IoT Kafka Backend Apps SOA / BPM
  33. 33. Hybrid Integration Platforms (HIP) needed Reliable Data Ingestion in Big Data/IoT Source: Gartner
  34. 34. Trivadis @ DOAG 2016 Booth: 3rd Floor – next to the escalator Know how, T-Shirts, Contest and Trivadis Power to go We look forward to your visit Because with Trivadis you always win ! Reliable Data Ingestion in Big Data/IoT

×