SlideShare une entreprise Scribd logo
1  sur  14
Always-on Ingestion
Con$nuous	
  Inges$on	
  for	
  Data	
  at	
  Scale	
  
	
  ©	
  2015	
  StreamSets	
  Inc.,	
  All	
  rights	
  reserved	
  
Arvind	
  Prabhakar	
  
Big	
  Data	
  Day	
  LA,	
  June	
  2015	
  
© 2015 StreamSets, Inc.
About me 	
  
	
  
❏  Founder/CTO	
  
	
  Apache	
  So?ware	
  FoundaBon	
  
❏  Flume	
  -­‐	
  PMC	
  Chair	
  
❏  Sqoop	
  -­‐	
  PMC	
  Chair	
  
❏  Storm	
  -­‐	
  PMC,	
  CommiFer	
  
❏  MetaModel	
  -­‐	
  Mentor	
  
❏  Sentry	
  -­‐	
  Mentor	
  
❏  NiFi	
  -­‐	
  Mentor	
  
❏  ASF	
  Member	
  
	
  Previously...	
  
❏  Cloudera	
  
❏  InformaKca	
  
	
  
@aprabhakar
© 2015 StreamSets, Inc.
Some Background
What is Data Ingestion?
Why do we need Data Ingestion?
❏  Acquiring data from various sources
❏  Storing acquired data where it can be processed
❏  Data is consumed away from where it is produced
❏  Consuming systems are often distributed and remote
❏  Manually, with scripts, with rudimentary automation
❏  Higher level frameworks like Flume, Kafka, etc
How is Data Ingestion Implemented?
Logs	
  
Files	
  
Click	
  Streams	
  
Sensors	
  
Devices	
  
Database	
  
Logs	
  
Social	
  Data	
  
Streams	
  
Feeds	
  
Other	
  
Raw	
  Storage	
  
(HDFS,	
  S3)	
  
EDW,	
  NoSQL	
  
(Hive,	
  Impala,	
  
HBase,	
  Cassandra,	
  
RedShiY)	
  
Search	
  
(Solr,	
  
ElasKcSearch)	
  
Enterprise	
  Data	
  
Infrastructure	
  
© 2015 StreamSets, Inc.
Data Ingestion Challenge
Ever Increasing data volumes
and rates...
Data sources are physically
distributed and transient...
╳
╳
╳
╳
╳
╳
Data structures and semantics
are constantly changing...
© 2015 StreamSets, Inc.
Lot more than moving data!
Data Ingest should be agile
Data Ingest should be safe and reliable
❏  Welcome new data sources as they emerge
❏  Incorporate changes to existing sources as needed
❏  Protect your downstream from silent data corruption
❏  Ensure that there is no data loss in your infrastructure
Data Ingest should scale as needed
❏  Data ingest must never become a bottleneck
❏  Data ingest must scale without significant cost or effort
RELIABLE
© 2015 StreamSets, Inc.
»	
  	
  Design	
  Wisely	
  
	
  
	
  
»	
  	
  Operate	
  CauKously	
  
	
  
	
  
»	
  	
  Update	
  Liberally	
  
What can you do?
●  Pick	
  the	
  right	
  technology	
  
and	
  toolset	
  
	
  
●  Instrument	
  and	
  monitor	
  
mercilessly	
  
	
  
●  AnKcipate	
  and	
  understand	
  
the	
  changes	
  in	
  your	
  
environment	
  
Here is how...
© 2015 StreamSets, Inc.
Picking the right technology
Manual/Scripted
Batch Transport
Micro-batching
Pipelining
Message-Queue
File copying using CLI or GUI interface Cloudera HUE, Hadoop FS client
Ingest Mode Description Example
Bulk data transport using tools Sqoop, DistCp
Transport of small batches of data Sqoop/Sqoop2 (Storm, etc...)
Flow-like transport of event streams Flume, Scribe
Publish-Subscribe like transport of events Kafka, Kinesis
© 2015 StreamSets, Inc.
Sqoop
Overview Advantages Disadvantages
❏  Propagates	
  metadata	
  
❏  Cluster	
  based	
  parallel	
  
scaling	
  capability	
  
❏  Simple	
  and	
  easy	
  to	
  
understand/operate	
  
❏  Rich	
  set	
  of	
  connectors	
  
available	
  for	
  use	
  
❏  Supports	
  popular	
  formats	
  
like	
  Avro,	
  sequence	
  file	
  
etc.	
  
❏  Not	
  a	
  service	
  
❏  Direct	
  access	
  to	
  
producKon	
  data	
  stores	
  
from	
  cluster	
  
❏  Requires	
  access	
  to	
  data	
  
store	
  credenKals	
  	
  
❏  Connector	
  funcKonality	
  is	
  
not	
  consistent	
  between	
  
different	
  connectors	
  
❏  CLI	
  Tool	
  
❏  Oriented	
  towards	
  
structured	
  data	
  stores	
  
❏  Runs	
  map-­‐only	
  job	
  to	
  
transport	
  data	
  
	
  
© 2015 StreamSets, Inc.
Sqoop 2
Overview Advantages
❏  Propagates	
  metadata	
  
❏  Cluster	
  based	
  parallel	
  
scaling	
  capability	
  
❏  Simple	
  and	
  easy	
  to	
  
understand/operate	
  
❏  Supports	
  popular	
  formats	
  
like	
  Avro,	
  sequence	
  file	
  
etc.	
  
❏  Consistent	
  funcKonality	
  
across	
  connectors	
  
❏  Secure	
  handling	
  of	
  
credenKals	
  with	
  RBAC	
  
security	
  
❏  Considered	
  pre-­‐
producKon	
  quality	
  before	
  
2.0.0	
  release.	
  Currently	
  at	
  
1.99.6.	
  
❏  May	
  not	
  have	
  connecKvity	
  
at	
  par	
  with	
  Sqoop	
  1.	
  
❏  Sqoop	
  Service	
  with	
  CLI	
  
and	
  JSON/REST	
  interface	
  
❏  Oriented	
  towards	
  
structured	
  data	
  stores	
  
❏  Runs	
  chained	
  Map-­‐
Reduce	
  jobs	
  for	
  data	
  
transport	
  and	
  conversion	
  
	
  
Disadvantages
© 2015 StreamSets, Inc.
Flume
Overview Advantages
❏  Guaranteed	
  delivery	
  
semanKcs	
  
❏  Low-­‐latency	
  reliable	
  data	
  
transfer	
  
❏  DeclaraKve	
  configuraKon	
  
with	
  no	
  coding	
  necessary	
  
for	
  common	
  use-­‐cases	
  
❏  Fully	
  extendable	
  and	
  
customizable	
  
❏  Integrates	
  with	
  most	
  
commonly	
  used	
  end-­‐
points	
  
❏  Non-­‐trivial	
  configuraKon	
  
❏  Complex	
  topology	
  
configuration	
  can	
  be	
  hard	
  
to	
  build	
  and	
  maintain	
  
❏  Custom	
  end-­‐point	
  
implementaKon	
  requires	
  
significant	
  code	
  
complexity	
  
❏  Distributed	
  pipeline	
  
system	
  for	
  efficient	
  
transport	
  of	
  large	
  
volumes	
  of	
  data	
  
❏  Built	
  in	
  support	
  for	
  
contextual	
  rouKng,	
  
filtering,	
  replicaKon	
  and	
  
mulKplexing	
  
	
  
	
  
	
  
Disadvantages
© 2015 StreamSets, Inc.
Kafka
Overview Advantages
❏  Strong	
  retenKon	
  and	
  
ordering	
  semanKcs	
  
❏  Dynamic	
  cluster	
  based	
  
scalability	
  and	
  throughput	
  
❏  Low-­‐level	
  APIs	
  for	
  building	
  
consumers	
  and	
  producers	
  
❏  Variety	
  of	
  open	
  source	
  
producers	
  and	
  consumers	
  
available	
  on	
  GitHub	
  
❏  Allows	
  reprocessing	
  of	
  
consumed	
  data	
  
❏  Distributed	
  and	
  efficient	
  
publish-­‐subscribe	
  
messaging	
  system	
  	
  
❏  Used	
  for	
  democraKzaKon	
  
of	
  data	
  between	
  
applicaKons	
  
	
  
	
  
	
  
Disadvantages
❏  Delivery	
  guarantee	
  owned	
  
by	
  producers	
  and	
  
consumers	
  
❏  Opaque	
  pub-­‐sub	
  design	
  
can	
  cause	
  applicaKons	
  to	
  
be	
  	
  highly	
  coupled	
  
❏  Minimal	
  metadata	
  
support	
  
© 2015 StreamSets, Inc.
Typical Examples
For Structured Data
Simple	
  
❏  Sqoop	
  for	
  Batch	
  transport	
  
❏  Sqoop	
  2	
  for	
  micro-­‐batch	
  transport	
  
	
  
	
  
Intermediate	
  
❏  Flume	
  for	
  Directory	
  Spooling	
  	
  
	
  
	
  
Advanced	
  
❏  Custom	
  Database	
  Log	
  Shipping	
  
implementaKon	
  
Simple	
  
❏  Flume	
  based	
  AggregaKon	
  
❏  Kaia	
  based	
  pub-­‐sub	
  for	
  applicaKons	
  
	
  
	
  
Intermediate	
  
❏  Flume	
  +	
  Kaia	
  based	
  aggregaKon	
  and	
  
pub-­‐sub	
  
	
  
	
  
Advanced	
  
❏  Kaia	
  +	
  Storm	
  for	
  pub-­‐sub	
  and	
  
preparaKon	
  
For Streaming Event Data
© 2015 StreamSets, Inc.
	
  
❏  Apache	
  Sqoop:	
  hFp://sqoop.apache.org	
  
❏  Current	
  Version:	
  Sqoop1	
  -­‐	
  1.4.6;	
  	
  Sqoop2	
  -­‐	
  1.99.6	
  
	
  
❏  Apache	
  Flume:	
  hFp://flume.apache.org	
  
❏  Current	
  Version:	
  Flume	
  1.6.0	
  
	
  
❏  Apache	
  Kaia:	
  hFp://kaia.apache.org	
  
❏  Current	
  Version:	
  Kaia	
  0.8.2.1	
  
	
  
	
  
For more information...
© 2015 StreamSets, Inc.
My	
  Contact	
  InformaKon:	
  
●  Email:	
  	
  arvind	
  at	
  streamsets	
  dot	
  com	
  
●  TwiFer:	
  @aprabhakar	
  
●  Website:	
  www.streamsets.com	
  
	
  
	
  
	
  
Thank You!

Contenu connexe

Tendances

Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Zhenxiao Luo
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudRick Bilodeau
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseMichael Stack
 
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...Data Con LA
 
Hadoop data ingestion
Hadoop data ingestionHadoop data ingestion
Hadoop data ingestionVinod Nayal
 
From Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexFrom Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexDataWorks Summit
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
 
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017Big Data Spain
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonSpark Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Data Con LA
 
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Data Con LA
 
Spark meetup - Zoomdata Streaming
Spark meetup  - Zoomdata StreamingSpark meetup  - Zoomdata Streaming
Spark meetup - Zoomdata StreamingZoomdata
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseDataStax
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedHostedbyConfluent
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Cask Data
 

Tendances (20)

Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
 
In Flux Limiting for a multi-tenant logging service
In Flux Limiting for a multi-tenant logging serviceIn Flux Limiting for a multi-tenant logging service
In Flux Limiting for a multi-tenant logging service
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
 
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
 
What's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and BeyondWhat's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and Beyond
 
Hadoop data ingestion
Hadoop data ingestionHadoop data ingestion
Hadoop data ingestion
 
From Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexFrom Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache Apex
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
 
Active Learning for Fraud Prevention
Active Learning for Fraud PreventionActive Learning for Fraud Prevention
Active Learning for Fraud Prevention
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
 
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
 
Spark meetup - Zoomdata Streaming
Spark meetup  - Zoomdata StreamingSpark meetup  - Zoomdata Streaming
Spark meetup - Zoomdata Streaming
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax Enterprise
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?
 

En vedette

IoT Innovation Lab Berlin @relayr - Kay Lerch on Getting basics right for you...
IoT Innovation Lab Berlin @relayr - Kay Lerch on Getting basics right for you...IoT Innovation Lab Berlin @relayr - Kay Lerch on Getting basics right for you...
IoT Innovation Lab Berlin @relayr - Kay Lerch on Getting basics right for you...Kay Lerch
 
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San JoseDataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San JoseAldrin Piri
 
Comparison of various streaming technologies
Comparison of various streaming technologiesComparison of various streaming technologies
Comparison of various streaming technologiesSachin Aggarwal
 
Real-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS AzureReal-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS AzureKhalid Salama
 
Developing Connected Applications with AWS IoT - Technical 301
Developing Connected Applications with AWS IoT - Technical 301Developing Connected Applications with AWS IoT - Technical 301
Developing Connected Applications with AWS IoT - Technical 301Amazon Web Services
 
Lightbend Fast Data Platform
Lightbend Fast Data PlatformLightbend Fast Data Platform
Lightbend Fast Data PlatformLightbend
 
How to Build Continuous Ingestion for the Internet of Things
How to Build Continuous Ingestion for the Internet of ThingsHow to Build Continuous Ingestion for the Internet of Things
How to Build Continuous Ingestion for the Internet of ThingsCloudera, Inc.
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsLinkedIn
 
Getting started with Azure Event Hubs and Stream Analytics services
Getting started with Azure Event Hubs and Stream Analytics servicesGetting started with Azure Event Hubs and Stream Analytics services
Getting started with Azure Event Hubs and Stream Analytics servicesVladimir Bychkov
 
London Apache Kafka Meetup (Jan 2017)
London Apache Kafka Meetup (Jan 2017)London Apache Kafka Meetup (Jan 2017)
London Apache Kafka Meetup (Jan 2017)Landoop Ltd
 
Not Only Streams for Akademia JLabs
Not Only Streams for Akademia JLabsNot Only Streams for Akademia JLabs
Not Only Streams for Akademia JLabsKonrad Malawski
 
UX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and ArchivesUX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and ArchivesNed Potter
 
Processing IoT Data with Apache Kafka
Processing IoT Data with Apache KafkaProcessing IoT Data with Apache Kafka
Processing IoT Data with Apache KafkaMatthew Howlett
 
Designing Teams for Emerging Challenges
Designing Teams for Emerging ChallengesDesigning Teams for Emerging Challenges
Designing Teams for Emerging ChallengesAaron Irizarry
 
IoT Connected Brewery
IoT Connected BreweryIoT Connected Brewery
IoT Connected BreweryJason Hubbard
 
Five Cool Use Cases for the Spring Component of the SOA Suite 11g
Five Cool Use Cases for the Spring Component of the SOA Suite 11gFive Cool Use Cases for the Spring Component of the SOA Suite 11g
Five Cool Use Cases for the Spring Component of the SOA Suite 11gGuido Schmutz
 
Fusion Middleware Live Application Development Demo Oracle Open World 2012
Fusion Middleware Live Application Development Demo Oracle Open World 2012Fusion Middleware Live Application Development Demo Oracle Open World 2012
Fusion Middleware Live Application Development Demo Oracle Open World 2012Lucas Jellema
 

En vedette (20)

IoT Innovation Lab Berlin @relayr - Kay Lerch on Getting basics right for you...
IoT Innovation Lab Berlin @relayr - Kay Lerch on Getting basics right for you...IoT Innovation Lab Berlin @relayr - Kay Lerch on Getting basics right for you...
IoT Innovation Lab Berlin @relayr - Kay Lerch on Getting basics right for you...
 
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San JoseDataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
 
Comparison of various streaming technologies
Comparison of various streaming technologiesComparison of various streaming technologies
Comparison of various streaming technologies
 
Real-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS AzureReal-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS Azure
 
Developing Connected Applications with AWS IoT - Technical 301
Developing Connected Applications with AWS IoT - Technical 301Developing Connected Applications with AWS IoT - Technical 301
Developing Connected Applications with AWS IoT - Technical 301
 
Lightbend Fast Data Platform
Lightbend Fast Data PlatformLightbend Fast Data Platform
Lightbend Fast Data Platform
 
How to Build Continuous Ingestion for the Internet of Things
How to Build Continuous Ingestion for the Internet of ThingsHow to Build Continuous Ingestion for the Internet of Things
How to Build Continuous Ingestion for the Internet of Things
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving Cars
 
Getting started with Azure Event Hubs and Stream Analytics services
Getting started with Azure Event Hubs and Stream Analytics servicesGetting started with Azure Event Hubs and Stream Analytics services
Getting started with Azure Event Hubs and Stream Analytics services
 
Blr hadoop meetup
Blr hadoop meetupBlr hadoop meetup
Blr hadoop meetup
 
London Apache Kafka Meetup (Jan 2017)
London Apache Kafka Meetup (Jan 2017)London Apache Kafka Meetup (Jan 2017)
London Apache Kafka Meetup (Jan 2017)
 
Storm over gearpump
Storm over gearpumpStorm over gearpump
Storm over gearpump
 
Kafka connect
Kafka connectKafka connect
Kafka connect
 
Not Only Streams for Akademia JLabs
Not Only Streams for Akademia JLabsNot Only Streams for Akademia JLabs
Not Only Streams for Akademia JLabs
 
UX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and ArchivesUX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and Archives
 
Processing IoT Data with Apache Kafka
Processing IoT Data with Apache KafkaProcessing IoT Data with Apache Kafka
Processing IoT Data with Apache Kafka
 
Designing Teams for Emerging Challenges
Designing Teams for Emerging ChallengesDesigning Teams for Emerging Challenges
Designing Teams for Emerging Challenges
 
IoT Connected Brewery
IoT Connected BreweryIoT Connected Brewery
IoT Connected Brewery
 
Five Cool Use Cases for the Spring Component of the SOA Suite 11g
Five Cool Use Cases for the Spring Component of the SOA Suite 11gFive Cool Use Cases for the Spring Component of the SOA Suite 11g
Five Cool Use Cases for the Spring Component of the SOA Suite 11g
 
Fusion Middleware Live Application Development Demo Oracle Open World 2012
Fusion Middleware Live Application Development Demo Oracle Open World 2012Fusion Middleware Live Application Development Demo Oracle Open World 2012
Fusion Middleware Live Application Development Demo Oracle Open World 2012
 

Similaire à Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabhakar of Streamsets

Openstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaS
Openstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaSOpenstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaS
Openstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaSSadique Puthen
 
Introduction openstack-meetup-nov-28
Introduction openstack-meetup-nov-28Introduction openstack-meetup-nov-28
Introduction openstack-meetup-nov-28Sadique Puthen
 
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...Impetus Technologies
 
Best Practices for Scaling an InfluxEnterprise Cluster
Best Practices for Scaling an InfluxEnterprise ClusterBest Practices for Scaling an InfluxEnterprise Cluster
Best Practices for Scaling an InfluxEnterprise ClusterInfluxData
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native PlatformSunil Govindan
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native PlatformSunil Govindan
 
Music city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lakeMusic city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lakeTimothy Spann
 
QCon New York 2014 - Apache Stratos
QCon New York 2014  - Apache StratosQCon New York 2014  - Apache Stratos
QCon New York 2014 - Apache StratosSamisa Abeysinghe
 
Scylla Summit 2019 Keynote - Avi Kivity
Scylla Summit 2019 Keynote - Avi KivityScylla Summit 2019 Keynote - Avi Kivity
Scylla Summit 2019 Keynote - Avi KivityScyllaDB
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Timothy Spann
 
Big data conference europe real-time streaming in any and all clouds, hybri...
Big data conference europe   real-time streaming in any and all clouds, hybri...Big data conference europe   real-time streaming in any and all clouds, hybri...
Big data conference europe real-time streaming in any and all clouds, hybri...Timothy Spann
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Hadoop / Spark Conference Japan
 
Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"Piyush Kumar
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaDataWorks Summit
 
Lookout on Scaling Security to 100 Million Devices
Lookout on Scaling Security to 100 Million DevicesLookout on Scaling Security to 100 Million Devices
Lookout on Scaling Security to 100 Million DevicesScyllaDB
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataMike Percy
 
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...
Scenic City Summit (2021):  Real-Time Streaming in any and all clouds, hybrid...Scenic City Summit (2021):  Real-Time Streaming in any and all clouds, hybrid...
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...Timothy Spann
 
Apache Stratos tutorial WSO2Con Europe-2014
Apache Stratos tutorial WSO2Con Europe-2014Apache Stratos tutorial WSO2Con Europe-2014
Apache Stratos tutorial WSO2Con Europe-2014Lakmal Warusawithana
 
End to End Streaming Architectures
End to End Streaming ArchitecturesEnd to End Streaming Architectures
End to End Streaming ArchitecturesCloudera, Inc.
 

Similaire à Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabhakar of Streamsets (20)

Openstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaS
Openstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaSOpenstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaS
Openstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaS
 
Introduction openstack-meetup-nov-28
Introduction openstack-meetup-nov-28Introduction openstack-meetup-nov-28
Introduction openstack-meetup-nov-28
 
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
 
Best Practices for Scaling an InfluxEnterprise Cluster
Best Practices for Scaling an InfluxEnterprise ClusterBest Practices for Scaling an InfluxEnterprise Cluster
Best Practices for Scaling an InfluxEnterprise Cluster
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
Music city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lakeMusic city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lake
 
QCon New York 2014 - Apache Stratos
QCon New York 2014  - Apache StratosQCon New York 2014  - Apache Stratos
QCon New York 2014 - Apache Stratos
 
Scylla Summit 2019 Keynote - Avi Kivity
Scylla Summit 2019 Keynote - Avi KivityScylla Summit 2019 Keynote - Avi Kivity
Scylla Summit 2019 Keynote - Avi Kivity
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
 
Big data conference europe real-time streaming in any and all clouds, hybri...
Big data conference europe   real-time streaming in any and all clouds, hybri...Big data conference europe   real-time streaming in any and all clouds, hybri...
Big data conference europe real-time streaming in any and all clouds, hybri...
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
 
Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache Kafka
 
Lookout on Scaling Security to 100 Million Devices
Lookout on Scaling Security to 100 Million DevicesLookout on Scaling Security to 100 Million Devices
Lookout on Scaling Security to 100 Million Devices
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...
Scenic City Summit (2021):  Real-Time Streaming in any and all clouds, hybrid...Scenic City Summit (2021):  Real-Time Streaming in any and all clouds, hybrid...
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...
 
Apache Stratos tutorial WSO2Con Europe-2014
Apache Stratos tutorial WSO2Con Europe-2014Apache Stratos tutorial WSO2Con Europe-2014
Apache Stratos tutorial WSO2Con Europe-2014
 
End to End Streaming Architectures
End to End Streaming ArchitecturesEnd to End Streaming Architectures
End to End Streaming Architectures
 

Plus de Data Con LA

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA
 

Plus de Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Dernier

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 

Dernier (20)

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 

Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabhakar of Streamsets

  • 1. Always-on Ingestion Con$nuous  Inges$on  for  Data  at  Scale    ©  2015  StreamSets  Inc.,  All  rights  reserved   Arvind  Prabhakar   Big  Data  Day  LA,  June  2015  
  • 2. © 2015 StreamSets, Inc. About me     ❏  Founder/CTO    Apache  So?ware  FoundaBon   ❏  Flume  -­‐  PMC  Chair   ❏  Sqoop  -­‐  PMC  Chair   ❏  Storm  -­‐  PMC,  CommiFer   ❏  MetaModel  -­‐  Mentor   ❏  Sentry  -­‐  Mentor   ❏  NiFi  -­‐  Mentor   ❏  ASF  Member    Previously...   ❏  Cloudera   ❏  InformaKca     @aprabhakar
  • 3. © 2015 StreamSets, Inc. Some Background What is Data Ingestion? Why do we need Data Ingestion? ❏  Acquiring data from various sources ❏  Storing acquired data where it can be processed ❏  Data is consumed away from where it is produced ❏  Consuming systems are often distributed and remote ❏  Manually, with scripts, with rudimentary automation ❏  Higher level frameworks like Flume, Kafka, etc How is Data Ingestion Implemented? Logs   Files   Click  Streams   Sensors   Devices   Database   Logs   Social  Data   Streams   Feeds   Other   Raw  Storage   (HDFS,  S3)   EDW,  NoSQL   (Hive,  Impala,   HBase,  Cassandra,   RedShiY)   Search   (Solr,   ElasKcSearch)   Enterprise  Data   Infrastructure  
  • 4. © 2015 StreamSets, Inc. Data Ingestion Challenge Ever Increasing data volumes and rates... Data sources are physically distributed and transient... ╳ ╳ ╳ ╳ ╳ ╳ Data structures and semantics are constantly changing...
  • 5. © 2015 StreamSets, Inc. Lot more than moving data! Data Ingest should be agile Data Ingest should be safe and reliable ❏  Welcome new data sources as they emerge ❏  Incorporate changes to existing sources as needed ❏  Protect your downstream from silent data corruption ❏  Ensure that there is no data loss in your infrastructure Data Ingest should scale as needed ❏  Data ingest must never become a bottleneck ❏  Data ingest must scale without significant cost or effort RELIABLE
  • 6. © 2015 StreamSets, Inc. »    Design  Wisely       »    Operate  CauKously       »    Update  Liberally   What can you do? ●  Pick  the  right  technology   and  toolset     ●  Instrument  and  monitor   mercilessly     ●  AnKcipate  and  understand   the  changes  in  your   environment   Here is how...
  • 7. © 2015 StreamSets, Inc. Picking the right technology Manual/Scripted Batch Transport Micro-batching Pipelining Message-Queue File copying using CLI or GUI interface Cloudera HUE, Hadoop FS client Ingest Mode Description Example Bulk data transport using tools Sqoop, DistCp Transport of small batches of data Sqoop/Sqoop2 (Storm, etc...) Flow-like transport of event streams Flume, Scribe Publish-Subscribe like transport of events Kafka, Kinesis
  • 8. © 2015 StreamSets, Inc. Sqoop Overview Advantages Disadvantages ❏  Propagates  metadata   ❏  Cluster  based  parallel   scaling  capability   ❏  Simple  and  easy  to   understand/operate   ❏  Rich  set  of  connectors   available  for  use   ❏  Supports  popular  formats   like  Avro,  sequence  file   etc.   ❏  Not  a  service   ❏  Direct  access  to   producKon  data  stores   from  cluster   ❏  Requires  access  to  data   store  credenKals     ❏  Connector  funcKonality  is   not  consistent  between   different  connectors   ❏  CLI  Tool   ❏  Oriented  towards   structured  data  stores   ❏  Runs  map-­‐only  job  to   transport  data    
  • 9. © 2015 StreamSets, Inc. Sqoop 2 Overview Advantages ❏  Propagates  metadata   ❏  Cluster  based  parallel   scaling  capability   ❏  Simple  and  easy  to   understand/operate   ❏  Supports  popular  formats   like  Avro,  sequence  file   etc.   ❏  Consistent  funcKonality   across  connectors   ❏  Secure  handling  of   credenKals  with  RBAC   security   ❏  Considered  pre-­‐ producKon  quality  before   2.0.0  release.  Currently  at   1.99.6.   ❏  May  not  have  connecKvity   at  par  with  Sqoop  1.   ❏  Sqoop  Service  with  CLI   and  JSON/REST  interface   ❏  Oriented  towards   structured  data  stores   ❏  Runs  chained  Map-­‐ Reduce  jobs  for  data   transport  and  conversion     Disadvantages
  • 10. © 2015 StreamSets, Inc. Flume Overview Advantages ❏  Guaranteed  delivery   semanKcs   ❏  Low-­‐latency  reliable  data   transfer   ❏  DeclaraKve  configuraKon   with  no  coding  necessary   for  common  use-­‐cases   ❏  Fully  extendable  and   customizable   ❏  Integrates  with  most   commonly  used  end-­‐ points   ❏  Non-­‐trivial  configuraKon   ❏  Complex  topology   configuration  can  be  hard   to  build  and  maintain   ❏  Custom  end-­‐point   implementaKon  requires   significant  code   complexity   ❏  Distributed  pipeline   system  for  efficient   transport  of  large   volumes  of  data   ❏  Built  in  support  for   contextual  rouKng,   filtering,  replicaKon  and   mulKplexing         Disadvantages
  • 11. © 2015 StreamSets, Inc. Kafka Overview Advantages ❏  Strong  retenKon  and   ordering  semanKcs   ❏  Dynamic  cluster  based   scalability  and  throughput   ❏  Low-­‐level  APIs  for  building   consumers  and  producers   ❏  Variety  of  open  source   producers  and  consumers   available  on  GitHub   ❏  Allows  reprocessing  of   consumed  data   ❏  Distributed  and  efficient   publish-­‐subscribe   messaging  system     ❏  Used  for  democraKzaKon   of  data  between   applicaKons         Disadvantages ❏  Delivery  guarantee  owned   by  producers  and   consumers   ❏  Opaque  pub-­‐sub  design   can  cause  applicaKons  to   be    highly  coupled   ❏  Minimal  metadata   support  
  • 12. © 2015 StreamSets, Inc. Typical Examples For Structured Data Simple   ❏  Sqoop  for  Batch  transport   ❏  Sqoop  2  for  micro-­‐batch  transport       Intermediate   ❏  Flume  for  Directory  Spooling         Advanced   ❏  Custom  Database  Log  Shipping   implementaKon   Simple   ❏  Flume  based  AggregaKon   ❏  Kaia  based  pub-­‐sub  for  applicaKons       Intermediate   ❏  Flume  +  Kaia  based  aggregaKon  and   pub-­‐sub       Advanced   ❏  Kaia  +  Storm  for  pub-­‐sub  and   preparaKon   For Streaming Event Data
  • 13. © 2015 StreamSets, Inc.   ❏  Apache  Sqoop:  hFp://sqoop.apache.org   ❏  Current  Version:  Sqoop1  -­‐  1.4.6;    Sqoop2  -­‐  1.99.6     ❏  Apache  Flume:  hFp://flume.apache.org   ❏  Current  Version:  Flume  1.6.0     ❏  Apache  Kaia:  hFp://kaia.apache.org   ❏  Current  Version:  Kaia  0.8.2.1       For more information...
  • 14. © 2015 StreamSets, Inc. My  Contact  InformaKon:   ●  Email:    arvind  at  streamsets  dot  com   ●  TwiFer:  @aprabhakar   ●  Website:  www.streamsets.com         Thank You!