SlideShare une entreprise Scribd logo
1  sur  35
Télécharger pour lire hors ligne
Data encoding and 
metadata for streams
> Introduction 
Me at a glance 
• My name is Jonathan Winandy (@ahoy_jon). 
• I am a Data pipeline engineer : 
• I worked on a “DataLake” ! 
• I use tools in the larger Java ecosystem like 
Java, Scala, Clojure, Hadoop … 
• And I am an “entrepreneur”.
> Introduction 
I cofounded two companies and they use streams 
as their data backbone. 
Health care oriented software engineering. 
Provide : 
Coordination for health care professionals.
I cofounded two companies and they use streams 
as their data backbone. 
@PrimaticeData 
“Good dataviz, surreal backends.” 
> Introduction 
Provide : 
Tools and methods for Data capitalisation.
> Introduction 
What are Streams ? 
It’s an abstract data structure with the following : 
operations : 
• append(bytes) -> void? 
• readAt(int) -> null | bytes 
rule 1 : 
∀p ∈ ℕ, for some definition of ‘==‘ 
x := readAt(p) 
y := readAt(p) 
! 
x != null => x == y 
Rule 1 implies : Infinite cacheability 
once the data is available at a position.
Streams are the simplest way 
to manage data. 
0 1 2 3 4 5 6 
And they are naturally compatible with the perception of information from a 
singular observer … 
> Introduction
But be careful, streams are definitely 
not like queues, ESB, EAI, or what ever 
messaging solution comes to mind …
> Introduction 
There is a lot to tell on Streams 
• Sub events : Events are pre-projected into … 
• Quantum of action : A ‘user’ action generates zero or one event (no 
more). 
• Structural sharing for large payload (cf. Content Addressable 
Storage). 
• Garbage collection for append only data structures. 
this presentation 
! 
• Causality enforcement in asynchronous contexts : On important 
request, causality is enforced. 
• Binary encoding and Metadata.
> Introduction 
A quick note on Causality 
If you don’t ensure causality for 
web apps, some strange 
comportements may arise : 
Sometimes, as a user, I 
cannot see my own “edits”. 
Sometimes, as a client, I 
cannot buy on the website 
after I checkout my basket. 
APP APP 
“Who is the fastest 
between the Data bus 
and the client ?” 
You don’t want to bet, 
especially under load.
Data encoding and 
metadata for streams
> Content 
Content : 
• Data encoding 
• Identity 
• Metadata 
• Datagram 
• Conclusion
> Data encoding 
State of data encodings in the 
industry 
• As always worse is considered better. 
• Most of streams have data encoded in : 
• CSV/TSV 
• JSON 
• Platform specific serialisations (eg: Java 
serialisation, Kryo)
> Data encoding 
Why this is important ? 
• Some streams may contains very large amount of 
Data, the chosen encoding must be cpu and space 
efficient. 
• Streams are processed 
by many programs, 
and many intermediaries, 
for many years, 
the chosen encoding must be processable in a 
generic way.
JSON is the lower denominator 
Plus : 
• It reaches the browser, you can produce and consume data from 
inside a web page. 
A lot of Cons : 
• Inefficient, 
• No dates, no proper numerics, 
• Very basic data structures, 
• Very error Prone. 
We all need JSON, 
but we should use it 
only when we can't 
avoid it. 
> Data encoding 
Eg : In our databases, we can 
avoid JSONs ;)
> Data encoding 
How bad JSON is ? 
{“name”:"Bob", 
“age":11, 
"gender":"Male"} 
39 Bytes for 
10 Bytes of data 
:02:06:62:6f:62:02:16:02:da:01
> Data encoding 
relevant 
ones 
popular binaries low tech cognitect “papa ?!” 
Avro Thrift Proto Buf JSON CSV Fressian Transit EDN XML RDF 
binary YES NO YES OK NO ?? 
generic YES ?? NO YES YES YES 
schema 
based 
YES NO YES NO ?? meta 
specific 
encoding 
S” YES OK Literal 
YES NO “STRING 
s 
reach the 
browser 
YES NO +++++ OK NO YES OK 
easy ? NO I PASS “true” YEP 
HUM 
? 
… 
safe ? YES HUM? NO NO MISM 
ATCH 
<! 
YES 
has 
dates? Soon NO NO YES YES
Identity 
> Identity 
• Most mechanism around stream assure an “at 
most once delivery”. 
• An identity definition is necessary to ensure 
idempotency.
> Identity 
There are 2 ways to refer to a message : 
• with a fingerprint calculated from the message 
(digest). 
• with an external identifier (like UUIDs).
> Identity 
UUIDs allow : 
F0991FD1-D58A-4A5F-8D13-903F368882D1 
8AA5C612-B365-4F8F-AF3F-DF623E1F6B22 
93A87D37-0658-47C9-84F6-801E83A5821C 
• to manage things that are not encoded yet. 
• to avoid the hashing and the parsing of payloads. 
Recommandation : add an UUID (128bits) to 
every elements of the stream.
Metadata 
> Metadata 
• Metadata uses range from the very useful (like 
http headers) to the very meta meta[1]. 
• Metadata on Stream elements is most of the 
time implicit, like for example the Content-Type : 
• “It’s a stream of JSONs” then every element 
of the stream has “content-type=application/json”. 
[1] I am looking at you RDF !
What kind of metadata there are for streams element ? 
• Content-type or data-encoding : 
e.g. : application/json 
• Type or Profile : indicate that the given element 
is an instance of a given type. 
e.g. : domain.model.MessageSent 
• Provenance information : 
e.g. : {“env”:”test”, 
“application”:{“name”:”webapp”, 
“version”:{“commit”:”68546ca…”}}} 
> Metadata
> Metadata 
A quick note on provenance 
The provenance is practical in distributed systems we 
want to know : 
• from which node do a element comes. 
• on the behalf of which agent this element is created. 
• from which environment[1] a element comes. 
[1] with new architecture and Data Labs, environments are sometimes 
shared on the same infrastructure (eg : no Pre-Production platform). 
It’s then very useful to safeguard against the pollution of data.
> Metadata 
{ 
"content-type":"application/json", 
"profile":"domain.model.MessageSent", 
"provenance":{ 
"application":{ 
"name":"webapp", 
"version":"68546ca6e963981a8279aa327cc1e1362d15554e" 
}, 
"node":{ 
"environement":"test", 
"network":{ 
"interface":{ 
"en0":{ 
"addresses":{ 
"192.168.0.13":{ 
"family":"inet", 
"netmask":"255.255.255.0", 
"broadcast":"192.168.0.255" 
} 
} 
} 
} 
}, 
"hostname":["Blaze"], 
"platform_family":"mac_os_x" 
} 
} 
} 
• The metadata of an element can 
represent a significant piece of data. 
Sometimes more than the data itself. 
• !! The same piece of metadata can be 
shared across many elements. !!
> Datagram 
Anatomy of an element 
:ID :HEADERS 
:BODY 
DB7D919B-248F-4676- 
8494-2698B48C69C3 
57158663-5933-4CE6- 
A54E-8179ECFBFCCA [“ich”,“bin”,“ein”,“JSON”] 
e.g.
> Datagram 
1. Create and register your headers 
(in a distributed Key/Store for example) . 
4813EDF2-B04E-4B70- 
{ 
AB04-0F9EA456E032 
"content-type":"application/json", 
"profile":"domain.model.MessageSent", 
"provenance":{ 
"application":{ 
"name":"webapp", 
"version":"68546ca6e963981a8279aa327cc1e1362d15554e" 
}, 
"node":{ 
"environement":"test", 
"network":{ 
"interface":{ 
"en0":{ 
"addresses":{ 
"192.168.0.13":{ 
"family":"inet", 
"netmask":"255.255.255.0", 
"broadcast":"192.168.0.255" 
} 
} 
} 
} 
}, 
"hostname":["Blaze"], 
"platform_family":"mac_os_x" 
} 
} 
}
> Datagram 
2. use it in your stream ! 
5462E738-ABAA-452F- 
87E0-FD38AEB9DF81 
4813EDF2-B04E-4B70- 
AB04-0F9EA456E032 
{"cid": {"idStr": "498683D2-1192-4794-8C23-5BE49EEEC763"}, 
"userId": 
{"idStr": "BC3D8614-AF1F-48C8-B91F-0D907FD0FAF3"}, 
"content": " Contenu de message de test"} 
81C76676-7B19-428E-8 
56D-984BB67287D1 
4813EDF2-B04E-4B70- 
AB04-0F9EA456E032 
{"cid": {"idStr": "498683D2-1192-4794-8C23-5BE49EEEC763"}, 
"userId": 
{"idStr": "BC3D8614-AF1F-48C8-B91F-0D907FD0FAF3"}, 
"content": " Contenu de message de test”}
> Datagram 
Ho : You can have also have a stream of headers … 
4813EDF2-B04E-4B70- 
AB04-0F9EA456E032 :HEADERS 
4813EDF2-B04E-4B70- 
AB04-0F9EA456E032 
5462E738-ABAA-452F- 
87E0-FD38AEB9DF81 
4813EDF2-B04E-4B70- 
AB04-0F9EA456E032 
81C76676-7B19-428E-85 
6D-984BB67287D1 
4813EDF2-B04E-4B70- 
AB04-0F9EA456E032 
69DFC711-9D21-4DD6- 
A51D-C04A7A6E20A9 
0 1 2
> Conclusion 
If you don’t yet use streams instead of 
databases, start to use one next Monday 
(even with JSON and no headers…). 
If you do already use streams … Well, 
you know what to do ! ;)
Bonus :What is a CAS ? 
A Content Adressable Storage is a specific “key 
value store” : 
operations : 
• store(bytes) -> key 
• get(key) -> null | bytes 
rule 1 : 
key = h(data) 
h being a cryptographic hash 
function like md5 or sha1. 
rule 2 : 
∀data 
get(store(data)) = data 
Rule 1 and 2 imply : 
Infinite cacheability 
and scalability.
Exemple of architectures 
CLASSICAL 
APP 
APP 
DB 
WITH STREAMS 
APP 
APP 
append 
broadcast
The broadcast mechanism is equivalent to 
Exemple of architectures 
a db replication mechanism. 
CLASSICAL 
APP 
REPLICATION 
(BIN/LOG) 
APP 
APP 
DB 
DB 
WITH STREAMS 
APP 
APP 
APP 
append 
broadcast

Contenu connexe

Tendances

Log analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaLog analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaAvinash Ramineni
 
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...Hadoop User Group
 
Presentation: mongo db & elasticsearch & membase
Presentation: mongo db & elasticsearch & membasePresentation: mongo db & elasticsearch & membase
Presentation: mongo db & elasticsearch & membaseArdak Shalkarbayuli
 
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...Lightbend
 
Scalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at PinterestScalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at PinterestKrishna Gade
 
ELK at LinkedIn - Kafka, scaling, lessons learned
ELK at LinkedIn - Kafka, scaling, lessons learnedELK at LinkedIn - Kafka, scaling, lessons learned
ELK at LinkedIn - Kafka, scaling, lessons learnedTin Le
 
Lightbend Fast Data Platform
Lightbend Fast Data PlatformLightbend Fast Data Platform
Lightbend Fast Data PlatformLightbend
 
Real time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchReal time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchAbhishek Andhavarapu
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentationCyanny LIANG
 
Understanding Akka Streams, Back Pressure, and Asynchronous Architectures
Understanding Akka Streams, Back Pressure, and Asynchronous ArchitecturesUnderstanding Akka Streams, Back Pressure, and Asynchronous Architectures
Understanding Akka Streams, Back Pressure, and Asynchronous ArchitecturesLightbend
 
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
Building Realtime Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtime Data Pipelines with Kafka Connect and Spark Streaming
Building Realtime Data Pipelines with Kafka Connect and Spark StreamingJen Aman
 
How to manage large amounts of data with akka streams
How to manage large amounts of data with akka streamsHow to manage large amounts of data with akka streams
How to manage large amounts of data with akka streamsIgor Mielientiev
 
To Have Own Data Analytics Platform, Or NOT To
To Have Own Data Analytics Platform, Or NOT ToTo Have Own Data Analytics Platform, Or NOT To
To Have Own Data Analytics Platform, Or NOT ToSATOSHI TAGOMORI
 
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...Fwdays
 
SF Front End Developers - Ember + D3
SF Front End Developers - Ember + D3SF Front End Developers - Ember + D3
SF Front End Developers - Ember + D3Ben Lesh
 
Fluentd and Docker - running fluentd within a docker container
Fluentd and Docker - running fluentd within a docker containerFluentd and Docker - running fluentd within a docker container
Fluentd and Docker - running fluentd within a docker containerTreasure Data, Inc.
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in NetflixDanny Yuan
 
Akka Streams And Kafka Streams: Where Microservices Meet Fast Data
Akka Streams And Kafka Streams: Where Microservices Meet Fast DataAkka Streams And Kafka Streams: Where Microservices Meet Fast Data
Akka Streams And Kafka Streams: Where Microservices Meet Fast DataLightbend
 

Tendances (20)

Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Log analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaLog analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and Kibana
 
Not only SQL
Not only SQL Not only SQL
Not only SQL
 
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
 
Presentation: mongo db & elasticsearch & membase
Presentation: mongo db & elasticsearch & membasePresentation: mongo db & elasticsearch & membase
Presentation: mongo db & elasticsearch & membase
 
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
 
Scalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at PinterestScalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at Pinterest
 
ELK at LinkedIn - Kafka, scaling, lessons learned
ELK at LinkedIn - Kafka, scaling, lessons learnedELK at LinkedIn - Kafka, scaling, lessons learned
ELK at LinkedIn - Kafka, scaling, lessons learned
 
Lightbend Fast Data Platform
Lightbend Fast Data PlatformLightbend Fast Data Platform
Lightbend Fast Data Platform
 
Real time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchReal time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and Elasticsearch
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentation
 
Understanding Akka Streams, Back Pressure, and Asynchronous Architectures
Understanding Akka Streams, Back Pressure, and Asynchronous ArchitecturesUnderstanding Akka Streams, Back Pressure, and Asynchronous Architectures
Understanding Akka Streams, Back Pressure, and Asynchronous Architectures
 
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
Building Realtime Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtime Data Pipelines with Kafka Connect and Spark Streaming
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
 
How to manage large amounts of data with akka streams
How to manage large amounts of data with akka streamsHow to manage large amounts of data with akka streams
How to manage large amounts of data with akka streams
 
To Have Own Data Analytics Platform, Or NOT To
To Have Own Data Analytics Platform, Or NOT ToTo Have Own Data Analytics Platform, Or NOT To
To Have Own Data Analytics Platform, Or NOT To
 
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
 
SF Front End Developers - Ember + D3
SF Front End Developers - Ember + D3SF Front End Developers - Ember + D3
SF Front End Developers - Ember + D3
 
Fluentd and Docker - running fluentd within a docker container
Fluentd and Docker - running fluentd within a docker containerFluentd and Docker - running fluentd within a docker container
Fluentd and Docker - running fluentd within a docker container
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in Netflix
 
Akka Streams And Kafka Streams: Where Microservices Meet Fast Data
Akka Streams And Kafka Streams: Where Microservices Meet Fast DataAkka Streams And Kafka Streams: Where Microservices Meet Fast Data
Akka Streams And Kafka Streams: Where Microservices Meet Fast Data
 

En vedette

Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelinesLars Albertsson
 
Character Encoding - MySQL DevRoom - FOSDEM 2015
Character Encoding - MySQL DevRoom - FOSDEM 2015Character Encoding - MySQL DevRoom - FOSDEM 2015
Character Encoding - MySQL DevRoom - FOSDEM 2015mushupl
 
Data encoding techniques for reducing energyb consumption in network on-chip
Data encoding techniques for reducing energyb consumption in network on-chipData encoding techniques for reducing energyb consumption in network on-chip
Data encoding techniques for reducing energyb consumption in network on-chipLogicMindtech Nologies
 
AWS Lambda: Event-driven Code in the Cloud
AWS Lambda: Event-driven Code in the CloudAWS Lambda: Event-driven Code in the Cloud
AWS Lambda: Event-driven Code in the CloudAmazon Web Services
 
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)Mathieu Bastian
 
Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?MapR Technologies
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solidLars Albertsson
 
CCNA
CCNACCNA
CCNAniict
 
Encoding in Data Communication DC8
Encoding in Data Communication DC8Encoding in Data Communication DC8
Encoding in Data Communication DC8koolkampus
 
Asynchronous and synchronous
Asynchronous and synchronousAsynchronous and synchronous
Asynchronous and synchronousAkhil .B
 
Synchronous and-asynchronous-data-transfer
Synchronous and-asynchronous-data-transferSynchronous and-asynchronous-data-transfer
Synchronous and-asynchronous-data-transferAnuj Modi
 
Ccna Presentation
Ccna PresentationCcna Presentation
Ccna Presentationbcdran
 
A primer on building real time data-driven products
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven productsLars Albertsson
 
Docker based Hadoop provisioning - anywhere
Docker based Hadoop provisioning - anywhere Docker based Hadoop provisioning - anywhere
Docker based Hadoop provisioning - anywhere Janos Matyas
 
Provisioning Big Data Platform using Cloudbreak & Ambari
Provisioning Big Data Platform using Cloudbreak & AmbariProvisioning Big Data Platform using Cloudbreak & Ambari
Provisioning Big Data Platform using Cloudbreak & AmbariDataWorks Summit/Hadoop Summit
 
Docker based Hadoop provisioning - Hadoop Summit 2014
Docker based Hadoop provisioning - Hadoop Summit 2014 Docker based Hadoop provisioning - Hadoop Summit 2014
Docker based Hadoop provisioning - Hadoop Summit 2014 Janos Matyas
 
Data Encoding
Data EncodingData Encoding
Data EncodingLuka M G
 

En vedette (20)

Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelines
 
Character Encoding - MySQL DevRoom - FOSDEM 2015
Character Encoding - MySQL DevRoom - FOSDEM 2015Character Encoding - MySQL DevRoom - FOSDEM 2015
Character Encoding - MySQL DevRoom - FOSDEM 2015
 
Data encoding techniques for reducing energyb consumption in network on-chip
Data encoding techniques for reducing energyb consumption in network on-chipData encoding techniques for reducing energyb consumption in network on-chip
Data encoding techniques for reducing energyb consumption in network on-chip
 
AWS Lambda: Event-driven Code in the Cloud
AWS Lambda: Event-driven Code in the CloudAWS Lambda: Event-driven Code in the Cloud
AWS Lambda: Event-driven Code in the Cloud
 
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
 
Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?
 
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
Data encoding
Data encodingData encoding
Data encoding
 
CCNA
CCNACCNA
CCNA
 
Encoding in Data Communication DC8
Encoding in Data Communication DC8Encoding in Data Communication DC8
Encoding in Data Communication DC8
 
Asynchronous and synchronous
Asynchronous and synchronousAsynchronous and synchronous
Asynchronous and synchronous
 
Synchronous and-asynchronous-data-transfer
Synchronous and-asynchronous-data-transferSynchronous and-asynchronous-data-transfer
Synchronous and-asynchronous-data-transfer
 
Ccna Presentation
Ccna PresentationCcna Presentation
Ccna Presentation
 
A primer on building real time data-driven products
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven products
 
Docker based Hadoop provisioning - anywhere
Docker based Hadoop provisioning - anywhere Docker based Hadoop provisioning - anywhere
Docker based Hadoop provisioning - anywhere
 
Provisioning Big Data Platform using Cloudbreak & Ambari
Provisioning Big Data Platform using Cloudbreak & AmbariProvisioning Big Data Platform using Cloudbreak & Ambari
Provisioning Big Data Platform using Cloudbreak & Ambari
 
Docker based Hadoop provisioning - Hadoop Summit 2014
Docker based Hadoop provisioning - Hadoop Summit 2014 Docker based Hadoop provisioning - Hadoop Summit 2014
Docker based Hadoop provisioning - Hadoop Summit 2014
 
Cloudbreak - Technical Deep Dive
Cloudbreak - Technical Deep DiveCloudbreak - Technical Deep Dive
Cloudbreak - Technical Deep Dive
 
Data Encoding
Data EncodingData Encoding
Data Encoding
 

Similaire à Data encoding and Metadata for Streams

Document Databases & RavenDB
Document Databases & RavenDBDocument Databases & RavenDB
Document Databases & RavenDBBrian Ritchie
 
(BDT209) Launch: Amazon Elasticsearch For Real-Time Data Analytics
(BDT209) Launch: Amazon Elasticsearch For Real-Time Data Analytics(BDT209) Launch: Amazon Elasticsearch For Real-Time Data Analytics
(BDT209) Launch: Amazon Elasticsearch For Real-Time Data AnalyticsAmazon Web Services
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFAmazon Web Services
 
Data saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overviewData saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overviewRiccardo Zamana
 
Open Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCOpen Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCSheetal Dolas
 
Public private hybrid - cmdb challenge
Public private hybrid - cmdb challengePublic private hybrid - cmdb challenge
Public private hybrid - cmdb challengeryszardsshare
 
SQL to NoSQL: Top 6 Questions
SQL to NoSQL: Top 6 QuestionsSQL to NoSQL: Top 6 Questions
SQL to NoSQL: Top 6 QuestionsMike Broberg
 
Sherlock Homepage (Maarten Balliauw)
Sherlock Homepage (Maarten Balliauw)Sherlock Homepage (Maarten Balliauw)
Sherlock Homepage (Maarten Balliauw)Visug
 
Sherlock Homepage - A detective story about running large web services (VISUG...
Sherlock Homepage - A detective story about running large web services (VISUG...Sherlock Homepage - A detective story about running large web services (VISUG...
Sherlock Homepage - A detective story about running large web services (VISUG...Maarten Balliauw
 
Apache Spark Streaming -Real time web server log analytics
Apache Spark Streaming -Real time web server log analyticsApache Spark Streaming -Real time web server log analytics
Apache Spark Streaming -Real time web server log analyticsANKIT GUPTA
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About Jesus Rodriguez
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasMapR Technologies
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdbjixuan1989
 
OData: Universal Data Solvent or Clunky Enterprise Goo? (GlueCon 2015)
OData: Universal Data Solvent or Clunky Enterprise Goo? (GlueCon 2015)OData: Universal Data Solvent or Clunky Enterprise Goo? (GlueCon 2015)
OData: Universal Data Solvent or Clunky Enterprise Goo? (GlueCon 2015)Pat Patterson
 
Analyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The CloudAnalyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The CloudRobert Dempsey
 
The Right Data for the Right Job
The Right Data for the Right JobThe Right Data for the Right Job
The Right Data for the Right JobEmily Curtin
 
MyHeritage Cassandra meetup 2016
MyHeritage Cassandra meetup 2016MyHeritage Cassandra meetup 2016
MyHeritage Cassandra meetup 2016Ran Peled
 

Similaire à Data encoding and Metadata for Streams (20)

Document Databases & RavenDB
Document Databases & RavenDBDocument Databases & RavenDB
Document Databases & RavenDB
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
(BDT209) Launch: Amazon Elasticsearch For Real-Time Data Analytics
(BDT209) Launch: Amazon Elasticsearch For Real-Time Data Analytics(BDT209) Launch: Amazon Elasticsearch For Real-Time Data Analytics
(BDT209) Launch: Amazon Elasticsearch For Real-Time Data Analytics
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Data saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overviewData saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overview
 
Open Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCOpen Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOC
 
Public private hybrid - cmdb challenge
Public private hybrid - cmdb challengePublic private hybrid - cmdb challenge
Public private hybrid - cmdb challenge
 
SQL to NoSQL: Top 6 Questions
SQL to NoSQL: Top 6 QuestionsSQL to NoSQL: Top 6 Questions
SQL to NoSQL: Top 6 Questions
 
Sherlock Homepage (Maarten Balliauw)
Sherlock Homepage (Maarten Balliauw)Sherlock Homepage (Maarten Balliauw)
Sherlock Homepage (Maarten Balliauw)
 
Sherlock Homepage - A detective story about running large web services (VISUG...
Sherlock Homepage - A detective story about running large web services (VISUG...Sherlock Homepage - A detective story about running large web services (VISUG...
Sherlock Homepage - A detective story about running large web services (VISUG...
 
Apache Spark Streaming -Real time web server log analytics
Apache Spark Streaming -Real time web server log analyticsApache Spark Streaming -Real time web server log analytics
Apache Spark Streaming -Real time web server log analytics
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdb
 
JSON-LD and SHACL for Knowledge Graphs
JSON-LD and SHACL for Knowledge GraphsJSON-LD and SHACL for Knowledge Graphs
JSON-LD and SHACL for Knowledge Graphs
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
OData: Universal Data Solvent or Clunky Enterprise Goo? (GlueCon 2015)
OData: Universal Data Solvent or Clunky Enterprise Goo? (GlueCon 2015)OData: Universal Data Solvent or Clunky Enterprise Goo? (GlueCon 2015)
OData: Universal Data Solvent or Clunky Enterprise Goo? (GlueCon 2015)
 
Analyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The CloudAnalyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The Cloud
 
The Right Data for the Right Job
The Right Data for the Right JobThe Right Data for the Right Job
The Right Data for the Right Job
 
MyHeritage Cassandra meetup 2016
MyHeritage Cassandra meetup 2016MyHeritage Cassandra meetup 2016
MyHeritage Cassandra meetup 2016
 

Plus de univalence

Scala pour le Data Eng
Scala pour le Data EngScala pour le Data Eng
Scala pour le Data Engunivalence
 
Spark-adabra, Comment Construire un DATALAKE ! (Devoxx 2017)
Spark-adabra, Comment Construire un DATALAKE ! (Devoxx 2017) Spark-adabra, Comment Construire un DATALAKE ! (Devoxx 2017)
Spark-adabra, Comment Construire un DATALAKE ! (Devoxx 2017) univalence
 
7 key recipes for data engineering
7 key recipes for data engineering7 key recipes for data engineering
7 key recipes for data engineeringunivalence
 
7 key recipes for data engineering
7 key recipes for data engineering7 key recipes for data engineering
7 key recipes for data engineeringunivalence
 
Beyond tabular data
Beyond tabular dataBeyond tabular data
Beyond tabular dataunivalence
 
Introduction aux Macros
Introduction aux MacrosIntroduction aux Macros
Introduction aux Macrosunivalence
 
Big data forever
Big data foreverBig data forever
Big data foreverunivalence
 

Plus de univalence (7)

Scala pour le Data Eng
Scala pour le Data EngScala pour le Data Eng
Scala pour le Data Eng
 
Spark-adabra, Comment Construire un DATALAKE ! (Devoxx 2017)
Spark-adabra, Comment Construire un DATALAKE ! (Devoxx 2017) Spark-adabra, Comment Construire un DATALAKE ! (Devoxx 2017)
Spark-adabra, Comment Construire un DATALAKE ! (Devoxx 2017)
 
7 key recipes for data engineering
7 key recipes for data engineering7 key recipes for data engineering
7 key recipes for data engineering
 
7 key recipes for data engineering
7 key recipes for data engineering7 key recipes for data engineering
7 key recipes for data engineering
 
Beyond tabular data
Beyond tabular dataBeyond tabular data
Beyond tabular data
 
Introduction aux Macros
Introduction aux MacrosIntroduction aux Macros
Introduction aux Macros
 
Big data forever
Big data foreverBig data forever
Big data forever
 

Dernier

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 

Dernier (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 

Data encoding and Metadata for Streams

  • 1. Data encoding and metadata for streams
  • 2.
  • 3. > Introduction Me at a glance • My name is Jonathan Winandy (@ahoy_jon). • I am a Data pipeline engineer : • I worked on a “DataLake” ! • I use tools in the larger Java ecosystem like Java, Scala, Clojure, Hadoop … • And I am an “entrepreneur”.
  • 4. > Introduction I cofounded two companies and they use streams as their data backbone. Health care oriented software engineering. Provide : Coordination for health care professionals.
  • 5. I cofounded two companies and they use streams as their data backbone. @PrimaticeData “Good dataviz, surreal backends.” > Introduction Provide : Tools and methods for Data capitalisation.
  • 6. > Introduction What are Streams ? It’s an abstract data structure with the following : operations : • append(bytes) -> void? • readAt(int) -> null | bytes rule 1 : ∀p ∈ ℕ, for some definition of ‘==‘ x := readAt(p) y := readAt(p) ! x != null => x == y Rule 1 implies : Infinite cacheability once the data is available at a position.
  • 7. Streams are the simplest way to manage data. 0 1 2 3 4 5 6 And they are naturally compatible with the perception of information from a singular observer … > Introduction
  • 8. But be careful, streams are definitely not like queues, ESB, EAI, or what ever messaging solution comes to mind …
  • 9. > Introduction There is a lot to tell on Streams • Sub events : Events are pre-projected into … • Quantum of action : A ‘user’ action generates zero or one event (no more). • Structural sharing for large payload (cf. Content Addressable Storage). • Garbage collection for append only data structures. this presentation ! • Causality enforcement in asynchronous contexts : On important request, causality is enforced. • Binary encoding and Metadata.
  • 10. > Introduction A quick note on Causality If you don’t ensure causality for web apps, some strange comportements may arise : Sometimes, as a user, I cannot see my own “edits”. Sometimes, as a client, I cannot buy on the website after I checkout my basket. APP APP “Who is the fastest between the Data bus and the client ?” You don’t want to bet, especially under load.
  • 11. Data encoding and metadata for streams
  • 12. > Content Content : • Data encoding • Identity • Metadata • Datagram • Conclusion
  • 13. > Data encoding State of data encodings in the industry • As always worse is considered better. • Most of streams have data encoded in : • CSV/TSV • JSON • Platform specific serialisations (eg: Java serialisation, Kryo)
  • 14. > Data encoding Why this is important ? • Some streams may contains very large amount of Data, the chosen encoding must be cpu and space efficient. • Streams are processed by many programs, and many intermediaries, for many years, the chosen encoding must be processable in a generic way.
  • 15. JSON is the lower denominator Plus : • It reaches the browser, you can produce and consume data from inside a web page. A lot of Cons : • Inefficient, • No dates, no proper numerics, • Very basic data structures, • Very error Prone. We all need JSON, but we should use it only when we can't avoid it. > Data encoding Eg : In our databases, we can avoid JSONs ;)
  • 16. > Data encoding How bad JSON is ? {“name”:"Bob", “age":11, "gender":"Male"} 39 Bytes for 10 Bytes of data :02:06:62:6f:62:02:16:02:da:01
  • 17. > Data encoding relevant ones popular binaries low tech cognitect “papa ?!” Avro Thrift Proto Buf JSON CSV Fressian Transit EDN XML RDF binary YES NO YES OK NO ?? generic YES ?? NO YES YES YES schema based YES NO YES NO ?? meta specific encoding S” YES OK Literal YES NO “STRING s reach the browser YES NO +++++ OK NO YES OK easy ? NO I PASS “true” YEP HUM ? … safe ? YES HUM? NO NO MISM ATCH <! YES has dates? Soon NO NO YES YES
  • 18. Identity > Identity • Most mechanism around stream assure an “at most once delivery”. • An identity definition is necessary to ensure idempotency.
  • 19. > Identity There are 2 ways to refer to a message : • with a fingerprint calculated from the message (digest). • with an external identifier (like UUIDs).
  • 20. > Identity UUIDs allow : F0991FD1-D58A-4A5F-8D13-903F368882D1 8AA5C612-B365-4F8F-AF3F-DF623E1F6B22 93A87D37-0658-47C9-84F6-801E83A5821C • to manage things that are not encoded yet. • to avoid the hashing and the parsing of payloads. Recommandation : add an UUID (128bits) to every elements of the stream.
  • 21. Metadata > Metadata • Metadata uses range from the very useful (like http headers) to the very meta meta[1]. • Metadata on Stream elements is most of the time implicit, like for example the Content-Type : • “It’s a stream of JSONs” then every element of the stream has “content-type=application/json”. [1] I am looking at you RDF !
  • 22. What kind of metadata there are for streams element ? • Content-type or data-encoding : e.g. : application/json • Type or Profile : indicate that the given element is an instance of a given type. e.g. : domain.model.MessageSent • Provenance information : e.g. : {“env”:”test”, “application”:{“name”:”webapp”, “version”:{“commit”:”68546ca…”}}} > Metadata
  • 23. > Metadata A quick note on provenance The provenance is practical in distributed systems we want to know : • from which node do a element comes. • on the behalf of which agent this element is created. • from which environment[1] a element comes. [1] with new architecture and Data Labs, environments are sometimes shared on the same infrastructure (eg : no Pre-Production platform). It’s then very useful to safeguard against the pollution of data.
  • 24. > Metadata { "content-type":"application/json", "profile":"domain.model.MessageSent", "provenance":{ "application":{ "name":"webapp", "version":"68546ca6e963981a8279aa327cc1e1362d15554e" }, "node":{ "environement":"test", "network":{ "interface":{ "en0":{ "addresses":{ "192.168.0.13":{ "family":"inet", "netmask":"255.255.255.0", "broadcast":"192.168.0.255" } } } } }, "hostname":["Blaze"], "platform_family":"mac_os_x" } } } • The metadata of an element can represent a significant piece of data. Sometimes more than the data itself. • !! The same piece of metadata can be shared across many elements. !!
  • 25. > Datagram Anatomy of an element :ID :HEADERS :BODY DB7D919B-248F-4676- 8494-2698B48C69C3 57158663-5933-4CE6- A54E-8179ECFBFCCA [“ich”,“bin”,“ein”,“JSON”] e.g.
  • 26. > Datagram 1. Create and register your headers (in a distributed Key/Store for example) . 4813EDF2-B04E-4B70- { AB04-0F9EA456E032 "content-type":"application/json", "profile":"domain.model.MessageSent", "provenance":{ "application":{ "name":"webapp", "version":"68546ca6e963981a8279aa327cc1e1362d15554e" }, "node":{ "environement":"test", "network":{ "interface":{ "en0":{ "addresses":{ "192.168.0.13":{ "family":"inet", "netmask":"255.255.255.0", "broadcast":"192.168.0.255" } } } } }, "hostname":["Blaze"], "platform_family":"mac_os_x" } } }
  • 27. > Datagram 2. use it in your stream ! 5462E738-ABAA-452F- 87E0-FD38AEB9DF81 4813EDF2-B04E-4B70- AB04-0F9EA456E032 {"cid": {"idStr": "498683D2-1192-4794-8C23-5BE49EEEC763"}, "userId": {"idStr": "BC3D8614-AF1F-48C8-B91F-0D907FD0FAF3"}, "content": " Contenu de message de test"} 81C76676-7B19-428E-8 56D-984BB67287D1 4813EDF2-B04E-4B70- AB04-0F9EA456E032 {"cid": {"idStr": "498683D2-1192-4794-8C23-5BE49EEEC763"}, "userId": {"idStr": "BC3D8614-AF1F-48C8-B91F-0D907FD0FAF3"}, "content": " Contenu de message de test”}
  • 28. > Datagram Ho : You can have also have a stream of headers … 4813EDF2-B04E-4B70- AB04-0F9EA456E032 :HEADERS 4813EDF2-B04E-4B70- AB04-0F9EA456E032 5462E738-ABAA-452F- 87E0-FD38AEB9DF81 4813EDF2-B04E-4B70- AB04-0F9EA456E032 81C76676-7B19-428E-85 6D-984BB67287D1 4813EDF2-B04E-4B70- AB04-0F9EA456E032 69DFC711-9D21-4DD6- A51D-C04A7A6E20A9 0 1 2
  • 29. > Conclusion If you don’t yet use streams instead of databases, start to use one next Monday (even with JSON and no headers…). If you do already use streams … Well, you know what to do ! ;)
  • 30.
  • 31.
  • 32.
  • 33. Bonus :What is a CAS ? A Content Adressable Storage is a specific “key value store” : operations : • store(bytes) -> key • get(key) -> null | bytes rule 1 : key = h(data) h being a cryptographic hash function like md5 or sha1. rule 2 : ∀data get(store(data)) = data Rule 1 and 2 imply : Infinite cacheability and scalability.
  • 34. Exemple of architectures CLASSICAL APP APP DB WITH STREAMS APP APP append broadcast
  • 35. The broadcast mechanism is equivalent to Exemple of architectures a db replication mechanism. CLASSICAL APP REPLICATION (BIN/LOG) APP APP DB DB WITH STREAMS APP APP APP append broadcast