SlideShare a Scribd company logo
1 of 31
Download to read offline
Entity-Centric Indexing
Mark Harwood @elasticmark
4/6/2015
www.elastic.co
2
(or “when aggregations don’t cut it”)
Entity-centric indexes
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohibited
3
A typical “event-centric” deployment
Time-based event indexesEvent stream
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohibited
4
Problem: some aggregations are expensive
We need to join all event-level data together at query-time.
?Using web server log data,
answer the question:
"how long on average do
customers spend on my site?"
!
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohibited
5
How to cripple elasticsearch with a bucket explosion:
1. Ask a question about values that needs to be derived from multiple
documents (e.g. deriving a web session’s duration)
2. Make the joining key a high cardinality field e.g. something like “IP
address”
3. Extra points if you use no routing of your documents so that related
content is spray-gunned across multiple shards
www.elastic.co
6
A “pay-as-you-go” model to the
costs of fusing data
Solution
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohibited
7
Solution: an “entity-centric” model
Usual stream of events
Time-based event indexes
Entity-based summary indexes
Periodic extracts sorted
by entity ID and time
www.elastic.co
8
• WebSessions
• "how long on average do my customers spend on my site?”
• “which users behave like bots?”
• “what is the most common exit page?”
• Bank Accounts
• "Does this new payment match the typical spending behaviour of bank account X?”
Entity-centric queries
www.elastic.co
9
• Buyers
• "What do the users who bought product X also buy?”
• “Which buyers behave like ‘shills’ and who are they promoting?”
• Cars
• “Which cars drove long distances after failing a road worthiness test?”
Entity-centric queries
www.elastic.co
10
Web log analytics
Use case
www.elastic.co
11
• Analyses website traffic for retailers and manufacturers in the automotive
industry
• Summarising many behaviours over time e.g.
• unique numbers of visitors per month
• engagement: average session durations
• Faced scaling issues producing some results from raw events
Use case: GFORCES
www.elastic.co
12
• Data store contains 150m events generated by 26m user sessions
• Event-centric aggregations were taking ~25 seconds
• Equivalent entity-centric aggregations take <50ms
• Simplified queries for common entry pages, common exit pages etc
Results of moving to entity-centric indexing
www.elastic.co
13
Amazon marketplace reviews -
building profiles for reviewers
Worked example
Play	
  along!	
  Code	
  +	
  data	
  here:	
  bit.ly/entcent
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohibited
14
An “entity-centric” model
AmazonReviews
(an event-centric index)
reviews.csv loadEvents.sh
Review event fields
• rating
• seller
• reviewer
• date
AmazonReviewers
(an entity-centric index)
buildEntities.sh
• Drops and creates reviewers index.
• Uses Python client to query and scroll list of
reviews sorted by reviewerId and time
• Python pushes _update requests to ~400k
“Reviewer” documents each containing
bundles of their recent reviews using bulk
indexing API
• Shard-side Groovy script collapses the
multiple reviews into a single reviewer JSON
document summarising behaviour
Reviewer entity fields
• positivity
• num sellers reviewed
• last 50 reviews
• profile (“newbie”, “fanboy” etc)
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohibited
15
Anatomy of an entity indexing groovy script
Initialize	
  if	
  new	
  document
Loop	
  to	
  consolidate	
  latest	
  events
Re-­‐run	
  risk	
  profile	
  logic	
  
Load	
  stored	
  state
Store	
  the	
  script	
  in	
  ES_HOME/config/scripts/foo.groovy
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohibited
16
Insight: which sellers have a lot of fanboys?
Seller	
  #187	
  has	
  more	
  than	
  his	
  
fair	
  share	
  of	
  “fanboy”	
  reviewers	
  
…
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohibited
17
Drilling down into seller #187’s fanboys
Suspiciously	
  
synchronised	
  
behaviour
www.elastic.co
18
UK 2013 car road worthiness tests
Worked example
www.elastic.co
19
• In the UK all vehicles must pass an annual roadworthiness test, called an MOT
(named after the Ministry of Transport)
• It is illegal to drive a car that has failed an MOT (unless driving home from a
test or to a repair centre)
• Taxis and other forms of public transport have to be tested more frequently -
every 6 months.
• All data is freely available from data.gov.uk but with anonymised vehicle ID and
inexact test locations.
Example background
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohibited
20
Example background
MOTs
mots.csv loadMOTs.sh
Cars
buildEntities.sh
• Drops and creates mots
index.
• Uses Python client to
bulk load all 37m road
worthiness test results
for 2013 (data source
http://data.gov.uk/
• Drops and creates cars index.
• Registers CarProfileUpdater.groovy as a
stored script
• Uses Python client to query and scroll list of
mot test results sorted by vehicle ID and
time
• Python pushes _update requests to ~27m
“Car” documents each containing bundles
of related MOT test results using bulk
indexing API
• Shard-side Groovy script collapses the
multiple tests into a single summary JSON
document for a car, deriving summaries eg
MOT event fields
• result (pass/fail)
• vehicle ID
• Make + model +
age
• mileage
• test date
• test location
Car entity fields
• Make + model + age
• last test result, date, location
• miles driven while failed
• days between fail and fix
• complete test history
• suspected bad mileometer
readings
www.elastic.co
21
Car attributes derived from 3 test result documents
Data fusion logic
1
2
3
Test	
  date
Mile-­‐o-­‐meter	
  reading
daysForFix
badReading?
milesDrivenAfterFailure
mile-o-meterRewind
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohibited
22
Insight: who is driving failed vehicles?
Q: Why is there an
unexpected peak in
milesDrivenWithFailure
around 6-months?
A: Taxis
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohibited
23
Insight: Taxis keep on trucking after failures..
www.elastic.co
24
A user-centric index as a
recommendation engine
Recycling user behaviours
www.elastic.co
25
• A public dataset* of 10m movie ratings made by 71k users
• One elasticsearch document per user with a list of their
movie ratings
Movielens data
Example background
*	
  http://files.grouplens.org/datasets/movielens/ml-­‐10m-­‐README.html
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohibited
26
“Uncommonly common”user behaviours
www.elastic.co
27
Conclusions
www.elastic.co
28
• Efficient and simple queries
• Advanced analytics/insights
• Can provide a cheaper data retention policy (daily->weekly->monthly roll-ups)
• Can reuse existing elasticsearch APIs or build entity documents using external
technologies
Entity centric indexing: Advantages
www.elastic.co
29
• Avoid “fat entities”
• Use forgetful collections: Priority queues, circular buffers, HyperLogLog
• Avoid pointless updates
• Use ctx.op=“none” to avoid writes of insignificant changes
• Consider options for reducing event volumes:
• Use of aggregations in gathering events
• Reduce related events in event-gathering script that issues updates
• Parallelise the pull of event information
Entity centric indexing: tips
www.elastic.co
30
• Incremental entity updates can be achieved by querying all events since the
timestamp of the last run
• Data integrity - implement policies for:
• handling any failures in performing entity updates
• retiring old entities (use of TTL?)
Entity centric indexing
www.elastic.co
31
@elasticmark
Questions?

More Related Content

Similar to Mark Harwood - Building Entity Centric Indexes - NoSQL matters Dublin 2015

Colman Hackathon Webhose.io API Reference
Colman Hackathon Webhose.io API ReferenceColman Hackathon Webhose.io API Reference
Colman Hackathon Webhose.io API ReferenceOhad Flinker
 
Core web Vitals: Web Performance and Usability
Core web Vitals: Web Performance and UsabilityCore web Vitals: Web Performance and Usability
Core web Vitals: Web Performance and UsabilityIngo Steinke
 
10 Commonly Missed SEO Opportunities For Wordpress Awesomeness
10 Commonly Missed SEO Opportunities For Wordpress Awesomeness10 Commonly Missed SEO Opportunities For Wordpress Awesomeness
10 Commonly Missed SEO Opportunities For Wordpress AwesomenessJason White
 
Comment transformer vos données en informations exploitables
Comment transformer vos données en informations exploitablesComment transformer vos données en informations exploitables
Comment transformer vos données en informations exploitablesElasticsearch
 
Website Parameters.pptx
Website Parameters.pptxWebsite Parameters.pptx
Website Parameters.pptxASHAVI2
 
How Tracking Companies Circumvented Ad Blockers Using WebSockets
How Tracking Companies Circumvented Ad Blockers Using WebSocketsHow Tracking Companies Circumvented Ad Blockers Using WebSockets
How Tracking Companies Circumvented Ad Blockers Using WebSocketsSajjad "JJ" Arshad
 
Core Web Vitals - Why You Need to Pay Attention
Core Web Vitals - Why You Need to Pay AttentionCore Web Vitals - Why You Need to Pay Attention
Core Web Vitals - Why You Need to Pay AttentionTAC Marketing Group
 
SUGCON NA 2023 - Crafting Lightning Fast Composable Experiences.pptx
SUGCON NA 2023 - Crafting Lightning Fast Composable Experiences.pptxSUGCON NA 2023 - Crafting Lightning Fast Composable Experiences.pptx
SUGCON NA 2023 - Crafting Lightning Fast Composable Experiences.pptxVasiliy Fomichev
 
Increase Profits with Better Vehicle Listing Data
Increase Profits with Better Vehicle Listing DataIncrease Profits with Better Vehicle Listing Data
Increase Profits with Better Vehicle Listing DataConnotate
 
Comment transformer vos données en informations exploitables
Comment transformer vos données en informations exploitablesComment transformer vos données en informations exploitables
Comment transformer vos données en informations exploitablesElasticsearch
 
Transforming data into actionable insights
Transforming data into actionable insightsTransforming data into actionable insights
Transforming data into actionable insightsElasticsearch
 
Understanding Google Analytics: Who's Oggling My Company
Understanding Google Analytics: Who's Oggling My CompanyUnderstanding Google Analytics: Who's Oggling My Company
Understanding Google Analytics: Who's Oggling My CompanyKimberly Swetland
 
Performance Engineering - how to start!
Performance Engineering - how to start!Performance Engineering - how to start!
Performance Engineering - how to start!Yoav Weiss
 
Cómo transformar los datos en análisis con los que tomar decisiones
Cómo transformar los datos en análisis con los que tomar decisionesCómo transformar los datos en análisis con los que tomar decisiones
Cómo transformar los datos en análisis con los que tomar decisionesElasticsearch
 
WordPress Theme Performance - WP Vienna meetup 8.6.2016
WordPress Theme Performance - WP Vienna meetup 8.6.2016WordPress Theme Performance - WP Vienna meetup 8.6.2016
WordPress Theme Performance - WP Vienna meetup 8.6.2016jancbeck
 
Eli Stull STPCon Spring 2017 Keynote
Eli Stull STPCon Spring 2017 KeynoteEli Stull STPCon Spring 2017 Keynote
Eli Stull STPCon Spring 2017 KeynoteApica
 
Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼
Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼
Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼Elasticsearch
 
Web Application Performance from User Perspective
Web Application Performance from User PerspectiveWeb Application Performance from User Perspective
Web Application Performance from User PerspectiveŁódQA
 
Browser Diagnostics using dynatrace Ajax Edition
Browser Diagnostics using dynatrace Ajax EditionBrowser Diagnostics using dynatrace Ajax Edition
Browser Diagnostics using dynatrace Ajax EditionDeepak Kaul
 
Adobe Digital Analytics - SiteCatalyst, Test & Target Workshop
Adobe Digital Analytics - SiteCatalyst, Test & Target WorkshopAdobe Digital Analytics - SiteCatalyst, Test & Target Workshop
Adobe Digital Analytics - SiteCatalyst, Test & Target WorkshopDigital Vidya
 

Similar to Mark Harwood - Building Entity Centric Indexes - NoSQL matters Dublin 2015 (20)

Colman Hackathon Webhose.io API Reference
Colman Hackathon Webhose.io API ReferenceColman Hackathon Webhose.io API Reference
Colman Hackathon Webhose.io API Reference
 
Core web Vitals: Web Performance and Usability
Core web Vitals: Web Performance and UsabilityCore web Vitals: Web Performance and Usability
Core web Vitals: Web Performance and Usability
 
10 Commonly Missed SEO Opportunities For Wordpress Awesomeness
10 Commonly Missed SEO Opportunities For Wordpress Awesomeness10 Commonly Missed SEO Opportunities For Wordpress Awesomeness
10 Commonly Missed SEO Opportunities For Wordpress Awesomeness
 
Comment transformer vos données en informations exploitables
Comment transformer vos données en informations exploitablesComment transformer vos données en informations exploitables
Comment transformer vos données en informations exploitables
 
Website Parameters.pptx
Website Parameters.pptxWebsite Parameters.pptx
Website Parameters.pptx
 
How Tracking Companies Circumvented Ad Blockers Using WebSockets
How Tracking Companies Circumvented Ad Blockers Using WebSocketsHow Tracking Companies Circumvented Ad Blockers Using WebSockets
How Tracking Companies Circumvented Ad Blockers Using WebSockets
 
Core Web Vitals - Why You Need to Pay Attention
Core Web Vitals - Why You Need to Pay AttentionCore Web Vitals - Why You Need to Pay Attention
Core Web Vitals - Why You Need to Pay Attention
 
SUGCON NA 2023 - Crafting Lightning Fast Composable Experiences.pptx
SUGCON NA 2023 - Crafting Lightning Fast Composable Experiences.pptxSUGCON NA 2023 - Crafting Lightning Fast Composable Experiences.pptx
SUGCON NA 2023 - Crafting Lightning Fast Composable Experiences.pptx
 
Increase Profits with Better Vehicle Listing Data
Increase Profits with Better Vehicle Listing DataIncrease Profits with Better Vehicle Listing Data
Increase Profits with Better Vehicle Listing Data
 
Comment transformer vos données en informations exploitables
Comment transformer vos données en informations exploitablesComment transformer vos données en informations exploitables
Comment transformer vos données en informations exploitables
 
Transforming data into actionable insights
Transforming data into actionable insightsTransforming data into actionable insights
Transforming data into actionable insights
 
Understanding Google Analytics: Who's Oggling My Company
Understanding Google Analytics: Who's Oggling My CompanyUnderstanding Google Analytics: Who's Oggling My Company
Understanding Google Analytics: Who's Oggling My Company
 
Performance Engineering - how to start!
Performance Engineering - how to start!Performance Engineering - how to start!
Performance Engineering - how to start!
 
Cómo transformar los datos en análisis con los que tomar decisiones
Cómo transformar los datos en análisis con los que tomar decisionesCómo transformar los datos en análisis con los que tomar decisiones
Cómo transformar los datos en análisis con los que tomar decisiones
 
WordPress Theme Performance - WP Vienna meetup 8.6.2016
WordPress Theme Performance - WP Vienna meetup 8.6.2016WordPress Theme Performance - WP Vienna meetup 8.6.2016
WordPress Theme Performance - WP Vienna meetup 8.6.2016
 
Eli Stull STPCon Spring 2017 Keynote
Eli Stull STPCon Spring 2017 KeynoteEli Stull STPCon Spring 2017 Keynote
Eli Stull STPCon Spring 2017 Keynote
 
Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼
Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼
Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼
 
Web Application Performance from User Perspective
Web Application Performance from User PerspectiveWeb Application Performance from User Perspective
Web Application Performance from User Perspective
 
Browser Diagnostics using dynatrace Ajax Edition
Browser Diagnostics using dynatrace Ajax EditionBrowser Diagnostics using dynatrace Ajax Edition
Browser Diagnostics using dynatrace Ajax Edition
 
Adobe Digital Analytics - SiteCatalyst, Test & Target Workshop
Adobe Digital Analytics - SiteCatalyst, Test & Target WorkshopAdobe Digital Analytics - SiteCatalyst, Test & Target Workshop
Adobe Digital Analytics - SiteCatalyst, Test & Target Workshop
 

More from NoSQLmatters

Nathan Ford- Divination of the Defects (Graph-Based Defect Prediction through...
Nathan Ford- Divination of the Defects (Graph-Based Defect Prediction through...Nathan Ford- Divination of the Defects (Graph-Based Defect Prediction through...
Nathan Ford- Divination of the Defects (Graph-Based Defect Prediction through...NoSQLmatters
 
Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...
Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...
Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...NoSQLmatters
 
Adrian Colyer - Keynote: NoSQL matters - NoSQL matters Dublin 2015
Adrian Colyer - Keynote: NoSQL matters - NoSQL matters Dublin 2015Adrian Colyer - Keynote: NoSQL matters - NoSQL matters Dublin 2015
Adrian Colyer - Keynote: NoSQL matters - NoSQL matters Dublin 2015NoSQLmatters
 
Peter Bakas - Zero to Insights - Real time analytics with Kafka, C*, and Spar...
Peter Bakas - Zero to Insights - Real time analytics with Kafka, C*, and Spar...Peter Bakas - Zero to Insights - Real time analytics with Kafka, C*, and Spar...
Peter Bakas - Zero to Insights - Real time analytics with Kafka, C*, and Spar...NoSQLmatters
 
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...NoSQLmatters
 
Prassnitha Sampath - Real Time Big Data Analytics with Kafka, Storm & HBase -...
Prassnitha Sampath - Real Time Big Data Analytics with Kafka, Storm & HBase -...Prassnitha Sampath - Real Time Big Data Analytics with Kafka, Storm & HBase -...
Prassnitha Sampath - Real Time Big Data Analytics with Kafka, Storm & HBase -...NoSQLmatters
 
Akmal Chaudhri - How to Build Streaming Data Applications: Evaluating the Top...
Akmal Chaudhri - How to Build Streaming Data Applications: Evaluating the Top...Akmal Chaudhri - How to Build Streaming Data Applications: Evaluating the Top...
Akmal Chaudhri - How to Build Streaming Data Applications: Evaluating the Top...NoSQLmatters
 
Michael Hackstein - NoSQL meets Microservices - NoSQL matters Dublin 2015
Michael Hackstein - NoSQL meets Microservices - NoSQL matters Dublin 2015Michael Hackstein - NoSQL meets Microservices - NoSQL matters Dublin 2015
Michael Hackstein - NoSQL meets Microservices - NoSQL matters Dublin 2015NoSQLmatters
 
Chris Ward - Understanding databases for distributed docker applications - No...
Chris Ward - Understanding databases for distributed docker applications - No...Chris Ward - Understanding databases for distributed docker applications - No...
Chris Ward - Understanding databases for distributed docker applications - No...NoSQLmatters
 
Philipp Krenn - Host your database in the cloud, they said... - NoSQL matters...
Philipp Krenn - Host your database in the cloud, they said... - NoSQL matters...Philipp Krenn - Host your database in the cloud, they said... - NoSQL matters...
Philipp Krenn - Host your database in the cloud, they said... - NoSQL matters...NoSQLmatters
 
Lucian Precup - Back to the Future: SQL 92 for Elasticsearch? - NoSQL matters...
Lucian Precup - Back to the Future: SQL 92 for Elasticsearch? - NoSQL matters...Lucian Precup - Back to the Future: SQL 92 for Elasticsearch? - NoSQL matters...
Lucian Precup - Back to the Future: SQL 92 for Elasticsearch? - NoSQL matters...NoSQLmatters
 
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015NoSQLmatters
 
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...NoSQLmatters
 
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...NoSQLmatters
 
David Pilato - Advance search for your legacy application - NoSQL matters Par...
David Pilato - Advance search for your legacy application - NoSQL matters Par...David Pilato - Advance search for your legacy application - NoSQL matters Par...
David Pilato - Advance search for your legacy application - NoSQL matters Par...NoSQLmatters
 
Tugdual Grall - From SQL to NoSQL in less than 40 min - NoSQL matters Paris 2015
Tugdual Grall - From SQL to NoSQL in less than 40 min - NoSQL matters Paris 2015Tugdual Grall - From SQL to NoSQL in less than 40 min - NoSQL matters Paris 2015
Tugdual Grall - From SQL to NoSQL in less than 40 min - NoSQL matters Paris 2015NoSQLmatters
 
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015NoSQLmatters
 
Michael Hackstein - Polyglot Persistence & Multi-Model NoSQL Databases - NoSQ...
Michael Hackstein - Polyglot Persistence & Multi-Model NoSQL Databases - NoSQ...Michael Hackstein - Polyglot Persistence & Multi-Model NoSQL Databases - NoSQ...
Michael Hackstein - Polyglot Persistence & Multi-Model NoSQL Databases - NoSQ...NoSQLmatters
 
Rob Harrop- Key Note The God, the Bad and the Ugly - NoSQL matters Paris 2015
Rob Harrop- Key Note The God, the Bad and the Ugly - NoSQL matters Paris 2015Rob Harrop- Key Note The God, the Bad and the Ugly - NoSQL matters Paris 2015
Rob Harrop- Key Note The God, the Bad and the Ugly - NoSQL matters Paris 2015NoSQLmatters
 
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...NoSQLmatters
 

More from NoSQLmatters (20)

Nathan Ford- Divination of the Defects (Graph-Based Defect Prediction through...
Nathan Ford- Divination of the Defects (Graph-Based Defect Prediction through...Nathan Ford- Divination of the Defects (Graph-Based Defect Prediction through...
Nathan Ford- Divination of the Defects (Graph-Based Defect Prediction through...
 
Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...
Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...
Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...
 
Adrian Colyer - Keynote: NoSQL matters - NoSQL matters Dublin 2015
Adrian Colyer - Keynote: NoSQL matters - NoSQL matters Dublin 2015Adrian Colyer - Keynote: NoSQL matters - NoSQL matters Dublin 2015
Adrian Colyer - Keynote: NoSQL matters - NoSQL matters Dublin 2015
 
Peter Bakas - Zero to Insights - Real time analytics with Kafka, C*, and Spar...
Peter Bakas - Zero to Insights - Real time analytics with Kafka, C*, and Spar...Peter Bakas - Zero to Insights - Real time analytics with Kafka, C*, and Spar...
Peter Bakas - Zero to Insights - Real time analytics with Kafka, C*, and Spar...
 
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
 
Prassnitha Sampath - Real Time Big Data Analytics with Kafka, Storm & HBase -...
Prassnitha Sampath - Real Time Big Data Analytics with Kafka, Storm & HBase -...Prassnitha Sampath - Real Time Big Data Analytics with Kafka, Storm & HBase -...
Prassnitha Sampath - Real Time Big Data Analytics with Kafka, Storm & HBase -...
 
Akmal Chaudhri - How to Build Streaming Data Applications: Evaluating the Top...
Akmal Chaudhri - How to Build Streaming Data Applications: Evaluating the Top...Akmal Chaudhri - How to Build Streaming Data Applications: Evaluating the Top...
Akmal Chaudhri - How to Build Streaming Data Applications: Evaluating the Top...
 
Michael Hackstein - NoSQL meets Microservices - NoSQL matters Dublin 2015
Michael Hackstein - NoSQL meets Microservices - NoSQL matters Dublin 2015Michael Hackstein - NoSQL meets Microservices - NoSQL matters Dublin 2015
Michael Hackstein - NoSQL meets Microservices - NoSQL matters Dublin 2015
 
Chris Ward - Understanding databases for distributed docker applications - No...
Chris Ward - Understanding databases for distributed docker applications - No...Chris Ward - Understanding databases for distributed docker applications - No...
Chris Ward - Understanding databases for distributed docker applications - No...
 
Philipp Krenn - Host your database in the cloud, they said... - NoSQL matters...
Philipp Krenn - Host your database in the cloud, they said... - NoSQL matters...Philipp Krenn - Host your database in the cloud, they said... - NoSQL matters...
Philipp Krenn - Host your database in the cloud, they said... - NoSQL matters...
 
Lucian Precup - Back to the Future: SQL 92 for Elasticsearch? - NoSQL matters...
Lucian Precup - Back to the Future: SQL 92 for Elasticsearch? - NoSQL matters...Lucian Precup - Back to the Future: SQL 92 for Elasticsearch? - NoSQL matters...
Lucian Precup - Back to the Future: SQL 92 for Elasticsearch? - NoSQL matters...
 
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
 
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...
 
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
 
David Pilato - Advance search for your legacy application - NoSQL matters Par...
David Pilato - Advance search for your legacy application - NoSQL matters Par...David Pilato - Advance search for your legacy application - NoSQL matters Par...
David Pilato - Advance search for your legacy application - NoSQL matters Par...
 
Tugdual Grall - From SQL to NoSQL in less than 40 min - NoSQL matters Paris 2015
Tugdual Grall - From SQL to NoSQL in less than 40 min - NoSQL matters Paris 2015Tugdual Grall - From SQL to NoSQL in less than 40 min - NoSQL matters Paris 2015
Tugdual Grall - From SQL to NoSQL in less than 40 min - NoSQL matters Paris 2015
 
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
 
Michael Hackstein - Polyglot Persistence & Multi-Model NoSQL Databases - NoSQ...
Michael Hackstein - Polyglot Persistence & Multi-Model NoSQL Databases - NoSQ...Michael Hackstein - Polyglot Persistence & Multi-Model NoSQL Databases - NoSQ...
Michael Hackstein - Polyglot Persistence & Multi-Model NoSQL Databases - NoSQ...
 
Rob Harrop- Key Note The God, the Bad and the Ugly - NoSQL matters Paris 2015
Rob Harrop- Key Note The God, the Bad and the Ugly - NoSQL matters Paris 2015Rob Harrop- Key Note The God, the Bad and the Ugly - NoSQL matters Paris 2015
Rob Harrop- Key Note The God, the Bad and the Ugly - NoSQL matters Paris 2015
 
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
 

Recently uploaded

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 

Recently uploaded (20)

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 

Mark Harwood - Building Entity Centric Indexes - NoSQL matters Dublin 2015

  • 1. Entity-Centric Indexing Mark Harwood @elasticmark 4/6/2015
  • 2. www.elastic.co 2 (or “when aggregations don’t cut it”) Entity-centric indexes
  • 3. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 3 A typical “event-centric” deployment Time-based event indexesEvent stream
  • 4. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 4 Problem: some aggregations are expensive We need to join all event-level data together at query-time. ?Using web server log data, answer the question: "how long on average do customers spend on my site?" !
  • 5. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 5 How to cripple elasticsearch with a bucket explosion: 1. Ask a question about values that needs to be derived from multiple documents (e.g. deriving a web session’s duration) 2. Make the joining key a high cardinality field e.g. something like “IP address” 3. Extra points if you use no routing of your documents so that related content is spray-gunned across multiple shards
  • 6. www.elastic.co 6 A “pay-as-you-go” model to the costs of fusing data Solution
  • 7. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 7 Solution: an “entity-centric” model Usual stream of events Time-based event indexes Entity-based summary indexes Periodic extracts sorted by entity ID and time
  • 8. www.elastic.co 8 • WebSessions • "how long on average do my customers spend on my site?” • “which users behave like bots?” • “what is the most common exit page?” • Bank Accounts • "Does this new payment match the typical spending behaviour of bank account X?” Entity-centric queries
  • 9. www.elastic.co 9 • Buyers • "What do the users who bought product X also buy?” • “Which buyers behave like ‘shills’ and who are they promoting?” • Cars • “Which cars drove long distances after failing a road worthiness test?” Entity-centric queries
  • 11. www.elastic.co 11 • Analyses website traffic for retailers and manufacturers in the automotive industry • Summarising many behaviours over time e.g. • unique numbers of visitors per month • engagement: average session durations • Faced scaling issues producing some results from raw events Use case: GFORCES
  • 12. www.elastic.co 12 • Data store contains 150m events generated by 26m user sessions • Event-centric aggregations were taking ~25 seconds • Equivalent entity-centric aggregations take <50ms • Simplified queries for common entry pages, common exit pages etc Results of moving to entity-centric indexing
  • 13. www.elastic.co 13 Amazon marketplace reviews - building profiles for reviewers Worked example Play  along!  Code  +  data  here:  bit.ly/entcent
  • 14. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 14 An “entity-centric” model AmazonReviews (an event-centric index) reviews.csv loadEvents.sh Review event fields • rating • seller • reviewer • date AmazonReviewers (an entity-centric index) buildEntities.sh • Drops and creates reviewers index. • Uses Python client to query and scroll list of reviews sorted by reviewerId and time • Python pushes _update requests to ~400k “Reviewer” documents each containing bundles of their recent reviews using bulk indexing API • Shard-side Groovy script collapses the multiple reviews into a single reviewer JSON document summarising behaviour Reviewer entity fields • positivity • num sellers reviewed • last 50 reviews • profile (“newbie”, “fanboy” etc)
  • 15. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 15 Anatomy of an entity indexing groovy script Initialize  if  new  document Loop  to  consolidate  latest  events Re-­‐run  risk  profile  logic   Load  stored  state Store  the  script  in  ES_HOME/config/scripts/foo.groovy
  • 16. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 16 Insight: which sellers have a lot of fanboys? Seller  #187  has  more  than  his   fair  share  of  “fanboy”  reviewers   …
  • 17. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 17 Drilling down into seller #187’s fanboys Suspiciously   synchronised   behaviour
  • 18. www.elastic.co 18 UK 2013 car road worthiness tests Worked example
  • 19. www.elastic.co 19 • In the UK all vehicles must pass an annual roadworthiness test, called an MOT (named after the Ministry of Transport) • It is illegal to drive a car that has failed an MOT (unless driving home from a test or to a repair centre) • Taxis and other forms of public transport have to be tested more frequently - every 6 months. • All data is freely available from data.gov.uk but with anonymised vehicle ID and inexact test locations. Example background
  • 20. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 20 Example background MOTs mots.csv loadMOTs.sh Cars buildEntities.sh • Drops and creates mots index. • Uses Python client to bulk load all 37m road worthiness test results for 2013 (data source http://data.gov.uk/ • Drops and creates cars index. • Registers CarProfileUpdater.groovy as a stored script • Uses Python client to query and scroll list of mot test results sorted by vehicle ID and time • Python pushes _update requests to ~27m “Car” documents each containing bundles of related MOT test results using bulk indexing API • Shard-side Groovy script collapses the multiple tests into a single summary JSON document for a car, deriving summaries eg MOT event fields • result (pass/fail) • vehicle ID • Make + model + age • mileage • test date • test location Car entity fields • Make + model + age • last test result, date, location • miles driven while failed • days between fail and fix • complete test history • suspected bad mileometer readings
  • 21. www.elastic.co 21 Car attributes derived from 3 test result documents Data fusion logic 1 2 3 Test  date Mile-­‐o-­‐meter  reading daysForFix badReading? milesDrivenAfterFailure mile-o-meterRewind
  • 22. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 22 Insight: who is driving failed vehicles? Q: Why is there an unexpected peak in milesDrivenWithFailure around 6-months? A: Taxis
  • 23. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 23 Insight: Taxis keep on trucking after failures..
  • 24. www.elastic.co 24 A user-centric index as a recommendation engine Recycling user behaviours
  • 25. www.elastic.co 25 • A public dataset* of 10m movie ratings made by 71k users • One elasticsearch document per user with a list of their movie ratings Movielens data Example background *  http://files.grouplens.org/datasets/movielens/ml-­‐10m-­‐README.html
  • 26. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 26 “Uncommonly common”user behaviours
  • 28. www.elastic.co 28 • Efficient and simple queries • Advanced analytics/insights • Can provide a cheaper data retention policy (daily->weekly->monthly roll-ups) • Can reuse existing elasticsearch APIs or build entity documents using external technologies Entity centric indexing: Advantages
  • 29. www.elastic.co 29 • Avoid “fat entities” • Use forgetful collections: Priority queues, circular buffers, HyperLogLog • Avoid pointless updates • Use ctx.op=“none” to avoid writes of insignificant changes • Consider options for reducing event volumes: • Use of aggregations in gathering events • Reduce related events in event-gathering script that issues updates • Parallelise the pull of event information Entity centric indexing: tips
  • 30. www.elastic.co 30 • Incremental entity updates can be achieved by querying all events since the timestamp of the last run • Data integrity - implement policies for: • handling any failures in performing entity updates • retiring old entities (use of TTL?) Entity centric indexing