SlideShare une entreprise Scribd logo
1  sur  34
Interactive Analytics in Human Time
S u p r e e t h R a o , S u n i l G u p t a ⎪ J u n e 4 , 2 0 1 4
2 0 1 4 H a d o o p S u m m i t , S a n J o s e , C a l i f o r n i a
Interactive – How we see it?
2 Yahoo Confidential & Proprietary
60B events,
3.5TB of compressed data
Response 400ms
Serve an ad and get insights
< 2s
Agenda:
Motivation
Approach
Problem Deepdive- Instant Overlap
Summary
Questions
Motivation:
Lots of data
Analytics
Data restatement - batch and real time
Human time
Lots of data
~30B advertising events/day
~10s of TB of compressed data/day
Minutes to Year Grain
Multi-quarter data retention
Data Aging
Analytics
Reporting Metrics
Attribution
Multi-level hierarchical computation
Bidding/Targeting optimization
Non-additive computation
Data Restatement
Real time
Batch
Producer Consumer
quick path, lower
amount of checks or
reconciliation,
typically no lookups
high latency path,
checks and
reconciliations,
can have lookbacks
and lookups
Human Time
<1s ( 99 percentile)
Default time grain ( < 300 ms)
Instant overlap ( < 60s)
Data ingested, insights available ( < 2s)
Lots of data
Analytics
Data restatement - batch and real time
Human time
Approach:
Data Ingestion or Collection
Transformations
Data Ingestion
Persistence
Runtime Compute
Caching
API
Optional Middleware
Business API
UI
Data Collection
Data Pipelines
Data Warehouse/ Analytics and
Optimizations
Reporting Application/UI
Logical View - Scope
Transformations/Aggs
Data Ingestion
Persistence
Runtime Compute
Caching
API
Optional Middleware
Business API
UI
Data Collection
Impacts
Out of scope
Transformations/Aggs
Data Ingestion
Persistence
Runtime Compute
Caching
API
Optional Middleware
Business API
UI
Data Collection
Batch processing DAG, Real-time topology, SOX,
Traffic protection, Late processing, Retention,
Completeness Monitoring, PII cleansing/masking
Compatible with HDFS, Performance (Indexed,
Columnar, Compression, Serialization, Flexibility,
Concurrency, Grain of data stored)
Distributed/Stand-alone, Caching objects vs caching
results
Access to data with group by, order by etc..; SQL or SQL
like
Translate JSON to SQL(optional)
Logical View - Characteristics
Impacts
Out of scope
Transformations/Aggs
Data Ingestion
Persistence
Runtime Compute
Caching
API
Optional Middleware
Business API
UI
Data Collection
Hadoop MR/PIG /Oozie(Lotus)/Storm(Trident)
Druid, Shark, Hive, Oracle RAC, Mysql, Hbase, Impala
memcached_y, Redis
JSON-REST API ; JDBC; ODBC
Data Ingestion
Persistence
Runtime Compute
Caching
API
Optional Middleware
Business API
UI
Data Collection
Logical View - Choices
Impacts
Out of scope
How we do what we do? Components of Advertising Data Warehouse
Druid
JDBC/ODBC
Data Warehouse-Persistence
Hive
Metrics
Store
JSON-API Persistence and run-time compute
Computation and Ingestion
Quick cache ( using a database for now)
Upstream: API layer, MSTR,
Adhoc access, Identity Service,
Ad-Serving manifests
Data Producers; Serving,
Scoring, Booking, 3rd/1st Party
Data
Real time and batch compute engine
(Hadoop/Storm )
Data filtering/transformations:
Transformations, format
conversions
Custom Algorithms : computing
recursive uniques, indexing
Human time, How?
Druid for interactive queries
Storm-Druid for quick ingestion and
index
Specialized computation and
processing for quicker response
› Sketches
› Feature sequence based overlaps
› Custom indexing
Problem Deepdive: Instant Overlap
Users
Car
commuters
Soccer
Fans
Vegans
Users
Car
Commuters
Soccer
Fans
Vegans
Overlap
Non-additive
› Require access to raw (user level data) to compute
non-additive
• Billions of events a day
• TBs of data a day
 1-1 vs 1-n vs few-n
› Between car commuter and vegan what is the overlap
› For Car commuter which are the top overlap groups
› For Vegan, Car commuters what are the top overlap
groups
Re-stating motivation
Given two sets having identifiers, how
can we do exact overlaps in close to
real time?
( < 1 min).
Overlap is like a AND operation or a set
Existing Approaches
● Use exact compute paradigms
o Do joins for intersections which will lead to
exact results
 Hive, PIG, MR can all support efficient joins
 Exact but not real time
● Use sketches
o Approximate algorithms
 HLL, KMV, accuracy vs size, performance
 Approx, needs high perf tuning
 close to real time but not exact
Using Feature Sequences – 1/4
Feature sequence encoding
o Encode the sequence
 {Ram} - { car commuter, soccer fan,...}
 {Tom} - { soccer fan, vegan...}
 {Sam} - { car commuter, soccer fan, vegan...}
 ….
Using Feature Sequences – 2/4
Eliminate the user on encoded bitmaps
 {car commuter, soccer fan, vegan...}- count -c1- #
 {soccer fan, vegan...} - count - c2 - #
 {car commuter, vegan...} - count - c2 - #
Counts become additive now
Using Feature Sequences – 3/4
● Store row qualifications into a bitmap
o Car commuter- Row1, Row3
 1010000000
o Vegan - Row1, Row2, Row3
 1110000000
● Load the bitmap into Druid using a
custom indexer
o in-memory or memory mapped
Using Feature Sequences – 4/4
 Data Structures
› {feature_sequence}->count
› Feature->row qualification bitmaps
 AND is now an “AND” on bitmaps
› supported within Druid
› Very efficient
 Works alongside topN and
groupBys
Comparison with existing algorithm
● 1-n – Bulk Overlap on grid
o 19 hours on grid
o Few-n calls for a re-process
o 1-1 ( <1s)
● Instant Overlap
o < 60s ( pre-processing 3-4 hours)
o Supports “exact” AND
o Flexible ( few-n, 1-n)
o 1-1 ( < 1s)
Summary
● Yahoo’s Advertising Data Warehouse
o Peta Byte Scale
o Normalized view across many systems
o Analytics and optimizations with specialized
algorithms
o Data restatement - batch and realtime
o Human time
Thank You
@supreeth_
@_skgupta
We are hiring!
Stop by Kiosk P9
or reach out to us at
bigdata@yahoo-inc.com.
Data Ingestion or Collection
Transformations
Data Ingestion
Persistence
Runtime Compute
Caching
API
Optional Middleware
Business API
UI
Data Collection
Data Pipelines
Data Warehouse/ Analytics and
Optimizations
Reporting Application/UI
Logical View - Scope
Transformations/Aggs
Data Ingestion
Persistence
Runtime Compute
Caching
API
Optional Middleware
Business API
UI
Data Collection
Impacts
Out of scope
Dimension Flexibility
Many dimensions
Adding new dimensions
Time zones
Time grain
Normalized view across systems
PaidSearch
Display
Native
Programmatic buying and
selling
Ad-targeting
Hardware Configs
●High-memory boxes
●SSD preferred
●Savings due to better
compression

Contenu connexe

Tendances

Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentationTao Feng
 
Introduction to Neo4j
Introduction to Neo4jIntroduction to Neo4j
Introduction to Neo4jNeo4j
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentationTao Feng
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Meta-Prod2Vec: Simple Product Embeddings with Side-Information
Meta-Prod2Vec: Simple Product Embeddings with Side-InformationMeta-Prod2Vec: Simple Product Embeddings with Side-Information
Meta-Prod2Vec: Simple Product Embeddings with Side-Informationrecsysfr
 
Building real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case studyBuilding real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case studyKishore Gopalakrishna
 
Apache Spark for Cyber Security in an Enterprise Company
Apache Spark for Cyber Security in an Enterprise CompanyApache Spark for Cyber Security in an Enterprise Company
Apache Spark for Cyber Security in an Enterprise CompanyDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...Altinity Ltd
 
Numberly on Joining Billions of Rows in Seconds: Replacing MongoDB and Hive w...
Numberly on Joining Billions of Rows in Seconds: Replacing MongoDB and Hive w...Numberly on Joining Billions of Rows in Seconds: Replacing MongoDB and Hive w...
Numberly on Joining Billions of Rows in Seconds: Replacing MongoDB and Hive w...ScyllaDB
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
 
Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )
Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )
Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )SANG WON PARK
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Simplilearn
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data EngineeringDurga Gadiraju
 
Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation
Neo4j GraphTour Santa Monica 2019 - Amundsen PresentationNeo4j GraphTour Santa Monica 2019 - Amundsen Presentation
Neo4j GraphTour Santa Monica 2019 - Amundsen PresentationTamikaTannis
 
Building a Big Data Pipeline
Building a Big Data PipelineBuilding a Big Data Pipeline
Building a Big Data PipelineJesus Rodriguez
 
Introduction To Kibana
Introduction To KibanaIntroduction To Kibana
Introduction To KibanaJen Stirrup
 

Tendances (20)

Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Introduction to Neo4j
Introduction to Neo4jIntroduction to Neo4j
Introduction to Neo4j
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Meta-Prod2Vec: Simple Product Embeddings with Side-Information
Meta-Prod2Vec: Simple Product Embeddings with Side-InformationMeta-Prod2Vec: Simple Product Embeddings with Side-Information
Meta-Prod2Vec: Simple Product Embeddings with Side-Information
 
Building real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case studyBuilding real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case study
 
Apache Spark for Cyber Security in an Enterprise Company
Apache Spark for Cyber Security in an Enterprise CompanyApache Spark for Cyber Security in an Enterprise Company
Apache Spark for Cyber Security in an Enterprise Company
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
 
Numberly on Joining Billions of Rows in Seconds: Replacing MongoDB and Hive w...
Numberly on Joining Billions of Rows in Seconds: Replacing MongoDB and Hive w...Numberly on Joining Billions of Rows in Seconds: Replacing MongoDB and Hive w...
Numberly on Joining Billions of Rows in Seconds: Replacing MongoDB and Hive w...
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Graph database
Graph database Graph database
Graph database
 
Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )
Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )
Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation
Neo4j GraphTour Santa Monica 2019 - Amundsen PresentationNeo4j GraphTour Santa Monica 2019 - Amundsen Presentation
Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation
 
Building a Big Data Pipeline
Building a Big Data PipelineBuilding a Big Data Pipeline
Building a Big Data Pipeline
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Introduction To Kibana
Introduction To KibanaIntroduction To Kibana
Introduction To Kibana
 

En vedette

Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxAlex Moundalexis
 
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...David Chen
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Adam Kawa
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failingSandy Ryza
 
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...DataWorks Summit
 
Video Transcoding on Hadoop
Video Transcoding on HadoopVideo Transcoding on Hadoop
Video Transcoding on HadoopDataWorks Summit
 
How To Analyze Geolocation Data with Hive and Hadoop
How To Analyze Geolocation Data with Hive and HadoopHow To Analyze Geolocation Data with Hive and Hadoop
How To Analyze Geolocation Data with Hive and HadoopHortonworks
 
Multi-tenant, Multi-cluster and Multi-container Apache HBase Deployments
Multi-tenant, Multi-cluster and Multi-container Apache HBase DeploymentsMulti-tenant, Multi-cluster and Multi-container Apache HBase Deployments
Multi-tenant, Multi-cluster and Multi-container Apache HBase DeploymentsDataWorks Summit
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
 

En vedette (11)

Something about Kafka - Why Kafka is so fast
Something about Kafka - Why Kafka is so fastSomething about Kafka - Why Kafka is so fast
Something about Kafka - Why Kafka is so fast
 
Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via Linux
 
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
 
Video Transcoding on Hadoop
Video Transcoding on HadoopVideo Transcoding on Hadoop
Video Transcoding on Hadoop
 
How To Analyze Geolocation Data with Hive and Hadoop
How To Analyze Geolocation Data with Hive and HadoopHow To Analyze Geolocation Data with Hive and Hadoop
How To Analyze Geolocation Data with Hive and Hadoop
 
Multi-tenant, Multi-cluster and Multi-container Apache HBase Deployments
Multi-tenant, Multi-cluster and Multi-container Apache HBase DeploymentsMulti-tenant, Multi-cluster and Multi-container Apache HBase Deployments
Multi-tenant, Multi-cluster and Multi-container Apache HBase Deployments
 
Flume vs. kafka
Flume vs. kafkaFlume vs. kafka
Flume vs. kafka
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 

Similaire à Interactive Analytics in Human Time

June 2014 HUG: Interactive analytics over hadoop
June 2014 HUG: Interactive analytics over hadoopJune 2014 HUG: Interactive analytics over hadoop
June 2014 HUG: Interactive analytics over hadoopYahoo Developer Network
 
Big dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlBig dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlKhanderao Kand
 
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsReal time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsData Con LA
 
Presentation Data Council Meetup: F. Mekkenholt, R. Vlijm
Presentation Data Council Meetup: F. Mekkenholt, R. VlijmPresentation Data Council Meetup: F. Mekkenholt, R. Vlijm
Presentation Data Council Meetup: F. Mekkenholt, R. VlijmAlexander Oppel
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData
 
Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016Guido Schmutz
 
JEEConf 2015 Big Data Analysis in Java World
JEEConf 2015 Big Data Analysis in Java WorldJEEConf 2015 Big Data Analysis in Java World
JEEConf 2015 Big Data Analysis in Java WorldSerg Masyutin
 
Stream Processing – Concepts and Frameworks
Stream Processing – Concepts and FrameworksStream Processing – Concepts and Frameworks
Stream Processing – Concepts and FrameworksGuido Schmutz
 
Real-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven ApplicationsReal-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven ApplicationsVMware Tanzu
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of MusicLars Albertsson
 
Swift distributed tracing method and tools v2
Swift distributed tracing method and tools v2Swift distributed tracing method and tools v2
Swift distributed tracing method and tools v2zhang hua
 
Transforming Mobile Push Notifications with Big Data
Transforming Mobile Push Notifications with Big DataTransforming Mobile Push Notifications with Big Data
Transforming Mobile Push Notifications with Big Dataplumbee
 
Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018
Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018
Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018VMware Tanzu
 
High availability, real-time and scalable architectures
High availability, real-time and scalable architecturesHigh availability, real-time and scalable architectures
High availability, real-time and scalable architecturesJampp
 
Accelerate Your Analytic Queries with Amazon Aurora Parallel Query (DAT362) -...
Accelerate Your Analytic Queries with Amazon Aurora Parallel Query (DAT362) -...Accelerate Your Analytic Queries with Amazon Aurora Parallel Query (DAT362) -...
Accelerate Your Analytic Queries with Amazon Aurora Parallel Query (DAT362) -...Amazon Web Services
 
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive AnalyticsInfochimps, a CSC Big Data Business
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...Amazon Web Services
 
Building High Performance Apps with In-memory Data
Building High Performance Apps with In-memory DataBuilding High Performance Apps with In-memory Data
Building High Performance Apps with In-memory DataAmazon Web Services
 
Serverless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleServerless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleJim Dowling
 

Similaire à Interactive Analytics in Human Time (20)

June 2014 HUG: Interactive analytics over hadoop
June 2014 HUG: Interactive analytics over hadoopJune 2014 HUG: Interactive analytics over hadoop
June 2014 HUG: Interactive analytics over hadoop
 
Big dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlBig dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosql
 
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsReal time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
 
Presentation Data Council Meetup: F. Mekkenholt, R. Vlijm
Presentation Data Council Meetup: F. Mekkenholt, R. VlijmPresentation Data Council Meetup: F. Mekkenholt, R. Vlijm
Presentation Data Council Meetup: F. Mekkenholt, R. Vlijm
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
 
Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016
 
JEEConf 2015 Big Data Analysis in Java World
JEEConf 2015 Big Data Analysis in Java WorldJEEConf 2015 Big Data Analysis in Java World
JEEConf 2015 Big Data Analysis in Java World
 
Observability at Spotify
Observability at SpotifyObservability at Spotify
Observability at Spotify
 
Stream Processing – Concepts and Frameworks
Stream Processing – Concepts and FrameworksStream Processing – Concepts and Frameworks
Stream Processing – Concepts and Frameworks
 
Real-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven ApplicationsReal-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven Applications
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of Music
 
Swift distributed tracing method and tools v2
Swift distributed tracing method and tools v2Swift distributed tracing method and tools v2
Swift distributed tracing method and tools v2
 
Transforming Mobile Push Notifications with Big Data
Transforming Mobile Push Notifications with Big DataTransforming Mobile Push Notifications with Big Data
Transforming Mobile Push Notifications with Big Data
 
Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018
Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018
Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018
 
High availability, real-time and scalable architectures
High availability, real-time and scalable architecturesHigh availability, real-time and scalable architectures
High availability, real-time and scalable architectures
 
Accelerate Your Analytic Queries with Amazon Aurora Parallel Query (DAT362) -...
Accelerate Your Analytic Queries with Amazon Aurora Parallel Query (DAT362) -...Accelerate Your Analytic Queries with Amazon Aurora Parallel Query (DAT362) -...
Accelerate Your Analytic Queries with Amazon Aurora Parallel Query (DAT362) -...
 
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
 
Building High Performance Apps with In-memory Data
Building High Performance Apps with In-memory DataBuilding High Performance Apps with In-memory Data
Building High Performance Apps with In-memory Data
 
Serverless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleServerless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData Seattle
 

Plus de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 

Dernier (20)

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 

Interactive Analytics in Human Time

  • 1. Interactive Analytics in Human Time S u p r e e t h R a o , S u n i l G u p t a ⎪ J u n e 4 , 2 0 1 4 2 0 1 4 H a d o o p S u m m i t , S a n J o s e , C a l i f o r n i a
  • 2. Interactive – How we see it? 2 Yahoo Confidential & Proprietary 60B events, 3.5TB of compressed data Response 400ms Serve an ad and get insights < 2s
  • 6. Lots of data Analytics Data restatement - batch and real time Human time
  • 7. Lots of data ~30B advertising events/day ~10s of TB of compressed data/day Minutes to Year Grain Multi-quarter data retention Data Aging
  • 8. Analytics Reporting Metrics Attribution Multi-level hierarchical computation Bidding/Targeting optimization Non-additive computation
  • 9. Data Restatement Real time Batch Producer Consumer quick path, lower amount of checks or reconciliation, typically no lookups high latency path, checks and reconciliations, can have lookbacks and lookups
  • 10. Human Time <1s ( 99 percentile) Default time grain ( < 300 ms) Instant overlap ( < 60s) Data ingested, insights available ( < 2s)
  • 11. Lots of data Analytics Data restatement - batch and real time Human time
  • 13. Data Ingestion or Collection Transformations Data Ingestion Persistence Runtime Compute Caching API Optional Middleware Business API UI Data Collection Data Pipelines Data Warehouse/ Analytics and Optimizations Reporting Application/UI Logical View - Scope Transformations/Aggs Data Ingestion Persistence Runtime Compute Caching API Optional Middleware Business API UI Data Collection Impacts Out of scope
  • 14. Transformations/Aggs Data Ingestion Persistence Runtime Compute Caching API Optional Middleware Business API UI Data Collection Batch processing DAG, Real-time topology, SOX, Traffic protection, Late processing, Retention, Completeness Monitoring, PII cleansing/masking Compatible with HDFS, Performance (Indexed, Columnar, Compression, Serialization, Flexibility, Concurrency, Grain of data stored) Distributed/Stand-alone, Caching objects vs caching results Access to data with group by, order by etc..; SQL or SQL like Translate JSON to SQL(optional) Logical View - Characteristics Impacts Out of scope
  • 15. Transformations/Aggs Data Ingestion Persistence Runtime Compute Caching API Optional Middleware Business API UI Data Collection Hadoop MR/PIG /Oozie(Lotus)/Storm(Trident) Druid, Shark, Hive, Oracle RAC, Mysql, Hbase, Impala memcached_y, Redis JSON-REST API ; JDBC; ODBC Data Ingestion Persistence Runtime Compute Caching API Optional Middleware Business API UI Data Collection Logical View - Choices Impacts Out of scope
  • 16. How we do what we do? Components of Advertising Data Warehouse Druid JDBC/ODBC Data Warehouse-Persistence Hive Metrics Store JSON-API Persistence and run-time compute Computation and Ingestion Quick cache ( using a database for now) Upstream: API layer, MSTR, Adhoc access, Identity Service, Ad-Serving manifests Data Producers; Serving, Scoring, Booking, 3rd/1st Party Data Real time and batch compute engine (Hadoop/Storm ) Data filtering/transformations: Transformations, format conversions Custom Algorithms : computing recursive uniques, indexing
  • 17. Human time, How? Druid for interactive queries Storm-Druid for quick ingestion and index Specialized computation and processing for quicker response › Sketches › Feature sequence based overlaps › Custom indexing
  • 21. Overlap Non-additive › Require access to raw (user level data) to compute non-additive • Billions of events a day • TBs of data a day  1-1 vs 1-n vs few-n › Between car commuter and vegan what is the overlap › For Car commuter which are the top overlap groups › For Vegan, Car commuters what are the top overlap groups
  • 22. Re-stating motivation Given two sets having identifiers, how can we do exact overlaps in close to real time? ( < 1 min). Overlap is like a AND operation or a set
  • 23. Existing Approaches ● Use exact compute paradigms o Do joins for intersections which will lead to exact results  Hive, PIG, MR can all support efficient joins  Exact but not real time ● Use sketches o Approximate algorithms  HLL, KMV, accuracy vs size, performance  Approx, needs high perf tuning  close to real time but not exact
  • 24. Using Feature Sequences – 1/4 Feature sequence encoding o Encode the sequence  {Ram} - { car commuter, soccer fan,...}  {Tom} - { soccer fan, vegan...}  {Sam} - { car commuter, soccer fan, vegan...}  ….
  • 25. Using Feature Sequences – 2/4 Eliminate the user on encoded bitmaps  {car commuter, soccer fan, vegan...}- count -c1- #  {soccer fan, vegan...} - count - c2 - #  {car commuter, vegan...} - count - c2 - # Counts become additive now
  • 26. Using Feature Sequences – 3/4 ● Store row qualifications into a bitmap o Car commuter- Row1, Row3  1010000000 o Vegan - Row1, Row2, Row3  1110000000 ● Load the bitmap into Druid using a custom indexer o in-memory or memory mapped
  • 27. Using Feature Sequences – 4/4  Data Structures › {feature_sequence}->count › Feature->row qualification bitmaps  AND is now an “AND” on bitmaps › supported within Druid › Very efficient  Works alongside topN and groupBys
  • 28. Comparison with existing algorithm ● 1-n – Bulk Overlap on grid o 19 hours on grid o Few-n calls for a re-process o 1-1 ( <1s) ● Instant Overlap o < 60s ( pre-processing 3-4 hours) o Supports “exact” AND o Flexible ( few-n, 1-n) o 1-1 ( < 1s)
  • 29. Summary ● Yahoo’s Advertising Data Warehouse o Peta Byte Scale o Normalized view across many systems o Analytics and optimizations with specialized algorithms o Data restatement - batch and realtime o Human time
  • 30. Thank You @supreeth_ @_skgupta We are hiring! Stop by Kiosk P9 or reach out to us at bigdata@yahoo-inc.com.
  • 31. Data Ingestion or Collection Transformations Data Ingestion Persistence Runtime Compute Caching API Optional Middleware Business API UI Data Collection Data Pipelines Data Warehouse/ Analytics and Optimizations Reporting Application/UI Logical View - Scope Transformations/Aggs Data Ingestion Persistence Runtime Compute Caching API Optional Middleware Business API UI Data Collection Impacts Out of scope
  • 32. Dimension Flexibility Many dimensions Adding new dimensions Time zones Time grain
  • 33. Normalized view across systems PaidSearch Display Native Programmatic buying and selling Ad-targeting
  • 34. Hardware Configs ●High-memory boxes ●SSD preferred ●Savings due to better compression

Notes de l'éditeur

  1. Logical view of a typical data driven application architecture Sox compliance
  2. -quick cache store for querying all metrics in a single fetch, to support one-page load UI architecture - Hive for scheduled job and for adhoc long range generic queries which are not supported on the interactive interface
  3. Bitmap indexed, columnar, can operate on compressed bitmaps, distributed -