SlideShare une entreprise Scribd logo
1  sur  29
THE CAR BUSINESS
MOVING FROM ANECDOTES TO DATA
WHAT DRIVES
WHO WE ARE
 TrueCar’s mission is to prove that truth
and transparency is a more profitable way
of doing business – starting with
automotive.
 The TrueCar Platform allows for data to be
dissected and transformed into easily
digestible and usable purchasing tools for
the consumer. So you can be a first-time car
buyer — you don’t have to be an expert —
and actually understand the difference
between a bad price, a fair price and a great
price.
 www.TrueCar.com, TRUE.com,
NASDAQ: TRUE
2.4%
3
$65M
ABOUT US
JOHN WILLIAMS, SVP PLATFORM OPERATIONS
RUSSELL FOLTZ-SMITH, VP DATA PLATFORM
Russ is the VP of Data Platform at TrueCar.com, where he creates the intelligence
systems driving TrueCar’s innovative interactive product set. Prior to TrueCar, he
held executive, product and technical leadership positions at category leaders like
IAC, Grind Networks, and Wolfram|Alpha. Russ holds a degree in mathematics
from the University of Chicago and currently lives in Marina Del Rey, CA with his
wife and two daughters.
John Williams is the SVP, Platform Operations of TrueCar. John has over 20 years
of experience designing, building and operating large scale Internet infrastructure.
John joined TrueCar in March 2011. John is responsible for the technology,
security and operations strategy that facilitates explosive growth while still meeting
strict requirements for performance, security and reliability. Before joining TrueCar,
John was retained as a consultant by numerous world-class technology, financial
services, entertainment, military and government organizations. Previously, John
was the CTO and co-founder of Preventsys (acquired by McAfee) where he
created the world’s first automated security policy compliance system for large
enterprise networks. Prior to that he founded and led the network penetration
testing team for Internet security pioneer Trusted Information Systems. At the start
of his career, John co-founded and built one of New York City’s first Internet
Service Providers.
2.4%
4
OUR CORE
SERVICE
Provide Interactive
Transaction Guidance to
Consumers via Web,
Mobile
PAY PER SALE
Revenue Model
CONSUMERS INDUSTRY
Provide Interactive
Transaction Tools to
OEMs, Dealers via Web,
Mobile
OUR PARTNERS
2.4%
6
THE SITUATION
INCREASING DATA APPETITE
GROWING TECH DIVERSITY
MORE PRODUCTS
Data Movement Pressure
Too much time keeping it together
SQL Wizardry=
2.4%
7
$65M
DATA FLOW
MULTIPLE
DATA
WAREHOUSES
100s of
enrichment
processes
1,000+ Inbound Data
Feeds
7,500+ Dealers
1,500,000+ TC Dealers
Vehicles Tracked Daily
8,000,000+ Industry Wide
Vehicles Tracked Daily
400+
Websites Powered
1,000,000+
Cars Sold
20,000,000+
Customers Serviced
Industry Leading
Analytic Products
250,000,000+ Vehicle
Images
And More…
FEEDBACK LOOPS
*NUMBERS ARE ALL APPROXIMATE
WHOLESALE
SHIFT NEEDED
It’s not just an economics exercise.
WE NEED NEW CAPABILITIES.
9
$65M
FUNDAMENTAL ROLE
TRANSFORMATION
SQL
but Faster
Data Scientists Database Developers Programmers Analysts
INTELLIGENCE ENGINEERS
YES,
THIS
NOT THIS
2.4%
10
FOCUS ON MAKING THINGS
INTELLIGENCE ENGINEERS should not
have to worry about:
 COMPUTE CYCLES
 STORAGE
 SYSTEM SCALE
 MOVING DATA
THEY SHOULD BE MAKING SMARTER THINGS
2.4%
11
$65M
DATA then APPs
EXISTING DEVELOPMENT MODEL
IS BROKEN & LIMITING
NEW MODEL
Define app
Create highly tuned DB
for specific app
Load specific
data
GET ALL THE DATA YOU CAN
HDFS
Make and Remake
apps
12
$65M
PHILOSOPHY
DELET
E DATA
MOVE
DATA
DON’Ts
LEARN MAP REDUCE WELL USE NATIVE COMPONENTS
TAKE
SHORTCUTS
DO’s
2.4%
13
$65M
NO PROOF OF CONCEPTS
POCS are:
TOO SMALL
TOO SIMPLE
TOO EASY
ONLY WAY TO BUILD LHC
is to BUILD LHC
14
$65M
OUR DATA EVOLUTION
JUNE ‘13
Initiate
Hadoop
Execution
JULY ‘13
Partner with
Hortonworks
AUG. ‘13
Training
& Dev
Begins
NOV. ‘13
(60)
Node,
2PB prod.
Cluster
live
DEC. ‘13
(3)
production
apps launch
FEB ‘14
(3) more
production apps
launch
JAN. ‘14
40% Dev
staff
proficient
MAY ‘14
IPO
12 months execution path
DataPlatformCapabilities
We addressed out data
platform capabilities
strategically as a pre-cursor to
IPO.
OUR SETUP
TrueCar Hadoop Cluster:
 60 Nodes, 2.55PB usable HDFS, 960 Xeon CPU
cores, 7.7TB RAM
- 10GbE networking, 3 racks, HDP 2.1
Final price point:
$0.23/GB hardware & software/support
$0.003/GB/mo space/power/cooling
16
$65M
SOME OF OUR
HADOOP BASED SYSTEMS
Vehicle Data Systems
Intelligent Image Processing
And of course… better BI
2.4%
17
$65M
EXAMPLE SYSTEM 1:
VEHICLE DATA
 We keep track of over
8,000,000+ new and used
vehicles in inventory in the
marketplace every day
 We enrich and use vehicle
data to power our market
reports, Live Offers,
value/pricing systems,
industry data products and
more
 Previous non-Hadoop
system took 6-24 hours to
complete a full processing
run
The Goal with Hadoop:The Situation:
 Scale up to allow
reprocessing of 50 years of
inventory/vehicle record data
available to us
 Enable attaching additional
enrichment data and
processing without a massive
overhaul (plug and play)
 Complete a full processing
run of daily inbound data in 1
hour and speedy one
off/small batch CRUD
operations
18
$65M
EXAMPLE SYSTEM 1:
VEHICLE INVENTORY DATA
1. Dealer Data Feeds
 Provide daily snapshot of raw
vehicle inventory
2. MapReduce – Data Loader
 Normalize into a standard record
 Filter out bad records
 Validate fields
3. MapReduce – VIN Decoder
 Identify trim/options for each
vehicle
4. Hive – Data Enhancer
 Join against other data sources to
enrich the vehicle information
5. MapReduce – CRUD
 Decide which entries are new,
updated or should be deleted
 Put entries in a queue for exporting
to SQL
HDFS
MR –
FILTER/VERIFY
MR – VIN DECODE
Hive Enrich
MR – Rabbit/CRUD
Database
DEALER INVENTORY FEEDS
Queue
Service
Message
Queue
HADOOP
19
$65M
EXAMPLE SYSTEM 1:
VEHICLE DATA VIN DECODER
Inventory or
transaction
data from
dealers
(HDFS)
VIN
decode
rules
(general &
make-
specific)
Compute
F1 score
for
matches
Mapper
Vehicle trim
& probability
Canonical
vehicle color
data
(HDFS)
Canonical
vehicle
trim/style
data
(HDFS)
Pre staged in memory Hadoop Components:
Just a MAPPER
Avro format for I/O
Challenge:
Understand EXACTLY
What options are on all cars.
Used to compute similarity between
inventory and canonical data
http://en.wikipedia.org/wiki/F1_score
2.4%
20
$65M
EXAMPLE SYSTEM 2:
INTELLIGENT IMAGE PROCESSING
 250,000,000+ vehicle images
currently under asset
management for live data
 1,000,000,000+ images have
passed through system
 1,000,000+ images processed
daily (and growing)
 Original system for processing
images: could take up to 1 day
to fully process all daily
images
The Goal with Hadoop:The Situation:
 Scale to being able to store
online over 1,000,000,000+
image
 Allow for advanced image
recognition, OCR
 Process full run of latest
images in less than 2 hours,
allow for speedy one off/small
batch real time CRUD
operations
21
$65M
EXAMPLE SYSTEM 2: IMAGE
DOWNLOADER
Pulls Images From Providers into HDFS
Hadoop
 Downloads multiple images
simultaneously
 Downloads from multiple
providers simultaneously
 Download times scale with
cluster size
2.4%
22
$65M
EXAMPLE SYSTEM 2: IMAGE
BUNDLER
BUNDLES MILLIONS OF DAILY IMAGES INTO SINGLE HDFS FILE
Hadoop
Image Bundle
May 31, 2014
Image Bundle
May 30, 2014
 Uses HIPI
(http://hipi.cs.virginia.edu) to
store multiple images in an
HDFS sequence file
 Instead of millions of small
daily image files ( << block
size), have 1 large daily file
with all images bundled
inside (>> block size)
 We tag images with
metadata, permanently
linking images to our vehicle
database (e.g., VIN, Make,
Model, Model Year, etc.)
2.4%
23
$65M
Hadoop
Thumbnailing
builds thumbnail
library
Vehicle Locator
finds vehicle in image
Color Decoder
determines vehicle RGB
color code
COCOCO
Orientation
determines image
orientation
Driver Side
 Image bundles can be processed
through multiple Java MapReduce
routines
 Thumbnailing is done with ImageJ
 Vehicle locator will be done with
OpenCV, using edge detection and
shape-based features
 Average color will be determined from
pixel value ratios in the RGB layers of
the jpeg
 Orientation will be determined with
shape-based features and gradient
algorithms (see Rybski, Huber, Morris,
and Hoffman 2010)
EXAMPLE SYSTEM 2: IMAGE
PROCESSOR
PROCESSES IMAGE BUNDLE THROUGH HADOOP
2.4%
24
$65M
EXAMPLE SYSTEM 3:
ADVANCED BUSINESS INTELLIGENCE
 8 years of web/app behavior
 25,000+ data fields
 50,000,000+ configured vehicles
 1,000,000+ TrueCar car
transactions
 Previous approaches had various
data spread across 4+ data
warehouses and only a small
portion of the data online
available for query and required
extensive data movement
pipelines to integrate
The Goal with Hadoop:The Situation:
 All behavioral data for all time
available for analytics
 Data injected no less than once
per day, with most coming in
near real time
 Remove worry from analysts and
DBAs regarding deletion or
offline archive
 Reduce data warehouses,
consolidate analytic tooling
2.4%
25
$65M
EXAMPLE SYSTEM 3:
BI GROWTH
0
200
400
600
800
1000
1200
1400
1600
1800
Millions
ACCELERATING BI DATA GROWTH
2.4%
26
EXAMPLE SYSTEM 3:
MULTI-DIMENSIONAL BI
27
$65M
WAS IT WORTH IT ?
ECONOMIC
 Storage Costs, Compute Costs
- FROM $19.00/GB to $0.23/GB
 Elimination of expensive proprietary tools
FUNCTIONALITY
 Development effort of complex data applications reduced by 3x
 Automated Trend Hunting
 Consolidation of data into immediately computable, searchable
infrastructure
 Unified ETL and Storage system – near zero data movement
environment
 Functional Programming Approach
FUTURE PREVIEW
COMPREHENSIVE
DATA
REAL TIME
MARKET
SIMULATION
REAL TIME
TRANSACTION
PROCESSING
PRESCRIPTIVE
MOBILE REAL
TIME TOOLS
TOTAL AUTO
MARKETPLACE
THANK YOU.

Contenu connexe

Similaire à What Drives the Car Business: Moving from Anecdotes to Data

Growth hacking in the age of Data
Growth hacking in the age of DataGrowth hacking in the age of Data
Growth hacking in the age of DataDaniel Saito
 
18Mar14 Find the Hidden Signal in Market Data Noise Webinar
18Mar14 Find the Hidden Signal in Market Data Noise Webinar 18Mar14 Find the Hidden Signal in Market Data Noise Webinar
18Mar14 Find the Hidden Signal in Market Data Noise Webinar Revolution Analytics
 
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...Hortonworks
 
An Innovative Big-Data Web Scraping Tech Company
An Innovative Big-Data Web Scraping Tech CompanyAn Innovative Big-Data Web Scraping Tech Company
An Innovative Big-Data Web Scraping Tech CompanyRoger Giuffre
 
An Innovative Big-Data Web Scraping Tech Company
An Innovative Big-Data Web Scraping Tech CompanyAn Innovative Big-Data Web Scraping Tech Company
An Innovative Big-Data Web Scraping Tech CompanyRoger Giuffre
 
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...Revolution Analytics
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
DataOps: Control-M's role in data pipeline orchestration
DataOps: Control-M's role in data pipeline orchestrationDataOps: Control-M's role in data pipeline orchestration
DataOps: Control-M's role in data pipeline orchestrationpzjnjr6rsg
 
Integrating Structure and Analytics with Unstructured Data
Integrating Structure and Analytics with Unstructured DataIntegrating Structure and Analytics with Unstructured Data
Integrating Structure and Analytics with Unstructured DataDATAVERSITY
 
Real-time Visibility at Scale with Sumo Logic
Real-time Visibility at Scale with Sumo LogicReal-time Visibility at Scale with Sumo Logic
Real-time Visibility at Scale with Sumo LogicAmazon Web Services
 
WSO2Con EU 2015: Reference Architecture for EDA
WSO2Con EU 2015: Reference Architecture for EDAWSO2Con EU 2015: Reference Architecture for EDA
WSO2Con EU 2015: Reference Architecture for EDAWSO2
 
Hadoop 2.0: YARN to Further Optimize Data Processing
Hadoop 2.0: YARN to Further Optimize Data ProcessingHadoop 2.0: YARN to Further Optimize Data Processing
Hadoop 2.0: YARN to Further Optimize Data ProcessingHortonworks
 
Big Data Solutions on Cloud – The Way Forward by Kiththi Perera SLT
Big Data Solutions on Cloud – The Way Forward by Kiththi Perera SLTBig Data Solutions on Cloud – The Way Forward by Kiththi Perera SLT
Big Data Solutions on Cloud – The Way Forward by Kiththi Perera SLTKiththi Perera
 
Big data solutions on cloud – the way forward
Big data solutions on cloud – the way forwardBig data solutions on cloud – the way forward
Big data solutions on cloud – the way forwardKiththi Perera
 
How Experian increased insights with Hadoop
How Experian increased insights with HadoopHow Experian increased insights with Hadoop
How Experian increased insights with HadoopPrecisely
 
Improve your Tech Quotient
Improve your Tech QuotientImprove your Tech Quotient
Improve your Tech QuotientTarence DSouza
 
Digital marketing pharma - google event
Digital marketing   pharma - google eventDigital marketing   pharma - google event
Digital marketing pharma - google eventDaniel Viveiros
 
The Next Digital Marketing- Digital Pharma presentation by Ci&T and Google
The Next Digital Marketing- Digital Pharma presentation by Ci&T and GoogleThe Next Digital Marketing- Digital Pharma presentation by Ci&T and Google
The Next Digital Marketing- Digital Pharma presentation by Ci&T and GoogleCI&T
 

Similaire à What Drives the Car Business: Moving from Anecdotes to Data (20)

What Is Rain
What Is RainWhat Is Rain
What Is Rain
 
Growth hacking in the age of Data
Growth hacking in the age of DataGrowth hacking in the age of Data
Growth hacking in the age of Data
 
18Mar14 Find the Hidden Signal in Market Data Noise Webinar
18Mar14 Find the Hidden Signal in Market Data Noise Webinar 18Mar14 Find the Hidden Signal in Market Data Noise Webinar
18Mar14 Find the Hidden Signal in Market Data Noise Webinar
 
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
 
An Innovative Big-Data Web Scraping Tech Company
An Innovative Big-Data Web Scraping Tech CompanyAn Innovative Big-Data Web Scraping Tech Company
An Innovative Big-Data Web Scraping Tech Company
 
Big data Introduction by Mohan
Big data Introduction by MohanBig data Introduction by Mohan
Big data Introduction by Mohan
 
An Innovative Big-Data Web Scraping Tech Company
An Innovative Big-Data Web Scraping Tech CompanyAn Innovative Big-Data Web Scraping Tech Company
An Innovative Big-Data Web Scraping Tech Company
 
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
DataOps: Control-M's role in data pipeline orchestration
DataOps: Control-M's role in data pipeline orchestrationDataOps: Control-M's role in data pipeline orchestration
DataOps: Control-M's role in data pipeline orchestration
 
Integrating Structure and Analytics with Unstructured Data
Integrating Structure and Analytics with Unstructured DataIntegrating Structure and Analytics with Unstructured Data
Integrating Structure and Analytics with Unstructured Data
 
Real-time Visibility at Scale with Sumo Logic
Real-time Visibility at Scale with Sumo LogicReal-time Visibility at Scale with Sumo Logic
Real-time Visibility at Scale with Sumo Logic
 
WSO2Con EU 2015: Reference Architecture for EDA
WSO2Con EU 2015: Reference Architecture for EDAWSO2Con EU 2015: Reference Architecture for EDA
WSO2Con EU 2015: Reference Architecture for EDA
 
Hadoop 2.0: YARN to Further Optimize Data Processing
Hadoop 2.0: YARN to Further Optimize Data ProcessingHadoop 2.0: YARN to Further Optimize Data Processing
Hadoop 2.0: YARN to Further Optimize Data Processing
 
Big Data Solutions on Cloud – The Way Forward by Kiththi Perera SLT
Big Data Solutions on Cloud – The Way Forward by Kiththi Perera SLTBig Data Solutions on Cloud – The Way Forward by Kiththi Perera SLT
Big Data Solutions on Cloud – The Way Forward by Kiththi Perera SLT
 
Big data solutions on cloud – the way forward
Big data solutions on cloud – the way forwardBig data solutions on cloud – the way forward
Big data solutions on cloud – the way forward
 
How Experian increased insights with Hadoop
How Experian increased insights with HadoopHow Experian increased insights with Hadoop
How Experian increased insights with Hadoop
 
Improve your Tech Quotient
Improve your Tech QuotientImprove your Tech Quotient
Improve your Tech Quotient
 
Digital marketing pharma - google event
Digital marketing   pharma - google eventDigital marketing   pharma - google event
Digital marketing pharma - google event
 
The Next Digital Marketing- Digital Pharma presentation by Ci&T and Google
The Next Digital Marketing- Digital Pharma presentation by Ci&T and GoogleThe Next Digital Marketing- Digital Pharma presentation by Ci&T and Google
The Next Digital Marketing- Digital Pharma presentation by Ci&T and Google
 

Plus de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 

Dernier (20)

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 

What Drives the Car Business: Moving from Anecdotes to Data

  • 1. THE CAR BUSINESS MOVING FROM ANECDOTES TO DATA WHAT DRIVES
  • 2. WHO WE ARE  TrueCar’s mission is to prove that truth and transparency is a more profitable way of doing business – starting with automotive.  The TrueCar Platform allows for data to be dissected and transformed into easily digestible and usable purchasing tools for the consumer. So you can be a first-time car buyer — you don’t have to be an expert — and actually understand the difference between a bad price, a fair price and a great price.  www.TrueCar.com, TRUE.com, NASDAQ: TRUE
  • 3. 2.4% 3 $65M ABOUT US JOHN WILLIAMS, SVP PLATFORM OPERATIONS RUSSELL FOLTZ-SMITH, VP DATA PLATFORM Russ is the VP of Data Platform at TrueCar.com, where he creates the intelligence systems driving TrueCar’s innovative interactive product set. Prior to TrueCar, he held executive, product and technical leadership positions at category leaders like IAC, Grind Networks, and Wolfram|Alpha. Russ holds a degree in mathematics from the University of Chicago and currently lives in Marina Del Rey, CA with his wife and two daughters. John Williams is the SVP, Platform Operations of TrueCar. John has over 20 years of experience designing, building and operating large scale Internet infrastructure. John joined TrueCar in March 2011. John is responsible for the technology, security and operations strategy that facilitates explosive growth while still meeting strict requirements for performance, security and reliability. Before joining TrueCar, John was retained as a consultant by numerous world-class technology, financial services, entertainment, military and government organizations. Previously, John was the CTO and co-founder of Preventsys (acquired by McAfee) where he created the world’s first automated security policy compliance system for large enterprise networks. Prior to that he founded and led the network penetration testing team for Internet security pioneer Trusted Information Systems. At the start of his career, John co-founded and built one of New York City’s first Internet Service Providers.
  • 4. 2.4% 4 OUR CORE SERVICE Provide Interactive Transaction Guidance to Consumers via Web, Mobile PAY PER SALE Revenue Model CONSUMERS INDUSTRY Provide Interactive Transaction Tools to OEMs, Dealers via Web, Mobile
  • 6. 2.4% 6 THE SITUATION INCREASING DATA APPETITE GROWING TECH DIVERSITY MORE PRODUCTS Data Movement Pressure Too much time keeping it together SQL Wizardry=
  • 7. 2.4% 7 $65M DATA FLOW MULTIPLE DATA WAREHOUSES 100s of enrichment processes 1,000+ Inbound Data Feeds 7,500+ Dealers 1,500,000+ TC Dealers Vehicles Tracked Daily 8,000,000+ Industry Wide Vehicles Tracked Daily 400+ Websites Powered 1,000,000+ Cars Sold 20,000,000+ Customers Serviced Industry Leading Analytic Products 250,000,000+ Vehicle Images And More… FEEDBACK LOOPS *NUMBERS ARE ALL APPROXIMATE
  • 8. WHOLESALE SHIFT NEEDED It’s not just an economics exercise. WE NEED NEW CAPABILITIES.
  • 9. 9 $65M FUNDAMENTAL ROLE TRANSFORMATION SQL but Faster Data Scientists Database Developers Programmers Analysts INTELLIGENCE ENGINEERS YES, THIS NOT THIS
  • 10. 2.4% 10 FOCUS ON MAKING THINGS INTELLIGENCE ENGINEERS should not have to worry about:  COMPUTE CYCLES  STORAGE  SYSTEM SCALE  MOVING DATA THEY SHOULD BE MAKING SMARTER THINGS
  • 11. 2.4% 11 $65M DATA then APPs EXISTING DEVELOPMENT MODEL IS BROKEN & LIMITING NEW MODEL Define app Create highly tuned DB for specific app Load specific data GET ALL THE DATA YOU CAN HDFS Make and Remake apps
  • 12. 12 $65M PHILOSOPHY DELET E DATA MOVE DATA DON’Ts LEARN MAP REDUCE WELL USE NATIVE COMPONENTS TAKE SHORTCUTS DO’s
  • 13. 2.4% 13 $65M NO PROOF OF CONCEPTS POCS are: TOO SMALL TOO SIMPLE TOO EASY ONLY WAY TO BUILD LHC is to BUILD LHC
  • 14. 14 $65M OUR DATA EVOLUTION JUNE ‘13 Initiate Hadoop Execution JULY ‘13 Partner with Hortonworks AUG. ‘13 Training & Dev Begins NOV. ‘13 (60) Node, 2PB prod. Cluster live DEC. ‘13 (3) production apps launch FEB ‘14 (3) more production apps launch JAN. ‘14 40% Dev staff proficient MAY ‘14 IPO 12 months execution path DataPlatformCapabilities We addressed out data platform capabilities strategically as a pre-cursor to IPO.
  • 15. OUR SETUP TrueCar Hadoop Cluster:  60 Nodes, 2.55PB usable HDFS, 960 Xeon CPU cores, 7.7TB RAM - 10GbE networking, 3 racks, HDP 2.1 Final price point: $0.23/GB hardware & software/support $0.003/GB/mo space/power/cooling
  • 16. 16 $65M SOME OF OUR HADOOP BASED SYSTEMS Vehicle Data Systems Intelligent Image Processing And of course… better BI
  • 17. 2.4% 17 $65M EXAMPLE SYSTEM 1: VEHICLE DATA  We keep track of over 8,000,000+ new and used vehicles in inventory in the marketplace every day  We enrich and use vehicle data to power our market reports, Live Offers, value/pricing systems, industry data products and more  Previous non-Hadoop system took 6-24 hours to complete a full processing run The Goal with Hadoop:The Situation:  Scale up to allow reprocessing of 50 years of inventory/vehicle record data available to us  Enable attaching additional enrichment data and processing without a massive overhaul (plug and play)  Complete a full processing run of daily inbound data in 1 hour and speedy one off/small batch CRUD operations
  • 18. 18 $65M EXAMPLE SYSTEM 1: VEHICLE INVENTORY DATA 1. Dealer Data Feeds  Provide daily snapshot of raw vehicle inventory 2. MapReduce – Data Loader  Normalize into a standard record  Filter out bad records  Validate fields 3. MapReduce – VIN Decoder  Identify trim/options for each vehicle 4. Hive – Data Enhancer  Join against other data sources to enrich the vehicle information 5. MapReduce – CRUD  Decide which entries are new, updated or should be deleted  Put entries in a queue for exporting to SQL HDFS MR – FILTER/VERIFY MR – VIN DECODE Hive Enrich MR – Rabbit/CRUD Database DEALER INVENTORY FEEDS Queue Service Message Queue HADOOP
  • 19. 19 $65M EXAMPLE SYSTEM 1: VEHICLE DATA VIN DECODER Inventory or transaction data from dealers (HDFS) VIN decode rules (general & make- specific) Compute F1 score for matches Mapper Vehicle trim & probability Canonical vehicle color data (HDFS) Canonical vehicle trim/style data (HDFS) Pre staged in memory Hadoop Components: Just a MAPPER Avro format for I/O Challenge: Understand EXACTLY What options are on all cars. Used to compute similarity between inventory and canonical data http://en.wikipedia.org/wiki/F1_score
  • 20. 2.4% 20 $65M EXAMPLE SYSTEM 2: INTELLIGENT IMAGE PROCESSING  250,000,000+ vehicle images currently under asset management for live data  1,000,000,000+ images have passed through system  1,000,000+ images processed daily (and growing)  Original system for processing images: could take up to 1 day to fully process all daily images The Goal with Hadoop:The Situation:  Scale to being able to store online over 1,000,000,000+ image  Allow for advanced image recognition, OCR  Process full run of latest images in less than 2 hours, allow for speedy one off/small batch real time CRUD operations
  • 21. 21 $65M EXAMPLE SYSTEM 2: IMAGE DOWNLOADER Pulls Images From Providers into HDFS Hadoop  Downloads multiple images simultaneously  Downloads from multiple providers simultaneously  Download times scale with cluster size
  • 22. 2.4% 22 $65M EXAMPLE SYSTEM 2: IMAGE BUNDLER BUNDLES MILLIONS OF DAILY IMAGES INTO SINGLE HDFS FILE Hadoop Image Bundle May 31, 2014 Image Bundle May 30, 2014  Uses HIPI (http://hipi.cs.virginia.edu) to store multiple images in an HDFS sequence file  Instead of millions of small daily image files ( << block size), have 1 large daily file with all images bundled inside (>> block size)  We tag images with metadata, permanently linking images to our vehicle database (e.g., VIN, Make, Model, Model Year, etc.)
  • 23. 2.4% 23 $65M Hadoop Thumbnailing builds thumbnail library Vehicle Locator finds vehicle in image Color Decoder determines vehicle RGB color code COCOCO Orientation determines image orientation Driver Side  Image bundles can be processed through multiple Java MapReduce routines  Thumbnailing is done with ImageJ  Vehicle locator will be done with OpenCV, using edge detection and shape-based features  Average color will be determined from pixel value ratios in the RGB layers of the jpeg  Orientation will be determined with shape-based features and gradient algorithms (see Rybski, Huber, Morris, and Hoffman 2010) EXAMPLE SYSTEM 2: IMAGE PROCESSOR PROCESSES IMAGE BUNDLE THROUGH HADOOP
  • 24. 2.4% 24 $65M EXAMPLE SYSTEM 3: ADVANCED BUSINESS INTELLIGENCE  8 years of web/app behavior  25,000+ data fields  50,000,000+ configured vehicles  1,000,000+ TrueCar car transactions  Previous approaches had various data spread across 4+ data warehouses and only a small portion of the data online available for query and required extensive data movement pipelines to integrate The Goal with Hadoop:The Situation:  All behavioral data for all time available for analytics  Data injected no less than once per day, with most coming in near real time  Remove worry from analysts and DBAs regarding deletion or offline archive  Reduce data warehouses, consolidate analytic tooling
  • 25. 2.4% 25 $65M EXAMPLE SYSTEM 3: BI GROWTH 0 200 400 600 800 1000 1200 1400 1600 1800 Millions ACCELERATING BI DATA GROWTH
  • 27. 27 $65M WAS IT WORTH IT ? ECONOMIC  Storage Costs, Compute Costs - FROM $19.00/GB to $0.23/GB  Elimination of expensive proprietary tools FUNCTIONALITY  Development effort of complex data applications reduced by 3x  Automated Trend Hunting  Consolidation of data into immediately computable, searchable infrastructure  Unified ETL and Storage system – near zero data movement environment  Functional Programming Approach
  • 28. FUTURE PREVIEW COMPREHENSIVE DATA REAL TIME MARKET SIMULATION REAL TIME TRANSACTION PROCESSING PRESCRIPTIVE MOBILE REAL TIME TOOLS TOTAL AUTO MARKETPLACE