2. WHO WE ARE
TrueCar’s mission is to prove that truth
and transparency is a more profitable way
of doing business – starting with
automotive.
The TrueCar Platform allows for data to be
dissected and transformed into easily
digestible and usable purchasing tools for
the consumer. So you can be a first-time car
buyer — you don’t have to be an expert —
and actually understand the difference
between a bad price, a fair price and a great
price.
www.TrueCar.com, TRUE.com,
NASDAQ: TRUE
3. 2.4%
3
$65M
ABOUT US
JOHN WILLIAMS, SVP PLATFORM OPERATIONS
RUSSELL FOLTZ-SMITH, VP DATA PLATFORM
Russ is the VP of Data Platform at TrueCar.com, where he creates the intelligence
systems driving TrueCar’s innovative interactive product set. Prior to TrueCar, he
held executive, product and technical leadership positions at category leaders like
IAC, Grind Networks, and Wolfram|Alpha. Russ holds a degree in mathematics
from the University of Chicago and currently lives in Marina Del Rey, CA with his
wife and two daughters.
John Williams is the SVP, Platform Operations of TrueCar. John has over 20 years
of experience designing, building and operating large scale Internet infrastructure.
John joined TrueCar in March 2011. John is responsible for the technology,
security and operations strategy that facilitates explosive growth while still meeting
strict requirements for performance, security and reliability. Before joining TrueCar,
John was retained as a consultant by numerous world-class technology, financial
services, entertainment, military and government organizations. Previously, John
was the CTO and co-founder of Preventsys (acquired by McAfee) where he
created the world’s first automated security policy compliance system for large
enterprise networks. Prior to that he founded and led the network penetration
testing team for Internet security pioneer Trusted Information Systems. At the start
of his career, John co-founded and built one of New York City’s first Internet
Service Providers.
6. 2.4%
6
THE SITUATION
INCREASING DATA APPETITE
GROWING TECH DIVERSITY
MORE PRODUCTS
Data Movement Pressure
Too much time keeping it together
SQL Wizardry=
7. 2.4%
7
$65M
DATA FLOW
MULTIPLE
DATA
WAREHOUSES
100s of
enrichment
processes
1,000+ Inbound Data
Feeds
7,500+ Dealers
1,500,000+ TC Dealers
Vehicles Tracked Daily
8,000,000+ Industry Wide
Vehicles Tracked Daily
400+
Websites Powered
1,000,000+
Cars Sold
20,000,000+
Customers Serviced
Industry Leading
Analytic Products
250,000,000+ Vehicle
Images
And More…
FEEDBACK LOOPS
*NUMBERS ARE ALL APPROXIMATE
10. 2.4%
10
FOCUS ON MAKING THINGS
INTELLIGENCE ENGINEERS should not
have to worry about:
COMPUTE CYCLES
STORAGE
SYSTEM SCALE
MOVING DATA
THEY SHOULD BE MAKING SMARTER THINGS
11. 2.4%
11
$65M
DATA then APPs
EXISTING DEVELOPMENT MODEL
IS BROKEN & LIMITING
NEW MODEL
Define app
Create highly tuned DB
for specific app
Load specific
data
GET ALL THE DATA YOU CAN
HDFS
Make and Remake
apps
13. 2.4%
13
$65M
NO PROOF OF CONCEPTS
POCS are:
TOO SMALL
TOO SIMPLE
TOO EASY
ONLY WAY TO BUILD LHC
is to BUILD LHC
14. 14
$65M
OUR DATA EVOLUTION
JUNE ‘13
Initiate
Hadoop
Execution
JULY ‘13
Partner with
Hortonworks
AUG. ‘13
Training
& Dev
Begins
NOV. ‘13
(60)
Node,
2PB prod.
Cluster
live
DEC. ‘13
(3)
production
apps launch
FEB ‘14
(3) more
production apps
launch
JAN. ‘14
40% Dev
staff
proficient
MAY ‘14
IPO
12 months execution path
DataPlatformCapabilities
We addressed out data
platform capabilities
strategically as a pre-cursor to
IPO.
16. 16
$65M
SOME OF OUR
HADOOP BASED SYSTEMS
Vehicle Data Systems
Intelligent Image Processing
And of course… better BI
17. 2.4%
17
$65M
EXAMPLE SYSTEM 1:
VEHICLE DATA
We keep track of over
8,000,000+ new and used
vehicles in inventory in the
marketplace every day
We enrich and use vehicle
data to power our market
reports, Live Offers,
value/pricing systems,
industry data products and
more
Previous non-Hadoop
system took 6-24 hours to
complete a full processing
run
The Goal with Hadoop:The Situation:
Scale up to allow
reprocessing of 50 years of
inventory/vehicle record data
available to us
Enable attaching additional
enrichment data and
processing without a massive
overhaul (plug and play)
Complete a full processing
run of daily inbound data in 1
hour and speedy one
off/small batch CRUD
operations
18. 18
$65M
EXAMPLE SYSTEM 1:
VEHICLE INVENTORY DATA
1. Dealer Data Feeds
Provide daily snapshot of raw
vehicle inventory
2. MapReduce – Data Loader
Normalize into a standard record
Filter out bad records
Validate fields
3. MapReduce – VIN Decoder
Identify trim/options for each
vehicle
4. Hive – Data Enhancer
Join against other data sources to
enrich the vehicle information
5. MapReduce – CRUD
Decide which entries are new,
updated or should be deleted
Put entries in a queue for exporting
to SQL
HDFS
MR –
FILTER/VERIFY
MR – VIN DECODE
Hive Enrich
MR – Rabbit/CRUD
Database
DEALER INVENTORY FEEDS
Queue
Service
Message
Queue
HADOOP
19. 19
$65M
EXAMPLE SYSTEM 1:
VEHICLE DATA VIN DECODER
Inventory or
transaction
data from
dealers
(HDFS)
VIN
decode
rules
(general &
make-
specific)
Compute
F1 score
for
matches
Mapper
Vehicle trim
& probability
Canonical
vehicle color
data
(HDFS)
Canonical
vehicle
trim/style
data
(HDFS)
Pre staged in memory Hadoop Components:
Just a MAPPER
Avro format for I/O
Challenge:
Understand EXACTLY
What options are on all cars.
Used to compute similarity between
inventory and canonical data
http://en.wikipedia.org/wiki/F1_score
20. 2.4%
20
$65M
EXAMPLE SYSTEM 2:
INTELLIGENT IMAGE PROCESSING
250,000,000+ vehicle images
currently under asset
management for live data
1,000,000,000+ images have
passed through system
1,000,000+ images processed
daily (and growing)
Original system for processing
images: could take up to 1 day
to fully process all daily
images
The Goal with Hadoop:The Situation:
Scale to being able to store
online over 1,000,000,000+
image
Allow for advanced image
recognition, OCR
Process full run of latest
images in less than 2 hours,
allow for speedy one off/small
batch real time CRUD
operations
21. 21
$65M
EXAMPLE SYSTEM 2: IMAGE
DOWNLOADER
Pulls Images From Providers into HDFS
Hadoop
Downloads multiple images
simultaneously
Downloads from multiple
providers simultaneously
Download times scale with
cluster size
22. 2.4%
22
$65M
EXAMPLE SYSTEM 2: IMAGE
BUNDLER
BUNDLES MILLIONS OF DAILY IMAGES INTO SINGLE HDFS FILE
Hadoop
Image Bundle
May 31, 2014
Image Bundle
May 30, 2014
Uses HIPI
(http://hipi.cs.virginia.edu) to
store multiple images in an
HDFS sequence file
Instead of millions of small
daily image files ( << block
size), have 1 large daily file
with all images bundled
inside (>> block size)
We tag images with
metadata, permanently
linking images to our vehicle
database (e.g., VIN, Make,
Model, Model Year, etc.)
23. 2.4%
23
$65M
Hadoop
Thumbnailing
builds thumbnail
library
Vehicle Locator
finds vehicle in image
Color Decoder
determines vehicle RGB
color code
COCOCO
Orientation
determines image
orientation
Driver Side
Image bundles can be processed
through multiple Java MapReduce
routines
Thumbnailing is done with ImageJ
Vehicle locator will be done with
OpenCV, using edge detection and
shape-based features
Average color will be determined from
pixel value ratios in the RGB layers of
the jpeg
Orientation will be determined with
shape-based features and gradient
algorithms (see Rybski, Huber, Morris,
and Hoffman 2010)
EXAMPLE SYSTEM 2: IMAGE
PROCESSOR
PROCESSES IMAGE BUNDLE THROUGH HADOOP
24. 2.4%
24
$65M
EXAMPLE SYSTEM 3:
ADVANCED BUSINESS INTELLIGENCE
8 years of web/app behavior
25,000+ data fields
50,000,000+ configured vehicles
1,000,000+ TrueCar car
transactions
Previous approaches had various
data spread across 4+ data
warehouses and only a small
portion of the data online
available for query and required
extensive data movement
pipelines to integrate
The Goal with Hadoop:The Situation:
All behavioral data for all time
available for analytics
Data injected no less than once
per day, with most coming in
near real time
Remove worry from analysts and
DBAs regarding deletion or
offline archive
Reduce data warehouses,
consolidate analytic tooling
27. 27
$65M
WAS IT WORTH IT ?
ECONOMIC
Storage Costs, Compute Costs
- FROM $19.00/GB to $0.23/GB
Elimination of expensive proprietary tools
FUNCTIONALITY
Development effort of complex data applications reduced by 3x
Automated Trend Hunting
Consolidation of data into immediately computable, searchable
infrastructure
Unified ETL and Storage system – near zero data movement
environment
Functional Programming Approach