Making Data Work for LinkedIn

LinkedIn’s Presto Adventures
Mark Wagner
Engineer, LinkedIn

Analytics at LinkedIn
• Reporting – Understanding business
performance and activity
• Product – Understanding users and
making data-informed decisions
• Customer service – Investigating
incorrect behavior on the site
• Data products – Building products for
our user powered by data
• Research – Economics research
powered by the data we have
• Systems engineering – Analyzing
internal system performance

LinkedIn Analytics circa 2014
• Hadoop-centric
• Pig is the dominant tool, followed by Hive
• Bad addiction to Avro
Engineering Everyone else
• “Traditional” DWH
• Data siphoned out of HDFS
• Faster, more expensive, harder to scale

First steps with Presto
• Separate cluster for Presto
• Replicate data in
• Avro?
• How can we manage all this?
Main HDFS cluster
Presto

Apache Gobblin
• Monitor tables for new data
• Convert any new partitions or
snapshots to ORC and
replicate to the other cluster
• Sort the data while we’re at it
• Easy to scale, easy to onboard

Scaling up
• Users love Presto!
• Adoption took off
• How are people using it?
• Is the experience consistent?
• What problems are people
having?
Daily users (artist’s impression)

Kafka logging
• Had operational metrics, needed
tracking
• Publish everything to Kafka
• Bring back to Presto for analysis
• Used to require modification,
now it’s just a plugin
• No excuse not to do this
public interface EventListener {
default void queryCreated(…) { }
default void queryCompleted(…) { }
default void splitCompleted(…) { }
}

Meta analysis
• Resource intensive tables – Which
tables are worth optimizing?
• HDFS locality – Are we processing
near the storage?
• Bug triaging – How many people
might be impacted?
• User support – “Oops! I didn’t save
my query”
• User growth – Who’s using Presto?
Who’s new?
• Tools used – Are people using the
tools we provide? Are they having
trouble?
• Expensive queries – Where are the
resources going?
• Lineage analysis – Who is using this
data and how?

• Got us off the ground
• Demonstrated value
• Need to get out of this mess!
That cluster?

• Very hard to get new data in
• Equally hard to get data out
• Can’t interoperate with our other
tools
• Don’t want to do a hard cutover
Life in a silo

• JDBC
• Pinot
• Kafka
• Venice
• Espresso
• Elasticsearch
• OpenTSDB
• Should we burden a user with this?
Other data sources

Make many data sources
look like one
Federation
• Hide the physical location of data
from users
• Allow infrastructure providers to
make changes behind the scene
• Scans and writes are routed to
connectors dynamically

Federation of connectors
• Like a multiplexer for
connectors
• Any connector can be
federated
• Flexible routing mechanism
Pluggable routing logic
Hive Kafka JDBC
Federation plugin
Read/write operations

Encapsulating business
logic
Dali views
• All the goodness of views, but for
Hadoop!
• An abstraction layer for data
• Manage data the same way you
manage libraries and services

Dali tooling
• Manage your data like a service
• Allow versioned views for
compatibility
• Authoring, testing,
deployment, and life-cycle
tools
git
Artifactory
Metastore Metastore
Authoring tools
Validation and
versioning
Deployment

HiveQL to Presto SQL conversion
• Presto can’t run HiveQL
• Translate through Apache
Calcite
• Evaluate result as a Presto view
• Coverage for most Hive
features
Hive analyzer
Calcite SQL
generator
Presto
view?
Yes
No
Presto SQL
View analysis

• Every platform has their own
API
• Bridge across them with a
common abstraction
• Zero development cost per
UDF per platform
• A bit of runtime overhead
UDFs

Encapsulating business
logic
Dali views
• Translate SQL dialects
• Build cross platform UDFs for custom
logic
• Keep all optimization and execution
in Presto

4 takeaways
Big hammer for
optimizing data
Gobblin
Measure everything,
analyze later
Kafka logger
Making many data
sources look like one
Federation
Encapsulating
business logic
Dali Views

We’re hiring!
Questions? Contact me on LinkedIn or email mwagner@linkedin.com

Making Data Work for LinkedIn

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Making Data Work for LinkedIn

Similar to Making Data Work for LinkedIn (20)

More from kbajda

More from kbajda (8)

Recently uploaded

Recently uploaded (20)

Making Data Work for LinkedIn