2. Analytics at LinkedIn
• Reporting – Understanding business
performance and activity
• Product – Understanding users and
making data-informed decisions
• Customer service – Investigating
incorrect behavior on the site
• Data products – Building products for
our user powered by data
• Research – Economics research
powered by the data we have
• Systems engineering – Analyzing
internal system performance
3. LinkedIn Analytics circa 2014
• Hadoop-centric
• Pig is the dominant tool, followed by Hive
• Bad addiction to Avro
Engineering Everyone else
• “Traditional” DWH
• Data siphoned out of HDFS
• Faster, more expensive, harder to scale
4. First steps with Presto
• Separate cluster for Presto
• Replicate data in
• Avro?
• How can we manage all this?
Main HDFS cluster
Presto
5. Apache Gobblin
• Monitor tables for new data
• Convert any new partitions or
snapshots to ORC and
replicate to the other cluster
• Sort the data while we’re at it
• Easy to scale, easy to onboard
6. Scaling up
• Users love Presto!
• Adoption took off
• How are people using it?
• Is the experience consistent?
• What problems are people
having?
Daily users (artist’s impression)
7. Kafka logging
• Had operational metrics, needed
tracking
• Publish everything to Kafka
• Bring back to Presto for analysis
• Used to require modification,
now it’s just a plugin
• No excuse not to do this
public interface EventListener {
default void queryCreated(…) { }
default void queryCompleted(…) { }
default void splitCompleted(…) { }
}
8. Meta analysis
• Resource intensive tables – Which
tables are worth optimizing?
• HDFS locality – Are we processing
near the storage?
• Bug triaging – How many people
might be impacted?
• User support – “Oops! I didn’t save
my query”
• User growth – Who’s using Presto?
Who’s new?
• Tools used – Are people using the
tools we provide? Are they having
trouble?
• Expensive queries – Where are the
resources going?
• Lineage analysis – Who is using this
data and how?
9. • Got us off the ground
• Demonstrated value
• Need to get out of this mess!
That cluster?
10. • Very hard to get new data in
• Equally hard to get data out
• Can’t interoperate with our other
tools
• Don’t want to do a hard cutover
Life in a silo
11. • JDBC
• Pinot
• Kafka
• Venice
• Espresso
• Elasticsearch
• OpenTSDB
• Should we burden a user with this?
Other data sources
12. Make many data sources
look like one
Federation
• Hide the physical location of data
from users
• Allow infrastructure providers to
make changes behind the scene
• Scans and writes are routed to
connectors dynamically
13. Federation of connectors
• Like a multiplexer for
connectors
• Any connector can be
federated
• Flexible routing mechanism
Pluggable routing logic
Hive Kafka JDBC
Federation plugin
Read/write operations
14. Encapsulating business
logic
Dali views
• All the goodness of views, but for
Hadoop!
• An abstraction layer for data
• Manage data the same way you
manage libraries and services
15. Dali tooling
• Manage your data like a service
• Allow versioned views for
compatibility
• Authoring, testing,
deployment, and life-cycle
tools
git
Artifactory
Metastore Metastore
Authoring tools
Validation and
versioning
Deployment
16. HiveQL to Presto SQL conversion
• Presto can’t run HiveQL
• Translate through Apache
Calcite
• Evaluate result as a Presto view
• Coverage for most Hive
features
Hive analyzer
Calcite SQL
generator
Presto
view?
Yes
No
Presto SQL
View analysis
17. • Every platform has their own
API
• Bridge across them with a
common abstraction
• Zero development cost per
UDF per platform
• A bit of runtime overhead
UDFs
19. 4 takeaways
Big hammer for
optimizing data
Gobblin
Measure everything,
analyze later
Kafka logger
Making many data
sources look like one
Federation
Encapsulating
business logic
Dali Views