This document summarizes a presentation given by Robert Lancaster and Jonathan Seidman about how their company, Orbitz, is extending their enterprise data warehouse with Hadoop. They discuss how Hadoop provides scalable storage and processing of large amounts of log and web analytics data. They then provide examples of how this data is used for applications like optimizing hotel search, recommendations, and user segmentation. Finally, they outline their vision of integrating Hadoop and the data warehouse to provide a unified view for business intelligence and analytics tools.
1. Extending the Enterprise Data Warehouse with Hadoop
Robert Lancaster and Jonathan Seidman
Chicago Data Summit
April 26 | 2011
2. Who We Are
• Robert Lancaster
– Solutions Architect, Hotel Supply Team
– rlancaster@orbitz.com
– @rob1lancaster
• Jonathan Seidman
– Lead Engineer, Business Intelligence/Big Data Team
– Co-founder/organizer of Chicago Hadoop User Group
(http://www.meetup.com/Chicago-area-Hadoop-User-
Group-CHUG)
– jseidman@orbitz.com
– @jseidman
page 2
14. Cache Analysis
100.00%
72% of queries are Queries
singletons and make up
90.00% Searches
nearly a third of total
search volume.
80.00% Reverse Running Total
(Searches)
71.67%
Reverse Running Total
70.00% (Queries)
60.00%
A small number of
queries (3%) make
50.00% up more than a third
of search volume.
40.00%
34.30%
31.87%
30.00%
20.00%
10.00%
2.78%
0.00%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
page 14
16. All of this is great, but…
Most of these efforts are driven by development
teams.
The challenge now is to unlock the value in this data
by making it more available to the rest of the
organization.
page 16
17. “Given the ubiquity of data in modern organizations, a data
warehouse can keep pace today only by being “magnetic”:
attracting all the data sources that crop up within an
organization regardless of data quality niceties.”*
*MAD Skills: New Analysis Practices for Big Data
page 17
31. Click Data Processing – Current DW Processing
Web
Data
Server
Web Cleansing
Web
Server
Logs ETL DW (Stored DW
Servers
procedure)
3 hours 2 hours ~20%
original
data
size
page 31
32. Click Data Processing – New Hadoop Processing
Web Data
Server
Web Cleansing
Web
Server
Logs HDFS (MapReduce) DW
Servers
page 32
33. Conclusions
• Market is still immature, but Hadoop has already become a
valuable business intelligence tool, and will become an
increasingly important part of a BI infrastructure.
• Hadoop won’t replace your EDW, but any organization with a
large EDW should at least be exploring Hadoop as a
complement to their BI infrastructure.
• Use Hadoop to offload the time and resource intensive
processing of large data sets so you can free up your data
warehouse to serve user needs.
• The challenge now is making Hadoop more accessible to non-
developers. Vendors are addressing this, so expect rapid
advancements in Hadoop accessibility.
page 33
34. Oh, and also…
• Orbitz is looking for a Lead Engineer for the BI/Big Data team.
• Go to http://careers.orbitz.com/ and search for IRC19035.
page 34
35. References
• MAD Skills: New Analysis Practices for Big Data, Jeffrey
Cohen, Brian Dolan, Mark Dunlap, Joseph Hellerstein, and
Caleb Welton, 2009
page 35