Hadoop provides the ability to extract business intelligence from extremely large, heterogeneous data sets that were previously impractical to store and process in traditional data warehouses. The challenge now is in bridging the gap between the data warehouse and Hadoop. In this talk we’ll discuss some steps that Orbitz has taken to bridge this gap, including examples of how Hadoop and Hive are used to aggregate data from large data sets, and how that data can be combined with relational data to create new reports that provide actionable intelligence to business users.
7. Hadoop Provided a Solution… page Data Warehouse Detailed non-transactional data (what every user sees, clicks, etc.) Hadoop Transactional data (e.g. bookings) and aggregated Non-transactional data
12. A View Shared Beyond Orbitz… page “ We strongly believe that Hadoop is the nucleus of the next-generation cloud EDW…” *James Kobielus, Forrester Research, “ Hadoop, Is It Soup Yet?” “… but that promise is still three to five years from fruition.”*
15. ETL Example: Click Data Processing page Web Server Logs ETL DW Data Cleansing (Stored procedure) DW Web Server Web Servers Several hours of processing ~20% original data size Current Processing in Data Warehouse
16.
17.
18. BI Vendors Are Working on Hadoop Integration page Both big (relatively)…
Most people think of orbitz.com, but Orbitz Worldwide is really a global portfolio of leading online travel consumer brands including Orbitz, Cheaptickets, The Away Network, ebookers and HotelClub. Orbitz also provides business to business services - Orbitz Worldwide Distribution provides hotel booking capabilities to a number of leading carriers such as Amtrak, Delta, LAN, KLM, Air France and Orbitz for Business provides corporate travel services to a number of Fortune 100 clients Orbitz started in 1999, orbitz site launched in 2001.
The initial motivation was to solve a particular business problem. Orbitz wanted to be able to use intelligent algorithms to optimize various site functions, for example optimizing hotel search by showing consumers hotels that more closely match their preferences, leading to more bookings.
Improving hotel search requires access to such data as which hotels users saw in search results, which hotels they clicked on, and which hotels were actually booked. Much of this data was available in web analytics logs.
Our data warehouse contains a full record of all transactions, but much of the required non-transactional data was either not stored, or stored in aggregated fields.
Hadoop is being used to analyze and optimize cache performance – in this case hotel rate cache. This type of analysis will allow us to ensure that more requests can be served from the cache, optimizing the user experience and improving our “look-to-book” metrics. Hadoop is used to crunch data for input to a system to recommend products to users. Although we use third-party sites to monitor site performance, Hadoop allows the front end team to provide detailed reports on page download performance, providing valuable trending data not available from other sources.
1 st viz is just plot of the lat/long of hotel bookings for the month, illustrating the global nature of the business. 2 nd viz is a simple price prediction for air fares Data is used for analysis of user segments, which can drive personalization. This chart shows that Safari users click on hotels with higher mean and median prices as opposed to other users. This is just a handful of examples of how Hadoop is driving business value.
Recently received an email from a user seeking access to Hive. Sent him a detailed email with info on accessing Hive, etc. Received an email back basically saying “you lost me at ssh”.
Making part of BI team probably makes Orbitz unique, but it’s a reflection of the importance of big data to driving BI for the company.
probably both of these are common use cases at other companies employing Hadoop with an EDW.
Hadoop will be used to transform web analytics data into a dimensional model, allowing multiple business unit to generate reports providing valuable intelligence to improve business results.
Processing of click data gathered by web servers. This click data contains marketing info. data cleansing step is done inside data warehouse using a stored procedure further downstream processing is done to generate final data sets for reporting Although this processing generates the required user reports, this process consumes considerable time and resources on the data warehouse, consuming resources that could be used for reports, queries, etc.
ETL step is eliminated, instead raw logs will be uploaded to HDFS which is a much faster process Moving the data cleansing to MapReduce will allow us to take advantage of Hadoop’s efficiencies and greatly speed up the processing. Moves the “heavy lifting” of processing the relatively large data sets to Hadoop, and takes advantage of Hadoop’s efficiencies.
Data was apparently available in the DW, but wasn’t modeled to enable efficient querying. Points up a strength of Hadoop, which is that it places no constraints on how data is processed.
This provides an example of a typical processing flow for the large volumes of non-transactional data we’re collecting. This processing allows us to convert large volumes of un-structured data into structured data that can be queried, extracted, etc. for further processing.
This type of processing also allows us summarize large volumes of data into a data set that can be exported to the data warehouse, allowing us to query and report on that data using all of our standard BI tools.