Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Gartner peer forum sept 2011 orbitz
1. Architecting for Big Data Integrating Hadoop into an Enterprise Data Infrastructure Raghu Kashyap and Jonathan Seidman Gartner Peer Forum September 14 | 2011
2.
3. page Launched in 2001, Chicago, IL Over 160 million bookings
4.
5.
6. Why We Started Using Hadoop page Optimizing hotel search…
7.
8.
9. Hadoop Was Selected as a Solution… page Transactional Data (e.g. bookings) Data Warehouse Non-Transactional Data (e.g. searches) Hadoop
10.
11. Current Big Data Infrastructure Hadoop page MapReduce HDFS MapReduce Jobs (Java, Python, R/RHIPE) Analytic Tools (Hive, Pig) Data Warehouse (Greenplum) psql, gpload, Sqoop External Analytical Jobs (Java, R, etc.) Aggregated Data Aggregated Data
22. Click Data Processing – Current Data Warehouse Processing page Web Server Logs ETL DW Data Cleansing (Stored procedure) DW Web Server Web Servers 3 hours 2 hours ~20% original data size
23. Click Data Processing – Proposed Hadoop Processing page Web Server Logs HDFS Data Cleansing (MapReduce) DW Web Server Web Servers
Welcome everyone. I will be presenting on how we are shaping up web analytics and big data to optimize the data driven decisions at Orbitz World wide 2.. I will also be talking about the process model on how we are effectively utilizing the brains and man power across the organization towards a common goal 4. Between me and Jonathan we promise to give you some thought provoking details about analytics and Big data :-)
Most people think of orbitz.com, but Orbitz Worldwide is really a global portfolio of leading online travel consumer brands including Orbitz, Cheaptickets, The Away Network, ebookers and HotelClub. Orbitz also provides business to business services - Orbitz Worldwide Distribution provides hotel booking capabilities to a number of leading carriers such as Amtrak, Delta, LAN, KLM, Air France and Orbitz for Business provides corporate travel services to a number of Fortune 100 clients Orbitz started in 1999, orbitz site launched in 2001.
A couple of years ago when I mentioned Hadoop I’d often get blank stares, even from developers. I think most folks now are at least aware of what Hadoop is.
This chart isn’t exactly an apples-to-apples comparison, but provides some idea of the difference in cost per TB for the DW vs. Hadoop Hadoop doesn’t provide the same functionality as a data warehouse, but it does allow us to store and process data that wasn’t practical before for economic and technical reasons. Putting data into a DB or DWH requires having knowledge or making assumptions about how the data will be used. Either way you’re putting constraints around how the data is accessed and processed. With Hadoop each application can process the raw data in whatever way is required. If you decide you need to analyze different attributes you just run a new query.
The initial motivation was to solve a particular business problem. Orbitz wanted to be able to use intelligent algorithms to optimize various site functions, for example optimizing hotel search by showing consumers hotels that more closely match their preferences, leading to more bookings.
Improving hotel search requires access to such data as which hotels users saw in search results, which hotels they clicked on, and which hotels were actually booked. Much of this data was available in web analytics logs.
Management was supportive of anything that facilitated ML team efforts. But when we presented a hardware spec for servers with local non-raided storage, etc. syseng offered us blades with attached storage.
Hadoop is used to crunch data for input to a system to recommend products to users. Although we use third-party sites to monitor site performance, Hadoop allows the front end team to provide detailed reports on page download performance, providing valuable trending data not available from other sources. Data is used for analysis of user segments, which can drive personalization. This chart shows that Safari users click on hotels with higher mean and median prices as opposed to other users. This is just a handful of examples of how Hadoop is driving business value.
Recently received an email from a user seeking access to Hive. Sent him a detailed email with info on accessing Hive, etc. Received an email back basically saying “you lost me at ssh”.
Previous to 2011 Hadoop responsibilities were split across technology teams. Moving under a single team centralized responsibility and resources for Hadoop.
Processing of click data gathered by web servers. This click data contains marketing info. data cleansing step is done inside data warehouse using a stored procedure further downstream processing is done to generate final data sets for reporting Although this processing generates the required user reports, this process consumes considerable time and resources on the data warehouse, consuming resources that could be used for reports, queries, etc.
ETL step is eliminated, instead raw logs will be uploaded to HDFS which is a much faster process Moving the data cleansing to MapReduce will allow us to take advantage of Hadoop’s efficiencies and greatly speed up the processing. Moves the “heavy lifting” of processing the relatively large data sets to Hadoop, and takes advantage of Hadoop’s efficiencies.
Bad news is we need to significantly increase the number of servers in our cluster, the good news is that this is because teams are using Hadoop, and new projects are coming online.
I met someone at the train station who asked me what I do? I said I work in the web analytics field and I help shape up the strategy and vision at Orbitz worldwide and enable our business teams to get insights on the performance of our site and act upon it. So he said, Ah you do reporting :-) 2. I started thinking why web analytics is hard for people to get and started evangelizing both within and outside Orbitz 3. I manage the webanalytics team at Orbitz worldwide I also try to help out non-profit organizations while I am not busy with my wife and 2 sons
1 So what is web analytics? 2 Read the definition. It tells you exactly why someone came to your site and what kind of impact they had on the bottom line of your revenue 3. Read the definition. You need to immerse yourself in data to understand the story it's telling 4. Read the definition. Focus on Customer. Customer is the king. You need to listen and act upon their feedback 5. Read the definition. Test Test and Test. If you want to prove or disprove a HIPPO's opinion you need to perform tests on your site 6. btw HIPPO is a common terminology in the industry. It stands for Highest Income Paid person's opinion :-)
S o with so many brands and so much data we had quite a few challenges? For starters we couldn't easily do multi dimensional analysis with the tools. With data spread across in multiple tools it was hard to picture the whole 9 yards obviously tools cost money Harder for people to understand where to look at for data With Analytics you need direction rather than precision to take action and get insights
In the Big Data front we didn't have a good infrastructure where we could house all this data in a cost effective way. 2. Data extraction was NOT an easy task 3. Focusing on the key differences on when you need testing v/s when you need reporting. 4. Earlier I mentioned that you need to do rigorous outcome analysis. However, with all the challenges we faced it was not an easy task.
So how do we fit the puzzle? By learning the behavior of the customer and focusing on key attributes Know the travel details such as how many travelers, what kind of travelers, any preferred carrier or hotels? 4. Understand the shopping patterns. Does he want to shop only on weekends or else only on Thursdays. 5. Focus on Visit Patterns. How many times does he come to the site before he buys anything 6. Learn the page navigation. I.e does he see 100 pages every time he comes or does he know exactly what to look at 7. Master the Demand source. Anyone who's worked in the marketing side knows that attribution is a holy war. Deciding which demand source gets the credit for conversion is something people will argue to death Just like the IDE war between VIM, EMACS, Intellij and Eclipse :-)
We realized that with all the challenges we had, we had to innovate and experiment new ways to enable successful web analytics at OWW 2. We generate hundreds of GB of log data per day. How can we effectively store this massive data and how can we mine this data and make sense out of it? 3. Our existing DW was not intended to support such large sets of data and more importantly process this data We also needed to make sure that we don't spend huge money to store this data set. 4. Big data infrastructure with Hadoop has been a huge success at Orbitz and at other organizations So what does this buy us? We can now store data for a long period of time without worrying too much about the space Analysts and developers have access to this data set Developers can run adhoc queries to support our business needs. While the core web analytics team focuses on the company standards and metrics
Here is an example of how we process our site analytics data today. We FTP the log files into our Hadoop infrastructure daily. The files are LZO compressed for better storage utilization. Developers then write Map reduce jobs against these raw log files to output data into HIVE tables. HIVE is a DW equivalent of Hadoop Most of the MR jobs are written using Java and scripting languages such as Python, Ruby, BASH. Business teams however, have skillset to run queries against HIVE tables.
Since the market on Big Data is not that mature there are no good ways to build visualization on top of HIVE 2. Due to this and for other reasons we need to bring a subset of this data into our warehouse. 3. So in essence the data that are in HIVE will make it into the warehouse. 4. There are companies such as Karmaspehe, Datamere who are in the initial stages of bridging the gap between business needs and Hadoop access. 5. However, its too early to say if this will be the norm
We focused on some key areas of our business such as demand source and campaigns as our pilot and worked with our business partners to enable the analytics on Big Data 2. We have developers writing Map Reduce jobs which run every day and populate HIVE tables We generate more than 25 million records for a month for the pilot use case that we worked on This only show cases the sheer magnitude and power of analytics within the Big Data framework
So if you have read Avinash Kaushik’s book and his follow his blog Occoms razor” then you know what he always mentions 2 words Data puke Gold (Insights) Here we have a nice depiction of all kinds of insights provided in a nice dashbaord format to our business users. These insights were only made possible due to the data that we housed and extracted from Hadoop. Obviously I couldn’t share what these graphs meant without giving more details
So how do you organizationally structure yourself and Big Data so that you can be effective both in terms of resource utilization and setting the platform for success 2. This is what we call the Centralized Decentralization. 3. With this approach the core web analytics team controls and supports the individual teams when it comes to data extraction and modeling. 4. This prevents one team from being the bottle neck with data extraction and analytics 5. If you have ever worked in the Data Warehouse side of the world you will know the challenges and delays in getting the data
With the core process of centralized decentralization and being agile how do you succeed? You can't manage if you can't measure. But once you measure make sure you fail fast Every team needs to be thinking of analytics with every feature they work on Dimensional modeling is great but like someone wise said 'All models are wrong but some are useful" :-) My point here is data without analysis is like a Ferrari without gas. If you Make it a point to extract smaller chunks of data and tie this effort to your business objectives. You are sure to succeed
Here are some key learning's from our experience and some thoughts for you to consider If you have the strength of technology go for it. This needs heavy investment from time and resource perspective Like I mentioned many times data without analysis is worthless
Thanks again for listening to our story and we would be available for any further questions you may have. Also if you are know anyone who is interested in working at Orbitz please check out the career site