Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Big data analysis concepts and references


Consultez-les par la suite

1 sur 60 Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Similaire à Big data analysis concepts and references (20)


Plus par Information Security Awareness Group (20)


Big data analysis concepts and references

  1. 1. Big Data Analysis 1Author: Vikram Andem ISRM & IT GRC Conference Big Data Analysis  Concepts and References  Use Cases in Airline Industry
  2. 2. Big Data Analysis 2Author: Vikram Andem ISRM & IT GRC Conference  The objective of this presentation is to provide awareness and familiarize a general business or management user with terms and terminology of Big Data Analysis and some references to use cases that can be (or are currently) applied in Airline industry.  The presentation is intended for an business or a management user to help with the thinking process on formulating an analytical question given business situation / problem for Big Data Analysis.  The presentation may also help provide an insight on basic terms and concepts that are a need to know, what to ask, how to evaluate and/or help solve a business problem for a potential Big Data Analysis use case and what to expect from the work of an competent Data Scientist when dealing with such use case for Big Data Analysis. NOTE: Just reviewing this presentation will most likely NOT make you competent enough to instantly perform Big Data Analysis. Big Data Analysis is a new (very recent) aspect of Data Science and requires some college or university level course work in (fields such as, but not limited to) mathematics, statistics, computer science , management science, econometrics, engineering etc. The presentation is divided into three parts following a separate presentation on Big Data Security & Governance, Risk Management & Compliance Part 1. Big Data : Introduction ( page # 3) Part 2. Very quick introduction to understanding Data and analysis of Data ( page # 8) (Beginner: if you are new to understanding data and use of data you should start here) Part 3. Big Data Analysis : Concepts and References to Use Cases in Airline Industry ( page # 17) (Advanced: if you understand data and how to use data, you may jump to this part).
  3. 3. Big Data Analysis 3Author: Vikram Andem ISRM & IT GRC Conference Big Data: Introduction You may skip this section if you are familiar with Big Data and directly jump to Part 2 (page # 8) Part 1
  4. 4. Big Data Analysis 4Author: Vikram Andem ISRM & IT GRC Conference Introduction Projected growth and use of Unstructured vs. Structured data¹ 2012 2013 2014 2015 2016 2017 2018 2019 2020 Unstructured Structured ¹ 2013 IEEE Bigdata conference (projected growth of data combined for all fortune 500 companies only) Limitations of existing Data Analytics Architecture BI Reports + Interactive Apps RDBMS (aggregated data) ETL Computer Grid Collection Instrumentation Storage Only Grid (original raw data) Mostly Append Limit #1 : Moving data to compute doesn’t scale. Limit #2 : Can’t explore high fidelity raw data Limit #3 : Archiving = Premature data death *Zettabyte’sofdata * 1 Zettabyte = 1000 Exabyte's = 1 Million Petabyte’s = 1 Billion Terabyte’s. Big Data a general term refers to the large voluminous amounts (at least terabytes) of poly-structured data that is gleaned from traditional and non-traditional sources and continuously flows through and around organizations, including but not limited-to e-mail, text, event logs, audio, video, blogs, social media and transactional records. What does this information hold? What is the challenge extracting it? It holds the promise of giving enterprises like United a deeper insight into their customers, partners, and business. This data can provide answers to questions they may not have even thought to ask. Companies like United can benefit from a multidimensional view of their business when they add insight from big data to the traditional types of information they collect and analyze. NumberofResults Demand <- More Generic More Specific -> Popularity Rank <- Small Tail Long Tail -> Traditional EDW Classical Statistics Big Data Specific Spikes Transactional Data (e.g., Reservations) Non-Transactional & Raw Data (e.g., Search's, Event logs) - + The challenge of extracting value from big data is similar in many ways to the age-old problem of distilling business intelligence from transactional data. At the heart of this challenge is the process used to extract data from multiple sources, transform it to fit your analytical needs, and load it into a data warehouse for subsequent analysis, a process known as “Extract, Transform & Load” (ETL). The nature of big data requires that the infrastructure for this process can scale cost-effectively. While the storage capacities of hard drives have increased massively over the years, access speeds — the rate at which data can be read from drives — have not kept up. One typical drive from year 1990 could store 1,370 MB of data and had a transfer speed of 4.4 MB/s, so you could read all the data from a full drive in around five minutes. Over 20 years later, one terabyte drives are the norm, but the transfer speed is around 100 MB/s, so it takes more than two and a half hours to read all data off the disk. This is a long time to read all data on a single drive — and writing is even slower. Data Storage and Analysis
  5. 5. Big Data Analysis 5Author: Vikram Andem ISRM & IT GRC Conference Hadoop Apache Hadoop is a scalable fault- tolerant distributed system for storage and large scale processing of data-sets on clusters of commodity hardware. Hadoop can be used to store Exabyte's of unstructured and semi-structured data reliably on tens of thousands of general purpose servers while scaling performance cost-effectively by merely adding inexpensive nodes to the cluster. Using Hadoop in this way, organization like United gains an additional ability to store and access data that they “might” need, data that may never be loaded into the data warehouse. The Key Benefit: Agility/Flexibility Schema-on-Write (RDBMS)  Schema must be created before any data can be loaded.  An explicit load operation has to take place which transforms data to DB internal structure.  New columns must be added explicitly before new data for such columns can be loaded into the database. Schema-on-Read (Hadoop)  Data is simply copied to the file store, no transformation is needed.  A SerDe (Serializer/ Deserlizer) is applied during read time to extract the required columns (late binding).  New data can start flowing anytime and will appear retroactively once the SerDe is updated to parse it.  Read is Fast  Standards / Governance  Load is Fast  Flexibility / AgilityPros  Interactive OLAP Analytics (<1sec)  Multistep ACID Transactions  100% SQL Compliance  Structured or Not (Flexibility)  Scalability of Storage/Compute  Complex Data Processing Use When Hadoop Architecture Central to the scalability of Hadoop is the distributed processing framework known as MapReduce which splits the input data-set into multiple chunks, each of which is assigned a map task that can process the data in parallel. Each map task reads the input as a set of (key, value) pairs and produces a transformed set of (key, value) pairs as the output. The framework shuffles and sorts outputs of the map tasks, sending the intermediate (key, value) pairs to the reduce tasks, which group them into final results. MapReduce uses Job Tracker and Task Tracker mechanisms to schedule tasks, monitor them, & restart any that fail. Hadoop Distributed File System (HDFS) is designed for scalability and fault tolerance. HDFS stores large files by dividing them into blocks (usually 64 or 128 MB) and replicating the blocks on three or more servers. HDFS provides APIs for MapReduce applications to read and write data in parallel. Capacity and performance can be scaled by adding Data Nodes, and a single NameNode mechanism manages data placement and monitors server availability. HDFS clusters in production use today reliably hold petabytes of data on thousands of nodes. ² Pictures source :Intel White Paper on Big Data Analytics
  6. 6. Big Data Analysis 6Author: Vikram Andem ISRM & IT GRC Conference Use Case Description Data Storage Collect and store unstructured data in a fault-resilient scalable data store that can be organized and sorted for indexing and analysis. Analytics Ability to query in real time at the speed of thought on petabyte scale unstructured and semi structured data using Hbase and Hive. Batch Processing of Unstructured data Ability to batch-process (index, analyze etc.) tens to hundreds of petabytes of unstructured and semi-structured data. Data Archive Medium-term (12-36 months) archival of data from EDW/DBMS to meet data retention policies. Integration with EDW Extract, transfer and load data in and out of Hadoop into separate DBMS for advanced analytics. Search and predictive analytics Crawl, extract, index and transform structured and unstructured data for search and predictive analytics. Use Cases Southwest Airlines uses Hadoop based solution for its "Rapid Rewards loyalty program" for Customer Service. CommonPatternsofHadoopUse Big Data Exploration Pattern # 1: Hadoop as a Data Refinery Traditional Sources (RDBMS, OLTP, OLAP) DATASYSTEMSDATASOURCESAPPLICATIONS New Sources (Web logs, email, sensor data, social media) RDBMS EDW Traditional Repos Business Analytics Custom Applications Enterprise Applications 1 3 2 1 3 2 Collect data and apply a known algorithm into trusted operational process Capture Capture all data Process Parse, cleanse, apply structure & transform Exchange Push to existing data warehouse for use with existing analytic tools Pattern # 2: Data Exploration with Hadoop Traditional Sources (RDBMS, OLTP, OLAP) DATASYSTEMSDATASOURCESAPPLICATIONS New Sources (Web logs, email, sensor data, social media) RDBMS EDW Traditional Repos Business Analytics Custom Applications Enterprise Applications 1 3 2 1 3 2 Collect data and perform iterative investigation for value Capture Capture all data Process Parse, cleanse, apply structure & transform Exchange Explore and visualize with analytics tools supporting Hadoop Pattern # 3: Application Enrichment with Hadoop Traditional Sources (RDBMS, OLTP, OLAP) DATASYSTEMSDATASOURCESAPPLICATIONS New Sources (Web logs, email, sensor data, social media) RDBMS EDW Traditional Repos Business Analytics Custom Applications Enterprise Applications 1 3 2 1 3 2 Collect data, analyze and present salient results for online apps Capture Capture all data Process Parse, cleanse, apply structure & transform Exchange Incorporate data directly into applications Top 5 General Usages Obtaining a 360-degree view of Customers Operations Analytics Data Warehouse Augmentation Social Media How Airline Industry is using Hadoop ? ³ Capturing Sensor Data to Optimize Maintenance Top 5 Airline Usages Forecasting the Weather to Optimize Fuel Loads Identifying and Capturing the Demand Signal (Competitive offerings, Travel partner feeds) Loyalty and Promotions Webpage Visit’s, Log Storage  When is the best time of day/day of week/time of year to fly to minimize delays?  Do older planes suffer more delays?  How does the number of people flying between different locations change over time?  How well does weather predict plane delays?  Can you detect cascading failures as delays in one airport create delays in others? Are there critical links in the system? American Airlines is utilizing Hadoop based solution for Clickstream, Customer, Kiosk and Data Analytics. British Airlines uses Hadoop based solution for Forecasting and Revenue Management. Orbitz Worldwide utilizes Hadoop based solution for Statistical Analysis to identify best possible promotions that combine air travel with hotel stay. Prospects: CTO Q’s? ³ source : Internet – Google search
  7. 7. Big Data Analysis 7Author: Vikram Andem ISRM & IT GRC Conference Cost Benefit Analysis Gigabyte scale Petabyte scale Size of data CostperTerabyte Cost of current, structured data management technologies Economics of Data Cost of Hadoop Value captured by customers (like United) using Hadoop Why is Hadoop a value proposition? TCOD (Total Cost of Data) is the cost of owning (and using!) data over time for analytic purposes is a better metric than TCO (Total Cost of Ownership) for Cost Benefit Analysis of this case. TCOD estimates what a company like United will really spend to get to its business goal and the focus is on on total cost, not just the platform cost. In this case the TCOD comparison is made between EDW Platform/Appliance and Hadoop for the same amount of (raw or unstructured) data. Cost Comparison & Benefits are based on underlying “Data Management” Requirements⁴ 1.Hundreds of TB of data per week –500 TB data. 2.Raw data life: few hours to a few days. 3.Challenge: find the important events or trends. 4.Analyze the raw data once or a few times. 5.When analyzing, read entire files. 6.Keep only the significant data. Project A: Emphasis on “Data Refining” Requirements EDW Appliance Hadoop Total System Cost $23M $1.3M System and Data Admin $2M $2M Application Development $5M $6M Total Cost of Data $30M $9.3M Cost equation is favorable to Hadoop for data refining, data landing and archival requirements. 1.Data volume 500 TB to start – all must be retained for at least five years. 2.Continual growth of data and workload 3.Data sources: thousands 4. Data sources change their feeds frequently 5.Challenges: Data must be correct and data must be integrated 6.Typical enterprise data lifetime: decades 7.Analytic application lifetime: years 8.Millions of data users 9.Hundreds of analytic applications 10.Thousands of one time analyses 11.Tens of thousands of complex queries Project B: Emphasis on “EDW” Requirements EDW Appliance Hadoop Total System Cost $45M $5M System and Data Admin $50M $100M Application Development $40M $300M ETL $60M $100M Complex Queries $40M $80M Analysis $30M $70M Total Cost of Data $265M $655M Cost equation is favorable to EDW for Data Warehouse appliance requirements Cost Comparison Conclusions Each technology has large advantages in its sweet spot(s).  Neither platform is cost effective in the other’s sweet spot.  Biggest differences for the data warehouse are the development cost for “Complex Queries” & “Analytics”.  Total cost is extremely sensitive to technology choice.  Analytic architectures will require both Hadoop and data warehouse platforms.  Focus on total cost, not platform cost, in making your choice for a particular application or use.  Many analytic processes will use both Hadoop & EDW technology – so integration cost also counts! ⁴ source for TCOD comparison is “The Real Cost of Big Data Spreadsheet” provided by Winter Corp (www.wintercorp.com/tcod) Hadoop is ideal for data storage of • data which rarely needed; • data which can grow rapidly; • data which can grow very large; • data for which it is uncertain how it will be needed in the future; • data which may or may not have structure; • data which may require ETL and Analysis sometime in the future but just needs to be stored now …..for some unknown use. TCOD is the cost of owning (and using!) dataover timefor analytic purposes * ETL is extract, transform and load (preparing data for analytic use) Software Development/Maintenance Cost SYSTEM COST ADMIN COST ETL * APPS QUERIES ANALYTICS
  8. 8. Big Data Analysis 8Author: Vikram Andem ISRM & IT GRC Conference Very quick introduction to understanding Data and analysis of Data Start here if you are new to understanding the data or do not know how to analyze data. Part 2
  9. 9. Big Data Analysis 9Author: Vikram Andem ISRM & IT GRC Conference Introduction to Data  Targets business constraints.  Assesses and determines new ways to operate. What is the best action/outcome? What could happen? What happened?  Finds associations in data not readily apparent with customary analysis .  Forecasts future probabilities and trends.  Pattern recognition from samples for reporting of trends.  Formulates and analyzes historical data. Descriptive Analytics Predictive Analytics Prescriptive Analytics Benefits Very High High Medium Myth: I have large sets of data on Hadoop File System. Running powerful analytical tools (e.g., R, SAS, Tableau etc.) on Hadoop Infrastructure will perform all the data analysis work for me and provide/deliver useful information. Fact : The data by itself may not contain the answer, Big or Small - you need the right data. The combination of data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data . Performing data analysis with an understanding (and application of) data science principles, by correctly framing the analytical problem (with correct data sampling/collection methods) and with the use of appropriate analytical tools will most likely provide useful information for statistical/analytical inference. Data Analysis Benefits Data Facts  Data is just data.  Data does not live nor die.  Data does not offer truth nor does it lie.  Data is not large nor its small.  Data has always existed as both big data and small data. Data is the second most important thing  The most important thing in data science is the question.  The second most important is the data.  Often the data will limit or enable the questions.  But having data is useless if you don't have an appropriate question. Data are values of qualitative or quantitative variables, belonging to a set of items. Source: Wikipedia
  10. 10. Big Data Analysis 10Author: Vikram Andem ISRM & IT GRC Conference Data Types : Basics Types of Data Variables Examples (from above table):  gender: categorical  sleep: numerical, continuous  bedtime: categorical, ordinal  countries: numerical, discrete  dread: categorical, ordinal - could also be used as numerical Data variables numerical categorical continuous discrete regular categorical ordinal Numerical (quantitative) variables take on numerical values. It is sensible to add, subtract, take averages, etc. with these values. Categorical (qualitative) variables take on a limited number of distinct categories. These categories can be identified with numbers, but it wouldn’t be sensible to do arithmetic operations with these values. Continuous numerical variables are measured, and can take on any numerical value. Discrete numerical variables are counted, and can take on only whole non-negative numbers. Categorical variables that have ordered levels are called ordinal. Think about a flight survey question where you are asked about how satisfied you are with the customer service you received, and the options are very unsatisfied, unsatisfied, neutral, satisfied or very satisfied. These levels will have inherent ordering and hence the variable will be called ordinal. If the levels of the categorical variable do not have an inherent ordering to them, then the variable is simply called (regular) categorical. (e.g., do you prefer morning flight or an evening flight?) Observations, Variables and Data Matrices Data are organized in what we call a data matrix, where each row represents an observation (or a case), and each column represents a variable. City no_flights %_ontime ……. region Chicago 350 90 …… Midwest Houston 330 96 …… South ……… ……… ……… …… ……… Newark 306 92 …… Northeast San Francisco 310 93 …… West data matrix observation (case) variable  First variable is City which is an identifier variable for the name of the city United serves to which the data are gathered.  Next is the no_flights (number of flights) served by united daily, and is a discrete numerical variable.  Next is the %_ontime (percentage on time) representing the united flights that operated on-time (arrival or departure) which represents a continuous numerical variable (as it can take on any value between zero and 100, even though the values shown here are rounded to whole numbers).  Last column is region representing where the city is located in USA as designated by US census (Northeast, Midwest, South, and West) and this a categorical variable.  When two variables show some connection with one another, they are called associated, or dependent variables.  The association can be further described as positive or negative  If two variables are not associated they are said to be independent. Relationships between Variables example
  11. 11. Big Data Analysis 11Author: Vikram Andem ISRM & IT GRC Conference Data Observational Studies and Experiments Observational Study  In an observational study you collect data in a way that does not directly interfere with how the data arise, i.e. merely “observe”.  We can only establish an association (or correlation) between the explanatory and response variables.  If an observation study uses data from the past, it’s called a retrospective study, whereas if data are collected throughout the study, it’s called prospective. Experiment  In an experiment, you randomly assign subjects to various treatments and can therefore establish a causal connection between the explanatory and response variables. work out don't work out average energy level average energy level work out don't work out average energy level average energy level random assignment In observational study we sample two types of people from the population, those who choose to work out regularly and those who don’t, then we find the average energy level for the two groups of people and compare. On the other hand in an experiment, we sample a group of people from the population and then we randomly assign these people into two groups, those who will regularly workout throughout the course of the study and those who will not. The difference is that the decision of whether to work out or not is not left to the subjects as in the observational study but is instead imposed by the researcher. At the end we compare the average energy levels of the two groups. Based on the observational study, even if we find the difference between the energy levels of these two groups of people, we really can’t attribute this difference solely to working out, because there may be other variables that we didn’t control for in this study that contribute to the observed difference. For example people who are in better shape might be more likely to work out and also have high energy levels. However in the experiment such variables that likely to contribute to the outcome are equally represented in the two groups, due to random assignment. Therefore if we find a difference between the two averages, we can indeed make a causal statement attributing this difference to working out. Example: Suppose you want to evaluate the relationship between regularly working out and energy level. We can design the study as observational study or an experiment. What type of study is this, observational study or an experiment? “Girls who regularly ate breakfast, particularly one that includes cereal, were slimmer than those who skipped the morning meal, according to a study that tracked nearly 2,400 girls for 10 years. [...] As part of the survey, the girls were asked once a year what they had eaten during the previous three days.” This is an observational study since the researchers merely observed the behavior of the girls (subjects) as opposed to imposing treatments on them. The study concludes there is an association between girls eating breakfast and being slimmer. #1 : Eating breakfast causes girls to be thinner. #2: Being thin causes girls to eat breakfast #3: A third variable is responsible for both. What could it be? An extraneous variable that affects both the explanatory and the response variable and that make it seem like there is a relationship between the two are called confounding variables. Images from: http://www.appforhealth.com/ wp-content/ uploads/ 2011/ 08/ ipn-cerealfrijo-300x135.jpg, http://www.dreamstime.com/ stock-photography-too-thin-woman-anorexia-model-image2814892. • What determines whether we can infer correlation or causation depends on the type of study that we are basing our decision on. • Observational studies for the most part only allow us to make correlation statements, while experiments infer us to causation. Correlation does not imply causation. Question Answer 3 Possible Explanations vs.
  12. 12. Big Data Analysis 12Author: Vikram Andem ISRM & IT GRC Conference Data Sampling and Sources of Bias Data Sampling  Think about sampling something that you are cooking - you taste (examine) a small part of what you’re cooking to get an idea about the dish as a whole.  When you taste a spoonful of soup and decide the spoonful you tasted isn’t salty enough, that’s called exploratory analysis for the sample at hand.  If you can generalize and conclude that your entire soup needs salt, that’s making an inference.  For your inference to be valid, the spoonful you tasted (the sample) needs to be representative of the entire pot (the population).  If your spoonful comes only from the surface and the salt is collected at the bottom of the pot, what you tasted is probably not representative of the whole pot.  On the other hand, if you first stir the soup thoroughly before you taste, your spoonful will more likely be representative of the whole pot. Exploratory Analysis Representative Sample Inference Sources of Sampling Bias  Convenience sample bias: Individuals who are easily accessible are more likely to be included in the sample. Example: say you want to find out how people in your city feel about recent increase in public transportation costs. If you only poll people in your neighborhood as opposed to a representative of a whole sample from the entire city your study will suffer from Convenience sample bias.  Voluntary Response bias: Occurs when the sample consists of people who volunteer to respond because they have strong opinions on the issue. Such a sample will also not be representative of the entire population. Example: say you place polling machines at all bus stops and metro stations in your city, but only those who suffered by the price increase choose to actually take the time to vote and express their opinion on the recent increase in public transportation fares. The people who respond to such sample do not make up the representative of the entire population.  Non-Response sampling bias: If only a (non-random) fraction of the randomly sampled people choose to respond to a survey, the sample is no longer a representative of the entire population. Example: say you take a random sample of individuals from your city and attempt to survey them but certain segments of the population; say those from the lower socio- economic status are less likely to respond to the survey then its not a representative of entire population. Sampling Bias a historical example: Landon vs. FDRIn 1936, Landon sought the Republican presidential nomination opposing the re-election of FDR. A popular magazine of the time (1936) “Literary Digest” polled about 10 million Americans, and got responses from about 2.4 million. To put things in perspective, nowadays reliable polls in USA poll about 1500 to 3000 people, so the “10 million” poll was very huge sample. The poll showed that Landon would likely be the overwhelming winner and FDR would get only 43% of the votes. Election result: FDR won, with 62% of the votes. What went wrong with the Literary Digest Poll? The magazine had surveyed  its own readers:  registered automobile owners,  and registered telephone users. These groups had incomes well above the national average of the day (remember, this is Great Depression era) which resulted in lists of voters far more likely to support Republicans than a truly typical voter of the time, i.e. the sample was not representative of the American population at the time. The Literary Digest election poll was based on a sample size of 2.4 million, which is huge, but since the sample was biased, the sample did not yield an accurate prediction. Back to the soup analogy: If the soup is not well stirred, it doesn’t matter how large a spoon you have, it will still not taste right. If the soup is well stirred, a small spoon will suffice to test the soup.
  13. 13. Big Data Analysis 13Author: Vikram Andem ISRM & IT GRC Conference Data Sampling Methods & Experimental Design Obtaining Good Samples  Almost all statistical methods are based on the notion of implied randomness.  If observational data are not collected in a random framework from a population, these statistical methods the estimates & errors associated with the estimates are not reliable.  Most commonly used random sampling techniques are simple, stratified, and cluster sampling. Simple Random Sample Randomly select cases from the population, where there is no implied connection between the points that are selected. Stratified Sample Strata are made up of similar observations. We take a simple random sample from each stratum. Cluster Sample Clusters are usually not made up of homogeneous observations, and we take a simple random sample from a random sample of clusters. Usually preferred for economical reasons. (data) Experimental Design Concepts • Control: Compare treatment of interest to a control group. • Randomize: Randomly assign subjects to treatments, and randomly sample from the population whenever possible. • Replicate: Within a study, replicate by collecting a sufficiently large sample. Or replicate the entire study. • Block: If there are variables that are known or suspected to affect the response variable, first group subjects into blocks based on these variables, and then randomize cases within each block to treatment groups. • Blocking example: We would like to design an experiment to investigate if energy gels makes you run faster: Treatment: energy gel ; Control: no energy gel • It is suspected that energy gels might affect pro and amateur athletes differently, therefore we block for pro status: • Divide the sample to pro and amateur. Randomly assign pro athletes to treatment and control groups. Randomly assign amateur athletes to treatment and control groups. • Pro/amateur status is equally represented in resulting treatment and control groups Random Assignment vs. Random Sampling
  14. 14. Big Data Analysis 14Author: Vikram Andem ISRM & IT GRC Conference Hypothesis Testing Two competing claims Claim 1. “There is nothing going on.” Promotion and gender are independent, no gender discrimination, observed difference in proportions is simply due to chance. => Null hypothesis Claim 2. “There is something going on.” Promotion and gender are dependent, there is gender discrimination, observed difference in proportions is not due to chance. => Alternative hypothesis A court trial as a hypothesis test Hypothesis testing is very much like a court trial.  H0: Defendant is innocent  HA: Defendant is guilty  We then present the evidence: collect data.  Then we judge the evidence - “Could these data plausibly have happened by chance if the null hypothesis were true?”  If they were very unlikely to have occurred, then the evidence raises more than a reasonable doubt in our minds about the null hypothesis.  Ultimately we must make a decision. How unlikely is unlikely?  If the evidence is not strong enough to reject the assumption of innocence, the jury returns with a verdict of “not guilty”.  The jury does not say that the defendant is innocent, just that there is not enough evidence to convict.  The defendant may, in fact, be innocent, but the jury has no way of being sure.  Said statistically, we fail to reject the null hypothesis.  We never declare the null hypothesis to be true, because we simply do not know whether it’s true or not.  Therefore we never “accept the null hypothesis”.  In a trial, the burden of proof is on the prosecution.  In a hypothesis test, the burden of proof is on the unusual claim.  The null hypothesis is the ordinary state of affairs (the status quo), so it’s the alternative hypothesis that we consider unusual and for which we must gather evidence. Hypothesis Testing
  15. 15. Big Data Analysis 15Author: Vikram Andem ISRM & IT GRC Conference Statistical Inference and Prediction Statistical Inference Statistical inference is the process of drawing conclusions from data that are subject to random variation, for example, observational errors or sampling variation. Initial requirements of such a system of procedures for inference and induction are that the system should produce reasonable answers when applied to well-defined situations and that it should be general enough to be applied across a range of situations. Inferential statistics are used to test hypotheses and make estimations using sample data. Confidence Interval Outcome of statistical inference may be an answer to the question "what should be done next?", where this might be a decision about making further experiments or surveys, or about drawing a conclusion before implementing some organizational or governmental policy. A confidence interval gives an estimated range of values which is likely to include an unknown population parameter, the estimated range being calculated from a given set of sample data. If independent samples are taken repeatedly from the same population, and a confidence interval calculated for each sample, then a certain percentage (confidence level) of the intervals will include the unknown population parameter. Confidence intervals are usually calculated so that this percentage is 95%, but we can produce 90%, 99%, 99.9% (or whatever) confidence intervals for the unknown parameter. Confidence Level Confidence level is the probability value (1-alpha) associated with a confidence interval. It is often expressed as a percentage. For example, say alpha = 0.05 = 5%, then the confidence level is equal to (1-0.05) = 0.95, i.e. a 95% confidence level. Prediction In statistics, prediction is the process of determining the magnitude of statistical variates at some future point of time. For data analysis context the word may also occur in slightly different meanings; e.g. in a regression equation expressing a dependent variate y in terms of dependent x’s, the value given for y by specified values of x’s is called the “predicted” value even when no temporal element is involved. Prediction vs. Inference  Using data to predict an event that has yet to occur is statistical prediction. Inferring the value of a population quantity such as the average income of a country or the proportion of eligible voters who say they will vote ‘yes’ is statistical inference.  Prediction and inference answer different types of data analysis questions. Examples of predictions (because the events have not occurred at the time of writing this content):  The probability that Chicago Bulls will win the 2018 NBA playoffs is __.  The probability that Republican Party will win the 2020 Presidential election is __. Examples of inferences: (because the questions involve estimating a population value.)  The proportion of NBA fans that currently believe Chicago Bulls will win the 2018 playoffs is __.  The proportion of eligible voters that currently state they will vote for Republican Party in the 2020 Presidential election is __.
  16. 16. Big Data Analysis 16Author: Vikram Andem ISRM & IT GRC Conference Before you proceed to Part 3 : Please quickly review the Appendix section to familiarize with terms and terminology that will be used in the rest of the presentation.
  17. 17. Big Data Analysis 17Author: Vikram Andem ISRM & IT GRC Conference Big Data Analysis  Concepts and References  Use Cases in Airline Industry Jump here directly, if you are a advanced user who understands data and knows how to analyze data. Part 3
  18. 18. Big Data Analysis 18Author: Vikram Andem ISRM & IT GRC Conference Big Data Analysis: Concepts and Airline Industry Use Cases Data Analysis Machine Learning & Models derive useful analytical information so humans can better understand it. Examples:  Does spending more money on marketing & sales in area “X” vs. area “Y” makes the company more profitable?  What does the customer want ? (e.g., customer survey). Models allow machines (software programs & applications) to make “real-time” (auto) decisions. Examples:  Google search / Amazon product recommendations, Facebook news feed, etc.  Geographic (GPS based) specific advertisements or event (holiday, weather, traffic) based promotions. Big Data application areas
  19. 19. Big Data Analysis 19Author: Vikram Andem ISRM & IT GRC Conference Bayesian Approach and Bayes Rule Bayesian Approach Differences between Bayesians and Non-Bayesians Bayes Theorem The probability the email message is spam, given the words in the email message is expressed as below: Example: Email Spam Filtering With Bayes: • A key benefit: The ability to incorporate prior knowledge • A key weakness: The need to incorporate prior knowledge
  20. 20. Big Data Analysis 21Author: Vikram Andem ISRM & IT GRC Conference Bayesian Belief Network Representation of Airline Passenger Behavior Source: Booz Allen Hamilton Bayesian Belief Network Representation of Airline Passenger BehaviorThe basis of this slide is from the paper titled Airline Analytics: Decision Analytics Center of Excellence by Cenk Tunasar, Ph.D., and Alex Cosmas of Booz Allen Hamilton In the above listed paper authors claim Booz Allen used the Big Data infrastructure of an airline client, and were able to analyze large datasets containing more than 3 years’ worth of passenger data of approximately 100 GB+. Booz Allen generated hypotheses to test from the Big Data set including , but not limited to: Airline Market Performance • What are the client’s natural market types and their distinct attributes? • What is the client’s competitive market health? • Where does the client capture fare premiums or fare discounts relative to other carriers? Passenger Behavior • What is the variability of booking curves by market type? • What are the intrinsic attributes of markets with the highest earn and highest burn rates? • Can predictive modeling be developed for reservation changes and no-show rates for individual passengers on individual itineraries? Consumer Choice • What is the demand impact of increasing connection time? • What is the effect of direct versus connecting itineraries on passenger preference? A use case in Airline industry (URL: http://www.boozallen.com/media/file/airline-analytics-brochure.pdf)
  21. 21. Big Data Analysis 22Author: Vikram Andem ISRM & IT GRC Conference Bayesian Ideas are very important for Big Data Analysis Bayesian Themes Prediction Average over unknowns, don't maximize. Uncertainty Probability coherently represents uncertainty. Combine Information Hierarchical models combine information from multiple sources. Source: Steve Scott (Google Inc. Sparsity Sparsity plays an important role in modeling Big Data  Models are "big" because of a small number of factors with many levels.  Big data problems are often big collections of small data problems. Multi-armed Bandits Problem Multi-armed bandit problem is the problem a gambler faces at a row of slot machines, sometimes known as "one-armed bandits", when deciding which slot machines to play, how many times to play each machine and in which order to play them. When played, each machine provides a random reward from a distribution specific to that machine. The objective of the gambler is to maximize the sum of rewards earned through a sequence of lever pulls. Source: Wikipedia Bayes Rule applied to Machine Learning A use case in Airline industry Big Data Project at South West Airlines The below URL provides a visual (interactive graphics) presentation of the Big Data Project at South West Airlines and how they used Bayesian approach and Naive Bayes classification with WEKA("Waikato Environment for Knowledge Analysis") tool for analysis of the following questions: 1) What are the important factors that cause delays and their weightage ? 2) What kind of weather (e.g. sunny, cloudy, snow, rain, etc.) causes weather delays? 3) Are some of the time periods during the day (e.g. early morning, morning, noon, etc.) that are more prone to delays than others? (URL: http://prezi.com/f3bsv9m6yl2g/big-data-project_southwest-airlines/) Entirely driven by parameter uncertainty
  22. 22. Big Data Analysis 23Author: Vikram Andem ISRM & IT GRC Conference Example: Bayesian based “Search Optimization” on Google File System (Source: Google Analytics) Source: Steve Scott (Google Inc.) Personalization as a “Big Logistic Regression" Search words: “Chicago to Houston today” Search words: “Chicago to Houston flight tomorrow” Search words: “Chicago to Houston cheapest”
  23. 23. Big Data Analysis 24Author: Vikram Andem ISRM & IT GRC Conference Meta Analysis Meta Analysis Meta-analysis refers to methods that focus on contrasting and combining results from different studies, in the hope of identifying patterns among study results, sources of disagreement among those results, or other interesting relationships that may come to light in the context of multiple studies. In its simplest form, meta-analysis is normally done by identification of a common measure of effect size. A weighted average of that common measure is the output of a meta- analysis. The weighting is related to sample sizes within the individual studies. More generally there are other differences between the studies that need to be allowed for, but the general aim of a meta- analysis is to more powerfully estimate the true effect size as opposed to a less precise effect size derived in a single study under a given single set of assumptions and conditions. Source: Wikipedia Advantages  Results can be generalized to a larger population,  The precision and accuracy of estimates can be improved as more data is used. This, in turn, may increase the statistical power to detect an effect.  Inconsistency of results across studies can be quantified and analyzed. For instance, does inconsistency arise from sampling error, or are study results (partially) influenced by between- study heterogeneity.  Hypothesis testing can be applied on summary estimates. A use case in Airline industry Price Elasticities of Demand for Passenger Air Travel A good discussion of the topic is detailed in the paper listed below: Price Elasticities of Demand for Passenger Air Travel: A Meta-Analysis by Martijn Brons, Eric Pels, Peter Nijkamp, Piet Rietveld of Tinbergen Institute (URL: http://papers.tinbergen.nl/01047.pdf) Meta Analysis and Big Data A good discussion of the topic is detailed in the article listed below: Meta-Analysis: The Original 'Big Data‘ by Blair T. Johnson , Professor at University of Connecticut (URL: http://meta-analysis.ning.com/profiles/blogs/meta-analysis-the-original-big-data)
  24. 24. Big Data Analysis 25Author: Vikram Andem ISRM & IT GRC Conference Effect Size Effect Size Effect size is a measure of the strength of a phenomenon (for example, the change in an outcome after experimental intervention). An effect size calculated from data is a descriptive statistic that conveys the estimated magnitude of a relationship without making any statement about whether the apparent relationship in the data reflects a true relationship in the population. In that way, effect sizes complement inferential statistics such as p-values. Among other uses, effect size measures play an important role in meta-analysis studies that summarize findings from a specific area of research, and in statistical power analyses. Source: Wikipedia Example: A weight loss program may boast that it leads to an average weight loss of 30 pounds. In this case, 30 pounds is the claimed effect size. if the weight loss program results in an average loss of 30 pounds, it is possible that every participant loses exactly 30 pounds, or half the participants lose 60 pounds and half lose no weight at all. "Small", “Medium", “Large" Effect Sizes Effect sizes apply terms such as "small", "medium" and "large" to the size of the effect and are relative. Whether an effect size should be interpreted small, medium, or large depends on its substantive context and its operational definition. Cohen's conventional criteria small, medium, or big are near ubiquitous across many fields. Power analysis or sample size planning requires an assumed population parameter of effect sizes. For Cohen's an effect size of 0.2 to 0.3 might be a "small" effect, around 0.5 a "medium" effect and 0.8 to infinity, a "large" effect.
  25. 25. Big Data Analysis 26Author: Vikram Andem ISRM & IT GRC Conference Monte Carlo Method Monte Carlo Method Monte Carlo methods (or experiments) are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results; typically one runs simulations many times over in order to obtain the distribution of an unknown probabilistic entity. The name comes from the resemblance of the technique to the act of playing and recording results in a real gambling casino. They are often used in physical and mathematical problems and are most useful when it is difficult or impossible to obtain a closed-form expression, or infeasible to apply a deterministic algorithm. Monte Carlo methods are mainly used in three distinct problem classes: optimization, numerical integration and generation of draws from a probability distribution. Monte Carlo methods vary, but tend to follow a particular pattern:  Define a domain of possible inputs.  Generate inputs randomly from a probability distribution over the domain.  Perform a deterministic computation on the inputs.  Aggregate the results. For example: Consider a circle inscribed in a unit square. Given that circle and the square have a ratio of areas that is π/4, the value of π can be approximated using a Monte Carlo method:  Draw a square on ground, then inscribe a circle within it.  Uniformly scatter some objects of uniform size (grains of rice or sand) over the square.  Count the number of objects inside the circle and the total number of objects.  The ratio of the two counts is an estimate of the ratio of the two areas, which is π/4. Multiply the result by 4 to estimate π. Monte Carlo Methods for Bayesian Analysis and Big Data A good discussion of the topic is detailed in the paper listed below: A Sequential Monte Carlo Method for Bayesian Analysis of Massive Datasets by David Madigan, Professor and Dean at Columbia University and Greg Ridgeway, Deputy Director at National Institute of Justice. (URL: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2753529/ ) Source: Wikipedia A use case in Airline industry Flight Delay-Cost (Initial delay – “type I” and Propagated delay “type II”) and Dynamic Simulation Analysis for Airline Schedule Optimization Flight Delay-Cost Simulation Analysis and Airline Schedule Optimization by Duojia Yuan of RMIT University, Victoria, Australia (URL: http://researchbank.rmit.edu.au/eserv/rmit:9807/Yuan.pdf General use case for Customer Satisfaction and Customer Loyalty Concurrent Reinforcement Learning from Customer Interactions Concurrent Reinforcement Learning from Customer Interactions by David Silver of University College London (published 2013) and Leonard Newnham, Dave Barker, Suzanne Weller, Jason McFall of Causata Ltd . (URL: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Publications_files/concurrent-rl.pdf ) A good discussion of the topic is detailed in the Ph.D. thesis listed below. The reliability modeling approach developed in this project (to enhance the dispatch reliability of Australian X airline fleet) is based on the probability distributions and Monte Carlo Simulation (MCS) techniques. Initial (type I) delay and propagated (type II) delay are adopted as the criterion for data classification and analysis. In the below paper, authors present a framework for concurrent reinforcement learning, a new method of a company interacting concurrently with many customers with an objective function to maximize revenue, customer satisfaction, or customer loyalty, which depends primarily on the sequence of interactions between company and customer (such as promotions, advertisements, or emails) and actions by the customer (such as point-of- sale purchases, or clicks on a website). The proposed concurrent reinforcement learning framework uses a variant of temporal- difference learning to learn efficiently from partial interaction sequences. The goal is to maximize the future rewards for each customer, given their history of interactions with the company. The proposed framework differs from traditional reinforcement learning paradigms, due to the concurrent nature of the customer interactions. This distinction leads to new considerations for reinforcement learning algorithms.
  26. 26. Big Data Analysis 27Author: Vikram Andem ISRM & IT GRC Conference Bayes and Big Data: Consensus Monte Carlo and Nonparametric Bayesian Data Analysis A good discussion of the topic is detailed in the article listed below: “Bayes and Big Data: The Consensus Monte Carlo Algorithm” by  Robert E. McCulloch, of University of Chicago, Booth School of Business  Edward I. George, of University of Pennsylvania, The Wharton School  Steven L. Scott, of Google, Inc  Alexander W. Blocker, of Google, Inc  Fernando V. Bonassi, Google, Inc. (URL: http://www.rob-mcculloch.org/some_papers_and_talks/papers/working/consensus-mc.pdf) Consensus Monte Carlo For Bayesian methods to work in a MapReduce / Hadoop environment, we need algorithms that require very little communication. Need: A useful definition of “big data” is data that is too big to fit on a single machine, either because of processor, memory, or disk bottlenecks. Graphics Processing Units (GPU) can alleviate the processor bottleneck, but memory or disk bottlenecks can only be alleviated by splitting “big data” across multiple machines. Communication between large numbers of machines is expensive (regardless of the amount of data being communicated), so there is a need for algorithms that perform distributed approximate Bayesian analyses with minimal communication. Consensus Monte Carlo operates by running a separate Monte Carlo algorithm on each machine, and then averaging the individual Monte Carlo draws. Depending on the model, the resulting draws can be nearly indistinguishable from the draws that would have been obtained by running a single machine algorithm for a very long time. Source: Steve Scott (Google Inc.) Non-Parametric Bayesian Data Analysis A use case in Airline industry Airline Delays in International Air Cargo Logistics A good discussion of the topic is detailed in the paper below: “Nonparametric Bayesian Analysis in International Air Cargo Logistics” by Yan Shang of Fuqua School of Business, Duke University (URL: https://bayesian.org/abstracts/5687 ) Non-Parametric Analysis refers to comparative properties (statistics) of the data, or population, which do not include the typical parameters, of mean, variance, standard deviation, etc. Need / Motivation: Models are never correct for real world data. Non-Parametric Modelling of Large Data Sets What is a nonparametric model?  A parametric model where the number of parameters increases with data.  A really large parametric model.  A model over infinite dimensional function or measure spaces.  A family of distributions that is dense in some large space. Why nonparametric models in Bayesian theory of learning?  Broad class of priors that allows data to “speak for itself”.  Side-step model selection and averaging. Bayes and Big Data
  27. 27. Big Data Analysis 28Author: Vikram Andem ISRM & IT GRC Conference Homoscedasticity vs. Heteroskedasticity Homoscedasticity In regression analysis , homoscedasticity means a situation in which the variance of the dependent variable is the same for all the data. Homoscedasticity facilitates analysis because most methods are based on the assumption of equal variance. A sequence or a vector of random variables is homoscedastic if all random variables in the sequence or vector have the same finite variance. This is also known as homogeneity of variance. In regression analysis , heteroskedasticity means a situation in which the variance of the dependent variable varies across the data. Heteroskedasticity complicates analysis because many methods in regression analysis are based on an assumption of equal variance. A collection of random variables is heteroscedastic if there are sub- populations that have different variabilities from others. Here "variability" could be quantified by the variance or any other measure of statistical dispersion. Thus heteroscedasticity is the absence of homoscedasticity. Heteroskedasticity
  28. 28. Big Data Analysis 29Author: Vikram Andem ISRM & IT GRC Conference Benford’s Law Benford’s Law Benford's Law, also called the First-Digit Law, refers to the frequency distribution of digits in many (but not all) real-life sources of data. In this distribution, the number 1 occurs as the leading digit about 30% of the time, while larger numbers occur in that position less frequently: 9 as the first digit less than 5% of the time. Benford's Law also concerns the expected distribution for digits beyond the first, which approach a uniform distribution. This result has been found to apply to a wide variety of data sets, including electricity bills, street addresses, stock prices, population numbers, death rates, lengths of rivers, physical and mathematical constants, and processes described by power laws (which are very common in nature). It tends to be most accurate when values are distributed across multiple orders of magnitude. Source: Wikipedia Numerically, the leading digits have the following distribution in Benford's Law, where d is the leading digit and P(d) the probability: Benford’s Law Big Data Application: Fraud Detection Facts The graph below shows Benford's Law for base 10. There is a generalization of the law to numbers expressed in other bases (for example, base 16), and also a generalization from leading 1 digit to leading n digits. A set of numbers is said to satisfy Benford's Law if the leading digit d (d ∈ {1, ..., 9}) occurs with Probability. Benford’s Law holds true for a data set that grows exponentially (e.g., doubles, then doubles again in the same time span). It is best applied to data sets that go across multiple orders of magnitude . The theory does not hold true for data sets in which digits are predisposed to begin with a limited set of digits. The theory also does not hold true when a data set covers only one or two orders of magnitude.  Helps identify duplicates & other data pattern anomalies in large data sets.  Enables auditors and data analysts to focuses on possible anomalies in very large data sets.  It does not "directly" prove that error or fraud exist, but identifies items that deserve further study on statistical grounds.  Mainly used for setting future auditing plans and is a low cost entry for continuous analysis of very large data sets  Not good for sampling – results in very large selection sizes.  As technology matures, finding fraud will increase (not decrease).  Not all data sets are suitable for analysis . A use case in Airline industry An financial/accounting auditor can evaluate very large data sets (in a continuous monitoring or continuous audit environment) that represents a continuous stream of transactions , such as the sales made by an (third party) online retailer or the internal airline reservation system. Fraud Detection in Airline Ticket Purchases Christopher J. Rosetti, CPA, CFE, DABFA of KPMG states in his presentation titled "SAS 99: Detecting Fraud Using Benford’s Law" presented at the FAE/NYSSCPA, Technology Assurance Committee , on March 13, 2003 claims that United Airlines currently uses Benford's law for fraud detection! (URL: http://www.nysscpa.org/committees/emergingtech/law.ppt )
  29. 29. Big Data Analysis 30Author: Vikram Andem ISRM & IT GRC Conference Multiple Hypothesis Testing Multiple Testing Problem Multiple testing problem occurs when one considers a set of statistical inferences simultaneously or infers a subset of parameters selected based on the observed values. Errors in inference, including confidence intervals that fail to include their corresponding population parameters or hypothesis tests that incorrectly reject the null hypothesis are more likely to occur when one considers the set as a whole. Source: Wikipedia For example, one might declare that a coin was biased if in 10 flips it landed heads at least 9 times. Indeed, if one assumes as a null hypothesis that the coin is fair, then the probability that a fair coin would come up heads at least 9 out of 10 times is (10 + 1) × (1/2)10 = 0.0107. This is relatively unlikely, and under statistical criteria such as p-value < 0.05, one would declare that the null hypothesis should be rejected — i.e., the coin is unfair. A multiple-comparisons problem arises if one wanted to use this test (which is appropriate for testing the fairness of a single coin), to test the fairness of many coins. Imagine if one were to test 100 fair coins by this method. Given that the probability of a fair coin coming up 9 or 10 heads in 10 flips is 0.0107, one would expect that in flipping 100 fair coins ten times each, to see a particular (i.e., pre-selected) coin comes up heads 9 or 10 times would still be very unlikely, but seeing any coin behave that way, without concern for which one, would be more likely than not. Precisely, the likelihood that all 100 fair coins are identified as fair by this criterion is (1 − 0.0107)100 ≈ 0.34. Therefore the application of our single-test coin-fairness criterion to multiple comparisons would be more likely to falsely identify at least one fair coin as unfair. Multiple Hypothesis Testing A use case in Airline industry Predicting Flight Delays using Multiple Hypothesis Testing A good discussion of the topic is detailed in the paper listed below: Predicting Flight Delays by Dieterich Lawson and William Castillo of Stanford University (URL: http://cs229.stanford.edu/proj2012/CastilloLawson-PredictingFlightDelays.pdf ) Also detailed in the book “Big Data for Chimps: A Seriously Fun guide to Terabyte-scale data processing“ by the same author (Dieterich Lawson) and Philip Kromer. Sample Source Code for modelling in Matlab is also provided by the Dieterich Lawson and can be found at URL: https://github.com/infochimps-labs/big_data_for_chimps
  30. 30. Big Data Analysis 31Author: Vikram Andem ISRM & IT GRC Conference The German Tank Problem The German Tank Problem The problem of estimating the maximum of a discrete uniform distribution from sampling without replacement is known in English as the German tank problem, due to its application in World War II to the estimation of the number of German tanks. The analyses illustrate the difference between frequentist inference and Bayesian inference. Estimating the population maximum based on a single sample yields divergent results, while the estimation based on multiple samples is an instructive practical estimation question whose answer is simple but not obvious. Source: Wikipedia During World War II, production of German tanks such as the Panther (below photo) was accurately estimated by Allied intelligence using statistical methods. Example: Suppose an intelligence officer has spotted k = 4 tanks with serial numbers, 2, 6, 7, and 14, with maximum observed serial number, m = 14. The unknown total number of tanks is called N. The formula for estimating the total number of tanks suggested by the frequentist approach outlined is: Whereas, the Bayesian analysis below yield (primarily) a probability mass function for the number of tanks: from which we can estimate the number of tanks according to: This distribution has positive skewness, related to the fact that there are at least 14 tanks. During the course of the war the Western Allies made sustained efforts to determine the extent of German production, and approached this in two major ways: conventional intelligence gathering and statistical estimation. To do this they used the serial numbers on captured or destroyed tanks. The principal numbers used were gearbox numbers, as these fell in two unbroken sequences. Chassis and engine numbers were also used, though their use was more complicated. Various other components were used to cross- check the analysis. Similar analyses were done on tires, which were observed to be sequentially numbered (i.e., 1, 2, 3, ..., N). The analysis of tank wheels yielded an estimate for the number of wheel molds that were in use. Analysis of wheels from two tanks (48 wheels each, 96 wheels total) yielded an estimate of 270 produced in February 1944, substantially more than had previously been suspected. German records after the war showed production for the month of February 1944 was 276. The statistical approach proved to be far more accurate than conventional intelligence methods, and the phrase German tank problem became accepted as a descriptor for this type of statistical analysis. Application in Big Data Analysis  Similar to German Tank Problem we can estimate/analyze (large or small) data sets that we don’t have (or assumed that we don’t have).  There is “leaky” data all around us; all we have to do is to think outside the box. Companies very often don’t think about the data they publish publicly and we can either extrapolate from that data (as in the German Tank problem) or simply extract useful information from it.  A company’s competitors' websites (publicly available data) can be a valuable hunting ground. Think about whether you can use it to estimate some missing data (as with the serial numbers) and/or combine that data with other, seemingly innocuous, sets to produce some vital information. If that information gives your company a commercial advantage and is legal, then you should use it as part of your analysis. Source: Wikipedia
  31. 31. Big Data Analysis 32Author: Vikram Andem ISRM & IT GRC Conference Nyquist–Shannon Sampling Theorem Nyquist–Shannon Sampling Theorem The Nyquist Theorem, also known as the sampling theorem, is a principle that engineers follow in the digitization of analog signals. For analog-to-digital conversion (ADC) to result in a faithful reproduction of the signal, slices, called samples, of the analog waveform must be taken frequently. The number of samples per second is called the sampling rate or sampling frequency. Any analog signal consists of components at various frequencies. The simplest case is the sine wave, in which all the signal energy is concentrated at one frequency. In practice, analog signals usually have complex waveforms, with components at many frequencies. The highest frequency component in an analog signal determines the bandwidth of that signal. The higher the frequency, the greater the bandwidth, if all other factors are held constant. Suppose the highest frequency component, in hertz, for a given analog signal is fmax. According to the Nyquist Theorem, the sampling rate must be at least 2fmax, or twice the highest analog frequency component. The sampling in an analog-to-digital converter is actuated by a pulse generator (clock). If the sampling rate is less than 2fmax, some of the highest frequency components in the analog input signal will not be correctly represented in the digitized output. When such a digital signal is converted back to analog form by a digital-to-analog converter, false frequency components appear that were not in the original analog signal. This undesirable condition is a form of distortion called aliasing. Application in Big Data Analysis  Even though the “Nyquist–Shannon Sampling Theorem” is about the minimum sampling rate of a continuous wave, but with Big Data Analysis practice it will tell you how frequently you need to collect that Big Data from sensors like smart meters.  The frequency of data collection for Big Data is the “Velocity”, one of the three “V”s for terms that define Big Data; Volume, Velocity and Varity. Left figure: X(f) (top blue) and XA(f) (bottom blue) are continuous Fourier transforms of two different functions, x(t) and xA(t) (not shown). When the functions are sampled at rate fs, the images (green) are added to the original transforms (blue) when one examines the discrete-time Fourier transforms (DTFT) of the sequences. In this hypothetical example, the DTFTs are identical, which means the sampled sequences are identical, even though the original continuous pre- sampled functions are not. If these were audio signals, x(t) and xA(t) might not sound the same. But their samples (taken at rate fs) are identical and would lead to identical reproduced sounds; thus xA(t) is an alias of x(t) at this sample rate. In this example (of a bandlimited function), such aliasing can be prevented by increasing fs such that the green images in the top figure do not overlap the blue portion. Right figure: Spectrum, Xs(f), of a properly sampled bandlimited signal (blue) and the adjacent DTFT images (green) that do not overlap. A brick-wall low-pass filter, H(f), removes the images, leaves the original spectrum, X(f), and recovers the original signal from its samples. Source: Wikipedia Source: Wikipedia
  32. 32. Big Data Analysis 33Author: Vikram Andem ISRM & IT GRC Conference Simpson’s Paradox Simpson’s Paradox Simpson's paradox is a paradox in which a trend that appears in different groups of data disappears when these groups are combined, and the reverse trend appears for the aggregate data. This result is particularly confounding when frequency data are unduly given causal interpretations. Simpson's Paradox disappears when causal relations are brought into consideration. Example:  It's a well accepted rule of thumb that the larger the data set, the more reliable the conclusions drawn. Simpson' paradox, however, slams a hammer down on the rule and the result is a good deal worse than a sore thumb. Simpson's paradox demonstrates that a great deal of care has to be taken when combining small data sets into a large one. Sometimes conclusions from the large data set may be the exact opposite of conclusion from the smaller sets. Unfortunately, the conclusions from the large set can (also) be wrong.  The lurking variables (or confounding variable) in Simpson’s paradox are categorical. That is, they break the observation into groups, such as the city of origin for the airline flights. Simpson’s paradox is an extreme form of the fact that the observed associations can be misleading when there are lurking variables. Status Airline A Airline B On Time 718 5534 Delayed 74 532 Total 792 6066 From the left table: Airline A is delayed 9.3% (74/792) of the time; Airline B is delayed only 8.8% (532/6066) of the time. So Airline A would NOT be preferable. Chicago Houston Airline On Time Delayed Total On Time Delayed Total A 497 62 559 221 12 233 B 694 117 811 4840 415 5255 From the above table: From Chicago, Airline A is delayed 11.1% (62/559) of the time, but Airline B is delayed 14.4% (117/811) of the time. From Houston, Airline A is delayed 5.2% (12/233) of the time, but Airline B is delayed 7.9% (415/5255). Consequently, Airline A would be preferable. This conclusion contradicts the previous conclusion. Simpsons' Paradox is when Big Data sets CAN go wrong A use case in Airline industry Airline On-Time Performance at Hub-and-Spoke Flight Networks A good discussion of the topic is detailed in the paper listed below: Simpson’s Paradox, Aggregation, and Airline On-time Performance by Bruce Brown of Cal State Polytechnic University (URL: http://www.csupomona.edu/~bbrown/Brown_SimpPar_WEAI06.pdf)  Big Data doesn’t happen overnight and there’s no magic to it.  Just deploying Big Data tools and analytical solutions (R, SAS, and Tableau etc.) doesn’t guarantee anything, as Simpson’s Paradox proves.
  33. 33. Big Data Analysis 34Author: Vikram Andem ISRM & IT GRC Conference Machine Learning Machine Learning and Data Mining Machine learning concerns the construction and study of systems that can learn from data. For example, a machine learning system could be trained on email messages to learn to distinguish between spam and non-spam messages. After learning, it can then be used to classify new email messages into spam and non-spam folders. The core of machine learning deals with representation and generalization. Representation of data instances and functions evaluated on these instances are part of all machine learning systems. Generalization is the property that the system will perform well on unseen data instances. Source: Wikipedia These two terms are commonly confused, as they often employ the same methods and overlap significantly. They can be roughly defined as follows:  Machine learning focuses on prediction, based on known properties learned from the training data.  Data mining focuses on the discovery of (previously) unknown properties in the data. This is the analysis step of Knowledge Discovery in Databases. Terminology  Classification: The learned attribute is categorical (“nominal”)  Regression: The learned attribute is numeric  Supervised Learning (“Training”) : We are given examples of inputs and associated outputs and we learn the relationship between them.  Unsupervised Learning (sometimes: “Mining”): We are given inputs, but no outputs (such as unlabeled data) and we learn the “Latent” labels. (example: Clustering). Example: Document Classification • Highly accurate predictions on real time and continuous data (based on rule sets with earlier training / learning and training / historical data). • Goal is not to uncover underlying “truth”. • Emphasis on methods that can handle very large datasets for better predictions. A use case in Airline industry South West Airlines use of Machine Learning for Airline Safety The below URL details an article (published September 2013) on how South West Airlines uses Machine Learning algorithms for Big Data purposes to analyze vast amounts of very large data sets (which are publicly accessible from NASA’s DASHlink site) to find anomalies and potential safety issues and to identify patterns to improve airline safety. URL: http://www.bigdata-startups.com/BigData-startup/southwest- airlines-uses-big-data-deliver-excellent-customer-service/ Primary Goal of Machine Learning Why Machine Learning? Increase barrier to entry when product / service quality is dependent on data Customize product / service to increase engagement and profits. Example: Customize sales page to increase conversion rates for online products. vs. Use Case1 Use Case 2
  34. 34. Big Data Analysis 35Author: Vikram Andem ISRM & IT GRC Conference Classification Rules and Rule Sets Rule Set to Classify Data Golf Example: To Play or Not to Play A use case in Airline industry Optimal Airline Ticket Purchasing (automated feature selection) A good discussion of the topic is detailed in the paper listed below: Optimal Airline Ticket Purchasing Using Automated User-Guided Feature Selection by William Groves and Maria Gini of University of Minnesota (URL: http://ijcai.org/papers13/Papers/IJCAI13-032.pdf ) Classification Problems Examples of Classification Problems: • Text categorization (e.g., spam filtering) • Market segmentation (e.g.: predict if customer will respond to promotion). • Natural-language processing (e.g., spoken language understanding).
  35. 35. Big Data Analysis 36Author: Vikram Andem ISRM & IT GRC Conference Decision Tree Learning Example: Good vs. Evil Decision tree learning uses a decision tree as a predictive model which maps observations about an item to conclusions about the item's target value. More descriptive names for such tree models are classification trees or regression trees. In these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. In data mining, a decision tree describes data but not decisions; rather the resulting classification tree can be an input for decision making. Source: Wikipedia
  36. 36. Big Data Analysis 37Author: Vikram Andem ISRM & IT GRC Conference Tree Size vs. Accuracy Accuracy, Confusion Matrix, Overfitting, Good/Bad Classifiers, and Controlling Tree Size Building an Accurate Classifier Good and Bad Classifiers A use case in Airline industry Predicting Airline Customers Future Values A good discussion of the topic is detailed in the paper listed below: Applying decision trees for value-based customer relations management: Predicting airline customers future values by Giuliano Tirenni, Christian Kaiser and Andreas Herrmann of the Center for Business Metrics at University of St. Gallen, Switzerland. (URL: http://ipgo.webs.upv.es/azahar/Pr%C3%A1cticas/articulo2.pdf ) Theory Overfitting example Accuracy and Confusion Matrix
  37. 37. Big Data Analysis 38Author: Vikram Andem ISRM & IT GRC Conference Entropy and Information Gain Entropy Question: How do you determine which attribute best classifies data or a data set? Answer: Entropy Entropy is a measure of unpredictability of information content. Example : A poll on some political issue. Usually, such polls happen because the outcome of the poll isn't already known. In other words, the outcome of the poll is relatively unpredictable, and actually performing the poll and learning the results gives some new information; these are just different ways of saying that the entropy of the poll results is large. Now, consider the case that the same poll is performed a second time shortly after the first poll. Since the result of the first poll is already known, the outcome of the second poll can be predicted well and the results should not contain much new information; in this case the entropy of the second poll results is small. Source: Wikipedia Statistical quantity measuring how well an attribute classifies the data.  Calculate the information gain for each attribute.  Choose attribute with greatest information gain. If there are n equally probable possible messages, then the probability p of each is 1/n Information conveyed by a message is -log(p) = log(n) Example, if there are 16 messages, then log(16) = 4 and we need 4 bits to identify/send each message. In general, if we are given a probability distribution P = (p1, p2, .., pn) The information conveyed by distribution (aka Entropy of P) is: I(P) = -(p1*log(p1) + p2*log(p2) + .. + pn*log(pn)) Information Theory : Background Information Gain Largest Entropy: Boolean functions with the same number of ones and zero's have largest entropy. In machine learning, this concept can be used to define a preferred sequence of attributes to investigate to most rapidly narrow down the state of X. Such a sequence (which depends on the outcome of the investigation of previous attributes at each stage) is called a decision tree. Usually an attribute with high mutual information should be preferred to other attributes. A use case in Airline industry An Airline matching Airplanes to Routes (using Machine Learning) ((URL: http://machinelearning.wustl.edu/mlpapers/paper_files/jmlr10_helmbold09a.pdf ) A good discussion of the topic is detailed in the paper listed below: Learning Permutations with Exponential Weights by David P. Helmbold and Manfred K.Warmuth of University of California, Santa Cruz
  38. 38. Big Data Analysis 39Author: Vikram Andem ISRM & IT GRC Conference The Bootstrap The Bootstrap A good discussion of the topic is detailed in the article listed below: “The Big Data Bootstrap” by Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar and Michael I. Jordan of University of California, Berkeley (URL: http://www.cs.berkeley.edu/~jordan/papers/blb_icml2012.pdf ) Bootstrapping is a method for assigning measures of accuracy (defined in terms of bias, variance, confidence intervals, prediction error or some other such measure) to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using only very simple methods. Generally, it falls in the broader class of resampling methods. The basic idea of bootstrapping is that inference about a population from sample data (sample → population) can be modeled by resampling the sample data and performing inference on (resample → sample). As the population is unknown, the true error in a sample statistic against its population value is unknowable. In bootstrap- resamples, the 'population' is in fact the sample, and this is known; hence the quality of inference from resample data → 'true' sample is measurable. Source: Wikipedia Concept Big Data and the Bootstrap Abstract from the paper listed on the lower left side: The bootstrap provides a simple and powerful means of assessing the quality of estimators. However, in settings involving large datasets, the computation of bootstrap-based quantities can be prohibitively demanding. As an alternative, the authors present the Bag of Little Bootstraps (BLB), a new procedure which incorporates features of both the bootstrap and subsampling to obtain a robust, computationally efficient means of assessing estimator quality. BLB is well suited to modern parallel and distributed computing architectures and retains the generic applicability, statistical efficiency, and favorable theoretical properties of the bootstrap. The authors provide the results of an extensive empirical and theoretical investigation of BLB's behavior, including a study of its statistical correctness, its large-scale implementation and performance, selection of hyper parameters, and performance on real data. The authors claim their procedure for quantifying estimator quality is “accurate”, “automatic” and “scalable and have tested on data sets of size exceeding 1 Terabyte. A use case in Airline industry Modeling Demand and Supply for Domestic and International Air Travel Economics for Cost Minimization and Profit Maximization An in-depth and “excellent” scholarly detail of the application of bootstrapping for modelling Domestic and International Air travel Economics (demand / supply) for an Airline company is detailed in the Ph.D. thesis listed below: (slightly old – published April 1999, but still very relevant to this age ) Essays on Domestic and International Airline Economics with Some Bootstrap Applications by Anthony Kenneth Postert of Rice University (URL: http://scholarship.rice.edu/bitstream/handle/1911/19428/9928581.PDF?sequence=1 ) Bootstrap and Big Data
  39. 39. Big Data Analysis 40Author: Vikram Andem ISRM & IT GRC Conference Ensemble Learning, Bagging and Boosting Ensemble Learning The basis of this slide is from the original presentation titled Bayesian Ensemble Learning for Big Data by Rob McCulloch of University of Chicago, Booth School of Business Published, November 17, 2013 (URL: http://www.rob-mcculloch.org/some_papers_and_talks/talks/dsi-bart.pdf ) Ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms. Unlike a statistical ensemble in statistical mechanics, which is usually infinite, a machine learning ensemble refers only to a concrete finite set of alternative models, but typically allows for much more flexible structure to exist between those alternatives. Source : Wikipedia Bagging A use case in Airline industry Air Traffic Capacity impact during Adverse Weather conditions A good discussion of the topic is detailed in the paper listed below: An Translation of Ensemble Weather Forecasts into Probabilistic Air Traffic Capacity Impact by Matthias Steiner, Richard Bateman, Daniel Megenhardt, Yubao Liu, Mei Xu, Matthew Pocernich, of the National Center for Atmospheric Research, and by Jimmy Krozel of Metron Aviation (URL: http://nldr.library.ucar.edu/repository/assets/osgc/OSGC-000-000-000-687.pdf ) Bootstrap aggregating, also called Bagging, is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method. Bagging is a special case of the model averaging approach. Source : Wikipedia Boosting Boosting is a machine learning meta- algorithm for reducing bias in supervised learning. Boosting is based on the question: Can a set of weak learners create a single strong learner? A weak learner is defined to be a classifier which is only slightly correlated with the true classification (it can label examples better than random guessing). In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification. Source : Wikipedia Ensemble Learning and Big Data
  40. 40. Big Data Analysis 41Author: Vikram Andem ISRM & IT GRC Conference Random Forests Random Forests Random forests are an ensemble learning method for classification (and regression) that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees. Source: Wikipedia A use case in Airline industry Network based model for Predicting Air Traffic Delays The authors of the below paper propose a model using Random Forest (RF) algorithms, considering both temporal and spatial (that is, network) delay states as explanatory variables. In addition to local delay variables that describe the arrival or departure delay states of the most influential airports and origin-destination (OD) pairs in the network, the authors propose new network delay variables that depict the global delay state of the entire NAS at the time of prediction. A Network-Based Model for Predicting Air Traffic Delays by Juan Jose Rebollo and Hamsa Balakrishnan of Massachusetts Institute of Technology (URL: http://www.mit.edu/~hamsa/pubs/RebolloBalakrishnanICRAT2012.pdf) Random Forests in Big Data Cloudera: In the below URL link Cloudera (a major Big Data vendor), shows how to implement a Poisson approximation to enable and train a random forest on an enormous data set (with R an open source free statistical software on Hadoop File System). The link also provides Map and Reduce source code. URL: https://blog.cloudera.com/blog/2013/02/how-to-resample-from-a- large-data-set-in-parallel-with-r-on-hadoop/
  41. 41. Big Data Analysis 42Author: Vikram Andem ISRM & IT GRC Conference k-nearest Neighbours k-nearest Neighbours k -nearest Neighbors algorithm (or k-NN for short) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k -NN is used for classification or regression:  In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.  In k-NN regression, the output is the property value for the object. This value is the average of the values of its k nearest neighbors. k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification. The k-NN algorithm is among the simplest of all machine learning algorithms. Both for classification and regression, it can be useful to weight contributions of neighbors, so that the nearer neighbors contribute more to the average than the more distant ones. For example, a common weighting scheme consists in giving each neighbor a weight of 1/d, where d is the distance to the neighbor. The neighbors are taken from a set of objects for which the class (for k -NN classification) or the object property value (for k -NN regression) is known. This can be thought of as the training set for the algorithm, though no explicit training step is required. A shortcoming of the k -NN algorithm is that it is sensitive to the local structure of the data. Example of k-NN classification. The test sample (green circle) should be classified either to the first class of blue squares or to the second class of red triangles. If k = 3 (solid line circle) it is assigned to the second class because there are 2 triangles and only 1 square inside the inner circle. If k = 5 (dashed line circle) it is assigned to the first class (3 squares vs. 2 triangles inside the outer circle). Data Reduction Data reduction is one of the most important problems for work with huge data sets. Usually, only some of the data points are needed for accurate classification. Those data are called the prototypes and can be found as follows: 1. Select the class-outliers, that is, training data that are classified incorrectly by k-NN (for a given k) 2. Separate the rest of the data into two sets: (i) the prototypes that are used for the classification decisions and (ii) the absorbed points that can be correctly classified by k- NN using prototypes which can be removed from the training set. K-Nearest Neighbours and Big Data A good discussion of the how to execute kNN joins in a MapReduce cluster with algorithms in MapReduce to perform efficient parallel kNN joins on large data is presented in the paper below. The authors demonstrated ideas on using Hadoop with extensive experiments in large real and synthetic datasets, with tens or hundreds of millions of records ran in R up to 30 dimensions, with efficiency, effectiveness, and scalability. Efficient Parallel kNN Joins for Large Data in MapReduce by Chi Zhang of Florida State University and Jeffrey Jestes of University of Utah (URL: http://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf ) Source: Wikipedia Background: Nearest Neighbor Graph Nearest neighbor graph (NNG) for a set of n objects P in a metric space (e.g., for a set of points in the plane with Euclidean distance) is a directed graph with P being its vertex set and with a directed edge from p to q whenever q is a nearest neighbor of p. Example: The right side image shows a nearest neighbor graph of 100 points in the Euclidean plane. k-nearest neighbor graph (k-NNG) is a graph in which two vertices p and q are connected by an edge, if the distance between p and q is among the k-th smallest distances from p to other objects from P Source: Wikipedia Source: Wikipedia
  42. 42. Big Data Analysis 43Author: Vikram Andem ISRM & IT GRC Conference k-nearest Neighbours (continued) A use case in Airline industry Seating Arrangement and Inflight Purchase / Buying Behavior of Airline Customers The below paper investigates and characterizes how social influence affects buying behavior of airline passengers who can purchase items through an individual entertainment system located in front of them. The author used the seating configuration in the airplane as a basis for the analysis. The authored used large data sets with a sample size for analysis of purchase behavior of about 257,000 passengers in nearly 2,000 fights for the analysis, where the passengers performed 65,525 transactions, with an average of 33.3 transactions per flight. The author claims that he finds strong evidence of social effects and states the number of average transactions per passenger increases 30% upon observation of a neighbor's purchase. Within and cross- category effects the author analyzed and found that the passengers are likely to buy from the same category purchased by their neighbors. For example, a purchase of an alcoholic beverage increases the probability of same category purchases by a neighbor in 78%. The author claims peer effects also take place at a deeper level than product category. Passengers `imitate' their peers' decisions on the type of food, alcohol and even movie genre. The paper also investigates the determinants of social influence: Author claims no support is found for informational learning as a significant mechanism in driving social influence. The main determinant of social influence is found to be the number of neighbors observed purchasing an item. The results are consistent with informational learning where consumers only learn from others' actions, but not from their inaction. Peer Effects in Buying Behavior: Evidence from In-Flight Purchases By Pedro M. Gardete, Assistant Professor of Marketing at Stanford University (published September 2013) (URL: http://faculty-gsb.stanford.edu/gardete/documents/SocialEffects_8_2013.pdf )
  43. 43. Big Data Analysis 44Author: Vikram Andem ISRM & IT GRC Conference Stochastic Gradient Descent Gradient Descent Gradient descent is a first-order optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point. If instead one takes steps proportional to the positive of the gradient, one approaches a local maximum of that function; the procedure is then known as gradient ascent. Gradient descent is also known as steepest descent, or the method of steepest descent. When known as the latter, gradient descent should not be confused with the method of steepest descent for approximating integrals. Source: Wikipedia Stochastic Gradient Descent Stochastic gradient descent is a gradient descent optimization method for minimizing an objective function that is written as a sum of differentiable functions. Source: Wikipedia Source: Wikipedia
  44. 44. Big Data Analysis 45Author: Vikram Andem ISRM & IT GRC Conference Stochastic Gradient Descent (continued) Gradient Descent Example
  45. 45. Big Data Analysis 46Author: Vikram Andem ISRM & IT GRC Conference Stochastic Gradient Descent (continued) Stochastic Gradient Descent: Example Source: Wikipedia Stochastic Gradient Descent vs. Minibatch Gradient Descent Stochastic Gradient Descent and Big Data The below URL link provides a presentation on “Stochastic Optimization for Big Data Analytics”. Stochastic Optimization for Big Data Analytics by Tianbao Yang and Shenghuo Zhu of NEC Laboratories America and Rong Jin of Michigan State University (URL: http://www.cse.msu.edu/~yangtia1/sdm14-tutorial.pdf) The authors in the below paper present stochastic gradient descent techniques for online learning and ensemble methods to scale out to large amounts of data at Twitter with details on how to integrate machine learning tools into Hadoop platform (using Pig a programming tool) . Large-Scale Machine Learning at Twitter by Jimmy Lin and Alek Kolcz of Twitter, Inc. (URL: http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf )
  46. 46. Big Data Analysis 47Author: Vikram Andem ISRM & IT GRC Conference Stochastic Games and Markov Perfect Equilibrium Importance of Markov Perfect Equilibrium in Airline industry Pricing As an example of the use of this equilibrium concept we consider the competition between firms which had invested heavily into fixed costs and are dominant producers in an industry, forming an oligopoly. The players are taken to be committed to levels of production capacity in the short run, and the strategies describe their decisions in setting prices. Firms' objectives are modeled as maximizing present discounted value of profits. Airfare Game / Airline Pricing Game: Often an airplane ticket for a certain route has the same price on either airline A or airline B. Presumably, the two airlines do not have exactly the same costs, nor do they face the same demand function given their varying frequent-flyer programs, the different connections their passengers will make, and so forth. Thus, a realistic general equilibrium model would be unlikely to result in nearly identical prices. Both airlines have made sunk investments into the equipment, personnel, and legal framework. In the near term we may think of them as committed to offering service. We therefore see that they are engaged, or trapped, in a strategic game with one another when setting prices. Equilibrium: Consider the following strategy of an airline for setting the ticket price for a certain route. At every price-setting opportunity:  If the other airline is charging $300 or more, or is not selling tickets on that flight, charge $300  If the other airline is charging between $200 and $300, charge the same price  If the other airline is charging $200 or less, choose randomly between the following three options with equal probability: matching that price, charging $300, or exiting the game by ceasing indefinitely to offer service on this route. This is a Markov strategy because it does not depend on a history of past observations. It satisfies also the Markov reaction function definition because it does not depend on other information which is irrelevant to revenues and profits. Assume now that both airlines follow this strategy exactly. Assume further that passengers always choose the cheapest flight and so if the airlines charge different prices, the one charging the higher price gets zero passengers. Then if each airline assumes that the other airline will follow this strategy, there is no higher-payoff alternative strategy for itself, i.e. it is playing a best response to the other airline strategy. If both airlines followed this strategy, it would form a Nash equilibrium in every proper subgame, thus a subgame-perfect Nash equilibrium. Source: Wikipedia Stochastic (or Markov) Games Stochastic (or Markov) game, is a dynamic game with probabilistic transitions played by one or more players. The game is played in a sequence of stages. At the beginning of each stage the game is in some state. The players select actions and each player receives a payoff that depends on the current state and the chosen actions. The game then moves to a new random state whose distribution depends on previous state and actions chosen by players. The procedure is repeated at the new state and play continues for a finite or infinite number of stages. Total payoff to a player is often taken to be discounted sum of stage payoffs or the limit inferior of averages of stage payoffs. Source: Wikipedia Markov Perfect Equilibrium A Markov perfect equilibrium is an equilibrium concept in game theory. It is the refinement of the concept of sub game perfect equilibrium to extensive form games for which a pay-off relevant state space can be readily identified. In extensive form games, and specifically in stochastic games, a Markov perfect equilibrium is a set of mixed strategies for each of the players which satisfy the following criteria:  The strategies have the Markov property of memorylessness, meaning that each player's mixed strategy can be conditioned only on the state of the game. These strategies are called Markov reaction functions.  The state can only encode payoff-relevant information. This rules out strategies that depend on non- substantive moves by the opponent. It excludes strategies that depend on signals, negotiation, or cooperation between players (e.g. cheap talk or contracts).  The strategies form a subgame perfect equilibrium of the game. Source: Wikipedia Subgame Perfect Equilibrium Subgame Perfect Equilibrium is a refinement of a Nash equilibrium used in dynamic games. A strategy profile is a subgame perfect equilibrium if it represents a Nash equilibrium of every subgame of the original game. Informally, this means that if (1) the players played any smaller game that consisted of only one part of the larger game and (2) their behavior represents a Nash equilibrium of that smaller game, then their behavior is a subgame perfect equilibrium of the larger game. Source: Wikipedia Dynamic Airfare Pricing and Competition The below paper details Airline industry price competition for an oligopoly in a dynamic setting, where each of the sellers has a fixed number of units available for sale over a fixed number of periods. Demand is stochastic, and depending on how it evolves, sellers may change their prices at any time. Dynamic Price Competition with Fixed Capacities by Kalyan Talluri & Victor Martinez de Albeniz A use case in Airline industry Most of the work in this paper was done prior to the writing of the paper as part of (both author's) Ph.D. dissertation at Massachusetts Institute of Technology (published February 2010) (URL: www.econ.upf.edu/docs/papers/downloads/1205.pdf)
  47. 47. Big Data Analysis 48Author: Vikram Andem ISRM & IT GRC Conference Stochastic Games and Markov Perfect Equilibrium (continued) A use case in Airline industry Dynamic Revenue Management in Airline Alliances / Code Sharing The below paper presents and excellent formulization of a Markov-game model of a two-partner airline alliance that can be used to analyze the effects of these mechanisms on each partner’s behavior. The authors show that no Markovian transfer pricing mechanism can coordinate an arbitrary alliance. Next, the authors derive the equilibrium acceptance policies under each scheme and use analytical techniques, as well a numerical analyses of sample alliances, to generate fundamental insights about partner behavior under each scheme. The analysis and numerical examples also illustrate how certain transfer price schemes are likely to perform in networks with particular characteristics. Dynamic Revenue Management in Airline Alliances by Robert Shumsky of Dartmouth College and Christopher Wright, Harry Groenevelt of University of Rochester (published February 2009) (URL: http://www.researchgate.net/publication/220413135_Dynamic_Revenue_Management_in_Airline_Alliances/file/72e7e5215a1f91ed5b.pdf )