SlideShare une entreprise Scribd logo
1  sur  39
The Art Of Data Analysis

Karthik Shashidhar
Quant Consultant
karthik.shashidhar@gmail.com

                © Karthik Shashidhar
Introduction


Six-step process


  Case Study


Common Pitfalls



    © Karthik Shashidhar
Why do you need this workshop?



We are moving to an increasingly data-driven world



Ability to use data for day-to-day decision-making
can prove to be a massive competitive advantage


This workshop equips managers with basic tools for
                dealing with data
                     © Karthik Shashidhar
Who needs this workshop?

                        What is the optimal level of sales
 Sales Managers         commissions in order to maximize
                                  profitability?

   Production         How do we set daily production targets
   Managers           given probabilities of line shut downs?


                       What are the factors that determine
  HR Managers
                              employee attrition?



This workshop is suitable for personnel in middle to
     senior management roles across functions
                      © Karthik Shashidhar
Introduction


Six-step process


  Case Study


Common Pitfalls



    © Karthik Shashidhar
Frame a clear and concise problem
                                          statement


                                Break down your problem into
                                smaller problems, and then use
                                 those to generate hypotheses



                                 Gather, clean and prepare data
A structured, iterative
approach to data-driven
decision making                 Test hypotheses. In the process,
                                generate additional hypotheses



                                 Consolidate results to solve the
                                         main problem



                                    Make the data tell a story

                          © Karthik Shashidhar
Introduction


Six-step process


  Case Study


Common Pitfalls



    © Karthik Shashidhar
The Rs. 32 Poverty Line




 Based on data from the 66th NSSO Survey, the Planning
Commission fixed the “Poverty Line” at Rs. 32 per person
 per day for people living in urban areas. This has led to
 much controversy and protests. The Prime Minister has
   asked for your inputs. What do you recommend?




                         © Karthik Shashidhar
Frame a clear and concise problem
                                     statement


                           Break down your problem into
                           smaller problems, and then use
                            those to generate hypotheses



                            Gather, clean and prepare data


For your reference
                           Test hypotheses. In the process,
                           generate additional hypotheses



                            Consolidate results to solve the
                                    main problem



                               Make the data tell a story

                     © Karthik Shashidhar
Frame a clear and concise problem
                                                                   statement
How would you frame the problem
    statement for this one?                             Break down your problem into
                                                        smaller problems, and then use
                                                         those to generate hypotheses
• Your client may not have framed the
  question precisely. You need to do
  that job and frame a precise problem                  Gather, clean and prepare data
  statement
• “Solving this problem” should tell
  you everything you want to know                       Test hypotheses. In the process,
  from your analysis                                    generate additional hypotheses
• Be concise, so that you remain
  focused towards answering your
  question                                              Consolidate results to solve the
• Frame your question such that it has                          main problem
  an objective answer. Yes/No
  questions or questions with
  numerical answers are preferred                          Make the data tell a story

                                © Karthik Shashidhar
Frame a clear and concise problem
                                                                     statement
Has the poverty line been set
 too low at Rs. 32 per day?                               Break down your problem into
                                                          smaller problems, and then use
                                                           those to generate hypotheses
• This problem statement has an
  objective answer (yes/no)
• The solution to this will be necessary                  Gather, clean and prepare data
  and sufficient to answer the
  question our client (the PM)
  demands                                                 Test hypotheses. In the process,
• The question addresses directly the                     generate additional hypotheses
  situation (people complaining that
  the poverty line has been set too
  low)                                                    Consolidate results to solve the
• This problem statement is to the                                main problem
  point and doesn’t take on additional
  responsibilities (such as defining an
  alternate poverty line)                                    Make the data tell a story

                                  © Karthik Shashidhar
Frame a clear and concise problem
What problems do we need to                                             statement

 solve in order to solve the
                                                             Break down your problem into
      main problem?                                          smaller problems, and then use
                                                              those to generate hypotheses

•   The set of “level two problems” must be
    precise and complete, in that:                           Gather, clean and prepare data
      • The combination of solution of all
         level two problems leads to the
         solution of the main problem
      • The solution of each level two                       Test hypotheses. In the process,
         problem directly impacts the main                   generate additional hypotheses
         problem
•   Once again, it is key to frame problems
    concisely and with objective answers
•   We need not stop at two levels. Some                     Consolidate results to solve the
                                                                     main problem
    level two problems might require
    solution of deeper problems. Add them
    to the list of sub-problems
                                                                Make the data tell a story

                                     © Karthik Shashidhar
Frame a clear and concise problem
   What do we need to know to                                          statement

answer “Has the poverty line been
                                                            Break down your problem into
  set too low at Rs. 32 per day?”                           smaller problems, and then use
                                                             those to generate hypotheses
• How is “poverty line” defined?
• What are the implications of poverty
  line?                                                     Gather, clean and prepare data
• What is the distribution of income in
  India?
• Does the distribution of income vary                      Test hypotheses. In the process,
  across states? If it varies significantly                 generate additional hypotheses
  does it make sense to have a state-
  wise poverty line?
• What are the essential goods that                         Consolidate results to solve the
  most people need?                                                 main problem
• For a given income level, what
  essential goods can a person afford?
                                                               Make the data tell a story

                                    © Karthik Shashidhar
Frame a clear and concise problem
Problems generate sub-problems,                                     statement

  and some of these will lead to
                                                         Break down your problem into
          hypotheses.                                    smaller problems, and then use
                                                          those to generate hypotheses



                                                         Gather, clean and prepare data
• Hypothesis1: There is significant
  difference in income level across
  states
• Hypothesis2: Essential goods are                       Test hypotheses. In the process,
                                                         generate additional hypotheses
  those that the poorest people
  consume. Also, their use flattens out
  as income goes up
                                                         Consolidate results to solve the
                                                                 main problem



                                                            Make the data tell a story

                                 © Karthik Shashidhar
Frame a clear and concise problem
   Some problems, however, are                                        statement
direct, and don’t need hypotheses.
 Some are qualitative while others                         Break down your problem into
             need data                                     smaller problems, and then use
                                                            those to generate hypotheses
• Question1: How is “poverty line”
  defined?
   • Poverty line is the minimum                           Gather, clean and prepare data
       income level that is deemed
       adequate
   • If a family is “below poverty                         Test hypotheses. In the process,
       line” it qualifies for additional                   generate additional hypotheses
       state benefits
• Question2: What is the distribution
  of incomes in each state?                                Consolidate results to solve the
• Question3: Is there some kind of a                               main problem
  threshold about the proportion of
  population that can be below
  poverty line?                                               Make the data tell a story

                                   © Karthik Shashidhar
Frame a clear and concise problem
                                                                   statement

  What data do you need here?
                                                        Break down your problem into
                                                        smaller problems, and then use
                                                         those to generate hypotheses



• It is important to frame problem and                  Gather, clean and prepare data
  break it down into components
  before listing data requirements,
  else data could bias you                              Test hypotheses. In the process,
• Define data requirements in a                         generate additional hypotheses
  general fashion, to allow you to
  easily access proxies
• Remember to gather data that both                     Consolidate results to solve the
  answers your questions and will                               main problem
  allow you to test your hypotheses

                                                           Make the data tell a story

                                © Karthik Shashidhar
Frame a clear and concise problem
   Once you’ve identified data                                    statement

requirements, identify sources and
                                                       Break down your problem into
          gather data                                  smaller problems, and then use
                                                        those to generate hypotheses



                                                       Gather, clean and prepare data
• Here we need
   • Distribution of a measure of
      income for India
   • Distribution of a measure of                      Test hypotheses. In the process,
                                                       generate additional hypotheses
      income for each state
   • Spending patterns for different
      income levels
   • Data on household sizes in                        Consolidate results to solve the
                                                               main problem
      different states

                                                          Make the data tell a story

                               © Karthik Shashidhar
Frame a clear and concise problem
   Once you’ve identified data                                       statement

requirements, identify sources and
                                                          Break down your problem into
          gather data                                     smaller problems, and then use
                                                           those to generate hypotheses



• The National Sample Survey                              Gather, clean and prepare data
  Organization (NSSO) conducts
  surveys every 5 years about income
  and expenditure, so we could                            Test hypotheses. In the process,
  perhaps use this                                        generate additional hypotheses
• However, income data gathered from
  surveys are notorious with respect to
  quality                                                 Consolidate results to solve the
• Poor have little savings so their total                         main problem
  consumption is a better indicator of
  income than the income data
                                                             Make the data tell a story

                                  © Karthik Shashidhar
Frame a clear and concise problem
                                                                     statement
    Data cleaning is an ugly but
          important step                                  Break down your problem into
                                                          smaller problems, and then use
                                                           those to generate hypotheses
• It is important to make sure names
  from data procured from different
  sources match                                           Gather, clean and prepare data
    • For example, some government
         sites say “AndhraPradesh”, while
         others say “Andhra Pradesh”.                     Test hypotheses. In the process,
         Fails if you want to do a join                   generate additional hypotheses
• If data set is small, go through it
  once to check numbers for
  consistency. For example, if you have                   Consolidate results to solve the
  data on percentages, make sure it                               main problem
  adds up to 100%
• For larger data sets, try write scripts
  to do basic cleaning                                       Make the data tell a story

                                  © Karthik Shashidhar
Frame a clear and concise problem
                                                                   statement
  Understand and prepare data
  before you dive into analysis                         Break down your problem into
                                                        smaller problems, and then use
                                                         those to generate hypotheses



• Get a general feel for the numbers                    Gather, clean and prepare data
  before getting into the analysis
• Simple visualization techniques such
  as scatter plots and density plots                    Test hypotheses. In the process,
  help                                                  generate additional hypotheses
• Use simple summary statistics
  (mean, median, SD, quartiles) to get
  a better feel for the data                            Consolidate results to solve the
• Check out what different functional                           main problem
  forms of your data look like

                                                           Make the data tell a story

                                © Karthik Shashidhar
Frame a clear and concise problem
While testing hypotheses, be on the                                    statement

        lookout for anything
                                                            Break down your problem into
         interesting/unusual                                smaller problems, and then use
                                                             those to generate hypotheses
 • It is impossible to generate all
   possible hypotheses before you
   begin the analysis                                       Gather, clean and prepare data
 • Usually, as you test out some
   hypotheses, something in the data
   will stand out which will lead to                        Test hypotheses. In the process,
   further hypotheses                                       generate additional hypotheses
 • It is ok to generate these
   hypotheses, which is what makes it
   an iterative process                                     Consolidate results to solve the
 • However, one needs to be careful to                              main problem
   not stray from the original objective
   – each new hypothesis should
   directly tie in to the original question                    Make the data tell a story

                                    © Karthik Shashidhar
Frame a clear and concise problem
                                                                   statement

        Consolidate results
                                                        Break down your problem into
                                                        smaller problems, and then use
                                                         those to generate hypotheses


• Build up your case in a bottom-up
  manner                                                Gather, clean and prepare data
• Sometimes different pieces of
  analysis can throw up contradictory
  inferences. Check, and reconcile                      Test hypotheses. In the process,
  before you integrate                                  generate additional hypotheses
• Make sure all components of the
  solution that you required are
  available                                             Consolidate results to solve the
• Don’t include results in the final                            main problem
  analysis unless it makes a definite
  contribution to the final solution
                                                           Make the data tell a story

                                © Karthik Shashidhar
Frame a clear and concise problem
                                                                            statement

     Use graphics intelligently!
                                                                 Break down your problem into
                                                                 smaller problems, and then use
                                                                  those to generate hypotheses

•   A picture is worth a thousand words, so
    use clear and easy-to-use visualizations                     Gather, clean and prepare data
    to communicate your findings
•   Use visualizations that make the solution
    self-evident, rather than something that
    requires a lot of explanation                                Test hypotheses. In the process,
•   Use your graphics to communicate, not                        generate additional hypotheses
    to confuse. If the intent of a graphic is to
    confuse, it is better to leave out that
    graphic
•   Sometimes all it takes to solve the                          Consolidate results to solve the
                                                                         main problem
    problem is to visualize the data from a
    different perspective!

                                                                    Make the data tell a story

                                         © Karthik Shashidhar
Frame a clear and concise problem
This graphic shows the decile in                               statement

which Rs. 32 per day (Rs. 960 per
                                                    Break down your problem into
 month) would fall in each state                    smaller problems, and then use
                                                     those to generate hypotheses



                                                    Gather, clean and prepare data




                                                    Test hypotheses. In the process,
                                                    generate additional hypotheses



                                                    Consolidate results to solve the
                                                            main problem



                                                       Make the data tell a story

                            © Karthik Shashidhar
Introduction


Six-step process


  Case Study


Common Pitfalls



    © Karthik Shashidhar
Correlation does                   Beware of
                          not imply                       anecdotal
                           causality                      evidence
    Beware of                                                          Don’t overfit
     Outliers                                                            models

                          Data-driven inference is
                       fraught with pitfalls. Drawing                  Contradictory
Start with getting
a feel for the data
                        the wrong conclusion out of                   inferences from
                      data is easier than drawing the                    same data

                              right conclusion.
  Don’t simply                                                          Don’t over-
throw everything                                                        complicate
   into the mix                                       Models can         graphics
                        Graphics can
                          deceive                     misbehave


                                   © Karthik Shashidhar
Outliers can
significantly distort
     inferences




                        © Karthik Shashidhar
“Throwing
everything into the
   mix” may not
always produce an
  accurate model



                      © Karthik Shashidhar
It could lead to
multicollinearity,
  for example




     According to this regression, the tallest person should have an
  extremely large right foot and a tiny left foot! That makes no sense!
                                 © Karthik Shashidhar
Over-fitting can
   lead to spurious
       models




It helps to keep your models as simple as possible. A simple rule of thumb – a
       good model is one that can be easily explained in simple English

                                    © Karthik Shashidhar
Diving into model
fitting without first
understanding the
  data can lead to
suboptimal results


   People are prone to doing regressions without actually
 looking at the data. Here, a simple linear regression gives a
                  reasonable fit (R^2 = 42%).
   However, a simple scatter plot would suggest a clear Y=
  1/X kind of relationship which the regression completely
                        misses out on
                           © Karthik Shashidhar
Contradictory
inferences can be
 derived from the
    same data




                    © Karthik Shashidhar
160

140

120

100

 80

 60

 40

 20
                                                 150
  0
      0   2   4   6   8   10   12   14     16    140


                                                 130


           Choice of axes and                    120


            scales can have a                    110


          significant impact on                  100


            the message your                      90


             graphic conveys                      80
                                                       0    2   4   6   8   10   12   14   16



                                         © Karthik Shashidhar
Correlation does not imply causality




              © Karthik Shashidhar
Mistaking correlation for
 causality can lead to
 hilarious conclusions




                            © Karthik Shashidhar
Readers get turned
    off by overly
complicated graphics

                       © Karthik Shashidhar
Anecdotal/
insufficient data
can lead to false
  conclusions




                    © Karthik Shashidhar
A model is just
that: a model. It is
 not a substitute
    for reality




                       © Karthik Shashidhar
The Art of Data Analysis will be further illustrated
by means of a detailed Case Study relevant to your
                company/industry



    For a half-day workshop on The Art of Data Analysis
  (including a case study), contact Karthik Shashidhar at
               karthik.shashidhar@gmail.com




                         © Karthik Shashidhar

Contenu connexe

En vedette

Information needs and user studies
Information needs and user studiesInformation needs and user studies
Information needs and user studiesChihwei Liu
 
Scenario Analysis Use Case: 3G/4G Wireless Data
Scenario Analysis Use Case: 3G/4G Wireless DataScenario Analysis Use Case: 3G/4G Wireless Data
Scenario Analysis Use Case: 3G/4G Wireless DataAugust Jackson
 
內容分析法(Content Analysis)
內容分析法(Content Analysis)內容分析法(Content Analysis)
內容分析法(Content Analysis)Chihwei Liu
 
統計的力量-SPSS的25種方法實戰2014版-三星統計張偉豪20141119
統計的力量-SPSS的25種方法實戰2014版-三星統計張偉豪20141119統計的力量-SPSS的25種方法實戰2014版-三星統計張偉豪20141119
統計的力量-SPSS的25種方法實戰2014版-三星統計張偉豪20141119Beckett Hsieh
 
Sentiment Analysis Training Guide [Simplified Chinese]
Sentiment Analysis Training Guide [Simplified Chinese]Sentiment Analysis Training Guide [Simplified Chinese]
Sentiment Analysis Training Guide [Simplified Chinese]Massolutions
 
巨量資料分析輕鬆上手_教您玩大強子對撞機公開數據
巨量資料分析輕鬆上手_教您玩大強子對撞機公開數據巨量資料分析輕鬆上手_教您玩大強子對撞機公開數據
巨量資料分析輕鬆上手_教您玩大強子對撞機公開數據Yuan CHAO
 
暴走漫画数据挖掘从0到1
暴走漫画数据挖掘从0到1暴走漫画数据挖掘从0到1
暴走漫画数据挖掘从0到1Michael Ding
 
Critical discourse analysis and an application
Critical discourse analysis and an applicationCritical discourse analysis and an application
Critical discourse analysis and an applicationSuaad Zahawi
 
Analytical Thinking Training
Analytical Thinking TrainingAnalytical Thinking Training
Analytical Thinking TrainingM Furqan Aslam
 
Machine Learning Introduction
Machine Learning IntroductionMachine Learning Introduction
Machine Learning IntroductionMark Chang
 
客戶數據分析四大難題一次解決: IBM 數據分析解決方案
客戶數據分析四大難題一次解決:  IBM 數據分析解決方案客戶數據分析四大難題一次解決:  IBM 數據分析解決方案
客戶數據分析四大難題一次解決: IBM 數據分析解決方案Randy Lin
 
優化宅的日常-數據分析篇
優化宅的日常-數據分析篇優化宅的日常-數據分析篇
優化宅的日常-數據分析篇Wanju Wang
 
Stockflare 强大的股票筛选工具嘉维证券合作伙伴独享
Stockflare 强大的股票筛选工具嘉维证券合作伙伴独享Stockflare 强大的股票筛选工具嘉维证券合作伙伴独享
Stockflare 强大的股票筛选工具嘉维证券合作伙伴独享Shane Leonard, CFA
 
20161017 R語言資料分析實務 (2)
20161017 R語言資料分析實務 (2)20161017 R語言資料分析實務 (2)
20161017 R語言資料分析實務 (2)羅左欣
 
分析路上你我他/如何學習分析
分析路上你我他/如何學習分析分析路上你我他/如何學習分析
分析路上你我他/如何學習分析Wanju Wang
 
[系列活動] 手把手教你R語言資料分析實務
[系列活動] 手把手教你R語言資料分析實務[系列活動] 手把手教你R語言資料分析實務
[系列活動] 手把手教你R語言資料分析實務台灣資料科學年會
 
初學R語言的60分鐘
初學R語言的60分鐘初學R語言的60分鐘
初學R語言的60分鐘Chen-Pan Liao
 
[SDX2016] 網站分析工作的領悟 / 鍾喬后 Isobar 安索帕 資料分析經理
[SDX2016] 網站分析工作的領悟 / 鍾喬后 Isobar 安索帕 資料分析經理[SDX2016] 網站分析工作的領悟 / 鍾喬后 Isobar 安索帕 資料分析經理
[SDX2016] 網站分析工作的領悟 / 鍾喬后 Isobar 安索帕 資料分析經理悠識學院
 
如何社群行銷?就是不銷而銷!
如何社群行銷?就是不銷而銷!如何社群行銷?就是不銷而銷!
如何社群行銷?就是不銷而銷!綠生活 GreenLife
 
打倒程式化購買術語
打倒程式化購買術語打倒程式化購買術語
打倒程式化購買術語NT150 Com
 

En vedette (20)

Information needs and user studies
Information needs and user studiesInformation needs and user studies
Information needs and user studies
 
Scenario Analysis Use Case: 3G/4G Wireless Data
Scenario Analysis Use Case: 3G/4G Wireless DataScenario Analysis Use Case: 3G/4G Wireless Data
Scenario Analysis Use Case: 3G/4G Wireless Data
 
內容分析法(Content Analysis)
內容分析法(Content Analysis)內容分析法(Content Analysis)
內容分析法(Content Analysis)
 
統計的力量-SPSS的25種方法實戰2014版-三星統計張偉豪20141119
統計的力量-SPSS的25種方法實戰2014版-三星統計張偉豪20141119統計的力量-SPSS的25種方法實戰2014版-三星統計張偉豪20141119
統計的力量-SPSS的25種方法實戰2014版-三星統計張偉豪20141119
 
Sentiment Analysis Training Guide [Simplified Chinese]
Sentiment Analysis Training Guide [Simplified Chinese]Sentiment Analysis Training Guide [Simplified Chinese]
Sentiment Analysis Training Guide [Simplified Chinese]
 
巨量資料分析輕鬆上手_教您玩大強子對撞機公開數據
巨量資料分析輕鬆上手_教您玩大強子對撞機公開數據巨量資料分析輕鬆上手_教您玩大強子對撞機公開數據
巨量資料分析輕鬆上手_教您玩大強子對撞機公開數據
 
暴走漫画数据挖掘从0到1
暴走漫画数据挖掘从0到1暴走漫画数据挖掘从0到1
暴走漫画数据挖掘从0到1
 
Critical discourse analysis and an application
Critical discourse analysis and an applicationCritical discourse analysis and an application
Critical discourse analysis and an application
 
Analytical Thinking Training
Analytical Thinking TrainingAnalytical Thinking Training
Analytical Thinking Training
 
Machine Learning Introduction
Machine Learning IntroductionMachine Learning Introduction
Machine Learning Introduction
 
客戶數據分析四大難題一次解決: IBM 數據分析解決方案
客戶數據分析四大難題一次解決:  IBM 數據分析解決方案客戶數據分析四大難題一次解決:  IBM 數據分析解決方案
客戶數據分析四大難題一次解決: IBM 數據分析解決方案
 
優化宅的日常-數據分析篇
優化宅的日常-數據分析篇優化宅的日常-數據分析篇
優化宅的日常-數據分析篇
 
Stockflare 强大的股票筛选工具嘉维证券合作伙伴独享
Stockflare 强大的股票筛选工具嘉维证券合作伙伴独享Stockflare 强大的股票筛选工具嘉维证券合作伙伴独享
Stockflare 强大的股票筛选工具嘉维证券合作伙伴独享
 
20161017 R語言資料分析實務 (2)
20161017 R語言資料分析實務 (2)20161017 R語言資料分析實務 (2)
20161017 R語言資料分析實務 (2)
 
分析路上你我他/如何學習分析
分析路上你我他/如何學習分析分析路上你我他/如何學習分析
分析路上你我他/如何學習分析
 
[系列活動] 手把手教你R語言資料分析實務
[系列活動] 手把手教你R語言資料分析實務[系列活動] 手把手教你R語言資料分析實務
[系列活動] 手把手教你R語言資料分析實務
 
初學R語言的60分鐘
初學R語言的60分鐘初學R語言的60分鐘
初學R語言的60分鐘
 
[SDX2016] 網站分析工作的領悟 / 鍾喬后 Isobar 安索帕 資料分析經理
[SDX2016] 網站分析工作的領悟 / 鍾喬后 Isobar 安索帕 資料分析經理[SDX2016] 網站分析工作的領悟 / 鍾喬后 Isobar 安索帕 資料分析經理
[SDX2016] 網站分析工作的領悟 / 鍾喬后 Isobar 安索帕 資料分析經理
 
如何社群行銷?就是不銷而銷!
如何社群行銷?就是不銷而銷!如何社群行銷?就是不銷而銷!
如何社群行銷?就是不銷而銷!
 
打倒程式化購買術語
打倒程式化購買術語打倒程式化購買術語
打倒程式化購買術語
 

Similaire à The art of data analysis

7 steps to master problem solving
7 steps to master problem solving7 steps to master problem solving
7 steps to master problem solvingYuri Kaminski
 
Developing a Project Plan
Developing a Project PlanDeveloping a Project Plan
Developing a Project Planbarrycordero
 
Problem Solving & Critical Thinking
Problem Solving & Critical ThinkingProblem Solving & Critical Thinking
Problem Solving & Critical ThinkingTKMG, Inc.
 
Basic tool for improvement.pdf
Basic tool for improvement.pdfBasic tool for improvement.pdf
Basic tool for improvement.pdfPrabirdas76
 
Twelve Heuristics for Solving Tough Problems—Faster and Better
Twelve Heuristics for Solving Tough Problems—Faster and BetterTwelve Heuristics for Solving Tough Problems—Faster and Better
Twelve Heuristics for Solving Tough Problems—Faster and BetterTechWell
 
Flevy.com - Structured Problem Solving & Hypothesis Generation
Flevy.com - Structured Problem Solving & Hypothesis GenerationFlevy.com - Structured Problem Solving & Hypothesis Generation
Flevy.com - Structured Problem Solving & Hypothesis GenerationDavid Tracy
 
Is it worth it agile2012 0
Is it worth it agile2012 0Is it worth it agile2012 0
Is it worth it agile2012 0drewz lin
 
How to Start Thinking Like a Data Scientist
How to Start Thinking Like a Data ScientistHow to Start Thinking Like a Data Scientist
How to Start Thinking Like a Data ScientistTanayKarnik1
 
Decision making & problem solving
Decision making & problem solvingDecision making & problem solving
Decision making & problem solvingashish1afmi
 
Strategic Planning & Deployment Using The X Matrix W225
Strategic Planning & Deployment Using The X Matrix W225Strategic Planning & Deployment Using The X Matrix W225
Strategic Planning & Deployment Using The X Matrix W225Robert Mitchell
 
T4 case analysis_workbook_may_2011
T4 case analysis_workbook_may_2011T4 case analysis_workbook_may_2011
T4 case analysis_workbook_may_2011Alex
 
T4 case analysis_workbook_may_2011
T4 case analysis_workbook_may_2011T4 case analysis_workbook_may_2011
T4 case analysis_workbook_may_2011Alex
 
PS-Problem-Solving-Toolkit.pptx
PS-Problem-Solving-Toolkit.pptxPS-Problem-Solving-Toolkit.pptx
PS-Problem-Solving-Toolkit.pptxAbdirazaqAhmed
 
ExactTarget & Crown Audience Builder
ExactTarget & Crown Audience BuilderExactTarget & Crown Audience Builder
ExactTarget & Crown Audience BuilderCrown
 
Is It Worth It? Using A Business Value Model To Guide Decisions
Is It Worth It?  Using A Business Value Model To Guide DecisionsIs It Worth It?  Using A Business Value Model To Guide Decisions
Is It Worth It? Using A Business Value Model To Guide DecisionsKent McDonald
 
Analytical thinking training
Analytical thinking trainingAnalytical thinking training
Analytical thinking trainingras1215
 
580827935-Mckinsey-Training-1-3.pdf
580827935-Mckinsey-Training-1-3.pdf580827935-Mckinsey-Training-1-3.pdf
580827935-Mckinsey-Training-1-3.pdfdavidscribddavidscri
 

Similaire à The art of data analysis (20)

7 steps to master problem solving
7 steps to master problem solving7 steps to master problem solving
7 steps to master problem solving
 
Developing a Project Plan
Developing a Project PlanDeveloping a Project Plan
Developing a Project Plan
 
Problem Solving & Critical Thinking
Problem Solving & Critical ThinkingProblem Solving & Critical Thinking
Problem Solving & Critical Thinking
 
Basic tool for improvement.pdf
Basic tool for improvement.pdfBasic tool for improvement.pdf
Basic tool for improvement.pdf
 
Twelve Heuristics for Solving Tough Problems—Faster and Better
Twelve Heuristics for Solving Tough Problems—Faster and BetterTwelve Heuristics for Solving Tough Problems—Faster and Better
Twelve Heuristics for Solving Tough Problems—Faster and Better
 
Flevy.com - Structured Problem Solving & Hypothesis Generation
Flevy.com - Structured Problem Solving & Hypothesis GenerationFlevy.com - Structured Problem Solving & Hypothesis Generation
Flevy.com - Structured Problem Solving & Hypothesis Generation
 
Is it worth it agile2012 0
Is it worth it agile2012 0Is it worth it agile2012 0
Is it worth it agile2012 0
 
How to Start Thinking Like a Data Scientist
How to Start Thinking Like a Data ScientistHow to Start Thinking Like a Data Scientist
How to Start Thinking Like a Data Scientist
 
Decision making & problem solving
Decision making & problem solvingDecision making & problem solving
Decision making & problem solving
 
Strategic planning & execution using the x matrix w225
Strategic planning & execution using the x matrix w225Strategic planning & execution using the x matrix w225
Strategic planning & execution using the x matrix w225
 
Strategic Planning & Deployment Using The X Matrix W225
Strategic Planning & Deployment Using The X Matrix W225Strategic Planning & Deployment Using The X Matrix W225
Strategic Planning & Deployment Using The X Matrix W225
 
T4 case analysis_workbook_may_2011
T4 case analysis_workbook_may_2011T4 case analysis_workbook_may_2011
T4 case analysis_workbook_may_2011
 
T4 case analysis_workbook_may_2011
T4 case analysis_workbook_may_2011T4 case analysis_workbook_may_2011
T4 case analysis_workbook_may_2011
 
PS-Problem-Solving-Toolkit.pptx
PS-Problem-Solving-Toolkit.pptxPS-Problem-Solving-Toolkit.pptx
PS-Problem-Solving-Toolkit.pptx
 
Yukti-DT.pdf
Yukti-DT.pdfYukti-DT.pdf
Yukti-DT.pdf
 
ExactTarget & Crown Audience Builder
ExactTarget & Crown Audience BuilderExactTarget & Crown Audience Builder
ExactTarget & Crown Audience Builder
 
Is It Worth It? Using A Business Value Model To Guide Decisions
Is It Worth It?  Using A Business Value Model To Guide DecisionsIs It Worth It?  Using A Business Value Model To Guide Decisions
Is It Worth It? Using A Business Value Model To Guide Decisions
 
Analytical thinking training
Analytical thinking trainingAnalytical thinking training
Analytical thinking training
 
Dig Deeper And Sell Faster
Dig Deeper And Sell FasterDig Deeper And Sell Faster
Dig Deeper And Sell Faster
 
580827935-Mckinsey-Training-1-3.pdf
580827935-Mckinsey-Training-1-3.pdf580827935-Mckinsey-Training-1-3.pdf
580827935-Mckinsey-Training-1-3.pdf
 

Plus de Karthik Shashidhar (17)

Berry's Quiz 5th September 2021
Berry's Quiz 5th September 2021Berry's Quiz 5th September 2021
Berry's Quiz 5th September 2021
 
Berrys Quiz 15th August 2021
Berrys Quiz 15th August 2021Berrys Quiz 15th August 2021
Berrys Quiz 15th August 2021
 
Berrys Quiz 1st August 2021
Berrys Quiz 1st August 2021Berrys Quiz 1st August 2021
Berrys Quiz 1st August 2021
 
Berry's Quiz 25th July 2021
Berry's Quiz 25th July 2021Berry's Quiz 25th July 2021
Berry's Quiz 25th July 2021
 
Berry's Quiz 18th July 2021
Berry's Quiz 18th July 2021Berry's Quiz 18th July 2021
Berry's Quiz 18th July 2021
 
Berry's Quiz 11th July 2021
Berry's Quiz 11th July 2021Berry's Quiz 11th July 2021
Berry's Quiz 11th July 2021
 
Berry's Quiz 4th July
Berry's Quiz 4th JulyBerry's Quiz 4th July
Berry's Quiz 4th July
 
Berry's Quiz 27th June
Berry's Quiz 27th JuneBerry's Quiz 27th June
Berry's Quiz 27th June
 
Berry's Quiz 20th June
Berry's Quiz 20th JuneBerry's Quiz 20th June
Berry's Quiz 20th June
 
Berry's Quiz 13th June
Berry's Quiz 13th JuneBerry's Quiz 13th June
Berry's Quiz 13th June
 
Berry's Quiz 6th June
Berry's Quiz 6th JuneBerry's Quiz 6th June
Berry's Quiz 6th June
 
Berry's Quiz 30th May
Berry's Quiz 30th MayBerry's Quiz 30th May
Berry's Quiz 30th May
 
Berry's Quiz 23rd May
Berry's Quiz 23rd MayBerry's Quiz 23rd May
Berry's Quiz 23rd May
 
Berry's Quiz 16 May
Berry's Quiz 16 MayBerry's Quiz 16 May
Berry's Quiz 16 May
 
Bespoke Data Insights at New Finance
Bespoke Data Insights at New FinanceBespoke Data Insights at New Finance
Bespoke Data Insights at New Finance
 
Importance of coalitions
Importance of coalitionsImportance of coalitions
Importance of coalitions
 
Hubbub-a The 6th KQA Bangalore Quiz
Hubbub-a The 6th KQA Bangalore QuizHubbub-a The 6th KQA Bangalore Quiz
Hubbub-a The 6th KQA Bangalore Quiz
 

Dernier

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 

Dernier (20)

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 

The art of data analysis

  • 1. The Art Of Data Analysis Karthik Shashidhar Quant Consultant karthik.shashidhar@gmail.com © Karthik Shashidhar
  • 2. Introduction Six-step process Case Study Common Pitfalls © Karthik Shashidhar
  • 3. Why do you need this workshop? We are moving to an increasingly data-driven world Ability to use data for day-to-day decision-making can prove to be a massive competitive advantage This workshop equips managers with basic tools for dealing with data © Karthik Shashidhar
  • 4. Who needs this workshop? What is the optimal level of sales Sales Managers commissions in order to maximize profitability? Production How do we set daily production targets Managers given probabilities of line shut downs? What are the factors that determine HR Managers employee attrition? This workshop is suitable for personnel in middle to senior management roles across functions © Karthik Shashidhar
  • 5. Introduction Six-step process Case Study Common Pitfalls © Karthik Shashidhar
  • 6. Frame a clear and concise problem statement Break down your problem into smaller problems, and then use those to generate hypotheses Gather, clean and prepare data A structured, iterative approach to data-driven decision making Test hypotheses. In the process, generate additional hypotheses Consolidate results to solve the main problem Make the data tell a story © Karthik Shashidhar
  • 7. Introduction Six-step process Case Study Common Pitfalls © Karthik Shashidhar
  • 8. The Rs. 32 Poverty Line Based on data from the 66th NSSO Survey, the Planning Commission fixed the “Poverty Line” at Rs. 32 per person per day for people living in urban areas. This has led to much controversy and protests. The Prime Minister has asked for your inputs. What do you recommend? © Karthik Shashidhar
  • 9. Frame a clear and concise problem statement Break down your problem into smaller problems, and then use those to generate hypotheses Gather, clean and prepare data For your reference Test hypotheses. In the process, generate additional hypotheses Consolidate results to solve the main problem Make the data tell a story © Karthik Shashidhar
  • 10. Frame a clear and concise problem statement How would you frame the problem statement for this one? Break down your problem into smaller problems, and then use those to generate hypotheses • Your client may not have framed the question precisely. You need to do that job and frame a precise problem Gather, clean and prepare data statement • “Solving this problem” should tell you everything you want to know Test hypotheses. In the process, from your analysis generate additional hypotheses • Be concise, so that you remain focused towards answering your question Consolidate results to solve the • Frame your question such that it has main problem an objective answer. Yes/No questions or questions with numerical answers are preferred Make the data tell a story © Karthik Shashidhar
  • 11. Frame a clear and concise problem statement Has the poverty line been set too low at Rs. 32 per day? Break down your problem into smaller problems, and then use those to generate hypotheses • This problem statement has an objective answer (yes/no) • The solution to this will be necessary Gather, clean and prepare data and sufficient to answer the question our client (the PM) demands Test hypotheses. In the process, • The question addresses directly the generate additional hypotheses situation (people complaining that the poverty line has been set too low) Consolidate results to solve the • This problem statement is to the main problem point and doesn’t take on additional responsibilities (such as defining an alternate poverty line) Make the data tell a story © Karthik Shashidhar
  • 12. Frame a clear and concise problem What problems do we need to statement solve in order to solve the Break down your problem into main problem? smaller problems, and then use those to generate hypotheses • The set of “level two problems” must be precise and complete, in that: Gather, clean and prepare data • The combination of solution of all level two problems leads to the solution of the main problem • The solution of each level two Test hypotheses. In the process, problem directly impacts the main generate additional hypotheses problem • Once again, it is key to frame problems concisely and with objective answers • We need not stop at two levels. Some Consolidate results to solve the main problem level two problems might require solution of deeper problems. Add them to the list of sub-problems Make the data tell a story © Karthik Shashidhar
  • 13. Frame a clear and concise problem What do we need to know to statement answer “Has the poverty line been Break down your problem into set too low at Rs. 32 per day?” smaller problems, and then use those to generate hypotheses • How is “poverty line” defined? • What are the implications of poverty line? Gather, clean and prepare data • What is the distribution of income in India? • Does the distribution of income vary Test hypotheses. In the process, across states? If it varies significantly generate additional hypotheses does it make sense to have a state- wise poverty line? • What are the essential goods that Consolidate results to solve the most people need? main problem • For a given income level, what essential goods can a person afford? Make the data tell a story © Karthik Shashidhar
  • 14. Frame a clear and concise problem Problems generate sub-problems, statement and some of these will lead to Break down your problem into hypotheses. smaller problems, and then use those to generate hypotheses Gather, clean and prepare data • Hypothesis1: There is significant difference in income level across states • Hypothesis2: Essential goods are Test hypotheses. In the process, generate additional hypotheses those that the poorest people consume. Also, their use flattens out as income goes up Consolidate results to solve the main problem Make the data tell a story © Karthik Shashidhar
  • 15. Frame a clear and concise problem Some problems, however, are statement direct, and don’t need hypotheses. Some are qualitative while others Break down your problem into need data smaller problems, and then use those to generate hypotheses • Question1: How is “poverty line” defined? • Poverty line is the minimum Gather, clean and prepare data income level that is deemed adequate • If a family is “below poverty Test hypotheses. In the process, line” it qualifies for additional generate additional hypotheses state benefits • Question2: What is the distribution of incomes in each state? Consolidate results to solve the • Question3: Is there some kind of a main problem threshold about the proportion of population that can be below poverty line? Make the data tell a story © Karthik Shashidhar
  • 16. Frame a clear and concise problem statement What data do you need here? Break down your problem into smaller problems, and then use those to generate hypotheses • It is important to frame problem and Gather, clean and prepare data break it down into components before listing data requirements, else data could bias you Test hypotheses. In the process, • Define data requirements in a generate additional hypotheses general fashion, to allow you to easily access proxies • Remember to gather data that both Consolidate results to solve the answers your questions and will main problem allow you to test your hypotheses Make the data tell a story © Karthik Shashidhar
  • 17. Frame a clear and concise problem Once you’ve identified data statement requirements, identify sources and Break down your problem into gather data smaller problems, and then use those to generate hypotheses Gather, clean and prepare data • Here we need • Distribution of a measure of income for India • Distribution of a measure of Test hypotheses. In the process, generate additional hypotheses income for each state • Spending patterns for different income levels • Data on household sizes in Consolidate results to solve the main problem different states Make the data tell a story © Karthik Shashidhar
  • 18. Frame a clear and concise problem Once you’ve identified data statement requirements, identify sources and Break down your problem into gather data smaller problems, and then use those to generate hypotheses • The National Sample Survey Gather, clean and prepare data Organization (NSSO) conducts surveys every 5 years about income and expenditure, so we could Test hypotheses. In the process, perhaps use this generate additional hypotheses • However, income data gathered from surveys are notorious with respect to quality Consolidate results to solve the • Poor have little savings so their total main problem consumption is a better indicator of income than the income data Make the data tell a story © Karthik Shashidhar
  • 19. Frame a clear and concise problem statement Data cleaning is an ugly but important step Break down your problem into smaller problems, and then use those to generate hypotheses • It is important to make sure names from data procured from different sources match Gather, clean and prepare data • For example, some government sites say “AndhraPradesh”, while others say “Andhra Pradesh”. Test hypotheses. In the process, Fails if you want to do a join generate additional hypotheses • If data set is small, go through it once to check numbers for consistency. For example, if you have Consolidate results to solve the data on percentages, make sure it main problem adds up to 100% • For larger data sets, try write scripts to do basic cleaning Make the data tell a story © Karthik Shashidhar
  • 20. Frame a clear and concise problem statement Understand and prepare data before you dive into analysis Break down your problem into smaller problems, and then use those to generate hypotheses • Get a general feel for the numbers Gather, clean and prepare data before getting into the analysis • Simple visualization techniques such as scatter plots and density plots Test hypotheses. In the process, help generate additional hypotheses • Use simple summary statistics (mean, median, SD, quartiles) to get a better feel for the data Consolidate results to solve the • Check out what different functional main problem forms of your data look like Make the data tell a story © Karthik Shashidhar
  • 21. Frame a clear and concise problem While testing hypotheses, be on the statement lookout for anything Break down your problem into interesting/unusual smaller problems, and then use those to generate hypotheses • It is impossible to generate all possible hypotheses before you begin the analysis Gather, clean and prepare data • Usually, as you test out some hypotheses, something in the data will stand out which will lead to Test hypotheses. In the process, further hypotheses generate additional hypotheses • It is ok to generate these hypotheses, which is what makes it an iterative process Consolidate results to solve the • However, one needs to be careful to main problem not stray from the original objective – each new hypothesis should directly tie in to the original question Make the data tell a story © Karthik Shashidhar
  • 22. Frame a clear and concise problem statement Consolidate results Break down your problem into smaller problems, and then use those to generate hypotheses • Build up your case in a bottom-up manner Gather, clean and prepare data • Sometimes different pieces of analysis can throw up contradictory inferences. Check, and reconcile Test hypotheses. In the process, before you integrate generate additional hypotheses • Make sure all components of the solution that you required are available Consolidate results to solve the • Don’t include results in the final main problem analysis unless it makes a definite contribution to the final solution Make the data tell a story © Karthik Shashidhar
  • 23. Frame a clear and concise problem statement Use graphics intelligently! Break down your problem into smaller problems, and then use those to generate hypotheses • A picture is worth a thousand words, so use clear and easy-to-use visualizations Gather, clean and prepare data to communicate your findings • Use visualizations that make the solution self-evident, rather than something that requires a lot of explanation Test hypotheses. In the process, • Use your graphics to communicate, not generate additional hypotheses to confuse. If the intent of a graphic is to confuse, it is better to leave out that graphic • Sometimes all it takes to solve the Consolidate results to solve the main problem problem is to visualize the data from a different perspective! Make the data tell a story © Karthik Shashidhar
  • 24. Frame a clear and concise problem This graphic shows the decile in statement which Rs. 32 per day (Rs. 960 per Break down your problem into month) would fall in each state smaller problems, and then use those to generate hypotheses Gather, clean and prepare data Test hypotheses. In the process, generate additional hypotheses Consolidate results to solve the main problem Make the data tell a story © Karthik Shashidhar
  • 25. Introduction Six-step process Case Study Common Pitfalls © Karthik Shashidhar
  • 26. Correlation does Beware of not imply anecdotal causality evidence Beware of Don’t overfit Outliers models Data-driven inference is fraught with pitfalls. Drawing Contradictory Start with getting a feel for the data the wrong conclusion out of inferences from data is easier than drawing the same data right conclusion. Don’t simply Don’t over- throw everything complicate into the mix Models can graphics Graphics can deceive misbehave © Karthik Shashidhar
  • 27. Outliers can significantly distort inferences © Karthik Shashidhar
  • 28. “Throwing everything into the mix” may not always produce an accurate model © Karthik Shashidhar
  • 29. It could lead to multicollinearity, for example According to this regression, the tallest person should have an extremely large right foot and a tiny left foot! That makes no sense! © Karthik Shashidhar
  • 30. Over-fitting can lead to spurious models It helps to keep your models as simple as possible. A simple rule of thumb – a good model is one that can be easily explained in simple English © Karthik Shashidhar
  • 31. Diving into model fitting without first understanding the data can lead to suboptimal results People are prone to doing regressions without actually looking at the data. Here, a simple linear regression gives a reasonable fit (R^2 = 42%). However, a simple scatter plot would suggest a clear Y= 1/X kind of relationship which the regression completely misses out on © Karthik Shashidhar
  • 32. Contradictory inferences can be derived from the same data © Karthik Shashidhar
  • 33. 160 140 120 100 80 60 40 20 150 0 0 2 4 6 8 10 12 14 16 140 130 Choice of axes and 120 scales can have a 110 significant impact on 100 the message your 90 graphic conveys 80 0 2 4 6 8 10 12 14 16 © Karthik Shashidhar
  • 34. Correlation does not imply causality © Karthik Shashidhar
  • 35. Mistaking correlation for causality can lead to hilarious conclusions © Karthik Shashidhar
  • 36. Readers get turned off by overly complicated graphics © Karthik Shashidhar
  • 37. Anecdotal/ insufficient data can lead to false conclusions © Karthik Shashidhar
  • 38. A model is just that: a model. It is not a substitute for reality © Karthik Shashidhar
  • 39. The Art of Data Analysis will be further illustrated by means of a detailed Case Study relevant to your company/industry For a half-day workshop on The Art of Data Analysis (including a case study), contact Karthik Shashidhar at karthik.shashidhar@gmail.com © Karthik Shashidhar