SlideShare une entreprise Scribd logo
1  sur  38
Télécharger pour lire hors ligne
Data Analysis Course
Preparing data for analysis(Version-1)
Venkat Reddy
Data Analysis Course
• Data analysis design document
• Introduction to statistical data analysis
• Descriptive statistics
•




                                                                Venkat Reddy
                                                          Data Analysis Course
•   Probability distributions examples and applications
•   Simple correlation and regression analysis
•   Multiple liner regression analysis
•   Logistic regression analysis
•   Testing of hypothesis
•   Clustering and decision trees
•   Time series analysis and forecasting
•   Credit Risk Model building-1
                                                                 2
•   Credit Risk Model building-2
Note
• This presentation is just class notes. The course notes for Data
  Analysis Training is by written by me, as an aid for myself.
• The best way to treat this is as a high-level summary; the
  actual session went more in depth and contained other




                                                                           Venkat Reddy
                                                                     Data Analysis Course
  information.
• Most of this material was written as informal notes, not
  intended for publication
• Please send questions/comments/corrections to
  venkat@trenwiseanalytics.com or 21.venkat@gmail.com
• Please check my website for latest version of this document
                                         -Venkat Reddy                      3
Contents
•   Need of data exploration
•   Data Exploration
•   Data Validation
•   Data Sanitization




                                                           Venkat Reddy
                                                     Data Analysis Course
    • Missing Value Treatment
    • Outlier Treatment Identification & Treatment




                                                            4
What is the need of data sanitization
•   Download age vs weight data from here
•   Fit a simple liner regression line
•   As age increases what happens to weight?
•   Is the inference accurate?




                                                     Venkat Reddy
                                               Data Analysis Course
                                                      5
Remember…
Data in the real world is dirty
Incomplete: lacking attribute values, lacking certain attributes of interest,
or containing only aggregate data. e.g., occupation=“ ”
       • Incomplete data may come from
          • “Not applicable” data value when collected.
          • Different considerations between the time when the data was collected and when it




                                                                                                      Venkat Reddy
                                                                                                Data Analysis Course
            is analyzed.
          • Human/hardware/software problems

Noisy: Containing errors or outliers. e.g., Salary=“-10”
       • Noisy data (incorrect values) may come from
          • Faulty data collection instruments
          • Human or computer error at data entry
          • Errors in data transmission

Inconsistent: containing discrepancies in codes or names
       • e.g., Age=“42” Birthday=“03/07/1997”, Was rating “1,2,3”, now rating “A, B,
         C”,e.g., discrepancy between duplicate records
       • Inconsistent data may come from                                                               6
          • Different data sources
          • Functional dependency violation (e.g., modify some linked data)
Lab: About the data, background and Business
       Objective:
 • Banks play a crucial role in market economies. They decide who can get finance and on what terms and
   can make or break investment decisions. For markets and society to function, individuals and companies
   need access to credit.
 • Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to
   determine whether or not a loan should be granted. This competition requires participants to improve
   on the state of the art in credit scoring, by predicting the probability that somebody will experience
   financial distress in the next two years. The objective is to build a model that borrowers can use to help
   make the best financial decisions.




                                                                                                                                                   Venkat Reddy
                                                                                                                                             Data Analysis Course
 • Historical data are provided on 250,000 borrowers.
Variable Name                          Description                                                                              Type
SeriousDlqin2yrs                       Person experienced 90 days past due delinquency or worse                                 Y/N

                                       Total balance on credit cards and personal lines of credit except real estate and no
RevolvingUtilizationOfUnsecuredLines                                                                                            percentage
                                       installment debt like car loans divided by the sum of credit limits

age                                    Age of borrower in years                                                                 integer
NumberOfTime30-                        Number of times borrower has been 30-59 days past due but no worse in the last 2
                                                                                                                                integer
59DaysPastDueNotWorse                  years.
DebtRatio                              Monthly debt payments, alimony,living costs divided by monthy gross income               percentage
MonthlyIncome                          Monthly income                                                                           real
                                       Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g.
NumberOfOpenCreditLinesAndLoans                                                                                                 integer
                                       credit cards)
NumberOfTimes90DaysLate                Number of times borrower has been 90 days or more past due.                              integer
NumberRealEstateLoansOrLines           Number of mortgage and real estate loans including home equity lines of credit           integer
                                                                                                                                                    7
NumberOfTime60-                        Number of times borrower has been 60-89 days past due but no worse in the last 2
                                                                                                                                integer
89DaysPastDueNotWorse                  years.
NumberOfDependents                     Number of dependents in family excluding themselves (spouse, children etc.)              integer
Basic contents of the data
   •   What are total number of observations
   •   What are total number of fields
   •   Each field name, Field type, Length of field
   •   Format of field, Label




                                                                                     Venkat Reddy
                                                                               Data Analysis Course
Basic Contents –Check points
• Are all variables as expected (variables names)
• Are there some variables which are unexpected say q9 r10?
• Are the data types and length across variables correct
• For known variables is the data type as expected (For example if age
  is in date format something is suspicious)
• Have labels been provided and are sensible
                                                                                      8
If anything suspicious we can further investigate it and correct accordingly
Proc Contents-SAS
• SAS code : proc contents data=<<data name>>; run;
• Useful options :
   • Short – Outputs the list of variables in a row by row format.
         Code : proc contents data=test short; run;




                                                                             Venkat Reddy
                                                                       Data Analysis Course
   • Out=filename - Creates a data set wherein each observation is a
      variable




                                                                              9
Snapshot of the data
Data Snapshot, if possible
• Printing the first few observations all fields in the data set .It helps in
  better understanding of the variable by looking at it’s assigned values.
Checkpoints for data snapshot output:
   1.   Do we have any unique identifier?          Is the unique identifier getting




                                                                                                    Venkat Reddy
                                                                                              Data Analysis Course
        repeated in different records?
   2.   Do the text variables have meaningful data?(If                text variables have
        absurd data as ‘&^%*HF’ then either the variable is meaningless or the variable has
        become corrupt or wasn’t properly created.)
   3.   Are there some coded values in the data?(if for a known variable say
        State we have category codes like 1-52 then we need definition of how they are
        coded.)
   4.   Do all the variables appear to have data? (In case variables are not
        populated with non missing meaningful value it would show in print. We can further        10
        investigates using means statistics.)
Proc print in SAS
SAS code : proc print data=<<data set>>; run;
• Useful options :
  proc print data=<<data set>> label noobs heading=vertical;
   var <<variable-list>>; by var1; run;




                                                                                                 Venkat Reddy
                                                                                           Data Analysis Course
   • Label:The label option uses variable labels as column headings rather than
      variable names (the default).
   • Obs : Restricts the number of observations in the output
   • Nobs: It omits the OBS column of output.
   • Heading=vertical: It prints the column headings vertically. This is useful when the
      names are long but the values of the variable are short.
   • Var: Specifies the variables to be listed and the order in which they will appear.
   • By: By statement produces output grouped by values of the mentioned variables
                                                                                               11
Lab: Data exploration & validation
• Import Data_explore.txt into sas
• What are basic contents of the data
• Verify the check list
• Any suspicious variables?




                                                          Venkat Reddy
                                                    Data Analysis Course
• What is var1?
• Are all the variable names correct?
• Print the first 10 observations
  •   Do we have any unique identifier?
  •   Do the text variables have meaningful data?
  •   Are there some coded values in the data?
                                                        12
  •   Do all the variables appear to have data
Categorical field frequencies
• Calculate frequency counts cross-tabulation frequencies for
  Especially for categorical, discrete & class fields
• Frequencies
     • help us understanding the variable by looking at the values it’s taking
       and data count at each value.




                                                                                       Venkat Reddy
                                                                                 Data Analysis Course
     • They also helps us in analyzing the relationships between variables
       by looking at the cross tab frequencies or by looking at association
Checkpoints for looking frequency table
1.    Are values as expected?
2.    Variable understanding : Distinct values of a particular variable,
      missing percentages
3.    Are there any extreme values or outliers?
4.    Any possibility of creating a new variable having small number of              13
      distinct category by clubbing certain categories with others.
Proc Freq in SAS
•   SAS code: Proc FREQ data =<dataset > <options> ;
             TABLES requests < / options > ; // Gives Frequency Count or Cross Tab
             BY <varl> ;                       // Grouping output based on varl
             WEIGHT variable < / option > ; //Specifying Weight (if applicable)
             OUTPUT < OUT=SAS-data-set > options ; //Output results to another data
            set         run;




                                                                                             Venkat Reddy
                                                                                       Data Analysis Course
•   Useful options :
    •   Order=Freq - sorts by descending frequency count (default is the unformatted
        value). Ex: proc freq data=test order=freq; tables X1-X5; run;
    •   Nocol/Norow/Nopercent - suppresses printing of column, row and cell
        percentages respectively of a cross tab. Ex : proc freq data=test; tables
        AGE*bad/nocol norow nopercent missing; run;
    •   Missing- interprets missing values as non-missing and includes them in % and
        statistics calculations   ex : proc freq data=test; tables CHANNEL* BAD
        /missing; run;
    •   Chisq - performs several chi-square tests. Ex: proc freq data=test; tables         14
        channel*bad/chisq; run;
Lab: Frequencies
•   Find the frequencies of all class variables in the data
•   Are there any variables with missing values?
•   Are there any default values?
•   Can you identify the variables with outliers?




                                                                    Venkat Reddy
                                                              Data Analysis Course
                                                                  15
Descriptive Statistics for continuous
fields
• Distribution of numeric variables by calculating
  •   N – Count of non missing observations
  •   Nmiss – Count of Missing observations
  •   Min, Max, Median, Mean




                                                                          Venkat Reddy
                                                                    Data Analysis Course
  •   Quartile numbers & percentiles– P1, p5,p10,q1(p25),q3(p75),
      p90,p99
  •   Stddev
  •   Var
  •   Skewness
  •   Kurtosis


                                                                        16
Descriptive Statistics Checkpoints
• Are variable distribution as expected.
• What is the central tendency of the variable? Mean, Median and
  Mode across each variable
• Is the concentration of variables as expected ? What are quartiles?
• Indicates variables which are unary I.e stddev=0 ; the variables




                                                                               Venkat Reddy
                                                                         Data Analysis Course
  which are useless for the current objective.
• Are there any outliers / extreme values for the variable?
• Are outlier values as expected or they have abnormally high values -
  for ex for Age if max and p99 values are 10000. Then should
  investigate if it’s the default value or there is some error in data
• What is the % of missing value associated with the variable?

                                                                             17
Proc Means in SAS
• Proc means data=<data set> < options>;
                Var <variable list >;
                Run;
• If variable list is not mentioned it gives results across all numeric
  variables




                                                                                 Venkat Reddy
                                                                           Data Analysis Course
• If options are not specified by default it gives stats like – n , min,
  max, mean and stddev.
• Useful Options :
   • By : Calculates statistics based on grouping across specified
     variable;
   Proc means data=check n nmiss min max;
   var age ;
   class channel;
                                                                               18
   run;
Proc Univariate in SAS
• SAS Code :    PROC UNIVARIATE data=<dataset>;
                 VAR variable(s); run;
• Useful options :
   •   PROC UNIVARIATE data=<dataset> plot normal;
       HISTOGRAM <variable(s)> </ option(s)>;




                                                                                           Venkat Reddy
                                                                                     Data Analysis Course
       By variable;
       VAR variable(s); run;
        •   Normal option produces the tests of normality ;
        •   Plot option produces the 3 plots of data(stem and leaf plot, box plot,
            normal probability plot
        •   By option is used for giving outputs separated by categories
        •   Histogram option gives the distribution of variable in a histogram


                                                                                         19
General Checks
• Mean=Median?
• Counted proportion data. If data consists of counted proportions,
  e.g. number of individuals responding out of total number of
  individuals,
• Polytomous data. If data consists of numbers falling into a number




                                                                                              Venkat Reddy
                                                                                        Data Analysis Course
  of mutually exclusive classes, do not reduce to proportions or
  percentages beforehand, but enter the integer counts
• Data Sufficiency: Data Sufficiency involves ensuring that the data
  has the required attributes to make the prediction as stated by
  objective
  • Eg1: To build a model to predict fraud, the given data doesn’t have any key
    for identifying fraud accounts or those identifiers are erased than there are
    no accounts which we can identify as ‘bad’ and build a model to predict the
    same.
  • Eg2: If we are building a response model specifically for internet channel for
    a “airline card”. Then data should have a identifier for ‘channel of acquisition’       20
    to identify the right data base on which to build the model
Lab: Data exploration & validation
•   Find N, Average, sd, minimum & maximum
•   Is N same for all the variables?
•   Any variables with unusual min & max?
•   Identify list of suspicious variables
•   Find below statistics for all the doubtful variables
    •




                                                                            Venkat Reddy
                                                                      Data Analysis Course
        N,Mean,Median,Mode
    •   Std Deviation
    •   Skewness
    •   Variance
    •   Kurtosis
    •   Interquartile Range
    •   Quantiles 100% Max, 99%,,95%,,90%, 75% Q3,50% Median,25%
        Q1,10%,5%,1%,0% Min
• See the variable definitions and possible values
• Identify variables with missing values, default values & outliers       21
Now what…?
• Some variables contain outliers
• Some variables have default values
• Some variables have missing values




                                                               Venkat Reddy
                                                         Data Analysis Course
• Shall we delete them and go ahead with our analysis?




                                                             22
Missing Values
• Data is not always available E.g., many tuples have no recorded
  value for several attributes, such as customer income in sales
  data
• Missing data may be due to
   • equipment malfunction




                                                                             Venkat Reddy
                                                                       Data Analysis Course
   • inconsistent with other recorded data and thus deleted
   • data not entered due to misunderstanding
   • certain data may not be considered important at the time of
     entry
   • not register history or changes of the data
   • Missing data may need to be inferred.
• Missing data - values, attributes, entire records, entire sections       23
• Missing values and defaults are indistinguishable
Missing Value & outlier Treatment

 Missing      No. of      % of
                                                     Treatment
Treatment   categories   Missing
                                      Impute with similar/closest bad rate group


                          <=50%       If bad rate is very different from all other
                                      categories, then assign a special value and




                                                                                             Venkat Reddy
                                                                                       Data Analysis Course
                                          include in the regression analysis
             <=25

                          >50%     Create a dummy / indicator variable with missing
                                            (as 1) vs. non missing category

 Variable
                          <=10%          Impute with the mean / median value

                                   Assign a special value to the missing category in
                         >10% &      the original variable and create an indicator
              >25        <=50%     variable with missing value as 1 and others as 0
                                                                                           24
                                   Create a dummy / indicator variable with missing
                         >50%
                                              value as 1 and others as 0
Missing Value Imputation1
• Missing Value Imputation – 1:Standalone imputation
  • Mean, median, other point estimates
  • Assume: Distribution of the missing values is the same as the non-
    missing values.
  • Does not take into account inter-relationships




                                                                               Venkat Reddy
                                                                         Data Analysis Course
  • Introduces bias
  • Convenient, easy to implement




                                                                             25
Missing Value Imputation2
Missing Value Imputation - 2
•   Better imputation - use attribute relationships
•   Assume : all prior attributes are populated
•   That is, monotonicity in missing values.     X1       X2   X3     X4    X5
•   Two techniques                               11       20   6.5    4      .




                                                                                       Venkat Reddy
                                                                                 Data Analysis Course
                                                12.1      18    8     2      .
    • Regression (parametric),                     11.9   22   4.2      .    .
                                                   10.9   15      .     .    .
    • Propensity score (nonparametric)
• Regression method
    • Use linear regression, sweep left-to-right
        X3=a+b*X2+c*X1;
        X4=d+e*X3+f*X2+g*X1, and so on
    • X3 in the second equation is estimated from the first equation if it
      is missing                                                                     26
Missing Value Imputation3
• Propensity Scores (nonparametric)
  • Let Yj=1 if Xj is missing, 0 otherwise
  • Estimate P(Yj =1) based on X1 through X(j-1) using logistic regression
  • Group by propensity score P(Yj =1)
  • Within each group, estimate missing Xjs from known Xjs using
    approximate Bayesian bootstrap.




                                                                                   Venkat Reddy
                                                                             Data Analysis Course
  • Repeat until all attributes are populated.
• Arbitrary missing pattern
  • Markov Chain Monte Carlo (MCMC)
  • Assume data is multivariate Normal, with parameter Q
  • (1) Simulate missing X, given Q estimated from observed X ; (2) Re-
    compute Q using filled in X
  • Repeat until stable.
  • Expensive: Used most often to induce monotonicity
Note that imputed values are useful in aggregates but can’t be trusted           27
individually
Default Values Treatment
• Special or default values are values like 999 or 999999 which fall
  outside the normal range of data .

• For instance a no. of bankcards variable usually has values from 0 to
  100 but 999 values in the data represent the population which does
  not have any tradelines. Including them in the regression as 999




                                                                                Venkat Reddy
                                                                          Data Analysis Course
  would skew the regression results, hence we need to treat them
  accordingly.

• Special or default values also should be treated as we are treating
  the missing values depending upon the no. of categories and the %
  of default value.

• They can also be taken care of by capping or flooring them to a
  realistic value / where the trend is being maintained ( especially if
  the % of default value is very less).                                       28
Default Values Treatment
 Default
              No. of     % of Default
  Value                                                 Treatment
            categories      Value
Treatment
                                          Club with similar/closest bad rate group
                           <=3%




                                                                                             Venkat Reddy
                                                                                       Data Analysis Course
             <=25        >3% & <50%     Keep as separate category use this as class


                           >=50%         Create a binary variable and use the binary
                                          in class along with the original variable
Variable
                           <=3%                 Substitute with correct group

                                           If x=-999 then do; flg_x=1;x=0; end;
              >25        >3% & <50%
                                                      Else flg_x=0;
                                                                                           29
                           >=50%         Create a binary variable and use the binary
                                               in class, drop original variable
Lab: Missing values & Default values
 • In the sample data, what all fields need missing value
   treatment?
 • Conduct the missing value treatment
 • Identify fields with outlier, conduct outlier treatment




                                                                   Venkat Reddy
                                                             Data Analysis Course
                                                                 30
Outlier Treatment
• Flooring
• Capping
• Treat as a separate segment




                                      Venkat Reddy
                                Data Analysis Course
                                    31
Lab: Data Cleaning step by step
• Var-1: Change the variable name to sr_no
• SeriousDlqin2yrs : Only training data has data objective variable test
  data doesn’t have objective variable in it. Subset training data & test
  from overall data
• Age: If age <21 make it 21




                                                                                  Venkat Reddy
                                                                            Data Analysis Course
                                                                                32
Lab : Data Cleaning step by step
NumberOfTime30_59DaysPastDueNotW
• Find bad rate in each category of this variable
• Replace 96 with _____? Replace 98 with_____?




                Missing      No. of




                                                                                                                        Venkat Reddy
                                                                                                                  Data Analysis Course
                                        % of Missing                            Treatment
               Treatment   categories
                                                        Impute with similar/closest bad rate
                                                                      group

                                          <=50%        If bad rate is very different from all other categories,
                                                           then assign a special value and include in the
                                                                         regression analysis
                           <=25
                                            >50%       Create a dummy / indicator variable with missing (as
                                                                   1) vs. non missing category
               30_59Days
                PastDue                    <=10%               Impute with the mean / median value


                                                       Assign a special value to the missing category in the
                                         >10% &         original variable and create an indicator variable
                             >25         <=50%               with missing value as 1 and others as 0                  33
                                                         Create a dummy / indicator variable with missing
                                           >50%
                                                                    value as 1 and others as 0
Lab :Data Cleaning step by step
• RevolvingUtilizationOfUnsecuredL: What type of variable is this? What are
  the possible values?
• Replace anything more than 1 with_____?



              Missing        No. of




                                                                                                                   Venkat Reddy
                                                                                                             Data Analysis Course
                                        % of Missing                         Treatment
             Treatment     categories

                                                          Impute with similar/closest bad rate group


                                           <=50%           If bad rate is very different from all other
                                                       categories, then assign a special value and include
                                                                    in the regression analysis
                            <=25

                                           >50%        Create a dummy / indicator variable with missing
                                                                 (as 1) vs. non missing category
             Utilization
                 pct                       <=10%             Impute with the mean / median value


                                                       Assign a special value to the missing category in
                                          >10% &         the original variable and create an indicator
                             >25          <=50%        variable with missing value as 1 and others as 0          34
                                                       Create a dummy / indicator variable with missing
                                           >50%
                                                                  value as 1 and others as 0
Lab :Data Cleaning step by step
• Monthly Income



         Missing      No. of
                                 % of Missing                         Treatment
        Treatment   categories




                                                                                                            Venkat Reddy
                                                                                                      Data Analysis Course
                                                   Impute with similar/closest bad rate group


                                    <=50%           If bad rate is very different from all other
                                                categories, then assign a special value and include
                                                             in the regression analysis
                     <=25

                                    >50%        Create a dummy / indicator variable with missing
                                                          (as 1) vs. non missing category
         Monthly
         Income                     <=10%             Impute with the mean / median value


                                                Assign a special value to the missing category in
                                   >10% &         the original variable and create an indicator
                      >25          <=50%        variable with missing value as 1 and others as 0
                                                                                                          35
                                                Create a dummy / indicator variable with missing
                                    >50%
                                                           value as 1 and others as 0
Lab :Data Cleaning step by step
• Debt Ratio: Similar Imputation
• NumberOfOpenCreditLinesAndLoans : No clear evidence
• NumberOfTimes90DaysLate: Imputation similar to
  NumberOfTime30_59DaysPastDueNotW




                                                               Venkat Reddy
                                                         Data Analysis Course
• NumberRealEstateLoansOrLines: : No clear evidence
• NumberOfTime60_89DaysPastDueNotW: Imputation similar
  to NumberOfTime30_59DaysPastDueNotW
• NumberOfDependents: Impute with equal bad rate



                                                             36
Variables & Treatment
Old Var                            Type   Treatment                   New Var
VAR1                               Num    Nothing
SeriousDlqin2yrs                   Num    Nothing
RevolvingUtilizationOfUnsecuredL   Num    Impute with the mean        Util

age                                Num    flooring                    age1
NumberOfTime30_59DaysPastDueNotW   Num    Impute with the mean        NumberOfTime30_59DaysPa




                                                                                                      Venkat Reddy
                                                                                                Data Analysis Course
                                                                      stDue1
DebtRatio                          Num    Impute with the median      DebtRatio1
MonthlyIncome                      Char   Convert to num & create a   ind_MonthlyIncome,
                                          dummy var                   MonthlyIncome1
NumberOfOpenCreditLinesAndLoans    Num    Impute with median          num_open_lines

NumberOfTimes90DaysLate            Num    Imputing & capping          delq_90
NumberRealEstateLoansOrLines       Num    Capping                     num_loans

NumberOfTime60_89DaysPastDueNotW   Num    Capping & Imputing          delq_60to89

NumberOfDependents                 Char
                                                                                                    37
obs_type                           Char   Subset training data        Obs_type
Venkat Reddy Konasani
Manager at Trendwise Analytics
venkat@TrendwiseAnalytics.com
21.venkat@gmail.com




                                          Venkat Reddy
                                    Data Analysis Course
www.trendwiseanalytics.com/venkat
+91 9886 768879




                                        38

Contenu connexe

Tendances

Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining Sulman Ahmed
 
Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisEva Durall
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 
Data Mining: Application and trends in data mining
Data Mining: Application and trends in data miningData Mining: Application and trends in data mining
Data Mining: Application and trends in data miningDataminingTools Inc
 
Data Analysis and Statistics
Data Analysis and StatisticsData Analysis and Statistics
Data Analysis and StatisticsT.S. Lim
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision treeKrish_ver2
 
Outlier detection method introduction
Outlier detection method introductionOutlier detection method introduction
Outlier detection method introductionDaeJin Kim
 
Capter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberCapter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberHouw Liong The
 
Linear regression
Linear regressionLinear regression
Linear regressionMartinHogg9
 
Exploratory data analysis
Exploratory data analysis Exploratory data analysis
Exploratory data analysis Peter Reimann
 
Outlier analysis and anomaly detection
Outlier analysis and anomaly detectionOutlier analysis and anomaly detection
Outlier analysis and anomaly detectionShantanuDeosthale
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data MiningDHIVYADEVAKI
 
Introduction to Statistical Machine Learning
Introduction to Statistical Machine LearningIntroduction to Statistical Machine Learning
Introduction to Statistical Machine Learningmahutte
 
Data preprocessing PPT
Data preprocessing PPTData preprocessing PPT
Data preprocessing PPTANUSUYA T K
 
DATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptxDATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptxAbdullahAbbasi55
 

Tendances (20)

Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
 
Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data Analysis
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Probability Theory for Data Scientists
Probability Theory for Data ScientistsProbability Theory for Data Scientists
Probability Theory for Data Scientists
 
Data Mining: Application and trends in data mining
Data Mining: Application and trends in data miningData Mining: Application and trends in data mining
Data Mining: Application and trends in data mining
 
Machine Learning (Classification Models)
Machine Learning (Classification Models)Machine Learning (Classification Models)
Machine Learning (Classification Models)
 
Data Analysis and Statistics
Data Analysis and StatisticsData Analysis and Statistics
Data Analysis and Statistics
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision tree
 
Outlier detection method introduction
Outlier detection method introductionOutlier detection method introduction
Outlier detection method introduction
 
Classification
ClassificationClassification
Classification
 
supervised learning
supervised learningsupervised learning
supervised learning
 
Capter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberCapter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & Kamber
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Exploratory data analysis
Exploratory data analysis Exploratory data analysis
Exploratory data analysis
 
Outlier analysis and anomaly detection
Outlier analysis and anomaly detectionOutlier analysis and anomaly detection
Outlier analysis and anomaly detection
 
Data Mining: Outlier analysis
Data Mining: Outlier analysisData Mining: Outlier analysis
Data Mining: Outlier analysis
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
 
Introduction to Statistical Machine Learning
Introduction to Statistical Machine LearningIntroduction to Statistical Machine Learning
Introduction to Statistical Machine Learning
 
Data preprocessing PPT
Data preprocessing PPTData preprocessing PPT
Data preprocessing PPT
 
DATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptxDATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptx
 

Similaire à Data Analysis Course Guide

Model building in credit card and loan approval
Model building in credit card and loan approval Model building in credit card and loan approval
Model building in credit card and loan approval Venkata Reddy Konasani
 
Classification & Clustering.pptx
Classification & Clustering.pptxClassification & Clustering.pptx
Classification & Clustering.pptxImXaib
 
Credit Evaluating system.pptx
Credit Evaluating system.pptxCredit Evaluating system.pptx
Credit Evaluating system.pptxetebarkhmichale
 
111-Evaluating_the_Evaluation.pptx
111-Evaluating_the_Evaluation.pptx111-Evaluating_the_Evaluation.pptx
111-Evaluating_the_Evaluation.pptxetebarkhmichale
 
The Credit Crisis and beyond: Lessons from the Financial Crisis & Sound Pract...
The Credit Crisis and beyond: Lessons from the Financial Crisis & Sound Pract...The Credit Crisis and beyond: Lessons from the Financial Crisis & Sound Pract...
The Credit Crisis and beyond: Lessons from the Financial Crisis & Sound Pract...Markus Krebsz
 
Baker Hill Prosper 2017 - CECL – Now That the Dust is Settling, What is Going...
Baker Hill Prosper 2017 - CECL – Now That the Dust is Settling, What is Going...Baker Hill Prosper 2017 - CECL – Now That the Dust is Settling, What is Going...
Baker Hill Prosper 2017 - CECL – Now That the Dust is Settling, What is Going...Baker Hill
 
Meet 1 - Introduction Data Mining - Dedi Darwis.pdf
Meet 1 - Introduction Data Mining - Dedi Darwis.pdfMeet 1 - Introduction Data Mining - Dedi Darwis.pdf
Meet 1 - Introduction Data Mining - Dedi Darwis.pdf09372002dedi
 
Statistical review of mortgage credit risk variables ica 2010
Statistical review of mortgage credit risk variables ica 2010Statistical review of mortgage credit risk variables ica 2010
Statistical review of mortgage credit risk variables ica 2010kylemrotek
 
chương 1 - Tổng quan về khai phá dữ liệu.pdf
chương 1 - Tổng quan về khai phá dữ liệu.pdfchương 1 - Tổng quan về khai phá dữ liệu.pdf
chương 1 - Tổng quan về khai phá dữ liệu.pdfphongnguyen312110237
 
Introduction to Data mining
Introduction to Data miningIntroduction to Data mining
Introduction to Data miningHadi Fadlallah
 
Data architecture around risk management
Data architecture around risk managementData architecture around risk management
Data architecture around risk managementSuvradeep Rudra
 
Solving the Credit Union 'Tower of Babel' (Conference Session Slides)
Solving the Credit Union 'Tower of Babel' (Conference Session Slides)Solving the Credit Union 'Tower of Babel' (Conference Session Slides)
Solving the Credit Union 'Tower of Babel' (Conference Session Slides)NAFCU Services Corporation
 
Behavior Driven Development - WPC 2011
Behavior Driven Development - WPC 2011Behavior Driven Development - WPC 2011
Behavior Driven Development - WPC 2011Fabio Armani
 
CFO Half-Day Conference
CFO Half-Day ConferenceCFO Half-Day Conference
CFO Half-Day Conferencegppcpa
 
A picture is worth a thousand words
A picture is worth a thousand wordsA picture is worth a thousand words
A picture is worth a thousand wordsMasum Billah
 

Similaire à Data Analysis Course Guide (20)

Model building in credit card and loan approval
Model building in credit card and loan approval Model building in credit card and loan approval
Model building in credit card and loan approval
 
Classification & Clustering.pptx
Classification & Clustering.pptxClassification & Clustering.pptx
Classification & Clustering.pptx
 
Credit Evaluating system.pptx
Credit Evaluating system.pptxCredit Evaluating system.pptx
Credit Evaluating system.pptx
 
111-Evaluating_the_Evaluation.pptx
111-Evaluating_the_Evaluation.pptx111-Evaluating_the_Evaluation.pptx
111-Evaluating_the_Evaluation.pptx
 
Credit appraisal SYSTEM
Credit appraisal SYSTEMCredit appraisal SYSTEM
Credit appraisal SYSTEM
 
Data Mining Lecture_1.pptx
Data Mining Lecture_1.pptxData Mining Lecture_1.pptx
Data Mining Lecture_1.pptx
 
The Credit Crisis and beyond: Lessons from the Financial Crisis & Sound Pract...
The Credit Crisis and beyond: Lessons from the Financial Crisis & Sound Pract...The Credit Crisis and beyond: Lessons from the Financial Crisis & Sound Pract...
The Credit Crisis and beyond: Lessons from the Financial Crisis & Sound Pract...
 
Baker Hill Prosper 2017 - CECL – Now That the Dust is Settling, What is Going...
Baker Hill Prosper 2017 - CECL – Now That the Dust is Settling, What is Going...Baker Hill Prosper 2017 - CECL – Now That the Dust is Settling, What is Going...
Baker Hill Prosper 2017 - CECL – Now That the Dust is Settling, What is Going...
 
Meet 1 - Introduction Data Mining - Dedi Darwis.pdf
Meet 1 - Introduction Data Mining - Dedi Darwis.pdfMeet 1 - Introduction Data Mining - Dedi Darwis.pdf
Meet 1 - Introduction Data Mining - Dedi Darwis.pdf
 
Statistical review of mortgage credit risk variables ica 2010
Statistical review of mortgage credit risk variables ica 2010Statistical review of mortgage credit risk variables ica 2010
Statistical review of mortgage credit risk variables ica 2010
 
chương 1 - Tổng quan về khai phá dữ liệu.pdf
chương 1 - Tổng quan về khai phá dữ liệu.pdfchương 1 - Tổng quan về khai phá dữ liệu.pdf
chương 1 - Tổng quan về khai phá dữ liệu.pdf
 
datamining-lect1.pptx
datamining-lect1.pptxdatamining-lect1.pptx
datamining-lect1.pptx
 
Introduction to Data mining
Introduction to Data miningIntroduction to Data mining
Introduction to Data mining
 
Data architecture around risk management
Data architecture around risk managementData architecture around risk management
Data architecture around risk management
 
Solving the Credit Union 'Tower of Babel' (Conference Session Slides)
Solving the Credit Union 'Tower of Babel' (Conference Session Slides)Solving the Credit Union 'Tower of Babel' (Conference Session Slides)
Solving the Credit Union 'Tower of Babel' (Conference Session Slides)
 
Behavior Driven Development - WPC 2011
Behavior Driven Development - WPC 2011Behavior Driven Development - WPC 2011
Behavior Driven Development - WPC 2011
 
Data analysis Design Document
Data analysis Design DocumentData analysis Design Document
Data analysis Design Document
 
CFO Half-Day Conference
CFO Half-Day ConferenceCFO Half-Day Conference
CFO Half-Day Conference
 
Loan Sanction Model
Loan Sanction ModelLoan Sanction Model
Loan Sanction Model
 
A picture is worth a thousand words
A picture is worth a thousand wordsA picture is worth a thousand words
A picture is worth a thousand words
 

Plus de Venkata Reddy Konasani

Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Venkata Reddy Konasani
 
Model selection and cross validation techniques
Model selection and cross validation techniquesModel selection and cross validation techniques
Model selection and cross validation techniquesVenkata Reddy Konasani
 
Table of Contents - Practical Business Analytics using SAS
Table of Contents - Practical Business Analytics using SAS Table of Contents - Practical Business Analytics using SAS
Table of Contents - Practical Business Analytics using SAS Venkata Reddy Konasani
 
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced features
Learning Tableau -  Data, Graphs, Filters, Dashboards and Advanced featuresLearning Tableau -  Data, Graphs, Filters, Dashboards and Advanced features
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced featuresVenkata Reddy Konasani
 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitizationVenkata Reddy Konasani
 

Plus de Venkata Reddy Konasani (20)

Transformers 101
Transformers 101 Transformers 101
Transformers 101
 
Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science
 
Model selection and cross validation techniques
Model selection and cross validation techniquesModel selection and cross validation techniques
Model selection and cross validation techniques
 
Neural Network Part-2
Neural Network Part-2Neural Network Part-2
Neural Network Part-2
 
GBM theory code and parameters
GBM theory code and parametersGBM theory code and parameters
GBM theory code and parameters
 
Neural Networks made easy
Neural Networks made easyNeural Networks made easy
Neural Networks made easy
 
Decision tree
Decision treeDecision tree
Decision tree
 
Step By Step Guide to Learn R
Step By Step Guide to Learn RStep By Step Guide to Learn R
Step By Step Guide to Learn R
 
Credit Risk Model Building Steps
Credit Risk Model Building StepsCredit Risk Model Building Steps
Credit Risk Model Building Steps
 
Table of Contents - Practical Business Analytics using SAS
Table of Contents - Practical Business Analytics using SAS Table of Contents - Practical Business Analytics using SAS
Table of Contents - Practical Business Analytics using SAS
 
SAS basics Step by step learning
SAS basics Step by step learningSAS basics Step by step learning
SAS basics Step by step learning
 
Testing of hypothesis case study
Testing of hypothesis case study Testing of hypothesis case study
Testing of hypothesis case study
 
L101 predictive modeling case_study
L101 predictive modeling case_studyL101 predictive modeling case_study
L101 predictive modeling case_study
 
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced features
Learning Tableau -  Data, Graphs, Filters, Dashboards and Advanced featuresLearning Tableau -  Data, Graphs, Filters, Dashboards and Advanced features
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced features
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
 
Online data sources for analaysis
Online data sources for analaysis Online data sources for analaysis
Online data sources for analaysis
 
A data analyst view of Bigdata
A data analyst view of Bigdata A data analyst view of Bigdata
A data analyst view of Bigdata
 
R- Introduction
R- IntroductionR- Introduction
R- Introduction
 
Cluster Analysis for Dummies
Cluster Analysis for DummiesCluster Analysis for Dummies
Cluster Analysis for Dummies
 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitization
 

Dernier

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 

Dernier (20)

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 

Data Analysis Course Guide

  • 1. Data Analysis Course Preparing data for analysis(Version-1) Venkat Reddy
  • 2. Data Analysis Course • Data analysis design document • Introduction to statistical data analysis • Descriptive statistics • Venkat Reddy Data Analysis Course • Probability distributions examples and applications • Simple correlation and regression analysis • Multiple liner regression analysis • Logistic regression analysis • Testing of hypothesis • Clustering and decision trees • Time series analysis and forecasting • Credit Risk Model building-1 2 • Credit Risk Model building-2
  • 3. Note • This presentation is just class notes. The course notes for Data Analysis Training is by written by me, as an aid for myself. • The best way to treat this is as a high-level summary; the actual session went more in depth and contained other Venkat Reddy Data Analysis Course information. • Most of this material was written as informal notes, not intended for publication • Please send questions/comments/corrections to venkat@trenwiseanalytics.com or 21.venkat@gmail.com • Please check my website for latest version of this document -Venkat Reddy 3
  • 4. Contents • Need of data exploration • Data Exploration • Data Validation • Data Sanitization Venkat Reddy Data Analysis Course • Missing Value Treatment • Outlier Treatment Identification & Treatment 4
  • 5. What is the need of data sanitization • Download age vs weight data from here • Fit a simple liner regression line • As age increases what happens to weight? • Is the inference accurate? Venkat Reddy Data Analysis Course 5
  • 6. Remember… Data in the real world is dirty Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data. e.g., occupation=“ ” • Incomplete data may come from • “Not applicable” data value when collected. • Different considerations between the time when the data was collected and when it Venkat Reddy Data Analysis Course is analyzed. • Human/hardware/software problems Noisy: Containing errors or outliers. e.g., Salary=“-10” • Noisy data (incorrect values) may come from • Faulty data collection instruments • Human or computer error at data entry • Errors in data transmission Inconsistent: containing discrepancies in codes or names • e.g., Age=“42” Birthday=“03/07/1997”, Was rating “1,2,3”, now rating “A, B, C”,e.g., discrepancy between duplicate records • Inconsistent data may come from 6 • Different data sources • Functional dependency violation (e.g., modify some linked data)
  • 7. Lab: About the data, background and Business Objective: • Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit. • Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years. The objective is to build a model that borrowers can use to help make the best financial decisions. Venkat Reddy Data Analysis Course • Historical data are provided on 250,000 borrowers. Variable Name Description Type SeriousDlqin2yrs Person experienced 90 days past due delinquency or worse Y/N Total balance on credit cards and personal lines of credit except real estate and no RevolvingUtilizationOfUnsecuredLines percentage installment debt like car loans divided by the sum of credit limits age Age of borrower in years integer NumberOfTime30- Number of times borrower has been 30-59 days past due but no worse in the last 2 integer 59DaysPastDueNotWorse years. DebtRatio Monthly debt payments, alimony,living costs divided by monthy gross income percentage MonthlyIncome Monthly income real Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. NumberOfOpenCreditLinesAndLoans integer credit cards) NumberOfTimes90DaysLate Number of times borrower has been 90 days or more past due. integer NumberRealEstateLoansOrLines Number of mortgage and real estate loans including home equity lines of credit integer 7 NumberOfTime60- Number of times borrower has been 60-89 days past due but no worse in the last 2 integer 89DaysPastDueNotWorse years. NumberOfDependents Number of dependents in family excluding themselves (spouse, children etc.) integer
  • 8. Basic contents of the data • What are total number of observations • What are total number of fields • Each field name, Field type, Length of field • Format of field, Label Venkat Reddy Data Analysis Course Basic Contents –Check points • Are all variables as expected (variables names) • Are there some variables which are unexpected say q9 r10? • Are the data types and length across variables correct • For known variables is the data type as expected (For example if age is in date format something is suspicious) • Have labels been provided and are sensible 8 If anything suspicious we can further investigate it and correct accordingly
  • 9. Proc Contents-SAS • SAS code : proc contents data=<<data name>>; run; • Useful options : • Short – Outputs the list of variables in a row by row format. Code : proc contents data=test short; run; Venkat Reddy Data Analysis Course • Out=filename - Creates a data set wherein each observation is a variable 9
  • 10. Snapshot of the data Data Snapshot, if possible • Printing the first few observations all fields in the data set .It helps in better understanding of the variable by looking at it’s assigned values. Checkpoints for data snapshot output: 1. Do we have any unique identifier? Is the unique identifier getting Venkat Reddy Data Analysis Course repeated in different records? 2. Do the text variables have meaningful data?(If text variables have absurd data as ‘&^%*HF’ then either the variable is meaningless or the variable has become corrupt or wasn’t properly created.) 3. Are there some coded values in the data?(if for a known variable say State we have category codes like 1-52 then we need definition of how they are coded.) 4. Do all the variables appear to have data? (In case variables are not populated with non missing meaningful value it would show in print. We can further 10 investigates using means statistics.)
  • 11. Proc print in SAS SAS code : proc print data=<<data set>>; run; • Useful options : proc print data=<<data set>> label noobs heading=vertical; var <<variable-list>>; by var1; run; Venkat Reddy Data Analysis Course • Label:The label option uses variable labels as column headings rather than variable names (the default). • Obs : Restricts the number of observations in the output • Nobs: It omits the OBS column of output. • Heading=vertical: It prints the column headings vertically. This is useful when the names are long but the values of the variable are short. • Var: Specifies the variables to be listed and the order in which they will appear. • By: By statement produces output grouped by values of the mentioned variables 11
  • 12. Lab: Data exploration & validation • Import Data_explore.txt into sas • What are basic contents of the data • Verify the check list • Any suspicious variables? Venkat Reddy Data Analysis Course • What is var1? • Are all the variable names correct? • Print the first 10 observations • Do we have any unique identifier? • Do the text variables have meaningful data? • Are there some coded values in the data? 12 • Do all the variables appear to have data
  • 13. Categorical field frequencies • Calculate frequency counts cross-tabulation frequencies for Especially for categorical, discrete & class fields • Frequencies • help us understanding the variable by looking at the values it’s taking and data count at each value. Venkat Reddy Data Analysis Course • They also helps us in analyzing the relationships between variables by looking at the cross tab frequencies or by looking at association Checkpoints for looking frequency table 1. Are values as expected? 2. Variable understanding : Distinct values of a particular variable, missing percentages 3. Are there any extreme values or outliers? 4. Any possibility of creating a new variable having small number of 13 distinct category by clubbing certain categories with others.
  • 14. Proc Freq in SAS • SAS code: Proc FREQ data =<dataset > <options> ; TABLES requests < / options > ; // Gives Frequency Count or Cross Tab BY <varl> ; // Grouping output based on varl WEIGHT variable < / option > ; //Specifying Weight (if applicable) OUTPUT < OUT=SAS-data-set > options ; //Output results to another data set run; Venkat Reddy Data Analysis Course • Useful options : • Order=Freq - sorts by descending frequency count (default is the unformatted value). Ex: proc freq data=test order=freq; tables X1-X5; run; • Nocol/Norow/Nopercent - suppresses printing of column, row and cell percentages respectively of a cross tab. Ex : proc freq data=test; tables AGE*bad/nocol norow nopercent missing; run; • Missing- interprets missing values as non-missing and includes them in % and statistics calculations ex : proc freq data=test; tables CHANNEL* BAD /missing; run; • Chisq - performs several chi-square tests. Ex: proc freq data=test; tables 14 channel*bad/chisq; run;
  • 15. Lab: Frequencies • Find the frequencies of all class variables in the data • Are there any variables with missing values? • Are there any default values? • Can you identify the variables with outliers? Venkat Reddy Data Analysis Course 15
  • 16. Descriptive Statistics for continuous fields • Distribution of numeric variables by calculating • N – Count of non missing observations • Nmiss – Count of Missing observations • Min, Max, Median, Mean Venkat Reddy Data Analysis Course • Quartile numbers & percentiles– P1, p5,p10,q1(p25),q3(p75), p90,p99 • Stddev • Var • Skewness • Kurtosis 16
  • 17. Descriptive Statistics Checkpoints • Are variable distribution as expected. • What is the central tendency of the variable? Mean, Median and Mode across each variable • Is the concentration of variables as expected ? What are quartiles? • Indicates variables which are unary I.e stddev=0 ; the variables Venkat Reddy Data Analysis Course which are useless for the current objective. • Are there any outliers / extreme values for the variable? • Are outlier values as expected or they have abnormally high values - for ex for Age if max and p99 values are 10000. Then should investigate if it’s the default value or there is some error in data • What is the % of missing value associated with the variable? 17
  • 18. Proc Means in SAS • Proc means data=<data set> < options>; Var <variable list >; Run; • If variable list is not mentioned it gives results across all numeric variables Venkat Reddy Data Analysis Course • If options are not specified by default it gives stats like – n , min, max, mean and stddev. • Useful Options : • By : Calculates statistics based on grouping across specified variable; Proc means data=check n nmiss min max; var age ; class channel; 18 run;
  • 19. Proc Univariate in SAS • SAS Code : PROC UNIVARIATE data=<dataset>; VAR variable(s); run; • Useful options : • PROC UNIVARIATE data=<dataset> plot normal; HISTOGRAM <variable(s)> </ option(s)>; Venkat Reddy Data Analysis Course By variable; VAR variable(s); run; • Normal option produces the tests of normality ; • Plot option produces the 3 plots of data(stem and leaf plot, box plot, normal probability plot • By option is used for giving outputs separated by categories • Histogram option gives the distribution of variable in a histogram 19
  • 20. General Checks • Mean=Median? • Counted proportion data. If data consists of counted proportions, e.g. number of individuals responding out of total number of individuals, • Polytomous data. If data consists of numbers falling into a number Venkat Reddy Data Analysis Course of mutually exclusive classes, do not reduce to proportions or percentages beforehand, but enter the integer counts • Data Sufficiency: Data Sufficiency involves ensuring that the data has the required attributes to make the prediction as stated by objective • Eg1: To build a model to predict fraud, the given data doesn’t have any key for identifying fraud accounts or those identifiers are erased than there are no accounts which we can identify as ‘bad’ and build a model to predict the same. • Eg2: If we are building a response model specifically for internet channel for a “airline card”. Then data should have a identifier for ‘channel of acquisition’ 20 to identify the right data base on which to build the model
  • 21. Lab: Data exploration & validation • Find N, Average, sd, minimum & maximum • Is N same for all the variables? • Any variables with unusual min & max? • Identify list of suspicious variables • Find below statistics for all the doubtful variables • Venkat Reddy Data Analysis Course N,Mean,Median,Mode • Std Deviation • Skewness • Variance • Kurtosis • Interquartile Range • Quantiles 100% Max, 99%,,95%,,90%, 75% Q3,50% Median,25% Q1,10%,5%,1%,0% Min • See the variable definitions and possible values • Identify variables with missing values, default values & outliers 21
  • 22. Now what…? • Some variables contain outliers • Some variables have default values • Some variables have missing values Venkat Reddy Data Analysis Course • Shall we delete them and go ahead with our analysis? 22
  • 23. Missing Values • Data is not always available E.g., many tuples have no recorded value for several attributes, such as customer income in sales data • Missing data may be due to • equipment malfunction Venkat Reddy Data Analysis Course • inconsistent with other recorded data and thus deleted • data not entered due to misunderstanding • certain data may not be considered important at the time of entry • not register history or changes of the data • Missing data may need to be inferred. • Missing data - values, attributes, entire records, entire sections 23 • Missing values and defaults are indistinguishable
  • 24. Missing Value & outlier Treatment Missing No. of % of Treatment Treatment categories Missing Impute with similar/closest bad rate group <=50% If bad rate is very different from all other categories, then assign a special value and Venkat Reddy Data Analysis Course include in the regression analysis <=25 >50% Create a dummy / indicator variable with missing (as 1) vs. non missing category Variable <=10% Impute with the mean / median value Assign a special value to the missing category in >10% & the original variable and create an indicator >25 <=50% variable with missing value as 1 and others as 0 24 Create a dummy / indicator variable with missing >50% value as 1 and others as 0
  • 25. Missing Value Imputation1 • Missing Value Imputation – 1:Standalone imputation • Mean, median, other point estimates • Assume: Distribution of the missing values is the same as the non- missing values. • Does not take into account inter-relationships Venkat Reddy Data Analysis Course • Introduces bias • Convenient, easy to implement 25
  • 26. Missing Value Imputation2 Missing Value Imputation - 2 • Better imputation - use attribute relationships • Assume : all prior attributes are populated • That is, monotonicity in missing values. X1 X2 X3 X4 X5 • Two techniques 11 20 6.5 4 . Venkat Reddy Data Analysis Course 12.1 18 8 2 . • Regression (parametric), 11.9 22 4.2 . . 10.9 15 . . . • Propensity score (nonparametric) • Regression method • Use linear regression, sweep left-to-right X3=a+b*X2+c*X1; X4=d+e*X3+f*X2+g*X1, and so on • X3 in the second equation is estimated from the first equation if it is missing 26
  • 27. Missing Value Imputation3 • Propensity Scores (nonparametric) • Let Yj=1 if Xj is missing, 0 otherwise • Estimate P(Yj =1) based on X1 through X(j-1) using logistic regression • Group by propensity score P(Yj =1) • Within each group, estimate missing Xjs from known Xjs using approximate Bayesian bootstrap. Venkat Reddy Data Analysis Course • Repeat until all attributes are populated. • Arbitrary missing pattern • Markov Chain Monte Carlo (MCMC) • Assume data is multivariate Normal, with parameter Q • (1) Simulate missing X, given Q estimated from observed X ; (2) Re- compute Q using filled in X • Repeat until stable. • Expensive: Used most often to induce monotonicity Note that imputed values are useful in aggregates but can’t be trusted 27 individually
  • 28. Default Values Treatment • Special or default values are values like 999 or 999999 which fall outside the normal range of data . • For instance a no. of bankcards variable usually has values from 0 to 100 but 999 values in the data represent the population which does not have any tradelines. Including them in the regression as 999 Venkat Reddy Data Analysis Course would skew the regression results, hence we need to treat them accordingly. • Special or default values also should be treated as we are treating the missing values depending upon the no. of categories and the % of default value. • They can also be taken care of by capping or flooring them to a realistic value / where the trend is being maintained ( especially if the % of default value is very less). 28
  • 29. Default Values Treatment Default No. of % of Default Value Treatment categories Value Treatment Club with similar/closest bad rate group <=3% Venkat Reddy Data Analysis Course <=25 >3% & <50% Keep as separate category use this as class >=50% Create a binary variable and use the binary in class along with the original variable Variable <=3% Substitute with correct group If x=-999 then do; flg_x=1;x=0; end; >25 >3% & <50% Else flg_x=0; 29 >=50% Create a binary variable and use the binary in class, drop original variable
  • 30. Lab: Missing values & Default values • In the sample data, what all fields need missing value treatment? • Conduct the missing value treatment • Identify fields with outlier, conduct outlier treatment Venkat Reddy Data Analysis Course 30
  • 31. Outlier Treatment • Flooring • Capping • Treat as a separate segment Venkat Reddy Data Analysis Course 31
  • 32. Lab: Data Cleaning step by step • Var-1: Change the variable name to sr_no • SeriousDlqin2yrs : Only training data has data objective variable test data doesn’t have objective variable in it. Subset training data & test from overall data • Age: If age <21 make it 21 Venkat Reddy Data Analysis Course 32
  • 33. Lab : Data Cleaning step by step NumberOfTime30_59DaysPastDueNotW • Find bad rate in each category of this variable • Replace 96 with _____? Replace 98 with_____? Missing No. of Venkat Reddy Data Analysis Course % of Missing Treatment Treatment categories Impute with similar/closest bad rate group <=50% If bad rate is very different from all other categories, then assign a special value and include in the regression analysis <=25 >50% Create a dummy / indicator variable with missing (as 1) vs. non missing category 30_59Days PastDue <=10% Impute with the mean / median value Assign a special value to the missing category in the >10% & original variable and create an indicator variable >25 <=50% with missing value as 1 and others as 0 33 Create a dummy / indicator variable with missing >50% value as 1 and others as 0
  • 34. Lab :Data Cleaning step by step • RevolvingUtilizationOfUnsecuredL: What type of variable is this? What are the possible values? • Replace anything more than 1 with_____? Missing No. of Venkat Reddy Data Analysis Course % of Missing Treatment Treatment categories Impute with similar/closest bad rate group <=50% If bad rate is very different from all other categories, then assign a special value and include in the regression analysis <=25 >50% Create a dummy / indicator variable with missing (as 1) vs. non missing category Utilization pct <=10% Impute with the mean / median value Assign a special value to the missing category in >10% & the original variable and create an indicator >25 <=50% variable with missing value as 1 and others as 0 34 Create a dummy / indicator variable with missing >50% value as 1 and others as 0
  • 35. Lab :Data Cleaning step by step • Monthly Income Missing No. of % of Missing Treatment Treatment categories Venkat Reddy Data Analysis Course Impute with similar/closest bad rate group <=50% If bad rate is very different from all other categories, then assign a special value and include in the regression analysis <=25 >50% Create a dummy / indicator variable with missing (as 1) vs. non missing category Monthly Income <=10% Impute with the mean / median value Assign a special value to the missing category in >10% & the original variable and create an indicator >25 <=50% variable with missing value as 1 and others as 0 35 Create a dummy / indicator variable with missing >50% value as 1 and others as 0
  • 36. Lab :Data Cleaning step by step • Debt Ratio: Similar Imputation • NumberOfOpenCreditLinesAndLoans : No clear evidence • NumberOfTimes90DaysLate: Imputation similar to NumberOfTime30_59DaysPastDueNotW Venkat Reddy Data Analysis Course • NumberRealEstateLoansOrLines: : No clear evidence • NumberOfTime60_89DaysPastDueNotW: Imputation similar to NumberOfTime30_59DaysPastDueNotW • NumberOfDependents: Impute with equal bad rate 36
  • 37. Variables & Treatment Old Var Type Treatment New Var VAR1 Num Nothing SeriousDlqin2yrs Num Nothing RevolvingUtilizationOfUnsecuredL Num Impute with the mean Util age Num flooring age1 NumberOfTime30_59DaysPastDueNotW Num Impute with the mean NumberOfTime30_59DaysPa Venkat Reddy Data Analysis Course stDue1 DebtRatio Num Impute with the median DebtRatio1 MonthlyIncome Char Convert to num & create a ind_MonthlyIncome, dummy var MonthlyIncome1 NumberOfOpenCreditLinesAndLoans Num Impute with median num_open_lines NumberOfTimes90DaysLate Num Imputing & capping delq_90 NumberRealEstateLoansOrLines Num Capping num_loans NumberOfTime60_89DaysPastDueNotW Num Capping & Imputing delq_60to89 NumberOfDependents Char 37 obs_type Char Subset training data Obs_type
  • 38. Venkat Reddy Konasani Manager at Trendwise Analytics venkat@TrendwiseAnalytics.com 21.venkat@gmail.com Venkat Reddy Data Analysis Course www.trendwiseanalytics.com/venkat +91 9886 768879 38