SlideShare a Scribd company logo
1 of 17
Working on data ( cleaning, filtering
,transformation,sampling,visualization)
K K Singh, Dept. of CSE, RGUKT Nuzvid
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
1
Exploring DATA
 cd <- read.table(‘custData.csv’, sep=',',header=T)
 Once we’ve loaded the data into R, we’ll want to examine it.
 class()—Tells us what type of R object you have. In our case,
 summary()—Gives you a summary of almost any R object.
 str()-Gives structure of data table/frame
 names()– Gives detailed structure of data table/frame
 dim() –Gives rows and columns of data
 Data exploration uses a combination of summary statistics—means and
medians, variances, and counts—and visualization. You can spot some
problems just by using
summary statistics; other problems are easier to find visually.
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
2
OTHER DATA FORMATS
 .csv is not the only common data file format you’ll encounter. Other formats include
 .tsv (tab-separated values),
 pipe-separated files,
 Microsoft Excel workbooks,
 JSON data,
 and XML.
 R’s built-in read.table() command can be made to read most separated value formats.
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
3
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
4
 custdata<-fread(“custData.csv”)
 Summary(custdata)
Typical problems revealed by data summaries
 MISSING
VALUES
 INVALID
VALUES AND
OUTLIERS
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
5
Typical problems revealed by data summaries
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
6  DATA RANGE
 Unit
Data Cleaning
 Fundamentally, there are two things you can do with missing variables: drop the
rows with missing values, or convert the missing values to a meaningful value.
 If the missing data represents a fairly small fraction of the dataset, it’s probably saf
just to drop these customers from your analysis. But if it is significant, What do yo
do then?
 The most straightforward solution is just to create a new category for the variable,
called missing.
 f <- ifelse(is.na(custdata$is.employed), "missing", ifelse(custdata$is.employed==T,
“employed“, “not_employed”))
 summary(as.factor(f))
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
7
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
8
Data_transformations
The purpose of data transformation is to make data easier to model—and easier to
understand. For example, the cost of living will vary from state to state, so what would
be a high salary in one region could be barely enough to scrape by in another. If you
want to use income as an input to your insurance model, it might be more meaningful
to normalize a customer’s income by the typical income in the area where they live.
custdata <- merge(custdata, medianincome, by.x="state.of.res",
by.y="State")
summary(custdata[,c("state.of.res", "income", "Median.Income")])
custdata$income.norm <- with(custdata, income/Median.Income)
OR
custdata$income.norm <- custdata[, income/Median.Income]
summary(custdata$income.norm)
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
9
CONVERTING CONTINUOUS VARIABLES TO DISCRETE
 In these cases, you might want to convert the continuous age and income
variables into ranges, or discrete variables.
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
10
NORMALIZATION AND RESCALING
It is useful when absolute quantities are less meaningful than relative ones.
 For example, you might be less interested in a customer’s absolute age than in how old or young
they are relative to a “typical” customer. Let’s take the mean age of your customers to be the typical
age. You can normalize by that, as shown in the following listing.
 summary(custdata$age)
 meanage <- mean(custdata$age)
 custdata$age.normalized <- custdata$age/meanage
 summary(custdata$age.normalized)
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
11
Data Sampling
 Sampling is the process of selecting a subset of a population to
represent the whole, during analysis and modeling.
 it’s easier to test and debug the code on small subsamples before
training the model on the entire dataset. Visualization can be easier
with a subsample of the data;
 The other reason to sample your data is to create test and training
splits.
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
12
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
13 A convenient way to manage random sampling is to add a sample group column to the data frame. The
sample group column contains a number generated uniformly from zero to one, using the runif function. You
can draw a random sample of arbitrary size from the data frame by using the appropriate threshold on the
sample group column.
Data visualization (Refer to the lecture on Graph plotting )
 Visually checking distributions for a single variable
 What is the peak value of the distribution?
 How many peaks are there in the distribution (unimodality versus bimodality)?
 How normal (or lognormal) is the data?
 How much does the data vary? Is it concentrated in a certain interval or in a certain
category?
 Is there a relationship between the two inputs age and income in my data?
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
14
Uses
1. plot Shows the relationship between two continuous variables. Best when
that relationship is functional.
2. Shows the relationship between two continuous variables. Best when the
relationship is too loose or cloud-like to be seen on a line plot.
3. Shows the relationship between two categorical variables (var1 and var2).
Highlights the frequencies of each value of var1.
4. Shows the relationship between two categorical variables (var1 and var2).
Best for comparing the relative frequencies of each value of var2 within each
value of var1 when var2 takes on more than two values.
5. Examines data range, Checks number of modes,Checks if distribution is
normal/lognormal, Checks for anomalies and outliers. (use a log scale to
visualize data that is heavily skewed.)
6. Presents information from a five-number summary. Useful for indicating
whether a distribution is skewed and whether there are potential unusual
observations (outliers), Very useful when large numbers of observations are
involved and when two or more data sets are being compared.
 Graph type
1. Line Plot
2. Scatter plot
3. Bar chart
4. Bar chart with
faceting
5. Histogram or
density plot
6. A box and whisker
plot(boxplot)
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
15
Assignments
 load(nycflights)
 1. Create a new data frame that includes flights headed to SFO in February,
and save this data frame assfo_feb_flights. How many such recors are
there?
 2. Calculate the median and interquartile range for arr_delays of flights in
the sfo_feb_flights data frame, grouped by carrier. Which carrier has the
highest IQR of arrival delays?
 3. Considering the data from all the NYC airports, which month has the
highest average departure delay?
 4. What was the worst day to fly out of NYC in 2013 if you dislike delayed
flights?
 5. Make a histogram and calculate appropriate summary statistics for
arrival delays of sfo_feb_flights. Which of the following is false?
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
16
5. working on data using R -Cleaning, filtering ,transformation, Sampling

More Related Content

What's hot

What's hot (20)

3 Data Structure in R
3 Data Structure in R3 Data Structure in R
3 Data Structure in R
 
Manipulating Data using base R package
Manipulating Data using base R package Manipulating Data using base R package
Manipulating Data using base R package
 
Stata cheat sheet: data transformation
Stata  cheat sheet: data transformationStata  cheat sheet: data transformation
Stata cheat sheet: data transformation
 
Stata cheatsheet transformation
Stata cheatsheet transformationStata cheatsheet transformation
Stata cheatsheet transformation
 
4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function
 
R code for data manipulation
R code for data manipulationR code for data manipulation
R code for data manipulation
 
R code descriptive statistics of phenotypic data by Avjinder Kaler
R code descriptive statistics of phenotypic data by Avjinder KalerR code descriptive statistics of phenotypic data by Avjinder Kaler
R code descriptive statistics of phenotypic data by Avjinder Kaler
 
Data handling in r
Data handling in rData handling in r
Data handling in r
 
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on r
 
Stata Programming Cheat Sheet
Stata Programming Cheat SheetStata Programming Cheat Sheet
Stata Programming Cheat Sheet
 
Stata cheat sheet: data processing
Stata cheat sheet: data processingStata cheat sheet: data processing
Stata cheat sheet: data processing
 
SAS and R Code for Basic Statistics
SAS and R Code for Basic StatisticsSAS and R Code for Basic Statistics
SAS and R Code for Basic Statistics
 
Data manipulation with dplyr
Data manipulation with dplyrData manipulation with dplyr
Data manipulation with dplyr
 
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project ADN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
 
Basic Analysis using Python
Basic Analysis using PythonBasic Analysis using Python
Basic Analysis using Python
 
Basic Analysis using R
Basic Analysis using RBasic Analysis using R
Basic Analysis using R
 
5 R Tutorial Data Visualization
5 R Tutorial Data Visualization5 R Tutorial Data Visualization
5 R Tutorial Data Visualization
 
January 2016 Meetup: Speeding up (big) data manipulation with data.table package
January 2016 Meetup: Speeding up (big) data manipulation with data.table packageJanuary 2016 Meetup: Speeding up (big) data manipulation with data.table package
January 2016 Meetup: Speeding up (big) data manipulation with data.table package
 
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in R
 
R getting spatial
R getting spatialR getting spatial
R getting spatial
 

Similar to 5. working on data using R -Cleaning, filtering ,transformation, Sampling

A frame work for clustering time evolving data
A frame work for clustering time evolving dataA frame work for clustering time evolving data
A frame work for clustering time evolving data
iaemedu
 
Bank loan purchase modeling
Bank loan purchase modelingBank loan purchase modeling
Bank loan purchase modeling
Saleesh Satheeshchandran
 
BSA_AML Rule Tuning
BSA_AML Rule TuningBSA_AML Rule Tuning
BSA_AML Rule Tuning
Mayank Johri
 
Drsp dimension reduction for similarity matching and pruning of time series ...
Drsp  dimension reduction for similarity matching and pruning of time series ...Drsp  dimension reduction for similarity matching and pruning of time series ...
Drsp dimension reduction for similarity matching and pruning of time series ...
IJDKP
 

Similar to 5. working on data using R -Cleaning, filtering ,transformation, Sampling (20)

SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSSCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
 
A frame work for clustering time evolving data
A frame work for clustering time evolving dataA frame work for clustering time evolving data
A frame work for clustering time evolving data
 
Bank loan purchase modeling
Bank loan purchase modelingBank loan purchase modeling
Bank loan purchase modeling
 
Accounting serx
Accounting serxAccounting serx
Accounting serx
 
Accounting serx
Accounting serxAccounting serx
Accounting serx
 
Data visualization using R
Data visualization using RData visualization using R
Data visualization using R
 
QQ Plot.pptx
QQ Plot.pptxQQ Plot.pptx
QQ Plot.pptx
 
An Approach to Mixed Dataset Clustering and Validation with ART-2 Artificial ...
An Approach to Mixed Dataset Clustering and Validation with ART-2 Artificial ...An Approach to Mixed Dataset Clustering and Validation with ART-2 Artificial ...
An Approach to Mixed Dataset Clustering and Validation with ART-2 Artificial ...
 
UNIT-4.docx
UNIT-4.docxUNIT-4.docx
UNIT-4.docx
 
QUERY INVERSION TO FIND DATA PROVENANCE
QUERY INVERSION TO FIND DATA PROVENANCE QUERY INVERSION TO FIND DATA PROVENANCE
QUERY INVERSION TO FIND DATA PROVENANCE
 
BSA_AML Rule Tuning
BSA_AML Rule TuningBSA_AML Rule Tuning
BSA_AML Rule Tuning
 
Approach to BSA/AML Rule Thresholds
Approach to BSA/AML Rule ThresholdsApproach to BSA/AML Rule Thresholds
Approach to BSA/AML Rule Thresholds
 
Finding Relationships between the Our-NIR Cluster Results
Finding Relationships between the Our-NIR Cluster ResultsFinding Relationships between the Our-NIR Cluster Results
Finding Relationships between the Our-NIR Cluster Results
 
A Comparative Study for Anomaly Detection in Data Mining
A Comparative Study for Anomaly Detection in Data MiningA Comparative Study for Anomaly Detection in Data Mining
A Comparative Study for Anomaly Detection in Data Mining
 
Ijariie1117 volume 1-issue 1-page-25-27
Ijariie1117 volume 1-issue 1-page-25-27Ijariie1117 volume 1-issue 1-page-25-27
Ijariie1117 volume 1-issue 1-page-25-27
 
Drsp dimension reduction for similarity matching and pruning of time series ...
Drsp  dimension reduction for similarity matching and pruning of time series ...Drsp  dimension reduction for similarity matching and pruning of time series ...
Drsp dimension reduction for similarity matching and pruning of time series ...
 
E1062530
E1062530E1062530
E1062530
 
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
 
1234
12341234
1234
 
Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?
 

Recently uploaded

Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 

Recently uploaded (20)

BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 

5. working on data using R -Cleaning, filtering ,transformation, Sampling

  • 1. Working on data ( cleaning, filtering ,transformation,sampling,visualization) K K Singh, Dept. of CSE, RGUKT Nuzvid 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 1
  • 2. Exploring DATA  cd <- read.table(‘custData.csv’, sep=',',header=T)  Once we’ve loaded the data into R, we’ll want to examine it.  class()—Tells us what type of R object you have. In our case,  summary()—Gives you a summary of almost any R object.  str()-Gives structure of data table/frame  names()– Gives detailed structure of data table/frame  dim() –Gives rows and columns of data  Data exploration uses a combination of summary statistics—means and medians, variances, and counts—and visualization. You can spot some problems just by using summary statistics; other problems are easier to find visually. 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 2
  • 3. OTHER DATA FORMATS  .csv is not the only common data file format you’ll encounter. Other formats include  .tsv (tab-separated values),  pipe-separated files,  Microsoft Excel workbooks,  JSON data,  and XML.  R’s built-in read.table() command can be made to read most separated value formats. 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 3
  • 4. 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 4  custdata<-fread(“custData.csv”)  Summary(custdata)
  • 5. Typical problems revealed by data summaries  MISSING VALUES  INVALID VALUES AND OUTLIERS 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 5
  • 6. Typical problems revealed by data summaries 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 6  DATA RANGE  Unit
  • 7. Data Cleaning  Fundamentally, there are two things you can do with missing variables: drop the rows with missing values, or convert the missing values to a meaningful value.  If the missing data represents a fairly small fraction of the dataset, it’s probably saf just to drop these customers from your analysis. But if it is significant, What do yo do then?  The most straightforward solution is just to create a new category for the variable, called missing.  f <- ifelse(is.na(custdata$is.employed), "missing", ifelse(custdata$is.employed==T, “employed“, “not_employed”))  summary(as.factor(f)) 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 7
  • 8. 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 8
  • 9. Data_transformations The purpose of data transformation is to make data easier to model—and easier to understand. For example, the cost of living will vary from state to state, so what would be a high salary in one region could be barely enough to scrape by in another. If you want to use income as an input to your insurance model, it might be more meaningful to normalize a customer’s income by the typical income in the area where they live. custdata <- merge(custdata, medianincome, by.x="state.of.res", by.y="State") summary(custdata[,c("state.of.res", "income", "Median.Income")]) custdata$income.norm <- with(custdata, income/Median.Income) OR custdata$income.norm <- custdata[, income/Median.Income] summary(custdata$income.norm) 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 9
  • 10. CONVERTING CONTINUOUS VARIABLES TO DISCRETE  In these cases, you might want to convert the continuous age and income variables into ranges, or discrete variables. 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 10
  • 11. NORMALIZATION AND RESCALING It is useful when absolute quantities are less meaningful than relative ones.  For example, you might be less interested in a customer’s absolute age than in how old or young they are relative to a “typical” customer. Let’s take the mean age of your customers to be the typical age. You can normalize by that, as shown in the following listing.  summary(custdata$age)  meanage <- mean(custdata$age)  custdata$age.normalized <- custdata$age/meanage  summary(custdata$age.normalized) 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 11
  • 12. Data Sampling  Sampling is the process of selecting a subset of a population to represent the whole, during analysis and modeling.  it’s easier to test and debug the code on small subsamples before training the model on the entire dataset. Visualization can be easier with a subsample of the data;  The other reason to sample your data is to create test and training splits. 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 12
  • 13. 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 13 A convenient way to manage random sampling is to add a sample group column to the data frame. The sample group column contains a number generated uniformly from zero to one, using the runif function. You can draw a random sample of arbitrary size from the data frame by using the appropriate threshold on the sample group column.
  • 14. Data visualization (Refer to the lecture on Graph plotting )  Visually checking distributions for a single variable  What is the peak value of the distribution?  How many peaks are there in the distribution (unimodality versus bimodality)?  How normal (or lognormal) is the data?  How much does the data vary? Is it concentrated in a certain interval or in a certain category?  Is there a relationship between the two inputs age and income in my data? 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 14
  • 15. Uses 1. plot Shows the relationship between two continuous variables. Best when that relationship is functional. 2. Shows the relationship between two continuous variables. Best when the relationship is too loose or cloud-like to be seen on a line plot. 3. Shows the relationship between two categorical variables (var1 and var2). Highlights the frequencies of each value of var1. 4. Shows the relationship between two categorical variables (var1 and var2). Best for comparing the relative frequencies of each value of var2 within each value of var1 when var2 takes on more than two values. 5. Examines data range, Checks number of modes,Checks if distribution is normal/lognormal, Checks for anomalies and outliers. (use a log scale to visualize data that is heavily skewed.) 6. Presents information from a five-number summary. Useful for indicating whether a distribution is skewed and whether there are potential unusual observations (outliers), Very useful when large numbers of observations are involved and when two or more data sets are being compared.  Graph type 1. Line Plot 2. Scatter plot 3. Bar chart 4. Bar chart with faceting 5. Histogram or density plot 6. A box and whisker plot(boxplot) 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 15
  • 16. Assignments  load(nycflights)  1. Create a new data frame that includes flights headed to SFO in February, and save this data frame assfo_feb_flights. How many such recors are there?  2. Calculate the median and interquartile range for arr_delays of flights in the sfo_feb_flights data frame, grouped by carrier. Which carrier has the highest IQR of arrival delays?  3. Considering the data from all the NYC airports, which month has the highest average departure delay?  4. What was the worst day to fly out of NYC in 2013 if you dislike delayed flights?  5. Make a histogram and calculate appropriate summary statistics for arrival delays of sfo_feb_flights. Which of the following is false? 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 16