DATA MINING USING R (1).pptx

ADIKAVI NANNAYA UNIVERSITY
UNIVERSITY COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
ONE DAY ORIENTATION PROGRAM ON
DATA MINING USING R PROGRAMMING
11TH Dec 2017. Dr. M. Kamala Kumari
Assoc Prof

OUR WAY…
DATA….A BASE THING
DIFFERENCES BETWEEN RELATED TERMS
OBJECTIVES OF PROCESSING THINGS
STEPS IN DATA ANALYSIS
DIFFERENT ANGLES OF DATA SCIENCE
OBJECTIVES OF ALL STORIES
WHAT IS THE ROLE OF R
DEFINITIONS OF R
VARIATIONS OF R
COMPETETORS OF R
WHY R
CRAN R
RSTUDIO
BASIC COMMANDS
PROGRAM 1 TO PROGRAM 13.

Base to anything---Data!!
Processing Data =Applying Statistics on Data
Data
Context
345(423)
260
No: of UG Affiliated Colleges to
AKNU
No: of PG Affiliated Colleges to AKNU
Total No: of Affiliated Colleges to AKNU
Information
AKNU has more number of UG
Affiliations than PG
Analysis = Understanding Information
85
Decision Making Decide whether to give affiliation
for UG College or not!!

THE ABOVE PROCESS CAN BE VIEWED WITH R –SHOWING DATA,PROCESSING AND RESULTS
ALL IN ONE ENVIRONMENT…..LET’S MAKE DECISION EASY WITH R!!!!

DATA, INFORMATION AND KNOWLEDGE
KNOWLEDGE IS USEFUL INFORMATION OBTAINED
THROUGH LEARNING AND EXPERIENCE
KNOWLEDGE DOES NOT NEED DIRECT INTERACTION WIT
WITH DATA
PREDICTION IS POSSIBLE WITH REQUIRED
KNOWLEDGE BUT NOT WITH INFORMATION ALONE
NEED INFORMATION TO GET KNOWLEDGE
INFORMATION IS PROCESSING DATA
KNOWLEDGE IS PROCESSING PATTERNS OF
INFORMATION ASSOCIATED WITH EXPERIENCE
KNOWLEDGE REQUIRES COGNITIVE (REASONING,
PERCEPTION) ABILITY
....WHERE AS INFORMATION NEED NOT
INFORMATION
KNOWLEDGE
DATA

KNOWLEDGE==SCIENCE??
• Data ==Facts
• Statistics ==Data + Formulae
• Information==Description of Statistics(Reduce
errors)
• Analysis == Understanding Information or
Insights of Data and info
• Analytics == Algorithms/Techniques on Data
• Knowledge == Understanding information
and technical results
• Data Mining == Analytics==Querying…???...YES

STEPS IN DATA ANALYSIS
ETL DATA ANALYTICS
Reports/G
raphics
Model
Explore
Clean
Organize
Collect
DATA
Remove
errors and
fill gaps
Apply
Statistics,
Techniques
Apply
Algorithms
Visualization
Techniques/T
ools
Arrange in
a particular
format

DATA ANALYSIS DATA ANALYTICS DATA MINING AND
DATA SCIENCE --- WE ALL ARE RELATED !!
Data Science
DATA ANALYSIS
DCD
DATA
MINING
DATA ANALYTICS
DATA
WAREHOUSING

DAWN TO DUSK=DATA SCIENCE!!
Domain
Expert
SELECT
H/W STATISTICS
ETL
Data
Modeling
Computing
data
Visualization
Prediction

THE OBJECTIVES OF ALL THE STORIES
BEHIND!!.....CONTD
• DESCRIPTION
• COMPARISION
• CLASSIFICATION
• COMBINE SIMILAR
THINGS
• GENERATE RULES
UNDERSTAND
ACQUIRE KNOWLEDGE
….AND…..
PREDICT/DECIDE

ROLE OF ‘R’…IN WHICH STORY
The R language is widely used among statisticians
and data miners for developing statistical software
and data analysis.
Instead of long programming, R gives visualization
of statistical computations in an easy way(instant
methods and less programming with many
packages included)
R is one of the analytical tools

WE CAN DEFINE R TO BE….
R IS A PROGRAMMING LANGUAGE
R IS AN ANALYTICAL TOOL
R IS A SCRIPTING LANGUAGE
R STUDIO IS A SOFTWARE ENVIRONMENT

A B C D E …S..R..!!..?
R – A free and open source software
programming language for statistical
computing and graphics.
• Founders of R-Ross Ihaka & Robert Gentleman

R STUDIO
• R Studio is an IDE to develop R Founded by JJ
Allaire
• R is an extension of S Language a Statistical
Language.
• Latest version of R = R 3.4.2 for Windows
32/64bit

VARIATIONS OF R
• R – free implementation of the S (programming
language)
• pbdR – Programming with Big Data R
• R Commander– GUI interface for R
• Rattle GUI– GUI interface for R
• Revolution Analytics – production-grade software
for the enterprise big data analytics
• RStudio – GUI interface and development
environment for R

COMPETITORS OF R
• MS Excel - Microsoft Excel Sheet
• SAS - Statistical Analysis System
• SPSS - Statistical Package for Social Science
• MATLAB -Matrix Laboratory
• OCTAVE -Helps in solving linear and nonlinear
problems numerically.
• Python -Another Programming language which
express concepts in fewer lines of code.
• Spark -Provides Interface for programming
entire cluster with implicit data parallelism
• Storm - Distributed Real time computation System

THEN WHY R??
• More powerful data manipulation capabilities
• Easier automation
• Faster computation
• It reads any type of data
• Easier project organization
• It supports larger data sets
• Reproducibility (important for detecting errors)
• Easier to find and fix errors
• It's free
• It's open source
• Advanced Statistics capabilities
• State-of-the-art graphics
• It runs on many platforms
• Anyone can contribute packages to improve its functionality

INVITE R AND RSTUDIO…
• Download and install the latest
R: http://www.r-project.org/
• Download and install RStudio, the R
IDE: http://www.rstudio.com/

CRAN R
• The “Comprehensive R Archive Network” ( CRAN ) is a
collection of sites which carry identical material, consisting
of the R distribution(s), the contributed extensions,
documentation for R, and binaries.
• R FAQ - The R Project for Statistical Computing
• CRAN is a network of ftp and web servers around the world
that store identical, up-to-date, versions of code and
documentation for R. Please use the CRAN mirror nearest
to you to minimize network load.

Get and Set working directories
>getwd()
[1] "C:/Users/My Document/Documents"
➢setwd("C:/Program Files/R/R-3.4.3/bin/i386")
➢getwd()
➢ [1] "C:/Program Files/R/R-3.4.3/bin/i386"
➢dir()
➢data()
➢ls()

SIMPLE COMMANDS
TO INSTALL ANY PACKAGE
>install.packages(“ package name“)
We can install any package if we know the correct name
suitable for that version
TO SEE ALL LIST OF DATASETS
>data()
TO LOAD THAT INSTALLED PACKAGE/FUNCTION IN R
>library(function name/package name)
TO SEE LIST OF PACKAGES INSTALLED IN DIFFERENT
LIBRARIES
>library()

PACKAGE AND LIBRARY…???
Recently, the official repository (CRAN)
reached 25,000 packages published, and many
more are publicly available through the
internet.
A package is a like a book, a library is like a
library; you use library() to check a package in
the library----Hadley Wickham Chief Scientist
at Rstudio
Functions are like pages in a package book!!

COMPLETIONS
YELLOW COLOUR ARE VARIABLES
BLUE COLOURS ARE FOR FUNCTIONS
VOILET COLOUR AND P INSIDE WITH
TWO ::BESIDE FOR PACKAGES
VOILET FOR FUNCTION ARGUMENTS OR
VECTORS
GRID FOR DATAFRAMES

Program 1:BASIC COMMANDS-VECTORS
• A vector is a sequence of data elements of the same basic type. Members in a vector are officially called
components or members.
> 8.5:4.5 #sequence of numbers downline
➢ rnorm(10)
➢ c(1, 1:3, c(5, 8), 13) SAME CAN BE WRITTEN LIKE THIS ALSO
➢ vector("numeric", 5) >numeric(5)
➢ vector("complex", 5) >complex(5)
➢ vector("logical", 5) >logical(5)
➢ vector("list", 5) >list(5)
➢ vector("character", 5) >character(5)
➢ seq.int(3, 12) #same as 3:12
➢ seq.int(3, 12, 2)
➢ seq.int(0.1, 0.01, -0.01)
➢ seq_len(5)

>seq_len(n)
>pp <- c("Peter", "Piper", "picked", "a", "peck", "of", "pickled", "peppers")
>for(i in seq_along(pp)) print(pp[i])
>length(1:5)
>length(c(TRUE, FALSE, NA))
>sn <- c(“Varma", “Persis", “Kamala“, ”PVRao”)
>length(sn)
>nchar(sn)
• R’s vectors each element can be given a name. Labeling the elements can often make your code
much more readable. You can specify names when you create a vector in the form name = value. If
the name of an element is a valid variable name, it doesn’t need to be enclosed in quotes.
c(apple = 1, banana = 2, "kiwi fruit" = 3, 4)

>x <- (1:5) ^ 2
>x[c(1, 3, 5)]
>x[c(-2, -4)]
>x[c(TRUE, FALSE, TRUE, FALSE, TRUE)]
• Mixing positive and negative values is not allowed, and will throw an error:
>x[c(1, -1)] #This doesn't make sense!
>names(x) <- c("one", "four", "nine", "sixteen", "twenty five")
>x[c("one", "nine", "twenty five")]
>x[c(1, NA, 5)]
>x[c(TRUE, FALSE, NA, FALSE, TRUE)]
➢ > 10/3 [1] 3.333333
➢ > options(digits=8)
➢ > 10/3 [1] 3.3333333
➢ > options(digits=10) > 10/3
➢ [1] 3.333333333

The which function returns the locations where a logical vector is TRUE. This can be useful for switching
from logical indexing to integer indexing:
➢ x<-c(23,12,45,11,2,3,4)
➢ > which(x>10)
➢ [1] 1 2 3 4
>which.min(x)
>1:5 + 1 # adds one to each element of the vector
>1:5 + 1:15 # Smaller vector adds and recycles with the larger one
ADDING SCALARS TO VECTORS
>rep(1:5, 3) #repeat function
>rep(1:5, each = 3)
>rep(1:5, times = 1:5)
>rep(1:5, length.out = 7)
>rep.int(1:5, 3) #the same as rep(1:5, 3)
>rep_len(1:5, 13)
•

FEW MORE BASIC COMMANDS
To see any dataset in Code editor, Type
>View(women) in Console.
To list the number of rows / columns respectively
>nrow(women)
>ncol(women)
To output a summary about the dataset’s columns.
>summary(women)
To output a summary of a dataset’s structure.
>str(women)
To get the dimensions of a dataset(number of obseravtions and columns)
>dim(women)
To access a column in a dataset
>women$height
To check the type (or class) of a variable, the class function can be used
>class(women)

COERCION
> myNum <- 5.983904798274987298
> class(myNum)
"numeric“
• You can coerce (change type of) numeric string values into
numeric types, like so:
> myString <- "5.60“
> class(myString)
"character“
> myNumber <- as.numeric(myString)
> myNumber
5.6
> class(myNumber)
"numeric"

> myInt <- 209173987
> class(myInt)
"numeric“
• To actually force them to be integers, we need to
invoke a function that manually coerces them,
called as.integer:
> myInt <- as.integer(myInt)
> class(myInt)
"integer"

>myComparison <- 5 > 6
> myComparison
FALSE
> class(myComparison)
"logical“
>myComplex <- complex(1, 3292, 8974892)
>myComplex
3292+8974892i
> class(myComplex)
"complex"

PROGRAM NO:2
IMPORT FROM AND EXPORT TO CSV FILES
• CSV files(Comma Separated Values) are intentionally designed to be
widely supported; any OS or application that imports or exports data
usually has CSV support.
• They do nothing else but hold data - no text formatting for example.
• Excel files hold the same data, but in binary format. This allows the
file to save specifc Excel features - charts, formatting, etc.
• > datacsv<-read.csv("D:/FDP/Stu Info.csv")
• > datacsv
• > s<-subset(datacsv,Sec.Lang=="Sanskrit")
• > write.csv(s,"output.csv")
• >View(“output.csv)
• View(s)

VECTORS AND LISTS
• The most essential of all, the vector, is a collection of elements of the
same type.
• A vector can only have elements of the exact same type. Vectors are
usually created with the shorthand c (concatenate) function:
> myVector <- c("Hello", "World", "Third Element")
> class(myVector)
"character"
> myVector
"Hello" "World" "Third Element"

>myVector <- c("One", "Two", "Three", "Four", "Five",
"Six", "Seven", "Eight", "Nine", "Ten", "Eleven",
"Twelve", "Thirteen", "Fourteen", "Fifteen")
> myVector
[1] "One" "Two" "Three" "Four" "Five" "Six" "Seven"
[8] "Eight" "Nine" "Ten" "Eleven" "Twelve" "Thirteen"
"Fourteen"
[15] "Fifteen"

• Note that vectors are strictly one-dimensional. You cannot add another
vector as an element inside an existing vector – their elements get merged
into one:
➢ > v1 <- c("a", "b", "c")
➢ > v2 <- c("d", "e", "f")
➢ > v3 <- c(v1, v2)
➢ > v3
➢ [1] "a" "b" "c" "d" "e" "f“
• You can generate entire numeric vectors by specifying a range:
• > myRange <- c(1:10)
• > myRange
[1] 1 2 3 4 5 6 7 8 9 10

LISTS
Lists are just like vectors, only they don’t have
the limitation of being able to hold elements
of the same type exclusively. They are built
with the list function or with the c function if
one of the elements you’re adding is a list:

LISTS
• The following variable x is a list containing
copies of three vectors n, s, b, and a numeric
value 3.
• > n = c(2, 3, 5)
> s = c("aa", "bb", "cc", "dd", "ee")
> b = c(TRUE, FALSE, TRUE, FALSE, FALSE)
> x = list(n, s, b, 3) # x contains copies of n, s, b

pepper shaker
is list x x[1] is a single packet x[[1]] is a slice x[[1]][[1]] out of the list
• In contrast, a double bracket will always return only one element. Before moving to double bracket a
note to be kept in mind.
• NOTE:THE MAJOR DIFFERENCE BETWEEN THE TWO IS THAT SINGLE BRACKET RETURNS YOU A LIST WITH AS MANY ELEMENTS AS YOU
WISH WHILE A DOUBLE BRACKET WILL NEVER RETURN A LIST. RATHER A DOUBLE BRACKET WILL RETURN ONLY A SINGLE ELEMENT
FROM THE LIST.
•
•
Single bracket will always returns another list with number
of elements equal to the number of elements or number of
indices you pass into the single bracket.

• Member Reference
• In order to reference a list member directly, we have to
use the double square bracket "[[]]"operator. The
following object x[[2]] is the second member of x. In
other words, x[[2]] is a copy of s, but is not a slice
containing s or its copy.
• > x[[2]]
[1] "aa" "bb" "cc" "dd" "ee"
• We can modify its content directly.
• > x[[2]][1] = "ta"
> x[[2]]
[1] "ta" "bb" "cc" "dd" "ee"
> s
[1] "aa" "bb" "cc" "dd" "ee" # s is unaffected

MATRICES
• Matrices are vectors with a dimension attribute. The dimension
attribute is itself an integer vector of length 2 (nrow, ncol)
• > m <- matrix(nrow = 2, ncol = 3)
• > m
• [,1] [,2] [,3]
• [1,] NA NA NA
• [2,] NA NA NA
• > dim(m)
• [1] 2 3
• > attributes(m)
• $dim
• [1] 2 3

>m<-matrix(nrow=3,ncol=2,c(1,2,3,4,5,6))
>m
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
> m <- matrix(1:6, nrow = 2, ncol = 3)
> m<-matrix(c(1,2,3,4))
➢ m<-matrix(c(1,2,3,4),7,8)
➢ m<- matrix(1:9,nrow=3,ncol=3,byrow=TRUE)
➢ matrix(1,nrow=10,ncol=10)
➢ A <- matrix(0,3,4)
➢ z <- A[2,3] # returns 2nd row and 3rd col of matrix A and assigns to z
➢ > A[2:4,4:2] # Selecting 2nd,3rd and 4th rows and 4th,3rd and 2nd colmns and getting another sub
matrix.
➢ > A[2,2:3] # Second row, 2nd col and 3rd col elements.
>second.column <- A[,2] #returns second.column;
➢ >which(A>8) # returns elements which are greater than 8.

ARRAYS
An array is just a vector plus information on the dimensions of the array.
We can create an array from a vector:
➢ X <- array(1:24,dim=c(3,4,2)) # 24 elements in an array, with 3 rows, 4 cols, in 2 matrices form.
➢ x <- seq(1,27)
➢ > c(3,9)
➢ [1] 3 9
➢ > dim(x)=c(3,9)
➢ > is.array(x) [1] TRUE
➢ > is.matrix(x) [1] TRUE
➢ > x
➢ [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
➢ [1,] 1 4 7 10 13 16 19 22 25
➢ [2,] 2 5 8 11 14 17 20 23 26
➢ [3,] 3 6 9 12 15 18 21 24 27

DATA FRAMES
• Data frames are used to store tabular data.
• They are represented as a special type of list where
every element of the list has to have the same length
.
• Each element of the list can be thought of as a
column and the length of each element of the list is
the number of rows.
• Unlike matrices, data frames can store different
classes of objects in each column (just like lists);

• A data frame is used for storing data tables. It
is a list of vectors of equal length.
• > n = c(2, 3, 5)
> s = c("aa", "bb", "cc")
> b = c(TRUE, FALSE, TRUE)
> df = data.frame(n, s, b) # df is a data frame

• Cell value from the first row, second column of mtcars.
• > mtcars[1, 2]
[1] 6
• Can use the row and column names instead of the numeric coordinates.
• > mtcars["Mazda RX4", "cyl"]
[1] 6
• Lastly, the number of data rows in the data frame is given by
the nrow function.
• > nrow(mtcars) # number of data rows
[1] 32
• And the number of columns of a data frame is given by the ncol function.
• > ncol(mtcars) # number of columns
[1] 11
•

• We reference a data frame column with the double square bracket "[[]]" operator.
• For example, to retrieve the ninth column vector of the built-in data set mtcars, we
write mtcars[[9]].
• > mtcars[[9]]
[1] 1 1 1 0 0 0 0 0 0 0 0 ...
• We can retrieve the same column vector by its name.
• > mtcars[["am"]]
[1] 1 1 1 0 0 0 0 0 0 0 0 ...
• We can also retrieve with the "$" operator in lieu of the double square bracket operator.
• > mtcars$am
[1] 1 1 1 0 0 0 0 0 0 0 0 ...
• Yet another way to retrieve the same column vector is to use the single
square bracket "[]"operator. We prepend the column name with a comma character,
which signals a wildcard match for the row position.
• > mtcars[,"am"]
[1] 1 1 1 0 0 0 0 0 0 0 0 ...

• >x <- read.csv("data1.csv",header=T, sep=",")
• >x2 <- read.csv("data2.csv",header=T, sep=",")
•
• >x3 <- cbind(x,x2)
• >x3
• Subtype Gender Expression Age City
• 1 A m -0.54 32 New York
• 2 A f -0.80 21 Houston
• 3 B f -1.03 34 Seattle
• 4 C m -0.41 67 Houston
>which(A>=15,arr.ind=TRUE)
row col
[1,] 3 4
[2,] 4 4
Similarly we assign the values in the other way.
>A[1,] <- c(2,4,5)

EXP NO 3: GETTING AND CLEANING DATA
WITH SWIRL
Swirl is an interactive package which will teach
us and at the same time make us practice with
the exercises.
It has three types of exercises, basic,
intermediate and advanced.
Getting and cleaning data is an intermediate
exercise.

WHAT IS SWIRL() IN R
• swirl is a software package for the R programming
language that turns the Rconsole into an interactive
learning environment. Users receive immediate
feedback as they are guided through self-paced
lessons in data science and R programming.
➢install.packages(“swirl”)
➢library(swirl)
➢install_from_swirl("Getting and Cleaning Data")
➢

>install.packages(“swirl”)
>library(swirl)
➢install_course("Getting and Cleaning Data")
➢swirl()
➢

SWIRL() Flow..
• | Please choose a course, or type 0 to exit swirl.
•
• 1: Getting and Cleaning Data
• 2: R Programming
• 3: Take me to the swirl course repository!
•
• Selection: 1
•
• | Please choose a lesson, or type 0 to return to course
• | menu.
•
• 1: Manipulating Data with dplyr
• 2: Grouping and Chaining with dplyr
• 3: Tidying Data with tidyr
• 4: Dates and Times with lubridate

ABOUT PACKAGES COMING WITH GETTING
AND CLEANING DATA
• For this we use three types of packages: dplyr,
tidyr, lubridate.
• Dplyr is a package that provides a consistent
and concise grammar for manipulating tabular
data. It makes data manipulation easier.

About dplyr package from swirl()
According to the "Introduction to dplyr"
vignette written by the package authors, "The
dplyr philosophy is to have small functions that
each do one thing well."
Specifically, dplyr supplies five 'verbs' that cover
most fundamental data manipulation tasks:
select(), filter(), arrange(), mutate(), and
summarize().

Data manipulation using dplyr
• install.packages("dplyr") ## install
• You might get asked to choose a CRAN mirror – this is basically
asking you to choose a site to download the package from. The
choice doesn’t matter too much; We recommend the RStudio
mirror.
• library("dplyr") ## load
• You only need to install a package once per computer, but you need
to load it every time you open a new R session and want to use that
package.

Selecting columns and filtering rows
• To select columns of a data frame, use select().
The first argument to this function is the data
frame (ToothGrowth), and the subsequent
arguments are the columns to keep.
• select(ToothGrowth, len, supp, dose)
>aa<-select(ToothGrowth,len,supp,dose)

• Select():
To select columns of a data frame
• select(ToothGrowth, len, supp, dose)
>plot(aa)
• Filter():
To choose rows
• filter(ToothGrowth, len==5)

• Filter():
To choose rows
• filter(ToothGrowth, len>5)
Pipes(>%>)
• nest functions (i.e. one function inside of another)
• Pipes let you take the output of one function and
send it directly to the next, which is useful when
you need to many things to the same data set.
>ToothGrowth %>%
+ filter(len < 5) %>%
+ select(len,supp,dose)

• To create a new object with this smaller
version of the data we could do so by assigning
it a new name.
>ToothGrowth_sml <- ToothGrowth %>%
+ filter(len < 5) %>%
+ select(len,supp,dose)
➢MUTATE():
• create new columns based on the values in
existing columns

>ToothGrowth %>%
+ mutate(len = len/ 4)
• If this runs off your screen and you just want
to see the first few rows, you can use a pipe to
view the head() of the data
>ToothGrowth %>%
+ mutate(len=len/4) %>%
+head

• The first few rows are full of NAs, so if we
wanted to remove those we could insert
filter() in this chain:
>ToothGrowth %>%
+ mutate(len = len/ 4) %>%
+ filter(!is.na(len)) %>%
+ head

➢Groupby():
• group_by() splits the data into groups upon which some operations can
be run
>ToothGrowth %>% group_by(supp) %>%tally()
➢summarize():
• single group_by() is often used together with summarize() which
collapses each group into a -row summary of that group.
>ToothGrowth %>% group_by(supp) %>% summarize(len= mean(len,
na.rm = TRUE))

Data Frame Column Slice
• We retrieve a data frame column slice with the single square bracket "[]" operator.
• Numeric Indexing
• The following is a slice containing the first column of the built-in data set mtcars.
• > mtcars[1]
mpg
Mazda RX4 21.0
Mazda RX4 Wag 21.0
Datsun 710 22.8
............
• Name Indexing
• We can retrieve the same column slice by its name.
• > mtcars["mpg"]
mpg
Mazda RX4 21.0
Mazda RX4 Wag 21.0
Datsun 710 22.8
............
• To retrieve a data frame slice with the two columns mpg and hp, we pack the column names in an index vector
inside the single square bracket operator.
• > mtcars[c("mpg", "hp")]
mpg hp
Mazda RX4 21.0 110
Mazda RX4 Wag 21.0 110
Datsun 710 22.8 93
............
•

Exp 5. Creating Data Frame
emp.data <- data.frame( emp_id = c (1:5),
emp_name = c(“Ratna",”Kumar”,“Kamala",“Prajwal",“Pravachan"),
salary = c(623.3, 515.2, 611.0, 729.0, 843.25),
start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-
05-11", "2015-03-27")), stringsAsFactors = FALSE )
>emp.data
# Add the "dept" coulmn.
➢ emp.data$dept <- c("IT","Operations","IT","HR","Finance")
➢ v <- emp.data
➢ print(v)

Extracting rows and columns
A=emp.data$emp_id
B=emp.data$emp_name
a)C=data.frame(A,B)
b)data.frame[1:2,]
c)data.frame[c(3,5),c(2,4)]

➢emp.data[1:2,]
➢emp_id emp_name salary start_date
➢1 1 Rick 623.3 2012-01-01
➢2 2 Dan 515.2 2013-09-23
➢ > emp.data[c(3,5),c(2,4)]
➢ emp_name start_date
➢ 3 Michelle 2014-11-15
➢5 Gary 2015-03-27

PROGRAM 6: ‘apply’ group of functions

Functio
n
Arguments Objective Input Output
apply
apply(x,
MARGIN,
FUN)
Apply a function to the
rows or columns or both
Data frame or
matrix
vector,
list, array
lapply
lapply(X,
FUN)
Apply a function to all the
elements of the input
List, vector or
data frame
list
sapply(X, Apply a function to all the List, vector or vector or

PROGRAM 7- cbind-ing and rbind-ing
• Matrices can be created by column-binding or row-binding with cbind() and
rbind().
• > x <- 1:3
• > y <- 10:12
• > cbind(x, y)
• x y
• [1,] 1 10
• [2,] 2 11
• [3,] 3 12
• > rbind(x, y)
[,1] [,2] [,3]
• x 1 2 3
• y 10 11 12
>C <- cbind(1:3,4:6,5:7)
>D <- rbind(1:3,4:6)

PROGRAM 7:
Rbind() and cbind() functions.
• Matrices can be created by column-binding or row-binding with cbind() and
rbind().
• Data frames can also be appended by these functions.
• > x <- 1:3
• > y <- 10:12
• > cbind(x, y)
– x y
• [1,] 1 10
• [2,] 2 11
• [3,] 3 12
• > rbind(x, y)
• [,1] [,2] [,3]
• x 1 2 3
• y 10 11 12

Factor Variables
Factor variables are nothing but nominal variables and
also known as categorical variables.
Levels are nothing but unique values in the variable
values.
➢gender <- c(rep("male",20), rep("female", 30))
➢ gender<-factor(gender)
➢Levels: female male # Factor variables
➢summary(gender)
➢female male
30 20

PROGRAM 8: DISCRETE IRIS
➢ iris$Seplen<- cut(iris$Sepal.Length, breaks=c(4.3,5.6,6.8,7.9),
labels=c("low","medium","high"))
➢ > iris$Seplen
➢ [1] low low low low low low low low [9] low low low low low
<NA> medium medium [17] low low medium low low low low
low [25] low low low low low low low low [33] low low low low
low low low low [41] low low low low low low low low [49] low
low high medium high low medium medium [57] medium low
medium low low medium medium medium [65] …..
➢ Levels: low medium high

PROGRAM 9 - SCATTER PLOT USING ‘DPLYR’ ON
GUINEA PIGS ‘TOOTHGROWTH’ DATA SET

➢aa<-select(ToothGrowth,len,supp,dose)
#To choose rows we use filter()
➢> filter(ToothGrowth,len<=14.5)
➢> ToothGrowth%>%+ group_by(supp)
• > ToothGrowth%>%
• + group_by(supp)%>%
• + summarise(meanoflen=mean(len))
• > plot(aa)
• >

gg-grammer of graphics
➢library(dplyr)
➢> library(ggplot)
➢> library(ggplot2)
➢>ggplot(aa,aes(x=factor(dose),y=len,fill=supp))
➢>gplot(aa,aes(x=factor(dose),y=len,fill=supp))+geo
m_boxplot()
➢/*aes=aesthetic*/

PROGRAM-10…LINEAR AND MULTIPLE
REGRESSION
Regression: A technique for determining the
statistical relationship between two or more
variables where a change in a dependent
variable is associated with, and depends on, a
change in one or more independent variables.
Linear Regression: Y=mX+c
Y X
Single Predictor, X

Multiple Linear Regression Y=aX3+bX2+cX+d
3 Predictors/Explanatory variables, X3,X2, X
a,b,c are coefficients
d is random error=bias value
Y is a response variable
Y is estimated or predicted dependent on 3 X
variables.

Mtcars variables
[, 1] mpg Miles/(US) gallon
[, 2] cyl Number of cylinders
[, 3] disp Displacement (cu.in.)
[, 4] hp Gross horsepower
[, 5] drat Rear axle ratio
[, 6] wt Weight (lb/1000)
[, 7] qsec 1/4 mile time
[, 8] vs. V/S (Engine Cylinder confg V shape or S shape)
[, 9] am Transmission (0 = automatic, 1 = manual)
[,10] gear Number of forward gears
[,11] carb Number of carburetors

lm=linear mode
> library(ggplot2)
>ggplot(mtcars,aes(wt,mpg))
>ggplot(mtcars,aes(wt,mpg))+geom_point()
>ggplot(mtcars,aes(wt,mpg))+geom_point()+geo
m_smooth(method="lm")

• For example in the mtcars dataset, you can
build a linear model between the gas
consumption (mpg) and the weight of the car
(wt):
mpg=β0+β1wt
• β1 is slope mpg is dependent
• β0 is intercept wt is independent

• Residuals. The difference between the observed
value of the dependent variable (y) and the
predicted value (ŷ) is called the residual (e).
• Each data point has one residual.
• y=10*3+5=35——-observed
• Model, m=9. y=9x+c
• y=9*3+5=32——predicted….

> mfit = lm(mpg ~ wt + disp + cyl, data=mtcars)
> plot(mfit)

PROGRAM NO: 11
Major Clustering Approaches (I)
• Partitioning approach:
– Construct various partitions and then evaluate them by some criterion,
e.g., minimizing the sum of square errors
– Typical methods: k-means, k-medoids, CLARANS
• Hierarchical approach:
– Create a hierarchical decomposition of the set of data (or objects) using
some criterion
– Typical methods: Diana, Agnes, BIRCH, CAMELEON
• Density-based approach:
– Based on connectivity and density functions
– Typical methods: DBSACN, OPTICS, DenClue
• Grid-based approach:
– based on a multiple-level granularity structure
– Typical methods: STING, WaveCluster, CLIQUE
89

K-means clustering
• names(iris)
• [1] "Sepal.Length" "Sepal.Width" "Petal.Length"
• [4] "Petal.Width" "Species"
•
• > x<-iris[,-5]
•
• > y<-iris$Species
•
• > kc<-kmeans(x,3)
•
• > kc
•
• K-means clustering with 3 clusters of sizes 38, 62, 50
•
• Cluster means:
• Sepal.Length Sepal.Width Petal.Length Petal.Width
• 1 6.850000 3.073684 5.742105 2.071053
• 2 5.901613 2.748387 4.393548 1.433871
• 3 5.006000 3.428000 1.462000 0.246000
•

• Clustering vector:
• [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
• [29] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 1 2 2 2
• [57] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2
• [85] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 1 1 1 2 1 1 1 1 1
• [113] 1 2 2 1 1 1 1 2 1 2 1 2 1 1 2 2 1 1 1 1 1 2 1 1 1 1 2 1
• [141] 1 1 2 1 1 1 2 1 1 2
•
• Within cluster sum of squares by cluster:
• [1] 23.87947 39.82097 15.15100
• (between_SS / total_SS = 88.4 %)

>plot(x[c("Sepal.Length","Sepal.Width")],col=kc$
cluster)

K-means
>points(kc$centers[,c("Sepal.Length",
"Sepal.Width")], col=1:3, pch=23, cex=3)

• > library(fpc)
• > pamresult<-pamk(iris1)
• > pamresult$nc
• [1] 2
• > pamresult$nc #nc-Number of Clusters
• [1] 2
• > table(pamresult$pamobject$clustering,iris$Species)
•
• setosa versicolor virginica
• 1 50 1 0
• 2 0 49 50
• > layout(matrix(c(1,2),1,2)) #
> plot(pamresult$pamobject)

• The ggplot() command creates a plot object. In it
we assigned a data set.
• aes() creates what Hadley Wickham calls an
aesthetic: a mapping of variables to various parts of
the plot. ...
• Another way to split up the way we look at data is
with facets.

> ggplot(mtcars,aes(wt,mpg)) Error in ggplot(mtcars,
aes(wt, mpg)) : could not find function "ggplot" >
library(ggplot2) > library(ggplot2) >
ggplot(mtcars,aes(wt,mpg)) >
ggplot(mtcars,aes(wt,mpg))+geom_point() >
ggplot(mtcars,aes(wt,mpg))+geom_point()+geom_
smooth(method="lm") >
ggplot(mtcars,aes(wt,mpg))+geom_point()+geom_
abline()

➢> library(ggplot2)
➢> ggplot(mtcars,aes(wt,mpg))
➢> ggplot(mtcars,aes(wt,mpg))+geom_point()
➢>ggplot(mtcars,aes(wt,mpg))+geom_point()+g
eom_smooth(method="lm“)

> ggplot(mtcars, aes(x=wt, y=mpg, col=cyl, size=disp)) + geom_point()

What combination of predictors will best predict
fuel efficiency?(Slope/Coefficients and
intercepts)
Which predictors increase our accuracy by a
statistically significant amount?
We should guess which predictors are
significant, and to determine the ideal formula
for prediction….WHICH IS WHAT WE CALL
LINEAR REGRESSION.

Density-Based Clustering Methods
• Clustering based on density (local cluster criterion), such as
density-connected points
• Major features:
– Discover clusters of arbitrary shape
– Handle noise
– One scan
– Need density parameters as termination condition
107

Density-Based Clustering: Basic Concepts
• Two parameters:
– Eps: Maximum radius of the neighbourhood
– MinPts: Minimum number of points in an Eps-
neighbourhood of that point
• NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
• Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if
– p belongs to NEps(q)
– core point condition:
|NEps (q)| ≥ MinPts
MinPts = 5
Eps = 1 cm
p
q
108

Density-Reachable and Density-Connected
• Density-reachable:
– A point p is density-reachable from a
point q w.r.t. Eps, MinPts if there is a
chain of points p1, …, pn, p1 = q, pn = p
such that pi+1 is directly density-
reachable from pi
• Density-connected
– A point p is density-connected to a
point q w.r.t. Eps, MinPts if there is a
point o such that both, p and q are
density-reachable from o w.r.t. Eps
and MinPts
p
q
p1
p q
o
109

DBSCAN: Density-Based Spatial Clustering of Applications
with Noise
• Relies on a density-based notion of cluster: A cluster is defined
as a maximal set of density-connected points
• Discovers clusters of arbitrary shape in spatial databases with
noise
Core
Border
Outlier
Eps = 1cm
MinPts = 5
110

DBSCAN: The Algorithm
• Arbitrary select a point p
• Retrieve all points density-reachable from p w.r.t. Eps and
MinPts
• If p is a core point, a cluster is formed
• If p is a border point, no points are density-reachable from p
and DBSCAN visits the next point of the database
• Continue the process until all of the points have been
processed
111

Before Package ‘rpart’
Title: Recursive Partitioning and Regression Trees
A regression line is a straight line
that attempts to predict the
relationship between two points,
also known as a trend line or line
of best fit.
Simple linear regression is a prediction
when a variable (y) is dependent on a second variable (x) based on the
regression equation of a given set of data.

Decision trees are of two types
Classification Trees
Regression Trees
CTs are used when the target or
response variable is of
categorical in nature.
RTs are used when the target
variable is continuous or
numeric.
It is the target variable that
determines the type of
decision tree needed.

DECISION TREES USING PARTY-PROGRAM
12
• > install.packages(“readr”)
• > library(readr)
• > install.packages("party")
• Installing package into ‘C:/Users/My Document/Documents/R/win-library/3.4’
• (as ‘lib’ is unspecified)
• trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/party_1.2-3.zip'
• Content type 'application/zip' length 719826 bytes (702 KB)
• downloaded 702 KB
•
• package ‘party’ successfully unpacked and MD5 sums checked
•
• The downloaded binary packages are in
• C:UsersMy
DocumentAppDataLocalTempRtmpOAuKaMdownloaded_packages
• > library(party)

DECISION TREE USING RPART..PROGRAM 12
rpart(formula, data, weights, subset, na.action =
na.rpart, method, model = FALSE, x = FALSE, y
= TRUE, parms, control, cost, ...)
tree<-
rpart(Species~Sepal.Length+Sepal.Width+Petal
.Length+Petal.Width,data=iris,method="class")

• > iris$class<-as.factor(iris$class)
• >
• > View(iris)
• > iris$Species<-as.factor(iris$Species)
• > tree1<-ctree(Species~Sepal.Length, data=iris)
• > plot(tree1)

➢tree<-
rpart(Species~Sepal.Length+Sepal.Width+Petal
.Length+Petal.Width,data=iris,method="class")
> plot(tree)

> plot(tree, uniform=TRUE,main="Classification
Tree for Iris dataset")> text(tree, use.n=TRUE,
all=TRUE, cex=.8)

ABOUT DIFFERENT TYPES OF VARIABLES

FEW GOOD WEB SITES ON R
www.kaggle.com
www.rdocumentation.org
www.statmethods.net
www.r-tutor.com
www.tutorialspoint.com
www.datacamp.com
www.github.com
https://drsimonj.svbtle.com/visualising-residuals

DATA MINING USING R (1).pptx

Recommandé

Recommandé

Contenu connexe

Similaire à DATA MINING USING R (1).pptx

Similaire à DATA MINING USING R (1).pptx (20)

Dernier

Dernier (20)

DATA MINING USING R (1).pptx