SlideShare une entreprise Scribd logo
1  sur  130
ADIKAVI NANNAYA UNIVERSITY
UNIVERSITY COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
ONE DAY ORIENTATION PROGRAM ON
DATA MINING USING R PROGRAMMING
11TH Dec 2017. Dr. M. Kamala Kumari
Assoc Prof
OUR WAY…
DATA….A BASE THING
DIFFERENCES BETWEEN RELATED TERMS
OBJECTIVES OF PROCESSING THINGS
STEPS IN DATA ANALYSIS
DIFFERENT ANGLES OF DATA SCIENCE
OBJECTIVES OF ALL STORIES
WHAT IS THE ROLE OF R
DEFINITIONS OF R
VARIATIONS OF R
COMPETETORS OF R
WHY R
CRAN R
RSTUDIO
BASIC COMMANDS
PROGRAM 1 TO PROGRAM 13.
Base to anything---Data!!
Processing Data =Applying Statistics on Data
Data
Context
345(423)
260
No: of UG Affiliated Colleges to
AKNU
No: of PG Affiliated Colleges to AKNU
Total No: of Affiliated Colleges to AKNU
Information
AKNU has more number of UG
Affiliations than PG
Analysis = Understanding Information
85
Decision Making Decide whether to give affiliation
for UG College or not!!
THE ABOVE PROCESS CAN BE VIEWED WITH R –SHOWING DATA,PROCESSING AND RESULTS
ALL IN ONE ENVIRONMENT…..LET’S MAKE DECISION EASY WITH R!!!!
DATA, INFORMATION AND KNOWLEDGE
KNOWLEDGE IS USEFUL INFORMATION OBTAINED
THROUGH LEARNING AND EXPERIENCE
KNOWLEDGE DOES NOT NEED DIRECT INTERACTION WIT
WITH DATA
PREDICTION IS POSSIBLE WITH REQUIRED
KNOWLEDGE BUT NOT WITH INFORMATION ALONE
NEED INFORMATION TO GET KNOWLEDGE
INFORMATION IS PROCESSING DATA
KNOWLEDGE IS PROCESSING PATTERNS OF
INFORMATION ASSOCIATED WITH EXPERIENCE
KNOWLEDGE REQUIRES COGNITIVE (REASONING,
PERCEPTION) ABILITY
....WHERE AS INFORMATION NEED NOT
INFORMATION
KNOWLEDGE
DATA
KNOWLEDGE==SCIENCE??
• Data ==Facts
• Statistics ==Data + Formulae
• Information==Description of Statistics(Reduce
errors)
• Analysis == Understanding Information or
Insights of Data and info
• Analytics == Algorithms/Techniques on Data
• Knowledge == Understanding information
and technical results
• Data Mining == Analytics==Querying…???...YES
STEPS IN DATA ANALYSIS
ETL DATA ANALYTICS
Reports/G
raphics
Model
Explore
Clean
Organize
Collect
DATA
Remove
errors and
fill gaps
Apply
Statistics,
Techniques
Apply
Algorithms
Visualization
Techniques/T
ools
Arrange in
a particular
format
DATA ANALYSIS DATA ANALYTICS DATA MINING AND
DATA SCIENCE --- WE ALL ARE RELATED !!
Data Science
DATA ANALYSIS
DCD
DATA
MINING
DATA ANALYTICS
DATA
WAREHOUSING
DAWN TO DUSK=DATA SCIENCE!!
Domain
Expert
SELECT
H/W STATISTICS
ETL
Data
Modeling
Computing
data
Visualization
Prediction
DATA SCIENCE ASSOCIATIONS
THE OBJECTIVES OF ALL THE STORIES
BEHIND!!.....CONTD
• DESCRIPTION
• COMPARISION
• CLASSIFICATION
• COMBINE SIMILAR
THINGS
• GENERATE RULES
UNDERSTAND
ACQUIRE KNOWLEDGE
….AND…..
PREDICT/DECIDE
ROLE OF ‘R’…IN WHICH STORY
The R language is widely used among statisticians
and data miners for developing statistical software
and data analysis.
Instead of long programming, R gives visualization
of statistical computations in an easy way(instant
methods and less programming with many
packages included)
R is one of the analytical tools
WE CAN DEFINE R TO BE….
R IS A PROGRAMMING LANGUAGE
R IS AN ANALYTICAL TOOL
R IS A SCRIPTING LANGUAGE
R STUDIO IS A SOFTWARE ENVIRONMENT
A B C D E …S..R..!!..?
R – A free and open source software
programming language for statistical
computing and graphics.
• Founders of R-Ross Ihaka & Robert Gentleman
R STUDIO
• R Studio is an IDE to develop R Founded by JJ
Allaire
• R is an extension of S Language a Statistical
Language.
• Latest version of R = R 3.4.2 for Windows
32/64bit
VARIATIONS OF R
• R – free implementation of the S (programming
language)
• pbdR – Programming with Big Data R
• R Commander– GUI interface for R
• Rattle GUI– GUI interface for R
• Revolution Analytics – production-grade software
for the enterprise big data analytics
• RStudio – GUI interface and development
environment for R
COMPETITORS OF R
• MS Excel - Microsoft Excel Sheet
• SAS - Statistical Analysis System
• SPSS - Statistical Package for Social Science
• MATLAB -Matrix Laboratory
• OCTAVE -Helps in solving linear and nonlinear
problems numerically.
• Python -Another Programming language which
express concepts in fewer lines of code.
• Spark -Provides Interface for programming
entire cluster with implicit data parallelism
• Storm - Distributed Real time computation System
THEN WHY R??
• More powerful data manipulation capabilities
• Easier automation
• Faster computation
• It reads any type of data
• Easier project organization
• It supports larger data sets
• Reproducibility (important for detecting errors)
• Easier to find and fix errors
• It's free
• It's open source
• Advanced Statistics capabilities
• State-of-the-art graphics
• It runs on many platforms
• Anyone can contribute packages to improve its functionality
INVITE R AND RSTUDIO…
• Download and install the latest
R: http://www.r-project.org/
• Download and install RStudio, the R
IDE: http://www.rstudio.com/
CRAN R
• The “Comprehensive R Archive Network” ( CRAN ) is a
collection of sites which carry identical material, consisting
of the R distribution(s), the contributed extensions,
documentation for R, and binaries.
• R FAQ - The R Project for Statistical Computing
• CRAN is a network of ftp and web servers around the world
that store identical, up-to-date, versions of code and
documentation for R. Please use the CRAN mirror nearest
to you to minimize network load.
Welcome to RStudio..!!
Get and Set working directories
>getwd()
[1] "C:/Users/My Document/Documents"
➢setwd("C:/Program Files/R/R-3.4.3/bin/i386")
➢getwd()
➢ [1] "C:/Program Files/R/R-3.4.3/bin/i386"
➢dir()
➢data()
➢ls()
SIMPLE COMMANDS
TO INSTALL ANY PACKAGE
>install.packages(“ package name“)
We can install any package if we know the correct name
suitable for that version
TO SEE ALL LIST OF DATASETS
>data()
TO LOAD THAT INSTALLED PACKAGE/FUNCTION IN R
>library(function name/package name)
TO SEE LIST OF PACKAGES INSTALLED IN DIFFERENT
LIBRARIES
>library()
PACKAGE AND LIBRARY…???
Recently, the official repository (CRAN)
reached 25,000 packages published, and many
more are publicly available through the
internet.
A package is a like a book, a library is like a
library; you use library() to check a package in
the library----Hadley Wickham Chief Scientist
at Rstudio
Functions are like pages in a package book!!
COMPLETIONS
YELLOW COLOUR ARE VARIABLES
BLUE COLOURS ARE FOR FUNCTIONS
VOILET COLOUR AND P INSIDE WITH
TWO ::BESIDE FOR PACKAGES
VOILET FOR FUNCTION ARGUMENTS OR
VECTORS
GRID FOR DATAFRAMES
Program 1:BASIC COMMANDS-VECTORS
• A vector is a sequence of data elements of the same basic type. Members in a vector are officially called
components or members.
> 8.5:4.5 #sequence of numbers downline
➢ rnorm(10)
➢ c(1, 1:3, c(5, 8), 13) SAME CAN BE WRITTEN LIKE THIS ALSO
➢ vector("numeric", 5) >numeric(5)
➢ vector("complex", 5) >complex(5)
➢ vector("logical", 5) >logical(5)
➢ vector("list", 5) >list(5)
➢ vector("character", 5) >character(5)
➢ seq.int(3, 12) #same as 3:12
➢ seq.int(3, 12, 2)
➢ seq.int(0.1, 0.01, -0.01)
➢ seq_len(5)
>seq_len(n)
>pp <- c("Peter", "Piper", "picked", "a", "peck", "of", "pickled", "peppers")
>for(i in seq_along(pp)) print(pp[i])
>length(1:5)
>length(c(TRUE, FALSE, NA))
>sn <- c(“Varma", “Persis", “Kamala“, ”PVRao”)
>length(sn)
>nchar(sn)
• R’s vectors each element can be given a name. Labeling the elements can often make your code
much more readable. You can specify names when you create a vector in the form name = value. If
the name of an element is a valid variable name, it doesn’t need to be enclosed in quotes.
c(apple = 1, banana = 2, "kiwi fruit" = 3, 4)
>x <- (1:5) ^ 2
>x[c(1, 3, 5)]
>x[c(-2, -4)]
>x[c(TRUE, FALSE, TRUE, FALSE, TRUE)]
• Mixing positive and negative values is not allowed, and will throw an error:
>x[c(1, -1)] #This doesn't make sense!
>names(x) <- c("one", "four", "nine", "sixteen", "twenty five")
>x[c("one", "nine", "twenty five")]
>x[c(1, NA, 5)]
>x[c(TRUE, FALSE, NA, FALSE, TRUE)]
➢ > 10/3 [1] 3.333333
➢ > options(digits=8)
➢ > 10/3 [1] 3.3333333
➢ > options(digits=10) > 10/3
➢ [1] 3.333333333
The which function returns the locations where a logical vector is TRUE. This can be useful for switching
from logical indexing to integer indexing:
➢ x<-c(23,12,45,11,2,3,4)
➢ > which(x>10)
➢ [1] 1 2 3 4
>which.min(x)
>1:5 + 1 # adds one to each element of the vector
>1:5 + 1:15 # Smaller vector adds and recycles with the larger one
ADDING SCALARS TO VECTORS
>rep(1:5, 3) #repeat function
>rep(1:5, each = 3)
>rep(1:5, times = 1:5)
>rep(1:5, length.out = 7)
>rep.int(1:5, 3) #the same as rep(1:5, 3)
>rep_len(1:5, 13)
•
FEW MORE BASIC COMMANDS
To see any dataset in Code editor, Type
>View(women) in Console.
To list the number of rows / columns respectively
>nrow(women)
>ncol(women)
To output a summary about the dataset’s columns.
>summary(women)
To output a summary of a dataset’s structure.
>str(women)
To get the dimensions of a dataset(number of obseravtions and columns)
>dim(women)
To access a column in a dataset
>women$height
To check the type (or class) of a variable, the class function can be used
>class(women)
COERCION
> myNum <- 5.983904798274987298
> class(myNum)
"numeric“
• You can coerce (change type of) numeric string values into
numeric types, like so:
> myString <- "5.60“
> class(myString)
"character“
> myNumber <- as.numeric(myString)
> myNumber
5.6
> class(myNumber)
"numeric"
> myInt <- 209173987
> class(myInt)
"numeric“
• To actually force them to be integers, we need to
invoke a function that manually coerces them,
called as.integer:
> myInt <- as.integer(myInt)
> class(myInt)
"integer"
>myComparison <- 5 > 6
> myComparison
FALSE
> class(myComparison)
"logical“
>myComplex <- complex(1, 3292, 8974892)
>myComplex
3292+8974892i
> class(myComplex)
"complex"
PROGRAM NO:2
IMPORT FROM AND EXPORT TO CSV FILES
• CSV files(Comma Separated Values) are intentionally designed to be
widely supported; any OS or application that imports or exports data
usually has CSV support.
• They do nothing else but hold data - no text formatting for example.
• Excel files hold the same data, but in binary format. This allows the
file to save specifc Excel features - charts, formatting, etc.
• > datacsv<-read.csv("D:/FDP/Stu Info.csv")
• > datacsv
• > s<-subset(datacsv,Sec.Lang=="Sanskrit")
• > write.csv(s,"output.csv")
• >View(“output.csv)
• View(s)
VECTORS AND LISTS
• The most essential of all, the vector, is a collection of elements of the
same type.
• A vector can only have elements of the exact same type. Vectors are
usually created with the shorthand c (concatenate) function:
> myVector <- c("Hello", "World", "Third Element")
> class(myVector)
"character"
> myVector
"Hello" "World" "Third Element"
>myVector <- c("One", "Two", "Three", "Four", "Five",
"Six", "Seven", "Eight", "Nine", "Ten", "Eleven",
"Twelve", "Thirteen", "Fourteen", "Fifteen")
> myVector
[1] "One" "Two" "Three" "Four" "Five" "Six" "Seven"
[8] "Eight" "Nine" "Ten" "Eleven" "Twelve" "Thirteen"
"Fourteen"
[15] "Fifteen"
• Note that vectors are strictly one-dimensional. You cannot add another
vector as an element inside an existing vector – their elements get merged
into one:
➢ > v1 <- c("a", "b", "c")
➢ > v2 <- c("d", "e", "f")
➢ > v3 <- c(v1, v2)
➢ > v3
➢ [1] "a" "b" "c" "d" "e" "f“
• You can generate entire numeric vectors by specifying a range:
• > myRange <- c(1:10)
• > myRange
[1] 1 2 3 4 5 6 7 8 9 10
LISTS
Lists are just like vectors, only they don’t have
the limitation of being able to hold elements
of the same type exclusively. They are built
with the list function or with the c function if
one of the elements you’re adding is a list:
LISTS VISUALIZATION
LISTS
• The following variable x is a list containing
copies of three vectors n, s, b, and a numeric
value 3.
• > n = c(2, 3, 5)
> s = c("aa", "bb", "cc", "dd", "ee")
> b = c(TRUE, FALSE, TRUE, FALSE, FALSE)
> x = list(n, s, b, 3) # x contains copies of n, s, b
pepper shaker
is list x x[1] is a single packet x[[1]] is a slice x[[1]][[1]] out of the list
• In contrast, a double bracket will always return only one element. Before moving to double bracket a
note to be kept in mind.
• NOTE:THE MAJOR DIFFERENCE BETWEEN THE TWO IS THAT SINGLE BRACKET RETURNS YOU A LIST WITH AS MANY ELEMENTS AS YOU
WISH WHILE A DOUBLE BRACKET WILL NEVER RETURN A LIST. RATHER A DOUBLE BRACKET WILL RETURN ONLY A SINGLE ELEMENT
FROM THE LIST.
•
•
Single bracket will always returns another list with number
of elements equal to the number of elements or number of
indices you pass into the single bracket.
• Member Reference
• In order to reference a list member directly, we have to
use the double square bracket "[[]]"operator. The
following object x[[2]] is the second member of x. In
other words, x[[2]] is a copy of s, but is not a slice
containing s or its copy.
• > x[[2]]
[1] "aa" "bb" "cc" "dd" "ee"
• We can modify its content directly.
• > x[[2]][1] = "ta"
> x[[2]]
[1] "ta" "bb" "cc" "dd" "ee"
> s
[1] "aa" "bb" "cc" "dd" "ee" # s is unaffected
MATRICES
• Matrices are vectors with a dimension attribute. The dimension
attribute is itself an integer vector of length 2 (nrow, ncol)
• > m <- matrix(nrow = 2, ncol = 3)
• > m
• [,1] [,2] [,3]
• [1,] NA NA NA
• [2,] NA NA NA
• > dim(m)
• [1] 2 3
• > attributes(m)
• $dim
• [1] 2 3
>m<-matrix(nrow=3,ncol=2,c(1,2,3,4,5,6))
>m
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
> m <- matrix(1:6, nrow = 2, ncol = 3)
> m<-matrix(c(1,2,3,4))
➢ m<-matrix(c(1,2,3,4),7,8)
➢ m<- matrix(1:9,nrow=3,ncol=3,byrow=TRUE)
➢ matrix(1,nrow=10,ncol=10)
➢ A <- matrix(0,3,4)
➢ z <- A[2,3] # returns 2nd row and 3rd col of matrix A and assigns to z
➢ > A[2:4,4:2] # Selecting 2nd,3rd and 4th rows and 4th,3rd and 2nd colmns and getting another sub
matrix.
➢ > A[2,2:3] # Second row, 2nd col and 3rd col elements.
>second.column <- A[,2] #returns second.column;
➢ >which(A>8) # returns elements which are greater than 8.
ARRAYS
An array is just a vector plus information on the dimensions of the array.
We can create an array from a vector:
➢ X <- array(1:24,dim=c(3,4,2)) # 24 elements in an array, with 3 rows, 4 cols, in 2 matrices form.
➢ x <- seq(1,27)
➢ > c(3,9)
➢ [1] 3 9
➢ > dim(x)=c(3,9)
➢ > is.array(x) [1] TRUE
➢ > is.matrix(x) [1] TRUE
➢ > x
➢ [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
➢ [1,] 1 4 7 10 13 16 19 22 25
➢ [2,] 2 5 8 11 14 17 20 23 26
➢ [3,] 3 6 9 12 15 18 21 24 27
DATA FRAMES
• Data frames are used to store tabular data.
• They are represented as a special type of list where
every element of the list has to have the same length
.
• Each element of the list can be thought of as a
column and the length of each element of the list is
the number of rows.
• Unlike matrices, data frames can store different
classes of objects in each column (just like lists);
• A data frame is used for storing data tables. It
is a list of vectors of equal length.
• > n = c(2, 3, 5)
> s = c("aa", "bb", "cc")
> b = c(TRUE, FALSE, TRUE)
> df = data.frame(n, s, b) # df is a data frame
• Cell value from the first row, second column of mtcars.
• > mtcars[1, 2]
[1] 6
• Can use the row and column names instead of the numeric coordinates.
• > mtcars["Mazda RX4", "cyl"]
[1] 6
• Lastly, the number of data rows in the data frame is given by
the nrow function.
• > nrow(mtcars) # number of data rows
[1] 32
• And the number of columns of a data frame is given by the ncol function.
• > ncol(mtcars) # number of columns
[1] 11
•
• We reference a data frame column with the double square bracket "[[]]" operator.
• For example, to retrieve the ninth column vector of the built-in data set mtcars, we
write mtcars[[9]].
• > mtcars[[9]]
[1] 1 1 1 0 0 0 0 0 0 0 0 ...
• We can retrieve the same column vector by its name.
• > mtcars[["am"]]
[1] 1 1 1 0 0 0 0 0 0 0 0 ...
• We can also retrieve with the "$" operator in lieu of the double square bracket operator.
• > mtcars$am
[1] 1 1 1 0 0 0 0 0 0 0 0 ...
• Yet another way to retrieve the same column vector is to use the single
square bracket "[]"operator. We prepend the column name with a comma character,
which signals a wildcard match for the row position.
• > mtcars[,"am"]
[1] 1 1 1 0 0 0 0 0 0 0 0 ...
• >x <- read.csv("data1.csv",header=T, sep=",")
• >x2 <- read.csv("data2.csv",header=T, sep=",")
•
• >x3 <- cbind(x,x2)
• >x3
• Subtype Gender Expression Age City
• 1 A m -0.54 32 New York
• 2 A f -0.80 21 Houston
• 3 B f -1.03 34 Seattle
• 4 C m -0.41 67 Houston
>which(A>=15,arr.ind=TRUE)
row col
[1,] 3 4
[2,] 4 4
Similarly we assign the values in the other way.
>A[1,] <- c(2,4,5)
EXP NO 3: GETTING AND CLEANING DATA
WITH SWIRL
Swirl is an interactive package which will teach
us and at the same time make us practice with
the exercises.
It has three types of exercises, basic,
intermediate and advanced.
Getting and cleaning data is an intermediate
exercise.
WHAT IS SWIRL() IN R
• swirl is a software package for the R programming
language that turns the Rconsole into an interactive
learning environment. Users receive immediate
feedback as they are guided through self-paced
lessons in data science and R programming.
➢install.packages(“swirl”)
➢library(swirl)
➢install_from_swirl("Getting and Cleaning Data")
➢
>install.packages(“swirl”)
>library(swirl)
➢install_course("Getting and Cleaning Data")
➢swirl()
➢
SWIRL() Flow..
• | Please choose a course, or type 0 to exit swirl.
•
• 1: Getting and Cleaning Data
• 2: R Programming
• 3: Take me to the swirl course repository!
•
• Selection: 1
•
• | Please choose a lesson, or type 0 to return to course
• | menu.
•
• 1: Manipulating Data with dplyr
• 2: Grouping and Chaining with dplyr
• 3: Tidying Data with tidyr
• 4: Dates and Times with lubridate
ABOUT PACKAGES COMING WITH GETTING
AND CLEANING DATA
• For this we use three types of packages: dplyr,
tidyr, lubridate.
• Dplyr is a package that provides a consistent
and concise grammar for manipulating tabular
data. It makes data manipulation easier.
About dplyr package from swirl()
According to the "Introduction to dplyr"
vignette written by the package authors, "The
dplyr philosophy is to have small functions that
each do one thing well."
Specifically, dplyr supplies five 'verbs' that cover
most fundamental data manipulation tasks:
select(), filter(), arrange(), mutate(), and
summarize().
Data manipulation using dplyr
• install.packages("dplyr") ## install
• You might get asked to choose a CRAN mirror – this is basically
asking you to choose a site to download the package from. The
choice doesn’t matter too much; We recommend the RStudio
mirror.
• library("dplyr") ## load
• You only need to install a package once per computer, but you need
to load it every time you open a new R session and want to use that
package.
Selecting columns and filtering rows
• To select columns of a data frame, use select().
The first argument to this function is the data
frame (ToothGrowth), and the subsequent
arguments are the columns to keep.
• select(ToothGrowth, len, supp, dose)
>aa<-select(ToothGrowth,len,supp,dose)
• Select():
To select columns of a data frame
• select(ToothGrowth, len, supp, dose)
>plot(aa)
• Filter():
To choose rows
• filter(ToothGrowth, len==5)
• Filter():
To choose rows
• filter(ToothGrowth, len>5)
Pipes(>%>)
• nest functions (i.e. one function inside of another)
• Pipes let you take the output of one function and
send it directly to the next, which is useful when
you need to many things to the same data set.
>ToothGrowth %>%
+ filter(len < 5) %>%
+ select(len,supp,dose)
• To create a new object with this smaller
version of the data we could do so by assigning
it a new name.
>ToothGrowth_sml <- ToothGrowth %>%
+ filter(len < 5) %>%
+ select(len,supp,dose)
➢MUTATE():
• create new columns based on the values in
existing columns
>ToothGrowth %>%
+ mutate(len = len/ 4)
• If this runs off your screen and you just want
to see the first few rows, you can use a pipe to
view the head() of the data
>ToothGrowth %>%
+ mutate(len=len/4) %>%
+head
• The first few rows are full of NAs, so if we
wanted to remove those we could insert
filter() in this chain:
>ToothGrowth %>%
+ mutate(len = len/ 4) %>%
+ filter(!is.na(len)) %>%
+ head
➢Groupby():
• group_by() splits the data into groups upon which some operations can
be run
>ToothGrowth %>% group_by(supp) %>%tally()
➢summarize():
• single group_by() is often used together with summarize() which
collapses each group into a -row summary of that group.
>ToothGrowth %>% group_by(supp) %>% summarize(len= mean(len,
na.rm = TRUE))
Data Frame Column Slice
• We retrieve a data frame column slice with the single square bracket "[]" operator.
• Numeric Indexing
• The following is a slice containing the first column of the built-in data set mtcars.
• > mtcars[1]
mpg
Mazda RX4 21.0
Mazda RX4 Wag 21.0
Datsun 710 22.8
............
• Name Indexing
• We can retrieve the same column slice by its name.
• > mtcars["mpg"]
mpg
Mazda RX4 21.0
Mazda RX4 Wag 21.0
Datsun 710 22.8
............
• To retrieve a data frame slice with the two columns mpg and hp, we pack the column names in an index vector
inside the single square bracket operator.
• > mtcars[c("mpg", "hp")]
mpg hp
Mazda RX4 21.0 110
Mazda RX4 Wag 21.0 110
Datsun 710 22.8 93
............
•
Exp 5. Creating Data Frame
emp.data <- data.frame( emp_id = c (1:5),
emp_name = c(“Ratna",”Kumar”,“Kamala",“Prajwal",“Pravachan"),
salary = c(623.3, 515.2, 611.0, 729.0, 843.25),
start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-
05-11", "2015-03-27")), stringsAsFactors = FALSE )
>emp.data
# Add the "dept" coulmn.
➢ emp.data$dept <- c("IT","Operations","IT","HR","Finance")
➢ v <- emp.data
➢ print(v)
Extracting rows and columns
A=emp.data$emp_id
B=emp.data$emp_name
a)C=data.frame(A,B)
b)data.frame[1:2,]
c)data.frame[c(3,5),c(2,4)]
➢emp.data[1:2,]
➢emp_id emp_name salary start_date
➢1 1 Rick 623.3 2012-01-01
➢2 2 Dan 515.2 2013-09-23
➢ > emp.data[c(3,5),c(2,4)]
➢ emp_name start_date
➢ 3 Michelle 2014-11-15
➢5 Gary 2015-03-27
PROGRAM 6: ‘apply’ group of functions
Functio
n
Arguments Objective Input Output
apply
apply(x,
MARGIN,
FUN)
Apply a function to the
rows or columns or both
Data frame or
matrix
vector,
list, array
lapply
lapply(X,
FUN)
Apply a function to all the
elements of the input
List, vector or
data frame
list
sapply(X, Apply a function to all the List, vector or vector or
PROGRAM 7- cbind-ing and rbind-ing
• Matrices can be created by column-binding or row-binding with cbind() and
rbind().
• > x <- 1:3
• > y <- 10:12
• > cbind(x, y)
• x y
• [1,] 1 10
• [2,] 2 11
• [3,] 3 12
• > rbind(x, y)
[,1] [,2] [,3]
• x 1 2 3
• y 10 11 12
>C <- cbind(1:3,4:6,5:7)
>D <- rbind(1:3,4:6)
PROGRAM 7:
Rbind() and cbind() functions.
• Matrices can be created by column-binding or row-binding with cbind() and
rbind().
• Data frames can also be appended by these functions.
• > x <- 1:3
• > y <- 10:12
• > cbind(x, y)
– x y
• [1,] 1 10
• [2,] 2 11
• [3,] 3 12
• > rbind(x, y)
• [,1] [,2] [,3]
• x 1 2 3
• y 10 11 12
Factor Variables
Factor variables are nothing but nominal variables and
also known as categorical variables.
Levels are nothing but unique values in the variable
values.
➢gender <- c(rep("male",20), rep("female", 30))
➢ gender<-factor(gender)
➢Levels: female male # Factor variables
➢summary(gender)
➢female male
30 20
PROGRAM 8: DISCRETE IRIS
➢ iris$Seplen<- cut(iris$Sepal.Length, breaks=c(4.3,5.6,6.8,7.9),
labels=c("low","medium","high"))
➢ > iris$Seplen
➢ [1] low low low low low low low low [9] low low low low low
<NA> medium medium [17] low low medium low low low low
low [25] low low low low low low low low [33] low low low low
low low low low [41] low low low low low low low low [49] low
low high medium high low medium medium [57] medium low
medium low low medium medium medium [65] …..
➢ Levels: low medium high
PROGRAM 9 - SCATTER PLOT USING ‘DPLYR’ ON
GUINEA PIGS ‘TOOTHGROWTH’ DATA SET
➢aa<-select(ToothGrowth,len,supp,dose)
#To choose rows we use filter()
➢> filter(ToothGrowth,len<=14.5)
➢> ToothGrowth%>%+ group_by(supp)
• > ToothGrowth%>%
• + group_by(supp)%>%
• + summarise(meanoflen=mean(len))
• > plot(aa)
• >
gg-grammer of graphics
➢library(dplyr)
➢> library(ggplot)
➢> library(ggplot2)
➢>ggplot(aa,aes(x=factor(dose),y=len,fill=supp))
➢>gplot(aa,aes(x=factor(dose),y=len,fill=supp))+geo
m_boxplot()
➢/*aes=aesthetic*/
PROGRAM-10…LINEAR AND MULTIPLE
REGRESSION
Regression: A technique for determining the
statistical relationship between two or more
variables where a change in a dependent
variable is associated with, and depends on, a
change in one or more independent variables.
Linear Regression: Y=mX+c
Y X
Single Predictor, X
Multiple Linear Regression Y=aX3+bX2+cX+d
3 Predictors/Explanatory variables, X3,X2, X
a,b,c are coefficients
d is random error=bias value
Y is a response variable
Y is estimated or predicted dependent on 3 X
variables.
Mtcars variables
[, 1] mpg Miles/(US) gallon
[, 2] cyl Number of cylinders
[, 3] disp Displacement (cu.in.)
[, 4] hp Gross horsepower
[, 5] drat Rear axle ratio
[, 6] wt Weight (lb/1000)
[, 7] qsec 1/4 mile time
[, 8] vs. V/S (Engine Cylinder confg V shape or S shape)
[, 9] am Transmission (0 = automatic, 1 = manual)
[,10] gear Number of forward gears
[,11] carb Number of carburetors
lm=linear mode
> library(ggplot2)
>ggplot(mtcars,aes(wt,mpg))
>ggplot(mtcars,aes(wt,mpg))+geom_point()
>ggplot(mtcars,aes(wt,mpg))+geom_point()+geo
m_smooth(method="lm")
Mpg verses weight
• For example in the mtcars dataset, you can
build a linear model between the gas
consumption (mpg) and the weight of the car
(wt):
mpg=β0+β1wt
• β1 is slope mpg is dependent
• β0 is intercept wt is independent
• Residuals. The difference between the observed
value of the dependent variable (y) and the
predicted value (ŷ) is called the residual (e).
• Each data point has one residual.
• y=10*3+5=35——-observed
• Model, m=9. y=9x+c
• y=9*3+5=32——predicted….
> mfit = lm(mpg ~ wt + disp + cyl, data=mtcars)
> plot(mfit)
PROGRAM NO: 11
Major Clustering Approaches (I)
• Partitioning approach:
– Construct various partitions and then evaluate them by some criterion,
e.g., minimizing the sum of square errors
– Typical methods: k-means, k-medoids, CLARANS
• Hierarchical approach:
– Create a hierarchical decomposition of the set of data (or objects) using
some criterion
– Typical methods: Diana, Agnes, BIRCH, CAMELEON
• Density-based approach:
– Based on connectivity and density functions
– Typical methods: DBSACN, OPTICS, DenClue
• Grid-based approach:
– based on a multiple-level granularity structure
– Typical methods: STING, WaveCluster, CLIQUE
89
IRIS TYPES
K-means clustering
• names(iris)
• [1] "Sepal.Length" "Sepal.Width" "Petal.Length"
• [4] "Petal.Width" "Species"
•
• > x<-iris[,-5]
•
• > y<-iris$Species
•
• > kc<-kmeans(x,3)
•
• > kc
•
• K-means clustering with 3 clusters of sizes 38, 62, 50
•
• Cluster means:
• Sepal.Length Sepal.Width Petal.Length Petal.Width
• 1 6.850000 3.073684 5.742105 2.071053
• 2 5.901613 2.748387 4.393548 1.433871
• 3 5.006000 3.428000 1.462000 0.246000
•
• Clustering vector:
• [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
• [29] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 1 2 2 2
• [57] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2
• [85] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 1 1 1 2 1 1 1 1 1
• [113] 1 2 2 1 1 1 1 2 1 2 1 2 1 1 2 2 1 1 1 1 1 2 1 1 1 1 2 1
• [141] 1 1 2 1 1 1 2 1 1 2
•
• Within cluster sum of squares by cluster:
• [1] 23.87947 39.82097 15.15100
• (between_SS / total_SS = 88.4 %)
>plot(x[c("Sepal.Length","Sepal.Width")],col=kc$
cluster)
K-means
>points(kc$centers[,c("Sepal.Length",
"Sepal.Width")], col=1:3, pch=23, cex=3)
• > library(fpc)
• > pamresult<-pamk(iris1)
• > pamresult$nc
• [1] 2
• > pamresult$nc #nc-Number of Clusters
• [1] 2
• > table(pamresult$pamobject$clustering,iris$Species)
•
• setosa versicolor virginica
• 1 50 1 0
• 2 0 49 50
• > layout(matrix(c(1,2),1,2)) #
> plot(pamresult$pamobject)
• The ggplot() command creates a plot object. In it
we assigned a data set.
• aes() creates what Hadley Wickham calls an
aesthetic: a mapping of variables to various parts of
the plot. ...
• Another way to split up the way we look at data is
with facets.
> ggplot(mtcars,aes(wt,mpg)) Error in ggplot(mtcars,
aes(wt, mpg)) : could not find function "ggplot" >
library(ggplot2) > library(ggplot2) >
ggplot(mtcars,aes(wt,mpg)) >
ggplot(mtcars,aes(wt,mpg))+geom_point() >
ggplot(mtcars,aes(wt,mpg))+geom_point()+geom_
smooth(method="lm") >
ggplot(mtcars,aes(wt,mpg))+geom_point()+geom_
abline()
➢> library(ggplot2)
➢> ggplot(mtcars,aes(wt,mpg))
➢> ggplot(mtcars,aes(wt,mpg))+geom_point()
➢>ggplot(mtcars,aes(wt,mpg))+geom_point()+g
eom_smooth(method="lm“)
> ggplot(mtcars, aes(x=wt, y=mpg, col=cyl, size=disp)) + geom_point()
What combination of predictors will best predict
fuel efficiency?(Slope/Coefficients and
intercepts)
Which predictors increase our accuracy by a
statistically significant amount?
We should guess which predictors are
significant, and to determine the ideal formula
for prediction….WHICH IS WHAT WE CALL
LINEAR REGRESSION.
Density-Based Clustering Methods
• Clustering based on density (local cluster criterion), such as
density-connected points
• Major features:
– Discover clusters of arbitrary shape
– Handle noise
– One scan
– Need density parameters as termination condition
107
Density-Based Clustering: Basic Concepts
• Two parameters:
– Eps: Maximum radius of the neighbourhood
– MinPts: Minimum number of points in an Eps-
neighbourhood of that point
• NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
• Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if
– p belongs to NEps(q)
– core point condition:
|NEps (q)| ≥ MinPts
MinPts = 5
Eps = 1 cm
p
q
108
Density-Reachable and Density-Connected
• Density-reachable:
– A point p is density-reachable from a
point q w.r.t. Eps, MinPts if there is a
chain of points p1, …, pn, p1 = q, pn = p
such that pi+1 is directly density-
reachable from pi
• Density-connected
– A point p is density-connected to a
point q w.r.t. Eps, MinPts if there is a
point o such that both, p and q are
density-reachable from o w.r.t. Eps
and MinPts
p
q
p1
p q
o
109
DBSCAN: Density-Based Spatial Clustering of Applications
with Noise
• Relies on a density-based notion of cluster: A cluster is defined
as a maximal set of density-connected points
• Discovers clusters of arbitrary shape in spatial databases with
noise
Core
Border
Outlier
Eps = 1cm
MinPts = 5
110
DBSCAN: The Algorithm
• Arbitrary select a point p
• Retrieve all points density-reachable from p w.r.t. Eps and
MinPts
• If p is a core point, a cluster is formed
• If p is a border point, no points are density-reachable from p
and DBSCAN visits the next point of the database
• Continue the process until all of the points have been
processed
111
Before Package ‘rpart’
Title: Recursive Partitioning and Regression Trees
A regression line is a straight line
that attempts to predict the
relationship between two points,
also known as a trend line or line
of best fit.
Simple linear regression is a prediction
when a variable (y) is dependent on a second variable (x) based on the
regression equation of a given set of data.
Decision trees are of two types
Classification Trees
Regression Trees
CTs are used when the target or
response variable is of
categorical in nature.
RTs are used when the target
variable is continuous or
numeric.
It is the target variable that
determines the type of
decision tree needed.
DECISION TREES USING PARTY-PROGRAM
12
• > install.packages(“readr”)
• > library(readr)
• > install.packages("party")
• Installing package into ‘C:/Users/My Document/Documents/R/win-library/3.4’
• (as ‘lib’ is unspecified)
• trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/party_1.2-3.zip'
• Content type 'application/zip' length 719826 bytes (702 KB)
• downloaded 702 KB
•
• package ‘party’ successfully unpacked and MD5 sums checked
•
• The downloaded binary packages are in
• C:UsersMy
DocumentAppDataLocalTempRtmpOAuKaMdownloaded_packages
• > library(party)
DECISION TREE USING RPART..PROGRAM 12
rpart(formula, data, weights, subset, na.action =
na.rpart, method, model = FALSE, x = FALSE, y
= TRUE, parms, control, cost, ...)
tree<-
rpart(Species~Sepal.Length+Sepal.Width+Petal
.Length+Petal.Width,data=iris,method="class")
• > iris$class<-as.factor(iris$class)
• >
• > View(iris)
• > iris$Species<-as.factor(iris$Species)
• > tree1<-ctree(Species~Sepal.Length, data=iris)
• > plot(tree1)
➢tree<-
rpart(Species~Sepal.Length+Sepal.Width+Petal
.Length+Petal.Width,data=iris,method="class")
> plot(tree)
> plot(tree, uniform=TRUE,main="Classification
Tree for Iris dataset")> text(tree, use.n=TRUE,
all=TRUE, cex=.8)
SUPPORT VECTOR MACHINE
• X1, X2 Attributes
ABOUT DIFFERENT TYPES OF VARIABLES
FEW GOOD WEB SITES ON R
www.kaggle.com
www.rdocumentation.org
www.statmethods.net
www.r-tutor.com
www.tutorialspoint.com
www.datacamp.com
www.github.com
https://drsimonj.svbtle.com/visualising-residuals

Contenu connexe

Similaire à DATA MINING USING R (1).pptx

Data Science - Part II - Working with R & R studio
Data Science - Part II -  Working with R & R studioData Science - Part II -  Working with R & R studio
Data Science - Part II - Working with R & R studioDerek Kane
 
BUSINESS ANALYTICS WITH R SOFTWARE DIAST
BUSINESS ANALYTICS WITH R SOFTWARE DIASTBUSINESS ANALYTICS WITH R SOFTWARE DIAST
BUSINESS ANALYTICS WITH R SOFTWARE DIASTHaritikaChhatwal1
 
STAT-522 (Data Analysis Using R) by SOUMIQUE AHAMED.pdf
STAT-522 (Data Analysis Using R) by SOUMIQUE AHAMED.pdfSTAT-522 (Data Analysis Using R) by SOUMIQUE AHAMED.pdf
STAT-522 (Data Analysis Using R) by SOUMIQUE AHAMED.pdfSOUMIQUE AHAMED
 
CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners Jen Stirrup
 
In-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and RevolutionIn-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and RevolutionRevolution Analytics
 
Is your excel production code?
Is your excel production code?Is your excel production code?
Is your excel production code?ProCogia
 
Research paper presentation
Research paper presentation Research paper presentation
Research paper presentation Akshat Sharma
 
R programming presentation
R programming presentationR programming presentation
R programming presentationAkshat Sharma
 
Unit1_Introduction to R.pdf
Unit1_Introduction to R.pdfUnit1_Introduction to R.pdf
Unit1_Introduction to R.pdfMDDidarulAlam15
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopRevolution Analytics
 
Data Analytics with R and SQL Server
Data Analytics with R and SQL ServerData Analytics with R and SQL Server
Data Analytics with R and SQL ServerStéphane Fréchette
 
A short tutorial on r
A short tutorial on rA short tutorial on r
A short tutorial on rAshraf Uddin
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenRevolution Analytics
 
R programming language
R programming languageR programming language
R programming languageKeerti Verma
 

Similaire à DATA MINING USING R (1).pptx (20)

Data Science - Part II - Working with R & R studio
Data Science - Part II -  Working with R & R studioData Science - Part II -  Working with R & R studio
Data Science - Part II - Working with R & R studio
 
BUSINESS ANALYTICS WITH R SOFTWARE DIAST
BUSINESS ANALYTICS WITH R SOFTWARE DIASTBUSINESS ANALYTICS WITH R SOFTWARE DIAST
BUSINESS ANALYTICS WITH R SOFTWARE DIAST
 
محاضرة برنامج التحليل الكمي R program د.هديل القفيدي
محاضرة برنامج التحليل الكمي   R program د.هديل القفيديمحاضرة برنامج التحليل الكمي   R program د.هديل القفيدي
محاضرة برنامج التحليل الكمي R program د.هديل القفيدي
 
STAT-522 (Data Analysis Using R) by SOUMIQUE AHAMED.pdf
STAT-522 (Data Analysis Using R) by SOUMIQUE AHAMED.pdfSTAT-522 (Data Analysis Using R) by SOUMIQUE AHAMED.pdf
STAT-522 (Data Analysis Using R) by SOUMIQUE AHAMED.pdf
 
R training
R trainingR training
R training
 
CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners
 
In-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and RevolutionIn-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and Revolution
 
Is your excel production code?
Is your excel production code?Is your excel production code?
Is your excel production code?
 
Research paper presentation
Research paper presentation Research paper presentation
Research paper presentation
 
R programming presentation
R programming presentationR programming presentation
R programming presentation
 
Unit1_Introduction to R.pdf
Unit1_Introduction to R.pdfUnit1_Introduction to R.pdf
Unit1_Introduction to R.pdf
 
محاضرة برنامج التحليل الكمي R program د.هديل القفيدي
محاضرة برنامج التحليل الكمي   R program د.هديل القفيديمحاضرة برنامج التحليل الكمي   R program د.هديل القفيدي
محاضرة برنامج التحليل الكمي R program د.هديل القفيدي
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Data Analytics with R and SQL Server
Data Analytics with R and SQL ServerData Analytics with R and SQL Server
Data Analytics with R and SQL Server
 
A short tutorial on r
A short tutorial on rA short tutorial on r
A short tutorial on r
 
R programming
R programmingR programming
R programming
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee Edlefsen
 
R and Data Science
R and Data ScienceR and Data Science
R and Data Science
 
R programming language
R programming languageR programming language
R programming language
 
Machine Learning in R
Machine Learning in RMachine Learning in R
Machine Learning in R
 

Dernier

Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 

Dernier (20)

Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 

DATA MINING USING R (1).pptx

  • 1. ADIKAVI NANNAYA UNIVERSITY UNIVERSITY COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING ONE DAY ORIENTATION PROGRAM ON DATA MINING USING R PROGRAMMING 11TH Dec 2017. Dr. M. Kamala Kumari Assoc Prof
  • 2. OUR WAY… DATA….A BASE THING DIFFERENCES BETWEEN RELATED TERMS OBJECTIVES OF PROCESSING THINGS STEPS IN DATA ANALYSIS DIFFERENT ANGLES OF DATA SCIENCE OBJECTIVES OF ALL STORIES WHAT IS THE ROLE OF R DEFINITIONS OF R VARIATIONS OF R COMPETETORS OF R WHY R CRAN R RSTUDIO BASIC COMMANDS PROGRAM 1 TO PROGRAM 13.
  • 3. Base to anything---Data!! Processing Data =Applying Statistics on Data Data Context 345(423) 260 No: of UG Affiliated Colleges to AKNU No: of PG Affiliated Colleges to AKNU Total No: of Affiliated Colleges to AKNU Information AKNU has more number of UG Affiliations than PG Analysis = Understanding Information 85 Decision Making Decide whether to give affiliation for UG College or not!!
  • 4. THE ABOVE PROCESS CAN BE VIEWED WITH R –SHOWING DATA,PROCESSING AND RESULTS ALL IN ONE ENVIRONMENT…..LET’S MAKE DECISION EASY WITH R!!!!
  • 5. DATA, INFORMATION AND KNOWLEDGE KNOWLEDGE IS USEFUL INFORMATION OBTAINED THROUGH LEARNING AND EXPERIENCE KNOWLEDGE DOES NOT NEED DIRECT INTERACTION WIT WITH DATA PREDICTION IS POSSIBLE WITH REQUIRED KNOWLEDGE BUT NOT WITH INFORMATION ALONE NEED INFORMATION TO GET KNOWLEDGE INFORMATION IS PROCESSING DATA KNOWLEDGE IS PROCESSING PATTERNS OF INFORMATION ASSOCIATED WITH EXPERIENCE KNOWLEDGE REQUIRES COGNITIVE (REASONING, PERCEPTION) ABILITY ....WHERE AS INFORMATION NEED NOT INFORMATION KNOWLEDGE DATA
  • 6. KNOWLEDGE==SCIENCE?? • Data ==Facts • Statistics ==Data + Formulae • Information==Description of Statistics(Reduce errors) • Analysis == Understanding Information or Insights of Data and info • Analytics == Algorithms/Techniques on Data • Knowledge == Understanding information and technical results • Data Mining == Analytics==Querying…???...YES
  • 7. STEPS IN DATA ANALYSIS ETL DATA ANALYTICS Reports/G raphics Model Explore Clean Organize Collect DATA Remove errors and fill gaps Apply Statistics, Techniques Apply Algorithms Visualization Techniques/T ools Arrange in a particular format
  • 8. DATA ANALYSIS DATA ANALYTICS DATA MINING AND DATA SCIENCE --- WE ALL ARE RELATED !! Data Science DATA ANALYSIS DCD DATA MINING DATA ANALYTICS DATA WAREHOUSING
  • 9. DAWN TO DUSK=DATA SCIENCE!! Domain Expert SELECT H/W STATISTICS ETL Data Modeling Computing data Visualization Prediction
  • 11.
  • 12. THE OBJECTIVES OF ALL THE STORIES BEHIND!!.....CONTD • DESCRIPTION • COMPARISION • CLASSIFICATION • COMBINE SIMILAR THINGS • GENERATE RULES UNDERSTAND ACQUIRE KNOWLEDGE ….AND….. PREDICT/DECIDE
  • 13. ROLE OF ‘R’…IN WHICH STORY The R language is widely used among statisticians and data miners for developing statistical software and data analysis. Instead of long programming, R gives visualization of statistical computations in an easy way(instant methods and less programming with many packages included) R is one of the analytical tools
  • 14. WE CAN DEFINE R TO BE…. R IS A PROGRAMMING LANGUAGE R IS AN ANALYTICAL TOOL R IS A SCRIPTING LANGUAGE R STUDIO IS A SOFTWARE ENVIRONMENT
  • 15. A B C D E …S..R..!!..? R – A free and open source software programming language for statistical computing and graphics. • Founders of R-Ross Ihaka & Robert Gentleman
  • 16. R STUDIO • R Studio is an IDE to develop R Founded by JJ Allaire • R is an extension of S Language a Statistical Language. • Latest version of R = R 3.4.2 for Windows 32/64bit
  • 17. VARIATIONS OF R • R – free implementation of the S (programming language) • pbdR – Programming with Big Data R • R Commander– GUI interface for R • Rattle GUI– GUI interface for R • Revolution Analytics – production-grade software for the enterprise big data analytics • RStudio – GUI interface and development environment for R
  • 18. COMPETITORS OF R • MS Excel - Microsoft Excel Sheet • SAS - Statistical Analysis System • SPSS - Statistical Package for Social Science • MATLAB -Matrix Laboratory • OCTAVE -Helps in solving linear and nonlinear problems numerically. • Python -Another Programming language which express concepts in fewer lines of code. • Spark -Provides Interface for programming entire cluster with implicit data parallelism • Storm - Distributed Real time computation System
  • 19. THEN WHY R?? • More powerful data manipulation capabilities • Easier automation • Faster computation • It reads any type of data • Easier project organization • It supports larger data sets • Reproducibility (important for detecting errors) • Easier to find and fix errors • It's free • It's open source • Advanced Statistics capabilities • State-of-the-art graphics • It runs on many platforms • Anyone can contribute packages to improve its functionality
  • 20. INVITE R AND RSTUDIO… • Download and install the latest R: http://www.r-project.org/ • Download and install RStudio, the R IDE: http://www.rstudio.com/
  • 21. CRAN R • The “Comprehensive R Archive Network” ( CRAN ) is a collection of sites which carry identical material, consisting of the R distribution(s), the contributed extensions, documentation for R, and binaries. • R FAQ - The R Project for Statistical Computing • CRAN is a network of ftp and web servers around the world that store identical, up-to-date, versions of code and documentation for R. Please use the CRAN mirror nearest to you to minimize network load.
  • 23. Get and Set working directories >getwd() [1] "C:/Users/My Document/Documents" ➢setwd("C:/Program Files/R/R-3.4.3/bin/i386") ➢getwd() ➢ [1] "C:/Program Files/R/R-3.4.3/bin/i386" ➢dir() ➢data() ➢ls()
  • 24. SIMPLE COMMANDS TO INSTALL ANY PACKAGE >install.packages(“ package name“) We can install any package if we know the correct name suitable for that version TO SEE ALL LIST OF DATASETS >data() TO LOAD THAT INSTALLED PACKAGE/FUNCTION IN R >library(function name/package name) TO SEE LIST OF PACKAGES INSTALLED IN DIFFERENT LIBRARIES >library()
  • 25. PACKAGE AND LIBRARY…??? Recently, the official repository (CRAN) reached 25,000 packages published, and many more are publicly available through the internet. A package is a like a book, a library is like a library; you use library() to check a package in the library----Hadley Wickham Chief Scientist at Rstudio Functions are like pages in a package book!!
  • 26. COMPLETIONS YELLOW COLOUR ARE VARIABLES BLUE COLOURS ARE FOR FUNCTIONS VOILET COLOUR AND P INSIDE WITH TWO ::BESIDE FOR PACKAGES VOILET FOR FUNCTION ARGUMENTS OR VECTORS GRID FOR DATAFRAMES
  • 27. Program 1:BASIC COMMANDS-VECTORS • A vector is a sequence of data elements of the same basic type. Members in a vector are officially called components or members. > 8.5:4.5 #sequence of numbers downline ➢ rnorm(10) ➢ c(1, 1:3, c(5, 8), 13) SAME CAN BE WRITTEN LIKE THIS ALSO ➢ vector("numeric", 5) >numeric(5) ➢ vector("complex", 5) >complex(5) ➢ vector("logical", 5) >logical(5) ➢ vector("list", 5) >list(5) ➢ vector("character", 5) >character(5) ➢ seq.int(3, 12) #same as 3:12 ➢ seq.int(3, 12, 2) ➢ seq.int(0.1, 0.01, -0.01) ➢ seq_len(5)
  • 28. >seq_len(n) >pp <- c("Peter", "Piper", "picked", "a", "peck", "of", "pickled", "peppers") >for(i in seq_along(pp)) print(pp[i]) >length(1:5) >length(c(TRUE, FALSE, NA)) >sn <- c(“Varma", “Persis", “Kamala“, ”PVRao”) >length(sn) >nchar(sn) • R’s vectors each element can be given a name. Labeling the elements can often make your code much more readable. You can specify names when you create a vector in the form name = value. If the name of an element is a valid variable name, it doesn’t need to be enclosed in quotes. c(apple = 1, banana = 2, "kiwi fruit" = 3, 4)
  • 29. >x <- (1:5) ^ 2 >x[c(1, 3, 5)] >x[c(-2, -4)] >x[c(TRUE, FALSE, TRUE, FALSE, TRUE)] • Mixing positive and negative values is not allowed, and will throw an error: >x[c(1, -1)] #This doesn't make sense! >names(x) <- c("one", "four", "nine", "sixteen", "twenty five") >x[c("one", "nine", "twenty five")] >x[c(1, NA, 5)] >x[c(TRUE, FALSE, NA, FALSE, TRUE)] ➢ > 10/3 [1] 3.333333 ➢ > options(digits=8) ➢ > 10/3 [1] 3.3333333 ➢ > options(digits=10) > 10/3 ➢ [1] 3.333333333
  • 30. The which function returns the locations where a logical vector is TRUE. This can be useful for switching from logical indexing to integer indexing: ➢ x<-c(23,12,45,11,2,3,4) ➢ > which(x>10) ➢ [1] 1 2 3 4 >which.min(x) >1:5 + 1 # adds one to each element of the vector >1:5 + 1:15 # Smaller vector adds and recycles with the larger one ADDING SCALARS TO VECTORS >rep(1:5, 3) #repeat function >rep(1:5, each = 3) >rep(1:5, times = 1:5) >rep(1:5, length.out = 7) >rep.int(1:5, 3) #the same as rep(1:5, 3) >rep_len(1:5, 13) •
  • 31. FEW MORE BASIC COMMANDS To see any dataset in Code editor, Type >View(women) in Console. To list the number of rows / columns respectively >nrow(women) >ncol(women) To output a summary about the dataset’s columns. >summary(women) To output a summary of a dataset’s structure. >str(women) To get the dimensions of a dataset(number of obseravtions and columns) >dim(women) To access a column in a dataset >women$height To check the type (or class) of a variable, the class function can be used >class(women)
  • 32. COERCION > myNum <- 5.983904798274987298 > class(myNum) "numeric“ • You can coerce (change type of) numeric string values into numeric types, like so: > myString <- "5.60“ > class(myString) "character“ > myNumber <- as.numeric(myString) > myNumber 5.6 > class(myNumber) "numeric"
  • 33. > myInt <- 209173987 > class(myInt) "numeric“ • To actually force them to be integers, we need to invoke a function that manually coerces them, called as.integer: > myInt <- as.integer(myInt) > class(myInt) "integer"
  • 34. >myComparison <- 5 > 6 > myComparison FALSE > class(myComparison) "logical“ >myComplex <- complex(1, 3292, 8974892) >myComplex 3292+8974892i > class(myComplex) "complex"
  • 35. PROGRAM NO:2 IMPORT FROM AND EXPORT TO CSV FILES • CSV files(Comma Separated Values) are intentionally designed to be widely supported; any OS or application that imports or exports data usually has CSV support. • They do nothing else but hold data - no text formatting for example. • Excel files hold the same data, but in binary format. This allows the file to save specifc Excel features - charts, formatting, etc. • > datacsv<-read.csv("D:/FDP/Stu Info.csv") • > datacsv • > s<-subset(datacsv,Sec.Lang=="Sanskrit") • > write.csv(s,"output.csv") • >View(“output.csv) • View(s)
  • 36. VECTORS AND LISTS • The most essential of all, the vector, is a collection of elements of the same type. • A vector can only have elements of the exact same type. Vectors are usually created with the shorthand c (concatenate) function: > myVector <- c("Hello", "World", "Third Element") > class(myVector) "character" > myVector "Hello" "World" "Third Element"
  • 37. >myVector <- c("One", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight", "Nine", "Ten", "Eleven", "Twelve", "Thirteen", "Fourteen", "Fifteen") > myVector [1] "One" "Two" "Three" "Four" "Five" "Six" "Seven" [8] "Eight" "Nine" "Ten" "Eleven" "Twelve" "Thirteen" "Fourteen" [15] "Fifteen"
  • 38. • Note that vectors are strictly one-dimensional. You cannot add another vector as an element inside an existing vector – their elements get merged into one: ➢ > v1 <- c("a", "b", "c") ➢ > v2 <- c("d", "e", "f") ➢ > v3 <- c(v1, v2) ➢ > v3 ➢ [1] "a" "b" "c" "d" "e" "f“ • You can generate entire numeric vectors by specifying a range: • > myRange <- c(1:10) • > myRange [1] 1 2 3 4 5 6 7 8 9 10
  • 39. LISTS Lists are just like vectors, only they don’t have the limitation of being able to hold elements of the same type exclusively. They are built with the list function or with the c function if one of the elements you’re adding is a list:
  • 41. LISTS • The following variable x is a list containing copies of three vectors n, s, b, and a numeric value 3. • > n = c(2, 3, 5) > s = c("aa", "bb", "cc", "dd", "ee") > b = c(TRUE, FALSE, TRUE, FALSE, FALSE) > x = list(n, s, b, 3) # x contains copies of n, s, b
  • 42. pepper shaker is list x x[1] is a single packet x[[1]] is a slice x[[1]][[1]] out of the list • In contrast, a double bracket will always return only one element. Before moving to double bracket a note to be kept in mind. • NOTE:THE MAJOR DIFFERENCE BETWEEN THE TWO IS THAT SINGLE BRACKET RETURNS YOU A LIST WITH AS MANY ELEMENTS AS YOU WISH WHILE A DOUBLE BRACKET WILL NEVER RETURN A LIST. RATHER A DOUBLE BRACKET WILL RETURN ONLY A SINGLE ELEMENT FROM THE LIST. • • Single bracket will always returns another list with number of elements equal to the number of elements or number of indices you pass into the single bracket.
  • 43. • Member Reference • In order to reference a list member directly, we have to use the double square bracket "[[]]"operator. The following object x[[2]] is the second member of x. In other words, x[[2]] is a copy of s, but is not a slice containing s or its copy. • > x[[2]] [1] "aa" "bb" "cc" "dd" "ee" • We can modify its content directly. • > x[[2]][1] = "ta" > x[[2]] [1] "ta" "bb" "cc" "dd" "ee" > s [1] "aa" "bb" "cc" "dd" "ee" # s is unaffected
  • 44.
  • 45. MATRICES • Matrices are vectors with a dimension attribute. The dimension attribute is itself an integer vector of length 2 (nrow, ncol) • > m <- matrix(nrow = 2, ncol = 3) • > m • [,1] [,2] [,3] • [1,] NA NA NA • [2,] NA NA NA • > dim(m) • [1] 2 3 • > attributes(m) • $dim • [1] 2 3
  • 46. >m<-matrix(nrow=3,ncol=2,c(1,2,3,4,5,6)) >m [,1] [,2] [1,] 1 4 [2,] 2 5 [3,] 3 6 > m <- matrix(1:6, nrow = 2, ncol = 3) > m<-matrix(c(1,2,3,4)) ➢ m<-matrix(c(1,2,3,4),7,8) ➢ m<- matrix(1:9,nrow=3,ncol=3,byrow=TRUE) ➢ matrix(1,nrow=10,ncol=10) ➢ A <- matrix(0,3,4) ➢ z <- A[2,3] # returns 2nd row and 3rd col of matrix A and assigns to z ➢ > A[2:4,4:2] # Selecting 2nd,3rd and 4th rows and 4th,3rd and 2nd colmns and getting another sub matrix. ➢ > A[2,2:3] # Second row, 2nd col and 3rd col elements. >second.column <- A[,2] #returns second.column; ➢ >which(A>8) # returns elements which are greater than 8.
  • 47. ARRAYS An array is just a vector plus information on the dimensions of the array. We can create an array from a vector: ➢ X <- array(1:24,dim=c(3,4,2)) # 24 elements in an array, with 3 rows, 4 cols, in 2 matrices form. ➢ x <- seq(1,27) ➢ > c(3,9) ➢ [1] 3 9 ➢ > dim(x)=c(3,9) ➢ > is.array(x) [1] TRUE ➢ > is.matrix(x) [1] TRUE ➢ > x ➢ [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] ➢ [1,] 1 4 7 10 13 16 19 22 25 ➢ [2,] 2 5 8 11 14 17 20 23 26 ➢ [3,] 3 6 9 12 15 18 21 24 27
  • 48. DATA FRAMES • Data frames are used to store tabular data. • They are represented as a special type of list where every element of the list has to have the same length . • Each element of the list can be thought of as a column and the length of each element of the list is the number of rows. • Unlike matrices, data frames can store different classes of objects in each column (just like lists);
  • 49. • A data frame is used for storing data tables. It is a list of vectors of equal length. • > n = c(2, 3, 5) > s = c("aa", "bb", "cc") > b = c(TRUE, FALSE, TRUE) > df = data.frame(n, s, b) # df is a data frame
  • 50. • Cell value from the first row, second column of mtcars. • > mtcars[1, 2] [1] 6 • Can use the row and column names instead of the numeric coordinates. • > mtcars["Mazda RX4", "cyl"] [1] 6 • Lastly, the number of data rows in the data frame is given by the nrow function. • > nrow(mtcars) # number of data rows [1] 32 • And the number of columns of a data frame is given by the ncol function. • > ncol(mtcars) # number of columns [1] 11 •
  • 51. • We reference a data frame column with the double square bracket "[[]]" operator. • For example, to retrieve the ninth column vector of the built-in data set mtcars, we write mtcars[[9]]. • > mtcars[[9]] [1] 1 1 1 0 0 0 0 0 0 0 0 ... • We can retrieve the same column vector by its name. • > mtcars[["am"]] [1] 1 1 1 0 0 0 0 0 0 0 0 ... • We can also retrieve with the "$" operator in lieu of the double square bracket operator. • > mtcars$am [1] 1 1 1 0 0 0 0 0 0 0 0 ... • Yet another way to retrieve the same column vector is to use the single square bracket "[]"operator. We prepend the column name with a comma character, which signals a wildcard match for the row position. • > mtcars[,"am"] [1] 1 1 1 0 0 0 0 0 0 0 0 ...
  • 52. • >x <- read.csv("data1.csv",header=T, sep=",") • >x2 <- read.csv("data2.csv",header=T, sep=",") • • >x3 <- cbind(x,x2) • >x3 • Subtype Gender Expression Age City • 1 A m -0.54 32 New York • 2 A f -0.80 21 Houston • 3 B f -1.03 34 Seattle • 4 C m -0.41 67 Houston >which(A>=15,arr.ind=TRUE) row col [1,] 3 4 [2,] 4 4 Similarly we assign the values in the other way. >A[1,] <- c(2,4,5)
  • 53. EXP NO 3: GETTING AND CLEANING DATA WITH SWIRL Swirl is an interactive package which will teach us and at the same time make us practice with the exercises. It has three types of exercises, basic, intermediate and advanced. Getting and cleaning data is an intermediate exercise.
  • 54. WHAT IS SWIRL() IN R • swirl is a software package for the R programming language that turns the Rconsole into an interactive learning environment. Users receive immediate feedback as they are guided through self-paced lessons in data science and R programming. ➢install.packages(“swirl”) ➢library(swirl) ➢install_from_swirl("Getting and Cleaning Data") ➢
  • 56. SWIRL() Flow.. • | Please choose a course, or type 0 to exit swirl. • • 1: Getting and Cleaning Data • 2: R Programming • 3: Take me to the swirl course repository! • • Selection: 1 • • | Please choose a lesson, or type 0 to return to course • | menu. • • 1: Manipulating Data with dplyr • 2: Grouping and Chaining with dplyr • 3: Tidying Data with tidyr • 4: Dates and Times with lubridate
  • 57. ABOUT PACKAGES COMING WITH GETTING AND CLEANING DATA • For this we use three types of packages: dplyr, tidyr, lubridate. • Dplyr is a package that provides a consistent and concise grammar for manipulating tabular data. It makes data manipulation easier.
  • 58. About dplyr package from swirl() According to the "Introduction to dplyr" vignette written by the package authors, "The dplyr philosophy is to have small functions that each do one thing well." Specifically, dplyr supplies five 'verbs' that cover most fundamental data manipulation tasks: select(), filter(), arrange(), mutate(), and summarize().
  • 59. Data manipulation using dplyr • install.packages("dplyr") ## install • You might get asked to choose a CRAN mirror – this is basically asking you to choose a site to download the package from. The choice doesn’t matter too much; We recommend the RStudio mirror. • library("dplyr") ## load • You only need to install a package once per computer, but you need to load it every time you open a new R session and want to use that package.
  • 60. Selecting columns and filtering rows • To select columns of a data frame, use select(). The first argument to this function is the data frame (ToothGrowth), and the subsequent arguments are the columns to keep. • select(ToothGrowth, len, supp, dose) >aa<-select(ToothGrowth,len,supp,dose)
  • 61. • Select(): To select columns of a data frame • select(ToothGrowth, len, supp, dose) >plot(aa) • Filter(): To choose rows • filter(ToothGrowth, len==5)
  • 62. • Filter(): To choose rows • filter(ToothGrowth, len>5) Pipes(>%>) • nest functions (i.e. one function inside of another) • Pipes let you take the output of one function and send it directly to the next, which is useful when you need to many things to the same data set. >ToothGrowth %>% + filter(len < 5) %>% + select(len,supp,dose)
  • 63. • To create a new object with this smaller version of the data we could do so by assigning it a new name. >ToothGrowth_sml <- ToothGrowth %>% + filter(len < 5) %>% + select(len,supp,dose) ➢MUTATE(): • create new columns based on the values in existing columns
  • 64. >ToothGrowth %>% + mutate(len = len/ 4) • If this runs off your screen and you just want to see the first few rows, you can use a pipe to view the head() of the data >ToothGrowth %>% + mutate(len=len/4) %>% +head
  • 65. • The first few rows are full of NAs, so if we wanted to remove those we could insert filter() in this chain: >ToothGrowth %>% + mutate(len = len/ 4) %>% + filter(!is.na(len)) %>% + head
  • 66. ➢Groupby(): • group_by() splits the data into groups upon which some operations can be run >ToothGrowth %>% group_by(supp) %>%tally() ➢summarize(): • single group_by() is often used together with summarize() which collapses each group into a -row summary of that group. >ToothGrowth %>% group_by(supp) %>% summarize(len= mean(len, na.rm = TRUE))
  • 67. Data Frame Column Slice • We retrieve a data frame column slice with the single square bracket "[]" operator. • Numeric Indexing • The following is a slice containing the first column of the built-in data set mtcars. • > mtcars[1] mpg Mazda RX4 21.0 Mazda RX4 Wag 21.0 Datsun 710 22.8 ............ • Name Indexing • We can retrieve the same column slice by its name. • > mtcars["mpg"] mpg Mazda RX4 21.0 Mazda RX4 Wag 21.0 Datsun 710 22.8 ............ • To retrieve a data frame slice with the two columns mpg and hp, we pack the column names in an index vector inside the single square bracket operator. • > mtcars[c("mpg", "hp")] mpg hp Mazda RX4 21.0 110 Mazda RX4 Wag 21.0 110 Datsun 710 22.8 93 ............ •
  • 68. Exp 5. Creating Data Frame emp.data <- data.frame( emp_id = c (1:5), emp_name = c(“Ratna",”Kumar”,“Kamala",“Prajwal",“Pravachan"), salary = c(623.3, 515.2, 611.0, 729.0, 843.25), start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014- 05-11", "2015-03-27")), stringsAsFactors = FALSE ) >emp.data # Add the "dept" coulmn. ➢ emp.data$dept <- c("IT","Operations","IT","HR","Finance") ➢ v <- emp.data ➢ print(v)
  • 69. Extracting rows and columns A=emp.data$emp_id B=emp.data$emp_name a)C=data.frame(A,B) b)data.frame[1:2,] c)data.frame[c(3,5),c(2,4)]
  • 70. ➢emp.data[1:2,] ➢emp_id emp_name salary start_date ➢1 1 Rick 623.3 2012-01-01 ➢2 2 Dan 515.2 2013-09-23 ➢ > emp.data[c(3,5),c(2,4)] ➢ emp_name start_date ➢ 3 Michelle 2014-11-15 ➢5 Gary 2015-03-27
  • 71. PROGRAM 6: ‘apply’ group of functions
  • 72. Functio n Arguments Objective Input Output apply apply(x, MARGIN, FUN) Apply a function to the rows or columns or both Data frame or matrix vector, list, array lapply lapply(X, FUN) Apply a function to all the elements of the input List, vector or data frame list sapply(X, Apply a function to all the List, vector or vector or
  • 73. PROGRAM 7- cbind-ing and rbind-ing • Matrices can be created by column-binding or row-binding with cbind() and rbind(). • > x <- 1:3 • > y <- 10:12 • > cbind(x, y) • x y • [1,] 1 10 • [2,] 2 11 • [3,] 3 12 • > rbind(x, y) [,1] [,2] [,3] • x 1 2 3 • y 10 11 12 >C <- cbind(1:3,4:6,5:7) >D <- rbind(1:3,4:6)
  • 74. PROGRAM 7: Rbind() and cbind() functions. • Matrices can be created by column-binding or row-binding with cbind() and rbind(). • Data frames can also be appended by these functions. • > x <- 1:3 • > y <- 10:12 • > cbind(x, y) – x y • [1,] 1 10 • [2,] 2 11 • [3,] 3 12 • > rbind(x, y) • [,1] [,2] [,3] • x 1 2 3 • y 10 11 12
  • 75. Factor Variables Factor variables are nothing but nominal variables and also known as categorical variables. Levels are nothing but unique values in the variable values. ➢gender <- c(rep("male",20), rep("female", 30)) ➢ gender<-factor(gender) ➢Levels: female male # Factor variables ➢summary(gender) ➢female male 30 20
  • 76. PROGRAM 8: DISCRETE IRIS ➢ iris$Seplen<- cut(iris$Sepal.Length, breaks=c(4.3,5.6,6.8,7.9), labels=c("low","medium","high")) ➢ > iris$Seplen ➢ [1] low low low low low low low low [9] low low low low low <NA> medium medium [17] low low medium low low low low low [25] low low low low low low low low [33] low low low low low low low low [41] low low low low low low low low [49] low low high medium high low medium medium [57] medium low medium low low medium medium medium [65] ….. ➢ Levels: low medium high
  • 77. PROGRAM 9 - SCATTER PLOT USING ‘DPLYR’ ON GUINEA PIGS ‘TOOTHGROWTH’ DATA SET
  • 78. ➢aa<-select(ToothGrowth,len,supp,dose) #To choose rows we use filter() ➢> filter(ToothGrowth,len<=14.5) ➢> ToothGrowth%>%+ group_by(supp) • > ToothGrowth%>% • + group_by(supp)%>% • + summarise(meanoflen=mean(len)) • > plot(aa) • >
  • 79. gg-grammer of graphics ➢library(dplyr) ➢> library(ggplot) ➢> library(ggplot2) ➢>ggplot(aa,aes(x=factor(dose),y=len,fill=supp)) ➢>gplot(aa,aes(x=factor(dose),y=len,fill=supp))+geo m_boxplot() ➢/*aes=aesthetic*/
  • 80.
  • 81. PROGRAM-10…LINEAR AND MULTIPLE REGRESSION Regression: A technique for determining the statistical relationship between two or more variables where a change in a dependent variable is associated with, and depends on, a change in one or more independent variables. Linear Regression: Y=mX+c Y X Single Predictor, X
  • 82. Multiple Linear Regression Y=aX3+bX2+cX+d 3 Predictors/Explanatory variables, X3,X2, X a,b,c are coefficients d is random error=bias value Y is a response variable Y is estimated or predicted dependent on 3 X variables.
  • 83. Mtcars variables [, 1] mpg Miles/(US) gallon [, 2] cyl Number of cylinders [, 3] disp Displacement (cu.in.) [, 4] hp Gross horsepower [, 5] drat Rear axle ratio [, 6] wt Weight (lb/1000) [, 7] qsec 1/4 mile time [, 8] vs. V/S (Engine Cylinder confg V shape or S shape) [, 9] am Transmission (0 = automatic, 1 = manual) [,10] gear Number of forward gears [,11] carb Number of carburetors
  • 86. • For example in the mtcars dataset, you can build a linear model between the gas consumption (mpg) and the weight of the car (wt): mpg=β0+β1wt • β1 is slope mpg is dependent • β0 is intercept wt is independent
  • 87. • Residuals. The difference between the observed value of the dependent variable (y) and the predicted value (ŷ) is called the residual (e). • Each data point has one residual. • y=10*3+5=35——-observed • Model, m=9. y=9x+c • y=9*3+5=32——predicted….
  • 88. > mfit = lm(mpg ~ wt + disp + cyl, data=mtcars) > plot(mfit)
  • 89. PROGRAM NO: 11 Major Clustering Approaches (I) • Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods: k-means, k-medoids, CLARANS • Hierarchical approach: – Create a hierarchical decomposition of the set of data (or objects) using some criterion – Typical methods: Diana, Agnes, BIRCH, CAMELEON • Density-based approach: – Based on connectivity and density functions – Typical methods: DBSACN, OPTICS, DenClue • Grid-based approach: – based on a multiple-level granularity structure – Typical methods: STING, WaveCluster, CLIQUE 89
  • 91. K-means clustering • names(iris) • [1] "Sepal.Length" "Sepal.Width" "Petal.Length" • [4] "Petal.Width" "Species" • • > x<-iris[,-5] • • > y<-iris$Species • • > kc<-kmeans(x,3) • • > kc • • K-means clustering with 3 clusters of sizes 38, 62, 50 • • Cluster means: • Sepal.Length Sepal.Width Petal.Length Petal.Width • 1 6.850000 3.073684 5.742105 2.071053 • 2 5.901613 2.748387 4.393548 1.433871 • 3 5.006000 3.428000 1.462000 0.246000 •
  • 92. • Clustering vector: • [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 • [29] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 1 2 2 2 • [57] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 • [85] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 1 1 1 2 1 1 1 1 1 • [113] 1 2 2 1 1 1 1 2 1 2 1 2 1 1 2 2 1 1 1 1 1 2 1 1 1 1 2 1 • [141] 1 1 2 1 1 1 2 1 1 2 • • Within cluster sum of squares by cluster: • [1] 23.87947 39.82097 15.15100 • (between_SS / total_SS = 88.4 %)
  • 95.
  • 96. • > library(fpc) • > pamresult<-pamk(iris1) • > pamresult$nc • [1] 2 • > pamresult$nc #nc-Number of Clusters • [1] 2 • > table(pamresult$pamobject$clustering,iris$Species) • • setosa versicolor virginica • 1 50 1 0 • 2 0 49 50 • > layout(matrix(c(1,2),1,2)) # > plot(pamresult$pamobject)
  • 97.
  • 98.
  • 99.
  • 100.
  • 101. • The ggplot() command creates a plot object. In it we assigned a data set. • aes() creates what Hadley Wickham calls an aesthetic: a mapping of variables to various parts of the plot. ... • Another way to split up the way we look at data is with facets.
  • 102. > ggplot(mtcars,aes(wt,mpg)) Error in ggplot(mtcars, aes(wt, mpg)) : could not find function "ggplot" > library(ggplot2) > library(ggplot2) > ggplot(mtcars,aes(wt,mpg)) > ggplot(mtcars,aes(wt,mpg))+geom_point() > ggplot(mtcars,aes(wt,mpg))+geom_point()+geom_ smooth(method="lm") > ggplot(mtcars,aes(wt,mpg))+geom_point()+geom_ abline()
  • 103. ➢> library(ggplot2) ➢> ggplot(mtcars,aes(wt,mpg)) ➢> ggplot(mtcars,aes(wt,mpg))+geom_point() ➢>ggplot(mtcars,aes(wt,mpg))+geom_point()+g eom_smooth(method="lm“)
  • 104.
  • 105. > ggplot(mtcars, aes(x=wt, y=mpg, col=cyl, size=disp)) + geom_point()
  • 106. What combination of predictors will best predict fuel efficiency?(Slope/Coefficients and intercepts) Which predictors increase our accuracy by a statistically significant amount? We should guess which predictors are significant, and to determine the ideal formula for prediction….WHICH IS WHAT WE CALL LINEAR REGRESSION.
  • 107. Density-Based Clustering Methods • Clustering based on density (local cluster criterion), such as density-connected points • Major features: – Discover clusters of arbitrary shape – Handle noise – One scan – Need density parameters as termination condition 107
  • 108. Density-Based Clustering: Basic Concepts • Two parameters: – Eps: Maximum radius of the neighbourhood – MinPts: Minimum number of points in an Eps- neighbourhood of that point • NEps(p): {q belongs to D | dist(p,q) ≤ Eps} • Directly density-reachable: A point p is directly density- reachable from a point q w.r.t. Eps, MinPts if – p belongs to NEps(q) – core point condition: |NEps (q)| ≥ MinPts MinPts = 5 Eps = 1 cm p q 108
  • 109. Density-Reachable and Density-Connected • Density-reachable: – A point p is density-reachable from a point q w.r.t. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density- reachable from pi • Density-connected – A point p is density-connected to a point q w.r.t. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o w.r.t. Eps and MinPts p q p1 p q o 109
  • 110. DBSCAN: Density-Based Spatial Clustering of Applications with Noise • Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points • Discovers clusters of arbitrary shape in spatial databases with noise Core Border Outlier Eps = 1cm MinPts = 5 110
  • 111. DBSCAN: The Algorithm • Arbitrary select a point p • Retrieve all points density-reachable from p w.r.t. Eps and MinPts • If p is a core point, a cluster is formed • If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database • Continue the process until all of the points have been processed 111
  • 112.
  • 113.
  • 114.
  • 115.
  • 116.
  • 117. Before Package ‘rpart’ Title: Recursive Partitioning and Regression Trees A regression line is a straight line that attempts to predict the relationship between two points, also known as a trend line or line of best fit. Simple linear regression is a prediction when a variable (y) is dependent on a second variable (x) based on the regression equation of a given set of data.
  • 118. Decision trees are of two types Classification Trees Regression Trees CTs are used when the target or response variable is of categorical in nature. RTs are used when the target variable is continuous or numeric. It is the target variable that determines the type of decision tree needed.
  • 119. DECISION TREES USING PARTY-PROGRAM 12 • > install.packages(“readr”) • > library(readr) • > install.packages("party") • Installing package into ‘C:/Users/My Document/Documents/R/win-library/3.4’ • (as ‘lib’ is unspecified) • trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/party_1.2-3.zip' • Content type 'application/zip' length 719826 bytes (702 KB) • downloaded 702 KB • • package ‘party’ successfully unpacked and MD5 sums checked • • The downloaded binary packages are in • C:UsersMy DocumentAppDataLocalTempRtmpOAuKaMdownloaded_packages • > library(party)
  • 120. DECISION TREE USING RPART..PROGRAM 12 rpart(formula, data, weights, subset, na.action = na.rpart, method, model = FALSE, x = FALSE, y = TRUE, parms, control, cost, ...) tree<- rpart(Species~Sepal.Length+Sepal.Width+Petal .Length+Petal.Width,data=iris,method="class")
  • 121. • > iris$class<-as.factor(iris$class) • > • > View(iris) • > iris$Species<-as.factor(iris$Species) • > tree1<-ctree(Species~Sepal.Length, data=iris) • > plot(tree1)
  • 122.
  • 124. > plot(tree, uniform=TRUE,main="Classification Tree for Iris dataset")> text(tree, use.n=TRUE, all=TRUE, cex=.8)
  • 126. • X1, X2 Attributes
  • 127.
  • 128.
  • 129. ABOUT DIFFERENT TYPES OF VARIABLES
  • 130. FEW GOOD WEB SITES ON R www.kaggle.com www.rdocumentation.org www.statmethods.net www.r-tutor.com www.tutorialspoint.com www.datacamp.com www.github.com https://drsimonj.svbtle.com/visualising-residuals