2. Big Data: Why it matters
2
In the past, data(record) are only for important people
and events
e.g. royal family, war records, …
3. Big Data: Why it matters
3
In the past, data(record) are expensive and for very
important event or a few privileged people
e.g. royal family, war records, …
Now data is for everyone and every moment
5. Data easily go beyond Human and computer
5
Moore’s Law is a computing term which originated around 1970; the simplified
version of this law states that
processor speeds, or overall processing power for computers will double
every two years.
8. Data Science
Data Science aims to derive knowledge fr
om big data, efficiently and intelligently
Data Science encompasses the set of acti
vities, tools, and methods that enable da
ta-driven activities in science, business,
medicine, and government
Machine Learning(or Data Mining) is agai
n one of the core technologies that enabl
es Data Science
http://www.oreilly.com/data/free/what-is-data-science.csp
8
Data Science is process to get Actionable Insight from Big DATA
9. Night time Bus by Seoul Local Government
Effective Bus Route Design with Big Data
5 Million Records of Taxi Take on/off
3 Billion Night time call records(location) from telco company
10. Singapore Traffic visualization
Data from LTA
Subway, bus, taxi traffic info.
By data analytics team in Institute for Infocomm Research
12. What We Cover Here
• Data Manipulation and Visualization in R
12
13. R
• Most common tools for data scientist other than DBMS
• Cover wide range of data scientist – Data engineer, Statistician, Data main
expert, … rather than just computer engineer
• Providing thousands of ready-to-use powerful packages for data scientist
• Well documented
13
http://blog.revolutionanalytics.com/2014/01/in-data-scientist-survey-r-is-the-most-used-tool-other-than-databases.html
14. R How to Start
• Explanation of windows in R studio
14
15. R
• Getting help
▫ help(command) or ?command
▫ example(command) to see examples
15
16. Package Installation and loading
• install.packages(“package name”)
• to load package
▫ library(“package name”)
▫ or require(“package name”)
16
17. Variable
• Variable is a container to hold data (or information) that
we want to work with
• Variable can hold
▫ a single value: 10, 10.5, “abc”, factor, NA, NULL
▫ multiple values: vector, matrix, list
▫ specially formatted data (values): data.frame
17
18. How you assign a value to variable
• var <- value
18
my_first_variable <- 35.121
New variable is now assigned and
available in working environment
19. Operators
19
a <- 10.5
b <- 20
c <- 4
a + b ## addition
## [1] 30.5
a - c ## substraction
## [1] 6.5
a * c ## mulitiplication
## [1] 42
b / c ## division
## [1] 5
a %% c ## remainder
## [1] 2.5
a > b ## inequality
## [1] FALSE
a*2 == b ## equality
## [1] FALSE
!(a > b) ## negation
## [1] TRUE
(b > a) & (b > c) ## logical AND
## [1] TRUE
(a > b) | (a > c) ## logical OR
## [1] TRUE
20. Data Type – Missing Value (NA)
• Sometimes values are missing, and R represent the
missing values as NAs
20
21. Vector
• A vector is a sequence of data elements of the same basic type.
• All members should be of same data type
21
numeric_vector <- c(1, 10, 49)
character_vector <- c("a", "b", "c")
boolean_vector <- c(TRUE, FALSE, TRUE)
typeof(numeric_vector)
## [1] "double"
typeof(character_vector)
## [1] "character"
typeof(boolean_vector)
## [1] "logical"
length(numeric_vector) ## number of members in the vector
## [1] 3
new_vector <- c(numeric_vector, 50)
new_vector
## [1] 1 10 49 50
22. Vector
• R’s vector index starts from 1
▫ 1,2,3,4, …
• Minus Index means “except for”
22
23. Vector with named elements
• We can give name to each element of vector
• and we can use the name instead of index number
23
some_vector <- c("John Doe", "poker player")
names(some_vector) <- c("Name", "Profession")
some_vector
## Name Profession
## "John Doe" "poker player"
some_vector['Name']
## Name
## "John Doe"
some_vector['Profession']
## Profession
## "poker player"
some_vector[1]
## Name
## "John Doe"
24. Vector with named elements
• We can give name to each element of vector
• and we can use the name instead of index number
24
weather_vector <- c("Mon" = "Sunny", "Tues" = "Rainy",
"Wed" = "Cloudy", "Thur" = "Foggy",
"Fri" = "Sunny", "Sat" = "Sunny",
"Sun" = "Cloudy")
weather_vector
## Mon Tues Wed Thur Fri Sat Sun
## "Sunny" "Rainy" "Cloudy" "Foggy" "Sunny" "Sunny" "Cloudy“
names(weather_vector)
## [1] "Mon" "Tues" "Wed" "Thur" "Fri" "Sat" "Sun"
26. Basic Vector operations
26
a_vector <- c(1,5,2,7,8)
b_vector <- seq(1, 10, 2)
sum(a_vector) ## summation
## [1] 23
mean(a_vector) ## average
## [1] 4.6
# operation of Vector and Scala
a_vector + 10
## [1] 11 15 12 17 18
a_vector > 4
## [1] FALSE TRUE FALSE TRUE TRUE
sum(a_vector > 4) ## what does this mean?
## [1] 3
# operation of Vector and Vector
a_vector - b_vector
## [1] 0 2 -3 0 -1
a_vector == b_vector
## [1] TRUE FALSE FALSE TRUE FALSE
sum(a_vector == b_vector) ## what does this mean?
## [1] 2
27. Vector Indexing (Selection)
27
sample_vector <- c(1, 4, NA, 2, 1, NA, 4, NA) ## vector with some missing values
sample_vector[1:5]
## [1] 1 4 NA 2 1
sample_vector[c(1,3,5)]
## [1] 1 NA 1
sample_vector[-1]
## [1] 4 NA 2 1 NA 4 NA
sample_vector[c(-1, -3, -5)]
## [1] 4 2 NA 4 NA
sample_vector[c(T, T, F, T, F, T, F, T)]
## [1] 1 4 2 NA NA
is.na(sample_vector)
## [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE
sum(is.na(sample_vector))
## [1] 3
## can you select non-NA elements from the vector?
Selection by numeric vector
Selection by logical vector
28. Data Frame
• Very commonly datasets contains variables of different kinds
▫ e.g. student dataset may contain name(character), age(integer), major(factor),
gpa(numeric, real number)…
• Vector and metric can have values of same data type
• A data frame has the variables of a data set as columns and the
observations as rows.
28
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
29. Overviewing of Data frame
• head functions shows the first n (6 by default) observation of dataframe
• tail functions shows the last n (6 by default) observation of dataframe
29
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
head(mtcars, 10) ## try to see what happens
tail(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2
tail(mtcars, 10) ## try to see what happens
30. Overviewing of Data frame
• str function shows the structure of your data set, it tells you
▫ The total number of observations (e.g. 32 car types)
▫ The total number of variables (e.g. 11 car features)
▫ A full list of the variables names (e.g. mpg, cyl ... )
▫ The data type of each variable (e.g. num)
▫ The first few observations
30
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
31. Creating Data frame
• data.frame function with vectors (of same length and possibly different
type) makes you a data frame
31
# Definition of vectors
name <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet",
"Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)
# Create a data frame from the vectors
planets_df <- data.frame(name, type, diameter, rotation, rings)
planets_df
## name type diameter rotation rings
## 1 Mercury Terrestrial planet 0.382 58.64 FALSE
## 2 Venus Terrestrial planet 0.949 -243.02 FALSE
## 3 Earth Terrestrial planet 1.000 1.00 FALSE
## 4 Mars Terrestrial planet 0.532 1.03 FALSE
## 5 Jupiter Gas giant 11.209 0.41 TRUE
## 6 Saturn Gas giant 9.449 0.43 TRUE
## 7 Uranus Gas giant 4.007 -0.72 TRUE
## 8 Neptune Gas giant 3.883 0.67 TRUE
32. Creating Data frame
• you may specify the variables as parameters
32
my.df <- data.frame(name = c('John', 'Kim', 'Kaith'), job =
c('Teacher', 'Policeman', 'Secertary'), age = c(32, 25, 28))
my.df
## name job age
## 1 John Teacher 32
## 2 Kim Policeman 25
## 3 Keith Secretary 28
33. Selection of data frame elements
Similar to vectors and matrices, you select elements from a data frame with
the help of square brackets [ ].
By using a comma, you can indicate what to select from the rows and the
columns respectively.
33
# Print out diameter of Mercury (row 1, column 3)
planets_df[1,3]
## [1] 0.382
# Print out data for Mars (entire fourth row)
planets_df[4, ]
## name type diameter rotation rings
## 4 Mars Terrestrial planet 0.532 1.03 FALSE
# you can use of directly variable name
# Select first 5 values of diameter column
planets_df[1:5, 'diameter']
## [1] 0.382 0.949 1.000 0.532 11.209
34. Selection of data frame elements
You will often want to select an entire column, namely one specific variable from a data
frame. If you want to select all elements of the variable diameter, for example, both of
these will do the trick:
planets_df[,3]
planets_df[,"diameter"]
However, there is a short-cut. If your columns have names, you can use the $ sign:
planets_df$diameter
34
35. Selection of data frame elements
- a tricky part
• You can use a logical vector to select from data frame
35
## find planets with rings
planets_df[planets_df$rings, ]
## name type diameter rotation rings
## 5 Jupiter Gas giant 11.209 0.41 TRUE
## 6 Saturn Gas giant 9.449 0.43 TRUE
## 7 Uranus Gas giant 4.007 -0.72 TRUE
## 8 Neptune Gas giant 3.883 0.67 TRUE
## select names of planets with rings
planets_df[planets_df$rings, 'name']
## [1] Jupiter Saturn Uranus Neptune
## Levels: Earth Jupiter Mars Mercury Neptune Saturn Uranus Venus
## find planets with larger diameter than earth
planets_df$diameter > 1
## [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
planets_df[planets_df$diameter > 1, ]
## name type diameter rotation rings
## 5 Jupiter Gas giant 11.209 0.41 TRUE
## 6 Saturn Gas giant 9.449 0.43 TRUE
## 7 Uranus Gas giant 4.007 -0.72 TRUE
## 8 Neptune Gas giant 3.883 0.67 TRUE
36. Handling Big Data using R
<2st> Package “dplyr”, essential for data preprocessing in R
36
HyunJae Jo
School of ICT and Global Entrepreneurship
Handong Global University
38. Why we have to do “Data preprocessing”
• Data preprocessing is the process of making raw data
suitable for data analysis.
38
• Data Cleaning
Correction of missing value,
outlier and noisy data
• Data Integration
Combining data from multiple
sources
• Data Reduction
Reduce only the data needed for
analysis
• Data Transformation
Data transformation to maximize
efficiency of data analytics
39. Data Example of “Data Integration”
39
Missing
value Noisy data
Outlier
1. Data Integration
Integrating “Female Population”
Column
“Data preprocessing”
40. Example of “Data Cleaning”
40
Missing
value Noisy data
Outlier
“Data preprocessing”
2. Data Cleaning
Correction “Male Pop.” column
Remove missing value(“others” column)
Correction outliers
41. Example of “Data Reduce”
41
Missing
value Noisy data
Outlier
“Data preprocessing”
3. Data Reduce
Top 5 Population Regions by Total Population
42. Example of “Data Transformation”
42
Missing
value Noisy data
Outlier
“Data preprocessing”
4. Data Transfromation
Ratio region population to total population
43. What is “dplyr”
• "dplyr" is an R package that specializes in data processing.
• Functions of “dplyr”
▫ mutate() adds new variables that are functions of existing variables
▫ select() picks variables based on their names.
▫ filter() picks cases based on their values.
▫ summarise() reduces multiple values down to a single summary.
▫ arrange() changes the ordering of the rows.
▫ group_by() allows to perform any operation “by group”
▫ %>%(chain) connect each operation and perform it at once
43
44. How we can use “dplyr”
• Install packages and load packages.
• If you want to see package manual in R, you can use “help” function
44
45. imdb dataset
• IMDb (Internet Movie Database) is an online database of information related to
films, television programs, home videos and video games, and streaming content
online -- including cast, production crew and personnel biographies, plot
summaries, trivia, and fan reviews and ratings. As of October 2018, IMDb has
approximately 5.3 million titles (including episodes) and 9.3 million
personalities in its database, as well as 83 million registered users. (Wikipedia)
• We use reduced imdb dataset and you can load imdb dataset using read.csv()
function with URL link:
• https://raw.githubusercontent.com/myhan0710/prac_dplyr/master/imdb.csv
45
47. Lab Example
• We will now look at the functions of the dplyr library through the imdb
dataset.
• Familiarize yourself with the roles of the given functions, and just enter
the code in your RStudio.
• Through the Lab, we can finally obtain data from the imdb dataset that
meet the following conditions:
47
48. dplyr: filter
48
• Use filter() function to extract specific
row that fits the condition.
• First, we can find recent, high rating and
• many rating counts movie like this:
49. dplyr: mutate
49
• Use mutate() function to make new column.
• Second, we can make column that indicates
whether a movie contains action and adventure
genres:
• If a certain movie’s Action column value is 1 and
Adventure column value is 0, it means that movie
includes action genre, but not Advenutre genre.
• So, ActionAdven column value 1 means a movie
includes both action and adventure genres.
50. dplyr: select
• Use select() function to extract specific column
that fits the condition.
• We only need a few columns:
- wordsInTitle
- imdbRating
- ratingCount
- Year
- ActionAven
50
51. dplyr: arrange
51
• Use the arrange() function to sort data from small to large values based on a
specified column.
• Fourth, we would like to sort the year
in ascending order and then grade in descending order.
• desc() function allows it to be sorted in descending order.
52. dplyr: summarise
• Use the summarise() function to obtain basic statistics by specifying
functions such as mean(), var(), and median()
• Now, we want to know average rating, average year, and number of
Action-Adventure genre.
52
53. dplyr: group_by
• Use the group_by() function, then you can group data by level in the
specified column.
• After group_by() function, You can associate the results with summarise()
to view the results of each level.
53
54. dplyr: %>%
• Use the chain function(%>%) to write a code with all functions at once.
• Then, the results are same. (imdb_chain, imdb7)
54
56. Excercise
① Extract movies from the original imdb dataset between 1970
and 2000.
② Using chain function, extract movies that has both drama
and family genre. Then, columns are only year, duration,
nrOfphotos.
• Please replace <Fill In> with your solution.
• After replacing your solution, check your solution is right on the next page.
56
59. Data Visualization
• Essential component of skill set as a data scientist
• With ever increasing volume of data, it is impossible to tell
stories without visualizations.
• Data visualization is an art of how to turn “Big Data” into
useful knowledge
59
60. Data Visualization
• Data Visualization
60
Statistics Design
Graphical
Data Analysis
Communication &
Perception
61. Data Visualization
• Exploratory Visualization
▫ Help you see what is in the data
• Explanatory Visualization
▫ Shows others what you’ve found in your data
• R supports both types of visualizations
61
62. Data Visualization
• Exploratory Visualization
▫ Help you see what is in the data
▫ Keep as much as detail as possible
▫ Practical Limit: how much can you see and interpret
• Explanatory Visualization
▫ Help us share our understanding with others
▫ Shows others what you’ve found in your data
▫ Requires editorial decisions:
▫ Highlight the key features you want to emphasize
▫ Eliminate extraneous details
62
65. ggplot2
• Author: Hadley Wickham
• Open Source implementation of the layered grammar of
graphics
• High-level R package for creating publication-quality statistical
graphics
▫ Carefully chosen defaults following basic graphical design rules
• Flexible set of components for creating any type of graphics
• Things you cannot do With ggplot2
▫ 3-dimensional graphics
▫ Graph-theory type graphs (nodes/edges layout)
65
70. Grammar of Graphics
70
• Plotting Framework
• Leland Wilkinson, Grammar of Graphics, 1999
• 2 principles
▫ Graphics = distinct layers of grammatical elements
▫ Meaningful plots through aesthetic mapping
71. Essential Grammatical Elements
71
Element Description
Data The dataset being plotted.
Aesthetics The scales onto which we map our data.
Geometries The visual elements used for our data.
72. All Grammatical Elements
72
Element Description
Data The dataset being plotted.
Aesthetics The scales onto which we map our data.
Geometries The visual elements used for our data.
Facets Plotting small multiples
Statistics Representations of our data to aid understanding.
Coordinates The space on which the data will be plotted.
Themes All non-data ink.
74. Grammar of Graphics
• Building blocks
• Solid, creative, meaningful visualizations
• Essential Layers: Data, Aesthetics, Geometries
• For Enhancement: Facets, Statistics, Coordinates,
Themes
74
75. Example – First trial of ggplot
• To get a first feel for ggplot2, let's try to run some basic ggplot2 commands.
Together, they build a plot of the mtcars dataset that contains information about
32 cars from a 1973 Motor Trend magazine. This dataset is small, intuitive, and
contains a variety of continuous and categorical variables.
75
76. Example – First trial of ggplot
• The plot from the previous exercise wasn't really satisfying.
Although cyl (the number of cylinders) is categorical, it is classified
as numeric in mtcars. You'll have to explicitly tell ggplot2 that cyl is
a categorical variable.
76
Hello everyone, my name is Hyunjae Jo. I’m instructor of 2nd lectures in big data course.
In the previous lecture, you could hear the introduction of big data and basic grammars of R.
In this lecture, we will learn about data preprocessing and ‘dplyr’ which is the most powerful package for data preprocessing.
This is what we are going to do in this lecture.
First, we will learn what data preprocessing is and why we have to do.
And you’ll learn 4 types of data preprocessing.
Next, we’ll learn about the dplyr package.
You will find out what functions are in the package and what their role is.
Lastly, we’ll use dplyr package through rap example.
Let’s start. What is data preprocessing? Data preprocessing is the process of making raw data suitable for data analysis.
When you download original data, there may be unnecessary or incorrect contents in the data.
The raw data itself is inconvenient for you to analyze and can be unnecessarily large in size.
You should refine these original data into the data that you could use conveniently. And we call this process data preprocessing.
There are four main types of data pre-processing, and each of which is used when a process is required.
Data cleaning is a process of correcting incorrect values.
And Data integration is a process of synthesizing the required data into the existing data.
Data reduction is a process of removing unnecessary data.
Lastly, data transformation is a process of converting raw data to the required data form.
Let's take a look at it with an example.
The picture on the left is a sample of data related to Myanmar's population.
Current data shows the total population and male population by region.
But as you can see, there are a few wrong numbers or empty items in the data.
You can see these minus values. We call it outlier when the figures are significantly different from the existing data.
And others row have no value in each column, we call it missing value.
Lastly, Male population almost has unnecessary values, and we call it noisy data.
So, Let’s preprocess this data step by step.
First, I think we need one more column, Female population by region.
So, I integrate one more column “Female population”.
We call this process Data integration.
Next, we have to correct wrong values like outliers, missing values, and noisy data.
As shown in the picture on the right, the wrong values have been changed to the right ones.
And some unnecessary values were removed. We call this process data cleaning.
Now, we want to select a top five areas of the population and analyze this data.In other words, without top five population regions, the other data is not needed.I pick out only the top five population areas, and the rest of the data is removed.This process is called data redirection.
Finally, with top 5 population regions data, we want to know ratio of the population by region.
So, I divide the population of each region by the total population.
This enabled me to get a percentage of the population from region to region.
We call this process data transformation.
Through four preprocesses of data, we could able to obtain the percentage of population in each region
from the raw data.
Yeah, this is data preprocessing. I believe you follow me well.
Now, we learn about package dplyr, which is most skillful way to preprocess data in R.
As you can see here, dplyr supports seven major functions.You will actually use the above functions through the back slides.
Yesterday, all computer was installed dplyr.
So, you don’t have to install it. Just load package ‘dplyr’
And you want some guide about dplyr, you can use help function.
Give it a try!
Let me introduce the imdb dataset that we will use to apply dplyr function.imdb dataset mainly has information related to movies, TV and so on.
It's a very large dataset, so we're going to do pre-processing with some scaled-down imdb datasets.
You can load a dataset through read.csv function and its url link.Now get the data on your computer.
If there's a problem, raise your hand and get some helps.
I think most of you have done it, and now let's look at the structure of the dataset.
You can use str function to look at the structure of the data.Through str function, you can see the number of observations in the data, total number of variables, names of the variables, type of data in the variables, and some observations in each variable.
Our data has 15190 observations and 44 variables.
Now let’s start lab example.
We will use dplyr function and apply it imdb dataset.
Familiarize yourself with the function and enter the code on it.
After the lab, you will know what data preprocessing is.
First, you can use filter function.
Through filter function, you can extract specific rows that fit the conditions.
So, first preprocessing is to find recent, high rating and many rating counts movies.
I’ll give you time to write this code, so now, just follow me.
Second function is mutate.
You can make new column through mutate function.
With this function, you can make a new column that show whether a movie contains both an action and an adventure genre.
I named the column ActionAdven.
If ActionAdven column is 1, it contains both genres.
And ActionAdven column value 0 means it doesn’t contain one or more genre.
Third function is select.
You can extract specific columns that fit the conditions with select function.
So, we make a dataframe that only has wordsInTitle, imdbRating, ratingCount, year, ActionAdven columns.
Write above these three codes on your r and Check the result is same like pictures.
So, I think most of you have written the select function code.
Let’s move on the next.
next function is arrange. With arrange function, we can sort data from small to large value based on specific column.
And also, you can sort data from large to small value with desc function.
Now, we would like to sort data by the year.
Finally, with most powerful function, you can write code with all functions at once.
This function name is chain function, and with this function, you can link all the codes you used before.
Please write down the code below and check the results.If it was written correctly, the results of the imdb_chain and imdb7 would be the same.
This is all the code of the lab example we just did.
Now let's try to solve the problem.
There are two problems.
Read it and fill in the blanks.
After you solve it, try matching the answer to the next one.
I think most of them are done, so I'll finish this selection with this.You learned what data preprocessing is and why we have to do.And we looked at the use of dplyr,
most powerful data processing package in R.
I hope what you've learned today will be helpful to you when you're doing an analysis through big data.I'll finish with this. Thank you.