2. Data-Visualization tools and techniques offer executives and other
knowledge workers new approaches to dramatically improve their ability
to grasp information hiding in their data.
Data visualization is a general term that describes any effort to help
people understand the significance of data by placing it in a visual
context. Patterns, trends and correlations that might go undetected in
text-based data can be exposed and recognized easier with data
visualization software.
It isn't just the attraction of the huge range of statistical analyses
afforded by R that attracts data people to R. The language has also
developed a rich ecosystem of charts, plots and visualizations over
the years.
3. ggplot2 is a data visualization package for the statistical programming language R.
Created by Hadley Wickham in 2005, ggplot2 is an implementation of Leland
Wilkinson's Grammar of Graphics—a general scheme for data visualization which
breaks up graphs into semantic components such as scales and layers.
ggplot2 can serve as a replacement for the base graphics in R and contains a
number of defaults for web and print display of common scales.
Since 2005, ggplot2 has grown in use to become one of the most popular R
packages. It is licensed under GNU GPL v2.
ggplot2
4. Basic Visualization
Histogram
Bar / Line Chart
Box plot
Scatter plot
Advanced Visualization
Heat Map
Mosaic Map
Map Visualization
3D Graphs
Correlogram
5. 1. Histogram
Histogram is basically a plot that breaks the data into bins (or
breaks) and shows frequency distribution of these bins. You
can change the breaks also and see the effect it has data
visualization in terms of understandability.
6.
7. 2. Bar/ Line Chart
Line Chart
Below is the line chart showing the increase in air
passengers over given time period. Line Charts are
commonly preferred when we are to analyse a trend
spread over a time period. Furthermore, line plot is also
suitable to plots where we need to compare relative
changes in quantities across some variable (like time).
Below is the code:
plot(AirPassengers,type="l") #Simple Line Plot
8.
9. Bar Chart
Bar Plots are suitable for showing comparison between
cumulative totals across several groups. Stacked Plots are
used for bar plots for various categories. Here’s the code:
10.
11. 3. Box Plot ( including group-by option )
Box Plot shows 5 statistically significant numbers- the minimum, the 25th percentile, the
median, the 75th percentile and the maximum. It is thus useful for visualizing the spread
of the data is and deriving inferences accordingly. Here’s the basic code:
12. Ingest Data
When reading data into R, we generally will use
the read.table() or read.csv()function. This opens a file and
returns the content of that file.
In the above example we store the contents of the file in the
variab le bugData. Notice that we use the <- operator in R
instead of the = like in most other languages.
There are certain parameters that we can pass in
to table.read().
Among the most often used of these parameters
are: sep, header, row.name, and col.name.
13. R provides the plot function that can be used to create time
series charts. We can either pass in a complete data structure
like in the example below (if it contains a plotting function), or
we can pass in lists to serve as the x- and y- axes of the chart.
?plot
17. R provides the hist() function to create histograms.
The hist() function accepts a vector of values.
18. Usage
ggplot(data = NULL, mapping = aes(), ..., environment = parent.frame())
Arguments
Data:
Default dataset to use for plot. If not already a data.frame, will be converted to one
by fortify. If not specified, must be suppled in each layer added to the plot.
mapping
Default list of aesthetic mappings to use for plot. If not specified, must be suppled in
each layer added to the plot.
environment
If an variable defined in the aesthetic mapping is not found in the data, ggplot will
look for it in this environment. It defaults to using the environment in which ggplot() is
called.
19. ggplot() is used to construct the initial plot object, and is
almost always followed by + to add component to the plot. There
are three common ways to invoke ggplot:
ggplot(df, aes(x, y, ))
ggplot(df)
ggplot()
The first method is recommended if all layers use the same data
and the same set of aesthetics, although this method can also be
used to add a layer using data from another data frame. See the
first example below.
The second method specifies the default data frame to use for the
plot, but no aesthetics are defined up front. This is useful when one
data frame is used predominantly as layers are added, but the
aesthetics may vary from one layer to another.
The third method initializes a skeleton ggplot object which is
fleshed out as layers are added. This method is useful when
multiple data frames are used to produce different layers, as is
often the case in complex graphics.