4. Agenda
• Data analysis
• Capturing GitHub data
• Using R in Data Analysis
– R basics
– Data exploration & processing
– I/O operations
• Azure ML:
– Datasets
– Experiments
5. Data analysis process
Raw Data
Processed
Data
Data Analysis
& Visualization
Exploratory
Data Analysis
Data
Capture
15. Why R?
• Ross Ihaka & Robert Gentleman
• Name:
– First letter of names
– Play on the name of S
– S-PLUS – commercial alternative
• Open source
• Nr 1 for statistical computing
17. R Environment
• R project
– console environment
– http://www.r-project.org/
• IDE
– Any editor
– RStudio
http://www.rstudio.com/products/rstudio/download/
32. Task: Reading Pull Requests
1. Read the file line by line and
extract only pull request events
2. Extract id and language
information
3. Count and visualise language
distribution
Data: 1h GitHub Archive Events
from 01-01-2015, 3 PM
45. Language information
• Active repositories – Create, Push and
PullRequest events
• Missing language information:
– Google BigQuery
– GitHub API
• Process various data sources
47. Different sources of data
• GitHub Archive:
– id,
– url in a form:
https://api.github.com/repos/:name
– (rare cases) language
• Google BigQuery:
– no id,
– url in a form:
https://github.com/:name
– language
48. Task: Reading Active Repositories
1. Read the file line by line and extract only
create, push and pull request events
2. Extract id and url information
3. Read Google BigQuery data from saved file
4. Combine repositories data and Google data
base on the same url and fill in missing
language information
5. Count and visualise language distribution
64. Task: Gather & Save Week Data
1. Read files line by line
and count push events
for every day
2. Fill in the retrieved
data into data frame
3. Save data in the csv
file
68. Exercise: Analyse the week
1. Read activity data from file.
2. Define new column as part of
the day.
3. Calculate mean value for
number of pushes for every
part of the day.
4. Compare and visualize data
for Monday, Wednesday and
Friday.