6. Benefits, Drawbacks & Facts
Benefits
● No Licence Cost
● Huge amount of
knowledge in the
community
● High speed of innovation
● Funny names
Drawbacks
● Overwhelming choices
● Varying maturity
● Skills challenge (for
newer projects)
Facts of Life
● Professional Services / Support not free
8. Popular Data Products
Google Flights (not a booking engine!)
CIA World Fact Book (simple presentation)
Inside AirBnB (“activist”)
data.gov.uk
9.
10. The Data Process
1. Obtain data
2. Explore & clean data
3. Analyse & model
4. Visualise
5. Productionise & automate Data Pipeline
a. How and where to distribute?
b. How to scale?
c. How to secure?
d. How to manage day-to-day?
12. Using ggplot2 for exploratory graphs
qplot(host$availability_365,
+ geom="histogram",
+ binwidth = 5,
+ main = "Histogram for Availability",
+ xlab = "AirBnB in London",
+ fill=I("blue"))
13. Statistical Analysis
SIMPLE
● Sum, Count, Mean / Median
● Variance / Standard Deviation
E.g. Average Revenue per User per
Neighbourhood (by Month of the
Year)
MORE COMPLEX
● Clustering
● Co-variance matrix
(dependencies between
variables)
● Predictive Models
● Machine Learning
14. Big Data Architectures (simplified)
“Big” Database Hadoop Cluster / File System
Query Engine (Data Access)
Execution Engine (Business Logic)
Search Engine (Accessibility)
Visualisation Layer
17. Interactive Notebooks
New breed of software to work interactively on data
Spark/Scala Notebook
Apache Zeppelin
Databricks: cloud (proprietary but built on Spark)