This document discusses using visualization techniques to summarize and gain insights from large datasets, or "big data". It provides three examples:
1. A "tableplot" was used to visualize census data from 17 million records sorted by age to check data quality.
2. A "heat map" showed age versus income from a social security register containing 20 million monthly records to obtain insights.
3. Sentiment analysis of social media data was visualized at different granularities from daily to monthly and found to correlate highly (0.88) with consumer confidence surveys.
The document concludes that big data is an interesting source for official statistics and visualization is an effective way to explore and validate such large datasets.
Long journey of Ruby standard library at RubyConf AU 2024
Big Data Visualization
1. Edwin de Jonge, December 3, 2013
Big Data Visualization
“Turning Statistics into Knowledge”, Aguascalientes
With thanks to Piet Daas, MartijnTennekes
and Alex Priem
2. Overview
2
• Big Data
• Research ‘theme’ at Stat. Netherlands
• Data driven approach
•Visualization as a tool
•Why?
•Examples in our office
•Census
•Social Security
•Social Media
•Not shown:Traffic loops, Mobile phone data
6. Anscombe’s quartet
Property Value
Mean of x1, x2, x3, x4 All equal: 9
Variance of x1, x2, x3, x4 All equal: 11
Mean of y1, y2, y3, y4 All equal: 7.50
Variance of y1, y2, y3, y4 All equal: 4.1
Correlation for ds1, ds2, ds3, ds4 All equal 0.816
Linear regression for ds1, ds2, ds3,
ds4
All equal: y = 3.00 + 0.500x
Looks the same, right?
12. Example Virtual Census
‐ Every 10 years a Census needs to be conducted
‐ No longer with surveys in the Netherlands
• Last traditional census was in 1971
‐ Now by (re-)using existing information
• Linking administrative sources and available sample
survey data at a large scale
• Check result
• How?
• With a visualisation method: the Tableplot
11
13. Making the Tableplot
1. Load file 17 million records
2. Sort record according to 17 million records
key variable
• Age in this example
3. Combine records 100 groups (170,000 records
each)
• Numeric variables
• Calculate average (avg. age)
• Categorical variables
• Ratio between categories present (male vs. female)
4. Plot figure of select number of variables
• Colours used are important up to 12
12
14.
15. October 1st 2013, Statistics Netherlands tableplot of the census test file
16. Tableplot: Monitor data quality
16
– All data in Office passes stages:
‐ Raw data (collected)
‐ Preproccesed (technically correct)
‐ Edited (completed data)
‐ Final (removal of outliers etc.)
19. Social Security Register
– Contains all financial data on jobs, benefits and
pensions in the Netherlands
‐ Collected by the DutchTax office
‐ A total of 20 million records each month
‐ How to obtain insight into so much data?
• With a visualisation method: a heat map
19
20. October 1st 2013, Statistics Netherlands
Heat map: Age vs. ‘Income’
16
Age
Income(euro)
23. Daily Sentiment in Dutch Social Media
Social media: daily sentiment in Dutch messages
23
24. Granilarity: From day to week
Social media, daily sentiment in Dutch messagesSocial media: daily & weekly sentiment in Dutch messages
24
25. Granularity: From day to month
Social media, daily sentiment in Dutch messagesSocial media: daily, weekly & monthly sentiment in Dutch messages
25
26. Enter: Consumer confidence!
Social media, daily sentiment in Dutch messagesSocial media: monthly sentiment in Dutch messages &
Consumer confidence
26
Corr: 0.88
27. Conclusions
Big data is a very interesting data source for
official statistics
Visualisation is a great way of
getting/creating insight
Not only for data exploration, but also for
finding errors
27