SlideShare une entreprise Scribd logo
1  sur  38
Opportunities and
methodological challenges of
Big Data for official statistics
Dr. Piet J.H. Daas
Methodologist, Big Data research coördinator
March 31, Rome
Overview
2
• Big Data
• Definition?
• DGINS: Scheveningen Memorandum
• Experiences at Statistics Netherlands
• From ‘New data sources’ to ‘Big Data’
• Data driven approach (learning by doing)
• Opportunities & challenges
• Methodological & technical challenges
• Skills, legal and other issues
•With examples !
– Data, data everywhere!
X
What is Big Data?
Defining Big Data is not easy:
An attempt: “Data that are difficult to collect, store or process within the conventional
systems of statistical organizations. Either, their volume, velocity, structure
or variety requires the adoption of new statistical software processing
techniques and/or IT infrastructure to enable cost-effective insights to be
made.” (Virtual sprint paper)
More technical: “Big Data usually includes data sets with sizes beyond the ability of
commonly used software tools to capture, curate, manage, and process the
data within a tolerable elapsed time.” (Wikipedia)
A user: “Data sources that are awkward to work with.”
4
TIP: Big Data sources are NOT surveys and NOT administrative data
DGINS: Scheveningen Memorandum
1. Big Data represent new opportunities and challenges for Official Statistics.
2. Develop an 'Official Statistics Big Data strategy' at national and EU-level.
3. Recognize the implications of Big Data for legislation especially with
regard to data protection and personal rights
4. Several NSIs are currently initiating or considering different uses of Big
Data. Momentum to share experiences and to collaborate.
5. Recognize the necessary capabilities and skills to effectively explore Big
Data
6. Acknowledge that the multidisciplinary character requires synergies and
partnerships.
7. The use of Big Data in the context of official statistics requires new
developments in methodology, quality assessment and IT related issues.
8. Agree on adopting an ESS action plan and roadmap by mid-2014
5
Experiences at Statistics Netherlands
– Started as ‘New data sources for statistics’ in 2009
– Several initiatives over the years:
‐ Internet as a data source
• Collecting price data with web robots
• Study the use of web job vacancies data
• ‘Markplaats’ data (Dutch eBay clone)
‐ Alternative means of collecting primary data
• Use of smartphones
‐ Big Data (really large amounts of data)
• Traffic loop detection data (road sensors)
• Mobile phone data (location data)
• Social media data (content and sentiment)
6
Opportunities & challenges
What have we learned (so far) ?
I’ll discuss the most important ones:
1) Types of ‘data’ in Big Data
2) How to access and analyse large amounts of data
3) How to deal with noisy and unstructured data
4) How to deal with selectivity (and our own bias)
5) How to go beyond correlation
6) The need for people with the right skills and mind‐set
7) Need to solve/deal with privacy and security issues
8) Data management & costs
8
We are slowly starting to get a grip on some of these topics
1) Types of data
9 Secondary data Primary data
1) Types of data & events
There are many different Big data sources,
An attempt to classify them (Virtual sprint paper):
A) Human-sourced information (‘Social Networks’)
Social media messages, blogs, web searches
B) Process-mediated data (‘Traditional Business Systems andWebsites’)
Credit card, bank or on-line transactions,CDR, product prices,
page-views
C) Machine-generated data (‘Automated Systems’)
Road or climate sensors, satellite images, GPS,AIS.
Essentially most of the data are event-based of which some can be
directly related to a user (e.g. the target population)
10
2) How to access and analyse large amounts of data
11
– If you want to analyse Big Data
– You need a lot of computer power!!
– Or you need a lot of time!
High Performance
Computing expertise
is essential !
– We have:
- Workstations with lot’s of memory (32-64GB), fast disk drives (SSD, 512
GB) and a large hard drive (>= 1TB)
- A secure environment in which to access the data with those computers
- A Big Data lab
- The knowledge to load and analyse all the data into R or Python
- Followed a High Performance Computing training course
- Realized that learning by doing is key! (?databases?)
AND a Big data source with no privacy and security issues so we can test all
kinds of analysis, soft- and hardware (anyplace, anytime, anywhere)
• Traffic loop data (road sensors)
12
Our current equipment and more
An example:
– Processing of traffic loop data of 1 day
- A total ~100 million records (25 GB)
I/O limitation can by solved by:
1) Input part by using a cluster (distributed computing)
2) Output part by implementing a C++ write routine in R (20% faster)
Processing in R Time needed Speed-up
First R-script 6 hours -
Improved code 30 min 12
Faster hardware 10 min 36
(Java code)
Faster hardware 2 min 180
+ preprocessed data
Limited by I/O
13
All Dutch vehicles in September
3) How to deal with noisy and unstructured data
– Big Data is often
‐ noisy, dirty
‐ redundant
‐ unstructured
• e.g. texts, images
– How to extract information
from Big data?
‐ In the best/most efficient way
15
Example of noisy data: Roads sensors
Traffic loop data
‐ Each minute (24/7) the number of passing vehicles is
counted in around 20.000 ‘loops’ in the Netherlands
• Total and in different length classes
‐ Nice data source for transport and traffic statistics
(and more)
• A lot of data, around 100 million records a day
Locations
16
Total number of vehicles during the day
17
Time (hour)
Correct for missing data: macro level
Sliding window of 5 min. Impute missing data.
Before After
Total = ~ 295 million vehicles Total = ~ 330 million (+ 12%)
vehicles
18
Correct for missing data: micro level
19
Time (min.)
Numberofvehiclesdetected
Recursive Bayesian estimator (<1 sec on GPGPU)
4) How to deal with selectivity
– Big Data sources may be selective when
- Only part of the population contributes to the data set
• For example: mobile phone owners
- The measurement mechanism is selective (e.g. non-random times or
places)
• For example: placing of road sensors on Dutch highways is not random
– Many Big Data sources contain events
- Population units may generate widely varying numbers of events
- Attempt to associate events with units
– Correcting for selectivity
- Background characteristics – or features – are needed (linking with
registers; profiling)
- Use predictive modelling / machine learning to produce population
estimates20
Profiling: social media
Selectivity illustrated
Selectivity of big data could potentially
be less problematic than high non-
response rates of surveys.
-There is just more data for your
model!
The black line shows the relationship
between the target and auxiliary variable in
the target population.The red lines show
the estimated relationship according to
each of the three sources (with 95%
confidence intervals).
Here we assume units with auxiliary
variables are available!
22
5) How to go beyond correlation
– You will very likely use correlation to check Big Data findings
with those in other (survey) data
– When correlation is high:
1) try falsifying it first (is it coincidental?)
correlation ≠ causation
2) If this fails, you may have found something interesting!
3) Perform additional analysis (look for causality)
cointegration, Granger causality, time‐series approach,
etc.
23
Example: Sentiment in social media (day/week/month)
24
Platform specific sentiment
25
Table 1. Social media messages properties for various platforms and their correlation with consumer confidence
Correlation coefficient of
Social media platform Number of social Number of messages as monthly sentiment index and
media messages1
percentage of total (%) consumer confidence ( r )2
All platforms combined 3,153,002,327 100 0.75 0.78
Facebook 334,854,088 10.6 0.81* 0.85*
Twitter 2,526,481,479 80.1 0.68 0.70
Hyves 45,182,025 1.4 0.50 0.58
News sites 56,027,686 1.8 0.37 0.26
Blogs 48,600,987 1.5 0.25 0.22
Google+ 644,039 0.02 -0.04 -0.09
Linkedin 565,811 0.02 -0.23 -0.25
Youtube 5,661,274 0.2 -0.37 -0.41
Forums 134,98,938 4.3 -0.45 -0.49
1
period covered June 2010 untill November 2013
2
confirmed by visual inspecting scatterplots and additional checks (see text)
*cointegrated
Platform specific results
Granger causality reveals that Consumer Confidence precedes
Facebook sentiment ! (p-value < 0.001)
26
A schematic view
Vorige maand Maand
Consumer Confidence
Publication date (~20th)
Social media sentiment
Dag 1-7 Dag 8-14 Dag 15-21 Dag 22-28
Previous month Current month
Day 1-7 Day 8-14 Day 15-21 Day 22-28
27
Platform specific results (2)
More detailed studies revealed a 1 week delay between both!
Consumer confidence comes first, Social media sentiment follows
28
Table 1. Social media messages properties for various platforms and their correlation with consumer confidence
Correlation coefficient of
Social media platform Number of social Number of messages as monthly sentiment index and
media messages1
percentage of total (%) consumer confidence ( r )2
All platforms combined 3,153,002,327 100 0.75 0.78
Facebook 334,854,088 10.6 0.81* 0.85*
Twitter 2,526,481,479 80.1 0.68 0.70
Hyves 45,182,025 1.4 0.50 0.58
News sites 56,027,686 1.8 0.37 0.26
Blogs 48,600,987 1.5 0.25 0.22
Google+ 644,039 0.02 -0.04 -0.09
Linkedin 565,811 0.02 -0.23 -0.25
Youtube 5,661,274 0.2 -0.37 -0.41
Forums 134,98,938 4.3 -0.45 -0.49
1
period covered June 2010 untill November 2013
2
confirmed by visual inspecting scatterplots and additional checks (see text)
*cointegrated
6) People and skills needed
For Big data studies you need:
– People with an open mind‐set that do not see all
problems a priori in terms of sampling theory
– People with programming skills and IT‐affinity
– People with a data‐driven, pragmatic attitude (data
explorers, ’practitioners’)
‐ You need Data scientists !
29
Data science skills ‘landscape’
Sexy Skills of Data Geeks
1) Statistics - traditional analysis you're used to
thinking about
2) Data ‘munging’ - parsing, scraping, and
formatting data
3)Visualization - graphs, tools, etc.
4) High Performance Computing knowledge30
People that think outside the ‘box’
31
7) Privacy and security issues
– The Dutch privacy and security law allows the study of privacy
sensitive data for scientific and statistical research
– Of course, appropriate measures always need to be taken
• Prior to new research studies, check privacy sensitivity of data
• In case of privacy sensitive data:
• Try to anonymize micro data or use aggregates
• Use secure environment: workstations in Big Data lab
– Legal issues that enable the use of Big Data for official statistics
production are currently being looked at
- There is Big Data that can be considered ‘Administrative data’: i.e. Big
Data that is managed by a (semi-)governmentally funded organisation
32
Example: Mobile phones
Mobile phone activity as a data source
– Nearly every person in the Netherlands has a mobile phone
- Usually on them and almost always switched on!
- Many people are very active during the day
– Can data of mobile phones be used for statistics?
- Travel behaviour (of active phones)
- ‘Day time population’ (of active phones)
- Tourism (new phones that register to network)
– Data of a single mobile company was used
- Hourly aggregates per area (only when > 15 events)
- Especially important for roaming data (foreign visitors)
33
‘Day time population’
– Hourly changes of mobile
phone activity
– 7 & 8 May 2013
– Per area distinguished
– Only data for areas with
> 15 events per hour
34
Tourism: Roaming during European league final
Hardly any
Low
Medium
High
Very high
35
8) Costs and data management
– Costs
‐ In the Netherlands we don’t pay for administrative data.
‐ How about Big Data?
• We currently pay for social media (access) and mobile phone
data (extra processing efforts)
– Data management
‐ Who owns the data? Stability of delivery/source
‐ Cope with the huge volume
• Run queries in database of data source holder
• Collect and process it as data stream
• Bulk processing
36
The Future
37
The
future
of
statistics
looks
BIG
Thank you for your attention !@pietdaas

Contenu connexe

Tendances

Big data as a source for official statistics
Big data as a source for official statisticsBig data as a source for official statistics
Big data as a source for official statisticsEdwin de Jonge
 
Extracting information from ' messy' social media data
Extracting information from ' messy' social media dataExtracting information from ' messy' social media data
Extracting information from ' messy' social media dataPiet J.H. Daas
 
The impact of Big Data on next generation of smart cities
The impact of Big Data on next generation of smart citiesThe impact of Big Data on next generation of smart cities
The impact of Big Data on next generation of smart citiesPayamBarnaghi
 
A forecasting of stock trading price using time series information based on b...
A forecasting of stock trading price using time series information based on b...A forecasting of stock trading price using time series information based on b...
A forecasting of stock trading price using time series information based on b...IJECEIAES
 
Tools and techniques adopted for big data analytics
Tools and techniques adopted for big data analyticsTools and techniques adopted for big data analytics
Tools and techniques adopted for big data analyticsJOSEPH FRANCIS
 
Beyond dashboards
Beyond dashboardsBeyond dashboards
Beyond dashboardssuresh sood
 
4차 산업혁명 시대의 싱크탱크의 변화(kdi)
4차 산업혁명 시대의 싱크탱크의 변화(kdi)4차 산업혁명 시대의 싱크탱크의 변화(kdi)
4차 산업혁명 시대의 싱크탱크의 변화(kdi)Sungho Lee
 
Arloesiadur: An analytics experiment in innovation policy
Arloesiadur: An analytics experiment in innovation policyArloesiadur: An analytics experiment in innovation policy
Arloesiadur: An analytics experiment in innovation policyJuan Mateos-Garcia
 
eROSA Stakeholder WS1: Big Data and Open Science in agricultural and environm...
eROSA Stakeholder WS1: Big Data and Open Science in agricultural and environm...eROSA Stakeholder WS1: Big Data and Open Science in agricultural and environm...
eROSA Stakeholder WS1: Big Data and Open Science in agricultural and environm...e-ROSA
 
Big DataParadigm, Challenges, Analysis, and Application
Big DataParadigm, Challenges, Analysis, and ApplicationBig DataParadigm, Challenges, Analysis, and Application
Big DataParadigm, Challenges, Analysis, and ApplicationUyoyo Edosio
 
An introduction to Data Mining
An introduction to Data MiningAn introduction to Data Mining
An introduction to Data MiningShobhita Dayal
 
Public Safety Mashups to Support Policy Makers || Choennie
Public Safety Mashups to Support Policy Makers || ChoenniePublic Safety Mashups to Support Policy Makers || Choennie
Public Safety Mashups to Support Policy Makers || ChoennieHuman Centered ICT
 
Location, Location, Location: Leveraging Interactive Maps, Administrative and...
Location, Location, Location: Leveraging Interactive Maps, Administrative and...Location, Location, Location: Leveraging Interactive Maps, Administrative and...
Location, Location, Location: Leveraging Interactive Maps, Administrative and...soder145
 
An introduction to Data Mining by Kurt Thearling
An introduction to Data Mining by Kurt ThearlingAn introduction to Data Mining by Kurt Thearling
An introduction to Data Mining by Kurt ThearlingPim Piepers
 
Real World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining ToolsReal World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining Toolsijsrd.com
 
[2015 e-Government Program] Action Plan : Warsaw(Poland)
[2015 e-Government Program] Action Plan : Warsaw(Poland)[2015 e-Government Program] Action Plan : Warsaw(Poland)
[2015 e-Government Program] Action Plan : Warsaw(Poland)shrdcinfo
 
Big data sources and methods for social and economic analyses
Big data sources and methods for social and economic analysesBig data sources and methods for social and economic analyses
Big data sources and methods for social and economic analysesAmerico Arizaca Avalos
 
DigiGov_cmu_rwanda
DigiGov_cmu_rwandaDigiGov_cmu_rwanda
DigiGov_cmu_rwandaRajiv Ranjan
 

Tendances (20)

Big data as a source for official statistics
Big data as a source for official statisticsBig data as a source for official statistics
Big data as a source for official statistics
 
Extracting information from ' messy' social media data
Extracting information from ' messy' social media dataExtracting information from ' messy' social media data
Extracting information from ' messy' social media data
 
The impact of Big Data on next generation of smart cities
The impact of Big Data on next generation of smart citiesThe impact of Big Data on next generation of smart cities
The impact of Big Data on next generation of smart cities
 
A forecasting of stock trading price using time series information based on b...
A forecasting of stock trading price using time series information based on b...A forecasting of stock trading price using time series information based on b...
A forecasting of stock trading price using time series information based on b...
 
Tools and techniques adopted for big data analytics
Tools and techniques adopted for big data analyticsTools and techniques adopted for big data analytics
Tools and techniques adopted for big data analytics
 
Beyond dashboards
Beyond dashboardsBeyond dashboards
Beyond dashboards
 
4차 산업혁명 시대의 싱크탱크의 변화(kdi)
4차 산업혁명 시대의 싱크탱크의 변화(kdi)4차 산업혁명 시대의 싱크탱크의 변화(kdi)
4차 산업혁명 시대의 싱크탱크의 변화(kdi)
 
Data Analytics Career Paths
Data Analytics Career PathsData Analytics Career Paths
Data Analytics Career Paths
 
Arloesiadur: An analytics experiment in innovation policy
Arloesiadur: An analytics experiment in innovation policyArloesiadur: An analytics experiment in innovation policy
Arloesiadur: An analytics experiment in innovation policy
 
eROSA Stakeholder WS1: Big Data and Open Science in agricultural and environm...
eROSA Stakeholder WS1: Big Data and Open Science in agricultural and environm...eROSA Stakeholder WS1: Big Data and Open Science in agricultural and environm...
eROSA Stakeholder WS1: Big Data and Open Science in agricultural and environm...
 
Elementary Concepts of data minig
Elementary Concepts of data minigElementary Concepts of data minig
Elementary Concepts of data minig
 
Big DataParadigm, Challenges, Analysis, and Application
Big DataParadigm, Challenges, Analysis, and ApplicationBig DataParadigm, Challenges, Analysis, and Application
Big DataParadigm, Challenges, Analysis, and Application
 
An introduction to Data Mining
An introduction to Data MiningAn introduction to Data Mining
An introduction to Data Mining
 
Public Safety Mashups to Support Policy Makers || Choennie
Public Safety Mashups to Support Policy Makers || ChoenniePublic Safety Mashups to Support Policy Makers || Choennie
Public Safety Mashups to Support Policy Makers || Choennie
 
Location, Location, Location: Leveraging Interactive Maps, Administrative and...
Location, Location, Location: Leveraging Interactive Maps, Administrative and...Location, Location, Location: Leveraging Interactive Maps, Administrative and...
Location, Location, Location: Leveraging Interactive Maps, Administrative and...
 
An introduction to Data Mining by Kurt Thearling
An introduction to Data Mining by Kurt ThearlingAn introduction to Data Mining by Kurt Thearling
An introduction to Data Mining by Kurt Thearling
 
Real World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining ToolsReal World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining Tools
 
[2015 e-Government Program] Action Plan : Warsaw(Poland)
[2015 e-Government Program] Action Plan : Warsaw(Poland)[2015 e-Government Program] Action Plan : Warsaw(Poland)
[2015 e-Government Program] Action Plan : Warsaw(Poland)
 
Big data sources and methods for social and economic analyses
Big data sources and methods for social and economic analysesBig data sources and methods for social and economic analyses
Big data sources and methods for social and economic analyses
 
DigiGov_cmu_rwanda
DigiGov_cmu_rwandaDigiGov_cmu_rwanda
DigiGov_cmu_rwanda
 

Similaire à Opportunities and methodological challenges of Big Data for official statisticsg data piet_daas_roma2

Big data - a review (2013 4)
Big data - a review (2013 4)Big data - a review (2013 4)
Big data - a review (2013 4)Sonu Gupta
 
Introduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycleIntroduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycleDr. Radhey Shyam
 
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdfKIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdfDr. Radhey Shyam
 
Introduction to big data – convergences.
Introduction to big data – convergences.Introduction to big data – convergences.
Introduction to big data – convergences.saranya270513
 
KIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfKIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfDr. Radhey Shyam
 
Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...
Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...
Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...IT Network marcus evans
 
The profile of the management (data) scientist: Potential scenarios and skill...
The profile of the management (data) scientist: Potential scenarios and skill...The profile of the management (data) scientist: Potential scenarios and skill...
The profile of the management (data) scientist: Potential scenarios and skill...Juan Mateos-Garcia
 
Big data and Internet
Big data and InternetBig data and Internet
Big data and InternetSanoj Kumar
 
Workshop Rio de Janeiro Strategies for Web Based Data Dissemination
Workshop Rio de Janeiro Strategies for Web Based Data DisseminationWorkshop Rio de Janeiro Strategies for Web Based Data Dissemination
Workshop Rio de Janeiro Strategies for Web Based Data DisseminationZoltan Nagy
 
02 a holistic approach to big data
02 a holistic approach to big data02 a holistic approach to big data
02 a holistic approach to big dataRaul Chong
 

Similaire à Opportunities and methodological challenges of Big Data for official statisticsg data piet_daas_roma2 (20)

Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Identifying the new frontier of big data as an enabler for T&T industries: Re...
Identifying the new frontier of big data as an enabler for T&T industries: Re...Identifying the new frontier of big data as an enabler for T&T industries: Re...
Identifying the new frontier of big data as an enabler for T&T industries: Re...
 
Data Mining With Big Data
Data Mining With Big DataData Mining With Big Data
Data Mining With Big Data
 
Big Data World
Big Data WorldBig Data World
Big Data World
 
Big data - a review (2013 4)
Big data - a review (2013 4)Big data - a review (2013 4)
Big data - a review (2013 4)
 
Introduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycleIntroduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycle
 
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdfKIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
 
Introduction to big data – convergences.
Introduction to big data – convergences.Introduction to big data – convergences.
Introduction to big data – convergences.
 
Big Data et eGovernment
Big Data et eGovernmentBig Data et eGovernment
Big Data et eGovernment
 
KIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfKIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdf
 
Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...
Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...
Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...
 
The profile of the management (data) scientist: Potential scenarios and skill...
The profile of the management (data) scientist: Potential scenarios and skill...The profile of the management (data) scientist: Potential scenarios and skill...
The profile of the management (data) scientist: Potential scenarios and skill...
 
Complete-SRS.doc
Complete-SRS.docComplete-SRS.doc
Complete-SRS.doc
 
Big data and Internet
Big data and InternetBig data and Internet
Big data and Internet
 
Workshop Rio de Janeiro Strategies for Web Based Data Dissemination
Workshop Rio de Janeiro Strategies for Web Based Data DisseminationWorkshop Rio de Janeiro Strategies for Web Based Data Dissemination
Workshop Rio de Janeiro Strategies for Web Based Data Dissemination
 
Big Data technology
Big Data technologyBig Data technology
Big Data technology
 
Applications of Big Data
Applications of Big DataApplications of Big Data
Applications of Big Data
 
02 a holistic approach to big data
02 a holistic approach to big data02 a holistic approach to big data
02 a holistic approach to big data
 
Big data
Big data Big data
Big data
 

Plus de Piet J.H. Daas

Big Data and official statistics with examples of their use
Big Data and official statistics with examples of their useBig Data and official statistics with examples of their use
Big Data and official statistics with examples of their usePiet J.H. Daas
 
IT infrastructure for Big Data and Data Science at Statistics Netherlands
IT infrastructure for Big Data and Data Science at Statistics NetherlandsIT infrastructure for Big Data and Data Science at Statistics Netherlands
IT infrastructure for Big Data and Data Science at Statistics NetherlandsPiet J.H. Daas
 
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)Piet J.H. Daas
 
EMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniquesEMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniquesPiet J.H. Daas
 
Use of social media for official statistics
Use of social media for official statisticsUse of social media for official statistics
Use of social media for official statisticsPiet J.H. Daas
 
Isi 2017 presentation on Big Data and bias
Isi 2017 presentation on Big Data and biasIsi 2017 presentation on Big Data and bias
Isi 2017 presentation on Big Data and biasPiet J.H. Daas
 
Responsible Data Science at Statistics Netherlands
Responsible Data Science at Statistics NetherlandsResponsible Data Science at Statistics Netherlands
Responsible Data Science at Statistics NetherlandsPiet J.H. Daas
 
CBS lecture at the opening of Data Science Campus of ONS
CBS lecture at the opening of Data Science Campus of ONSCBS lecture at the opening of Data Science Campus of ONS
CBS lecture at the opening of Data Science Campus of ONSPiet J.H. Daas
 
Ntts2017 presentation 45
Ntts2017 presentation 45Ntts2017 presentation 45
Ntts2017 presentation 45Piet J.H. Daas
 
Big Data presentation Mannheim
Big Data presentation MannheimBig Data presentation Mannheim
Big Data presentation MannheimPiet J.H. Daas
 
Big data cbs_piet_daas
Big data cbs_piet_daasBig data cbs_piet_daas
Big data cbs_piet_daasPiet J.H. Daas
 
Gebruik van sociale media voor de officiële statistiek
Gebruik van sociale media voor de officiële statistiekGebruik van sociale media voor de officiële statistiek
Gebruik van sociale media voor de officiële statistiekPiet J.H. Daas
 
Profiling Big Data sources to assess their selectivity
Profiling Big Data sources to assess their selectivityProfiling Big Data sources to assess their selectivity
Profiling Big Data sources to assess their selectivityPiet J.H. Daas
 
Big Data @ CBS for Fontys students in Eindhoven
Big Data @ CBS for Fontys students in EindhovenBig Data @ CBS for Fontys students in Eindhoven
Big Data @ CBS for Fontys students in EindhovenPiet J.H. Daas
 
Quality challenges in modernising business statistics
Quality challenges in modernising business statisticsQuality challenges in modernising business statistics
Quality challenges in modernising business statisticsPiet J.H. Daas
 
Social media sentiment and consumer confidence
Social media sentiment and consumer confidenceSocial media sentiment and consumer confidence
Social media sentiment and consumer confidencePiet J.H. Daas
 
Bi dutch meeting data science
Bi dutch meeting data scienceBi dutch meeting data science
Bi dutch meeting data sciencePiet J.H. Daas
 
Piet daas big_data_official_statistics_target_groningen
Piet daas big_data_official_statistics_target_groningenPiet daas big_data_official_statistics_target_groningen
Piet daas big_data_official_statistics_target_groningenPiet J.H. Daas
 

Plus de Piet J.H. Daas (20)

Big Data and official statistics with examples of their use
Big Data and official statistics with examples of their useBig Data and official statistics with examples of their use
Big Data and official statistics with examples of their use
 
IT infrastructure for Big Data and Data Science at Statistics Netherlands
IT infrastructure for Big Data and Data Science at Statistics NetherlandsIT infrastructure for Big Data and Data Science at Statistics Netherlands
IT infrastructure for Big Data and Data Science at Statistics Netherlands
 
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
 
EMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniquesEMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniques
 
Use of social media for official statistics
Use of social media for official statisticsUse of social media for official statistics
Use of social media for official statistics
 
Isi 2017 presentation on Big Data and bias
Isi 2017 presentation on Big Data and biasIsi 2017 presentation on Big Data and bias
Isi 2017 presentation on Big Data and bias
 
Responsible Data Science at Statistics Netherlands
Responsible Data Science at Statistics NetherlandsResponsible Data Science at Statistics Netherlands
Responsible Data Science at Statistics Netherlands
 
CBS lecture at the opening of Data Science Campus of ONS
CBS lecture at the opening of Data Science Campus of ONSCBS lecture at the opening of Data Science Campus of ONS
CBS lecture at the opening of Data Science Campus of ONS
 
Ntts2017 presentation 45
Ntts2017 presentation 45Ntts2017 presentation 45
Ntts2017 presentation 45
 
Big Data presentation Mannheim
Big Data presentation MannheimBig Data presentation Mannheim
Big Data presentation Mannheim
 
Big data cbs_piet_daas
Big data cbs_piet_daasBig data cbs_piet_daas
Big data cbs_piet_daas
 
Gebruik van sociale media voor de officiële statistiek
Gebruik van sociale media voor de officiële statistiekGebruik van sociale media voor de officiële statistiek
Gebruik van sociale media voor de officiële statistiek
 
Big Data @ CBS
Big Data @ CBSBig Data @ CBS
Big Data @ CBS
 
Profiling Big Data sources to assess their selectivity
Profiling Big Data sources to assess their selectivityProfiling Big Data sources to assess their selectivity
Profiling Big Data sources to assess their selectivity
 
Big Data @ CBS for Fontys students in Eindhoven
Big Data @ CBS for Fontys students in EindhovenBig Data @ CBS for Fontys students in Eindhoven
Big Data @ CBS for Fontys students in Eindhoven
 
Quality challenges in modernising business statistics
Quality challenges in modernising business statisticsQuality challenges in modernising business statistics
Quality challenges in modernising business statistics
 
Social media sentiment and consumer confidence
Social media sentiment and consumer confidenceSocial media sentiment and consumer confidence
Social media sentiment and consumer confidence
 
Big data @ CBS
Big data @ CBSBig data @ CBS
Big data @ CBS
 
Bi dutch meeting data science
Bi dutch meeting data scienceBi dutch meeting data science
Bi dutch meeting data science
 
Piet daas big_data_official_statistics_target_groningen
Piet daas big_data_official_statistics_target_groningenPiet daas big_data_official_statistics_target_groningen
Piet daas big_data_official_statistics_target_groningen
 

Dernier

Mandvi (Ahemdabad) Escorts 6367492432 with Real Phone number and Model
Mandvi (Ahemdabad) Escorts 6367492432 with Real Phone number and ModelMandvi (Ahemdabad) Escorts 6367492432 with Real Phone number and Model
Mandvi (Ahemdabad) Escorts 6367492432 with Real Phone number and Modelhotbabesbook
 
Satara call girl 8617370543♥️ call girls in satara escort service
Satara call girl 8617370543♥️ call girls in satara escort serviceSatara call girl 8617370543♥️ call girls in satara escort service
Satara call girl 8617370543♥️ call girls in satara escort serviceNitya salvi
 
Call Girls in Nizampet / 8250092165 Genuine Call girls with real Photos and N...
Call Girls in Nizampet / 8250092165 Genuine Call girls with real Photos and N...Call Girls in Nizampet / 8250092165 Genuine Call girls with real Photos and N...
Call Girls in Nizampet / 8250092165 Genuine Call girls with real Photos and N...kumargunjan9515
 
Foreigner Call Girls Mahim WhatsApp +91-9833363713, Full Night Service
Foreigner Call Girls Mahim WhatsApp +91-9833363713, Full Night ServiceForeigner Call Girls Mahim WhatsApp +91-9833363713, Full Night Service
Foreigner Call Girls Mahim WhatsApp +91-9833363713, Full Night Servicemeghakumariji156
 
Codes and conventions of film magazines.pptx
Codes and conventions of film magazines.pptxCodes and conventions of film magazines.pptx
Codes and conventions of film magazines.pptxCharlotte512934
 
Bhubaneswar🌹Patia ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswar ESCORT S...
Bhubaneswar🌹Patia ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswar ESCORT S...Bhubaneswar🌹Patia ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswar ESCORT S...
Bhubaneswar🌹Patia ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswar ESCORT S...Call Girls Mumbai
 
Deira call girls 0507330913 Call girls in Deira
Deira call girls 0507330913  Call girls in DeiraDeira call girls 0507330913  Call girls in Deira
Deira call girls 0507330913 Call girls in DeiraMonica Sydney
 
Call Girls Bijnor Just Call 8617370543 Top Class Call Girl Service Available
Call Girls Bijnor  Just Call 8617370543 Top Class Call Girl Service AvailableCall Girls Bijnor  Just Call 8617370543 Top Class Call Girl Service Available
Call Girls Bijnor Just Call 8617370543 Top Class Call Girl Service AvailableNitya salvi
 
Osmanabad Call Girls Book Night 4k to 12k ️[8617370543] Escorts Girls Service
Osmanabad Call Girls Book Night 4k to 12k ️[8617370543] Escorts Girls ServiceOsmanabad Call Girls Book Night 4k to 12k ️[8617370543] Escorts Girls Service
Osmanabad Call Girls Book Night 4k to 12k ️[8617370543] Escorts Girls ServiceNitya salvi
 
Top IPTV Subscription Service to Stream Your Favorite Shows in 2024.pdf
Top IPTV Subscription Service to Stream Your Favorite Shows in 2024.pdfTop IPTV Subscription Service to Stream Your Favorite Shows in 2024.pdf
Top IPTV Subscription Service to Stream Your Favorite Shows in 2024.pdfXtreame HDTV
 
High Profile Escort in Dubai 0524076003 Dubai Escorts
High Profile Escort in Dubai 0524076003 Dubai EscortsHigh Profile Escort in Dubai 0524076003 Dubai Escorts
High Profile Escort in Dubai 0524076003 Dubai EscortsMonica Sydney
 
Hire 💕 8617370543 Mirzapur Call Girls Service Call Girls Agency
Hire 💕 8617370543 Mirzapur Call Girls Service Call Girls AgencyHire 💕 8617370543 Mirzapur Call Girls Service Call Girls Agency
Hire 💕 8617370543 Mirzapur Call Girls Service Call Girls AgencyNitya salvi
 
Call Girls in Perumbavoor / 9332606886 Genuine Call girls with real Photos an...
Call Girls in Perumbavoor / 9332606886 Genuine Call girls with real Photos an...Call Girls in Perumbavoor / 9332606886 Genuine Call girls with real Photos an...
Call Girls in Perumbavoor / 9332606886 Genuine Call girls with real Photos an...call girls kolkata
 
Call Girls Moradabad Just Call 8617370543 Top Class Call Girl Service Available
Call Girls Moradabad Just Call 8617370543 Top Class Call Girl Service AvailableCall Girls Moradabad Just Call 8617370543 Top Class Call Girl Service Available
Call Girls Moradabad Just Call 8617370543 Top Class Call Girl Service AvailableNitya salvi
 
Deira Call girls 0507330913 Call girls in Deira
Deira Call girls 0507330913 Call girls in DeiraDeira Call girls 0507330913 Call girls in Deira
Deira Call girls 0507330913 Call girls in DeiraMonica Sydney
 
Dubai Call girls Service 0524076003 Call girls in Dubai
Dubai Call girls Service 0524076003 Call girls in DubaiDubai Call girls Service 0524076003 Call girls in Dubai
Dubai Call girls Service 0524076003 Call girls in DubaiMonica Sydney
 
Ghansoli Escorts Services 09167354423 Ghansoli Call Girls,Call Girls In Ghan...
Ghansoli Escorts Services 09167354423  Ghansoli Call Girls,Call Girls In Ghan...Ghansoli Escorts Services 09167354423  Ghansoli Call Girls,Call Girls In Ghan...
Ghansoli Escorts Services 09167354423 Ghansoli Call Girls,Call Girls In Ghan...Priya Reddy
 
Call girls Service Berhampur - 9332606886 Our call girls are sure to provide ...
Call girls Service Berhampur - 9332606886 Our call girls are sure to provide ...Call girls Service Berhampur - 9332606886 Our call girls are sure to provide ...
Call girls Service Berhampur - 9332606886 Our call girls are sure to provide ...DipikaDelhi
 
Turbhe Female Escorts 09167354423 Turbhe Escorts,Call Girls In Turbhe
Turbhe Female Escorts 09167354423  Turbhe Escorts,Call Girls In TurbheTurbhe Female Escorts 09167354423  Turbhe Escorts,Call Girls In Turbhe
Turbhe Female Escorts 09167354423 Turbhe Escorts,Call Girls In TurbhePriya Reddy
 
Pakistani Call girls in Ajman 0505086370 Ajman Call girls
Pakistani Call girls in Ajman 0505086370 Ajman Call girlsPakistani Call girls in Ajman 0505086370 Ajman Call girls
Pakistani Call girls in Ajman 0505086370 Ajman Call girlsMonica Sydney
 

Dernier (20)

Mandvi (Ahemdabad) Escorts 6367492432 with Real Phone number and Model
Mandvi (Ahemdabad) Escorts 6367492432 with Real Phone number and ModelMandvi (Ahemdabad) Escorts 6367492432 with Real Phone number and Model
Mandvi (Ahemdabad) Escorts 6367492432 with Real Phone number and Model
 
Satara call girl 8617370543♥️ call girls in satara escort service
Satara call girl 8617370543♥️ call girls in satara escort serviceSatara call girl 8617370543♥️ call girls in satara escort service
Satara call girl 8617370543♥️ call girls in satara escort service
 
Call Girls in Nizampet / 8250092165 Genuine Call girls with real Photos and N...
Call Girls in Nizampet / 8250092165 Genuine Call girls with real Photos and N...Call Girls in Nizampet / 8250092165 Genuine Call girls with real Photos and N...
Call Girls in Nizampet / 8250092165 Genuine Call girls with real Photos and N...
 
Foreigner Call Girls Mahim WhatsApp +91-9833363713, Full Night Service
Foreigner Call Girls Mahim WhatsApp +91-9833363713, Full Night ServiceForeigner Call Girls Mahim WhatsApp +91-9833363713, Full Night Service
Foreigner Call Girls Mahim WhatsApp +91-9833363713, Full Night Service
 
Codes and conventions of film magazines.pptx
Codes and conventions of film magazines.pptxCodes and conventions of film magazines.pptx
Codes and conventions of film magazines.pptx
 
Bhubaneswar🌹Patia ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswar ESCORT S...
Bhubaneswar🌹Patia ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswar ESCORT S...Bhubaneswar🌹Patia ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswar ESCORT S...
Bhubaneswar🌹Patia ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswar ESCORT S...
 
Deira call girls 0507330913 Call girls in Deira
Deira call girls 0507330913  Call girls in DeiraDeira call girls 0507330913  Call girls in Deira
Deira call girls 0507330913 Call girls in Deira
 
Call Girls Bijnor Just Call 8617370543 Top Class Call Girl Service Available
Call Girls Bijnor  Just Call 8617370543 Top Class Call Girl Service AvailableCall Girls Bijnor  Just Call 8617370543 Top Class Call Girl Service Available
Call Girls Bijnor Just Call 8617370543 Top Class Call Girl Service Available
 
Osmanabad Call Girls Book Night 4k to 12k ️[8617370543] Escorts Girls Service
Osmanabad Call Girls Book Night 4k to 12k ️[8617370543] Escorts Girls ServiceOsmanabad Call Girls Book Night 4k to 12k ️[8617370543] Escorts Girls Service
Osmanabad Call Girls Book Night 4k to 12k ️[8617370543] Escorts Girls Service
 
Top IPTV Subscription Service to Stream Your Favorite Shows in 2024.pdf
Top IPTV Subscription Service to Stream Your Favorite Shows in 2024.pdfTop IPTV Subscription Service to Stream Your Favorite Shows in 2024.pdf
Top IPTV Subscription Service to Stream Your Favorite Shows in 2024.pdf
 
High Profile Escort in Dubai 0524076003 Dubai Escorts
High Profile Escort in Dubai 0524076003 Dubai EscortsHigh Profile Escort in Dubai 0524076003 Dubai Escorts
High Profile Escort in Dubai 0524076003 Dubai Escorts
 
Hire 💕 8617370543 Mirzapur Call Girls Service Call Girls Agency
Hire 💕 8617370543 Mirzapur Call Girls Service Call Girls AgencyHire 💕 8617370543 Mirzapur Call Girls Service Call Girls Agency
Hire 💕 8617370543 Mirzapur Call Girls Service Call Girls Agency
 
Call Girls in Perumbavoor / 9332606886 Genuine Call girls with real Photos an...
Call Girls in Perumbavoor / 9332606886 Genuine Call girls with real Photos an...Call Girls in Perumbavoor / 9332606886 Genuine Call girls with real Photos an...
Call Girls in Perumbavoor / 9332606886 Genuine Call girls with real Photos an...
 
Call Girls Moradabad Just Call 8617370543 Top Class Call Girl Service Available
Call Girls Moradabad Just Call 8617370543 Top Class Call Girl Service AvailableCall Girls Moradabad Just Call 8617370543 Top Class Call Girl Service Available
Call Girls Moradabad Just Call 8617370543 Top Class Call Girl Service Available
 
Deira Call girls 0507330913 Call girls in Deira
Deira Call girls 0507330913 Call girls in DeiraDeira Call girls 0507330913 Call girls in Deira
Deira Call girls 0507330913 Call girls in Deira
 
Dubai Call girls Service 0524076003 Call girls in Dubai
Dubai Call girls Service 0524076003 Call girls in DubaiDubai Call girls Service 0524076003 Call girls in Dubai
Dubai Call girls Service 0524076003 Call girls in Dubai
 
Ghansoli Escorts Services 09167354423 Ghansoli Call Girls,Call Girls In Ghan...
Ghansoli Escorts Services 09167354423  Ghansoli Call Girls,Call Girls In Ghan...Ghansoli Escorts Services 09167354423  Ghansoli Call Girls,Call Girls In Ghan...
Ghansoli Escorts Services 09167354423 Ghansoli Call Girls,Call Girls In Ghan...
 
Call girls Service Berhampur - 9332606886 Our call girls are sure to provide ...
Call girls Service Berhampur - 9332606886 Our call girls are sure to provide ...Call girls Service Berhampur - 9332606886 Our call girls are sure to provide ...
Call girls Service Berhampur - 9332606886 Our call girls are sure to provide ...
 
Turbhe Female Escorts 09167354423 Turbhe Escorts,Call Girls In Turbhe
Turbhe Female Escorts 09167354423  Turbhe Escorts,Call Girls In TurbheTurbhe Female Escorts 09167354423  Turbhe Escorts,Call Girls In Turbhe
Turbhe Female Escorts 09167354423 Turbhe Escorts,Call Girls In Turbhe
 
Pakistani Call girls in Ajman 0505086370 Ajman Call girls
Pakistani Call girls in Ajman 0505086370 Ajman Call girlsPakistani Call girls in Ajman 0505086370 Ajman Call girls
Pakistani Call girls in Ajman 0505086370 Ajman Call girls
 

Opportunities and methodological challenges of Big Data for official statisticsg data piet_daas_roma2

  • 1. Opportunities and methodological challenges of Big Data for official statistics Dr. Piet J.H. Daas Methodologist, Big Data research coördinator March 31, Rome
  • 2. Overview 2 • Big Data • Definition? • DGINS: Scheveningen Memorandum • Experiences at Statistics Netherlands • From ‘New data sources’ to ‘Big Data’ • Data driven approach (learning by doing) • Opportunities & challenges • Methodological & technical challenges • Skills, legal and other issues •With examples !
  • 3. – Data, data everywhere! X
  • 4. What is Big Data? Defining Big Data is not easy: An attempt: “Data that are difficult to collect, store or process within the conventional systems of statistical organizations. Either, their volume, velocity, structure or variety requires the adoption of new statistical software processing techniques and/or IT infrastructure to enable cost-effective insights to be made.” (Virtual sprint paper) More technical: “Big Data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process the data within a tolerable elapsed time.” (Wikipedia) A user: “Data sources that are awkward to work with.” 4 TIP: Big Data sources are NOT surveys and NOT administrative data
  • 5. DGINS: Scheveningen Memorandum 1. Big Data represent new opportunities and challenges for Official Statistics. 2. Develop an 'Official Statistics Big Data strategy' at national and EU-level. 3. Recognize the implications of Big Data for legislation especially with regard to data protection and personal rights 4. Several NSIs are currently initiating or considering different uses of Big Data. Momentum to share experiences and to collaborate. 5. Recognize the necessary capabilities and skills to effectively explore Big Data 6. Acknowledge that the multidisciplinary character requires synergies and partnerships. 7. The use of Big Data in the context of official statistics requires new developments in methodology, quality assessment and IT related issues. 8. Agree on adopting an ESS action plan and roadmap by mid-2014 5
  • 6. Experiences at Statistics Netherlands – Started as ‘New data sources for statistics’ in 2009 – Several initiatives over the years: ‐ Internet as a data source • Collecting price data with web robots • Study the use of web job vacancies data • ‘Markplaats’ data (Dutch eBay clone) ‐ Alternative means of collecting primary data • Use of smartphones ‐ Big Data (really large amounts of data) • Traffic loop detection data (road sensors) • Mobile phone data (location data) • Social media data (content and sentiment) 6
  • 8. What have we learned (so far) ? I’ll discuss the most important ones: 1) Types of ‘data’ in Big Data 2) How to access and analyse large amounts of data 3) How to deal with noisy and unstructured data 4) How to deal with selectivity (and our own bias) 5) How to go beyond correlation 6) The need for people with the right skills and mind‐set 7) Need to solve/deal with privacy and security issues 8) Data management & costs 8 We are slowly starting to get a grip on some of these topics
  • 9. 1) Types of data 9 Secondary data Primary data
  • 10. 1) Types of data & events There are many different Big data sources, An attempt to classify them (Virtual sprint paper): A) Human-sourced information (‘Social Networks’) Social media messages, blogs, web searches B) Process-mediated data (‘Traditional Business Systems andWebsites’) Credit card, bank or on-line transactions,CDR, product prices, page-views C) Machine-generated data (‘Automated Systems’) Road or climate sensors, satellite images, GPS,AIS. Essentially most of the data are event-based of which some can be directly related to a user (e.g. the target population) 10
  • 11. 2) How to access and analyse large amounts of data 11 – If you want to analyse Big Data – You need a lot of computer power!! – Or you need a lot of time! High Performance Computing expertise is essential !
  • 12. – We have: - Workstations with lot’s of memory (32-64GB), fast disk drives (SSD, 512 GB) and a large hard drive (>= 1TB) - A secure environment in which to access the data with those computers - A Big Data lab - The knowledge to load and analyse all the data into R or Python - Followed a High Performance Computing training course - Realized that learning by doing is key! (?databases?) AND a Big data source with no privacy and security issues so we can test all kinds of analysis, soft- and hardware (anyplace, anytime, anywhere) • Traffic loop data (road sensors) 12 Our current equipment and more
  • 13. An example: – Processing of traffic loop data of 1 day - A total ~100 million records (25 GB) I/O limitation can by solved by: 1) Input part by using a cluster (distributed computing) 2) Output part by implementing a C++ write routine in R (20% faster) Processing in R Time needed Speed-up First R-script 6 hours - Improved code 30 min 12 Faster hardware 10 min 36 (Java code) Faster hardware 2 min 180 + preprocessed data Limited by I/O 13
  • 14. All Dutch vehicles in September
  • 15. 3) How to deal with noisy and unstructured data – Big Data is often ‐ noisy, dirty ‐ redundant ‐ unstructured • e.g. texts, images – How to extract information from Big data? ‐ In the best/most efficient way 15
  • 16. Example of noisy data: Roads sensors Traffic loop data ‐ Each minute (24/7) the number of passing vehicles is counted in around 20.000 ‘loops’ in the Netherlands • Total and in different length classes ‐ Nice data source for transport and traffic statistics (and more) • A lot of data, around 100 million records a day Locations 16
  • 17. Total number of vehicles during the day 17 Time (hour)
  • 18. Correct for missing data: macro level Sliding window of 5 min. Impute missing data. Before After Total = ~ 295 million vehicles Total = ~ 330 million (+ 12%) vehicles 18
  • 19. Correct for missing data: micro level 19 Time (min.) Numberofvehiclesdetected Recursive Bayesian estimator (<1 sec on GPGPU)
  • 20. 4) How to deal with selectivity – Big Data sources may be selective when - Only part of the population contributes to the data set • For example: mobile phone owners - The measurement mechanism is selective (e.g. non-random times or places) • For example: placing of road sensors on Dutch highways is not random – Many Big Data sources contain events - Population units may generate widely varying numbers of events - Attempt to associate events with units – Correcting for selectivity - Background characteristics – or features – are needed (linking with registers; profiling) - Use predictive modelling / machine learning to produce population estimates20
  • 22. Selectivity illustrated Selectivity of big data could potentially be less problematic than high non- response rates of surveys. -There is just more data for your model! The black line shows the relationship between the target and auxiliary variable in the target population.The red lines show the estimated relationship according to each of the three sources (with 95% confidence intervals). Here we assume units with auxiliary variables are available! 22
  • 23. 5) How to go beyond correlation – You will very likely use correlation to check Big Data findings with those in other (survey) data – When correlation is high: 1) try falsifying it first (is it coincidental?) correlation ≠ causation 2) If this fails, you may have found something interesting! 3) Perform additional analysis (look for causality) cointegration, Granger causality, time‐series approach, etc. 23
  • 24. Example: Sentiment in social media (day/week/month) 24
  • 26. Table 1. Social media messages properties for various platforms and their correlation with consumer confidence Correlation coefficient of Social media platform Number of social Number of messages as monthly sentiment index and media messages1 percentage of total (%) consumer confidence ( r )2 All platforms combined 3,153,002,327 100 0.75 0.78 Facebook 334,854,088 10.6 0.81* 0.85* Twitter 2,526,481,479 80.1 0.68 0.70 Hyves 45,182,025 1.4 0.50 0.58 News sites 56,027,686 1.8 0.37 0.26 Blogs 48,600,987 1.5 0.25 0.22 Google+ 644,039 0.02 -0.04 -0.09 Linkedin 565,811 0.02 -0.23 -0.25 Youtube 5,661,274 0.2 -0.37 -0.41 Forums 134,98,938 4.3 -0.45 -0.49 1 period covered June 2010 untill November 2013 2 confirmed by visual inspecting scatterplots and additional checks (see text) *cointegrated Platform specific results Granger causality reveals that Consumer Confidence precedes Facebook sentiment ! (p-value < 0.001) 26
  • 27. A schematic view Vorige maand Maand Consumer Confidence Publication date (~20th) Social media sentiment Dag 1-7 Dag 8-14 Dag 15-21 Dag 22-28 Previous month Current month Day 1-7 Day 8-14 Day 15-21 Day 22-28 27
  • 28. Platform specific results (2) More detailed studies revealed a 1 week delay between both! Consumer confidence comes first, Social media sentiment follows 28 Table 1. Social media messages properties for various platforms and their correlation with consumer confidence Correlation coefficient of Social media platform Number of social Number of messages as monthly sentiment index and media messages1 percentage of total (%) consumer confidence ( r )2 All platforms combined 3,153,002,327 100 0.75 0.78 Facebook 334,854,088 10.6 0.81* 0.85* Twitter 2,526,481,479 80.1 0.68 0.70 Hyves 45,182,025 1.4 0.50 0.58 News sites 56,027,686 1.8 0.37 0.26 Blogs 48,600,987 1.5 0.25 0.22 Google+ 644,039 0.02 -0.04 -0.09 Linkedin 565,811 0.02 -0.23 -0.25 Youtube 5,661,274 0.2 -0.37 -0.41 Forums 134,98,938 4.3 -0.45 -0.49 1 period covered June 2010 untill November 2013 2 confirmed by visual inspecting scatterplots and additional checks (see text) *cointegrated
  • 29. 6) People and skills needed For Big data studies you need: – People with an open mind‐set that do not see all problems a priori in terms of sampling theory – People with programming skills and IT‐affinity – People with a data‐driven, pragmatic attitude (data explorers, ’practitioners’) ‐ You need Data scientists ! 29
  • 30. Data science skills ‘landscape’ Sexy Skills of Data Geeks 1) Statistics - traditional analysis you're used to thinking about 2) Data ‘munging’ - parsing, scraping, and formatting data 3)Visualization - graphs, tools, etc. 4) High Performance Computing knowledge30
  • 31. People that think outside the ‘box’ 31
  • 32. 7) Privacy and security issues – The Dutch privacy and security law allows the study of privacy sensitive data for scientific and statistical research – Of course, appropriate measures always need to be taken • Prior to new research studies, check privacy sensitivity of data • In case of privacy sensitive data: • Try to anonymize micro data or use aggregates • Use secure environment: workstations in Big Data lab – Legal issues that enable the use of Big Data for official statistics production are currently being looked at - There is Big Data that can be considered ‘Administrative data’: i.e. Big Data that is managed by a (semi-)governmentally funded organisation 32
  • 33. Example: Mobile phones Mobile phone activity as a data source – Nearly every person in the Netherlands has a mobile phone - Usually on them and almost always switched on! - Many people are very active during the day – Can data of mobile phones be used for statistics? - Travel behaviour (of active phones) - ‘Day time population’ (of active phones) - Tourism (new phones that register to network) – Data of a single mobile company was used - Hourly aggregates per area (only when > 15 events) - Especially important for roaming data (foreign visitors) 33
  • 34. ‘Day time population’ – Hourly changes of mobile phone activity – 7 & 8 May 2013 – Per area distinguished – Only data for areas with > 15 events per hour 34
  • 35. Tourism: Roaming during European league final Hardly any Low Medium High Very high 35
  • 36. 8) Costs and data management – Costs ‐ In the Netherlands we don’t pay for administrative data. ‐ How about Big Data? • We currently pay for social media (access) and mobile phone data (extra processing efforts) – Data management ‐ Who owns the data? Stability of delivery/source ‐ Cope with the huge volume • Run queries in database of data source holder • Collect and process it as data stream • Bulk processing 36
  • 38. Thank you for your attention !@pietdaas