SlideShare une entreprise Scribd logo
1  sur  38
Opportunities and
methodological challenges of
Big Data for official statistics
Dr. Piet J.H. Daas
Methodologist, Big Data research coördinator
March 31, Rome
Overview
2
• Big Data
• Definition?
• DGINS: Scheveningen Memorandum
• Experiences at Statistics Netherlands
• From ‘New data sources’ to ‘Big Data’
• Data driven approach (learning by doing)
• Opportunities & challenges
• Methodological & technical challenges
• Skills, legal and other issues
•With examples !
– Data, data everywhere!
X
What is Big Data?
Defining Big Data is not easy:
An attempt: “Data that are difficult to collect, store or process within the conventional
systems of statistical organizations. Either, their volume, velocity, structure
or variety requires the adoption of new statistical software processing
techniques and/or IT infrastructure to enable cost-effective insights to be
made.” (Virtual sprint paper)
More technical: “Big Data usually includes data sets with sizes beyond the ability of
commonly used software tools to capture, curate, manage, and process the
data within a tolerable elapsed time.” (Wikipedia)
A user: “Data sources that are awkward to work with.”
4
TIP: Big Data sources are NOT surveys and NOT administrative data
DGINS: Scheveningen Memorandum
1. Big Data represent new opportunities and challenges for Official Statistics.
2. Develop an 'Official Statistics Big Data strategy' at national and EU-level.
3. Recognize the implications of Big Data for legislation especially with
regard to data protection and personal rights
4. Several NSIs are currently initiating or considering different uses of Big
Data. Momentum to share experiences and to collaborate.
5. Recognize the necessary capabilities and skills to effectively explore Big
Data
6. Acknowledge that the multidisciplinary character requires synergies and
partnerships.
7. The use of Big Data in the context of official statistics requires new
developments in methodology, quality assessment and IT related issues.
8. Agree on adopting an ESS action plan and roadmap by mid-2014
5
Experiences at Statistics Netherlands
– Started as ‘New data sources for statistics’ in 2009
– Several initiatives over the years:
‐ Internet as a data source
• Collecting price data with web robots
• Study the use of web job vacancies data
• ‘Markplaats’ data (Dutch eBay clone)
‐ Alternative means of collecting primary data
• Use of smartphones
‐ Big Data (really large amounts of data)
• Traffic loop detection data (road sensors)
• Mobile phone data (location data)
• Social media data (content and sentiment)
6
Opportunities & challenges
What have we learned (so far) ?
I’ll discuss the most important ones:
1) Types of ‘data’ in Big Data
2) How to access and analyse large amounts of data
3) How to deal with noisy and unstructured data
4) How to deal with selectivity (and our own bias)
5) How to go beyond correlation
6) The need for people with the right skills and mind‐set
7) Need to solve/deal with privacy and security issues
8) Data management & costs
8
We are slowly starting to get a grip on some of these topics
1) Types of data
9 Secondary data Primary data
1) Types of data & events
There are many different Big data sources,
An attempt to classify them (Virtual sprint paper):
A) Human-sourced information (‘Social Networks’)
Social media messages, blogs, web searches
B) Process-mediated data (‘Traditional Business Systems andWebsites’)
Credit card, bank or on-line transactions,CDR, product prices,
page-views
C) Machine-generated data (‘Automated Systems’)
Road or climate sensors, satellite images, GPS,AIS.
Essentially most of the data are event-based of which some can be
directly related to a user (e.g. the target population)
10
2) How to access and analyse large amounts of data
11
– If you want to analyse Big Data
– You need a lot of computer power!!
– Or you need a lot of time!
High Performance
Computing expertise
is essential !
– We have:
- Workstations with lot’s of memory (32-64GB), fast disk drives (SSD, 512
GB) and a large hard drive (>= 1TB)
- A secure environment in which to access the data with those computers
- A Big Data lab
- The knowledge to load and analyse all the data into R or Python
- Followed a High Performance Computing training course
- Realized that learning by doing is key! (?databases?)
AND a Big data source with no privacy and security issues so we can test all
kinds of analysis, soft- and hardware (anyplace, anytime, anywhere)
• Traffic loop data (road sensors)
12
Our current equipment and more
An example:
– Processing of traffic loop data of 1 day
- A total ~100 million records (25 GB)
I/O limitation can by solved by:
1) Input part by using a cluster (distributed computing)
2) Output part by implementing a C++ write routine in R (20% faster)
Processing in R Time needed Speed-up
First R-script 6 hours -
Improved code 30 min 12
Faster hardware 10 min 36
(Java code)
Faster hardware 2 min 180
+ preprocessed data
Limited by I/O
13
All Dutch vehicles in September
3) How to deal with noisy and unstructured data
– Big Data is often
‐ noisy, dirty
‐ redundant
‐ unstructured
• e.g. texts, images
– How to extract information
from Big data?
‐ In the best/most efficient way
15
Example of noisy data: Roads sensors
Traffic loop data
‐ Each minute (24/7) the number of passing vehicles is
counted in around 20.000 ‘loops’ in the Netherlands
• Total and in different length classes
‐ Nice data source for transport and traffic statistics
(and more)
• A lot of data, around 100 million records a day
Locations
16
Total number of vehicles during the day
17
Time (hour)
Correct for missing data: macro level
Sliding window of 5 min. Impute missing data.
Before After
Total = ~ 295 million vehicles Total = ~ 330 million (+ 12%)
vehicles
18
Correct for missing data: micro level
19
Time (min.)
Numberofvehiclesdetected
Recursive Bayesian estimator (<1 sec on GPGPU)
4) How to deal with selectivity
– Big Data sources may be selective when
- Only part of the population contributes to the data set
• For example: mobile phone owners
- The measurement mechanism is selective (e.g. non-random times or
places)
• For example: placing of road sensors on Dutch highways is not random
– Many Big Data sources contain events
- Population units may generate widely varying numbers of events
- Attempt to associate events with units
– Correcting for selectivity
- Background characteristics – or features – are needed (linking with
registers; profiling)
- Use predictive modelling / machine learning to produce population
estimates20
Profiling: social media
Selectivity illustrated
Selectivity of big data could potentially
be less problematic than high non-
response rates of surveys.
-There is just more data for your
model!
The black line shows the relationship
between the target and auxiliary variable in
the target population.The red lines show
the estimated relationship according to
each of the three sources (with 95%
confidence intervals).
Here we assume units with auxiliary
variables are available!
22
5) How to go beyond correlation
– You will very likely use correlation to check Big Data findings
with those in other (survey) data
– When correlation is high:
1) try falsifying it first (is it coincidental?)
correlation ≠ causation
2) If this fails, you may have found something interesting!
3) Perform additional analysis (look for causality)
cointegration, Granger causality, time‐series approach,
etc.
23
Example: Sentiment in social media (day/week/month)
24
Platform specific sentiment
25
Table 1. Social media messages properties for various platforms and their correlation with consumer confidence
Correlation coefficient of
Social media platform Number of social Number of messages as monthly sentiment index and
media messages1
percentage of total (%) consumer confidence ( r )2
All platforms combined 3,153,002,327 100 0.75 0.78
Facebook 334,854,088 10.6 0.81* 0.85*
Twitter 2,526,481,479 80.1 0.68 0.70
Hyves 45,182,025 1.4 0.50 0.58
News sites 56,027,686 1.8 0.37 0.26
Blogs 48,600,987 1.5 0.25 0.22
Google+ 644,039 0.02 -0.04 -0.09
Linkedin 565,811 0.02 -0.23 -0.25
Youtube 5,661,274 0.2 -0.37 -0.41
Forums 134,98,938 4.3 -0.45 -0.49
1
period covered June 2010 untill November 2013
2
confirmed by visual inspecting scatterplots and additional checks (see text)
*cointegrated
Platform specific results
Granger causality reveals that Consumer Confidence precedes
Facebook sentiment ! (p-value < 0.001)
26
A schematic view
Vorige maand Maand
Consumer Confidence
Publication date (~20th)
Social media sentiment
Dag 1-7 Dag 8-14 Dag 15-21 Dag 22-28
Previous month Current month
Day 1-7 Day 8-14 Day 15-21 Day 22-28
27
Platform specific results (2)
More detailed studies revealed a 1 week delay between both!
Consumer confidence comes first, Social media sentiment follows
28
Table 1. Social media messages properties for various platforms and their correlation with consumer confidence
Correlation coefficient of
Social media platform Number of social Number of messages as monthly sentiment index and
media messages1
percentage of total (%) consumer confidence ( r )2
All platforms combined 3,153,002,327 100 0.75 0.78
Facebook 334,854,088 10.6 0.81* 0.85*
Twitter 2,526,481,479 80.1 0.68 0.70
Hyves 45,182,025 1.4 0.50 0.58
News sites 56,027,686 1.8 0.37 0.26
Blogs 48,600,987 1.5 0.25 0.22
Google+ 644,039 0.02 -0.04 -0.09
Linkedin 565,811 0.02 -0.23 -0.25
Youtube 5,661,274 0.2 -0.37 -0.41
Forums 134,98,938 4.3 -0.45 -0.49
1
period covered June 2010 untill November 2013
2
confirmed by visual inspecting scatterplots and additional checks (see text)
*cointegrated
6) People and skills needed
For Big data studies you need:
– People with an open mind‐set that do not see all
problems a priori in terms of sampling theory
– People with programming skills and IT‐affinity
– People with a data‐driven, pragmatic attitude (data
explorers, ’practitioners’)
‐ You need Data scientists !
29
Data science skills ‘landscape’
Sexy Skills of Data Geeks
1) Statistics - traditional analysis you're used to
thinking about
2) Data ‘munging’ - parsing, scraping, and
formatting data
3)Visualization - graphs, tools, etc.
4) High Performance Computing knowledge30
People that think outside the ‘box’
31
7) Privacy and security issues
– The Dutch privacy and security law allows the study of privacy
sensitive data for scientific and statistical research
– Of course, appropriate measures always need to be taken
• Prior to new research studies, check privacy sensitivity of data
• In case of privacy sensitive data:
• Try to anonymize micro data or use aggregates
• Use secure environment: workstations in Big Data lab
– Legal issues that enable the use of Big Data for official statistics
production are currently being looked at
- There is Big Data that can be considered ‘Administrative data’: i.e. Big
Data that is managed by a (semi-)governmentally funded organisation
32
Example: Mobile phones
Mobile phone activity as a data source
– Nearly every person in the Netherlands has a mobile phone
- Usually on them and almost always switched on!
- Many people are very active during the day
– Can data of mobile phones be used for statistics?
- Travel behaviour (of active phones)
- ‘Day time population’ (of active phones)
- Tourism (new phones that register to network)
– Data of a single mobile company was used
- Hourly aggregates per area (only when > 15 events)
- Especially important for roaming data (foreign visitors)
33
‘Day time population’
– Hourly changes of mobile
phone activity
– 7 & 8 May 2013
– Per area distinguished
– Only data for areas with
> 15 events per hour
34
Tourism: Roaming during European league final
Hardly any
Low
Medium
High
Very high
35
8) Costs and data management
– Costs
‐ In the Netherlands we don’t pay for administrative data.
‐ How about Big Data?
• We currently pay for social media (access) and mobile phone
data (extra processing efforts)
– Data management
‐ Who owns the data? Stability of delivery/source
‐ Cope with the huge volume
• Run queries in database of data source holder
• Collect and process it as data stream
• Bulk processing
36
The Future
37
The
future
of
statistics
looks
BIG
Thank you for your attention !@pietdaas

Contenu connexe

Tendances

The impact of Big Data on next generation of smart cities
The impact of Big Data on next generation of smart citiesThe impact of Big Data on next generation of smart cities
The impact of Big Data on next generation of smart cities
PayamBarnaghi
 
A forecasting of stock trading price using time series information based on b...
A forecasting of stock trading price using time series information based on b...A forecasting of stock trading price using time series information based on b...
A forecasting of stock trading price using time series information based on b...
IJECEIAES
 
Tools and techniques adopted for big data analytics
Tools and techniques adopted for big data analyticsTools and techniques adopted for big data analytics
Tools and techniques adopted for big data analytics
JOSEPH FRANCIS
 
Public Safety Mashups to Support Policy Makers || Choennie
Public Safety Mashups to Support Policy Makers || ChoenniePublic Safety Mashups to Support Policy Makers || Choennie
Public Safety Mashups to Support Policy Makers || Choennie
Human Centered ICT
 
An introduction to Data Mining by Kurt Thearling
An introduction to Data Mining by Kurt ThearlingAn introduction to Data Mining by Kurt Thearling
An introduction to Data Mining by Kurt Thearling
Pim Piepers
 

Tendances (20)

Big data as a source for official statistics
Big data as a source for official statisticsBig data as a source for official statistics
Big data as a source for official statistics
 
Extracting information from ' messy' social media data
Extracting information from ' messy' social media dataExtracting information from ' messy' social media data
Extracting information from ' messy' social media data
 
The impact of Big Data on next generation of smart cities
The impact of Big Data on next generation of smart citiesThe impact of Big Data on next generation of smart cities
The impact of Big Data on next generation of smart cities
 
A forecasting of stock trading price using time series information based on b...
A forecasting of stock trading price using time series information based on b...A forecasting of stock trading price using time series information based on b...
A forecasting of stock trading price using time series information based on b...
 
Tools and techniques adopted for big data analytics
Tools and techniques adopted for big data analyticsTools and techniques adopted for big data analytics
Tools and techniques adopted for big data analytics
 
Beyond dashboards
Beyond dashboardsBeyond dashboards
Beyond dashboards
 
4차 산업혁명 시대의 싱크탱크의 변화(kdi)
4차 산업혁명 시대의 싱크탱크의 변화(kdi)4차 산업혁명 시대의 싱크탱크의 변화(kdi)
4차 산업혁명 시대의 싱크탱크의 변화(kdi)
 
Data Analytics Career Paths
Data Analytics Career PathsData Analytics Career Paths
Data Analytics Career Paths
 
Arloesiadur: An analytics experiment in innovation policy
Arloesiadur: An analytics experiment in innovation policyArloesiadur: An analytics experiment in innovation policy
Arloesiadur: An analytics experiment in innovation policy
 
eROSA Stakeholder WS1: Big Data and Open Science in agricultural and environm...
eROSA Stakeholder WS1: Big Data and Open Science in agricultural and environm...eROSA Stakeholder WS1: Big Data and Open Science in agricultural and environm...
eROSA Stakeholder WS1: Big Data and Open Science in agricultural and environm...
 
Elementary Concepts of data minig
Elementary Concepts of data minigElementary Concepts of data minig
Elementary Concepts of data minig
 
Big DataParadigm, Challenges, Analysis, and Application
Big DataParadigm, Challenges, Analysis, and ApplicationBig DataParadigm, Challenges, Analysis, and Application
Big DataParadigm, Challenges, Analysis, and Application
 
An introduction to Data Mining
An introduction to Data MiningAn introduction to Data Mining
An introduction to Data Mining
 
Public Safety Mashups to Support Policy Makers || Choennie
Public Safety Mashups to Support Policy Makers || ChoenniePublic Safety Mashups to Support Policy Makers || Choennie
Public Safety Mashups to Support Policy Makers || Choennie
 
Location, Location, Location: Leveraging Interactive Maps, Administrative and...
Location, Location, Location: Leveraging Interactive Maps, Administrative and...Location, Location, Location: Leveraging Interactive Maps, Administrative and...
Location, Location, Location: Leveraging Interactive Maps, Administrative and...
 
An introduction to Data Mining by Kurt Thearling
An introduction to Data Mining by Kurt ThearlingAn introduction to Data Mining by Kurt Thearling
An introduction to Data Mining by Kurt Thearling
 
Real World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining ToolsReal World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining Tools
 
[2015 e-Government Program] Action Plan : Warsaw(Poland)
[2015 e-Government Program] Action Plan : Warsaw(Poland)[2015 e-Government Program] Action Plan : Warsaw(Poland)
[2015 e-Government Program] Action Plan : Warsaw(Poland)
 
Big data sources and methods for social and economic analyses
Big data sources and methods for social and economic analysesBig data sources and methods for social and economic analyses
Big data sources and methods for social and economic analyses
 
DigiGov_cmu_rwanda
DigiGov_cmu_rwandaDigiGov_cmu_rwanda
DigiGov_cmu_rwanda
 

Similaire à Opportunities and methodological challenges of Big Data for official statisticsg data piet_daas_roma2

Big data - a review (2013 4)
Big data - a review (2013 4)Big data - a review (2013 4)
Big data - a review (2013 4)
Sonu Gupta
 
Introduction to big data – convergences.
Introduction to big data – convergences.Introduction to big data – convergences.
Introduction to big data – convergences.
saranya270513
 
Workshop Rio de Janeiro Strategies for Web Based Data Dissemination
Workshop Rio de Janeiro Strategies for Web Based Data DisseminationWorkshop Rio de Janeiro Strategies for Web Based Data Dissemination
Workshop Rio de Janeiro Strategies for Web Based Data Dissemination
Zoltan Nagy
 
02 a holistic approach to big data
02 a holistic approach to big data02 a holistic approach to big data
02 a holistic approach to big data
Raul Chong
 

Similaire à Opportunities and methodological challenges of Big Data for official statisticsg data piet_daas_roma2 (20)

Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Identifying the new frontier of big data as an enabler for T&T industries: Re...
Identifying the new frontier of big data as an enabler for T&T industries: Re...Identifying the new frontier of big data as an enabler for T&T industries: Re...
Identifying the new frontier of big data as an enabler for T&T industries: Re...
 
Data Mining With Big Data
Data Mining With Big DataData Mining With Big Data
Data Mining With Big Data
 
Big Data World
Big Data WorldBig Data World
Big Data World
 
Big data - a review (2013 4)
Big data - a review (2013 4)Big data - a review (2013 4)
Big data - a review (2013 4)
 
Introduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycleIntroduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycle
 
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdfKIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
 
Introduction to big data – convergences.
Introduction to big data – convergences.Introduction to big data – convergences.
Introduction to big data – convergences.
 
Big Data et eGovernment
Big Data et eGovernmentBig Data et eGovernment
Big Data et eGovernment
 
KIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfKIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdf
 
Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...
Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...
Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...
 
The profile of the management (data) scientist: Potential scenarios and skill...
The profile of the management (data) scientist: Potential scenarios and skill...The profile of the management (data) scientist: Potential scenarios and skill...
The profile of the management (data) scientist: Potential scenarios and skill...
 
Complete-SRS.doc
Complete-SRS.docComplete-SRS.doc
Complete-SRS.doc
 
Big data and Internet
Big data and InternetBig data and Internet
Big data and Internet
 
Workshop Rio de Janeiro Strategies for Web Based Data Dissemination
Workshop Rio de Janeiro Strategies for Web Based Data DisseminationWorkshop Rio de Janeiro Strategies for Web Based Data Dissemination
Workshop Rio de Janeiro Strategies for Web Based Data Dissemination
 
Big Data technology
Big Data technologyBig Data technology
Big Data technology
 
Applications of Big Data
Applications of Big DataApplications of Big Data
Applications of Big Data
 
02 a holistic approach to big data
02 a holistic approach to big data02 a holistic approach to big data
02 a holistic approach to big data
 
Big data
Big data Big data
Big data
 

Plus de Piet J.H. Daas

Plus de Piet J.H. Daas (20)

Big Data and official statistics with examples of their use
Big Data and official statistics with examples of their useBig Data and official statistics with examples of their use
Big Data and official statistics with examples of their use
 
IT infrastructure for Big Data and Data Science at Statistics Netherlands
IT infrastructure for Big Data and Data Science at Statistics NetherlandsIT infrastructure for Big Data and Data Science at Statistics Netherlands
IT infrastructure for Big Data and Data Science at Statistics Netherlands
 
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
 
EMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniquesEMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniques
 
Use of social media for official statistics
Use of social media for official statisticsUse of social media for official statistics
Use of social media for official statistics
 
Isi 2017 presentation on Big Data and bias
Isi 2017 presentation on Big Data and biasIsi 2017 presentation on Big Data and bias
Isi 2017 presentation on Big Data and bias
 
Responsible Data Science at Statistics Netherlands
Responsible Data Science at Statistics NetherlandsResponsible Data Science at Statistics Netherlands
Responsible Data Science at Statistics Netherlands
 
CBS lecture at the opening of Data Science Campus of ONS
CBS lecture at the opening of Data Science Campus of ONSCBS lecture at the opening of Data Science Campus of ONS
CBS lecture at the opening of Data Science Campus of ONS
 
Ntts2017 presentation 45
Ntts2017 presentation 45Ntts2017 presentation 45
Ntts2017 presentation 45
 
Big Data presentation Mannheim
Big Data presentation MannheimBig Data presentation Mannheim
Big Data presentation Mannheim
 
Big data cbs_piet_daas
Big data cbs_piet_daasBig data cbs_piet_daas
Big data cbs_piet_daas
 
Gebruik van sociale media voor de officiële statistiek
Gebruik van sociale media voor de officiële statistiekGebruik van sociale media voor de officiële statistiek
Gebruik van sociale media voor de officiële statistiek
 
Big Data @ CBS
Big Data @ CBSBig Data @ CBS
Big Data @ CBS
 
Profiling Big Data sources to assess their selectivity
Profiling Big Data sources to assess their selectivityProfiling Big Data sources to assess their selectivity
Profiling Big Data sources to assess their selectivity
 
Big Data @ CBS for Fontys students in Eindhoven
Big Data @ CBS for Fontys students in EindhovenBig Data @ CBS for Fontys students in Eindhoven
Big Data @ CBS for Fontys students in Eindhoven
 
Quality challenges in modernising business statistics
Quality challenges in modernising business statisticsQuality challenges in modernising business statistics
Quality challenges in modernising business statistics
 
Social media sentiment and consumer confidence
Social media sentiment and consumer confidenceSocial media sentiment and consumer confidence
Social media sentiment and consumer confidence
 
Big data @ CBS
Big data @ CBSBig data @ CBS
Big data @ CBS
 
Bi dutch meeting data science
Bi dutch meeting data scienceBi dutch meeting data science
Bi dutch meeting data science
 
Piet daas big_data_official_statistics_target_groningen
Piet daas big_data_official_statistics_target_groningenPiet daas big_data_official_statistics_target_groningen
Piet daas big_data_official_statistics_target_groningen
 

Dernier

Zirakpur Call Girls👧 Book Now📱8146719683 📞👉Mohali Call Girl Service No Advanc...
Zirakpur Call Girls👧 Book Now📱8146719683 📞👉Mohali Call Girl Service No Advanc...Zirakpur Call Girls👧 Book Now📱8146719683 📞👉Mohali Call Girl Service No Advanc...
Zirakpur Call Girls👧 Book Now📱8146719683 📞👉Mohali Call Girl Service No Advanc...
rajveermohali2022
 

Dernier (20)

Hire 💕 8617697112 North Sikkim Call Girls Service Call Girls Agency
Hire 💕 8617697112 North Sikkim Call Girls Service Call Girls AgencyHire 💕 8617697112 North Sikkim Call Girls Service Call Girls Agency
Hire 💕 8617697112 North Sikkim Call Girls Service Call Girls Agency
 
Top Rated Pune Call Girls Pimpri Chinchwad ⟟ 6297143586 ⟟ Call Me For Genuin...
Top Rated  Pune Call Girls Pimpri Chinchwad ⟟ 6297143586 ⟟ Call Me For Genuin...Top Rated  Pune Call Girls Pimpri Chinchwad ⟟ 6297143586 ⟟ Call Me For Genuin...
Top Rated Pune Call Girls Pimpri Chinchwad ⟟ 6297143586 ⟟ Call Me For Genuin...
 
Hotel And Home Service Available Kolkata Call Girls South End Park ✔ 62971435...
Hotel And Home Service Available Kolkata Call Girls South End Park ✔ 62971435...Hotel And Home Service Available Kolkata Call Girls South End Park ✔ 62971435...
Hotel And Home Service Available Kolkata Call Girls South End Park ✔ 62971435...
 
College Call Girls Pune 8617697112 Short 1500 Night 6000 Best call girls Service
College Call Girls Pune 8617697112 Short 1500 Night 6000 Best call girls ServiceCollege Call Girls Pune 8617697112 Short 1500 Night 6000 Best call girls Service
College Call Girls Pune 8617697112 Short 1500 Night 6000 Best call girls Service
 
VIP Model Call Girls Vijayawada ( Pune ) Call ON 8005736733 Starting From 5K ...
VIP Model Call Girls Vijayawada ( Pune ) Call ON 8005736733 Starting From 5K ...VIP Model Call Girls Vijayawada ( Pune ) Call ON 8005736733 Starting From 5K ...
VIP Model Call Girls Vijayawada ( Pune ) Call ON 8005736733 Starting From 5K ...
 
2k Shot Call girls Laxmi Nagar Delhi 9205541914
2k Shot Call girls Laxmi Nagar Delhi 92055419142k Shot Call girls Laxmi Nagar Delhi 9205541914
2k Shot Call girls Laxmi Nagar Delhi 9205541914
 
Hotel And Home Service Available Kolkata Call Girls Lake Town ✔ 6297143586 ✔C...
Hotel And Home Service Available Kolkata Call Girls Lake Town ✔ 6297143586 ✔C...Hotel And Home Service Available Kolkata Call Girls Lake Town ✔ 6297143586 ✔C...
Hotel And Home Service Available Kolkata Call Girls Lake Town ✔ 6297143586 ✔C...
 
Call Girls Bhandara Just Call 8617697112 Top Class Call Girl Service Available
Call Girls Bhandara Just Call 8617697112 Top Class Call Girl Service AvailableCall Girls Bhandara Just Call 8617697112 Top Class Call Girl Service Available
Call Girls Bhandara Just Call 8617697112 Top Class Call Girl Service Available
 
Jodhpur Park ( Call Girls ) Kolkata ✔ 6297143586 ✔ Hot Model With Sexy Bhabi ...
Jodhpur Park ( Call Girls ) Kolkata ✔ 6297143586 ✔ Hot Model With Sexy Bhabi ...Jodhpur Park ( Call Girls ) Kolkata ✔ 6297143586 ✔ Hot Model With Sexy Bhabi ...
Jodhpur Park ( Call Girls ) Kolkata ✔ 6297143586 ✔ Hot Model With Sexy Bhabi ...
 
Almora call girls 📞 8617697112 At Low Cost Cash Payment Booking
Almora call girls 📞 8617697112 At Low Cost Cash Payment BookingAlmora call girls 📞 8617697112 At Low Cost Cash Payment Booking
Almora call girls 📞 8617697112 At Low Cost Cash Payment Booking
 
Call Girls Manjri Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Manjri Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Manjri Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Manjri Call Me 7737669865 Budget Friendly No Advance Booking
 
Zirakpur Call Girls👧 Book Now📱8146719683 📞👉Mohali Call Girl Service No Advanc...
Zirakpur Call Girls👧 Book Now📱8146719683 📞👉Mohali Call Girl Service No Advanc...Zirakpur Call Girls👧 Book Now📱8146719683 📞👉Mohali Call Girl Service No Advanc...
Zirakpur Call Girls👧 Book Now📱8146719683 📞👉Mohali Call Girl Service No Advanc...
 
📞 Contact Number 8617697112 VIP Ganderbal Call Girls
📞 Contact Number 8617697112 VIP Ganderbal Call Girls📞 Contact Number 8617697112 VIP Ganderbal Call Girls
📞 Contact Number 8617697112 VIP Ganderbal Call Girls
 
VIP Model Call Girls Budhwar Peth ( Pune ) Call ON 8005736733 Starting From 5...
VIP Model Call Girls Budhwar Peth ( Pune ) Call ON 8005736733 Starting From 5...VIP Model Call Girls Budhwar Peth ( Pune ) Call ON 8005736733 Starting From 5...
VIP Model Call Girls Budhwar Peth ( Pune ) Call ON 8005736733 Starting From 5...
 
❤Personal Whatsapp Number Mukteshwar Call Girls 8617697112 💦✅.
❤Personal Whatsapp Number Mukteshwar Call Girls 8617697112 💦✅.❤Personal Whatsapp Number Mukteshwar Call Girls 8617697112 💦✅.
❤Personal Whatsapp Number Mukteshwar Call Girls 8617697112 💦✅.
 
VIP Model Call Girls Koregaon Park ( Pune ) Call ON 8005736733 Starting From ...
VIP Model Call Girls Koregaon Park ( Pune ) Call ON 8005736733 Starting From ...VIP Model Call Girls Koregaon Park ( Pune ) Call ON 8005736733 Starting From ...
VIP Model Call Girls Koregaon Park ( Pune ) Call ON 8005736733 Starting From ...
 
Behala ( Call Girls ) Kolkata ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Ready ...
Behala ( Call Girls ) Kolkata ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Ready ...Behala ( Call Girls ) Kolkata ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Ready ...
Behala ( Call Girls ) Kolkata ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Ready ...
 
Thane West \ Escort Service in Mumbai - 450+ Call Girl Cash Payment 983332523...
Thane West \ Escort Service in Mumbai - 450+ Call Girl Cash Payment 983332523...Thane West \ Escort Service in Mumbai - 450+ Call Girl Cash Payment 983332523...
Thane West \ Escort Service in Mumbai - 450+ Call Girl Cash Payment 983332523...
 
Top Rated Kolkata Call Girls Dum Dum ⟟ 6297143586 ⟟ Call Me For Genuine Sex S...
Top Rated Kolkata Call Girls Dum Dum ⟟ 6297143586 ⟟ Call Me For Genuine Sex S...Top Rated Kolkata Call Girls Dum Dum ⟟ 6297143586 ⟟ Call Me For Genuine Sex S...
Top Rated Kolkata Call Girls Dum Dum ⟟ 6297143586 ⟟ Call Me For Genuine Sex S...
 
Verified Trusted Call Girls Tambaram Chennai ✔✔7427069034 Independent Chenna...
Verified Trusted Call Girls Tambaram Chennai ✔✔7427069034  Independent Chenna...Verified Trusted Call Girls Tambaram Chennai ✔✔7427069034  Independent Chenna...
Verified Trusted Call Girls Tambaram Chennai ✔✔7427069034 Independent Chenna...
 

Opportunities and methodological challenges of Big Data for official statisticsg data piet_daas_roma2

  • 1. Opportunities and methodological challenges of Big Data for official statistics Dr. Piet J.H. Daas Methodologist, Big Data research coördinator March 31, Rome
  • 2. Overview 2 • Big Data • Definition? • DGINS: Scheveningen Memorandum • Experiences at Statistics Netherlands • From ‘New data sources’ to ‘Big Data’ • Data driven approach (learning by doing) • Opportunities & challenges • Methodological & technical challenges • Skills, legal and other issues •With examples !
  • 3. – Data, data everywhere! X
  • 4. What is Big Data? Defining Big Data is not easy: An attempt: “Data that are difficult to collect, store or process within the conventional systems of statistical organizations. Either, their volume, velocity, structure or variety requires the adoption of new statistical software processing techniques and/or IT infrastructure to enable cost-effective insights to be made.” (Virtual sprint paper) More technical: “Big Data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process the data within a tolerable elapsed time.” (Wikipedia) A user: “Data sources that are awkward to work with.” 4 TIP: Big Data sources are NOT surveys and NOT administrative data
  • 5. DGINS: Scheveningen Memorandum 1. Big Data represent new opportunities and challenges for Official Statistics. 2. Develop an 'Official Statistics Big Data strategy' at national and EU-level. 3. Recognize the implications of Big Data for legislation especially with regard to data protection and personal rights 4. Several NSIs are currently initiating or considering different uses of Big Data. Momentum to share experiences and to collaborate. 5. Recognize the necessary capabilities and skills to effectively explore Big Data 6. Acknowledge that the multidisciplinary character requires synergies and partnerships. 7. The use of Big Data in the context of official statistics requires new developments in methodology, quality assessment and IT related issues. 8. Agree on adopting an ESS action plan and roadmap by mid-2014 5
  • 6. Experiences at Statistics Netherlands – Started as ‘New data sources for statistics’ in 2009 – Several initiatives over the years: ‐ Internet as a data source • Collecting price data with web robots • Study the use of web job vacancies data • ‘Markplaats’ data (Dutch eBay clone) ‐ Alternative means of collecting primary data • Use of smartphones ‐ Big Data (really large amounts of data) • Traffic loop detection data (road sensors) • Mobile phone data (location data) • Social media data (content and sentiment) 6
  • 8. What have we learned (so far) ? I’ll discuss the most important ones: 1) Types of ‘data’ in Big Data 2) How to access and analyse large amounts of data 3) How to deal with noisy and unstructured data 4) How to deal with selectivity (and our own bias) 5) How to go beyond correlation 6) The need for people with the right skills and mind‐set 7) Need to solve/deal with privacy and security issues 8) Data management & costs 8 We are slowly starting to get a grip on some of these topics
  • 9. 1) Types of data 9 Secondary data Primary data
  • 10. 1) Types of data & events There are many different Big data sources, An attempt to classify them (Virtual sprint paper): A) Human-sourced information (‘Social Networks’) Social media messages, blogs, web searches B) Process-mediated data (‘Traditional Business Systems andWebsites’) Credit card, bank or on-line transactions,CDR, product prices, page-views C) Machine-generated data (‘Automated Systems’) Road or climate sensors, satellite images, GPS,AIS. Essentially most of the data are event-based of which some can be directly related to a user (e.g. the target population) 10
  • 11. 2) How to access and analyse large amounts of data 11 – If you want to analyse Big Data – You need a lot of computer power!! – Or you need a lot of time! High Performance Computing expertise is essential !
  • 12. – We have: - Workstations with lot’s of memory (32-64GB), fast disk drives (SSD, 512 GB) and a large hard drive (>= 1TB) - A secure environment in which to access the data with those computers - A Big Data lab - The knowledge to load and analyse all the data into R or Python - Followed a High Performance Computing training course - Realized that learning by doing is key! (?databases?) AND a Big data source with no privacy and security issues so we can test all kinds of analysis, soft- and hardware (anyplace, anytime, anywhere) • Traffic loop data (road sensors) 12 Our current equipment and more
  • 13. An example: – Processing of traffic loop data of 1 day - A total ~100 million records (25 GB) I/O limitation can by solved by: 1) Input part by using a cluster (distributed computing) 2) Output part by implementing a C++ write routine in R (20% faster) Processing in R Time needed Speed-up First R-script 6 hours - Improved code 30 min 12 Faster hardware 10 min 36 (Java code) Faster hardware 2 min 180 + preprocessed data Limited by I/O 13
  • 14. All Dutch vehicles in September
  • 15. 3) How to deal with noisy and unstructured data – Big Data is often ‐ noisy, dirty ‐ redundant ‐ unstructured • e.g. texts, images – How to extract information from Big data? ‐ In the best/most efficient way 15
  • 16. Example of noisy data: Roads sensors Traffic loop data ‐ Each minute (24/7) the number of passing vehicles is counted in around 20.000 ‘loops’ in the Netherlands • Total and in different length classes ‐ Nice data source for transport and traffic statistics (and more) • A lot of data, around 100 million records a day Locations 16
  • 17. Total number of vehicles during the day 17 Time (hour)
  • 18. Correct for missing data: macro level Sliding window of 5 min. Impute missing data. Before After Total = ~ 295 million vehicles Total = ~ 330 million (+ 12%) vehicles 18
  • 19. Correct for missing data: micro level 19 Time (min.) Numberofvehiclesdetected Recursive Bayesian estimator (<1 sec on GPGPU)
  • 20. 4) How to deal with selectivity – Big Data sources may be selective when - Only part of the population contributes to the data set • For example: mobile phone owners - The measurement mechanism is selective (e.g. non-random times or places) • For example: placing of road sensors on Dutch highways is not random – Many Big Data sources contain events - Population units may generate widely varying numbers of events - Attempt to associate events with units – Correcting for selectivity - Background characteristics – or features – are needed (linking with registers; profiling) - Use predictive modelling / machine learning to produce population estimates20
  • 22. Selectivity illustrated Selectivity of big data could potentially be less problematic than high non- response rates of surveys. -There is just more data for your model! The black line shows the relationship between the target and auxiliary variable in the target population.The red lines show the estimated relationship according to each of the three sources (with 95% confidence intervals). Here we assume units with auxiliary variables are available! 22
  • 23. 5) How to go beyond correlation – You will very likely use correlation to check Big Data findings with those in other (survey) data – When correlation is high: 1) try falsifying it first (is it coincidental?) correlation ≠ causation 2) If this fails, you may have found something interesting! 3) Perform additional analysis (look for causality) cointegration, Granger causality, time‐series approach, etc. 23
  • 24. Example: Sentiment in social media (day/week/month) 24
  • 26. Table 1. Social media messages properties for various platforms and their correlation with consumer confidence Correlation coefficient of Social media platform Number of social Number of messages as monthly sentiment index and media messages1 percentage of total (%) consumer confidence ( r )2 All platforms combined 3,153,002,327 100 0.75 0.78 Facebook 334,854,088 10.6 0.81* 0.85* Twitter 2,526,481,479 80.1 0.68 0.70 Hyves 45,182,025 1.4 0.50 0.58 News sites 56,027,686 1.8 0.37 0.26 Blogs 48,600,987 1.5 0.25 0.22 Google+ 644,039 0.02 -0.04 -0.09 Linkedin 565,811 0.02 -0.23 -0.25 Youtube 5,661,274 0.2 -0.37 -0.41 Forums 134,98,938 4.3 -0.45 -0.49 1 period covered June 2010 untill November 2013 2 confirmed by visual inspecting scatterplots and additional checks (see text) *cointegrated Platform specific results Granger causality reveals that Consumer Confidence precedes Facebook sentiment ! (p-value < 0.001) 26
  • 27. A schematic view Vorige maand Maand Consumer Confidence Publication date (~20th) Social media sentiment Dag 1-7 Dag 8-14 Dag 15-21 Dag 22-28 Previous month Current month Day 1-7 Day 8-14 Day 15-21 Day 22-28 27
  • 28. Platform specific results (2) More detailed studies revealed a 1 week delay between both! Consumer confidence comes first, Social media sentiment follows 28 Table 1. Social media messages properties for various platforms and their correlation with consumer confidence Correlation coefficient of Social media platform Number of social Number of messages as monthly sentiment index and media messages1 percentage of total (%) consumer confidence ( r )2 All platforms combined 3,153,002,327 100 0.75 0.78 Facebook 334,854,088 10.6 0.81* 0.85* Twitter 2,526,481,479 80.1 0.68 0.70 Hyves 45,182,025 1.4 0.50 0.58 News sites 56,027,686 1.8 0.37 0.26 Blogs 48,600,987 1.5 0.25 0.22 Google+ 644,039 0.02 -0.04 -0.09 Linkedin 565,811 0.02 -0.23 -0.25 Youtube 5,661,274 0.2 -0.37 -0.41 Forums 134,98,938 4.3 -0.45 -0.49 1 period covered June 2010 untill November 2013 2 confirmed by visual inspecting scatterplots and additional checks (see text) *cointegrated
  • 29. 6) People and skills needed For Big data studies you need: – People with an open mind‐set that do not see all problems a priori in terms of sampling theory – People with programming skills and IT‐affinity – People with a data‐driven, pragmatic attitude (data explorers, ’practitioners’) ‐ You need Data scientists ! 29
  • 30. Data science skills ‘landscape’ Sexy Skills of Data Geeks 1) Statistics - traditional analysis you're used to thinking about 2) Data ‘munging’ - parsing, scraping, and formatting data 3)Visualization - graphs, tools, etc. 4) High Performance Computing knowledge30
  • 31. People that think outside the ‘box’ 31
  • 32. 7) Privacy and security issues – The Dutch privacy and security law allows the study of privacy sensitive data for scientific and statistical research – Of course, appropriate measures always need to be taken • Prior to new research studies, check privacy sensitivity of data • In case of privacy sensitive data: • Try to anonymize micro data or use aggregates • Use secure environment: workstations in Big Data lab – Legal issues that enable the use of Big Data for official statistics production are currently being looked at - There is Big Data that can be considered ‘Administrative data’: i.e. Big Data that is managed by a (semi-)governmentally funded organisation 32
  • 33. Example: Mobile phones Mobile phone activity as a data source – Nearly every person in the Netherlands has a mobile phone - Usually on them and almost always switched on! - Many people are very active during the day – Can data of mobile phones be used for statistics? - Travel behaviour (of active phones) - ‘Day time population’ (of active phones) - Tourism (new phones that register to network) – Data of a single mobile company was used - Hourly aggregates per area (only when > 15 events) - Especially important for roaming data (foreign visitors) 33
  • 34. ‘Day time population’ – Hourly changes of mobile phone activity – 7 & 8 May 2013 – Per area distinguished – Only data for areas with > 15 events per hour 34
  • 35. Tourism: Roaming during European league final Hardly any Low Medium High Very high 35
  • 36. 8) Costs and data management – Costs ‐ In the Netherlands we don’t pay for administrative data. ‐ How about Big Data? • We currently pay for social media (access) and mobile phone data (extra processing efforts) – Data management ‐ Who owns the data? Stability of delivery/source ‐ Cope with the huge volume • Run queries in database of data source holder • Collect and process it as data stream • Bulk processing 36
  • 38. Thank you for your attention !@pietdaas