SlideShare une entreprise Scribd logo
1  sur  25
Big Data as a
data source for
official statistics

Piet Daas, Marco Puts, Bart Buelens and Paul van den Hurk
Statistics Netherlands


                            Big Data Target Conference, April 4, Groningen
Overview

• Data sources and statistics
     • More & more data becomes available
     • Effect on statistics production
• How we study Big Data: 2 examples
     • Traffic loop detection data
     • Social media messages




Big Data Target Conference, April 4, Groningen   1
Introduction




  “Statistics Netherlands has produced
  about 5000 official publications and
  tables in 2012”
            For this we need DATA




Big Data Target Conference, April 4, Groningen   2
Data sources for official statistics




         Primary data                              Secondary data



                                                  Data from ‘others’
       Our own surveys                             - Administrative sources
                                                   - ‘New’ data sources

 Big Data Target Conference, April 4, Groningen                               3
Statistics Netherlands law

• “Statistics Netherlands aims to reduce the
  administrative burden for companies and the
  public as much as possible”
  • By (re-)using existing administrative registrations of both
    government and government-funded organizations.
  • And study potential new sources of information




 Big Data Target Conference, April 4, Groningen               3
• Data, data everywhere!




  X

Big Data Target Conference, April 4, Groningen   4
Statistics Netherlands and Data
•    Data is generated in increasing amounts and at increasing frequencies:
    •      From ‘Data scarcity’ (sample survey) to ‘Data abundance’ (administrative
           & Big)
           •   Ever increasing amounts of data need to be checked, processed and
               analyzed
           •   More sources of information become available
           •   Opportunities to produce statistics faster (‘real-time statistics’)
    •      Need for new methods and tools
           1. Methods to quickly uncover information from massive amounts of data
              available, such as visualisation methods and data-, text- and stream-
              mining techniques (‘making Big Data small’), High Performance Comp.
           2. Methods capable of integrating the information in the statistical process,
              e.g. linking at massive scale, macro/meso-integration, estimation methods
              suited for large datasets


        Big Data Target Conference, April 4, Groningen                               5
2 Big Data case studies

Research findings on the study of Big Data sources
from a statistics point of view

     1. Traffic loop detection data
               80 million records/day, studied 90 days so far,
               number of vehicles detected each minute

     2. Dutch social media messages
               1~2 million public messages/day, studied up to 2 billion
               records, content and sentiment


Big Data Target Conference, April 4, Groningen                            6
1. Traffic loop detection data

• Traffic ‘loops’
   • Every minute (24/7) the number of passing
     vehicles is counted by >10,000 road sensors
     & camera’s in the Netherlands
      • Total vehicles and in different length classes

   • Interesting source to produce traffic and
     transport statistics (and more)
       • Huge amounts of data, about 100 million
         records a day
                                                         Locations


   Big Data Target Conference, April 4, Groningen                7
Number of detected vehicles on a single day




By all loops                                     Total = ~ 295 million

Big Data Target Conference, April 4, Groningen                           8
Traffic loop detection activity (only first 10 min.)




Big Data Target Conference, April 4, Groningen          9
Correct for missing data
 • ‘Corrected’ data (for blocks of 5 min)

            Before                                 After




                Total = ~ 295 million            Total = ~ 330 million (+ 12%)

Big Data Target Conference, April 4, Groningen                                   10
Total vehicles during the day (snapshots)




Big Data Target Conference, April 4, Groningen   12
For different vehicle lengths
      1 categorie              3 categoriën       5 categoriën

      Totaal                   Totaal             Totaal
                               <= 5.6m            > 1.85 & <= 2.4m
                               > 5.6 & <= 12.2m   > 2.4 & <= 5.6m
                               > 12.2m            > 5.6 & <= 11.5m
                                                  > 11.5 & <= 12.2m
                                                  > 12.2m


         Small vehicles <= 5.6 m
         Medium sized vehicles > 5.6 m & <= 12.2 m
         Large vehicles > 12.2 m



Big Data Target Conference, April 4, Groningen                        13
Small vehicles




                                                 ~75% of total

Big Data Target Conference, April 4, Groningen            14
Small & medium vehicles




Big Data Target Conference, April 4, Groningen   15
Small, medium & large vehicles




Big Data Target Conference, April 4, Groningen   16
Volatile behaviour at the micro-level




Big Data Target Conference, April 4, Groningen   17
2. Social media messages

• Dutch are very active on social media platforms
     • Bijna altijd bij zich en staat vrijwel altijd aan
          • Steeds meer mensen hebben een smartphone!

• Mogelijke informatiebron voor:
     • Welke onderwerpen zijn actueel:
          • Aantal berichten en sentiment hierover


     • Als meetinstrument te gebruiken voor:
          • .
                                                     Map by Eric Fischer (via Fast Company)



Big Data Target Conference, April 4, Groningen                                                18
2. Social media messages
  • Dutch are very active on social media platforms
    • Potential information source for:
            • Topics discussed and sentiment over these topics (quickly
              available!) and probably more?
            • Investigate it to obtain an answer on potential use


2a. Content:
    - Collected Dutch Twitter messages for study: ‘selection’ of 12 million

2b. Sentiment
    - Sentiment in Dutch social media messages: ‘all’ ~2 billion



 Big Data Target Conference, April 4, Groningen                          19
Social media: Dutch Twitter topics

               (3%)




                    (7%)
                (3%)


                         (10%)
                      (7%)
               (3%)
                  (5%)
                                                       (46%)


                                                 12 million messages

Big Data Target Conference, April 4, Groningen                    20
Sentiment in Social media
• Access to Coosto database
  • > 2 billion publicly available messages
          • Twitter, Facebook, Hyves, Webfora, Blogs etc.
     • Sentiment of each message
          • Positive, negative or neutral
     • Interesting finding
          • Determine so-called ‘Mood of the nation’ compared
            to Consumer confidence of Statistics Netherlands



Big Data Target Conference, April 4, Groningen                  21
Consumer confidence, survey data

                                          Sentiment towards the economic climate

              (pos – neg) as % of total




                                                                        ~1000 respondents/month

  Big Data Target Conference, April 4, Groningen                                            22
Final remarks: Big Data and statistics
 •   Preparing Big data for statistics is time consuming
      • Exploration phase takes a lot of time
      • Try to reduce amount of data without losing information (‘making big data
        small’, noise reduction)
      • Risk: ‘garbage in’     ‘garbage statistics out’
 •   Traditional approach does not suffice
      • Big data sources are definitely not ‘large’ sample surveys or admin data
      • Often a selective but a large part of the ‘population’ is included
      • Events are registered, not units!
      • Careful with using ‘traditional’ statistical analysis (everything is significant!)
 •   More need for:
      • Visualisation methods (to rapidly gain insight)
      • Methods & models specific for large dataset (fast and ‘robust’)
      • Learn from ‘computational statistics’ & (try to) use dedicated hardware
      • Beware of privacy issues!



     Big Data Target Conference, April 4, Groningen                                      27
Big Data Target Conference, April 4, Groningen   The future of Stat Neth?

Contenu connexe

Tendances

networks inparliament-ccct
 networks inparliament-ccct networks inparliament-ccct
networks inparliament-ccct
maartenmarx
 

Tendances (10)

2nd Stakeholder workshop: Bertin, Embrapa's appraoch to open Agricultural Sci...
2nd Stakeholder workshop: Bertin, Embrapa's appraoch to open Agricultural Sci...2nd Stakeholder workshop: Bertin, Embrapa's appraoch to open Agricultural Sci...
2nd Stakeholder workshop: Bertin, Embrapa's appraoch to open Agricultural Sci...
 
#opendata Back to the future
#opendata Back to the future#opendata Back to the future
#opendata Back to the future
 
Open Data Engagement - Using Open Data w3c Workshop
Open Data Engagement - Using Open Data w3c Workshop Open Data Engagement - Using Open Data w3c Workshop
Open Data Engagement - Using Open Data w3c Workshop
 
Domenico Donvito - Istat - Open Data in Official Statistics - 10 July 2013
Domenico Donvito - Istat - Open Data in Official Statistics - 10 July 2013Domenico Donvito - Istat - Open Data in Official Statistics - 10 July 2013
Domenico Donvito - Istat - Open Data in Official Statistics - 10 July 2013
 
networks inparliament-ccct
 networks inparliament-ccct networks inparliament-ccct
networks inparliament-ccct
 
Data sharing for development: a case of Infrastructural development in Uganda...
Data sharing for development: a case of Infrastructural development in Uganda...Data sharing for development: a case of Infrastructural development in Uganda...
Data sharing for development: a case of Infrastructural development in Uganda...
 
Digital preservation through Digital Sustainability
Digital preservation through Digital SustainabilityDigital preservation through Digital Sustainability
Digital preservation through Digital Sustainability
 
Open Data in a Day - Introduction to Open Data
Open Data in a Day - Introduction to Open DataOpen Data in a Day - Introduction to Open Data
Open Data in a Day - Introduction to Open Data
 
Case Studies: Burkina Open Data Initiative/Malick Tapsoba
Case Studies: Burkina Open Data Initiative/Malick TapsobaCase Studies: Burkina Open Data Initiative/Malick Tapsoba
Case Studies: Burkina Open Data Initiative/Malick Tapsoba
 
SK INSPIRE Data sharing
SK INSPIRE Data sharingSK INSPIRE Data sharing
SK INSPIRE Data sharing
 

En vedette

En vedette (13)

Lex Pater (Flevoziekenhuis) - Slim omgaan met ziekenhuisdata
Lex Pater (Flevoziekenhuis) - Slim omgaan met ziekenhuisdataLex Pater (Flevoziekenhuis) - Slim omgaan met ziekenhuisdata
Lex Pater (Flevoziekenhuis) - Slim omgaan met ziekenhuisdata
 
Необычные СПА процедуры мира
Необычные СПА процедуры мираНеобычные СПА процедуры мира
Необычные СПА процедуры мира
 
Relaciones laborales en Salud Publica
Relaciones laborales en Salud Publica Relaciones laborales en Salud Publica
Relaciones laborales en Salud Publica
 
Delitos Contra la Administración pública
Delitos Contra la Administración públicaDelitos Contra la Administración pública
Delitos Contra la Administración pública
 
October 2016 classes
October 2016 classesOctober 2016 classes
October 2016 classes
 
Qué y a dónde más parte 1 de 3
Qué y a dónde más parte 1 de 3Qué y a dónde más parte 1 de 3
Qué y a dónde más parte 1 de 3
 
Revolucioindustrial
RevolucioindustrialRevolucioindustrial
Revolucioindustrial
 
Chapter7 International Finance Management
Chapter7 International Finance ManagementChapter7 International Finance Management
Chapter7 International Finance Management
 
Yellow Fever: Risk Mapping
Yellow Fever: Risk MappingYellow Fever: Risk Mapping
Yellow Fever: Risk Mapping
 
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Using spider for sharding in production
Using spider for sharding in productionUsing spider for sharding in production
Using spider for sharding in production
 
20160929_InnoDBの全文検索を使ってみた by 株式会社インサイトテクノロジー 中村範夫
20160929_InnoDBの全文検索を使ってみた by 株式会社インサイトテクノロジー 中村範夫20160929_InnoDBの全文検索を使ってみた by 株式会社インサイトテクノロジー 中村範夫
20160929_InnoDBの全文検索を使ってみた by 株式会社インサイトテクノロジー 中村範夫
 

Similaire à Piet daas big_data_official_statistics_target_groningen

Research Data Alliance Member Statistics July 2015
Research Data Alliance Member Statistics July 2015Research Data Alliance Member Statistics July 2015
Research Data Alliance Member Statistics July 2015
Research Data Alliance
 

Similaire à Piet daas big_data_official_statistics_target_groningen (20)

Strata Big data presentation
Strata Big data presentationStrata Big data presentation
Strata Big data presentation
 
Big data as a source for official statistics
Big data as a source for official statisticsBig data as a source for official statistics
Big data as a source for official statistics
 
OSFair2017 Workshop | OpenDataMonitor
OSFair2017 Workshop | OpenDataMonitorOSFair2017 Workshop | OpenDataMonitor
OSFair2017 Workshop | OpenDataMonitor
 
Open data: Where do we go from here
Open data: Where do we go from hereOpen data: Where do we go from here
Open data: Where do we go from here
 
Cambridgeshire Insight Open Data: What we’ve learnt from the unexpected - He...
Cambridgeshire Insight Open Data: What we’ve learnt from the unexpected - He...Cambridgeshire Insight Open Data: What we’ve learnt from the unexpected - He...
Cambridgeshire Insight Open Data: What we’ve learnt from the unexpected - He...
 
Big Data presentation for Statistics Canada
Big Data presentation for Statistics CanadaBig Data presentation for Statistics Canada
Big Data presentation for Statistics Canada
 
Developing a Data Management Plan
Developing a Data Management PlanDeveloping a Data Management Plan
Developing a Data Management Plan
 
R A Longhorn Presentation at Taiwan Open Data Forum, Taipei, 9 July 2014
R A Longhorn Presentation at Taiwan Open Data Forum, Taipei, 9 July 2014R A Longhorn Presentation at Taiwan Open Data Forum, Taipei, 9 July 2014
R A Longhorn Presentation at Taiwan Open Data Forum, Taipei, 9 July 2014
 
Aligning stakeholders' perspectives in Open Government Data Community
Aligning stakeholders' perspectives in Open Government Data CommunityAligning stakeholders' perspectives in Open Government Data Community
Aligning stakeholders' perspectives in Open Government Data Community
 
Research Data Alliance Member Statistics June 2015
Research Data Alliance Member Statistics June 2015Research Data Alliance Member Statistics June 2015
Research Data Alliance Member Statistics June 2015
 
Research Data Alliance Member Statistics August 2015
Research Data Alliance Member Statistics August 2015Research Data Alliance Member Statistics August 2015
Research Data Alliance Member Statistics August 2015
 
Big Data @ CBS
Big Data @ CBSBig Data @ CBS
Big Data @ CBS
 
Research Data Alliance Member Statistics September 2015
Research Data Alliance Member Statistics September 2015Research Data Alliance Member Statistics September 2015
Research Data Alliance Member Statistics September 2015
 
Research Data Alliance Member Statistics July 2015
Research Data Alliance Member Statistics July 2015Research Data Alliance Member Statistics July 2015
Research Data Alliance Member Statistics July 2015
 
Opportunities and methodological challenges of Big Data for official statist...
Opportunities and methodological challenges of  Big Data for official statist...Opportunities and methodological challenges of  Big Data for official statist...
Opportunities and methodological challenges of Big Data for official statist...
 
Research Data Alliance Member Statistics October 2015
Research Data Alliance Member Statistics October 2015Research Data Alliance Member Statistics October 2015
Research Data Alliance Member Statistics October 2015
 
SC6 Workshop 1: From your data to data stories - BigDataEurope, SC6 Workshop
SC6 Workshop 1: From your data to data stories - BigDataEurope, SC6 WorkshopSC6 Workshop 1: From your data to data stories - BigDataEurope, SC6 Workshop
SC6 Workshop 1: From your data to data stories - BigDataEurope, SC6 Workshop
 
Jovana Pistek and Christian van der Kooi - Open government data workshop - BO...
Jovana Pistek and Christian van der Kooi - Open government data workshop - BO...Jovana Pistek and Christian van der Kooi - Open government data workshop - BO...
Jovana Pistek and Christian van der Kooi - Open government data workshop - BO...
 
Open Government Data for Transparency & Innovation
Open Government Data for Transparency & InnovationOpen Government Data for Transparency & Innovation
Open Government Data for Transparency & Innovation
 
#FIWAREPamplona Aporta IODC16 Open Data
#FIWAREPamplona Aporta IODC16 Open Data#FIWAREPamplona Aporta IODC16 Open Data
#FIWAREPamplona Aporta IODC16 Open Data
 

Plus de Piet J.H. Daas

Plus de Piet J.H. Daas (20)

Big Data and official statistics with examples of their use
Big Data and official statistics with examples of their useBig Data and official statistics with examples of their use
Big Data and official statistics with examples of their use
 
IT infrastructure for Big Data and Data Science at Statistics Netherlands
IT infrastructure for Big Data and Data Science at Statistics NetherlandsIT infrastructure for Big Data and Data Science at Statistics Netherlands
IT infrastructure for Big Data and Data Science at Statistics Netherlands
 
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
 
EMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniquesEMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniques
 
Use of social media for official statistics
Use of social media for official statisticsUse of social media for official statistics
Use of social media for official statistics
 
Isi 2017 presentation on Big Data and bias
Isi 2017 presentation on Big Data and biasIsi 2017 presentation on Big Data and bias
Isi 2017 presentation on Big Data and bias
 
Responsible Data Science at Statistics Netherlands
Responsible Data Science at Statistics NetherlandsResponsible Data Science at Statistics Netherlands
Responsible Data Science at Statistics Netherlands
 
CBS lecture at the opening of Data Science Campus of ONS
CBS lecture at the opening of Data Science Campus of ONSCBS lecture at the opening of Data Science Campus of ONS
CBS lecture at the opening of Data Science Campus of ONS
 
Ntts2017 presentation 45
Ntts2017 presentation 45Ntts2017 presentation 45
Ntts2017 presentation 45
 
Big Data presentation Mannheim
Big Data presentation MannheimBig Data presentation Mannheim
Big Data presentation Mannheim
 
Extracting information from ' messy' social media data
Extracting information from ' messy' social media dataExtracting information from ' messy' social media data
Extracting information from ' messy' social media data
 
Big data cbs_piet_daas
Big data cbs_piet_daasBig data cbs_piet_daas
Big data cbs_piet_daas
 
Gebruik van sociale media voor de officiële statistiek
Gebruik van sociale media voor de officiële statistiekGebruik van sociale media voor de officiële statistiek
Gebruik van sociale media voor de officiële statistiek
 
Profiling Big Data sources to assess their selectivity
Profiling Big Data sources to assess their selectivityProfiling Big Data sources to assess their selectivity
Profiling Big Data sources to assess their selectivity
 
Using Road Sensor Data for Official Statistics: towards a Big Data Methodology
Using Road Sensor Data for Official Statistics: towards a Big Data MethodologyUsing Road Sensor Data for Official Statistics: towards a Big Data Methodology
Using Road Sensor Data for Official Statistics: towards a Big Data Methodology
 
Big Data @ CBS for Fontys students in Eindhoven
Big Data @ CBS for Fontys students in EindhovenBig Data @ CBS for Fontys students in Eindhoven
Big Data @ CBS for Fontys students in Eindhoven
 
Quality challenges in modernising business statistics
Quality challenges in modernising business statisticsQuality challenges in modernising business statistics
Quality challenges in modernising business statistics
 
Quality Approaches to Big Data
Quality Approaches to Big DataQuality Approaches to Big Data
Quality Approaches to Big Data
 
Social media sentiment and consumer confidence
Social media sentiment and consumer confidenceSocial media sentiment and consumer confidence
Social media sentiment and consumer confidence
 
Big data @ CBS
Big data @ CBSBig data @ CBS
Big data @ CBS
 

Dernier

Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
FIDO Alliance
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
FIDO Alliance
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc
 

Dernier (20)

Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overview
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptx
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 

Piet daas big_data_official_statistics_target_groningen

  • 1. Big Data as a data source for official statistics Piet Daas, Marco Puts, Bart Buelens and Paul van den Hurk Statistics Netherlands Big Data Target Conference, April 4, Groningen
  • 2. Overview • Data sources and statistics • More & more data becomes available • Effect on statistics production • How we study Big Data: 2 examples • Traffic loop detection data • Social media messages Big Data Target Conference, April 4, Groningen 1
  • 3. Introduction “Statistics Netherlands has produced about 5000 official publications and tables in 2012” For this we need DATA Big Data Target Conference, April 4, Groningen 2
  • 4. Data sources for official statistics Primary data Secondary data Data from ‘others’ Our own surveys - Administrative sources - ‘New’ data sources Big Data Target Conference, April 4, Groningen 3
  • 5. Statistics Netherlands law • “Statistics Netherlands aims to reduce the administrative burden for companies and the public as much as possible” • By (re-)using existing administrative registrations of both government and government-funded organizations. • And study potential new sources of information Big Data Target Conference, April 4, Groningen 3
  • 6. • Data, data everywhere! X Big Data Target Conference, April 4, Groningen 4
  • 7. Statistics Netherlands and Data • Data is generated in increasing amounts and at increasing frequencies: • From ‘Data scarcity’ (sample survey) to ‘Data abundance’ (administrative & Big) • Ever increasing amounts of data need to be checked, processed and analyzed • More sources of information become available • Opportunities to produce statistics faster (‘real-time statistics’) • Need for new methods and tools 1. Methods to quickly uncover information from massive amounts of data available, such as visualisation methods and data-, text- and stream- mining techniques (‘making Big Data small’), High Performance Comp. 2. Methods capable of integrating the information in the statistical process, e.g. linking at massive scale, macro/meso-integration, estimation methods suited for large datasets Big Data Target Conference, April 4, Groningen 5
  • 8. 2 Big Data case studies Research findings on the study of Big Data sources from a statistics point of view 1. Traffic loop detection data 80 million records/day, studied 90 days so far, number of vehicles detected each minute 2. Dutch social media messages 1~2 million public messages/day, studied up to 2 billion records, content and sentiment Big Data Target Conference, April 4, Groningen 6
  • 9. 1. Traffic loop detection data • Traffic ‘loops’ • Every minute (24/7) the number of passing vehicles is counted by >10,000 road sensors & camera’s in the Netherlands • Total vehicles and in different length classes • Interesting source to produce traffic and transport statistics (and more) • Huge amounts of data, about 100 million records a day Locations Big Data Target Conference, April 4, Groningen 7
  • 10. Number of detected vehicles on a single day By all loops Total = ~ 295 million Big Data Target Conference, April 4, Groningen 8
  • 11. Traffic loop detection activity (only first 10 min.) Big Data Target Conference, April 4, Groningen 9
  • 12. Correct for missing data • ‘Corrected’ data (for blocks of 5 min) Before After Total = ~ 295 million Total = ~ 330 million (+ 12%) Big Data Target Conference, April 4, Groningen 10
  • 13. Total vehicles during the day (snapshots) Big Data Target Conference, April 4, Groningen 12
  • 14. For different vehicle lengths 1 categorie 3 categoriën 5 categoriën Totaal Totaal Totaal <= 5.6m > 1.85 & <= 2.4m > 5.6 & <= 12.2m > 2.4 & <= 5.6m > 12.2m > 5.6 & <= 11.5m > 11.5 & <= 12.2m > 12.2m Small vehicles <= 5.6 m Medium sized vehicles > 5.6 m & <= 12.2 m Large vehicles > 12.2 m Big Data Target Conference, April 4, Groningen 13
  • 15. Small vehicles ~75% of total Big Data Target Conference, April 4, Groningen 14
  • 16. Small & medium vehicles Big Data Target Conference, April 4, Groningen 15
  • 17. Small, medium & large vehicles Big Data Target Conference, April 4, Groningen 16
  • 18. Volatile behaviour at the micro-level Big Data Target Conference, April 4, Groningen 17
  • 19. 2. Social media messages • Dutch are very active on social media platforms • Bijna altijd bij zich en staat vrijwel altijd aan • Steeds meer mensen hebben een smartphone! • Mogelijke informatiebron voor: • Welke onderwerpen zijn actueel: • Aantal berichten en sentiment hierover • Als meetinstrument te gebruiken voor: • . Map by Eric Fischer (via Fast Company) Big Data Target Conference, April 4, Groningen 18
  • 20. 2. Social media messages • Dutch are very active on social media platforms • Potential information source for: • Topics discussed and sentiment over these topics (quickly available!) and probably more? • Investigate it to obtain an answer on potential use 2a. Content: - Collected Dutch Twitter messages for study: ‘selection’ of 12 million 2b. Sentiment - Sentiment in Dutch social media messages: ‘all’ ~2 billion Big Data Target Conference, April 4, Groningen 19
  • 21. Social media: Dutch Twitter topics (3%) (7%) (3%) (10%) (7%) (3%) (5%) (46%) 12 million messages Big Data Target Conference, April 4, Groningen 20
  • 22. Sentiment in Social media • Access to Coosto database • > 2 billion publicly available messages • Twitter, Facebook, Hyves, Webfora, Blogs etc. • Sentiment of each message • Positive, negative or neutral • Interesting finding • Determine so-called ‘Mood of the nation’ compared to Consumer confidence of Statistics Netherlands Big Data Target Conference, April 4, Groningen 21
  • 23. Consumer confidence, survey data Sentiment towards the economic climate (pos – neg) as % of total ~1000 respondents/month Big Data Target Conference, April 4, Groningen 22
  • 24. Final remarks: Big Data and statistics • Preparing Big data for statistics is time consuming • Exploration phase takes a lot of time • Try to reduce amount of data without losing information (‘making big data small’, noise reduction) • Risk: ‘garbage in’ ‘garbage statistics out’ • Traditional approach does not suffice • Big data sources are definitely not ‘large’ sample surveys or admin data • Often a selective but a large part of the ‘population’ is included • Events are registered, not units! • Careful with using ‘traditional’ statistical analysis (everything is significant!) • More need for: • Visualisation methods (to rapidly gain insight) • Methods & models specific for large dataset (fast and ‘robust’) • Learn from ‘computational statistics’ & (try to) use dedicated hardware • Beware of privacy issues! Big Data Target Conference, April 4, Groningen 27
  • 25. Big Data Target Conference, April 4, Groningen The future of Stat Neth?