Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong Value-Adding Proposition
5 Aug 2014•0 j'aime
1 j'aime
Soyez le premier à aimer ceci
afficher plus
•673 vues
vues
Nombre de vues
0
Sur Slideshare
0
À partir des intégrations
0
Nombre d'intégrations
0
Télécharger pour lire hors ligne
Signaler
Technologie
Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong Value-Adding Proposition
by Patrick Hadley, Australian Bureau of Statistics at the Australian CIO Summit 2014
Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong Value-Adding Proposition
Australian CIO Summit 2014
28 – 30 July 2014
Bigger and Better: Employing a Holistic Strategy for
Big Data toward a Strong Value-Adding Proposition
Patrick Hadley
Chief Information Officer
Australian Bureau of Statistics
Not Another ‘Big Data’ Presentation
(‘V’ is not the only letter in the alphabet!)
The promise
Big data is at the foundation of all the megatrends that are happening today,
from social to mobile to the cloud to gaming. - Chris Lynch, ex Vertica CEO
“Big Data is a tidal wave, which in the next decade will create consumer –
and producer – value in almost every major sector of the economy” Philip
Evans
“….a tremendous wave of innovation, productivity and growth… all driven by
big data” McKinsey
“Big Data: A Revolution that Will Transform how We Live, Work, and Think”
Viktor Mayer-Schönberger and Kenneth Cukier. 2013.
Big data is like teenage sex: everyone talks about it, nobody really
knows how to do it, everyone thinks everyone else is doing it, so
everyone claims they are doing it...
Dan Ariely, 2013
In God we trust; all others must bring data.
W.E. Deming
Or, the reality…….
Agenda
• What is Big Data (3/4/5/6 v’s)
• Sources of Data
• Data as an asset
• Open Data
• Opportunities…..applications…..benefits
• Data Management
• Data Analytics; technologies
• Security
• Privacy
• Skills and capabilities
• …… and on
Agenda
• What is Big Data (3/4/5/6 v’s)
• Sources od Data
• Data as an asset
• Open Data
• Opportunities…..applications…..benefits
• Data Management
• Data Analytics; technologies
• Security
• Privacy
• Skills and capabilities
• …… and on
Today ………
• The use of Big Data in official statistics
• ABS initiatives, experiences and capabilities
• Learnings: Towards a strong value- adding proposition
Big Data in Official Statistics
The vision…..
A richer, more dynamic statistical picture of Australia;
Opportunity: reduce costs; improve quality
Sources of Data
• digital descriptions of the physical environment
• sensors and other devices
• communications networks
• individual behaviour and information
• digitisation of commerce and supply chains
High potential data sources
• Telecom
• Utilities
• Retailers
• Financial sector
• Satellite
• Other
Example: Telecom data applications
• small area population estimates
• service populations
• travel patterns
• seasonal population movements
• event populations
• internet use……
How do we ?
o identify characteristics of handset owners?
o turn handset counts into people
Initiate exploratory R&D
Targeted streams of investigation
Use of satellite imagery to determine land utilisation
Use of integrated demographic data for small area
modelling of unemployment
Use of mobile device messaging records for real time
estimation of service populations
Progress the methodological framework and trial new
technology approaches
Machine learning
Multidimensional data visualisation
Distributed computing
Open linked data
Big Data challenges
• Data quality
• Data volatility and stability
• Data representativeness
• Data dimensionality
• Statistical modelling and inference
Data quality
Big Data sets/streams are generally noisy and often
unstructured – they need to undergo non-trivial filtering and
cleaning process before they can be used
Balancing the complexity of the cleaning process with the
information value of the obtained results is significant issue
What methods can be used for noise reduction?
How do we deal with missing data?
Data volatility and stability
Streaming data may fluctuate over short time frames
Data sources themselves may change or disappear
What becomes of time series in a world where data streams
and sources are transient?
Data representativeness
How representative are the data from emerging Big Data
sources of the phenomena we are trying to measure?
How do we determine whether there are hidden biases?
What methods can be used to reduce the volume of data while
retaining the information value of the data and statistical
validity of the analysis?
Data dimensionality
Dimensionality is a significant and challenging aspect of
“bigness”
Dimension has an impact on
Storage of data
Processing and analysis of data
Existing storage and computational paradigms fail badly
Statistical modelling and inference
How can population characteristics be determined?
What is the population? In many cases this is not known (e.g.
Twitter)
Can we draw a sample and calculate descriptive statistics?
How do we avoid apophenia?
Seeing meaningful patterns and connections where none exist
The number of fake correlations grows with the number of
variables
“To understand is to perceive patterns.” – Isaiah Berlin
From ‘V’ (what) to ‘C’ (how)
‘What’ has changed about data?
Vs: Volume, Velocity, Variety, Veracity,
Volatility
‘How’ will we change?
Cs: Creating, Computing,
Comprehending, Competing,
Collaboration
Big Data ‘C’s and the ABS - CREATING
The world is CREATING data like never before and every
individual, household and business we interact with will change in
data creation:
• The Internet of Things (M2M) becomes the ‘Internet of
Everything’
• Sometimes called the 4 internets: people, things, information,
places are all network addressable, most have data
producing/collecting/transmitting capability
Big Data ‘C’s and the ABS - COMPUTING
COMPUTING data like never before. Some examples:
• emerged from Web-scale problems such as search engines with
new solutions such as key-value databases (Hadoop, NOSQL DBs
• advanced computation algorithms and approaches become
‘popularised’ e.g. machine learning approaches, automated
visualisation and explanations systems, data mining/discovery,
semantic (knowledge) representation and reasoning systems
requiring ‘search’
• statistical analysis-as-a-service e.g. auto-coding, confidentiality,
time series analysis, etc
• distributed/parallel computation for low-cost multi-core, multi-
socket, multi-computers, in-memory computation technologies
• embedded processors, sensors/RFIDs/GPS/SIM
• the ‘logical data warehouse’
Big Data ‘C’s and the ABS - COMPREHENDING
COMPREHENDING/CONSUMING data requiring new tools in the ABS kit bag:
• tables – static and data consumer dynamically defined (ABS.stat, REEM Table
Builder) in standard XML formats like SDMX
• visualisation – for internal ABS insight, for our ‘retail’ dissemination, ‘smart’ insight
where software suggests the best way to see data: ‘telling the story’
• narrative – table to text production (auto produce media release & part of main
features):
• voice – text to speech to read narrative & data for Accessibility speech to text for
NIRS analysis
• semantic data outputs in OWL/RDF
• hybrid of above – to add value to information, for ABS data consumers to enhance
comprehension
• data streams – data-as-a-service for M2M (the ABS public Web services library) ,
could be called ‘the embedded ABS’
and all this with adaptive/responsive design for multiple end-points devices types!!!
Big Data ‘C’s and the ABS - COMPETING
COMPETING with data, to obtain it and use it for competitive
advantage
• In some subject-matter areas there is more competition. Who
can make a statistical index ? Anyone with a spreadsheet;
• Who else wants to be influential in and/or monetarise statistics?
• Everyone else starts to understand INFONOMICS
• More ‘agent’ data sources for ABS as we may not have a the
capability to collect (full) unit record ‘big data’?
Big Data ‘C’s and the ABS : COLLABORATING
In ABS
In Government
In Academia
Across the international statistical community
ABS Capabilities, expertise
• collect and process large quantities of data
• data ‘cleansing’
• data standards and framework
• data integration
• methodological techniques
• strong analytical capability
• sophisticated web based dissemination system
• data quality framework
ABS Big Data Challenges
Business Benefit
Validity of Statistical Inference
Privacy and Public Trust
Data Integrity
Data Ownership and Access
Computational Efficacy
Technology Infrastructure
(Source: “Big data and the ABS – from ideas to action”, ABS MM paper, Oct 2013)
Summary - considerations
• Value :
• what’s the proposition
• what’s the question
• Strategy; plan, investments
• Data sources & acquisition
• Eyes open – data challenges
• Build capabilities: V’s to C’s