Data Science Innovations : Democratisation of Data and Data Science

Innovations in Data Science:
Systems of Insight
suresh.sood@uts.edu.au
linkedin.com/in/sureshsood
@soody

Areas for Conversation
Data Science
Data Science Innovation
Democratisation of big data
Gartner & Forrester Trends
Systems of Insight

Vignettes in the two-step arrival
of the internet of things and its
reshaping of marketing
management’s service-
dominant logic
Woodside & Sood
Journal of Marketing Management Volume
33, 2017 - Issue 1-2: The Internet of Things
(IoT) and Marketing: The State of Play,
Future Trends and the Implications for
Marketing

Statistics, Data Mining or Data Science ?
• Statistics
– precise deterministic causal analysis over precisely collected data
• Data Mining
– deterministic causal analysis over re-purposed data carefully sampled
• Data Science
– trending/correlation analysis over existing data using bulk of population i.e. big data
– Extraction of actionable knowledge directly from data through a process of discovery,
hypothesis, and hypothesis testing.
Adapted from: NIST Big Data taxonomy draft report :
(see http://bigdatawg.nist.gov /show_InputDoc.php)

Useful References Big Data
• NIST Big Data interoperability Framework (NBDIF) V1.0 Final Version (September 2015)
Big Data Definitions: http://dx.doi.org/10.6028/NIST.SP.1500-1
Big Data Taxonomies: http://dx.doi.org/10.6028/NIST.SP.1500-2
Big Data Use Cases and Requirements: http://dx.doi.org/10.6028/NIST.SP.1500-3
Big Data Security and Privacy: http://dx.doi.org/10.6028/NIST.SP.1500-4
Big Data Architecture White Paper Survey: http://dx.doi.org/10.6028/NIST.SP.1500-5
Big Data Reference Architecture: http://dx.doi.org/10.6028/NIST.SP.1500-6
Big Data Standards Roadmap: http://dx.doi.org/10.6028/NIST.SP.1500-7
• Apache Spark 2.1.0 Documentation
Machine Learning Library (MLlib) Guide http://spark.apache.org/docs/latest/ml-guide.html
GraphX Programming Guide http://spark.apache.org/docs/latest/graphx-programming-guide.html
SparkR (R on Spark) http://spark.apache.org/docs/latest/sparkr.html#sparkdataframe
Spark SQL, DataFrames and Datasets Guide http://spark.apache.org/docs/latest/sql-programming-guide.html

Data Science Innovation
Data science innovation is something an
organization has not done before or even
something nobody anywhere has done before. A
data science innovation focuses on discovering
and using new or untraditional data sources to
solve new problems.
Adapted from:
Franks, B. (2012) Taming the Big Data Tidal Wave, p. 255, John Wiley & Son

Variety of Data Types & Big Data Challenge
1. Astronomical
2. Documents
3. Earthquake
4. Email
5. Environmental sensors
6. Fingerprints
7. Health (personal) Images
8. Graph data (social network)
9. Location
10.Marine
11.Particle accelerator
12.Satellite
13.Scanned survey data
14.Sound
15.Text
16.Transactions
17.Video
Big Data consists of extensive datasets primarily in the characteristics of
volume, variety, velocity, and/or variability that require a scalable
architecture for efficient storage, manipulation, and analysis.
. Computational portability is the movement of the computation to the location of the data.

Internet of Things “trillion sensors”
Source: www.tsensorssummit.org

• The data collected in a single day take nearly two million years to playback on an MP3 player
• Generates enough raw data to fill 15 million 64GB iPods every day
• The central computer has processing power of about one hundred million PCs
• Uses enough optical fiber linking up all the radio telescopes to wrap twice around the Earth
• The dishes when fully operational will produce 10 times the global internet traffic as of 2013
• The supercomputer will perform 1018 operations per second - equivalent to the number of stars in three million Milky
Way galaxies - in order to process all the data produced.
• Sensitivity to detect an airport radar on a planet 50 light years away.
• Thousands of antennas with a combined collecting area of 1,000,000 square meters - 1 sqkm)
• Previous mapping of Centaurus A galaxy took a team 12,000 hours of observations and several years - SKA ETA 5
minutes !
To the scientists involved, however, the SKA is no testbed, it’s a transformative instrument which,
according to Luijten, will lead to “fundamental discoveries of how life and planets and matter all came
into existence. As a scientist, this is a once in a lifetime opportunity.”
Sources: http://bit.ly/amazin-facts & http://bit.ly/astro-ska
Galileo
Square Kilometer Array Construction
(SKA1 - 2018-23; SKA2 - 2023-30)
Centaurus A

New Sources of Information (Big data) : Social Media + Internet of Things  Innovations
7,919 40,204
2,003,254,102 51
Gridded Data Sources

The following BigQuery query (note that the wildcard on "TAX_WEAPONS_SUICIDE_" catches suicide vests, suicide bombers, suicide bombings,
suicide jackets, and so on):
SELECT DATE, DocumentIdentifier, SourceCommonName, V2Themes, V2Locations, V2Tone, SharingImage, TranslationInfo FROM [gdeltv2.gkg] where
(V2Themes like '%TAX_TERROR_GROUP_ISLAMIC_STATE%' or V2Themes like '%TAX_TERROR_GROUP_ISIL%' or V2Themes like
'%TAX_TERROR_GROUP_ISIS%' or V2Themes like '%TAX_TERROR_GROUP_DAASH%') and (V2Themes like '%TERROR%TERROR%' or V2Themes like
'%SUICIDE_ATTACK%' or V2Themes like '%TAX_WEAPONS_SUICIDE_%')
The GDELT Project pushes the boundaries of “big data,” weighing in at over a quarter-billion rows with 59 fields for each record,
spanning the geography of the entire planet, and covering a time horizon of more than 35 years. The GDELT Project is the largest
open-access database on human society in existence. Its archives contain nearly 400M latitude/longitude geographic coordinates
spanning over 12,900 days, making it one of the largest open-access spatio-temporal datasets as well.
GDELT + BigQuery = Query The Planet

Oil reserves shipment monitoring
Ras Tanura Najmah compound, Saudi Arabia
Source: http://www.skyboximaging.com/blog/monitoring-oil-reserves-from-space

13
https://nodexl.codeplex.com/

Key Network Measures
• Degree Centrality
• Betweenness Centrality
• Closeness Centrality
• Eigenvector Centrality
krackkite.##h (modified labels)
Connector
(hub)
Diana’s
Clique
Broker
Boundary spanners
Contractor ? Vendor

16
Sherman and Young (2016), When Financial Reporting Still Falls
Short, Harvard Business Review, July-August
Sood (2015), Truth, Lies and Brand Trust The Deceit
Algorithm,
http://datafication.com.au/
New Analytical Tools Can Help

The Newman Model of Deception (Pennebaker et al)
Key word categories for deception mapping:
(1) Self words e.g. “I” and “me” – decrease when someone distances themselves from content
(2) Exclusive words e.g. “but” and “or” decrease with fabricated content owing to complexity of maintaining
deception
(3) Negative emotion words e.g. “hate” increase in word usage owing to shame or guilty feeling
(4) Motion verbs e.g. “go” or “move” increase as exclusive words go down to keep the story on track

Language on Twitter Tracks Rates of Coronary Heart
Disease, Psychological Science, January 2015
21
The findings show that expressions of negative emotions such as anger, stress, and fatigue in the tweets
from people in a given county were associated with higher heart disease risk in that county.
On the other hand, expressions of positive emotions like excitement and optimism were associated with
lower risk.
The results suggest that using Twitter as a window into a community’s collective mental state may provide a
useful tool in epidemiology…So predictions from Twitter can actually be more accurate than using a set of
traditional variables.

Twitter and Marketing Predictions
• Tweets is “found data” without asking questions
• More meaning than typical search engine query
• Large numbers of passive participants in natural settings
• Twitter can predict the stock market (Lisa Grossman, Wired, Oct 19 2010)
• Predict movie success in first few weekends of release
• “…it also raises an interesting new question for advertisers and marketing
executives. Can they change the demand for their film, product or service buy
directly influencing the rate at which people tweet about it? In other words, can
they change the future that tweeters predict?”
Tech Review, http://www.technologyreview.com/blog/arxiv/25000/
22

23
http://www.analyzewords.com

 By 2020-22 :
 100 million consumers shop in augmented
reality
 30% of web browsing sessions without a screen
 Algorithms positively alter behavior of over 1B
 Blockchain-based business worth $10B
 IoT will save consumers/businesses $1T a year
 40% of employees cut healthcare costs via
fitness tracker
SStrategic Predictions for 2017 and Beyond, research note
14 October, http://www.gartner.com/document/3471568
2016 Hype Cycle for Business Intelligence and Analytics,
29 July, http://www.gartner.com/document/3388326
Gartner (2016)

“With the addition of NLG [Natural Language
Generation], smart data discovery platforms
automatically present a written or spoken context-based
narrative of findings in the data that, alongside the
visualization, inform the user about what is most
important for them to act on in the data.”
Gartner, 29 June, 2015
Smart Data Discovery
Will Enable
New Class of Citizen Data Scientist

26
Insights-driven businesses will
generate $1.2 trillion in 2020
Forrester Research, 2016

27© 2016 Forrester Research, Inc. Reproduction Prohibited
Insights-driven businesses are faster than large companies
$0
$250
$500
$750
$1,000
$1,250
2015 2016 2017 2018 2019 2020
Revenue (billions)
Public
Startup
Global GDP will grow
only 3.5% annually.
27% CAGR
40% CAGR
Source: Forrester, Morningstar, PitchBook, and The Economist Intelligence Unit

Reports
&
Analysis
Visualisation
&
Interpretation
Write
Data/Business
“Story”
Insights
Led by Data Analyst or
Scientist
SME owner, Machine Learning and Natural Language Generation
Fusion of data science, business knowledge & creativity for maximium ROI
Data
Aggregation Operationalise
Detect &
Extract
Patterns and
Relationships
Generate
Insights &
Story
Process
Application
IoT
Data
Aggregation or
Data Set
Traditional Analytics: Slow & Expensive
80% of time sifting through data
System of Insight (SoI)
SoI: Fast & Cost Effective
80% of time in decision making with client

Actionable Insights
1. What now ?
2. So what ?
3. Now what ?

30
Companies are reimagining Business Processes with
Algorithms and there is “evidence of significant, even
exponential, business gains in customer’s customer
engagement, cost & revenue performance”
Wilson, H., Alter A. and Shukla, P. (2016), Companies Are Reimagining Business Processes with
Algorithms, Harvard Business Review, February, https://hbr.org/2016/02/companies-are-reimagining-
business-processes-with-algorithms

Better customer experiences . . .
. . . and half the inventory-carrying
costs
of other online fashion retailers.
Forrester, 2016

Systems of Insight
 Automated pattern extraction
 Outlier detection
 Correlation
 Time series
 Analytics integration with process, app or IoT
https://ubereats.com/melbourne/

33
outlier-detection “allow detecting a significant fraction
of fraudulent cases…different in nature from historical
fraud…resulting in a novel fraud pattern”
Baesens, B., Vlasselaer, V., and Verbeke, W., 2015, Fraud Analytics Using Descriptive,
Predictive, and Social Network Techniques: A Guide to Data Science for Fraud
Detection, Wiley

The ANZ Heavy Traffic Index comprises
flows of vehicles weighing more than 3.5
tonnes (primarily trucks) on 11 selected
roads around NZ. It is contemporaneous
with GDP growth.
The ANZ Light Traffic Index is made up of
light or total traffic flows (primarily cars and
vans) on 10 selected roads around the
country. It gives a six month lead on GDP
growth in normal circumstances (but
cannot predict sudden adverse events such
as the Global Financial Crisis).
http://www.a http://www.anz.co.nz/about-us/economic-markets-research/truckometer/
ANZ TRUCKOMETER

Systems of Insight
• Helps move away from “crisis levels” in talent
• Traditional 5 step analytics process reduced to 2 step from data to action
• Reimagine business processes through “machine engineering”
• Minimise messy data issues and data preparation time

Next Step
Start using Systems of Insight and innovative data sources

38
The future is impossible to predict.
However one thing is certain :
The company that can excite it’s customers
dreams is out ahead in the race to
business success
Selling Dreams, Gian Luigi Longinotti

Data Science Innovations : Democratisation of Data and Data Science

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Data Science Innovations : Democratisation of Data and Data Science

Similar to Data Science Innovations : Democratisation of Data and Data Science (20)

More from suresh sood

More from suresh sood (20)

Recently uploaded

Recently uploaded (20)

Data Science Innovations : Democratisation of Data and Data Science

Editor's Notes