Myths and challenges in knowledge extraction and analysis from human-generated content

Myths and Challenges
in Knowledge Extraction and Analysis
from Human-generated Content
Marco Brambilla
marco.brambilla@polimi.it
@marcobrambi

Knowledge, Behaviour and Feature Extraction
with Big Data Science

Problem 1.
The Complexity
of Knowledge

There are more things
In heaven and earth, Horatio,
Than are dreamt of in your philosophy.
Shakespeare (Hamlet Act 1, scene 5)

The Answer to the Great Question...
Of Life, the Universe and Everything
Data
Information
Knowledge
WisdomContext
independence
Understanding
Understanding relations
Understanding patterns
Understanding principles

Formalizing evolving knowledge is hard
Only high frequency emerges
The long tail challenge

The Evolving Knowledge
known
social
factoid
a
c
¬c
bpotentially
emerging potentially
decaying
actual and solid
d

Information and Knowledge Extraction

Heaven and Earth
Are they so different?

The Digital “Heaven”
Vs.
The “physical” Earth

Heaven and Earth
How to peer into the world
through an effective window?
INGREDIENTS
Social media, IoT, … – the data
Domain experts – the context

13
[photo: http://hoglundassociates.com/Images/Cloud_Gate.jpg]
The digital reflection
of our life is
sharpening

14
Data source Around for Frequency Delay
Census data 100s year years months
Newspaper 100s year days 1 day
Weather sensors 10s year hours/minutes hours/minutes
TV news 10s years hours minutes
Traffic sensors years 15 minutes minutes
Call Data Records years 15 minutes hours
Social media years seconds seconds
IoT recently milliseconds milliseconds
Source:EmanueleDellaValle
The data evolution

Data piles up without easing decision making
I have to decide:
A or B?
Why not C?
What if D?

But, we would like to …
fusing all those
data sources
making sense of the
fused information
Definitely E!

The MacroScope
Joël de Rosnay, The Macroscope, 1979

Problem 2.
Cognitive Bias
(of the observer)

the streetlamp effect
The bias of the observer

Model of social media and reality sensing

Data Quality Issue
Gartner Report
In 2017, 33% of the largest global companies will experience an information
crisis due to their inability to adequately value, govern and trust their
enterprise information.
If you torture the data long enough,
it will confess to anything
– Darrell Huff

The Vicious Cycle of Bad Data
Bad Data
Incorrect
Analysis
Invalid
Insights
Wrong
Decisions
Poor
Outcome

Conventional Definition of Data Quality
• Accuracy
• The data was recorded correctly.
• Completeness
• All relevant data was recorded.
• Uniqueness
• Entities are recorded once.
• Timeliness
• The data is kept up to date (and time consistency is granted).
• Consistency
• The data agrees with itself.

Why is Data “Dirty” ?
• Dummy Values,
• Absence of Data,
• Multipurpose Fields,
• Cryptic Data,
• Contradicting Data,
• Shared Field Usage,
• Inappropriate Use of Fields,
• Violation of Business Rules,
• Reused Primary Keys,
• Non-Unique Identifiers,
• Data Integration Problems

Data Wrangling a.k.a.
• Data Preprocessing
• Data Preparation
• Data Cleansing
• Data Scrubbing
• Data Munging
• Data Transformation
• Data Fold, Spindle, Mutilate…
• (good old) ETL

Foursquare
• Check-ins explicitly performed in venues all around the world
• Data set: Geo-localized Foursquare venues, collected through a
query every 50m with radius >50m over:
• Milan area: 20km x 17,5km
• Some numbers
• Total n° of venues: 90K (dirty)
• Total n° of valid venues: 43K

College & University
0
200
400
600
800
1000
1200
1400
weekend
we
eke
nd
we
eke
nd
we
eke
nd
we
eke
nd
No
access
No
access
No
access

Event
0
10
20
30
40
50
60
70
wee
kend
wee
kend
wee
kend
wee
kend
wee
kend
eve
nts
Eve
nts

The (pseudo) Practitioner Approach

Problem 4.
Content Bias
(of the source)

Data vs. Question
• Are they aligned?
• The usual problem of representativeness of the sample…
• At a different scale
• With much less control
• Example: the different pictures of the city

Foursquare
Checkins
Copyright © Milano-Hub project @Politecnico di Milano

Flickr

Instagram

44
Cities into cities, by language
http://urbanscope.polimi.it

Bias of the Source
• Technology
• Audience / Users / Adopters
• Behaviour

Problem 5.
Granularity
(time, space, …)

Example. Space Granularity: the Grid
• Regular squared grid
• Irregular grid with official business-driven meaning
• Irregular grid with data-driven definition
12/4

Cities into cities
http://urbanscope.polimi.it

But other dimensions matter too
• Time
• Categories
• Economical value
• …

Problem 6.
Availability
& Access

Google Places
Only in
the UI
(scraping)
Via API

Bringing Things Together
Space-text similarity btw. Google - Foursquare

Data is big!
1 GigaByte of Data
(109) or,
strictly,
230 bytes

1 ZettaByte of Data
one sextillion (1021) or, strictly, 270 bytes

The Fashion Week in Milano #MFW

• Mobile Phone Calls & Msgs: 5 to 10 MLN per day in a city like Milan
• Trackable user events (incl. data traffic): 1,000 per user per day
Mobile Phone Data

IoT Sensors
• People counters: 1 event per second (or less)
• 86K+ events per day per sensor
• Industrial machine sensors: 100 measurements per second

Human computation and crowdsourcing

… and now …
Examples and Cases

Use Case #1: Fashion
The Milano Fashion Week

Response of Social Media #MFW
• MILANO FASHION WEEK #MFW
• We have 2 signals:
• The first coming from the social media (in this case we will talk about only
Instagram)
• The second derived from the official calendar events

Research Questions
“Are live events still relevant?
Can online visibility be described simply by how famous is the brand?
Do space and time still matter?
Can we predict how people behave in time/space within events?

Discover more about the #MFW case
• https://marco-brambilla.com/2017/04/04/social-media-
behaviour-during-live-events-the-milano-fashion-week-mfw-case-
www2017/
(INCLUDING SLIDES)

Use Case #2: Design
The Milano Design Week
& FuoriSalone

•Fuorisalone Official database
• events/locations/itineraries
• Fuorisalone Official App
• GPS positions1 of the App users
• Events inserted in the agenda on the App
• Private social post (Facebook) of App users2
• SocialMedia Listener
• Keyword-based public social post (Twitter/Instagram)
• Semantic analysis
•
1 when the App was running
• 2 to use some App features the users had to perform a social login
Data sources of the analysis

• Data elements are georeferenced and aggregate by citypixel (100
x 100 mt squares)
• Merging multiple data sources makes it possible to infer
information:
• Which events attract more visitors?
• Which areas have the larger presence of visitors?
• Do people talk on the social networks about the events they are
interested in?
• Do people use social networks while visiting the events?
• ...
Fusing the data

Approach
City-scale: mobile telephone and (gross-grain geo-located)
social media data
Street/square: people counting & profiling IoT
sensors
Point of Interest:
people counting
sensor, WiFi log analysis,
beacons and (fine grain geo-
located)
social media
Descriptive, predictive, privacy-preserving and, when needed, real-time analysis
of a variety of (fused) data sources

Integration
Personalized information/offers,
city loyalty cards,
digital coupons, and polling
Proximity detection via
NFC or BLE/Beacons

Measuring
People counting and profiling via Mobile Data
24.512
People present
41%
71% 63%
59%
tourists
citizens
29%
female
male
37%
private
business
10 20 30 40 50 60 70
age
More people than usual

Measuring
People counting via 3D camera

Dashboards
Why people is there
CrowdInsights

7
1
6
2
3
4
5
7 Areas
1. Città murata
2. Lago sponda Viale Geno
3. Lago
4. Lago sponda di Villa Olmo
5. Zona industriale
6. Brunate
7. Business e università
Phone data

http://www.socialometers.com/balocchi/

Use Case #4:
Knowledge Updater

Knowledge Enrichment Setting
HF Entity1 HF Entity5
HF Entity2 HF Entity4
HF Entity3
LF Entity1
??
LF Entity2 LF Entity4
LF Entity3
??
High Frequency
Entities
Low Frequency
Entities
??
?? ????
??
Type1
Type11
Type2
Type111
Instances
Types
<<instanceof>>
<<instanceof>>
<<instanceof>>
<<instanceof>>
<<instanceof>>
<<instanceof>>
??
??
??
??
??
Seed Entity
Seed Type
Type of
interest
Legend
Expert inputs
Enrichment problems
Property2
Relations HF - LF entities
Relations LF - LF entities
Typing of LF entities
Extraction of new LF entities
Property1
?? ?? ??
Finding attribute values

Discover more
https://marco-brambilla.com/2017/04/06/extracting-emerging-
knowledge-from-social-media-www2017/
(SLIDES INCLUDED)

Concluding..
Plenty of issues
And also plenty of application scenarios
where to benchmark ideas!

THANKS!
QUESTIONS?
Myths and Challenges
in Extraction of Emerging Knowledge
from Human-generated Content
Marco Brambilla @marcobrambi marco.brambilla@polimi.it
http://datascience.deib.polimi.it http://home.deib.polimi.it/marcobrambi

Myths and challenges in knowledge extraction and analysis from human-generated content

Recommandé

Recommandé

Contenu connexe

Similaire à Myths and challenges in knowledge extraction and analysis from human-generated content

Similaire à Myths and challenges in knowledge extraction and analysis from human-generated content (20)

Plus de Marco Brambilla

Plus de Marco Brambilla (20)

Dernier

Dernier (20)

Myths and challenges in knowledge extraction and analysis from human-generated content

Notes de l'éditeur