For centuries, science (in German "Wissenschaft") has aimed to create ("schaften") new knowledge ("Wissen") from the observation of physical phenomena, their modelling, and empirical validation. Recently, a new source of knowledge has emerged: not (only) the physical world any more, but the virtual world, namely the Web with its ever-growing stream of data materialized in the form of social network chattering, content produced on demand by crowds of people, messages exchanged among interlinked devices in the Internet of Things. The knowledge we may find there can be dispersed, informal, contradicting, unsubstantiated and ephemeral today, while already tomorrow it may be commonly accepted. The challenge is once again to capture and create knowledge that is new, has not been formalized yet in existing knowledge bases, and is buried inside a big, moving target (the live stream of online data). The myth is that existing tools (spanning fields like semantic web, machine learning, statistics, NLP, and so on) suffice to the objective. While this may still be far from true, some existing approaches are actually addressing the problem and provide preliminary insights into the possibilities that successful attempts may lead to.
The talk explores the mixed realistic-utopian domain of knowledge extraction and reports on some tools and cases where digital and physical world have brought together for better understanding our society.
5. There are more things
In heaven and earth, Horatio,
Than are dreamt of in your philosophy.
Shakespeare (Hamlet Act 1, scene 5)
6. The Answer to the Great Question...
Of Life, the Universe and Everything
Data
Information
Knowledge
WisdomContext
independence
Understanding
Understanding relations
Understanding patterns
Understanding principles
14. 14
Data source Around for Frequency Delay
Census data 100s year years months
Newspaper 100s year days 1 day
Weather sensors 10s year hours/minutes hours/minutes
TV news 10s years hours minutes
Traffic sensors years 15 minutes minutes
Call Data Records years 15 minutes hours
Social media years seconds seconds
IoT recently milliseconds milliseconds
Source:EmanueleDellaValle
The data evolution
15. Data piles up without easing decision making
I have to decide:
A or B?
Why not C?
What if D?
Source:EmanueleDellaValle
16. But, we would like to …
fusing all those
data sources
making sense of the
fused information
Definitely E!
Source:EmanueleDellaValle
26. Data Quality Issue
Gartner Report
In 2017, 33% of the largest global companies will experience an information
crisis due to their inability to adequately value, govern and trust their
enterprise information.
If you torture the data long enough,
it will confess to anything
– Darrell Huff
27. The Vicious Cycle of Bad Data
Bad Data
Incorrect
Analysis
Invalid
Insights
Wrong
Decisions
Poor
Outcome
28. Conventional Definition of Data Quality
• Accuracy
• The data was recorded correctly.
• Completeness
• All relevant data was recorded.
• Uniqueness
• Entities are recorded once.
• Timeliness
• The data is kept up to date (and time consistency is granted).
• Consistency
• The data agrees with itself.
29. Why is Data “Dirty” ?
• Dummy Values,
• Absence of Data,
• Multipurpose Fields,
• Cryptic Data,
• Contradicting Data,
• Shared Field Usage,
• Inappropriate Use of Fields,
• Violation of Business Rules,
• Reused Primary Keys,
• Non-Unique Identifiers,
• Data Integration Problems
30. Data Wrangling a.k.a.
• Data Preprocessing
• Data Preparation
• Data Cleansing
• Data Scrubbing
• Data Munging
• Data Transformation
• Data Fold, Spindle, Mutilate…
• (good old) ETL
31. Foursquare
• Check-ins explicitly performed in venues all around the world
• Data set: Geo-localized Foursquare venues, collected through a
query every 50m with radius >50m over:
• Milan area: 20km x 17,5km
• Some numbers
• Total n° of venues: 90K (dirty)
• Total n° of valid venues: 43K
39. Data vs. Question
• Are they aligned?
• The usual problem of representativeness of the sample…
• At a different scale
• With much less control
• Example: the different pictures of the city
47. Example. Space Granularity: the Grid
• Regular squared grid
• Irregular grid with official business-driven meaning
• Irregular grid with data-driven definition
12/4
58. • Mobile Phone Calls & Msgs: 5 to 10 MLN per day in a city like Milan
• Trackable user events (incl. data traffic): 1,000 per user per day
Mobile Phone Data
59. IoT Sensors
• People counters: 1 event per second (or less)
• 86K+ events per day per sensor
• Industrial machine sensors: 100 measurements per second
63. Response of Social Media #MFW
• MILANO FASHION WEEK #MFW
• We have 2 signals:
• The first coming from the social media (in this case we will talk about only
Instagram)
• The second derived from the official calendar events
64. Research Questions
“Are live events still relevant?
Can online visibility be described simply by how famous is the brand?
Do space and time still matter?
Can we predict how people behave in time/space within events?
65. Discover more about the #MFW case
• https://marco-brambilla.com/2017/04/04/social-media-
behaviour-during-live-events-the-milano-fashion-week-mfw-case-
www2017/
(INCLUDING SLIDES)
66. Use Case #2: Design
The Milano Design Week
& FuoriSalone
67. •Fuorisalone Official database
• events/locations/itineraries
• Fuorisalone Official App
• GPS positions1 of the App users
• Events inserted in the agenda on the App
• Private social post (Facebook) of App users2
• SocialMedia Listener
• Keyword-based public social post (Twitter/Instagram)
• Semantic analysis
•
1 when the App was running
• 2 to use some App features the users had to perform a social login
Data sources of the analysis
68. • Data elements are georeferenced and aggregate by citypixel (100
x 100 mt squares)
• Merging multiple data sources makes it possible to infer
information:
• Which events attract more visitors?
• Which areas have the larger presence of visitors?
• Do people talk on the social networks about the events they are
interested in?
• Do people use social networks while visiting the events?
• ...
Fusing the data
87. THANKS!
QUESTIONS?
Myths and Challenges
in Extraction of Emerging Knowledge
from Human-generated Content
Marco Brambilla @marcobrambi marco.brambilla@polimi.it
http://datascience.deib.polimi.it http://home.deib.polimi.it/marcobrambi
Notes de l'éditeur
Paolo
Qui spieghiamo le dimensioni trovate per descrivere la città per poi spiegare su quale parte ci siamo focalizzati e le 3 analisi ampliate.
Qui spieghiamo le dimensioni trovate per descrivere la città per poi spiegare su quale parte ci siamo focalizzati e le 3 analisi ampliate.
Qui spieghiamo le dimensioni trovate per descrivere la città per poi spiegare su quale parte ci siamo focalizzati e le 3 analisi ampliate.
Piercesare
Piercesare > selinunte giambellino
db di Twitter contiene quindi 106278 tweet, con una percentuale di circa il 6.5% di post geolocalizzati, che corrisponde in valore assoluto a quasi 7mila post.
db di Instagram, invece, contiene poco più di 556 mila post (circa 5 volte le dimensioni del db di Twitter), con il 28% circa di media geolocalizzati (+/- 155mila post).
Possiamo subito notare due fatti interessanti:
Per questo specifico scenario (MFW) Instagram è stato il mezzo di comunicazione preferito
utenti di Instagram risultano più propensi ad esibire la loro posizione «fisica» e quindi il coinvolgimento a un evento, (o la visita di un luogo, in generale), quasi ad indicare una prova della stessa partecipazione all’evento interessato
A questo punto possiamo partire con l’esplorazione e lo studio dei nostri dati che si compone di differenti sotto-analisi -> (analizzato alcune misure proprie degli autori dei contenuti, affrontato il problema di risposta nel tempo e nello spazio ai diversi appuntamenti da calendario, e, dopo avere aggiunto un altro tipo di reazione, che definiamo di popolarità, confronto dei risultati precedenti)