1. Examples of Working
with Streaming Data
Yi-Shin Chen
Institute of Information Systems and Applications
Department of Computer Science
National Tsing Hua University
yishin@gmail.com
2. Hello
陳宜欣 Yi-Shin Chen
Currently
Associate professor at NTHU CS
Director of IDEA Lab
Education
Ph.D. in Computer Science, USC, USA
M.B.A. in Information Management, NCU, TW
B.B.A. in Information Management, NCU, TW
Courses
Introduction to Database Systems
Advanced Database Systems
Data Mining: Concepts, Techniques, and
Applications
2
5. Streaming Data
Continuous flow
E.g.,
Infinite length
Impractical to store and use all historical data
Concept drift
Not wise to use all historical data
Stock Volume
Sensor Data
Social Stream
6. 6
Continuous Queries
Stream DB
Acquisition
Process
Raw data &
Transformation of
Raw Stream
Transformation of
Raw Stream
Continuous
Query
Process
Crowd Wisdom
Rules/Patterns
Continuously Provide Feedback
Three major approaches for continuous queries
•Fast on-line classification/clustering
•Sliding window
•Range aggregation
8. Framework of Off-line Training Module
Acquisition
Process
Acquisition
Process
Crowd Wisdom
Rules/Patterns
9. Alignment
Industry:
Finance
Industry:
Textile
Industry:
Car
………
….
𝑏𝑒𝑙𝑜𝑛𝑔 𝑛 = [𝑃 𝑓𝑖𝑛𝑎𝑛𝑐𝑒, 𝑃𝑡𝑒𝑥𝑡𝑖𝑙𝑒, … … , 𝑃 𝑐𝑎𝑟]
於2011年4月在上海車展首度現身的Luxgen
Neora概念車,不但是國產自主品牌Luxgen自
創立以來,首度推出的第一輛概念車款……
𝑏𝑒𝑙𝑜𝑛𝑔 𝑛 = [0, 0, … … , 3]
Comp-
anies
Related
words
Comp-
anies
Related
words
Comp-
anies
Related
words
𝑃𝑓𝑖𝑛𝑎𝑛𝑐𝑒 =0 𝑃𝑡𝑒𝑥𝑡𝑖𝑙𝑒 =0 𝑃𝑐𝑎𝑟 = 3
10. Itemset Production
日本+地震 日本+救災
日本+地震 日本+淹水
日本+地震 日本+影響
日本+地震 日本+預估
日本+地震 日本+破壞
日本+購買 日本+旅遊
…
…
…
…
…
…
…
…
…
…
…
…
The confidence of
日本+地震:
The number of 日本+地震
appears in all transactions:
𝑢 𝑠
The number of 日本 appears
in all transactions:
𝑛 𝑝
The confidence of 日本+地
震 :
𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 =
𝑢 𝑠
𝑛 𝑝
=
5
6
Group
11. Representative Itemset Selection
Select itemsets based on high confidence as a
candidate of representative itemset.
𝑤𝑒𝑖𝑔ℎ𝑡 = 𝑥 ∗ 𝑡𝑓𝑖𝑑𝑓1 + 𝑦 ∗ 𝑡𝑓𝑖𝑑𝑓2 + 𝑧 ∗ 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒
日本+地震 日本+預估 核能+外洩 危機+發生
日本 地震 預估 核能 外洩 危機 發生
0.22 0.25 0.03 0.18 0.2 0.10 0.001
日本+地震 日本+預估 核能+外洩 危機+發生
0.833 1 0.667 0.667
日本,地震,核能,外洩
Concept
12. Concept Verification
By considering:
The daily frequency of concept 𝐶𝑗
The concept index 𝐶𝐼𝑗 of 𝐶𝑗
Regression model based on price within sliding windows
If p-value reject 𝐻0, the concept 𝐶𝑗 will be
considered as an influential event
13. On-line Prediction Module
Regression prediction
Use most frequent event.
Adjust regression prediction
Include other events which is not the most frequent.
Pheromone prediction
Include the past influence.
Continuous
Query
Process
14. Experimental Data
Stock data
Industry index from TWSE.
2012-01-01 to 2012-05-11
News data
Crawl the news form website.
Yahoo!, udn, Libertytimes, PCHome, etc.
Total 13 websites.
2012-01-01 to 2012-05-11
More than 150,000 news.
All the news is in Traditional Chinese.
15. Experimental Setup
Four methods to predict the market:
Pheromone prediction model
Adjust regression prediction model
Regression prediction model
Blind test.
Prediction
policy: fall rise
NSM
(no significant move)
16. Performance
Accuracy of four methods:
Methods Average
Accuracy
Pheromone 0.5784574
Adjust
regression
0.5323214
Regression 0.5134457
Blind test 0.3045479
17. Performance
Is it work on the whole market?
It catches our attention on using event to predict the
whole market by aggregate all the industry into all.
Type Accuracy
Pheromone 0.6315789
Adjust Regression 0.6896511
Regression 0.5714285
19. Motivation
Diversify human computer interaction
technology with multimedia
Music education
Music experiment
Amateur and professional conductors
Composers
Personal amusement
19
20. Devices
Build an interactive conducting system using motion
Microsoft Kinect
20
3D Depth Sensors
22. Conducting Data (Data Streams)
Cartesian coordinate (x,y,z)
30 Frames per second under 320x240 resolution
delay 33 ms (1/30 second)
Human eyes can process 10 to 12 frames per second [2]
delay ≈ 100 ms (1/10 second)
22
+Y
+X
Z
Sensor Direction
-X
-Y
23. Framework
23
Conducting Data
Received
Beat Pattern
Recognition
Whole Measure
Volume Identify Instrument Emphasis
Relative height of hand Tilt Z-Mapping
Volume Adjustment
According to
Instrument Emphasis
Tempo Adjustment
According to
Instrument Emphasis
YesStop Gesture
Recognition
Initial System
PlayStatus = False
Is
PlayStatus
true
No
Is
Stop
true
Is
Start
true
Yes
PlayStatus
= False
No Yes
PlayStatus
= True No
Start Gesture
Recognition
Acquisition
Process
Crowd Wisdom
Rules/Patterns
Offline Analysis
Continuous
Query
Process
24. Experiments
24
Evaluation
Beat pattern and measure recognition
Volume control and instrument emphasis recognition
Response time
Experimental Setup
Participants
1 professional
8 had no experience
Practice
30 minutes
25. Beat Pattern and Measure Recognition Evaluation
25
0.7826
0.86480.8438
0.8821
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Professional No Experiece
RecognitionRate
Recall
Precision
28. Goal
Identify the location of a particular Twitter
user at a given time
Using exclusively the content of his/her tweets
28
29. Major Challenges
Twitter Challenges
Tweets are noisy
Extensive use of non-standard vocabulary
Bots and spammers
Geo-locational Challenges
Users might have several associated locations
Toponyms
Scarce information
False profile information
29
31. Experimental Setup
Original Dataset 1.53 M Twitter users and 13 M tweets
3,314 Twitter users and 2.2 M tweets
104,054 geo-tagged tweets
Although we collected and processed data carefully, it still
needed to be validated
• Use of Local Experts
– People familiar with the geography of the country
Original
Tweets
Subject
Identification
Location
Discovery Tweets
Toponyms
Removal
Timeline
Sorting
Final
Results
329,814 57,153 18,662 9,093 6,928 2,165
35. Introduction
By analyzing social streams, it can benefit in
Emergency control
Crowd opinion analysis
Unreported events detection
Motivation: event identification from social
streams
35
37. Methodology – Keyword Selection
Well-noticed criterion
Compared to the past, if a word suddenly be
mentioned by many users, it is well-noticed
Time Frame – a unit of time period
Sliding Window – a certain number of past time frames
time
tf0 tf1 tf2 tf3 tf4
37
38. Methodology –
Event Candidate Recognition
Idea: group one keyword with its most relevant
keywords into one event candidate
38
boston
explosion confirm
prayerbombing boston-
marathon
threat
iraq
jfk
hospital
victim afghanistan
bomb
america
39. Methodology –
Evolving Social Graph Analysis
Information decay:
Vertex weight, edge weight
Decay mechanism
Concept-Based Evolving Graph Sequences (cEGS):
a sequence of directed graphs that demonstrate
information propagation
tf1 tf2 tf3
39
40. Experiment
Testing
Events identified in November 2013
Evaluated by 7 human experts
40
Average precision 86.64%
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Nov_2
Nov_3
Nov_4
Nov_5
Nov_6
Nov_7
Nov_8
Nov_10
Nov_11
Nov_12
Nov_13
Nov_14
Nov_15
Nov_16
Nov_17
Nov_18
Nov_19
Nov_22
Nov_23
Nov_24
Nov_25
Nov_26
Nov_27
Nov_28
Nov_29
Nov_30
Precision
Date
42. Introduction
18.1% people suffer from mental disorder in United States (*)
Using Social Network to research on Mental Disorder
National Insititute of Mental Helath:
http://www.nimh.nih.gov/health/statistics/prevalence/index.shtml
Analyze
43. Background
Bipolar Disorder:
*Unstable and impulsive emotions
Cycling between Maniac and Depression
episodes
Borderline Personality Disorder:
*Unstable and impulsive emotions
Impaired social interactions
53. Basic Guidelines
Identify the common and differences between
the experimental and control groups
Word/pattern frequency
Emotion related data (e.g., flipping rates, occurrence rates)
Social interaction (e.g., retweet, reply)
Lifestyle (e.g., online time, stay-up or not)
Age and gender
Features
53
54. Apply Classifiers (Online)
By utilize the extracted features
Various classifiers
Neural Networks
Naïve Bayes and Bayesian Belief Networks
Support Vector Machines
Random forest
54
Continuous
Query
Process
Similarly, in order to measure the effectiveness of our method, the results of the Hometown dataset were split into “Factual” and “Empty | Fictional”
-The first category refers to those profiles in which the user has explicitly stated his location
as a valid point. Belonging to the second category, are those profiles whose location is listed as empty, fictional, or overbroad
-WMAE: Workers MAE
-Tw MAE: Tweet MAE
-Workers would usually agree on the city , but not on the area as a result of their perception.
On a general basis, the error distance remained low. Also for reallocated tweets
TW mae remain low as compared to the area of united states 3.1 million square miles