Contenu connexe Similaire à Railroad Modeling at Hadoop Scale Similaire à Railroad Modeling at Hadoop Scale (20) Plus de DataWorks Summit (20) Railroad Modeling at Hadoop Scale1. © 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
Railroad Modeling at HadoOp
Scale
Hadoop Summit
3 June 2014, San Jose, CA
John Akred (@BigDataAnalysis),
Tatsiana Maskalevich (@notrockstar)
www.svds.com @SVDataScience
2. © 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
2
Why is a data science &
engineering consulting company
building its own Caltrain app?
3. © 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
3
4. © 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
4
• Commuter rail between San Francisco and San Mateo and
Santa Clara counties ~30 stations
• 118 passenger cars
• 60% >=30 years old
• 2014 weekday ridership is 52,019 people
• On-time performance is about 92%
• No reliable real-time status information
• API outage between April 5th and June 2nd
5. © 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
HOW DO
WE KNOW
IF THE
TRAIN IS
LATE?
• Direct observation
– We can hear the train horn
– We can see the train when it goes by
• Purpose-built systems:
– We can use Caltrain API’s (when working)
• Other signals
– We can check Twitter for delay info or rider
comments
5
6. © 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
SVDS Approach
6
Take advantage of the available
signals
Use historical data to make direct
and latent observations more
useful
Provide a service that gives
users valuable planning and
riding features
Don’t let the perfect be the
enemy of the good
7. © 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
7
Stovepipe:
One-to-one
relationship
from data
source to
product
Hard Failure:
If the data
source is
broken, so is
the app.
Multi-sourced:
Redundancy of
overlapping data
sources makes your
products more
resilient
Graceful Degradation:
If a data source
breaks, there is a
backup and your app
continues to function
Production data services
abstract the probabilistic
integration of overlapping
data sources. We call this
model a Data Mesh:
DATA RESILIENCY Products
Data
Sources
Broken
Data
Sources
Data
Services
8. © 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
8
Source
Signals
Audio
Image
Text
API
Variety
Volume
Velocity
9. © 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
9
• Microphone connected to Raspberry Pi
mic->preamp->analog-to-digital converter->usb
• PyAudio running on Raspberry Pi serializes
audio as an array of 2-byte integers.
• Sound data + metadata -> Flume on AWS
via flumelogger
• We use FFT + Decision Trees to detect and
classify the trains into express and local
based on the whistle sound.
Audio Capture and Ingest
Raspberry Pi
Raw Audio
Agent
Raw Audio
Agent
10. © 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
10
• wget pulls images from camera’s built-
in server 2-3 times a second, and
saves them on a local server/NAS
• Flume pushes the image data to our
EC2 servers
• openCV (python) is used to detect
trains in images
Image Capture and Ingest
Raw Image
Agent
Raw Image
Agent
Local Server
11. © 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
11
• Capturing all the tweets with keyword
‘Caltrain’ via Twitter API
• Flume agent sends tweets to Apache
Storm topology for processing
• Tweets are parsed and written to
HDFS and HBase
• Event Detection is based on the
baseline number of tweets per hour
and keywords
Text Capture and Ingest: Twitter
Raw Image Agent
Twitter APIs
12. © 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
12
• Real-time departure times available via
511.org developer API’s
• Python script collects data once a
minute from 511.org APIs and stores it in
HDFS as sequence files using
WebHDFS API’s.
• Python script collects data from the
Caltrain site that includes run #
• Didn’t function from April 5th until June
2nd 2014
Caltrain API Data Capturing
scraper.py
511.Org
APIs
Caltrain
Webpag
e
data_collec
tor_api.py
13. © 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
13
Combining
the Signals
Audio
Signal
Detection
Image
Recogni-
tion
Text
Analysis
STATE
of
complex
system
14. © 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
14
Twitter
Agent
Analytics
Dev
MapReduce
Event
StorageSound
Agent
Image
Agent
Twitter
Spout
Sound
Spout
Image
Spout
Tweet
Parser
Tweets
Counter
HDFS
Writer
Event
Detector
Alerts
Twitter
API
HBase
Writer
Microphone
on
Raspberry
Pi
Web
Camera
External
Data
Sources
Data Platform
Sounds
Classifier
Train
Detector
Transmit
to APP
Caltrain
Agent
Caltrain
Spout
Caltrain
API Schedule
Integrator
15. © 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
15
Batch:
• Apply FFT to audio data to
identify train based on train
whistle’s fundamental
frequencies.
• Decision tree trained to classify
trains into local or express based
on minimum and maximum
fundamental frequencies (Doppler
effect)
Data Science: Audio
Real-Time:
• Execute local / express classifier
• Send data to the Event Detector for
APP alerts
• Store results in HBase
• Apply FFT to audio
signal
• Extract min and
max fundamental
frequencies
Frequency,Hz
Histogram of Whistle Frequencies Over a Period of Time
FrequencyCounts
16. © 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
16
Real-Time
• ORB algorithm (openCV) is used to
detect the train in image
• Sends results to the Event Detector to
identify train and compare to schedule
• Event Detector updates APP with the
train’s status, alerts if late
Data Science: Image
Number of Key-PointsThat AreThe Same In Two ConsecutivesImages
Time (Sec)
NumberofMatchingPoints
17. © 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
17
Batch:
• Update baseline tweet
frequencies for each hour as
additional historical data
collected
• Store model parameters in
HBase
Data Science: Text
Real-Time:
• Count tweets as they stream
through topology
• Alert based on frequency
deviations from the baseline
18. © 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
18
Baseline
Calculation Baseline
19. © 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
19
Future Work • Detect direction of train in image
processing
• Use natural language processing on
twitter data for event detector.
• Continue evaluation of analytical
frameworks for model computation
• Add observation posts
• Release Caltrain Rider Application
20. © 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
COMING SOON:
CALTRAIN RIDER APP
• Find out what train to catch using our
‘Ride Now’ view
• Select a train, see when that train should
be reaching each stop in a trip detail
view.
• For more info:
www.svds.com/trains
20
21. © 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
questions
21
Yes, We’re Hiring
www.svds.com/join-us
22. © 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
THANK YOU
John @BigDataAnalysis
Tatsiana @notrockstar
22
Notes de l'éditeur When train is detected an the information is sent to Hbase and to the Event detector
The camera has a network connection, so we can drop images via wget to the local server.
Label wget Add API setup