3. 3
• Open your mobile phone’s browser & navigate to
http://snowflake.talend.live
Enter the session code only and click Submit; do not continue
Setup
4. 4
• Open your mobile phone’s browser & navigate to
http://devicemotion.xyz
• Enter the session code only and click Submit; do not continue
To participate:
5. 5
• Enter your first name only (no spaces or special characters)
Don’t click Submit until instructed
Setup
7. 7
Javascript
reads
devicemotion
events
Stream micro-
batches to
REST service
REST service
sends data to
Kafka
Spark
Streaming
reads from
Kafka
Apply Machine
Learning to
classify activity
Load into Data
Warehouse
Visualization
data obtained
from REST
service
How Are We Collecting?
{REST} {REST}
8. 8
• It let's you publish and subscribe to
streams of records. In this respect it
is similar to a message queue or
enterprise messaging system.
• It let's you store streams of records in
a fault-tolerant way.
• It let's you process streams of records
as they occur.
Distributed Streaming Platform
Kafka Background
9. 9
• Fast and general engine for large-scale data processing
• Developed in response to processing limitations with MapReduce
• 10x faster than MapReduce on disk
• 100x faster than MapReduce in memory
• Has a stack of libraries including Spark Streaming & MLib (machine learning)
• Runs everywhere; on Hadoop or Standalone
Spark Background
10. 10
• University study on gait (walking) characteristics based on smartphone sensors
proposed that each individual has a unique walking signature
• Showing a heat-trace on three individuals reveals their unique signature
Biometric Gait Signature
1 http://www.mdpi.com/2073-8994/8/10/100
2 http://kyrandale.com/viz/d3-smartphone-walking.html
11. 11
A Single Sensor
InvenSense MPU-6500 (Galaxy S6)
• Single-chip (3mm x 3mm x 0.9 mm)
integrates a 3-axis accelerometer
and a 3-axis gyroscope
• For comparison
18mm 3mm
12. 12
Linear Acceleration
• Shows forces measured by the accelerometer that
are caused by gravity
• The x, y and z axis show the direction of the force
• As you hold a phone looking at the screen…
• x is relative to the left and right sides
• y is relative to the up and down sides
• z is relative to the front and back sides
• If the phone is still, the linear acceleration values
should all be close to 0
• If you move it around it shows in real time how
much force is applied on it in the form of
acceleration
What Are We Collecting?
13. 13
• The devicemotion event is fired at a regular interval and indicates the
amount of physical force of acceleration the device is receiving at that time
• The information being transmitted is sent in JSON payloads every 250 events
(~5 seconds):
JavaScript devicemotion Events
"motionData":[
{
"client_ip":"127.0.0.1",
"timestamp":"1723452955",
"aX":"1.4",
"aY":"0.9",
"aZ":"3.1",
"user_name":"Name"
},
...
]
14. 14
Deduplication & Matching using Machine Learning to Scale to Big Data
Data Quality with Machine Learning
Training set
Single data set
with duplicates
Prediction of
potential
duplicates
Manual labeling: “is this a
duplicate?” yes/no
Run model
(Random Forests)
Train model
SAMPLE
ALL DATA
sampling
Continuous learning: the more data, the better the system learns
15. 15
• Linear acceleration on x, y, z axes (m/s2)
• Data classified into 3 categories
• Resting
• Walking
• Running
• Approximately 450 events
Training Data
aX,aY,aZ,label
-4.1,8.07,-16.36,running
-2.34,9.69,-0.33,running
0.0,0.01,-0.01,resting
-2.38,-0.54,0.65,walking
-0.7,12.93,-4.91,running
-3.3,-0.89,5.27,walking
1.85,-1.37,-0.73,walking
0.01,0.0,0.0,resting
…
16. 16
• Encode the model by using the previous handmade classified dataset
• Choose an appropriate algorithm for classification:
• Logistic Regression, Naïve Bayes, Decision Tree, Random Forest
• Validate algorithm using K-Fold Cross Validation
Encoding and Validating a Model
aX,aY,aZ,label
-4.1,8.07,-16.36,running
-2.34,9.69,-0.33,running
0.0,0.01,-0.01,resting
-2.38,-0.54,0.65,walking
-0.7,12.93,-4.91,running
-3.3,-0.89,5.27,walking
1.85,-1.37,-0.73,walking
0.01,0.0,0.0,resting
…
17. 17
5 Ways to Exploit Your Big Data
Spark
Streaming
Batch &
Real-Time
In Memory
Machine
Learning
1 click code
migration
Analyze before acting
Turn data into
decisions, prescriptions
& actions
Leverage the latest
technology
Remove latency
Exploit data as it arrives
19. 19
A Modern Big Data and Cloud Integration Platform
Data Fabric
APPLICATION
INTEGRATION
CLOUD
INTEGRATION
METADATA
MANAGEMENT
DATA
PREPARATION
BIG DATA
INTEGRATION
MASTER DATA
MANAGEMENT
20. 20
Check Authorization
Big Data Architecture
Get Software Updates &
Publish Artifacts
Store Metadata
Store Users, Rights, Roles,
Projects, Activity, Monitoring
Send & Request
Artifacts/Jobs
Job Server can be inside
or outside the cluster
Setup deployment
21. 21
UNIFIED PLATFORM
BATCH STREAMING HADOOP SPARK MAPREDUCE
INGEST PROFILE CLEANSE PARSE COMPLEX DATA
MAPPING
DATA QUALITY METADATA MANAGEMENT DATA LINEAGE
DESIGN DEPLOY MANAGE
ON-PREMISES PUBLIC CLOUD PRIVATE CLOUD
DATA GOVERNANCE
CONTINUOUS DELIVERY
DEPLOYMENT
BIG DATA
INTEGRATION
Big Data
22. 22
Talend Development Environment
• Talend Studio
o Eclipse Based Design Environment
o Drag and Drop UI
o Distributed Teamwork / Collaboration
o Rich palette of connectors : 800+
• N-Tier Architecture
o Client: Talend Studio
o Project Server: Talend Administration Center
o ETL Server: Talend Runtime
• Talend Administration Center
o Define Users and Projects (LDAP Enabled)
o Deploy
o Schedule
o Recover Job execution
o Monitor
23. 23
Create High Quality Information
• Data Quality and Profiling
• Explore, profile and monitor data
• Parse, cleanse, standardize and reconcile data
• Match, enrich and certify data, then and share it
widely and securely
• Map any data source to your business context
(customers products, organizations locations…)
• Data Masking
• Key Benefits
• More accurate information
• Regulatory compliance
24. 24
Talend Data Preparation
The first unified integration platform for governed, self-service data preparation
• Self-service data access & cleansing
+ Enterprise scale through Talend Data Fabric
+ Collaboration and sharing across teams
+ IT governs data usage with role-based security
+ Turn ad-hoc data prep into fully managed DI
processes
+ Ready for Big Data
LIVE DATA-SET
…and more
25. 25
The First Self-Service Data Quality Tool
Talend Data Stewardship App
Establish accountability and perfect data through teamwork
+ Engage everyone for data quality, not just data
stewards
+ Point & click approach for curation and
certification
+ Orchestrate data stewardship tasks as
campaigns
+ Audit and track data error resolution actions