SlideShare une entreprise Scribd logo
1  sur  54
Télécharger pour lire hors ligne
Analytics in the age of
the Internet of Things
Ludwine Probst @nivdul
Thank you!
?
me
Data Engineer
@nivdul
nivdul.wordpress.com
Women in Tech
Duchess France
@duchessfr
duchess-france.org
Paris chapter Leader
Internet of Things (IoT)
more and more
aircraft
use case: sensor data from a cross-country flight
data points: several terabytes every hour per sensor
data analysis: batch mode or real time analysis
applications:
• flight performance (optimize plane fuel consumption,
• reduce maintenance costs…)
• detect anomalies
• prevent accidents
insurance
use case: data from a connected car key
applications:
• monitoring
• real time vehicle location
• drive safety
• driving score
Why should I care?
Because it can affect & change our business, our everyday life?
Collecting
Time series
112578291481000 -5.13
112578334541000 -5.05
112578339541000 -5.15
112578451484000 -5.48
112578491615000 -5.33
Some protocols…
• DDS – Device-to-Device communication – real-time
• MQTT – Device-to-Server – collect telemetry data
• XMPP – Device-to-Server – Instant Messaging scenarios
• AMQP – Server-to-Server – connecting devices to backend
…
Challenges
limited CPU
&
memory resources
low energy communication network
Storing
• flat file:
limited utility
• relational database:
limited design
rigidity
• NoSQL database:
scalability
faster & more flexible
Storing TS
IoT data pipeline
streaming
Storm
Now, the example!
http://www.cis.fordham.edu/wisdm/index.php
http://www.cis.fordham.edu/wisdm/includes/files/sensorKDD-2010.pdf
WISDM Lab’s study
The example
Goal: identify the physical activity that a user is performing
inspired by WISDM Lab’s study http://www.cis.fordham.edu/wisdm/index.php
The situation
The labeled data comes from an accelerometer (37 users)
Possible activities are:
walking, jogging, sitting, standing, downstairs and upstairs.
This is a classification problem here!
Some algorithms to use: Decision tree, Random Forest, Multinomial
logistic regression...
How can I predict the user’s
activity?
1. analyzing part:
collect & clean data from a csv file
store it in Cassandra
define & extract features using Spark
build the predictive model using MLlib
2. predicting part:
collect data in real-time (REST)
use the model to predict result
MLlib
https://github.com/nivdul/actitracker-cassandra-spark
Collect
&
store the data
The accelerometer
A sensor (in a smartphone)
compute acceleration over X,Y,Z
collect data every 50ms
Each acceleration contains:
• a timestamp (eg, 1428773040488)
• acceleration along the X axis (unit is m/s²)
• acceleration along the Y axis (unit is m/s²)
• acceleration along the Z axis (unit is m/s²)
Accelerometer Android app
REST Api collecting data coming from a phone application
Accelerometer Data Model
CREATE KEYSPACE actitracker WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 };
CREATE TABLE users (user_id int,
activity text,
timestamp bigint,
acc_x double,
acc_y double,
acc_z double,
PRIMARY KEY ((user_id,activity),timestamp));
COPY users FROM '/path_to_your_data/data.csv' WITH HEADER = true;
Accelerometer Data Model: logical view
8 walking 112578291481000 -5.13 8.15 1.31
8 walking 112578334541000 -5.05 8.16 1.31
8 walking 112578339541000 -5.15 8.16 1.36
8 walking 112578451484000 -5.48 8.17 1.31
8 walking 112578491615000 -5.33 8.16 1.18
activityuser_id
timestamp
acc_x acc_z
acc_y
graph from the Cityzen Data widget
Analyzing
https://github.com/nivdul/actitracker-cassandra-spark
is a large-scale in-memory data processing framework
• big data analytics in memory/disk
• complements Hadoop
• faster and more flexible
• Resilient Distributed Datasets (RDD)
interactive shell (scala & python)
Lambda
(Java 8)
Spark ecosystem
MLlib
• regression
• classification
• clustering
• optimization
• collaborative filtering
• feature extraction (TF-IDF, Word2Vec…)
is Apache Spark's scalable machine learning library
spark-cassandra-connector
Exposes Cassandra tables as Spark RDD
Identify features
repetitive static
VS
walking, jogging, up/down stairs standing, sitting
graph from the Cityzen Data widget
The activities : jogging
mean_x = 3.3
mean_y = -6.9
mean_z = 0.8
Y-axis: peaks spaced out
about 0.25 seconds
graph from the Cityzen Data widget
The activities : walking
mean_x = 1
mean_y = 10
mean_z = -0.3
Y-axis: peaks spaced
about 0.5 seconds
graph from the Cityzen Data widget
The activities : up/downstairs
Y-axis: peaks spaced about 0.75 seconds
graph from the Cityzen Data widget
up down
The activities : standing
graph from the Cityzen Data widget
standing
static activity: no peaks
sitting
The features
• Average acceleration (for each axis)
• Variance (for each axis)
• Average absolute difference (for each axis)
• Average resultant acceleration
• Average time between peaks (max) (for Y-axis)
Goal: compute these features for all the users (37) and activities (6) over few
seconds window
Clean
&
prepare the data
retrieve the data from Cassandra
// define Spark context
SparkConf sparkConf = new SparkConf()
.setAppName("User's physical activity recognition")
.set("spark.cassandra.connection.host", "127.0.0.1")
.setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
// retrieve data from Cassandra and create an CassandraRDD
CassandraJavaRDD<CassandraRow> cassandraRowsRDD =
javaFunctions(sc).cassandraTable("actitracker", "users");
Compute the features
MLlib
Feature: mean
import org.apache.spark.mllib.stat.MultivariateStatisticalSummary;
import org.apache.spark.mllib.stat.Statistics;
private MultivariateStatisticalSummary summary;
public ExtractFeature(JavaRDD<Vector> data) {
this.summary = Statistics.colStats(data.rdd());
}
// return a Vector (mean_acc_x, mean_acc_y, mean_acc_z)
public Vector computeAvgAcc() {
return this.summary.mean();
}
Feature: avg time between peaks
// define the maximum using the max function from MLlib
double max = this.summary.max().toArray()[1];
// keep the timestamp of data point for which the value is greater than 0.9 * max
// and sort it !
// Here: data = RDD (ts, acc_y)
JavaRDD<Long> peaks = data.filter(record -> record[1] > 0.9 * max)
.map(record -> record[0])
.sortBy(time -> time, true, 1);
Feature: avg time between peaks
// retrieve the first and last element of the RDD (sorted)
Long firstElement = peaks.first();
Long lastElement = peaks.sortBy(time -> time, false, 1).first();
// compute the delta between each timestamp
JavaRDD<Long> firstRDD = peaks.filter(record -> record > firstElement);
JavaRDD<Long> secondRDD = peaks.filter(record -> record < lastElement);
JavaRDD<Vector> product = firstRDD.zip(secondRDD)
.map(pair -> pair._1() - pair._2())
// and keep it if the delta is != 0
.filter(value -> value > 0)
.map(line -> Vectors.dense(line));
// compute the mean of the delta
return Statistics.colStats(product.rdd()).mean().toArray()[0];
Choose algorithms
Random Forests
Decision Trees
Multiclass Logistic Regression
MLlib
Goal: identify the physical activity that a user is performing
Decision Trees
// Split data into 2 sets : training (60%) and test (40%)
JavaRDD<LabeledPoint>[] splits = data.randomSplit(new double[]{0.6, 0.4});
JavaRDD<LabeledPoint> trainingData = splits[0].cache();
JavaRDD<LabeledPoint> testData = splits[1];
Decision Trees// Decision Tree parameters
Map<Integer, Integer> categoricalFeaturesInfo = new HashMap<>();
int numClasses = 4;
String impurity = "gini";
int maxDepth = 9;
int maxBins = 32;
// create model
final DecisionTreeModel model = DecisionTree.trainClassifier(
trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins);
// Evaluate model on training instances and compute training error
JavaPairRDD<Double, Double> predictionAndLabel =
testData.mapToPair(p -> new Tuple2<>(model.predict(p.features()), p.label()));
Double testErrDT = 1.0 * predictionAndLabel.filter(pl -> !pl._1().equals(pl._2())).count() / testData.count();
// Save model
model.save(sc, "actitracker");
Results
Predictions
http://www.commitstrip.com/en/2014/04/08/the-demo-effect-dear-old-murphy/?setLocale=1
Accelerometer Android app
REST Api collecting data coming from a phone application
An example: https://github.com/MiraLak/accelerometer-rest-to-cassandra
Predictions!
// load the model saved before
DecisionTreeModel model = DecisionTreeModel.load(sc.sc(), "actitracker");
// connection between Spark and Cassandra using the spark-cassandra-connector
CassandraJavaRDD<CassandraRow> cassandraRowsRDD = javaFunctions(sc).cassandraTable("accelerations",
"acceleration");
// retrieve data from Cassandra and create an CassandraRDD
JavaRDD<CassandraRow> data = cassandraRowsRDD.select("timestamp", "acc_x", "acc_y", "acc_z")
.where("user_id=?", "TEST_USER")
.withDescOrder()
.limit(250);
Vector feature = computeFeature(sc);
double prediction = model.predict(feature);
How can I use my
computations?
possible applications:
• adapt the music over your speed
• detects lack of activity
• smarter pacemakers
• smarter oxygen therapy
Conclusion
• http://cassandra.apache.org/
• http://planetcassandra.org/getting-started-with-time-series-data-modeling/
• https://github.com/datastax/spark-cassandra-connector
• https://github.com/MiraLak/AccelerometerAndroidApp
• https://github.com/MiraLak/accelerometer-rest-to-cassandra
• https://spark.apache.org/docs/1.3.0/
• https://github.com/nivdul/actitracker-cassandra-spark
• http://www.duchess-france.org/analyze-accelerometer-data-with-apache-spark-and-mllib/
http://www.cis.fordham.edu/wisdm/index.php
Some references
Thank you!

Contenu connexe

Similaire à Analytics with Spark

Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandFrançois Garillot
 
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...InfluxData
 
Viktor Tsykunov: Azure Machine Learning Service
Viktor Tsykunov: Azure Machine Learning ServiceViktor Tsykunov: Azure Machine Learning Service
Viktor Tsykunov: Azure Machine Learning ServiceLviv Startup Club
 
Hadoop cluster performance profiler
Hadoop cluster performance profilerHadoop cluster performance profiler
Hadoop cluster performance profilerIhor Bobak
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...Chester Chen
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's comingDatabricks
 
Kapacitor - Real Time Data Processing Engine
Kapacitor - Real Time Data Processing EngineKapacitor - Real Time Data Processing Engine
Kapacitor - Real Time Data Processing EnginePrashant Vats
 
Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Mark Smith
 
Dataservices: Processing (Big) Data the Microservice Way
Dataservices: Processing (Big) Data the Microservice WayDataservices: Processing (Big) Data the Microservice Way
Dataservices: Processing (Big) Data the Microservice WayQAware GmbH
 
Intelligent Monitoring
Intelligent MonitoringIntelligent Monitoring
Intelligent MonitoringIntelie
 
112 portfpres.pdf
112 portfpres.pdf112 portfpres.pdf
112 portfpres.pdfsash236
 
Reactive programming every day
Reactive programming every dayReactive programming every day
Reactive programming every dayVadym Khondar
 
Fabric - Realtime stream processing framework
Fabric - Realtime stream processing frameworkFabric - Realtime stream processing framework
Fabric - Realtime stream processing frameworkShashank Gautam
 
Accelerating analytics on the Sensor and IoT Data.
Accelerating analytics on the Sensor and IoT Data. Accelerating analytics on the Sensor and IoT Data.
Accelerating analytics on the Sensor and IoT Data. Keshav Murthy
 
Example R usage for oracle DBA UKOUG 2013
Example R usage for oracle DBA UKOUG 2013Example R usage for oracle DBA UKOUG 2013
Example R usage for oracle DBA UKOUG 2013BertrandDrouvot
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionChetan Khatri
 

Similaire à Analytics with Spark (20)

Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
 
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
 
Viktor Tsykunov: Azure Machine Learning Service
Viktor Tsykunov: Azure Machine Learning ServiceViktor Tsykunov: Azure Machine Learning Service
Viktor Tsykunov: Azure Machine Learning Service
 
Hadoop cluster performance profiler
Hadoop cluster performance profilerHadoop cluster performance profiler
Hadoop cluster performance profiler
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
 
Kapacitor - Real Time Data Processing Engine
Kapacitor - Real Time Data Processing EngineKapacitor - Real Time Data Processing Engine
Kapacitor - Real Time Data Processing Engine
 
Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016
 
Dataservices: Processing (Big) Data the Microservice Way
Dataservices: Processing (Big) Data the Microservice WayDataservices: Processing (Big) Data the Microservice Way
Dataservices: Processing (Big) Data the Microservice Way
 
Intelligent Monitoring
Intelligent MonitoringIntelligent Monitoring
Intelligent Monitoring
 
112 portfpres.pdf
112 portfpres.pdf112 portfpres.pdf
112 portfpres.pdf
 
Reactive programming every day
Reactive programming every dayReactive programming every day
Reactive programming every day
 
Fabric - Realtime stream processing framework
Fabric - Realtime stream processing frameworkFabric - Realtime stream processing framework
Fabric - Realtime stream processing framework
 
Accelerating analytics on the Sensor and IoT Data.
Accelerating analytics on the Sensor and IoT Data. Accelerating analytics on the Sensor and IoT Data.
Accelerating analytics on the Sensor and IoT Data.
 
Example R usage for oracle DBA UKOUG 2013
Example R usage for oracle DBA UKOUG 2013Example R usage for oracle DBA UKOUG 2013
Example R usage for oracle DBA UKOUG 2013
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 

Dernier

Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...gragchanchal546
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...HyderabadDolls
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...gajnagarg
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...HyderabadDolls
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...SOFTTECHHUB
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...gajnagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...kumargunjan9515
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 

Dernier (20)

Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 

Analytics with Spark

  • 1. Analytics in the age of the Internet of Things Ludwine Probst @nivdul
  • 4. Women in Tech Duchess France @duchessfr duchess-france.org Paris chapter Leader
  • 6.
  • 8. aircraft use case: sensor data from a cross-country flight data points: several terabytes every hour per sensor data analysis: batch mode or real time analysis applications: • flight performance (optimize plane fuel consumption, • reduce maintenance costs…) • detect anomalies • prevent accidents
  • 9. insurance use case: data from a connected car key applications: • monitoring • real time vehicle location • drive safety • driving score
  • 10. Why should I care? Because it can affect & change our business, our everyday life?
  • 12. Time series 112578291481000 -5.13 112578334541000 -5.05 112578339541000 -5.15 112578451484000 -5.48 112578491615000 -5.33
  • 13. Some protocols… • DDS – Device-to-Device communication – real-time • MQTT – Device-to-Server – collect telemetry data • XMPP – Device-to-Server – Instant Messaging scenarios • AMQP – Server-to-Server – connecting devices to backend …
  • 14. Challenges limited CPU & memory resources low energy communication network
  • 16. • flat file: limited utility • relational database: limited design rigidity • NoSQL database: scalability faster & more flexible Storing TS
  • 20. The example Goal: identify the physical activity that a user is performing inspired by WISDM Lab’s study http://www.cis.fordham.edu/wisdm/index.php
  • 21. The situation The labeled data comes from an accelerometer (37 users) Possible activities are: walking, jogging, sitting, standing, downstairs and upstairs. This is a classification problem here! Some algorithms to use: Decision tree, Random Forest, Multinomial logistic regression...
  • 22. How can I predict the user’s activity? 1. analyzing part: collect & clean data from a csv file store it in Cassandra define & extract features using Spark build the predictive model using MLlib 2. predicting part: collect data in real-time (REST) use the model to predict result MLlib https://github.com/nivdul/actitracker-cassandra-spark
  • 24. The accelerometer A sensor (in a smartphone) compute acceleration over X,Y,Z collect data every 50ms Each acceleration contains: • a timestamp (eg, 1428773040488) • acceleration along the X axis (unit is m/s²) • acceleration along the Y axis (unit is m/s²) • acceleration along the Z axis (unit is m/s²)
  • 25. Accelerometer Android app REST Api collecting data coming from a phone application
  • 26. Accelerometer Data Model CREATE KEYSPACE actitracker WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; CREATE TABLE users (user_id int, activity text, timestamp bigint, acc_x double, acc_y double, acc_z double, PRIMARY KEY ((user_id,activity),timestamp)); COPY users FROM '/path_to_your_data/data.csv' WITH HEADER = true;
  • 27. Accelerometer Data Model: logical view 8 walking 112578291481000 -5.13 8.15 1.31 8 walking 112578334541000 -5.05 8.16 1.31 8 walking 112578339541000 -5.15 8.16 1.36 8 walking 112578451484000 -5.48 8.17 1.31 8 walking 112578491615000 -5.33 8.16 1.18 activityuser_id timestamp acc_x acc_z acc_y graph from the Cityzen Data widget
  • 29. is a large-scale in-memory data processing framework • big data analytics in memory/disk • complements Hadoop • faster and more flexible • Resilient Distributed Datasets (RDD) interactive shell (scala & python) Lambda (Java 8) Spark ecosystem
  • 30. MLlib • regression • classification • clustering • optimization • collaborative filtering • feature extraction (TF-IDF, Word2Vec…) is Apache Spark's scalable machine learning library
  • 32. Identify features repetitive static VS walking, jogging, up/down stairs standing, sitting graph from the Cityzen Data widget
  • 33. The activities : jogging mean_x = 3.3 mean_y = -6.9 mean_z = 0.8 Y-axis: peaks spaced out about 0.25 seconds graph from the Cityzen Data widget
  • 34. The activities : walking mean_x = 1 mean_y = 10 mean_z = -0.3 Y-axis: peaks spaced about 0.5 seconds graph from the Cityzen Data widget
  • 35. The activities : up/downstairs Y-axis: peaks spaced about 0.75 seconds graph from the Cityzen Data widget up down
  • 36. The activities : standing graph from the Cityzen Data widget standing static activity: no peaks sitting
  • 37. The features • Average acceleration (for each axis) • Variance (for each axis) • Average absolute difference (for each axis) • Average resultant acceleration • Average time between peaks (max) (for Y-axis) Goal: compute these features for all the users (37) and activities (6) over few seconds window
  • 39.
  • 40. retrieve the data from Cassandra // define Spark context SparkConf sparkConf = new SparkConf() .setAppName("User's physical activity recognition") .set("spark.cassandra.connection.host", "127.0.0.1") .setMaster("local[*]"); JavaSparkContext sc = new JavaSparkContext(sparkConf); // retrieve data from Cassandra and create an CassandraRDD CassandraJavaRDD<CassandraRow> cassandraRowsRDD = javaFunctions(sc).cassandraTable("actitracker", "users");
  • 42. Feature: mean import org.apache.spark.mllib.stat.MultivariateStatisticalSummary; import org.apache.spark.mllib.stat.Statistics; private MultivariateStatisticalSummary summary; public ExtractFeature(JavaRDD<Vector> data) { this.summary = Statistics.colStats(data.rdd()); } // return a Vector (mean_acc_x, mean_acc_y, mean_acc_z) public Vector computeAvgAcc() { return this.summary.mean(); }
  • 43. Feature: avg time between peaks // define the maximum using the max function from MLlib double max = this.summary.max().toArray()[1]; // keep the timestamp of data point for which the value is greater than 0.9 * max // and sort it ! // Here: data = RDD (ts, acc_y) JavaRDD<Long> peaks = data.filter(record -> record[1] > 0.9 * max) .map(record -> record[0]) .sortBy(time -> time, true, 1);
  • 44. Feature: avg time between peaks // retrieve the first and last element of the RDD (sorted) Long firstElement = peaks.first(); Long lastElement = peaks.sortBy(time -> time, false, 1).first(); // compute the delta between each timestamp JavaRDD<Long> firstRDD = peaks.filter(record -> record > firstElement); JavaRDD<Long> secondRDD = peaks.filter(record -> record < lastElement); JavaRDD<Vector> product = firstRDD.zip(secondRDD) .map(pair -> pair._1() - pair._2()) // and keep it if the delta is != 0 .filter(value -> value > 0) .map(line -> Vectors.dense(line)); // compute the mean of the delta return Statistics.colStats(product.rdd()).mean().toArray()[0];
  • 45. Choose algorithms Random Forests Decision Trees Multiclass Logistic Regression MLlib Goal: identify the physical activity that a user is performing
  • 46. Decision Trees // Split data into 2 sets : training (60%) and test (40%) JavaRDD<LabeledPoint>[] splits = data.randomSplit(new double[]{0.6, 0.4}); JavaRDD<LabeledPoint> trainingData = splits[0].cache(); JavaRDD<LabeledPoint> testData = splits[1];
  • 47. Decision Trees// Decision Tree parameters Map<Integer, Integer> categoricalFeaturesInfo = new HashMap<>(); int numClasses = 4; String impurity = "gini"; int maxDepth = 9; int maxBins = 32; // create model final DecisionTreeModel model = DecisionTree.trainClassifier( trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins); // Evaluate model on training instances and compute training error JavaPairRDD<Double, Double> predictionAndLabel = testData.mapToPair(p -> new Tuple2<>(model.predict(p.features()), p.label())); Double testErrDT = 1.0 * predictionAndLabel.filter(pl -> !pl._1().equals(pl._2())).count() / testData.count(); // Save model model.save(sc, "actitracker");
  • 50. Accelerometer Android app REST Api collecting data coming from a phone application An example: https://github.com/MiraLak/accelerometer-rest-to-cassandra
  • 51. Predictions! // load the model saved before DecisionTreeModel model = DecisionTreeModel.load(sc.sc(), "actitracker"); // connection between Spark and Cassandra using the spark-cassandra-connector CassandraJavaRDD<CassandraRow> cassandraRowsRDD = javaFunctions(sc).cassandraTable("accelerations", "acceleration"); // retrieve data from Cassandra and create an CassandraRDD JavaRDD<CassandraRow> data = cassandraRowsRDD.select("timestamp", "acc_x", "acc_y", "acc_z") .where("user_id=?", "TEST_USER") .withDescOrder() .limit(250); Vector feature = computeFeature(sc); double prediction = model.predict(feature);
  • 52. How can I use my computations? possible applications: • adapt the music over your speed • detects lack of activity • smarter pacemakers • smarter oxygen therapy
  • 54. • http://cassandra.apache.org/ • http://planetcassandra.org/getting-started-with-time-series-data-modeling/ • https://github.com/datastax/spark-cassandra-connector • https://github.com/MiraLak/AccelerometerAndroidApp • https://github.com/MiraLak/accelerometer-rest-to-cassandra • https://spark.apache.org/docs/1.3.0/ • https://github.com/nivdul/actitracker-cassandra-spark • http://www.duchess-france.org/analyze-accelerometer-data-with-apache-spark-and-mllib/ http://www.cis.fordham.edu/wisdm/index.php Some references Thank you!