More Related Content Similar to MongoDB World 2016: The Best IoT Analytics with MongoDB (20) MongoDB World 2016: The Best IoT Analytics with MongoDB1. The Best IoT Analytics with
MongoDB
Jake Angerman
Sr. Solutions Architect
MongoDB
2. Sessions:
1. Building an IoT Application that Will Work
Next Year
2. Building IoT Applications the Right Way
3. The Best IoT Analytics with MongoDB
Track Overview
✔
✔
6. #MDBW16
Tin Can Reveal
homemade antenna
(6.9mm quarter-wave whip)
NooElec NESDR Mini 2 SDR $23.00
USB extension cable $10.00
RF cable RG316 female to MCX male $5.50
?n can $2.87
Total: $41.37
6.9cm antenna
USB SDR
dump1090
8. #MDBW16
Antenna Range approximately 250 miles (400km)
> db.tincan.aggregate( [{ $geoNear: { near: { type: "Point", coordinates: [ center_y, center_x ] }, distanceField:
"meters", minDistance: 394289, limit: 100, spherical: true }}, {$sort: { "meters": -1}}, {$limit: 1} ])
9. #MDBW16
Antenna Range approximately 250 miles (400km)
> db.tincan.aggregate( [{ $geoNear: { near: { type: "Point", coordinates: [ center_y, center_x ] }, distanceField:
"meters", minDistance: 394289, limit: 100, spherical: true }}, {$sort: { "meters": -1}}, {$limit: 1} ])
10. #MDBW16
ADS-B BaseStation data format
MSG,7,111,11111,A3DC34,111111,2016/03/28,21:42:25.875,2016/03/28,21:42:25.865,,36975,,,,,,,,,,0
MSG,7,111,11111,A3DC34,111111,2016/03/28,21:42:25.884,2016/03/28,21:42:25.865,,36975,,,,,,,,,,0
MSG,8,111,11111,A33AA7,111111,2016/03/28,21:42:25.898,2016/03/28,21:42:25.865,,,,,,,,,,,,0
MSG,5,111,11111,A33AA7,111111,2016/03/28,21:42:25.961,2016/03/28,21:42:25.931,,28225,,,,,,,0,,0,0
MSG,3,111,11111,A678EF,111111,2016/03/28,21:42:26.013,2016/03/28,21:42:25.996,,34000,,,30.58369,-98.75438,,,,,,0
MSG,4,111,11111,A678EF,111111,2016/03/28,21:42:26.013,2016/03/28,21:42:25.996,,,417,283,,,0,,,,,0
MSG,3,111,11111,0D081C,111111,2016/03/28,21:42:26.280,2016/03/28,21:42:26.258,,35975,,,29.86456,-98.24018,,,,,,0
MSG,4,111,11111,0D081C,111111,2016/03/28,21:42:26.280,2016/03/28,21:42:26.258,,,429,206,,,0,,,,,0
MSG,8,111,11111,0D0648,111111,2016/03/28,21:42:26.358,2016/03/28,21:42:26.324,,,,,,,,,,,,0
MSG,3,111,11111,A678EF,111111,2016/03/28,21:42:26.454,2016/03/28,21:42:26.390,,34000,,,30.58389,-98.75544,,,,,,0
MSG,8,111,11111,A33AA7,111111,2016/03/28,21:42:26.478,2016/03/28,21:42:26.455,,,,,,,,,,,,0
MSG,7,111,11111,A678EF,111111,2016/03/28,21:42:26.679,2016/03/28,21:42:26.651,,34000,,,,,,,,,,0
MSG,7,111,11111,0D081C,111111,2016/03/28,21:42:26.759,2016/03/28,21:42:26.717,,35975,,,,,,,,,,0
altitudeICAO hex
lat/long
date & time
stamp
message
type
speed
11. #MDBW16
ADS-B in JSON
{
"timestamp" : ISODate("2016-01-31T20:54:35.000+0000"),
"icao" : "AC4144",
"callsign" : "N889WM",
"altitude" : 9350,
"bearing" : 150,
"position" : [-98.62762, 30.03657],
"ground_speed" : 152,
"vertical_rate" : 192
}
12. #MDBW16
dump1090
dump1090 data flow
Linked List in
RAM
HTTP
:8080
BaseStation
TCP
:30003
[{"hex":"ac741c", "squawk":"6234", "flight":"AAL2417 ",
"lat": 30.619176, "lon":-97.755963, "validposition":1,
"altitude":35975, "vert_rate":0,"track":202, "validtrack":1,
"speed":438, "messages":557, "seen":0}]
AJAX JSON
13. #MDBW16
dump1090
dump1090 data flow
Linked List in
RAM
HTTP
:8080
BaseStation
TCP
:30003
[{"hex":"ac741c", "squawk":"6234", "flight":"AAL2417 ",
"lat": 30.619176, "lon":-97.755963, "validposition":1,
"altitude":35975, "vert_rate":0,"track":202, "validtrack":1,
"speed":438, "messages":557, "seen":0}]
AJAX JSON
ingest.py
MSG,7,111,11111,A3DC34,111111,2016/03/28,
21:42:25.875,2016/03/28,21:42:25.865,,36975
MongoDB
TCP
14. #MDBW16
What Types of Analytics Can We Do?
• Real-time dashboards (<1 second latency) = Aggregation framework
• Ad-hoc queries = Aggregation framework
• Historical Reports = Aggregation framework or BI Connector
• Batch processing = Hadoop
• Machine Learning = Spark
17. #MDBW16
Analytics without Data Migration
Database
Historical
Analysis
Devices
Dashboards
• No bulk or incremental ETL required
• One language for both real-time and ad-hoc queries
21. #MDBW16
dump1090
dump1090 dashboard
Linked List in
RAM
HTTP
:8080
BaseStation
TCP
:30003
[{"hex":"ac741c", "squawk":"6234", "flight":"AAL2417 ",
"lat":30.619176, "lon":-97.755963, "validposition":1,
"altitude":35975, "vert_rate":0,"track":202, "validtrack":1,
"speed":438, "messages":557, "seen":0}]
AJAX JSON
ingest.py
MSG,7,111,11111,A3DC34,111111,2016/03/28,
21:42:25.875,2016/03/28,21:42:25.865,,36975
MongoDB
TCP
WT cache
22. #MDBW16
Real-time Dashboards
• Current Radar, last 5 minutes' worth of aircraft data
• pipeline = [
{"$match": {"t": {"$gte": datetime.datetime.utcnow() - datetime.timedelta(minutes=5) }}},
{"$sort": { "icao":1, "t":1 }},
{"$group": {"_id" : {"icao": "$icao"},
"events": {"$push": {"flight":"$callsign", "altitude":"$a", "track":"$b",
"speed":"$s", "lon": { "$arrayElemAt":["$p", 0] },
"lat": { "$arrayElemAt":["$p", 1] }, "vert_rate":"$v" }},
"sum": {"$sum":1}}},
{"$project" :{ "_id":0, "icao":"$_id.icao", "events":"$events", "sum":"$sum" }} ]
$match first uses index
pre-built array avoids
clumsy looping in
application
23. #MDBW16
Ad hoc aggregations
Which aircraft has the most observations?
> db.tincan.aggregate([
{ $group: {
_id: "$icao",
"sum": {$sum: 1},
"callsigns": {"$addToSet": "$callsign"} }},
{ $sort: { "sum": -1 }},
{$limit: 1}
])
{
"_id": ObjectId("5755..."),
"icao": "ADE201",
"callsign": "N994FE",
"a": 8600,
"b": 104,
"p": [-98.99888, 30.93031],
"s": 164,
"t": ISODate("2016-02-09T02:33:01Z"),
}
24. #MDBW16
Which aircraft has the most observations?
"result": [
{
"_id": "ADE201",
"sum": 14373,
"callsigns": [
"N994FE"
]
}
{
"_id": ObjectId("5755..."),
"icao": "ADE201",
"callsign": "N994FE",
"a": 8600,
"b": 104,
"p": [-98.99888, 30.93031],
"s": 164,
"t": ISODate("2016-02-09T02:33:01Z"),
}
25. #MDBW16
ICAO aircraft collection
$ mongoimport -d adsb -c aircraft --type csv --headerline aircraft_db.csv
icao,regid,mdl,type,operator
000334,PU-PLS,ULAC,EDRA SUPER PETREL LS,PRIVATE OWNER
000D77,PU-VGA,WT9,WT-9 DYNAMIC,PRIVATE OWNER
000D82,PU-DCT,WT9,AEROSPOOL WT9 DYNAMIC,PRIVATE OWNER
001100,-,320,UNKNOWN / VARIOUS,CODE USED BY SEVERAL AIRCRAFT
001108,EJC-1108,AC90,GULFSTREAM 690D,EJERCITO DE COLOMBIA
001411,PU-BGC,RV9,AMATEUR VANS RV-9A,PRIVATE OWNER
002008,LV-S004,P208,TECNAM P-2008,PRIVATE OWNER
003106,PU-FUA,ULAC,AMATEUR GFLY,PRIVATE OWNER
004003,Z-WPB,B732,BOEING 737-2N0,AIR ZIMBABWE
...
26. #MDBW16
$lookup to find aircraft model
> db.tincan.aggregate([
{ $group: {
_id: "$icao",
"sum": {$sum: 1},
"callsigns": {"$addToSet": "$callsign"} }},
{ $sort: { "sum": -1 }},
{ $limit: 1 },
{ $lookup: {
from:"aircraft",
localField:"_id",
foreignField:"icao",
as:"description" }}
])
27. #MDBW16
$lookup to find aircraft model
"result": [
{
"_id": "ADE201",
"sum": 14373,
"callsigns": [
"N994FE"
],
"description": [
{
"_id": ObjectId("575074300cf625050f2e730e"),
"icao": "ADE201",
"regid": "N994FE",
"mdl": "C208",
"type": "CESSNA 208B GRAND CARAVAN"
}
]
29. #MDBW16
Which aircraft is seen the most number of days?
> db.tincan.aggregate([
{ $group: {
_id: {icao: "$icao", dayOfYear: {$dateToString: { format: "%Y%m%d",
date: "$t"}}}}},
{$group:{
_id: "$_id.icao",
sum: { $sum: 1 }}},
{ $sort:{ "sum": -1 }},
{ $limit: 1 },
{ $lookup: {
from:"aircraft",
localField:"_id",
foreignField:"icao",
as:"description" }}
])
30. #MDBW16
Which aircraft is seen the most number of days?
"result": [
{
"_id": "A35969",
"sum": 63,
"description": [
{
"_id": ObjectId("5762e9cf6ecfc147a0503894"),
"icao": "A35969",
"regid": "N315AE",
"mdl": "B06",
"type": "BELL 206L-1 LONGRANGER II",
"operator": "AIR EVAC EMS"
}
]
33. #MDBW16
BI Connector
• New in MongoDB 3.2 Enterprise Advanced
• Mapping and transformation layer
• Projects smaller parts of large data sets for reporting
34. #MDBW16
MongoDB Query LanguageSQL
BI Connector Data flow
MongoDB
BI
Connector
Mapping
metadata
ApplicaAon data
{name:
“Andrew”,
address:
{street:
…}}
Document Table AnalyAcs & visualizaAon
39. #MDBW16
Altitude vs Speed
• Two predictable clusters:
• turbine aircraft at cruising
altitude
• piston aircraft at lower
altitude
• Outliers are Cessnas
reporting 51,000+ ft
41. #MDBW16
Spark Overview
• fast, general data processing engine
• interactive shell
• Scala, Java, Python
• machine learning libraries (mllib)
• supports streaming
• HDFS not required
43. #MDBW16
Spark Connector Diagram
• diagram
MongoDB Connector for Hadoop (with Spark Plug-in)
https://github.com/mongodb/mongo-hadoop
MongoDB Connector for Spark
https://github.com/mongodb/mongo-spark
44. #MDBW16
Supervised Unsupervised
Classification
• Naive Bayes
• Support Vector
Machines
• Random Decision
Forests
Clustering
• K-means
Regression
• Linear
• Logistic
Dimensionality
Reduction
• Principal Component
Analysis
• Singular Value
Decomposition
Spark Machine Learning
45. #MDBW16
K-Means Clustering
The K-Means algorithm aims to
minimize the sum of squares of the
distance between the points and the
centroid of each cluster.
source: Lovro Iliassich, toptal.com
46. #MDBW16
K-Means Clustering
>>> mongo_rdd = sc.mongoRDD('mongodb://localhost:27017/adsb.tincan')
OR specify a filter:
>>> input_conf = {"mongo.job.input.format":
"com.mongodb.hadoop.MongoInputFormat", "mongo.input.uri": "mongodb://
localhost:27017/adsb.tincan", "mongo.input.query": '{"t":{"$lte":{"$date":
1455494400000}}}' }
>>> mongo_rdd = sc.newAPIHadoopRDD(inputFormatClassName,
keyClassName, valueClassName, None, None, input_conf)
47. #MDBW16
K-Means Clustering
>>> mongo_rdd = sc.mongoRDD('mongodb://localhost:27017/adsb.tincan')
>>> mongo_rdd.first()
{u'icao': u'A06690', u'a': 11975, u'b': 150, u'_id':
ObjectId('5755bb862355da56d87895cf'), u't': datetime.datetime(2016, 2, 8, 5,
25, 4), u'p': [-98.41437, 30.29066], u's': 285, u'v': -1152}
48. #MDBW16
K-Means Clustering
>>> mongo_rdd = sc.mongoRDD('mongodb://localhost:27017/adsb.tincan')
>>> mongo_rdd.first()
{u'icao': u'A06690', u'a': 11975, u'b': 150, u'_id':
ObjectId('5755bb862355da56d87895cf'), u't': datetime.datetime(2016, 2, 8, 5,
25, 4), u'p': [-98.41437, 30.29066], u's': 285, u'v': -1152}
>>> parsed_rdd = mongo_rdd.map(parseData)
>>> parsed_rdd.first()
[5, 25, 4, 1, 11975, 150, 285, -1152, -98.14857, 30.92651]
49. #MDBW16
Choosing K
! = ! − !!
!
!∈!!
!
!!!
0
2,000,000
4,000,000
6,000,000
8,000,000
10,000,000
12,000,000
14,000,000
0 20 40 60 80 100 120 140 160 180 200
k
Within Set Sum of Squared Error
WSSSE
50. #MDBW16
Standard Scaling
! =
! − !
!
>>> parsed_rdd.first()
[5, 25, 4, 1, 11975, 150, 285, -1152, -98.14857, 30.92651]
>>> scaled_features.first()
[-1.036, -1.1089, -0.2617, 0.6821, -0.8202, 0.4057, 0.8537, -1.6502, -0.6559, 0.6876]
51. #MDBW16
K-Means Clustering
>>> k = 10
>>> clusters = KMeans.train(parsed_rdd, k, maxIterations=10, runs=1,
initializationMode="random")
>>> cluster_sizes = parsed_rdd.map(lambda e:
clusters.predict(e)).countByValue()
>>> cluster_sizes
defaultdict(<type 'int'>, {0: 70122, 1: 350890, 2: 118596, 3: 104609, 4:
254759, 5: 175840, 6: 166789, 7: 68309, 8: 147826, 9: 495102})
52. #MDBW16
Save Results Back to MongoDB def labelData(array):
result = {}
result['cluster'] = clusters.predict(array)
result['daystamp'] = str(array[0])
result['dayofweek'] = array[1]
result['hour'] = array[2]
result['minute'] = array[3]
result['second'] = array[4]
result['a'] = array[5]
result['b'] = array[6]
result['s'] = array[7]
result['v'] = array[8]
result['p'] = [ array[9], array[10] ]
return result
>>> labeled_rdd = parsed_rdd.map(labelData)
>>> labeled_rdd.saveToMongoDB('mongodb://
localhost:27017/adsb.labeled')
53. #MDBW16
K-Means Clustering
>>> cluster_sizes
defaultdict(<type 'int'>, {0: 70122, 1: 350890, 2:
118596, 3: 104609, 4: 254759, 5: 175840, 6: 166789,
7: 68309, 8: 147826, 9: 495102})
Hypothesis: largest cluster #9 is cruising altitude
54. #MDBW16
Hypothesis: largest cluster #9 is cruising altitude
adsb> db.labeled.aggregate([
{$match: {cluster:9}},
{$group: {_id: "summary",
"avg_alt": {$avg:"$a"},
"min_alt": {$min:"$a"},
"max_alt": {$max:"$a"} }}])