7. Key Architectural Tiers
• Origin: Devices and Data Sources
• Transport: Orchestrating Bi-Directional Data Flow Between Sources
• Analytics: Analysis of Unbounded (Streaming) and Bounded
(Batch) Data, and Acting in Response
9. Origin Tier
• Where data is born, but also a destination
• Sensors and Devices
• Constrained Hubs/Gateways
10. Origin Tier
Devices are getting smaller, cheaper, and increasingly network
enabled.
Examples:
• RaspberryPi ($35, Full OS)
• ESP8266 (<$5 WiFi-enabled microcontroller)
11. Origin Tier
Devices in the Origin Tier both transmit and receive data.
• Command and Control
• Actuators (interaction with the physical environment)
• End user alerts and notifications
13. IoT Protocol Considerations
• Device-Device / Device-Gateway Communication
• Radio Frequency Protocols
• IP-based Protocols
14. IoT Protocol Considerations
Radio Frequency Protocols
• Typically for very resource-constrained devices (Ex: Wireless
sensors in a home security system)
• Usually involve an intermediary hub/gateway as a protocol bridge
(Ex: Main panel in a home security system)
• Short range
• Low Power
15. Radio Frequency Protocols
ZigBee
• Intended for low power applications (~2 yr. battery life)
• Low data rates
• Simpler and less expensive that WPANs like Bluetooth
16. Radio Frequency Protocols
ZigBee
• Range: 10–100 meters LOS (between nodes, but messages can
hop in a mesh network)
• Data Rate: 250 kbit/s
• Supports Star, Tree, and Mesh network topologies
• Requires a coordinator device for every network (usually the hub/
gateway)
18. Radio Frequency Protocols
Z-Wave
• Range: ~30 meters LOS (between nodes, but messages can hop)
• Data Rate: 100kbit/s
• Form source-routed mesh-networks (can route around failures/obstacles)
• Devices must be paired
• Requires a primary controller (e.g. the hub/gateway)
• Max 232 devices per network (but networks can be bridged)
19. Radio Frequency Protocols
Bluetooth/Blootooth LE
• Targets wireless computer and device accessories
• High data rates
• Do not form routed networks like Zigbee and Z-Wave
• Usually one host to many device pairing
• Range: 0.5m (Class 4) - 100m (Class 1)
• Data Rate: 1 Mbit/s - 24 Mbit/s
20. Radio Frequency Protocols
Thread
• New wireless protocol introduced by Nest (Google/Alphabet), Samsung, ARM, Qualcomm
• Built on top of the same (IEEE 802.15.4) specification as ZigBee
• IPv6-based
• Mesh network with hops supported
• ~250 devices per network
• Very low power (purported years of operation on a single AA with deep sleep modes)
• Very new/unsure future — WiFi, Bluetooth, etc. already ubiquitous
22. IP-Based Protocols
CoAP - Constrained Application Protocol
• Designed to be used on micro controllers with as little as 10k of
memory.
• Simple request/response protocol
• Much like HTTP but based on UDP
• Based on the REST model (GET, PUT, POST, DELETE)
• Strong security via DTLS (Datagram Transport Layer Security)
23. IP-Based Protocols
CoAP - Constrained Application Protocol
• Simple 4-byte header
• Subset of MIME types and HTTP response codes
• Data model agnostic
• one-to-one
• Tranport (UDP) <— Base Messaging (Simple Confirmable/Non-
Confirmable message transfer) <— REST Semantics
24. IP-Based Protocols
MQTT - Message Queue Telemetry Transport
• Pub/Sub messaging protocol
• Requires a broker (though brokers can be lightweight)
• many-to-many broadcast
25. IP-Based Protocols
MQTT - Message Queue Telemetry Transport
• Message == Topic + Payload
• Topics: users/ptgoetz/office/thermostat
• Topic wildcards:
• Single level (+): users/ptgoetz/+/thermostat
• Multi-level (#): users/ptgoetz/office/#
• Payload: Just a bunch of bytes (you define the schema)
26. IP-Based Protocols
MQTT - Message Queue Telemetry Transport
• Delivery guarantees (QoS):
• 0: At-most-once
• 1: At-least-once
• 2: Exactly-once
• Last will and testament (when a device goes offline)
• Security via SSL/TLS
27. Apache Mynewt (incubating)
• Real-time, modular OS for IoT devices
• Designed for use in devices with power, memory and
storage constraints
• Support for many ARM Cortex-M based boards
(including Arduino)
• HAL for unified access to MCU features
• Connectivity with Bluetooth LE
• WiFi, CoAP, and Thread support (roadmap)
• Remote Firmware Upgrades
• Command-line tools for package management
29. Transport Tier
• Connecting Edge Devices:
• To and from the Analytics Tier (data center)
• To and from one another (inter-device communication)
• Bridging Protocols:
• e.g. WPAN to IP
• Collecting/Transforming/Enriching Data in Motion
31. Apache NiFi
• Data flow orchestration tool
• Guaranteed Delivery
• Data provenance (important in the Analytics
Tier)
• Backpressure with release
• Flow-specific QoS
• Web-based UI for editing data flows
• Data flows modifiable at runtime
• Supports bi-directional data flows
• Integrates with just about any system
32. Apache NiFi
Basic Concepts
• Flow File: Unit of user data with associated
key-value metadata
• Processor: Components for creating, sending,
receiving, transforming, routing, etc. Flow Files
• Connection: Acts as the link between
processors.
• Flow Controller: Brokers the exchange of data
between processors
• Process Group: Set of Processors and
Connections with Input/Output ports. New
components can be created by composition.
33. Apache NiFi minifi
• Supplement to NiFi for constrained
devices/environments
• More suitable for edge devices
• Small footprint
• Designed to collect data near where
it originates an integrate with NiFi
34. Apache NiFi
For more information:
• https://nifi.apache.org
Some of the best technical
documentation I’ve ever seen:
• https://nifi.apache.org/docs.html
38. Analytics Tier
Key Platform Considerations:
• Unbounded (Stream) data processing frequently necessary
• Apache Storm, Apache Flink, etc.
• Bounded (Batch) data processing frequently necessary
• e.g. Training machine learning models, etc.
• Apache Hadoop M/R, Apache Flink, Apache Spark
• Time Series DB a common requirement
• Apache HBase, Apache Cassandra, etc.
39. Analytics Tier
Key Platform Considerations:
• Latency matters for many use cases
• Latency can add up quickly, depending on the number of “hops”
• Windowing semantics and flexibility
41. What is Event Time and why is it so important?
• Event Times: Origin Time vs. Processing Time
• Ex: Airplane Mode
• Other types of Event Time:
• Enrichment Time
• Ingest Time
• Processing Time 1, 2, n…
• Exit Time (e.g. “return” events, C2, bi-directional communication)
42. Choose a platform/API that gives
you the most flexibility with respect
to dealing with various event times.
43. Future-Proofing and Scaling
Small to Medium Scale:
• Not Big Data
• Investment in large-scale distributed system infrastructure
wouldn’t make sense.
• YAGNI (Yet…)
• Vertical scaling may suffice
44. Future-Proofing and Scaling
Medium to Large Scale:
• A single server is no longer cutting it
• “V”s are starting to pile up
• Need to move to a distributed architecture to scale with increasing
demand
• Your data is now Big
45. Apache Beam (incubating)
• Unified API for dealing with bounded/
unbounded data sources (i.e. batch/
streaming)
• One API. Multiple implementations
(execution engines). Called
“Runners” in Beamspeak.
46. Apache Beam (incubating)
• Major focus on Windowing and
properly dealing with Event Time(s)
• Sliding Windows, Tumbling Windows,
Session Windows, etc.
• Watermark capabilities for dealing
with late data
47. Apache Beam (incubating)
• Runner/Execution Engine Availability
• Local runner (single machine)
• Runners for Google Cloud
Dataflow, Flink and Spark
• Others underway: Apache Storm,
Apache Apex and others
48. Apache Beam (incubating)
• Choose the right runner for your
current scaling and organizational
needs (you can switch later as as
necessary)
• Understand the limits of different
runner implementations
• Outside of Google Data Flow, the
Flink runner is currently the most
feature-complete (this will change)
49. Apache Beam (incubating)
For a technical deep dive into Apache
Beam:
Apache Beam: A Unified Model for
Batch and Streaming Data
Processing
- Davor Bonaci, Google Inc.
Thursday 4:10PM, Ballroom A
51. Problem: Data Formats
• Many IoT devices transmit data as a raw array of bytes
• The format of that data may be proprietary
• To be of any use it must be parsed into a machine-readable format
(i.e. Schema)
• Once parsed, you need to know the schema
52. Problem: Firmware Versions
• Deployed IoT devices may be running any number of versions
• Data formats may differ between firmware versions
• Multiple parsers may be necessary to accommodate different
device types and firmware versions
53. Solution: Parser Registry
• Allow manufacturers to supply proprietary parsers, load at runtime
• Parser API to include way to discover schema
• Tag data with device type + firmware version at the hub/gateway
• Look up associated parser when data arrives
• (This can be done either in either the Transport or Analytics tier)
54. Solution: Schema Registry
• When parsers are registered, also register the associated schema
• Downstream components (Transport/Analytics Tier) discover
schema based on metadata
56. Who owns your data?
• Beware of 3rd-party device manufacturers
• Data is valuable, and everyone wants it
• Frequently exclusive access
57. Who owns your data?
• Device manufacturers may hoard data.
• Retention policies limit how long you can store the data.
• Aggregate/Derivative data okay, but what’s the definition?