The massive increase of security-related data requires companies to respond with new approaches to ingestion. Learn how Lookout has changed its approach for ingesting telemetry to meet their goal of growing from 1.5 million devices to 100 million devices and beyond, using Kafka Connect and switching from AWS DynamoDB to Scylla.
Lookout on Scaling Security to 100 Million Devices
1.
2. , Principal Engineer
Over 30 years experience predominantly dealing with event pipelines and data
retrieval.
He currently works as a platform architect and principal developer at Lookout Inc
working on the Ingestion Pipeline and Query Services team working on the next
scale of data ingestion.
3. ■ Provides security scanning for mobile devices for Enterprise and
Consumer markets
■ Founded in 2004 when the original founders discovered a
vulnerability in the Bluetooth and Nokia phones
■ Demonstrated the need for mobile security through a demonstration
at the 2005 Academy Awards downloading information from
celebrity phones 1.5 miles away from the venue
4.
5. ■ Enterprise customers have the ability to apply corporate policies
against devices registered in their enterprise
■ To apply these policies Lookout ingests data about device
configuration and applications installed on devices
6. ■ Functions as a proxy for all mobile devices in the Lookout fleet
■ Device telemetry is sent at various intervals for these categories
● Software
● Hardware
● Client
● Filesystem
● Configuration
● Binary Manifest
● Risky Configuration
● Personal Content Protection (safe browsing)
● Device Settings
● Device Permissions
● Activation Status
7.
8. ■ Easy to setup and maintain
■ Scaling is easy
■ Cost Effective
■ Simple to handle the Unexpected
9. ■ Some of the components are “single region” (EMR)
■ As the system grows the costs increase significantly (DynamoDB)
■ Limits on Primary Key (PK) and Sort Key (SK) for DynamoDB - Not
designed for time series data
10.
11.
12.
13. A highly scalable and fault tolerant streaming framework that can process messages (for
example Device Telemetry Messages) and persist these messages into a scalable, fault
tolerant persistent store and support operational queries.
Key Requirements:
■ Infrastructure should scale to support 100M devices
■ Cost effective ingestion, storage and querying at this scale
■ Low Latency, High Availability at scale (up/down)
■ Failure handling (no loss of data)
■ Ease of deployment and management
14. ■ A NoSQL database that implements almost all the features of Apache Cassandra
■ Written in C++ 14 instead of Java to increase the performance.
■ Uses a shared nothing approach and uses the Seastar framework to shard requests by
core - http://seastar.io/
■ Scylla’s close-to-the-hardware design significantly reduces the number of instances
needed.
■ Can horizontal scale-out and is fault-tolerance like Apache Cassandra, but delivers 10X the
throughput and consistent, low single-digit latencies.
■ Has support for tunable job prioritization to support extremely high read and write
throughput (which was a problem that Cassandra has not solved yet). Has really high
throughput on instances with NVMe volumes (compared to EBS or non NVMe volumes).
15.
16.
17. ■ Amount of storage available for data depends on the compaction strategy
selected.
● Levelled compaction - Half of data storage needed for compaction - not
recommended
● Size tiered compaction - Half of data storage needed for compaction
● Time window compaction - Depends on the number of tables and record size -
normally around half needed for compaction
● Incremental compaction - possible to push up to 85% for data storage, so storage
needs need to be planned well. - Enterprise Edition
18. ■ May not be a good choice if storage requirements are very large as opposed to
transactions as you will have wasted compute tied to the increased storage needs.
■ Note that this assumes you do not plan to use low cost EBS volumes with much reduced
throughput.
■ No FedRamp certified version of Scylla Cloud available today requiring deployment of
self-managed cluster
■ No Autoscaling support as we have to provision nodes and rebalance data through
scripts/UI.
■ Not suitable for ad-hoc queries or table scan type queries, and does not support joins.
19. ■ Each worker instance is stateless and coordinates
with each other via internal Kafka topics.
■ Kafka Connect automatically detects failures and
rebalances work over remaining processes.
■ Suitable for streaming data to and from Kafka and is
not suitable for complex operation like aggregations,
windowing, etc., that frameworks like Apache Spark
or Apache Flink support.
■ The maximum number of tasks is limited to the
number of partitions.
■ Exposes a REST API to create, modify and monitor
the connectors and tasks
21. ■ The default partitioner (<murmur2 hash> mod <# partitions>) that comes with Kafka is
not very efficient with sharding when the number of partitions grow (approx 50% of the
partitions were idle).
■ We replaced by using a murmur3 hash and then put it through a consistent hashing
algorithm (jump hash) to get an even distribution across all partitions (we used Google’s
guava library). - “A Fast, Minimal Memory, Consistent Hash Algorithm” -
https://arxiv.org/pdf/1406.2294.pdf
22. ■ We emulated approximately 38
million devices generating a total
of 109,668 messages/second or
394 Million messages/hr.
■ On average a device was
generating 253 messages/day
■ We don’t expect querying to be
much impact, so did not add that
as part of load
■ The load test duration was for 96
hrs.
Telemetry Type # Device Telemetry
emulated/second
# Device Telemetry
emulated/day
Avg size in
Bytes/Telemetry
Celldata 760 1.72 83
Client 760 1.72 166
Configuration 13908 31.62 396
Device Change 2280 6.91 218
Device Permissions 1520 3.45 74
Device Settings 45600 103.68 75
Hardware 760 1.72 254
Loaded Libraries 38000 86.40 219
Risk Configuration 1900 5.18 261
Software 760 1.72 375
Binary 1520 3.45 219
File System 1900 5.18 219
23. ■ Message latency was in
milliseconds on average, unless the
system was overtaxed.
■ Repairs forced the load and was
generally taxing on the system
(CPU at 100%), but the cluster
continued to function.
■ The latency increased when Kafka
Connect tasks failed (when repairs
were running on ScyllaDB).
■ ScyllaDB Cluster was running near
capacity (CPU between 75-90%)
■ Overall, the results were really
positive.
24.
25. ■ Kafka Connect provided a quick and easy solution to add new ingestion pipelines
■ Using DataMountaineer’s Kafka Connect connector for Cassandra was easier to
implement than the Confluent connector
■ Scylla DB CPU shot up while repairing and timeouts occurred - Scylla’s ability to reserve
capacity for maintenance tasks ensured repairs completed something not available in
Cassandra.
■ As the complexity of the data ingestion increased the solution leaned more towards
implementing a custom Kafka → Scylla worker cluster for debugging and maintenance
reasons
■ The cost benefits over the current architecture flow increased significantly as our volume
increased.
26. ■ This does not include:
● Query load and associated costs.
● Dynamo streams and it’s equivalent on Scylla and associated costs.
DynamoDB Scylla
# Devices $ Cost/Mo # Devices $ Cost/Mo
On Demand
38,000,000 $304,400.00 38,000,000 $14,564.24
100,000,000 $801,052.63 100,000,000 $38,303.95
+20% Engineer cost
(Maintenance)
Provisioned
38,000,000 $55,610.00
100,000,000 $146,342.11