Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-Source Spark
1.
2. In partnership withIn partnership with
SPARK VS SPARK
AN ON-PREM COMPARISON OF DATABRICKS AND
OPEN-SOURCE SPARK
Justin Hoffman - Senior Lead Data Scientist at Booz Allen Hamilton
In Collaboration with US Air Force
In Collaboration with Databricks
SPARK AI SUMMIT 2020
3.
4.
5. • Rapidly expanding attack surface
• Inundation of cyber tools
• Attacks are more sophisticated
• Cyber talent shortage
Networks are harder to secure
than ever before, as defenders are
increasingly overwhelmed by data
and challenges
The average
intrusion is
detected almost
200 days
after the fact
7. The Challenge: Go Fast…. On-Premise?
COLLECT
Get and track
data from the
source
PROCESS
Give Data the
Power of
Greater Context
AGGREGATE
From disparate
data sources,
one version of
the truth
EXPOSE
Abstract away
complexities
through a single
interface
DATA SCIENCE
ANALYTICS
BUSINESS INTELLIGENCE
REPORTING
DATA CONSUMERS
VISUALIZATIONS
APPLICATIONS
Security, Governance, Provenance, Lineage
INSIGHTS
DATA SOURCES
SOCIAL MEDIA
NEWSFEEDS
WEB CRAWLERS
PROPRIETARY
SOURCES
8. BOOZ ALLEN HAMILTON - This document is intended solely for the client to whom it is addressed on the title slide.
7
Various Sensor
Data Feeds
Historic Data Storage
Repository of historic data for
compliance / retrospective analysis
Data Storage
Hosts normalized,
enriched data for
cyber operations
Traditional Tooling
Custom Dashboards
Often prebuilt with
analytic capabilities
Security Operations Center
(SOC) drives cyber hunt
and defensive operations
mission by analyzing data
through existing tooling
(COTS platforms and
custom dashboards)
Data Broker
Security Orchestration, Automation, and Response:
Integrated capabilities support the SOC by
automating simple tasks when appropriate
Suite of Crowd Sourced Analytics
Curates risk scores for nuanced adversary techniques that
were previously undetectable using containerized AI systems
Normalization Engine Enrichment Engine
Automated Threat Intelligence Enrichment
Accelerates investigations by automating the collection
of valuable context before the the data hits the SIEM
Established Data Model with Automation
Fuses multiple data feeds and normalizes
raw data to a common data model
Solution: A Service-Oriented Architecture for Capability Deployment
9. BOOZ ALLEN HAMILTON - This document is intended solely for the client to whom it is addressed on the title slide.
8
Various Sensor
Data Feeds
Data Storage
Hosts normalized,
enriched data for
cyber operations Security Operations Center
(SOC) drives cyber hunt
and defensive operations
mission by analyzing data
through existing tooling
(COTS platforms and
custom dashboards)
Data Broker
Custom Cyber AI Models
Identify IPs that are interesting
Enrichment Engine
Project Architecture: Focused on High Performance Computing
HPC Specs
• Master node - 1
• Worker Nodes – 6
• Memory – 128 GB
• Cores – 16
• Gigabit Connectivity
• RM – Yarn
• Hive Metastore – MariaDB
• DBIO Caching (DBR) - enabled
10. Results: Spark Open Source vs Spark DBR
*https://databricks.com/glossary/what-is-databricks-runtime
In Cloud On Prem
DBR OSS Gains (DBR)
SQL - Read
and count
34.4 s 158.5 s 4.6X
SQL –
Filtered
Count
1.7 s 72.5 s 42.65X
~1 Billion + records!!
1TB+ in size
11. Lessons Learned for Future On-Premise Installs
Ø Spark DBR with Delta Lake performs almost 50X faster on complex joins for IP’s of interest
Ø When performing Machine Learning at scale, Spark DBR still provides performance gains in
DGA classification
Ø We isolated some worker node failures to the open source Hadoop distribution which would
cause Spark not to complete. Switching distributions solved the issue.
Ø Simple applications leveraging RDDs (Resilient Distributed Dataset) will not have high
performance gains
Ø Leveraging the Delta Lake format and Parquet with MariaDB provides performance
optimizations
Ø No Cloud was leveraged in the making of this research