At Avast, we believe everyone has the right to be safe. We are dedicated to creating a world that provides safety and privacy for all, not matter where you are, who you are, or how you connect. With over 1.5 billion attacks stopped and 30 million new executable files monthly, big data pipelines are crucial for the security of our customers. At Avast we are leveraging Apache Spark machine learning libraries and TensorflowOnSpark for a variety of tasks ranging from marketing and advertisement, through network security to malware detection. This talk will cover our main cybersecurity usecases of Spark. After describing our cluster environment we will first demonstrate anomaly detection on time series of threats. Having thousands of types of attacks and malware, AI helps human analysts select and focus on most urgent or dire threats. We will walk through our setup for distributed training of deep neural networks with Tensorflow to deploying and monitoring of a streaming anomaly detection application with trained model. Next we will show how we use Spark for analysis and clustering of malicious files and large scale experimentation to automatically process and handle changes in malware. In the end, we will give comparison to other tools we used for solving those problems.
13. Malware classification
13#UnifiedDataAnalytics #SparkAISummit
Data
● >500 handcrafted features from binary
files from our experts
Task
● Classification to clean/malware/pup files
Two step ML Pipeline:
● Cluster data with custom k-means
● Classification inside the cluster is done
by Random Forest
16. Custom application Spark
• optimised & performant
• takes months to develop
• not that easy to change
• slower
• easy to experiment with
• very fast development
25. First idea - custom streaming app
• Python because of ML models
26. First idea - custom streaming app
• Python because of ML models
• Big part of code about already solved problems
27. First idea - custom streaming app
• Python because of ML models
• Big part of code about already solved problems
• POC written by researchers
28. First idea - custom streaming app
• Python because of ML models
• Big part of code about already solved problems
• POC written by researchers
• Gets job done, but not easy to maintain or experiment
33. Advantages of Structured Streaming
• Unified processing engine
• End to end AI with multiple sinks
33#UnifiedDataAnalytics #SparkAISummit
34. Advantages of Structured Streaming
• Unified processing engine
• End to end AI with multiple sinks
• Window aggregations and Watermarking
out of the box
34#UnifiedDataAnalytics #SparkAISummit
35. Advantages of Structured Streaming
• Unified processing engine
• End to end AI with multiple sinks
• Window aggregations and Watermarking out of the box
• Resilient streams
35#UnifiedDataAnalytics #SparkAISummit
51. How to quickly identify campaigns of malware and
potentially unwanted programs.
51#UnifiedDataAnalytics #SparkAISummit
AI driven anomaly detection on time series
52. How to quickly identify campaigns of malware and potentially
unwanted programs:
• Traditional approaches - find outliers
52#UnifiedDataAnalytics #SparkAISummit
AI driven anomaly detection on time series
53. How to quickly identify campaigns of malware and potentially
unwanted programs.
• Traditional approaches - find outliers
• Machine learning - predict and compare
– Neural networks - LSTMs vs CNNs
53#UnifiedDataAnalytics #SparkAISummit
AI driven anomaly detection on time series
54. How to quickly identify campaigns of malware and potentially
unwanted programs.
• Traditional approaches - find outliers
• Machine learning - predict and compare
– Neural networks - LSTMs vs CNNs
– Other - auto-regressive models etc.
54#UnifiedDataAnalytics #SparkAISummit
AI driven anomaly detection on time series
58. • pandas_udf for parallel predictions
• super easy to test on already stored data as batch job
58#UnifiedDataAnalytics #SparkAISummit
Threat anomaly detection: stream serving
61. Takeaways
• Easier collaboration between Science and Engineering teams
• An excellent toolbox to do anomaly detection in near real time
• Easy ML/AI/DL integration
• Parallelism
61#UnifiedDataAnalytics #SparkAISummit