Machine Learning for Malware Classification and Clustering

•

7 j'aime•3,249 vues

In this talk, we will give an overview of the machine learning model that is the foundation of Endgame’s automated malware classifier. We will discuss challenges and best approaches to finding a metric that adequately summarizes a model's performance recognizing malware and we will show how model results inform the more tactical analysis of malware researchers.

Technologie

Machine Learning for Malware
Classification and Clustering
Phil Roth, Data Scientist
1

• PhD in particle astrophysics
• Switched to making images from radar data
• Switched to solving security problems with data
Phil Roth
Data Scientist
2

Outline
• Malware Detection
• Boosted Decision Trees
• Malware Features
• Evaluating Performance
• Bringing a Human into the Loop
3

The Problem: Antivirus
The security industry has declared antivirus as dead, but
there is no widely accepted replacement.
Machine Learning can be that replacement.
4

The Problem: Antivirus
• Antivirus uses signatures, heuristics, and hand crafted rules
that do not scale well
• Using polymorphism and obfuscation, malware authors can
circumvent rules based detection techniques
5

The Solution: Machine Learning
Machine Learning uses statistical techniques to learn
patterns from large datasets
6
Two Steps:
• Feature Extraction
• Boundary Learning

Machine Learning Advantages
• Automation
• Deep Insights
• Scalability
• Generalization
7

Machine Learning Challenges
• Requires labels
• Requires large data sets
• Security field requires very low tolerance for errors
8

Boosted Decision Trees
Basically, it’s a game of 20 questions
Source: https://en.wikipedia.org/wiki/Decision_tree_learning
A tree showing survival of passengers
on the Titanic ("sibsp" is the number
of spouses or siblings aboard). The
figures under the leaves show the
probability of survival and the
percentage of observations in the
leaf.
9

Boosted Decision Trees
• The trees are built by choosing “questions” that
maximize the discrimination between two classes
• The model is called “boosted” because misclassified
samples are given higher weight in future tree building
10

Why Boosted Decision Trees?
Proven results in security and physics
References:
https://www.kaggle.com/c/malware-classification/
http://arxiv.org/pdf/1511.04317.pdf
http://jmlr.org/proceedings/papers/v42/chen14.pdf
11

Malware Features
The extracted features determine your
model’s performance, but there is a tradeoff
Complicated Explainable
12

Complicated Features
Byte frequency and byte
entropy features form a
binary fingerprint that inform
the model
13

Explainable Features
Lists of capabilities don’t greatly help the model classify a
sample, but they can provide more insight to an analyst.
This sample can:
• Record keystrokes
• Send/receive network traffic
• Modify registry
14

Evaluating Performance
We must be careful not to learn from “future” information:
time
time
Train Data
Test Data
Model Train Times
Patterns learned here….
... should not inform classifications here
15

Bringing Humans in the Loop
Amazon built an entire tool (Mechanical Turk) to cheaply
generate labels from human intuition:
Are these products related?
16

Bringing Humans in the Loop
Our labels are more expensive to obtain, and so choosing
what samples to label is even more important.
Is this binary malicious?
Active Learning can help!
17

Bringing Humans in the Loop
When new data arrives, Active Learning tells analysts
which labels would be most helpful.
18

Integration
• Our malware classifier model has been integrated into
our stealthy sensor and Hunt Platform
• Ask the other friendly Endgamers here for a demo!
19

Thanks!
proth@endgame.com
@mrphilroth
20

Contenu connexe

Tendances

Cognitive Computing in Security with AI

JoAnna Cheshire

Machine Learning for Malware Classification and Clustering

Ashwini Almad

When dealing with over 300 hundred thousand of malware samples every day, we had to deploy the state-of-the-art techniques to combat cyberthreats. And among them - machine learning algorithms. In this whitepaper, we start from describing the basic approaches and proceed to explaining the key applications of machine learning algorithms to automated malware detection. Learn more about how Kaspersky Lab protects businesses like yours => https://kas.pr/8dxv

Machine Learning in Malware Detection

Kaspersky

Semantics aware malware detection ppt

Manish Yadav

В докладе речь пойдёт о применении алгоритмов машинного обучения для обнаружения вредоносных приложений для Android. Я расскажу, как на базе Матрикснета в Яндексе был спроектирован высокопроизводительный инструмент для решения этой задачи. А также продемонстрирую, в каких случаях аналитические методы выявления вредоносного ПО помогают блокировать множество простых образцов вирусного кода. Затем мы поговорим о том, как можно усовершенствовать такие методы для обнаружения более хитроумных вредных программ.

"Быстрое обнаружение вредоносного ПО для Android с помощью машинного обучения...

Yandex

Malware classification using Machine Learning

Japneet Singh

Malware Detection using Machine Learning

Cysinfo Cyber Security Community

Labeling the virus share malware dataset lessons learned

John Seymour

Malware Classification and Analysis

Prashant Chopra

Malware Detection Using Machine Learning Techniques

ArshadRaja786

Detecting Evasive Malware in Sandbox

Rahul Mohandas

Nguyen Huu Trung - Building a web vulnerability scanner - From a hacker’s view

Security Bootcamp

Worst-Case Scenario: Being Detected without Knowing You are Detected

Ashwini Almad

Malware Dectection Using Machine learning

Shubham Dubey

An Example of use the Threat Modeling Tool (FFRI Monthly Research Nov 2016)

FFRI, Inc.

STRIDE Variants and Security Requirements-based Threat Analysis (FFRI Monthly...

FFRI, Inc.

The VTC experience

frisksoftware

Introduction to penetration testing

Amine SAIGHI

Whittaker How To Break Software Security - SoftTest Ireland

David O'Dowd

Test Strategies & Common Mistakes

frisksoftware

Tendances (20)

Cognitive Computing in Security with AI

Machine Learning for Malware Classification and Clustering

Machine Learning in Malware Detection

Semantics aware malware detection ppt

"Быстрое обнаружение вредоносного ПО для Android с помощью машинного обучения...

Malware classification using Machine Learning

Malware Detection using Machine Learning

Labeling the virus share malware dataset lessons learned

Malware Classification and Analysis

Malware Detection Using Machine Learning Techniques

Detecting Evasive Malware in Sandbox

Nguyen Huu Trung - Building a web vulnerability scanner - From a hacker’s view

Worst-Case Scenario: Being Detected without Knowing You are Detected

Malware Dectection Using Machine learning

An Example of use the Threat Modeling Tool (FFRI Monthly Research Nov 2016)

STRIDE Variants and Security Requirements-based Threat Analysis (FFRI Monthly...

The VTC experience

Introduction to penetration testing

Whittaker How To Break Software Security - SoftTest Ireland

Test Strategies & Common Mistakes

Similaire à Machine Learning for Malware Classification and Clustering

Web applications security conference slides

Bassam Al-Khatib

The presentation is an extended in-depth version review of cybersecurity challenges with generative AI, enriched with multiple demos, analysis, responsible AI topics and mitigation steps, also covering a broader scope beyond OpenAI service. Popularity, demand and ease of access to modern generative AI technologies reveal new challenges in the cybersecurity landscape that vary from protecting confidentiality and integrity of data to misuse and abuse of technology by malicious actors. In this session we elaborate about monitoring and auditing, managing ethical implications and resolving common problems like prompt injections, jailbreaks, utilization in cyberattacks or generating insecure code.

Cybersecurity and Generative AI - for Good and Bad vol.2

Ivo Andreev

Learning in adversarial settings is becoming an important task for application domains where attackers may inject malicious data into the training set to subvert normal operation of data-driven technologies. Feature selection has been widely used in machine learning for security applications to improve generalization and computational efficiency, although it is not clear whether its use may be beneficial or even counterproductive when training data are poisoned by intelligent attackers. In this work, we shed light on this issue by providing a framework to investigate the robustness of popular feature selection methods, including LASSO, ridge regression and the elastic net. Our results on malware detection show that feature selection methods can be significantly compromised under attack (we can reduce LASSO to almost random choices of feature sets by careful insertion of less than 5% poisoned training samples), highlighting the need for specific countermeasures.

Battista Biggio @ ICML 2015 - "Is Feature Selection Secure against Training D...

Pluribus One

To learn more, visit https://www.mirabilisdesign.com or email: info (at) mirabilisdesign.com. To meet the ISO-26262 Parts 4,5,6 Requirements. Failure Analysis, Identification and Resolution of Electronics and Software Join Mirabilis Design for a Webinar to evaluate performance and power consumption, measure the quality of your architecture in the event of failures and, the recovery time from the failures. During this Webinar, we will demonstrate a step-by-step approach to dynamic system modeling, fault generation, and evaluation of diagnostics to cover both ISO26262-Part 4,5,6. Using the VisualSim modeling and simulation software, we will validate and optimize the system architecture, apply failures, add diagnostics to identify the failures, and create logic to resolve the error condition. This model will be used to measure the compliance of the functional safety setup to meet the requirements of ISO26262-Part 4,5,6. At the Webinar, we will 1. Cover hardware, software, network, RTOS and power systems. 2. Construct an architecture model of a braking system. 3. Apply failures, add methods to detect errors and algorithms to return the system to normal operation. 3. Analyze the models to meet the timing, power and functional requirements during an event of a failure. System failure analysis plays a vital role in avoiding any real-time injuries/dangers, especially in aerospace, automotive and medical appliances. While designing the system, a proactive and systematic method to evaluate where and how the system might fail, the outcome of the failure, and how the failures can be prevented helps to consider required safety measures. This minimizes the cost, resources, and time-consumed after the occurrence of an unexpected incident.

Webinar on Functional Safety Analysis using Model-based System Analysis

Deepak Shankar

BsidesLVPresso2016_JZeditsv6

Rod Soto

Li Chen & Ravi Sahita In this talk, we juxtapose the resiliency and trustworthiness of composition of DL and classical ML algorithms for security, via a case study of evaluating the resiliency of ransomware detection via the generative adversarial network (GAN). We propose to use GAN to automatically produce dynamic features that exhibit generalized malicious behaviors that can reduce the efficacy of black-box ransomware classifiers. We examine the quality of the GAN-generated samples by comparing the statistical similarity of these samples to real ransomware and benign software. Further we investigate the latent subspace where the GAN-generated samples lie and explore reasons why such samples cause a certain class of ransomware classifiers to degrade in performance. The automatically generated adversarial samples can then be fed into the training set to reduce the blind spots of the detectors. There has been a surge of interest in using machine learning (ML) particularly deep learning (DL) to automatically detect malware through their dynamic behaviors. These approaches have achieved significant improvement in detection rates and lower false positive rates at large scale compared with traditional malware analysis methods. ML in threat detection has demonstrated to be a good cop to guard platform security. However it is imperative to evaluate - is ML-powered security resilient enough? To generate reliable traces of system activity, we can utilize CPU-based telemetry such as Intel Processor Trace which can be extracted via a hypervisor without guest instrumentation. We advocate that file I/O events extracted from Intel processor trace together with algorithmic improvements have shown potential stronger defense in ML -based model deployment in the wild to combat ransomware attack. Our results and discoveries should pose relevant questions for defenders such as how ML models can be made more resilient for robust enforcement of security objectives.

BlueHat Seattle 2019 || The good, the bad & the ugly of ML based approaches f...

BlueHat Security Conference

Malware Collection and Analysis via Hardware Virtualization

Tamas K Lengyel

First Principles Vulnerability Assessment

Manuel Brugnoli

Modern machine learning systems may be very complex and may fall into many pitfalls. It's very easy to unintendedly introduce technical debt into such a complex structure. One of the approaches solving some of anti-patterns is a feature store. Feature store is a missing piece filling a gap between raw data and machine learning models. Not only it will help you to handle technical debt, but even more importantly speeds up time to develop new model.

Feature store: Solving anti-patterns in ML-systems

Andrzej Michałowski

Machine Learning is increasingly being used by companies as a disruptor or providing a USP. This means that Machine Learning models need to cope with being a critical part of solutions and if those solutions use PCI-DSS or PII then the models must be highly secure. In addition, if a Machine Learning model is part of your USP then you will want to protect it. Also, the EU AI Regulation and UK AI Strategy means that AI is becoming increasingly regulated. This means you need to be able to prove what model made a prediction and why it made it by providing auditability and explainabilty. In this talk we go over these issues and how to address them including using AWS and how to implement development best practices.

Securing your Machine Learning models

PhilipBasford

Every single security company is talking in some way or another about how they are applying machine learning. Companies go out of their way to make sure they mention machine learning and not statistics when they explain how they work. Recently, that's not enough anymore either. As a security company you have to claim artificial intelligence to be even part of the conversation. Guess what. It's all baloney. We have entered a state in cyber security that is, in fact, dangerous. We are blindly relying on algorithms to do the right thing. We are letting deep learning algorithms detect anomalies in our data without having a clue what that algorithm just did. In academia, they call this the lack of explainability and verifiability. But rather than building systems with actual security knowledge, companies are using algorithms that nobody understands and in turn discover wrong insights. In this talk I will show the limitations of machine learning, outline the issues of explainability, and show where deep learning should never be applied. I will show examples of how the blind application of algorithms (including deep learning) actually leads to wrong results. Algorithms are dangerous. We need to revert back to experts and invest in systems that learn from, and absorb the knowledge, of experts.

AI & ML in Cyber Security - Why Algorithms Are Dangerous

Raffael Marty

Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malware

DaveEdwards12

Machine Learning & Predictive Maintenance

Arnab Biswas

Security best practice and regulations such as SOX, HIPAA, GDPR and others require you to restrict access to your critical IBM i systems and their data, but this is easier said than done. Legacy, proprietary access protocols now co-exist with new, open-source protocols to create access control headaches. View this webcast on-demand for an in-depth discussion of IBM i access points that must be secured and how exit points can be leveraged to accomplish the task. We’ll cover: • Securing network access and communication ports • How database access via open-source protocols can be secured • Taking control of command execution

Controlling Access to IBM i Systems and Data

Precisely

Deep learning in manufacturing predicting and preventing manufacturing defect...

WMG centre High Value Manufacturing Catapult

New Horizons SCYBER Presentation

New Horizons Computer Learning Centers / 5PE

Controlling all the ways your company’s data is being accessed, especially given the proliferation of open source software and other non-traditional data-access methods, is critical to ensuring security and regulatory compliance. This webinar reviews the different ways your data can be accessed, discusses how exit points work and how they can be managed, and why a global data access control strategy is especially important to efficiently protect sensitive data against unwanted access. Topics include: • IBM i access methods and risks • Using exit programs to block traditional and modern access methods • Real life examples and perspectives

Expand Your Control of Access to IBM i Systems and Data

Precisely

Cybersecurity Challenges with Generative AI - for Good and Bad

Ivo Andreev

Foutse_Khomh.pptx

Foutse Khomh

Rise of the machines -- Owasp israel -- June 2014 meetup

Shlomo Yona

Similaire à Machine Learning for Malware Classification and Clustering (20)

Web applications security conference slides

Cybersecurity and Generative AI - for Good and Bad vol.2

Battista Biggio @ ICML 2015 - "Is Feature Selection Secure against Training D...

Webinar on Functional Safety Analysis using Model-based System Analysis

BsidesLVPresso2016_JZeditsv6

BlueHat Seattle 2019 || The good, the bad & the ugly of ML based approaches f...

Malware Collection and Analysis via Hardware Virtualization

First Principles Vulnerability Assessment

Feature store: Solving anti-patterns in ML-systems

Securing your Machine Learning models

AI & ML in Cyber Security - Why Algorithms Are Dangerous

Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malware

Machine Learning & Predictive Maintenance

Controlling Access to IBM i Systems and Data

Deep learning in manufacturing predicting and preventing manufacturing defect...

New Horizons SCYBER Presentation

Expand Your Control of Access to IBM i Systems and Data

Cybersecurity Challenges with Generative AI - for Good and Bad

Foutse_Khomh.pptx

Rise of the machines -- Owasp israel -- June 2014 meetup

Plus de EndgameInc

Filar seymour oreilly_bot_story_

EndgameInc

Adversaries compromise at will, penetrating today’s signature and IOC dependent detection capabilities. Most incident responders are locked in a cycle of constant reaction to the fraction of activity that is known. Often, undetected attackers remain active in the network as reported incidents are remediated. A new approach is needed to break the cycle of reaction and eradicate the unknown. An offense-based approach must be adopted. Hunting puts the defender on the offensive within their networks, allowing for rapid detection and remediation of threats. Adversary dwell time can be drastically reduced, reducing business impacts and recovery costs. The Endgame hunt platform enables instant protection, visibility, and precision response across your endpoints and automates detection of known and never before seen adversaries without relying on signatures. This talk covers: • Description and benefits of hunt • Challenges of hunting • Solutions and hunting best practices

Hunting before a Known Incident

EndgameInc

Security researchers have limited options when it comes to debuggers and dynamic binary instrumentation tools for ARM-based devices. Hardware-based solutions can be expensive or destructive, while software tools are often restricted to user mode. Presented at REcon 2016, this presentation explores a common but often ignored feature of the ARM debug architecture in search of other options. Digging deeper into this hardware component reveals many interesting use-cases for researchers ranging from debugging and instrumentation to building a novel rootkit.

Hardware-Assisted Rootkits & Instrumentation

EndgameInc

For organizations and individuals with limited security budgets, successfully hunting for cyber adversaries can be a daunting challenge. Threat Intelligence can be expensive and sometimes nothing more than IoCs or blacklists. In this talk, Endgame’s threat research team will present a series of techniques that can enable organizations to leverage free or almost-free sources of data and open-source tools to “hunt on the cheap.” They’ll explain how to: retrieve attackers’ tools from globally distributed honeynets that look like your organization or a juicy launching point to attackers; enrich the data past basic file/tool hashes to identify malicious command and control IPs/domains through automated binary analysis using open-source sandboxes and tools; and use passive DNS data to identify active infections and enrich existing data sets. Attendees will learn how to apply these three techniques to hunt for adversaries within their own networks. They will also learn about the various open-source solutions available, such as graph databases, that make these techniques inexpensive and within the scope of many organizations. Anjum Ahuja, Senior Threat Researcher, Endgame Jamie Butler, Chief Scientist, Endgame Andrew Morris, Threat Researcher, Endgame

Hunting on the Cheap

EndgameInc

Dynamic Detection of Malicious Behavior

EndgameInc

Extracting the Malware Signal from Internet Noise

EndgameInc

Worst-Case Scenario: Being Detected without Knowing You are Detected

EndgameInc

Plus de EndgameInc (7)

Filar seymour oreilly_bot_story_

Hunting before a Known Incident

Hardware-Assisted Rootkits & Instrumentation

Hunting on the Cheap

Dynamic Detection of Malicious Behavior

Extracting the Malware Signal from Internet Noise

Worst-Case Scenario: Being Detected without Knowing You are Detected

Dernier

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

The Digital Insurer

Discord is a free app offering voice, video, and text chat functionalities, primarily catering to the gaming community. It serves as a hub for users to create and join servers tailored to their interests. Discord’s ecosystem comprises servers, each functioning as a distinct online community with its own channels dedicated to specific topics or activities. Users can engage in text-based discussions, voice calls, or video chats within these channels. Understanding Discord Servers Discord servers are virtual spaces where users congregate to interact, share content, and build communities. Servers may revolve around gaming, hobbies, interests, or fandoms, providing a platform for like-minded individuals to connect. Communication Features Discord offers a range of communication tools, including text channels for messaging, voice channels for real-time audio conversations, and video channels for face-to-face interactions. These features facilitate seamless communication and collaboration. What Does NSFW Mean? The acronym NSFW stands for “Not Safe For Work,” indicating content that may be inappropriate for professional or public settings. NSFW Content NSFW content encompasses material that is sexually explicit, violent, or otherwise graphic in nature. It often includes nudity, profanity, or depictions of sensitive topics.

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

UK Journal

Imagine a world where information flows as swiftly as thought itself, making decision-making as fluid as the data driving it. Every moment is critical, and the right tools can significantly boost your organization’s performance. The power of real-time data automation through FME can turn this vision into reality. Aimed at professionals eager to leverage real-time data for enhanced decision-making and efficiency, this webinar will cover the essentials of real-time data and its significance. We’ll explore: FME’s role in real-time event processing, from data intake and analysis to transformation and reporting An overview of leveraging streams vs. automations FME’s impact across various industries highlighted by real-life case studies Live demonstrations on setting up FME workflows for real-time data Practical advice on getting started, best practices, and tips for effective implementation Join us to enhance your skills in real-time data automation with FME, and take your operational capabilities to the next level.

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Safe Software

Data Cloud, More than a CDP by Matt Robison

Anna Loughnan Colquhoun

Created by Mozilla Research in 2012 and now part of Linux Foundation Europe, the Servo project is an experimental rendering engine written in Rust. It combines memory safety and concurrency to create an independent, modular, and embeddable rendering engine that adheres to web standards. Stewardship of Servo moved from Mozilla Research to the Linux Foundation in 2020, where its mission remains unchanged. After some slow years, in 2023 there has been renewed activity on the project, with a roadmap now focused on improving the engine’s CSS 2 conformance, exploring Android support, and making Servo a practical embeddable rendering engine. In this presentation, Rakhi Sharma reviews the status of the project, our recent developments in 2023, our collaboration with Tauri to make Servo an easy-to-use embeddable rendering engine, and our plans for the future to make Servo an alternative web rendering engine for the embedded devices industry. (c) Embedded Open Source Summit 2024 April 16-18, 2024 Seattle, Washington (US) https://events.linuxfoundation.org/embedded-open-source-summit/ https://ossna2024.sched.com/event/1aBNF/a-year-of-servo-reboot-where-are-we-now-rakhi-sharma-igalia

A Year of the Servo Reboot: Where Are We Now?

Igalia

Join our latest Connector Corner webinar to discover how UiPath Integration Service revolutionizes API-centric automation in a 'Quote to Cash' process—and how that automation empowers businesses to accelerate revenue generation. A comprehensive demo will explore connecting systems, GenAI, and people, through powerful pre-built connectors designed to speed process cycle times. Speakers: James Dickson, Senior Software Engineer Charlie Greenberg, Host, Product Marketing Manager

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

DianaGray10

GenAI Risks & Security Meetup 01052024.pdf

lior mazor

Axa Assurance Maroc - Insurer Innovation Award 2024

The Digital Insurer

How to Troubleshoot Apps for the Modern Connected Worker

ThousandEyes

Enterprise Knowledge’s Urmi Majumder, Principal Data Architecture Consultant, and Fernando Aguilar Islas, Senior Data Science Consultant, presented "Driving Behavioral Change for Information Management through Data-Driven Green Strategy" on March 27, 2024 at Enterprise Data World (EDW) in Orlando, Florida. In this presentation, Urmi and Fernando discussed a case study describing how the information management division in a large supply chain organization drove user behavior change through awareness of the carbon footprint of their duplicated and near-duplicated content, identified via advanced data analytics. Check out their presentation to gain valuable perspectives on utilizing data-driven strategies to influence positive behavioral shifts and support sustainability initiatives within your organization. In this session, participants gained answers to the following questions: - What is a Green Information Management (IM) Strategy, and why should you have one? - How can Artificial Intelligence (AI) and Machine Learning (ML) support your Green IM Strategy through content deduplication? - How can an organization use insights into their data to influence employee behavior for IM? - How can you reap additional benefits from content reduction that go beyond Green IM?

Driving Behavioral Change for Information Management through Data-Driven Gree...

Enterprise Knowledge

Histor y of HAM Radio presentation slide

vu2urc

Automating Google Workspace (GWS) & more with Apps Script

wesley chun

[2024]Digital Global Overview Report 2024 Meltwater.pdf

hans926745

Scaling API-first – The story of a global engineering organization

Radu Cotescu

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Product Anonymous

Effective data discovery is crucial for maintaining compliance and mitigating risks in today's rapidly evolving privacy landscape. However, traditional manual approaches often struggle to keep pace with the growing volume and complexity of data. Join us for an insightful webinar where industry leaders from TrustArc and Privya will share their expertise on leveraging AI-powered solutions to revolutionize data discovery. You'll learn how to: - Effortlessly maintain a comprehensive, up-to-date data inventory - Harness code scanning insights to gain complete visibility into data flows leveraging the advantages of code scanning over DB scanning - Simplify compliance by leveraging Privya's integration with TrustArc - Implement proven strategies to mitigate third-party risks Our panel of experts will discuss real-world case studies and share practical strategies for overcoming common data discovery challenges. They'll also explore the latest trends and innovations in AI-driven data management, and how these technologies can help organizations stay ahead of the curve in an ever-changing privacy landscape.

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

TrustArc

The presentation explores the development and application of artificial intelligence (AI) from its inception to its current status in the modern world. The term "artificial intelligence" was first coined by John McCarthy in 1956 to describe efforts to develop computer programs capable of performing tasks that typically require human intelligence. This concept was first introduced at a conference held at Dartmouth College, where programs demonstrated capabilities such as playing chess, proving theorems, and interpreting texts. In the early stages, Alan Turing contributed to the field by defining intelligence as the ability of a being to respond to certain questions intelligently, proposing what is now known as the Turing Test to evaluate the presence of intelligent behavior in machines. As the decades progressed, AI evolved significantly. The 1980s focused on machine learning, teaching computers to learn from data, leading to the development of models that could improve their performance based on their experiences. The 1990s and 2000s saw further advances in algorithms and computational power, which allowed for more sophisticated data analysis techniques, including data mining. By the 2010s, the proliferation of big data and the refinement of deep learning techniques enabled AI to become mainstream. Notable milestones included the success of Google's AlphaGo and advancements in autonomous vehicles by companies like Tesla and Waymo. A major theme of the presentation is the application of generative AI, which has been used for tasks such as natural language text generation, translation, and question answering. Generative AI uses large datasets to train models that can then produce new, coherent pieces of text or other media. The presentation also discusses the ethical implications and the need for regulation in AI, highlighting issues such as privacy, bias, and the potential for misuse. These concerns have prompted calls for comprehensive regulations to ensure the safe and equitable use of AI technologies. Artificial intelligence has also played a significant role in healthcare, particularly highlighted during the COVID-19 pandemic, where it was used in drug discovery, vaccine development, and analyzing the spread of the virus. The capabilities of AI in healthcare are vast, ranging from medical diagnostics to personalized medicine, demonstrating the technology's potential to revolutionize fields beyond just technical or consumer applications. In conclusion, AI continues to be a rapidly evolving field with significant implications for various aspects of society. The development from theoretical concepts to real-world applications illustrates both the potential benefits and the challenges that come with integrating advanced technologies into everyday life. The ongoing discussion about AI ethics and regulation underscores the importance of managing these technologies responsibly to maximize their their benefits while minimizing potential harms.

Artificial Intelligence: Facts and Myths

Joaquim Jorge

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Neo4j

🐬 The future of MySQL is Postgres 🐘

RTylerCroy

Strategies for Landing an Oracle DBA Job as a Fresher

Remote DBA Services

Dernier (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Data Cloud, More than a CDP by Matt Robison

A Year of the Servo Reboot: Where Are We Now?

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

GenAI Risks & Security Meetup 01052024.pdf

Axa Assurance Maroc - Insurer Innovation Award 2024

How to Troubleshoot Apps for the Modern Connected Worker

Driving Behavioral Change for Information Management through Data-Driven Gree...

Histor y of HAM Radio presentation slide

Automating Google Workspace (GWS) & more with Apps Script

[2024]Digital Global Overview Report 2024 Meltwater.pdf

Scaling API-first – The story of a global engineering organization

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

Artificial Intelligence: Facts and Myths

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

🐬 The future of MySQL is Postgres 🐘

Strategies for Landing an Oracle DBA Job as a Fresher

Machine Learning for Malware Classification and Clustering

1. Machine Learning for Malware Classification and Clustering Phil Roth, Data Scientist 1

2. • PhD in particle astrophysics • Switched to making images from radar data • Switched to solving security problems with data Phil Roth Data Scientist 2

3. Outline • Malware Detection • Boosted Decision Trees • Malware Features • Evaluating Performance • Bringing a Human into the Loop 3

4. The Problem: Antivirus The security industry has declared antivirus as dead, but there is no widely accepted replacement. Machine Learning can be that replacement. 4

5. The Problem: Antivirus • Antivirus uses signatures, heuristics, and hand crafted rules that do not scale well • Using polymorphism and obfuscation, malware authors can circumvent rules based detection techniques 5

6. The Solution: Machine Learning Machine Learning uses statistical techniques to learn patterns from large datasets 6 Two Steps: • Feature Extraction • Boundary Learning

7. Machine Learning Advantages • Automation • Deep Insights • Scalability • Generalization 7

8. Machine Learning Challenges • Requires labels • Requires large data sets • Security field requires very low tolerance for errors 8

9. Boosted Decision Trees Basically, it’s a game of 20 questions Source: https://en.wikipedia.org/wiki/Decision_tree_learning A tree showing survival of passengers on the Titanic ("sibsp" is the number of spouses or siblings aboard). The figures under the leaves show the probability of survival and the percentage of observations in the leaf. 9

10. Boosted Decision Trees • The trees are built by choosing “questions” that maximize the discrimination between two classes • The model is called “boosted” because misclassified samples are given higher weight in future tree building 10

11. Why Boosted Decision Trees? Proven results in security and physics References: https://www.kaggle.com/c/malware-classification/ http://arxiv.org/pdf/1511.04317.pdf http://jmlr.org/proceedings/papers/v42/chen14.pdf 11

12. Malware Features The extracted features determine your model’s performance, but there is a tradeoff Complicated Explainable 12

13. Complicated Features Byte frequency and byte entropy features form a binary fingerprint that inform the model 13

14. Explainable Features Lists of capabilities don’t greatly help the model classify a sample, but they can provide more insight to an analyst. This sample can: • Record keystrokes • Send/receive network traffic • Modify registry 14

15. Evaluating Performance We must be careful not to learn from “future” information: time time Train Data Test Data Model Train Times Patterns learned here…. ... should not inform classifications here 15

16. Bringing Humans in the Loop Amazon built an entire tool (Mechanical Turk) to cheaply generate labels from human intuition: Are these products related? 16

17. Bringing Humans in the Loop Our labels are more expensive to obtain, and so choosing what samples to label is even more important. Is this binary malicious? Active Learning can help! 17

18. Bringing Humans in the Loop When new data arrives, Active Learning tells analysts which labels would be most helpful. 18

19. Integration • Our malware classifier model has been integrated into our stealthy sensor and Hunt Platform • Ask the other friendly Endgamers here for a demo! 19

20. Thanks! proth@endgame.com @mrphilroth 20

Notes de l'éditeur

Dive right into train versus test data.

Machine Learning for Malware Classification and Clustering

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Machine Learning for Malware Classification and Clustering

Similaire à Machine Learning for Malware Classification and Clustering (20)

Plus de EndgameInc

Plus de EndgameInc (7)

Dernier

Dernier (20)

Machine Learning for Malware Classification and Clustering

Notes de l'éditeur