Contenu connexe Similaire à Examining Malware with Python (20) Examining Malware with Python3. 3
Python tools for text classification can easily be
adopted for malware classification.
When using instruction ngrams, your disassembler
and analysis passes are very important.
references: http://bit.ly/scipy-malware
Conclusions
7. Hex Dump
7
00401000 00 00 80 40 40 28 00 1C 02 42 00 C4 00 20 04 20
00401010 00 00 20 09 2A 02 00 00 00 00 8E 10 41 0A 21 01
00401020 40 00 02 01 00 90 21 00 32 40 00 1C 01 40 C8 18
00401030 40 82 02 63 20 00 00 09 10 01 02 21 00 82 00 04
00401040 82 20 08 83 00 08 00 00 00 00 02 00 60 80 10 80
00401050 18 00 00 20 A9 00 00 00 00 04 04 78 01 02 70 90
00401060 00 02 00 08 20 12 00 00 00 40 10 00 80 00 40 19
00401070 00 00 00 00 11 20 80 04 80 10 00 20 00 00 25 00
00401080 00 00 01 00 00 04 00 10 02 C1 80 80 00 20 20 00
00401090 08 A0 01 01 44 28 00 00 08 10 20 00 02 08 00 00
004010A0 00 40 00 00 00 34 40 40 00 04 00 08 80 08 00 08
004010B0 10 00 40 00 68 02 40 04 E1 00 28 14 00 08 20 0A
004010C0 06 01 02 00 40 00 00 00 00 00 00 20 00 02 00 04
004010D0 80 18 90 00 00 10 A0 00 45 09 00 10 04 40 44 82
004010E0 90 00 26 10 00 00 04 00 82 00 00 00 20 40 00 00
004010F0 B4 00 00 40 00 02 20 25 08 00 00 00 00 00 00 00
00401100 08 00 00 50 00 08 40 50 00 02 06 22 08 85 30 00
00401110 00 80 00 80 60 00 09 00 04 20 00 00 00 00 00 00
00401120 00 82 40 02 00 11 46 01 4A 01 8C 01 E6 00 86 10
00401130 4C 01 22 00 64 00 AE 01 EA 01 2A 11 E8 10 26 11
00401140 4E 11 8E 11 C2 00 6C 00 0C 11 60 01 CA 00 62 10
00401150 6C 01 A0 11 CE 10 2C 11 4E 10 8C 00 CE 01 AE 01
00401160 6C 10 6C 11 A2 01 AE 00 46 11 EE 10 22 00 A8 00
00401170 EC 01 08 11 A2 01 AE 10 6C 00 6E 00 AC 11 8C 00
00401180 EC 01 2A 10 2A 01 AE 00 40 00 C8 10 48 01 4E 11
00401190 0E 00 EC 11 24 10 4A 10 04 01 C8 11 E6 01 C2 00
raw data in hex
8. Hex Dump
8
00401000 00 00 80 40 40 28 00 1C 02 42 00 C4 00 20 04 20
00401010 00 00 20 09 2A 02 00 00 00 00 8E 10 41 0A 21 01
00401020 40 00 02 01 00 90 21 00 32 40 00 1C 01 40 C8 18
00401030 40 82 02 63 20 00 00 09 10 01 02 21 00 82 00 04
00401040 82 20 08 83 00 08 00 00 00 00 02 00 60 80 10 80
00401050 18 00 00 20 A9 00 00 00 00 04 04 78 01 02 70 90
00401060 00 02 00 08 20 12 00 00 00 40 10 00 80 00 40 19
00401070 00 00 00 00 11 20 80 04 80 10 00 20 00 00 25 00
00401080 00 00 01 00 00 04 00 10 02 C1 80 80 00 20 20 00
00401090 08 A0 01 01 44 28 00 00 08 10 20 00 02 08 00 00
004010A0 00 40 00 00 00 34 40 40 00 04 00 08 80 08 00 08
004010B0 10 00 40 00 68 02 40 04 E1 00 28 14 00 08 20 0A
004010C0 06 01 02 00 40 00 00 00 00 00 00 20 00 02 00 04
004010D0 80 18 90 00 00 10 A0 00 45 09 00 10 04 40 44 82
004010E0 90 00 26 10 00 00 04 00 82 00 00 00 20 40 00 00
004010F0 B4 00 00 40 00 02 20 25 08 00 00 00 00 00 00 00
00401100 08 00 00 50 00 08 40 50 00 02 06 22 08 85 30 00
00401110 00 80 00 80 60 00 09 00 04 20 00 00 00 00 00 00
00401120 00 82 40 02 00 11 46 01 4A 01 8C 01 E6 00 86 10
00401130 4C 01 22 00 64 00 AE 01 EA 01 2A 11 E8 10 26 11
00401140 4E 11 8E 11 C2 00 6C 00 0C 11 60 01 CA 00 62 10
00401150 6C 01 A0 11 CE 10 2C 11 4E 10 8C 00 CE 01 AE 01
00401160 6C 10 6C 11 A2 01 AE 00 46 11 EE 10 22 00 A8 00
00401170 EC 01 08 11 A2 01 AE 10 6C 00 6E 00 AC 11 8C 00
00401180 EC 01 2A 10 2A 01 AE 00 40 00 C8 10 48 01 4E 11
00401190 0E 00 EC 11 24 10 4A 10 04 01 C8 11 E6 01 C2 00
00401180
EC 01 2A 10 2A 01 AE
raw data in hex
9. Disassembly
9
HEADER:00400000 ;
HEADER:00400000 ; +-------------------------------------------------------------------------+
HEADER:00400000 ; | This file has been generated by The Interactive Disassembler (IDA) |
HEADER:00400000 ; | Copyright (c) 2013 Hex-Rays, <support@hex-rays.com> |
HEADER:00400000 ; | License info: |
HEADER:00400000 ; | Microsoft |
HEADER:00400000 ; +-------------------------------------------------------------------------+
HEADER:00400000 ;
HEADER:00400000
HEADER:00400000
HEADER:00400000 .686p
HEADER:00400000 .mmx
HEADER:00400000 .model flat
HEADER:00400000
HEADER:00400000 ; ===========================================================================
HEADER:00400000
HEADER:00400000 ; [00001000 BYTES: COLLAPSED SEGMENT HEADER. PRESS KEYPAD CTRL-"+" TO EXPAND]
.text:00401000 ;
.text:00401000 ; Format : Portable executable for 80386 (PE)
.text:00401000 ; Imagebase : 400000
.text:00401000 ; Section 1. (virtual address 00001000)
.text:00401000 ; Virtual size : 00071050 ( 462928.)
.text:00401000 ; Section size in file : 00071200 ( 463360.)
.text:00401000 ; Offset to raw data for section: 00000400
.text:00401000 ; Flags 60000020: Text Executable Readable
.text:00401000 ; Alignment : default
.text:00401000 ; ===========================================================================
10. HEADER:00400000 ;
HEADER:00400000 ; +-------------------------------------------------------------------------+
HEADER:00400000 ; | This file has been generated by The Interactive Disassembler (IDA) |
HEADER:00400000 ; | Copyright (c) 2013 Hex-Rays, <support@hex-rays.com> |
HEADER:00400000 ; | License info: |
HEADER:00400000 ; | Microsoft |
HEADER:00400000 ; +-------------------------------------------------------------------------+
HEADER:00400000 ;
HEADER:00400000
HEADER:00400000
HEADER:00400000 .686p
HEADER:00400000 .mmx
HEADER:00400000 .model flat
HEADER:00400000
HEADER:00400000 ; ===========================================================================
HEADER:00400000
HEADER:00400000 ; [00001000 BYTES: COLLAPSED SEGMENT HEADER. PRESS KEYPAD CTRL-"+" TO EXPAND]
.text:00401000 ;
.text:00401000 ; Format : Portable executable for 80386 (PE)
.text:00401000 ; Imagebase : 400000
.text:00401000 ; Section 1. (virtual address 00001000)
.text:00401000 ; Virtual size : 00071050 ( 462928.)
.text:00401000 ; Section size in file : 00071200 ( 463360.)
.text:00401000 ; Offset to raw data for section: 00000400
.text:00401000 ; Flags 60000020: Text Executable Readable
.text:00401000 ; Alignment : default
.text:00401000 ; ===========================================================================
Disassembly
10
HEADER:00400000
11. HEADER:00400000 ;
HEADER:00400000 ; +-------------------------------------------------------------------------+
HEADER:00400000 ; | This file has been generated by The Interactive Disassembler (IDA) |
HEADER:00400000 ; | Copyright (c) 2013 Hex-Rays, <support@hex-rays.com> |
HEADER:00400000 ; | License info: |
HEADER:00400000 ; | Microsoft |
HEADER:00400000 ; +-------------------------------------------------------------------------+
HEADER:00400000 ;
HEADER:00400000
HEADER:00400000
HEADER:00400000 .686p
HEADER:00400000 .mmx
HEADER:00400000 .model flat
HEADER:00400000
HEADER:00400000 ; ===========================================================================
HEADER:00400000
HEADER:00400000 ; [00001000 BYTES: COLLAPSED SEGMENT HEADER. PRESS KEYPAD CTRL-"+" TO EXPAND]
.text:00401000 ;
.text:00401000 ; Format : Portable executable for 80386 (PE)
.text:00401000 ; Imagebase : 400000
.text:00401000 ; Section 1. (virtual address 00001000)
.text:00401000 ; Virtual size : 00071050 ( 462928.)
.text:00401000 ; Section size in file : 00071200 ( 463360.)
.text:00401000 ; Offset to raw data for section: 00000400
.text:00401000 ; Flags 60000020: Text Executable Readable
.text:00401000 ; Alignment : default
.text:00401000 ; ===========================================================================
Disassembly
11
HEADER:00400000
12. Disassembly
12
.text:00470050 ; =============== S U B R O U T I N E ====================================
.text:00470050
.text:00470050 ; Attributes: bp-based frame
.text:00470050
.text:00470050 sub_470050 proc near ; CODE XREF: start+D8D^Yp
.text:00470050
.text:00470050 var_68 = dword ptr -68h
.text:00470050 var_64 = dword ptr -64h
.text:00470050 var_60 = dword ptr -60h
.text:00470050
.text:00470050 55 push ebp
.text:00470051 8B EC mov ebp, esp
.text:00470053 83 C4 98 add esp, 0FFFFFF98h
.text:00470056 33 C0 xor eax, eax
.text:00470058 8B 15 7C 10 4B 00 mov edx, dword_4B107C
.text:0047005E 89 55 EC mov [ebp+var_14], edx
.text:00470061 89 45 EC mov [ebp+var_14], eax
.text:00470064 53 push ebx
.text:00470065 8B 1D 7C 10 4B 00 mov ebx, dword_4B107C
.text:0047006B 83 FB 2D cmp ebx, 2Dh
.text:0047006E 75 03 jnz short loc_470073
.text:00470070 89 5D EC mov [ebp+var_14], ebx
.text:00470073
.text:00470073 loc_470073: ; CODE XREF: sub_470050+1E^Xj
.text:00470073 56 push esi
.text:00470074 33 C0 xor eax, eax
.text:00470076 8B 5D EC mov ebx, [ebp+var_14]
13. .text:00470050 ; =============== S U B R O U T I N E ====================================
.text:00470050
.text:00470050 ; Attributes: bp-based frame
.text:00470050
.text:00470050 sub_470050 proc near ; CODE XREF: start+D8D^Yp
.text:00470050
.text:00470050 var_68 = dword ptr -68h
.text:00470050 var_64 = dword ptr -64h
.text:00470050 var_60 = dword ptr -60h
.text:00470050
.text:00470050 55 push ebp
.text:00470051 8B EC mov ebp, esp
.text:00470053 83 C4 98 add esp, 0FFFFFF98h
.text:00470056 33 C0 xor eax, eax
.text:00470058 8B 15 7C 10 4B 00 mov edx, dword_4B107C
.text:0047005E 89 55 EC mov [ebp+var_14], edx
.text:00470061 89 45 EC mov [ebp+var_14], eax
.text:00470064 53 push ebx
.text:00470065 8B 1D 7C 10 4B 00 mov ebx, dword_4B107C
.text:0047006B 83 FB 2D cmp ebx, 2Dh
.text:0047006E 75 03 jnz short loc_470073
.text:00470070 89 5D EC mov [ebp+var_14], ebx
.text:00470073
.text:00470073 loc_470073: ; CODE XREF: sub_470050+1E^Xj
.text:00470073 56 push esi
.text:00470074 33 C0 xor eax, eax
.text:00470076 8B 5D EC mov ebx, [ebp+var_14]
Disassembly
13
mov ebx,dword_4B107C
14. .text:00470050 ; =============== S U B R O U T I N E ====================================
.text:00470050
.text:00470050 ; Attributes: bp-based frame
.text:00470050
.text:00470050 sub_470050 proc near ; CODE XREF: start+D8D^Yp
.text:00470050
.text:00470050 var_68 = dword ptr -68h
.text:00470050 var_64 = dword ptr -64h
.text:00470050 var_60 = dword ptr -60h
.text:00470050
.text:00470050 55 push ebp
.text:00470051 8B EC mov ebp, esp
.text:00470053 83 C4 98 add esp, 0FFFFFF98h
.text:00470056 33 C0 xor eax, eax
.text:00470058 8B 15 7C 10 4B 00 mov edx, dword_4B107C
.text:0047005E 89 55 EC mov [ebp+var_14], edx
.text:00470061 89 45 EC mov [ebp+var_14], eax
.text:00470064 53 push ebx
.text:00470065 8B 1D 7C 10 4B 00 mov ebx, dword_4B107C
.text:0047006B 83 FB 2D cmp ebx, 2Dh
.text:0047006E 75 03 jnz short loc_470073
.text:00470070 89 5D EC mov [ebp+var_14], ebx
.text:00470073
.text:00470073 loc_470073: ; CODE XREF: sub_470050+1E^Xj
.text:00470073 56 push esi
.text:00470074 33 C0 xor eax, eax
.text:00470076 8B 5D EC mov ebx, [ebp+var_14]
Disassembly
14
mov ebx,dword_4B107C
15. Disassembly
15
.idata:0046F4DC ;
.idata:0046F4DC ; Imports from KERNEL32.DLL
.idata:0046F4DC ;
.idata:0046F4DC ; ===========================================================================
.idata:0046F4DC
.idata:0046F4DC ; Segment type: Externs
.idata:0046F4DC ; _idata
.idata:0046F4DC ; DWORD __stdcall GetCurrentThreadId()
.idata:0046F4DC ?? ?? ?? ?? extrn __imp_GetCurrentThreadId:dword
.idata:0046F4DC ; DATA XREF: .text:0046F66C^Yo
.idata:0046F4DC ; GetCurrentThreadId^Yr
.idata:0046F4E0 ; BOOL __stdcall WriteFile(HANDLE hFile, LPCVOID lpBuffer, DWORD ...
.idata:0046F4E0 ?? ?? ?? ?? extrn WriteFile:dword ; DATA XREF: .text:00471E4C^Yr
.idata:0046F4E4 ; BOOL __stdcall FindNextVolumeA(HANDLE hFindVolume, LPSTR lpszVolumeName, DW ...
.idata:0046F4E4 ?? ?? ?? ?? extrn FindNextVolumeA:dword
.idata:0046F4E4 ; DATA XREF: .text:00471E46^Yr
.idata:0046F4E8 ; LPVOID __stdcall VirtualAlloc(LPVOID lpAddress, SIZE_T dwSize, DWORD ...
.idata:0046F4E8 ?? ?? ?? ?? extrn __imp_VirtualAlloc:dword
.idata:0046F4E8 ; DATA XREF: VirtualAlloc^Yr
.idata:0046F4EC ; BOOL __stdcall EnumResourceLanguagesA(HMODULE hModule, LPCSTR lpType, LPCSTR ...
.idata:0046F4EC ?? ?? ?? ?? extrn EnumResourceLanguagesA:dword
.idata:0046F4EC ; DATA XREF: .text:00471E70^Yr
16. .idata:0046F4DC ;
.idata:0046F4DC ; Imports from KERNEL32.DLL
.idata:0046F4DC ;
.idata:0046F4DC ; ===========================================================================
.idata:0046F4DC
.idata:0046F4DC ; Segment type: Externs
.idata:0046F4DC ; _idata
.idata:0046F4DC ; DWORD __stdcall GetCurrentThreadId()
.idata:0046F4DC ?? ?? ?? ?? extrn __imp_GetCurrentThreadId:dword
.idata:0046F4DC ; DATA XREF: .text:0046F66C^Yo
.idata:0046F4DC ; GetCurrentThreadId^Yr
.idata:0046F4E0 ; BOOL __stdcall WriteFile(HANDLE hFile, LPCVOID lpBuffer, DWORD ...
.idata:0046F4E0 ?? ?? ?? ?? extrn WriteFile:dword ; DATA XREF: .text:00471E4C^Yr
.idata:0046F4E4 ; BOOL __stdcall FindNextVolumeA(HANDLE hFindVolume, LPSTR lpszVolumeName, DW ...
.idata:0046F4E4 ?? ?? ?? ?? extrn FindNextVolumeA:dword
.idata:0046F4E4 ; DATA XREF: .text:00471E46^Yr
.idata:0046F4E8 ; LPVOID __stdcall VirtualAlloc(LPVOID lpAddress, SIZE_T dwSize, DWORD ...
.idata:0046F4E8 ?? ?? ?? ?? extrn __imp_VirtualAlloc:dword
.idata:0046F4E8 ; DATA XREF: VirtualAlloc^Yr
.idata:0046F4EC ; BOOL __stdcall EnumResourceLanguagesA(HMODULE hModule, LPCSTR lpType, LPCSTR ...
.idata:0046F4EC ?? ?? ?? ?? extrn EnumResourceLanguagesA:dword
.idata:0046F4EC ; DATA XREF: .text:00471E70^Yr
Disassembly
16
Imports from KERNEL32.DLL
__stdcall VirtualAlloc(
18. Byte ngrams
18
00401000 00 00 80 40 40 28 00 1C 02 42 00 C4 00 20 04 20
00401010 00 00 20 09 2A 02 00 00 00 00 8E 10 41 0A 21 01
00401020 40 00 02 01 00 90 21 00 32 40 00 1C 01 40 C8 18
00401030 40 82 02 63 20 00 00 09 10 01 02 21 00 82 00 04
Possibilies
1gram: 256
2gram: 65536
3gram: 16777216
4gram: 4294967296
Solution: Hashing
19. Byte ngrams
19
vectorizer = HashingVectorizer(
input="content", lowercase=True, stop_words=None, ngram_range=(1,3),
analyzer="word", n_features=2**16, binary=False, norm=None,
non_negative=True
)
pipe = Pipeline([
("extraction", CustomExtractor(vectorizer=vectorizer)),
("sel", VarianceThreshold(threshold=0)),
("tfidf", TfidfTransformer(norm="l2", use_idf=True, smooth_idf=True,
sublinear_tf=True)),
("kbest", SelectKBest(score_func=f_classif, k=500))
])
Code for extracting the byte ngrams and reducing
dimensionality:
20. Byte ngrams
20
vectorizer = HashingVectorizer(
input="content", lowercase=True, stop_words=None, ngram_range=(1,3),
analyzer="word", n_features=2**16, binary=False, norm=None,
non_negative=True
)
pipe = Pipeline([
("extraction", CustomExtractor(vectorizer=vectorizer)),
("sel", VarianceThreshold(threshold=0)),
("tfidf", TfidfTransformer(norm="l2", use_idf=True, smooth_idf=True,
sublinear_tf=True)),
("kbest", SelectKBest(score_func=f_classif, k=500))
])
Code for extracting the byte ngrams and reducing
dimensionality:
class CustomExtractor() :
def __init__(self, vectorizer=HashingVectorizer()) :
self.vectorizer = vectorizer
def fit(self, X, y) :
return self # stateless
def transform(self, X, y=None) :
pool = multiprocessing.Pool()
rows = pool.map(self.feature_extract, X, 32)
return scipy.sparse.vstack(list(rows))
fit_transform = transform
def feature_extract(self, file_name) :
clean_bytes = " ".join(toolz.pipe(
open(file_name, "r"),
map(lambda line : line.rstrip().split()[1:]),
toolz.concat,
filter(lambda b : b != "??" and b != "?")
))
return self.vectorizer.transform([clean_bytes])
23. Instruction ngrams
23
push lea push mov call mov mov pop retn
mov jmp
push mov mov call test jz push call add mov pop retn
mov mov mov mov retn
mov lea mov inc test jnz sub retn
mov mov mov push mov push push push push call add mov pop retn
mov mov mov push mov push push push push call add mov pop retn
xor retn
mov retn
mov retn
mov retn
mov mov mov retn
mov test jz mov mov push push call mov mov retn
push push push push call push call mov push push push mov call mov retn
mov mov mov retn
mov test jz mov mov push push call mov mov retn
push push push push call mov push push push mov call push call mov retn
Extracted instructions:
24. Instruction ngrams
24
vectorizer = HashingVectorizer(
input="content", lowercase=True, stop_words=None, ngram_range=(1, 2),
analyzer="word", n_features=2**25, binary=False, norm=None,
non_negative=True
)
pipe = Pipeline([
("extraction", CustomExtractor(vectorizer=vectorizer)),
("sel", VarianceThreshold(threshold=0)),
("tfidf", TfidfTransformer(norm="l2", use_idf=True, smooth_idf=True,
sublinear_tf=True)),
("kbest", SelectKBest(score_func=f_classif, k=500))
])
Code for extracting the instruction ngrams and reducing
dimensionality:
25. Section Names, Imports, Imported Functions.
Extracted these features with regular expressions.
Features were (awkwardly) selected in the same
step as instruction ngrams.
Named Features
25
26. Named Features
26
import re
re_features = {
"imports" : {
"re" : re.compile("Imports from w.+"),
"extract" : lambda m : m.group().split()[-1],
"filter" : lambda m : True
},
"imported_functions" : {
"re" : re.compile("__stdcall w.+("),
"extract" : lambda m : m.group().split()[-1][:-1],
"filter" : lambda m : not m.startswith("sub_")
},
"section_names" : {
"re" : re.compile("^S+?:"),
"extract" : lambda m : m.group()[:-1],
"filter" : lambda m : True
}
}
27. Named Features
27
from toolz import pipe, unique
from tools.curried import map, filter
def process_re_feature(lines, re_dict) :
return pipe(
lines,
map(re_dict["re"].search),
filter(lambda m : m is not None),
map(re_dict["extract"]),
filter(re_dict["filter"]),
unique
)
30. Gradient Boosting Classifier on 1026 features
Grid search optimized parameters
Also tried: LogisticRegression, MultinomialNB,
KNeighborsClassifier, RandomForestClassifier
Final Model
30
clf = GradientBoostingClassifier(
loss='deviance', learning_rate=0.1, n_estimators=300, subsample=0.9,
min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0,
max_depth=3, init=None, random_state=None, max_features=200,
max_leaf_nodes=None, warm_start=False, verbose=2
)
32. Final Model tSNE Plot
32
pipe = Pipeline([
("tsvd", TruncatedSVD(n_components=50)),
("tsne", TSNE(n_components=2, perplexity=40.0,
early_exaggeration=4.0, learning_rate=1000.0,
n_iter=1000, metric='euclidean', init='random’))
])
34. xgboost
malware as an image
compression ratio as a feature
other expanded feature sets
probability calibration
semi supervised learning
Winning Strategies
34
usable in a product
specific to
competitions
35. 35
ida ******************************
CV Scores: [ 0.03800 0.02551 0.05283 0.03953 0.0350 ]
mean: 0.03817940685733493 std: 0.008799619405211161
capstone ******************************
CV Scores: [ 0.05065 0.0451 0.06953 0.05583 0.05089]
mean: 0.05441113231562615 std: 0.008283830117670508
code = bytes(bytearray.fromhex("".join(map(
lambda l : "".join(l.split()[1:]).replace("?", ""),
open("data/sample/0A32eTdBKayjCWhZqDOQ.bytes", "r")
))))
from capstone import Cs, CS_ARCH_X86, CS_MODE_32
md = Cs(CS_ARCH_X86, CS_MODE_32)
instructions = " ".join(
[t[2] for t in md.disasm_lite(code, 0x1000) if t[2] != "int3"]
)
Using Capstone
36. IDA not (easily) batch distributable
capstone single pass produces suboptimal results
radare2 Python scriptable reversing framework
vivisect pure Python, largely undocumented
disassembler and analysis project
Disassemblers
36
37. Other Projects
37
pefile extracts header information from executables
binglide visualizations of entropy and byte ngrams
cuckoo automated dynamic analysis
barf binary analysis framework with code analysis
38. 38
Python tools for text classification can easily be
adopted for malware classification.
When using instruction ngrams, your disassembler
and analysis passes are very important.
references: http://bit.ly/scipy-malware
Conclusions