2. $whois tkisason
• Junior resarcher @ www.foi.hr
• Head of Open Systems and Security lab
• Likes to build and break things
• tonimir.kisasondi@foi.hr
• skype:tkisason
3. What happens when you digitize
the whole world?
• Google, Facebook, Twitter
• Is it a bubble or a valid business model?
• The new buzzword is big data
• Storage per capita doubles every three
years
• Kryder's law says that storage density
doubles every 18 months
• Can you really store the whole world?
5. What happens when you digitize
the whole world?
• Storing 20 Tbps traffic
• Map/Reduce like infrastructure to mine and
combine data
• Why is this interesting to us now?
o Storage is cheap
o Big data is useful everywhere
o Use tricks that intel agencies use to enable cool stuff
o It’s not rocket science...
o Yes, the most interesting applications are in cross
disciplinary fields
6. First: OSINT
• OSINT: Open Source Intelligence
o Finding, selecting and acquiring information over
open, publicly available sources like newspapers,
internet, books, internet, social networks (twitter)...
o Various registries (firm, open postings, public listing)
o Metadata
o Mine those, and you might find a lot of interesting
stuff
o White zone – Legal and ethical
o Black zone – Illegal and Unethical
o Gray zone – Legal but unethical
7. First: OSINT
• Not everything is OSINT, but you can
actually glean interesting data from almost
anything
• It worked for the guys that wrote Splunk, so
they decided to write Splunk.
• It works for data mining folks.
8. Data analysis 101
• Data is just data, you have to correlate it or
put it in context for it to be useful
o Find outliers
o Spot differences
o Find common attributes
o Find connections, not answers
o First identify, then try to interpret
o Put data into perspective, seek help J
o "Data driven design”
• A nice showcase of data driven design:
o A/B Testing
9. Do i need advanced statistics?
• Most of the time: No
• Are statistics awesome? Yup
• Well, don’t play with things where you can
get hurt. J
• Seek professional help
• Grep, Google refine/Mojo facets, and your
favorite scripting languages are just fine...
10. How can we approach the problem
• There are many (finished) tools, if they help,
great
• Roll your own script
• Duct tape some finished libraries
• Most of the times it takes less time then finding a
tool.
• Cheating and stealing is encouraged. ;)
12. Bad design 101
• If you hack it together, watch out for some
gotchas
• Line per line analysis
o Minimal complexity O(n)
• You can easily kill the speed of your script/
parser/*
• Best separator is t
• .split() is godsent
13. ignorecase?
#!/usr/bin/python
import re
a = open("access.log")
b = open("test.log","w")
for line in a:
if re.search("DENIED",line,re.IGNORECASE):
b.write(line)
b.close()
$ time ./re-search.py
real 0m4.516s
user 0m4.444s
sys 0m0.056s
14. simple RE
#!/usr/bin/python
import re
a = open("access.log")
b = open("test.log","w")
for line in a:
if re.search("DENIED",line):
b.write(line)
b.close()
$ time time ./re-search.py
real 0m2.520s
user 0m2.456s
sys 0m0.056s
15. find
#!/usr/bin/python
a = open("access.log")
b = open("test.log","w")
for line in a:
c = line.find("DENIED")
if c >= 0 :
b.write(line)
b.close()
$ time ./testparse.py
real 0m0.781s
user 0m0.728s
sys 0m0.044s
16. grep
$ time grep DENIED access.log > test
real 0m0.074s
user 0m0.040s
sys 0m0.032s
17. To sum it up...
Python RE ignorecase : 4.516s
Python RE : 2.520s
Python find : 0.781s
grep : 0.074s
18. Primer on useful and interesting
tools
• ipython
o http://ipython.org/
• python-nltk
o http://nltk.org/ (nltk.clean_html(messy_html))
• python-requests
o www.python-requests.org
• python-graphviz
o http://code.google.com/p/pydot/
• python-google by Mario Vilas
o https://github.com/MarioVilas
21. So, how about a short showcase of
some things i did
• Yeah, they are lame, and simple
• Works for me
• Available on github
• Hope they can motivate you to do some fun
and simple “one afternoon” stuff
• Most of the “hard” stuff is easy once you try
to hack it together
22. mkwordlist -
https://github.com/tkisason/gcrack
• Idea: Create wordlists with google results for
a set of keywords
• For a keyword return top 5 links (or N)
• Scrape and clean with NLTK
• Optional lowercasing for future mutations
o You can use JtR/HashCat with a ruleset to mutate
the lists
• Result: Nice targeted wordlist generator
23. mkwordlist -
https://github.com/tkisason/gcrack
• Some other cool things
o Keywords can be google dorks
§ site:.bg
§ filetype:txt
§ “”
• Interesting results for targeted attacks
• Broad keywords are also ok
o If you are pentesting a company or similar
24.
25.
26. gcrack -
https://github.com/tkisason/gcrack
• Idea: Most of the weak password hashes are
cracked and leaked on the public internet
• Google indexes the pages, and the content
of this pages contains the plaintext
• Use google searches for password cracking
• Create bag of words as a wordlist
• Result: Very effective and fast hash cracker
• Bonus: hash agnostic
27. logtool
https://github.com/tkisason/logtool
• log files are interesting..ish
• Especially if you have a compromised
machine and the attackers were noobish
enough to leave the log files
• What can you learn:
o IP addresses (known proxyes and tor exit points)
o Usernames (are they generic or are they specific)
o IP-GeoIP data
o Toolmarks (user agents, wordlists for attacks)
28. linkcrawl and nltk
https://github.com/tkisason/linkcrawl
• Building a simple crawler is easy (or use
wget and cURL, man up and write some
shell scripts)
• NLTK is awesome!
o import nltk, nltk.clean_html(data)
• http://orange.biolab.si is also a nice platform