This document provides an overview of InfoArmor's threat intelligence and data ingestion capabilities. It begins with a brief history of InfoArmor and its vision for the future. It then discusses how threat data is collected from the dark web and other sources through techniques like forum scraping, human operatives, and threat actor profiling. The document also discusses lessons learned from processing over 1 billion rows of data in databases like Elasticsearch and MariaDB. It cautions against issues like poor schema design, not closing database connections, importing too much data at once, and allowing malicious scripts into databases. The key takeaways are that data should be ingested and processed incrementally and that remote DBAs can help manage infrastructure challenges.
2. What we will be covering today.
HOW DID WE GET HERE?
A brief history of InfoArmor, and the
greatness that got us to where we are
today.
WHERE ARE WE GOING?
A look at the vision and where we see
InfoArmor going in the future.
HOW DO WE GET THERE?
What will it take for us to achieve our
vision, and what is our process to get
there?
1 2 3
6. The unseen threats.
Dark web monitoring through InfoArmor Advanced Threat Intelligence.
Forum scraping
Programmatic forum
scraping with bots while
humans operatives gain
access to closed forums.
Human operatives
Combat hackers that are
using technology and
innovating everyday.
Structuring raw data
Compromised data files
must be formatted,
organized and canonized
to be fully leveraged.
Threat actor profiling
Tracking threat actors
moves as we built out
profiles, information and
patterns to thwart risks.
7. 60% of companies can not detect compromised credentials survey says
Source: https://www.csoonline.com/article/3022066/security/60-of-companies-cannot-detect-compromised-credentials-say-security-pros-
surveyed.html
8. This product will get you 100.000 United Kingdom "HOTMAIL" Emails Leads
Source: http[:]//6qlocfg6zq2kyacl.onion/viewProduct?offer=857044.38586
12. Lessons from 1 billion
rows
What I learned that allowed me to sleep
again
13. Bird’s eye view of data
- Relational dbs for web application and storage of known
structured data
- Elasticsearch for unstructured and fulltext searching
- Replication off-site
- MariaDB remote DBAs monitor all InfoArmor
Over 2 billion credentials
45 million forum posts
300 GB and growing of botnet logs
Pretty much all code is in Python.
14. Don’t Do That!
- Feature worked for some inputs, but not others
- Schema was suboptimal, leading to full table scans
- 4 way join, hundreds of thousands of seconds
- Had to kill ‘em
- With MariaDB assistance, planned out new schema for
credentials
- More intuitive
- Meets business needs in API and GUI
- Listen to end users!
Non tech lesson: Cultivate relationships outside of tech!
15. Multithreading Mayhem
- Parallelized queries to multiple databases
- In Pyramid, achieved with separate DB Sessions
- Sessions weren’t closed, leaving connections open
- Fell outside of normal Zope/SQLAlchemy flow
- Monyog alerts about max’d connections, restarted application to
clear connections
- Found issue in code, added .close()
Lesson: Configuration changes solve and don’t solve problems at the
same time
17. Don’t Bring All Groceries in at Once
- Sometimes a ton of rows need to be updated
- Even if something doesn’t get committed….
...Log entries and rollbacks get created
- Gums up replication
- Wastes time
- MAX ALLOWED PACKET
Lesson: Data should be updated in small bites
Programmatic!
18. Same for import parsing scripts
Where multithreading amplifies binlog size
- Don’t get greedy, nothing is worth screwing up replication or your
application
Non tech lesson: Add 20 to 200 percent to time estimates for imports.
Process and organization will set you free
19. IDS - Intrusion Detection System
Or rather “Inline Data Shredder”
- Scrape malicious looking javascript, php, python, perl scripts
- Will normally get bounced on the way in from the scraper
- Replication kept mysteriously stopping
- Engineering team getting “WTF?” alerts from all angles
Found the chunk of code in the database. Replication now over SSL.
Lesson: Coincidence...or degree of separation?
20. Final thoughts...
- Data is business, business is data.
- Let remote dbas do nuts and bolts
- Focus on your application and goal of the data
- Make data available to sales people, but toolify it
- Keep evolving
1. How did we get here
About InfoArmor Founded in 2007
EPS Story
ATI Story
2. Where are we going
More established credit alerts
More secure’ing alerts such as high risk transactions or fraud relation
More underground economy
More actionable alerts near real time
3. How do we get there
Ingestion of large data sets
Correlations of large data sets
Near real time,
High availability
Follow on from Christian’s points.
About 700 million when I took over. 700 over 4 years or so, tripled in less than 2 years
New breaches, repacks of breaches,
Ingest process was disrupting normal use
Querying process fell apart
High disk consumption due to duplicate data
Clobbered with behind-the-scences processes, hidden mines from sales people
Forum, pastes, analyst dump files
Files include medical records, clinical trial pdfs, emails, xls, pdf Some stuff too hot to put in to production queryable
Botnet logs, organized and unorganized, different formats
Today:
Over 2 billion rows of credentials
Several indices on single rows and covering grouped indices on some columns
Raid 5 nvme ssds (#yolo)
40 million + forum posts with fulltext via ES
Application aware of where to read and write
Offsite replication
Monitored by remote dbas
Improved workflow of analyst communication
Long queries running from certain search boxes in the portal or api ( LIKE combined with a 4 way join)“The previous guy told me not to search for bigger domains…..”Ben Stillman came out as part of initial consulting engagement. Evaluated schema for credentials database. Full table scans are the devil.Duplicate data stored across 4 tables, all business uses of data almost always required doing costly joins.
Determine minimum useful unit of data for the business. What constitutes the most useful result set? How to quickly and reliably retrieve it? How to keep it updated with new data without new data making old data useless
Determine how closely related tables are, is there a 1to1 ratio of rows? Do they describe unique units of data?
Find the line between what collection of attributes constitute a useful record, and the cost of updating those records if denormalized too hard.
Is there anything you tell an end user not to do? Is there water cooler talk about something that is slow? Show processlist ;
Solve the issue.
“Don’t search for gmail.com” “don’t query for yahoo”
Cause long queries due to joins using low cardinality indices, or indices that are too huge, causing mysql to just scan the entire tables for the results
All problems can be solved, treat it like a Zelda dungeon or Metroid. Ask for help, research, MariaDB remote dba...
Initial thought was to speed up loading of dashboard by having queries fire off multi threaded queries
Random alerts about application being down despite nearly all things quiet.
Monyog alerts about max connections, so had remote DBAs increase max connections hi
Mitgated issue, but still happened
Sometimes bugs make it to production, stay calm
Symptoms were immediate 500 errors
Story:Scraper went haywire, not storing properly the last post, causing a flood of data. Could see disk usage graph rise and fall. Amplified other export processes. Updated format of posts, had to update old ones with new data, initially did it in one #yolo query. Huge transaction -> huge log -> huger redo log -> huge….Let remote DBA be your canary. We have a slack channel and i’ll get pinged if something is about to go off the rails.
Programatically solve problems in your preferred language, don’t use the mysql command line to update large chunks of data or shell scripts that don’t go in to version controlRemoteDBA will ask WTF if you are doing “yolo” update everything queries
Consider all aspects of network
Story:
Replication kept stopping to other datacenter
Remote dba flummoxed
Getting IDS alerts, engineering and security lost as to what PHP injection was doing with the database replication server
Correlated id of row that contained the code in the body of the pastebin paste
Resolved with ssl connection