Presenter: Harold Nguyen, Senior Data Scientist at Nexgate
In this talk, we focus on a use case by showing how Cassandra can detect spam and spammers on social media. We also show how we use Cassandra to train our 100+ social-media-security classifiers. The accuracy of any security product is directly tied to the breadth of the corpus of data upon which it is built. For Nexgate, this means that the success of our products is inextricably tied to our ability to save everything we've ever scanned, but in a way that is still readily accessible. In the days before NoSQL, this was hard. This talk is about how Datastax and Cassandra make it easy.
Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassandra for Success in Fraud Detection
1. Social Media Brand Protection
and Compliance
Harold Nguyen
12 September 2014
Cassandra Summit 2014
September 10 -11 | #CassandraSummit
2. What is Nexgate ?
Ø Nexgate helps automate the discovery,
monitoring, and protection of your brand’s
Social Media accounts
(Let us show you: nexgate.com/demo)
Ø One thing we do is offer automated
classification of content for 100+ categories
– Including malware, spam, hate speech, etc…
– And flagging for violation of HIPAA, FFIEC, SEC
and FINRA compliance standards
3. About Us
Company – Security & Compliance for Social
Ø Launched April 2013 - Series A from Sierra & WindForce Ventures
– 18 employees, 7 in Engineering (2 Data Scientists)
Ø Security people from:
Ø Customers:
4. Scale of Data
Ø Over 350 million pieces of social
media total content spread across
Facebook, Twitter, YouTube,
Google+, LinkedIn
Ø Currently about 1.5 million new
content per day
– All classified in real time as it
comes in
Ø Over 65 million total social media
content authors
Ø About 250,000 new social media
content authors per day
5. Content Classification
Ø In order to have an
accurate classification
system, we need to have
A LOT of data
Ø In order to have a lot of
data, we need a strong
and capable infrastructure
to store all the data we
collect
“The completeness of any classification
system is predicated on the breadth of the
corpus of data upon which it is built”
– Rich Sutton, CTO
6. Relational Database
Ø In the beginning, we
threw everything into
MySQL – why not? It’s:
– Easy to use
– Many people already
know how to use it
– Secure
– Inexpensive (free)
– Manages memory well
– Fast (up to 50 million
rows)
– Supports several
development interfaces
Not to say MySQL is a dumpster – we heavily rely on MySQL!
7. Not Only Relational Database
Ø But after several months, realized we needed a
NoSQL solution
8. Social Media Data
Ø Social media data size is
on average about 1k,
including content and
metadata
– Content includes the
actual text and links from
the social media message
– Metadata includes time,
social ID, parent, account,
etc…
– Metadata can vary
depending on the social
media platform (likes,
followers, subscribers,
etc…)
Social media data are pretty rough and jagged
– store some of it in a NoSQL solution
9. Storing Social Media Data
Ø Store social media data across both SQL and NoSQL
SQL: Fixed length, non-null,
heavily indexed, group access
NoSQL: Variable length, commonly null,
softly indexed, single access, text search
10. NoSQL Requirements
Ø Our requirements when searching for a NoSQL solution
Easy to use
Simple and proven horizontal scalability
Integrated tools for research (Solr): search and analysis
Operation simplicity: all nodes the same
Fantastic Enterprise support (Thanks !)
Simple to deploy and maintain
Integration with other “big data” tools support (Hadoop, Spark!)
11. Deployment!
!•
Multi-region AWS EC2!
• M1 Large instances!
• Instance attached storage!
• About to scale again!
• Separate dev, test, prod clusters!
!
Datastax:!
• Start-up pricing, per-core pricing!
• On site experts, responsive support!
13. Fighting Spam with
Cassandra
Ø Among the many security and compliance
classifications that Nexgate provides, we also
have powerful spam detection
Ø Spam can be a single link directing to a
fraudulent site (screenshots of a Facebook
comment):
14. Ø Or it can be less obvious, and more personal. This is extremely common.
Here, the same user has posted the same message across different social
media accounts (screenshot taken from Nexgate product):
15. Social media spam has grown
687% since the start of 2013.
Get the report at http://nx.gt/SocialSpamReport
16. Cassandra and
Social Media Spam
Ø Can create Spam signatures to catch this
type of content
Ø ...but it would be too slow to catch Spam in
real time.
Ø Cassandra
17. Define Your Data Model
Ø Even though Cassandra is a NoSQL schema-less
database, it is worth carefully defining
the data model
Ø Can’t just “throw data at it” – can make for
some really inefficient queries
Ø Define the data model based on how you will
query the data
Ø For us, we want to determine spam content
that has been posted duplicate times
– Spammers tend to post same-content messages
18. Spam Multiplicity Data Model
Ø Typical table in Cassandra
– Wide “unconstrained” rows is a nice feature w.r.t. SQL
Ø Row key -> hash of content
Ø Column Key -> Unique ID (strictly increasing with time)
Ø Column Value -> Item_id and time of post
19. Why this Data Model ?
Ø Spammers typically post the same content over and over
Ø Easy to determine how many times a same-content post is made:
check the number of columns
Ø Will never double count because the column key will simply be
updated instead of added
Ø Indexed by the content, so quick reads and writes
Ø By reading the column value, can extract the time series information
of duplicated posts
– Can also map back to the original value – we store actual content
indexed by the item_id in another Cassandra table
Ø Cassandra not a magic bullet
– still need a relational database to glue all the pieces of data together
– Batch processing may need other tools like Hadoop
20.
21. Real-world spam multiplicity
Ø This has become invaluable to us for catching spam content in real
time – the following “rant” comment was posted 38 times…
– Brand can more easily moderate given automated tools
Ø In another example, a customer received 25,000 inappropriate
messages, and this tool helped us automate content removal
22. Importance of Keeping All Data
Ø Another way to tackle real-time spam is by
identifying spammy users
– Since Cassandra effortlessly keeps all the
content we observed, our algorithm takes into
account all the posts contributed by an author
to determine if they are a spammer
Ø Additionally, it is important to keep all data
to train our 100+ classifiers
23. Tuning Cassandra
Ø Cassandra actually has been humming along quite nicely!
– Barely any tweaking needed from default values
– No deletes (just the nature of our dataset) => not a lot of frequent
repairs performed (repair is done to resolve inconsistencies across
all replicas of data due to deletes)
• Fine for us, because repair requires intensive disk I/O
Ø Only times we observed performance issues:
– When the rates of our reads and writes reached a certain threshold
– When the size of the data being inserted was too large
– Heap memory issue with Cassandra 1.1.x
Ø In all cases, Datastax provided a quick and simple solution,
mostly just toggling a few parameters in config files and
restarting the nodes
24. Cassandra Community
Ø Community is wonderful - it's really easy to jump on the
Cassandra IRC channel and talk to fellow users and
developers to get real-time feedback.
– With IRC and mailing list help, implemented composite columns
to detect malware sites on the second day of using Cassandra 3
years ago
Ø In fact, when we tested a migration to the latest version of
Casandra, and one of our Ruby wrappers didn't play nice with
CQL3, I was able to speak directly with the Ruby wrapper
author on IRC and received a reason on why it didn't work.
– In the same day, I committed and made a pull request for a fix to
the Ruby wrapper on github, and the author looked at it the next
morning
Ø Datastax support has been invaluable for providing fast
feedback and simple solutions
25. Datastax Additional Tools
Ø OpsCenter helpful in debugging
performance issues
Ø Solr – used to obtain training data for
classifiers by phrase matching
Ø Looking forward:
– Datastax Spark support to look into training
labeled data with MapReduce
26. Thank you!
Let us show you: nexgate.com/demo
Follow us:
@NXGate
facebook.com/NXGate