Austin Cassandra Users Meetup on July 15th 2013: http://www.meetup.com/Austin-Cassandra-Users/events/125837112/
Datafiniti will be presenting on some of the unique and interesting challenges they've faced when trying to build out their data search engine. Including a detailed use-case around their Cassandra data model and other integrated technologies like Solr.
2. Data on the Web
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● 48 billion pages on the Internet
● 56 million GB of data
● Incredibly powerful connections
● 70% of useful data is unstructured
● User generated data + facts
3. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Too Much Data…
4. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● Modern search engines
○ Unstructured data
○ Unconnected data
○ Unnormalized data
Search
5. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● Goals
○ Collect vast amounts of data through web crawling
○ Normalize and deduplicate data
○ Make it searchable and meaningful
6. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● Speed
● Scale
● Adaptable
Needs
7. ● Very fast
○ Log-structured storage
● Easily scalable
○ Decentralized rings
● Completely adaptable
○ Schema-less key/value store
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
The Solution
8. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
…Almost
● Useful searching was missing
○ Secondary indexes not flexible
○ No free text searches
○ No (reasonable) range queries
9. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● Pros: Full control over indexing
● Cons: Not scalable
What We Needed
10. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● Reasons to go with DSE
○ Combines Cassandra and Solr
○ Constant refinements and integrations
○ Support
Putting It All Together
11. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Normalization
Cassandra
Solr
Cassandra
Solr
Cassandra
Solr
Load
Balancing
Our Stack
Web Crawling
12. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Cassandra / Solr Setup
● 3 column families / 3 cores
○ Locations
○ Products
○ People
● 73,114,909 records
13. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● 29,818,644 records
● Interesting data
○ Reviews
○ Revenue
○ Contact information
● Businesses vs. Locations
○ Unique key
○ Location specific user data
Data: Locations
14. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Data: Products
● 18,470,005 records
● Interesting data
○ Categories
○ Price
○ Reviews
● Challenges
○ Too many unique keys
15. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Data: People
● 24,826,260 records
● Interesting data
○ Work History
○ Education History
○ Location
● Challenges
○ Normalization
○ Identification
16. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges
● Memory
● Speed
● Space
● Representation
17. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Memory
● Multi-minute garbage collection
● Exponential increase in frequency
● Virtual memory confusion
● Solr + Cassandra
● Heap Size vs Buffer Cache
● Bash Scripts
18. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Speed
● Upgrade
○ Better memory management
○ Smaller index size
● Reduce index size
● Future: Solaris
19. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Speed
● Providing a real-time service
● Issues
○ Solr not inherently real time
○ Search speeds
○ I/O
20. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Speed
● Solr Solution: DSE integration leverages
○ Cassandra's speed
○ Cassandra's caches
○ Cassandra's distribution
○ Solr caches less useful
21. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Speed
● Search complexity solution
○ Text vs String indexing
○ Uniqueness vs Flexibility
○ Leveraging Cassandra
22. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Speed
● I/O Solution
○ Cassandra's built in mapping
○ Increase disk access speeds (SSDs)
■ Not cost effective
○ Future: Solaris
23. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Space
● Field corruption
○ Caused by improper encoding
○ Exponential growth
○ Fills up Solr index
● Locate, inspect & remove corrupt records
24. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Space
● Solr index issue
○ No compression (vs Cassandra)
○ Must adjust indexing
● Key things to keep in mind
○ Size of fields
○ Scale vs Flexibility
○ Index as little as possible
25. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Representation
● Cassandra is flat
● Actual data is not flat
○ Reviews
○ Price information
● Many different output formats
○ CSV, JSON, XML, etc.
26. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● Solution: Flatten when possible
○ E.g. Address object -> Separate fields
● Internal subgroup representation
○ Composite keys (Occasionally)
■ Known subgroups
■ Non multiple subgroups
○ Dynamic fields
■ Composite field + Dynamic tag
■ E.g. review.text_<tag>
Challenges: Representation
27. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Representation
● Robust and adaptable conversion package
● JSON -> Internal
○ Solr returns JSON
● Internal -> CSV, JSON, XML
○ User defined views
○ Specify field groupings
○ Specify partitioning
29. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● Memory Usage
● Speed
● Space
● Containers
Future Work
30. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Future Work: Memory
● Java 7 G1 (Garbage First) Collector
○ Ideal for large heaps
○ Big Data Sets
○ Bursty Workloads
31. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Future Work: Speed
● Solaris Kernel Scheduler > Linux Kernel Scheduler
○ (At large number of cores)
● Drastically increase iops
○ Cache reads (L2ARC) on PCIe SSD (~800 MB/s)
○ Cache writes (ZIL) on PCIe SSD (~800 MB/s)
○ Reduce needed size of SSD
■ More smaller SSDs in ZFS pool
○ Fewer moving parts
32. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Future Work: Space
● Caching at PCIe, Storing on SATA III
○ Cheaper larger storage via ZFS pools
○ Easier to grow
● ZFS Compression (LZ4)
○ Replaces Cassandra's Snappy compression
○ Very fast lossless compression (400 Mb/s per core)
○ Scales to multiple CPUs
○ Hits the ram speed limit
33. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Future Work: Containers
● OS Level virtualization
○ Resource control
○ Boundary separation
● More control over cassandra resources
● Better snapshots (whole machine)
● Hardware abstracted out
○ Many disks represented as single space
○ Easily add or remove hardware