SlideShare une entreprise Scribd logo
1  sur  35
Télécharger pour lire hors ligne
The Internet in a
Database
A Cassandra Use Case
Data on the Web
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● 48 billion pages on the Internet
● 56 million GB of data
● Incredibly powerful connections
● 70% of useful data is unstructured
● User generated data + facts
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Too Much Data…
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● Modern search engines
○ Unstructured data
○ Unconnected data
○ Unnormalized data
Search
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● Goals
○ Collect vast amounts of data through web crawling
○ Normalize and deduplicate data
○ Make it searchable and meaningful
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● Speed
● Scale
● Adaptable
Needs
● Very fast
○ Log-structured storage
● Easily scalable
○ Decentralized rings
● Completely adaptable
○ Schema-less key/value store
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
The Solution
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
…Almost
● Useful searching was missing
○ Secondary indexes not flexible
○ No free text searches
○ No (reasonable) range queries
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● Pros: Full control over indexing
● Cons: Not scalable
What We Needed
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● Reasons to go with DSE
○ Combines Cassandra and Solr
○ Constant refinements and integrations
○ Support
Putting It All Together
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Normalization
Cassandra
Solr
Cassandra
Solr
Cassandra
Solr
Load
Balancing
Our Stack
Web Crawling
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Cassandra / Solr Setup
● 3 column families / 3 cores
○ Locations
○ Products
○ People
● 73,114,909 records
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● 29,818,644 records
● Interesting data
○ Reviews
○ Revenue
○ Contact information
● Businesses vs. Locations
○ Unique key
○ Location specific user data
Data: Locations
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Data: Products
● 18,470,005 records
● Interesting data
○ Categories
○ Price
○ Reviews
● Challenges
○ Too many unique keys
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Data: People
● 24,826,260 records
● Interesting data
○ Work History
○ Education History
○ Location
● Challenges
○ Normalization
○ Identification
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges
● Memory
● Speed
● Space
● Representation
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Memory
● Multi-minute garbage collection
● Exponential increase in frequency
● Virtual memory confusion
● Solr + Cassandra
● Heap Size vs Buffer Cache
● Bash Scripts
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Speed
● Upgrade
○ Better memory management
○ Smaller index size
● Reduce index size
● Future: Solaris
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Speed
● Providing a real-time service
● Issues
○ Solr not inherently real time
○ Search speeds
○ I/O
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Speed
● Solr Solution: DSE integration leverages
○ Cassandra's speed
○ Cassandra's caches
○ Cassandra's distribution
○ Solr caches less useful
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Speed
● Search complexity solution
○ Text vs String indexing
○ Uniqueness vs Flexibility
○ Leveraging Cassandra
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Speed
● I/O Solution
○ Cassandra's built in mapping
○ Increase disk access speeds (SSDs)
■ Not cost effective
○ Future: Solaris
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Space
● Field corruption
○ Caused by improper encoding
○ Exponential growth
○ Fills up Solr index
● Locate, inspect & remove corrupt records
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Space
● Solr index issue
○ No compression (vs Cassandra)
○ Must adjust indexing
● Key things to keep in mind
○ Size of fields
○ Scale vs Flexibility
○ Index as little as possible
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Representation
● Cassandra is flat
● Actual data is not flat
○ Reviews
○ Price information
● Many different output formats
○ CSV, JSON, XML, etc.
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● Solution: Flatten when possible
○ E.g. Address object -> Separate fields
● Internal subgroup representation
○ Composite keys (Occasionally)
■ Known subgroups
■ Non multiple subgroups
○ Dynamic fields
■ Composite field + Dynamic tag
■ E.g. review.text_<tag>
Challenges: Representation
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Representation
● Robust and adaptable conversion package
● JSON -> Internal
○ Solr returns JSON
● Internal -> CSV, JSON, XML
○ User defined views
○ Specify field groupings
○ Specify partitioning
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● Memory Usage
● Speed
● Space
● Containers
Future Work
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Future Work: Memory
● Java 7 G1 (Garbage First) Collector
○ Ideal for large heaps
○ Big Data Sets
○ Bursty Workloads
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Future Work: Speed
● Solaris Kernel Scheduler > Linux Kernel Scheduler
○ (At large number of cores)
● Drastically increase iops
○ Cache reads (L2ARC) on PCIe SSD (~800 MB/s)
○ Cache writes (ZIL) on PCIe SSD (~800 MB/s)
○ Reduce needed size of SSD
■ More smaller SSDs in ZFS pool
○ Fewer moving parts
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Future Work: Space
● Caching at PCIe, Storing on SATA III
○ Cheaper larger storage via ZFS pools
○ Easier to grow
● ZFS Compression (LZ4)
○ Replaces Cassandra's Snappy compression
○ Very fast lossless compression (400 Mb/s per core)
○ Scales to multiple CPUs
○ Hits the ram speed limit
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Future Work: Containers
● OS Level virtualization
○ Resource control
○ Boundary separation
● More control over cassandra resources
● Better snapshots (whole machine)
● Hardware abstracted out
○ Many disks represented as single space
○ Easily add or remove hardware
Questions?
https://www.datafiniti.net
http://blog.datafiniti.net
@datafiniti
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Addendum 1
ZFS Comparison
Name Ratio (MB/s) Compression
(MB/s)
Decompression
(MB/s)
LZ4 (r97) 2.084 410 1810
LZO 2.06 2.106 409 600
QuickLZ 1.5.1b6 2.237 373 420
Snappy 1.1.0 2.091 323 1070
LZF 2.077 270 570
zlib 1.2.8 -1 2.730 65 280
LZ4 HC (r97) 2.720 25 2040
zlib 1.2.8 -6 3.099 21 300

Contenu connexe

En vedette

CQL In Cassandra 1.0 (and beyond)
CQL In Cassandra 1.0 (and beyond)CQL In Cassandra 1.0 (and beyond)
CQL In Cassandra 1.0 (and beyond)Eric Evans
 
Virtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in CassandraVirtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in CassandraEric Evans
 
Cassandra by Example: Data Modelling with CQL3
Cassandra by Example:  Data Modelling with CQL3Cassandra by Example:  Data Modelling with CQL3
Cassandra by Example: Data Modelling with CQL3Eric Evans
 
Virtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in CassandraVirtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in CassandraEric Evans
 
CQL: SQL In Cassandra
CQL: SQL In CassandraCQL: SQL In Cassandra
CQL: SQL In CassandraEric Evans
 
It's not you, it's me: Ending a 15 year relationship with RRD
It's not you, it's me: Ending a 15 year relationship with RRDIt's not you, it's me: Ending a 15 year relationship with RRD
It's not you, it's me: Ending a 15 year relationship with RRDEric Evans
 
Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)Eric Evans
 
Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache CassandraEric Evans
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkVictor Coustenoble
 
Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache CassandraEric Evans
 
Time series storage in Cassandra
Time series storage in CassandraTime series storage in Cassandra
Time series storage in CassandraEric Evans
 
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetupDataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetupVictor Coustenoble
 
Préparation de Données Hadoop avec Trifacta
Préparation de Données Hadoop avec TrifactaPréparation de Données Hadoop avec Trifacta
Préparation de Données Hadoop avec TrifactaVictor Coustenoble
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataVictor Coustenoble
 
Learning Cassandra
Learning CassandraLearning Cassandra
Learning CassandraDave Gardner
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingDataStax Academy
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsDataStax Academy
 
Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3Eric Evans
 

En vedette (20)

DataStax Enterprise BBL
DataStax Enterprise BBLDataStax Enterprise BBL
DataStax Enterprise BBL
 
CQL In Cassandra 1.0 (and beyond)
CQL In Cassandra 1.0 (and beyond)CQL In Cassandra 1.0 (and beyond)
CQL In Cassandra 1.0 (and beyond)
 
Virtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in CassandraVirtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in Cassandra
 
Cassandra by Example: Data Modelling with CQL3
Cassandra by Example:  Data Modelling with CQL3Cassandra by Example:  Data Modelling with CQL3
Cassandra by Example: Data Modelling with CQL3
 
Virtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in CassandraVirtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in Cassandra
 
CQL: SQL In Cassandra
CQL: SQL In CassandraCQL: SQL In Cassandra
CQL: SQL In Cassandra
 
It's not you, it's me: Ending a 15 year relationship with RRD
It's not you, it's me: Ending a 15 year relationship with RRDIt's not you, it's me: Ending a 15 year relationship with RRD
It's not you, it's me: Ending a 15 year relationship with RRD
 
Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)
 
Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache Cassandra
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and Spark
 
Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache Cassandra
 
Time series storage in Cassandra
Time series storage in CassandraTime series storage in Cassandra
Time series storage in Cassandra
 
Cassandra 2.2 & 3.0
Cassandra 2.2 & 3.0Cassandra 2.2 & 3.0
Cassandra 2.2 & 3.0
 
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetupDataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
 
Préparation de Données Hadoop avec Trifacta
Préparation de Données Hadoop avec TrifactaPréparation de Données Hadoop avec Trifacta
Préparation de Données Hadoop avec Trifacta
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational Data
 
Learning Cassandra
Learning CassandraLearning Cassandra
Learning Cassandra
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3
 

Plus de DataStax Academy

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftDataStax Academy
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseDataStax Academy
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraDataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackDataStax Academy
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache CassandraDataStax Academy
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready CassandraDataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonDataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First ClusterDataStax Academy
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with DseDataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraDataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseDataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraDataStax Academy
 
Apache Cassandra and Drivers
Apache Cassandra and DriversApache Cassandra and Drivers
Apache Cassandra and DriversDataStax Academy
 
Getting Started with Graph Databases
Getting Started with Graph DatabasesGetting Started with Graph Databases
Getting Started with Graph DatabasesDataStax Academy
 

Plus de DataStax Academy (20)

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 
Apache Cassandra and Drivers
Apache Cassandra and DriversApache Cassandra and Drivers
Apache Cassandra and Drivers
 
Getting Started with Graph Databases
Getting Started with Graph DatabasesGetting Started with Graph Databases
Getting Started with Graph Databases
 

Dernier

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 

Dernier (20)

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Datafiniti: The Internet in a Database - Cassandra Use Case

  • 1. The Internet in a Database A Cassandra Use Case
  • 2. Data on the Web DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET ● 48 billion pages on the Internet ● 56 million GB of data ● Incredibly powerful connections ● 70% of useful data is unstructured ● User generated data + facts
  • 3. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Too Much Data…
  • 4. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET ● Modern search engines ○ Unstructured data ○ Unconnected data ○ Unnormalized data Search
  • 5. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET ● Goals ○ Collect vast amounts of data through web crawling ○ Normalize and deduplicate data ○ Make it searchable and meaningful
  • 6. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET ● Speed ● Scale ● Adaptable Needs
  • 7. ● Very fast ○ Log-structured storage ● Easily scalable ○ Decentralized rings ● Completely adaptable ○ Schema-less key/value store DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET The Solution
  • 8. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET …Almost ● Useful searching was missing ○ Secondary indexes not flexible ○ No free text searches ○ No (reasonable) range queries
  • 9. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET ● Pros: Full control over indexing ● Cons: Not scalable What We Needed
  • 10. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET ● Reasons to go with DSE ○ Combines Cassandra and Solr ○ Constant refinements and integrations ○ Support Putting It All Together
  • 11. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Normalization Cassandra Solr Cassandra Solr Cassandra Solr Load Balancing Our Stack Web Crawling
  • 12. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Cassandra / Solr Setup ● 3 column families / 3 cores ○ Locations ○ Products ○ People ● 73,114,909 records
  • 13. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET ● 29,818,644 records ● Interesting data ○ Reviews ○ Revenue ○ Contact information ● Businesses vs. Locations ○ Unique key ○ Location specific user data Data: Locations
  • 14. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Data: Products ● 18,470,005 records ● Interesting data ○ Categories ○ Price ○ Reviews ● Challenges ○ Too many unique keys
  • 15. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Data: People ● 24,826,260 records ● Interesting data ○ Work History ○ Education History ○ Location ● Challenges ○ Normalization ○ Identification
  • 16. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Challenges ● Memory ● Speed ● Space ● Representation
  • 17. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Challenges: Memory ● Multi-minute garbage collection ● Exponential increase in frequency ● Virtual memory confusion ● Solr + Cassandra ● Heap Size vs Buffer Cache ● Bash Scripts
  • 18. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Challenges: Speed ● Upgrade ○ Better memory management ○ Smaller index size ● Reduce index size ● Future: Solaris
  • 19. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Challenges: Speed ● Providing a real-time service ● Issues ○ Solr not inherently real time ○ Search speeds ○ I/O
  • 20. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Challenges: Speed ● Solr Solution: DSE integration leverages ○ Cassandra's speed ○ Cassandra's caches ○ Cassandra's distribution ○ Solr caches less useful
  • 21. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Challenges: Speed ● Search complexity solution ○ Text vs String indexing ○ Uniqueness vs Flexibility ○ Leveraging Cassandra
  • 22. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Challenges: Speed ● I/O Solution ○ Cassandra's built in mapping ○ Increase disk access speeds (SSDs) ■ Not cost effective ○ Future: Solaris
  • 23. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Challenges: Space ● Field corruption ○ Caused by improper encoding ○ Exponential growth ○ Fills up Solr index ● Locate, inspect & remove corrupt records
  • 24. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Challenges: Space ● Solr index issue ○ No compression (vs Cassandra) ○ Must adjust indexing ● Key things to keep in mind ○ Size of fields ○ Scale vs Flexibility ○ Index as little as possible
  • 25. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Challenges: Representation ● Cassandra is flat ● Actual data is not flat ○ Reviews ○ Price information ● Many different output formats ○ CSV, JSON, XML, etc.
  • 26. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET ● Solution: Flatten when possible ○ E.g. Address object -> Separate fields ● Internal subgroup representation ○ Composite keys (Occasionally) ■ Known subgroups ■ Non multiple subgroups ○ Dynamic fields ■ Composite field + Dynamic tag ■ E.g. review.text_<tag> Challenges: Representation
  • 27. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Challenges: Representation ● Robust and adaptable conversion package ● JSON -> Internal ○ Solr returns JSON ● Internal -> CSV, JSON, XML ○ User defined views ○ Specify field groupings ○ Specify partitioning
  • 28. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
  • 29. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET ● Memory Usage ● Speed ● Space ● Containers Future Work
  • 30. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Future Work: Memory ● Java 7 G1 (Garbage First) Collector ○ Ideal for large heaps ○ Big Data Sets ○ Bursty Workloads
  • 31. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Future Work: Speed ● Solaris Kernel Scheduler > Linux Kernel Scheduler ○ (At large number of cores) ● Drastically increase iops ○ Cache reads (L2ARC) on PCIe SSD (~800 MB/s) ○ Cache writes (ZIL) on PCIe SSD (~800 MB/s) ○ Reduce needed size of SSD ■ More smaller SSDs in ZFS pool ○ Fewer moving parts
  • 32. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Future Work: Space ● Caching at PCIe, Storing on SATA III ○ Cheaper larger storage via ZFS pools ○ Easier to grow ● ZFS Compression (LZ4) ○ Replaces Cassandra's Snappy compression ○ Very fast lossless compression (400 Mb/s per core) ○ Scales to multiple CPUs ○ Hits the ram speed limit
  • 33. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Future Work: Containers ● OS Level virtualization ○ Resource control ○ Boundary separation ● More control over cassandra resources ● Better snapshots (whole machine) ● Hardware abstracted out ○ Many disks represented as single space ○ Easily add or remove hardware
  • 35. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Addendum 1 ZFS Comparison Name Ratio (MB/s) Compression (MB/s) Decompression (MB/s) LZ4 (r97) 2.084 410 1810 LZO 2.06 2.106 409 600 QuickLZ 1.5.1b6 2.237 373 420 Snappy 1.1.0 2.091 323 1070 LZF 2.077 270 570 zlib 1.2.8 -1 2.730 65 280 LZ4 HC (r97) 2.720 25 2040 zlib 1.2.8 -6 3.099 21 300