SlideShare une entreprise Scribd logo
1  sur  28
Télécharger pour lire hors ligne
Real Time Analytics with Cassandra, Hive, and Solr
Real Time Analytics with Cassandra, Hive, and Solr
Aaron Stannard, Founder & CEO of MarkedUp
Powerful analytics tools for native apps
Understand your
audience.
Gain valuable data on
your users.
Monitor your
app’s health.
Log errors and crashes
remotely.
Drive
more sales.
Better data = more
revenue.
Do we really need real-time
analytics?
Real time analytics isn’t inherently
superior or necessary.
Building your own real-time
analytics service with Cassandra
and DataStax Enterprise
Cassandra Setup on EC2
Write Strategy
Read Strategy
Analytics Schema Strategy
•  All	
  row	
  keys	
  should	
  be	
  
predictable	
  (not	
  always	
  possible)	
  
•  U8lize	
  physical	
  sortability	
  of	
  
columns	
  
•  Use	
  predictably	
  sortable	
  data	
  
types	
  for	
  column	
  names	
  
(integers,	
  dates)	
  
	
  
•  Learn	
  to	
  love	
  composite	
  keys	
  
•  Batch	
  muta8ons	
  are	
  your	
  friend	
  
•  Use	
  distributed	
  counters	
  for	
  real-­‐
8me	
  metrics	
  
•  Use	
  TTL	
  for	
  automa8on	
  data	
  
expira8on	
  (if	
  necessary)	
  
	
  
Time Series Schema 0: All Knowns
Time Series Schema 1: Bounded Number of Unknowns
Time Series Schema 2: Unbounded Number of Unknowns
Schema Tips
Adding Hive and Hadoop to the Mix
Mo’ data, mo’ problems
When is Hadoop necessary?
•  Large volumes of data (100GB+)
•  Queries require retrospective / historical analysis
•  Need consistent results
•  Need to perform multi-stage analysis
•  Speed isn’t a concern (Hadoop is sloooooooooow)
Hadoop on easy mode: Hive
•  SQL abstraction on top of Hadoop (more familiar)
•  Easier to deploy and test
•  Simplifies data warehousing
•  Easy to automatically import data from Cassandra
•  DSE eliminates need for HDFS
C* to Hive
Hive Syntax
Query: count the number items where “key” is greater than
100
RDBMS> select key, count(1) from kv1
where key > 100 group by key;
Hive> select key, count(1) from kv1
where key > 100 group by key;
Hive Tips and Tricks
•  Don’t write data from Hive back to a hot Cassandra column family
•  If writing data from Hive to Cassandra, use dedicated column
families
•  You can write to multiple places on a single Hive read (table, CSV
file, etc…)
•  Use sampling to test Hive queries on scaled-down data sets
How do you count millions of
distinct items in real-time?
•  Solr:	
  Lucene-­‐based	
  indexing	
  engine	
  
•  Part	
  of	
  Apache	
  Founda8on	
  
•  Full-­‐text	
  search	
  
•  Faceted	
  search	
  
•  Distributed	
  
•  Integrates	
  well	
  with	
  Cassandra	
  
Solr Index Setup
Solr Search
Questions or Comments?
aaron@markedup.com	
  	
  
hMps://markedup.com/	
  	
  

Contenu connexe

Tendances

Tendances (20)

963
963963
963
 
Impala turbocharge your big data access
Impala   turbocharge your big data accessImpala   turbocharge your big data access
Impala turbocharge your big data access
 
Open source big data landscape and possible ITS applications
Open source big data landscape and possible ITS applicationsOpen source big data landscape and possible ITS applications
Open source big data landscape and possible ITS applications
 
Quark Virtualization Engine for Analytics
Quark Virtualization Engine for Analytics Quark Virtualization Engine for Analytics
Quark Virtualization Engine for Analytics
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWS
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
 
AWS re:Invent 2016: How Amazon S3 Storage Management Helps Optimize Storage a...
AWS re:Invent 2016: How Amazon S3 Storage Management Helps Optimize Storage a...AWS re:Invent 2016: How Amazon S3 Storage Management Helps Optimize Storage a...
AWS re:Invent 2016: How Amazon S3 Storage Management Helps Optimize Storage a...
 
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
 
Real time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchReal time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and Elasticsearch
 
AWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache StormAWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache Storm
 
(BDT313) Amazon DynamoDB For Big Data
(BDT313) Amazon DynamoDB For Big Data(BDT313) Amazon DynamoDB For Big Data
(BDT313) Amazon DynamoDB For Big Data
 
Introducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom ConnectorsIntroducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom Connectors
 
(BDT316) Offloading ETL to Amazon Elastic MapReduce
(BDT316) Offloading ETL to Amazon Elastic MapReduce(BDT316) Offloading ETL to Amazon Elastic MapReduce
(BDT316) Offloading ETL to Amazon Elastic MapReduce
 
Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Cloud native data platform
Cloud native data platformCloud native data platform
Cloud native data platform
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
 
(ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014
(ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014(ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014
(ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014
 
Big Data A La Carte Menu
Big Data A La Carte MenuBig Data A La Carte Menu
Big Data A La Carte Menu
 
ML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time Series
 

Similaire à C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard

First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
Tomas Cervenka
 
Building Big Data Solutions with Azure Data Lake.10.11.17.pptx
Building Big Data Solutions with Azure Data Lake.10.11.17.pptxBuilding Big Data Solutions with Azure Data Lake.10.11.17.pptx
Building Big Data Solutions with Azure Data Lake.10.11.17.pptx
thando80
 
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsight
Tillmann Eitelberg
 

Similaire à C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard (20)

Incredible Impala
Incredible Impala Incredible Impala
Incredible Impala
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
AWS Webcast - Build high-scale applications with Amazon DynamoDB
AWS Webcast - Build high-scale applications with Amazon DynamoDBAWS Webcast - Build high-scale applications with Amazon DynamoDB
AWS Webcast - Build high-scale applications with Amazon DynamoDB
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
 
Avoiding big data antipatterns
Avoiding big data antipatternsAvoiding big data antipatterns
Avoiding big data antipatterns
 
HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airship
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
 
Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...
Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...
Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Big data solutions in Azure
Big data solutions in AzureBig data solutions in Azure
Big data solutions in Azure
 
Building Big Data Solutions with Azure Data Lake.10.11.17.pptx
Building Big Data Solutions with Azure Data Lake.10.11.17.pptxBuilding Big Data Solutions with Azure Data Lake.10.11.17.pptx
Building Big Data Solutions with Azure Data Lake.10.11.17.pptx
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 
Need for Time series Database
Need for Time series DatabaseNeed for Time series Database
Need for Time series Database
 
Apache HBase Workshop
Apache HBase WorkshopApache HBase Workshop
Apache HBase Workshop
 
Building Big data solutions in Azure
Building Big data solutions in AzureBuilding Big data solutions in Azure
Building Big data solutions in Azure
 
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsight
 

Plus de DataStax Academy

Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
DataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
DataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
DataStax Academy
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
DataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
DataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
DataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
DataStax Academy
 

Plus de DataStax Academy (20)

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 

Dernier

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Dernier (20)

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard

  • 1. Real Time Analytics with Cassandra, Hive, and Solr
  • 2. Real Time Analytics with Cassandra, Hive, and Solr Aaron Stannard, Founder & CEO of MarkedUp
  • 3. Powerful analytics tools for native apps Understand your audience. Gain valuable data on your users. Monitor your app’s health. Log errors and crashes remotely. Drive more sales. Better data = more revenue.
  • 4.
  • 5. Do we really need real-time analytics?
  • 6.
  • 7. Real time analytics isn’t inherently superior or necessary.
  • 8.
  • 9. Building your own real-time analytics service with Cassandra and DataStax Enterprise
  • 13. Analytics Schema Strategy •  All  row  keys  should  be   predictable  (not  always  possible)   •  U8lize  physical  sortability  of   columns   •  Use  predictably  sortable  data   types  for  column  names   (integers,  dates)     •  Learn  to  love  composite  keys   •  Batch  muta8ons  are  your  friend   •  Use  distributed  counters  for  real-­‐ 8me  metrics   •  Use  TTL  for  automa8on  data   expira8on  (if  necessary)    
  • 14. Time Series Schema 0: All Knowns
  • 15. Time Series Schema 1: Bounded Number of Unknowns
  • 16. Time Series Schema 2: Unbounded Number of Unknowns
  • 18. Adding Hive and Hadoop to the Mix Mo’ data, mo’ problems
  • 19. When is Hadoop necessary? •  Large volumes of data (100GB+) •  Queries require retrospective / historical analysis •  Need consistent results •  Need to perform multi-stage analysis •  Speed isn’t a concern (Hadoop is sloooooooooow)
  • 20. Hadoop on easy mode: Hive •  SQL abstraction on top of Hadoop (more familiar) •  Easier to deploy and test •  Simplifies data warehousing •  Easy to automatically import data from Cassandra •  DSE eliminates need for HDFS
  • 22. Hive Syntax Query: count the number items where “key” is greater than 100 RDBMS> select key, count(1) from kv1 where key > 100 group by key; Hive> select key, count(1) from kv1 where key > 100 group by key;
  • 23. Hive Tips and Tricks •  Don’t write data from Hive back to a hot Cassandra column family •  If writing data from Hive to Cassandra, use dedicated column families •  You can write to multiple places on a single Hive read (table, CSV file, etc…) •  Use sampling to test Hive queries on scaled-down data sets
  • 24. How do you count millions of distinct items in real-time?
  • 25. •  Solr:  Lucene-­‐based  indexing  engine   •  Part  of  Apache  Founda8on   •  Full-­‐text  search   •  Faceted  search   •  Distributed   •  Integrates  well  with  Cassandra  
  • 28. Questions or Comments? aaron@markedup.com     hMps://markedup.com/