SlideShare une entreprise Scribd logo
1  sur  18
SCALE YOUR DATA COLLECTION
ON THE CLOUD LIKE A CHAMP
Moty Michaely, VP R&D Xplenty
SCALING DATA COLLECTION = A PAIN
Plenty of companies are limited by their data collection
methods when it comes to scalability.
Once they need more detailed data and in larger quantities,
scaling the system can become a major pain.
THREE COMMON METHODS FOR COLLECTING BIG
DATA... IS YOUR COMPANY USING THE RIGHT ONE?
▪ Storing directly in the DB
▪ Keeping it in a local file
▪ S3/CloudFront logging
STORING DIRECTLY IN THE DB
This is what companies usually start with. As the name
suggests, data is inserted right into the DB.
There are two ways to do it:
▪ Row by row means the data is added as a row to the DB in
real time.
▪ Bulk insert adds multiple rows to the DB in one transaction.
(It’s faster than row by row, but insertion of the entire batch may fail, thus having to re-insert a
big chunk of data.)
PROS FOR STORING DIRECTLY IN THE DB
▪ Better performance than other methods for inserting data.
▪ Real-time data available when adding row by row.
CONS FOR STORING DIRECTLY IN THE DB
▪ Schema changes are required to add new types of data.
▪ Scaling is required in two layers - application and database.
Scaling the application is usually easier (using a network load
balancer for example) but scaling the database requires hiring
an expert DBA, partitioning the DB, and scaling up the server.
(Relational DBs that scale out to multiple nodes are expensive and require a lot of
maintenance.)
BOTTOM LINE
Storing directly in the DB gives you fast performance, but it
doesn’t scale.
KEEPING IT LOCAL
Data is dumped in big local files. These files are periodically
uploaded via a program to S3 or inserted in batches into a
NoSQL DB, such as Amazon DynamoDB or a data warehouse
like Amazon RedShift.
PROS FOR KEEPING IT IN A LOCAL FILE
▪ New types of data can be added easily since no schema
changes are required.
▪ Compatible with all applications because any file format can
be used.
▪ Quicker filtering via customized directory/file names, e.g.
with date/time indication.
CONS FOR KEEPING IT IN A LOCAL FILE
▪ One needs to develop a tracking program to deal with the
files - rotating logs while more data is incoming, handling
failures, and transactionality. Even if you have the manpower,
time, and money, it’s hard to develop such a program.
▪ Scaling means adding more servers, more maintenance, and
more money.
▪ Data is not as query-able compared to storage in a DB.
▪ Staging and production environments require extra servers.
BOTTOM LINE
More flexible than direct DB storage, but requires more
development, and scaling is still an issue.
S3/CLOUDFRONT LOGGING
This old school solution goes back to the early days when
visitor counters and burning “hot!” animations ruled the web.
To track an event, an HTTP request is sent for a 1x1 pixel image
from a relevant S3 directory. Accessing the image automatically
generates a W3C log with all HTTP request parameters: IP
address, browser, date/time, etc. Extra session level data like
username or mouse position is passed via the query string. To
differentiate between event types, images are placed in
accordingly named directories, e.g. /click/.
PROS FOR S3/CLOUDFRONT LOGGING
▪ No tracking server required - data reaches S3 automatically.
▪ No file management - Amazon handles all file monkey
business.
▪ No servers - Amazon provides them.
▪ Cost effective - only log storage and bandwidth are paid for.
The logs take little space since they are all GZipped and the
bandwidth for 1x1 pixel images is marginal.
PROS FOR S3/CLOUDFRONT LOGGING
CONTINUED
▪ Easily scalable with practically infinite space and firepower.
▪ Quick and easy to implement.
▪ Simple setup for staging/production environments via
additional distributions and a prefix.
▪ Web application performance unharmed, especially using the
CloudFront CDN.
CONS FOR S3/CLOUDFRONT LOGGING
▪ Slower filtering performance compared to local setup. Amazon handles
log file/directory names automatically and no customization is available.
▪ Not suitable for real time or impatience. Data is aggregated into a new
file in the bucket only once per hour, and that’s Amazon’s best effort so
it could take longer.
▪ Data is not as query-able compared to storage in a DB.
▪ Vendor dependent. Having your servers outside of Amazon will
decrease performance.
▪ No control over the file format. W3C Extended Log File Format is
mandatory and some applications may not like that.
BOTTOM LINE
Quick, cheap, and scalable though it doesn’t provide the best
performance and customization.
WHAT’S RIGHT FOR YOU?
So much emphasis has been put on the technologies used
for processing, analyzing, and visualizing data. But so often
getting lost in the shuffle is the importance of the
collection of this data. The two go hand in hand. To get
good output from your data, you must first have proper
input.
Only once you have achieved the synergy between the two
will you fully be able to tap into your data’s potential.
XPLENTY
WWW.XPLENTY.COM

Contenu connexe

Tendances

AWS Segment XO Group Joint webinar
AWS Segment XO Group Joint webinarAWS Segment XO Group Joint webinar
AWS Segment XO Group Joint webinar
Arti Bhatia
 
Flickr Architecture Presentation
Flickr Architecture PresentationFlickr Architecture Presentation
Flickr Architecture Presentation
eraz
 
Сергей Сверчков и Виталий Руденя. Choosing a NoSQL database
Сергей Сверчков и Виталий Руденя. Choosing a NoSQL databaseСергей Сверчков и Виталий Руденя. Choosing a NoSQL database
Сергей Сверчков и Виталий Руденя. Choosing a NoSQL database
Volha Banadyseva
 
Qubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant ConferenceQubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant Conference
Joydeep Sen Sarma
 

Tendances (20)

RubiX
RubiXRubiX
RubiX
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
Spark meetup - Zoomdata Streaming
Spark meetup  - Zoomdata StreamingSpark meetup  - Zoomdata Streaming
Spark meetup - Zoomdata Streaming
 
Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)
 
AWS Segment XO Group Joint webinar
AWS Segment XO Group Joint webinarAWS Segment XO Group Joint webinar
AWS Segment XO Group Joint webinar
 
The Holy Grail of Data Analytics
The Holy Grail of Data AnalyticsThe Holy Grail of Data Analytics
The Holy Grail of Data Analytics
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL Databases
 
Loading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SFLoading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SF
 
Flickr Architecture Presentation
Flickr Architecture PresentationFlickr Architecture Presentation
Flickr Architecture Presentation
 
Cloud Optimized Big Data
Cloud Optimized Big DataCloud Optimized Big Data
Cloud Optimized Big Data
 
Ramunas Balukonis. Research DWH
Ramunas Balukonis. Research DWHRamunas Balukonis. Research DWH
Ramunas Balukonis. Research DWH
 
Сергей Сверчков и Виталий Руденя. Choosing a NoSQL database
Сергей Сверчков и Виталий Руденя. Choosing a NoSQL databaseСергей Сверчков и Виталий Руденя. Choosing a NoSQL database
Сергей Сверчков и Виталий Руденя. Choosing a NoSQL database
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
 
Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012
 
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBaseHBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
 
MySql to HBase in 5 Steps
MySql to HBase in 5 StepsMySql to HBase in 5 Steps
MySql to HBase in 5 Steps
 
Hadoop, Infrastructure and Stack
Hadoop, Infrastructure and StackHadoop, Infrastructure and Stack
Hadoop, Infrastructure and Stack
 
Qubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant ConferenceQubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant Conference
 
Advanced Schema Design Patterns
Advanced Schema Design PatternsAdvanced Schema Design Patterns
Advanced Schema Design Patterns
 
Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015
 

Similaire à How to scale your data collection on the cloud like a champ

Dynamo db pros and cons
Dynamo db  pros and consDynamo db  pros and cons
Dynamo db pros and cons
Saniya Khalsa
 

Similaire à How to scale your data collection on the cloud like a champ (20)

Real world business workflow with SharePoint designer 2013
Real world business workflow with SharePoint designer 2013Real world business workflow with SharePoint designer 2013
Real world business workflow with SharePoint designer 2013
 
Scaling a SaaS backend with PostgreSQL - A case study
Scaling a SaaS backend with PostgreSQL - A case studyScaling a SaaS backend with PostgreSQL - A case study
Scaling a SaaS backend with PostgreSQL - A case study
 
IBM Dash DB
IBM Dash DBIBM Dash DB
IBM Dash DB
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
 
5 Reasons to Move Your BI to the Cloud
5 Reasons to Move Your BI to the Cloud5 Reasons to Move Your BI to the Cloud
5 Reasons to Move Your BI to the Cloud
 
Steps to Modernize Your Data Ecosystem | Mindtree
Steps to Modernize Your Data Ecosystem | Mindtree									Steps to Modernize Your Data Ecosystem | Mindtree
Steps to Modernize Your Data Ecosystem | Mindtree
 
Six Steps to Modernize Your Data Ecosystem - Mindtree
Six Steps to Modernize Your Data Ecosystem  - MindtreeSix Steps to Modernize Your Data Ecosystem  - Mindtree
Six Steps to Modernize Your Data Ecosystem - Mindtree
 
6 Steps to Modernize Data Ecosystem with Mindtree
6 Steps to Modernize Data Ecosystem with Mindtree6 Steps to Modernize Data Ecosystem with Mindtree
6 Steps to Modernize Data Ecosystem with Mindtree
 
Steps to Modernize Your Data Ecosystem with Mindtree Blog
Steps to Modernize Your Data Ecosystem with Mindtree Blog Steps to Modernize Your Data Ecosystem with Mindtree Blog
Steps to Modernize Your Data Ecosystem with Mindtree Blog
 
Mongodb
MongodbMongodb
Mongodb
 
Dynamo db pros and cons
Dynamo db  pros and consDynamo db  pros and cons
Dynamo db pros and cons
 
Dynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the fly
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
 
L20 Scalability
L20 ScalabilityL20 Scalability
L20 Scalability
 
Traditional BI vs. Business Data Lake – A Comparison
Traditional BI vs. Business Data Lake – A ComparisonTraditional BI vs. Business Data Lake – A Comparison
Traditional BI vs. Business Data Lake – A Comparison
 
Database management system
Database management systemDatabase management system
Database management system
 
Dipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsDipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAs
 

Dernier

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
shivangimorya083
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Dernier (20)

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 

How to scale your data collection on the cloud like a champ

  • 1. SCALE YOUR DATA COLLECTION ON THE CLOUD LIKE A CHAMP Moty Michaely, VP R&D Xplenty
  • 2. SCALING DATA COLLECTION = A PAIN Plenty of companies are limited by their data collection methods when it comes to scalability. Once they need more detailed data and in larger quantities, scaling the system can become a major pain.
  • 3. THREE COMMON METHODS FOR COLLECTING BIG DATA... IS YOUR COMPANY USING THE RIGHT ONE? ▪ Storing directly in the DB ▪ Keeping it in a local file ▪ S3/CloudFront logging
  • 4. STORING DIRECTLY IN THE DB This is what companies usually start with. As the name suggests, data is inserted right into the DB. There are two ways to do it: ▪ Row by row means the data is added as a row to the DB in real time. ▪ Bulk insert adds multiple rows to the DB in one transaction. (It’s faster than row by row, but insertion of the entire batch may fail, thus having to re-insert a big chunk of data.)
  • 5. PROS FOR STORING DIRECTLY IN THE DB ▪ Better performance than other methods for inserting data. ▪ Real-time data available when adding row by row.
  • 6. CONS FOR STORING DIRECTLY IN THE DB ▪ Schema changes are required to add new types of data. ▪ Scaling is required in two layers - application and database. Scaling the application is usually easier (using a network load balancer for example) but scaling the database requires hiring an expert DBA, partitioning the DB, and scaling up the server. (Relational DBs that scale out to multiple nodes are expensive and require a lot of maintenance.)
  • 7. BOTTOM LINE Storing directly in the DB gives you fast performance, but it doesn’t scale.
  • 8. KEEPING IT LOCAL Data is dumped in big local files. These files are periodically uploaded via a program to S3 or inserted in batches into a NoSQL DB, such as Amazon DynamoDB or a data warehouse like Amazon RedShift.
  • 9. PROS FOR KEEPING IT IN A LOCAL FILE ▪ New types of data can be added easily since no schema changes are required. ▪ Compatible with all applications because any file format can be used. ▪ Quicker filtering via customized directory/file names, e.g. with date/time indication.
  • 10. CONS FOR KEEPING IT IN A LOCAL FILE ▪ One needs to develop a tracking program to deal with the files - rotating logs while more data is incoming, handling failures, and transactionality. Even if you have the manpower, time, and money, it’s hard to develop such a program. ▪ Scaling means adding more servers, more maintenance, and more money. ▪ Data is not as query-able compared to storage in a DB. ▪ Staging and production environments require extra servers.
  • 11. BOTTOM LINE More flexible than direct DB storage, but requires more development, and scaling is still an issue.
  • 12. S3/CLOUDFRONT LOGGING This old school solution goes back to the early days when visitor counters and burning “hot!” animations ruled the web. To track an event, an HTTP request is sent for a 1x1 pixel image from a relevant S3 directory. Accessing the image automatically generates a W3C log with all HTTP request parameters: IP address, browser, date/time, etc. Extra session level data like username or mouse position is passed via the query string. To differentiate between event types, images are placed in accordingly named directories, e.g. /click/.
  • 13. PROS FOR S3/CLOUDFRONT LOGGING ▪ No tracking server required - data reaches S3 automatically. ▪ No file management - Amazon handles all file monkey business. ▪ No servers - Amazon provides them. ▪ Cost effective - only log storage and bandwidth are paid for. The logs take little space since they are all GZipped and the bandwidth for 1x1 pixel images is marginal.
  • 14. PROS FOR S3/CLOUDFRONT LOGGING CONTINUED ▪ Easily scalable with practically infinite space and firepower. ▪ Quick and easy to implement. ▪ Simple setup for staging/production environments via additional distributions and a prefix. ▪ Web application performance unharmed, especially using the CloudFront CDN.
  • 15. CONS FOR S3/CLOUDFRONT LOGGING ▪ Slower filtering performance compared to local setup. Amazon handles log file/directory names automatically and no customization is available. ▪ Not suitable for real time or impatience. Data is aggregated into a new file in the bucket only once per hour, and that’s Amazon’s best effort so it could take longer. ▪ Data is not as query-able compared to storage in a DB. ▪ Vendor dependent. Having your servers outside of Amazon will decrease performance. ▪ No control over the file format. W3C Extended Log File Format is mandatory and some applications may not like that.
  • 16. BOTTOM LINE Quick, cheap, and scalable though it doesn’t provide the best performance and customization.
  • 17. WHAT’S RIGHT FOR YOU? So much emphasis has been put on the technologies used for processing, analyzing, and visualizing data. But so often getting lost in the shuffle is the importance of the collection of this data. The two go hand in hand. To get good output from your data, you must first have proper input. Only once you have achieved the synergy between the two will you fully be able to tap into your data’s potential.