Publicité
Publicité

Contenu connexe

Similaire à Architecture, Products, and Total Cost of Ownership of the Leading Machine Learning Stacks(20)

Publicité

Architecture, Products, and Total Cost of Ownership of the Leading Machine Learning Stacks

  1. Architecture, Products and Total Cost of Ownership of the Leading Machine Learning Stacks Presented by: William McKnight “#1 Global Influencer in Big Data” Thinkers360 President, McKnight Consulting Group A 2-time Inc. 5000 Company linkedin.com/in/wmcknight/ www.mcknightcg.com (214) 514-1444 Second Thursday of Every Month, at 2:00 ET With William McKnight
  2. TELECOMMUNICATIONS PHARMACEUTICAL EDUCATION CONSUMER PRODUCTS/RETAIL FINANCIAL INSURANCE/HEALTHCARE GOVERNMENT AND UTILITIES OTHER PUBLISHING McKnight Consulting Group Partial Client List
  3. Performance Features • Micro-partitions • Clustering Keys • Clustering Depth • Multi-Clusters • Transparent Materialized Views • Search Optimization Service • Query Acceleration Service
  4. Individual Query Performance Feature Comparison Improves Clustering Materialized Views Search Opt. Service Equality searches X X X Range searches X X X Sort operations X X Substring and Regex X VARIANT searches X Geospatial X Extra Costs Compute X X X Storage X X
  5. Usability Features • External Tables • Dynamic Data Masking • Time Travel and Fail Safe • Semi-Structured Data • Snowpipe • Snowsight Dashboards • Snowpark API 6
  6. Warehouses • 10 sizes • Available in Standard and Snowpark • New Snowpark- optimized with 16x memory than Standard (open preview) Size XS S M L XL 2XL 3XL 4XL 5XL 6XL
  7. Pricing • Watch For: – Concurrency and price-per- performance – Effective Warehouses (Multi-clusters) – Add-on compute: • Automatic Clustering • Materialized View Refreshes • Search Optimization • Query Acceleration – Time travel storage • Discounts
  8. (A) Snowflake ML Stack Category Dedicated Compute Snowflake Storage Snowflake Data Integration AWS Glue Streaming Kafka Confluent Cloud Spark Analytics Amazon EMR + Kinesis Spark Data Lake Snowflake External Tables Business Intelligence Tableau Machine Learning Amazon SageMaker Identity Management Amazon IAM Data Catalog Amazon Glue Data Catalog
  9. (A) Snowflake Machine Learning Stack Azure Kubernetes Services (AKS) Front-end E-Commerce Website Back-end Cart Profile Products Stock Deployed Recommender ML Model Training & Deployment Automatic Model deployment Databricks Databricks Transactional Database Cloud Firestore Data Loading Data Processing Cloud Data Fusion Snowflake Data Transformation Data Lake + Historical Data Data Marts Cloud Storage (data lake) MDM Database Talend Data Governance: • Partner Solutions • Marketplace solutions
  10. 11
  11. Performance Features • Redshift Advisor • Workload Management • Concurrency Scaling • Transparent Materialized Views • Short Query Acceleration 12
  12. Usability Features • Redshift Spectrum (External Tables) • Automated Materialized Views (AutoMV) • Dynamic Data Masking • Federated Queries • Semi-Structured and SUPER Type • Streaming Ingest with Kinesis • Python UDF • Redshift ML
  13. Provisioned Clusters vs. Serverless Provisioned Serverless Managed Self managed Fully managed Compute Choose node type and cluster size Workgroup Storage Provisioned disk capacity Namespace WLM User configured Not applicable Concurrent scaling User enabled Not applicable Scale out/up/down User-initiated cluster resize Not applicable Pause/resume Manual Automatic Compute billing Per second when not paused $/hour rate Per second when workloads run RPU-hour rate Storage billing $ per managed storage amount $ per GB-month used More detailed comparison: https://docs.aws.amazon.com/redshift/latest/mgmt/serverless-console-comparison.html
  14. Cluster Sizes AWS Type CPU/RAM Node Range Price Per Node dc2.large 2 / 15 GB 1 – 32 $0.25 dc2.8xlarge 32 / 244 GB 2 – 128 $4.80 ra3.xlplus 4 / 32 GB 1 – 32 $1.09 ra3.4xlarge 12 / 96 GB 2 – 32 $3.26 ra3.16xlarge 48 / 384 GB 2 – 128 $13.04 Serverless (Base & Max RPUs) ? 32 – 512 RPUs* $0.36 *Redshift Processing Units are available in units of 8 (32, 40, 48, and so on, up to 512)
  15. Pricing • Price-per- performance • Watch For: – Concurrency Scaling – Serverless RPU Usage – SageMaker costs for Redshift ML • Discounts
  16. Redshift ML Stack Category Dedicated Compute Amazon Redshift RA3 Storage Amazon Redshift Managed Storage Data Integration AWS Glue Streaming Amazon Kinesis Data Analytics Spark Analytics Amazon EMR + Kinesis Spark Data Lake Amazon Redshift Spectrum Business Intelligence Amazon Quicksight Machine Learning Amazon SageMaker Identity Management Amazon IAM Data Catalog Amazon Glue Data Catalog
  17. Amazon Elastic Kubernetes services (Amazon EKS) Front-end E-Commerce Website Back-end Cart Profile Products Stock Deployed Recommender ML Model Training & Deployment Automatic Model deployment SageMaker model endpoint Amazon SageMaker Transactional Database Amazon Dynamo DB Data Loading Amazon Glue Data Processing Amazon Redshift Data Lake + Historical Data S3 (data lake) Data Governance: • AWS Partner Solutions • AWS Marketplace solutions MDM Database Talend AWS Machine Learning Stack
  18. 19
  19. Performance Features • Workload Management • Estimated query plan (coming soon) • Transparent materialized views • Adaptive caching (recently use data on NVMe) • Azure Advisor
  20. Usability Features • Dynamic Data Masking • External Data Sources • Synapse Link • SynapseML 21
  21. Data Warehouse Units (DWU) • Official: “a collection of analytic resources…defined as a combination of CPU, memory, and IO…[which] represents an abstract, normalized measure of compute resources and performance.” • Increasing DWUs linearly improves performance DWUs 100 200 300 400 500 1000 1500 2000 2500 3000 5000 6000 7500 10000 15000 30000
  22. Pricing DWUs Price/hr 100 $1.20 200 $2.40 300 $3.60 400 $4.80 500 $6 1000 $12 1500 $18 2000 $24 2500 $30 3000 $36 5000 $60 6000 $72 7500 $90 10000 $120 15000 $180 30000 $360 Component Price Serverless $5/TB processed Dedicated $/hour >>> 1-year Reserved 37% discount 3-year Reserved 65% discount Storage $23/TB-month • Additional charges (per vCore-hour) for Synapse Link, Data Explorer, and Spark Pools • Pipelines priced by DIU-hour, runtime-hour, and per activity run
  23. Microsoft Synapse ML Stack Category Dedicated Compute Azure Synapse Analytics Workspace Storage Azure Synapse Analytics SQL Pool Data Integration Azure Data Factory (ADF) Streaming Azure Stream Analytics (for Analytics) and Azure Event Hubs Spark Analytics Big Data Analytics with Apache Spark Data Lake Amazon Redshift Spectrum Business Intelligence Amazon Quicksight Machine Learning Amazon Sagemaker Identity Management Amazon IAM Data Catalog Amazon Purview
  24. Azure Kubernetes Services (AKS) Front-end E-Commerce Website Back-end Cart Profile Products Stock Deployed Recommender ML Model Runtime Azure ML managed online endpoint Azure Machine Learning Transactional Database Azure Cosmos DB Core API Analytical Store (HTAP) Azure Cosmos DB Analytical Store (Parquet) Cognitive Services Sentiment analysis on product reviews to enhance the recommender model Synapse Link Enables automatic sync to analytical store (no ETL) Data Processing Azure Synapse Analytics Data Lake + Historical Data ADL Gen2 Data Lake: HTAP data, sentiment data, historical order data Automatic Model deployment (MLOps) Data Transformation & ML Model Training Azure Databricks Delta Live Tables SparkML Microsoft Purview Data Management & Governance Discover, classify, track lineage, and protect sensitive data (customer profiles, etc.) MDM Database Talend Azure Machine Learning Stack
  25. 26
  26. Performance Features • BQ Architecture and Slots • Clustering and Partitioning • Transparent Materialized Views • BI Engine
  27. Usability Features • BigQuery Omni – External Tables • Time Travel • Migration Service – SQL Translation • Looker Studio • Colab Notebooks • BigQuery ML 28
  28. Pricing Compute BigQuery Omni On-demand $5 per TB $5 per TB Flex $4.00/hr per 100 slots $5.00/hr per 100 slots Monthly Commit* $2.74/hr per 100 slots $3.42/hr per 100 slots Annual Commit* $2.33/hr per 100 slots $2.91/hr per 100 slots BI Engine $0.0416/hr per GB N/A Storage1 Logical2 Physical3 Active $0.02/GB- month $0.04/GB- month Long-term4 $0.01/GB- month $0.02/GB- month Batch loading FREE Streaming inserts $0.01 per 200MB Storage API $0.025 per 1GB 1 You get to choose logical or physical billing 2 Logical = Uncompressed size (Time travel free) 3 Physical = Compressed size + Time travel 4 Table not modified in 90 days *comes with some free BI Engine
  29. Google BigQuery ML Stack Category Dedicated Compute Google BigQuery Storage Google BigQuery Storage Data Integration Google Dataflow (Batch) Streaming Google Dataflow (Streaming) Spark Analytics Google Dataproc Data Lake Google BigQuery On-Demand Infrastructure Business Intelligence Google BigQuery BI Engine Machine Learning Google BigQuery ML Identity Management Google Cloud IAM Data Catalog Google Data Catalog
  30. Azure Kubernetes Services (AKS) Front-end E-Commerce Website Back-end Cart Profile Products Stock Deployed Recommender ML Model Training & Deployment Automatic Model deployment Vertex AI Prediction Vertex AI Data Governance • Google Dataplex Transactional Database Cloud Firestore Data Loading Data Processing Cloud Data Fusion BigQuery Data Transformation Data Lake + Historical Data Cloud Dataprep Cloud Dataflow Cloud Storage (data lake) MDM Database Talend Google Machine Learning Stack
  31. Technology Stack Costs
  32. Sample Stack Cost Breakout
  33. Line Item Pricing (AWS) Lookup CostCenter Category Platform Product Size UnitNode Amazon Redshift ra3.4xlarge-Infrastructure Infrastructure 01-Dedicated Compute AWS Amazon Redshift ra3.4xlarge 1-Medium ra3.4xlarge Amazon Redshift ra3.16xlarge-Infrastructure Infrastructure 01-Dedicated Compute AWS Amazon Redshift ra3.16xlarge 2-Large ra3.16xlarge Amazon Redshift Managed Storage-Storage Storage 02-Storage AWS Amazon Redshift Managed Storage 1-Medium GB-month Amazon Redshift Managed Storage-Storage Storage 02-Storage AWS Amazon Redshift Managed Storage 2-Large GB-month AWS Glue-Software Software 03-Data Integration AWS AWS Glue 1-Medium DPU-Hour AWS Glue-Software Software 03-Data Integration AWS AWS Glue 2-Large DPU-Hour Amazon Kinesis Data Analytics-Infrastructure Infrastructure 04-Streaming AWS Amazon Kinesis Data Analytics 1-Medium KPU-Hour Amazon Kinesis Data Analytics-Infrastructure Infrastructure 04-Streaming AWS Amazon Kinesis Data Analytics 2-Large KPU-Hour Amazon Kinesis Data Analytics-Storage Storage 04-Streaming AWS Amazon Kinesis Data Analytics 1-Medium GB-month Amazon Kinesis Data Analytics-Storage Storage 04-Streaming AWS Amazon Kinesis Data Analytics 2-Large GB-month Amazon EMR-Infrastructure Infrastructure 05-Spark Analytics AWS Amazon EMR 1-Medium r5.4xlarge Amazon EMR-Software Software 05-Spark Analytics AWS Amazon EMR 1-Medium EMR on r5.4xlarge Amazon EMR-Infrastructure Infrastructure 05-Spark Analytics AWS Amazon EMR 2-Large r5.4xlarge Amazon EMR-Software Software 05-Spark Analytics AWS Amazon EMR 2-Large EMR on r5.4xlarge Amazon Kinesis-Shards Shards 05-Spark Analytics AWS Amazon Kinesis 1-Medium Shard-hour Amazon Kinesis-Shards Shards 05-Spark Analytics AWS Amazon Kinesis 2-Large Shard-hour Amazon Redshift Spectrum-Software Software 06-Data Exploration AWS Amazon Redshift Spectrum 1-Medium TB-month Amazon Redshift Spectrum-Software Software 06-Data Exploration AWS Amazon Redshift Spectrum 2-Large TB-month Amazon Redshift ra3.4xlarge-Infrastructure Infrastructure 06-Data Exploration AWS Amazon Redshift ra3.4xlarge 1-Medium ra3.4xlarge Amazon Redshift ra3.4xlarge-Infrastructure Infrastructure 06-Data Exploration AWS Amazon Redshift ra3.4xlarge 2-Large ra3.4xlarge Amazon EMR-Infrastructure Infrastructure 07-Data Lake AWS Amazon EMR 1-Medium r5.4xlarge Amazon EMR-Software Software 07-Data Lake AWS Amazon EMR 1-Medium EMR on r5.4xlarge Amazon EMR-Infrastructure Infrastructure 07-Data Lake AWS Amazon EMR 2-Large r5.4xlarge Amazon EMR-Software Software 07-Data Lake AWS Amazon EMR 2-Large EMR on r5.4xlarge Amazon Quicksight Readers-Licenses Licenses 08-Business Intelligence AWS Amazon Quicksight Readers 1-Medium User-month Amazon Quicksight Readers-Licenses Licenses 08-Business Intelligence AWS Amazon Quicksight Readers 2-Large User-month Amazon Quicksight Authors-Licenses Licenses 08-Business Intelligence AWS Amazon Quicksight Authors 1-Medium User-month Amazon Quicksight Authors-Licenses Licenses 08-Business Intelligence AWS Amazon Quicksight Authors 2-Large User-month Amazon SageMaker-Infrastructure Infrastructure 09-Machine Learning AWS Amazon SageMaker 1-Medium ml.r5.2xlarge Amazon SageMaker-Software Software 09-Machine Learning AWS Amazon SageMaker 1-Medium ml.r5.2xlarge Amazon SageMaker-Infrastructure Infrastructure 09-Machine Learning AWS Amazon SageMaker 2-Large ml.r5.2xlarge Amazon SageMaker-Software Software 09-Machine Learning AWS Amazon SageMaker 2-Large ml.r5.2xlarge Amazon IAM-Licenses Licenses 10-Identity Management AWS Amazon IAM 1-Medium Included Amazon IAM-Licenses Licenses 10-Identity Management AWS Amazon IAM 2-Large Included AWS Glue Data Catalog-Software Software 11-Data Catalog AWS AWS Glue Data Catalog 1-Medium 100K objects AWS Glue Data Catalog-Software Software 11-Data Catalog AWS AWS Glue Data Catalog 2-Large 100K objects 34
  34. Stack Cost by Use Case for Medium-Sized Enterprises • 1st Year of Project • 1st Large Scale ML Project • 1.3M – 3.2M 35
  35. Stack Cost by Use Case for Large Size Enterprises • 1st Year of Project • 1st Large Scale ML Project • 3.4M – 8.5M 36
  36. Project ROI & TCO 37 ROI = Benefit TCO Infrastructure Software + FTE + Consulting +
  37. Summary • For large-sized enterprise projects, the stack cost typically ranges between $3.4M-$8.5M to ensure successful deployment of ML-based projects into production, in addition to labor expenses. • The total cost of ownership of cloud analytics platforms scales up as the demand for analytics at your company grows over time. • Snowflake adopts a usage-based or consumption-based pricing model, where users are charged based on the amount of data processed, resulting in higher costs for higher usage levels. • Redshift offers both provisioned clusters and serverless options to cater to different business requirements. • Synapse is available for purchase in DWU, which comprises a collection of analytic resources that can be adjusted to meet the specific needs of the organization. • BigQuery slots operate as virtual CPUs to ensure efficient data processing and analysis. • While there are numerous technology stacks available, the ones mentioned here are just a few examples. • Dedicated Compute, Storage, Data Integration, Streaming, Spark Analytics, Data Lake, Business Intelligence, Machine Learning, Identity Management, and Data Catalog are all essential components of a modern data management and analytics ecosystem. • Estimating the costs of building a technology stack can be a complex task and requires careful consideration of various factors. • It is recommended to seek reliable performance at a predictable price to ensure the successful implementation of data management and analytics projects. • The true measure of project efficacy is Return on Investment (ROI), and organizations should strive to achieve positive ROI in their data management and analytics endeavors.
  38. Architecture, Products and Total Cost of Ownership of the Leading Machine Learning Stacks Presented by: William McKnight “#1 Global Influencer in Big Data” Thinkers360 President, McKnight Consulting Group A 2-time Inc. 5000 Company linkedin.com/in/wmcknight/ www.mcknightcg.com (214) 514-1444 Second Thursday of Every Month, at 2:00 ET With William McKnight
Publicité