SlideShare une entreprise Scribd logo
1  sur  44
Lessons learned from
designing a QA automation
for analytics databases
(Big Data)
Omid Vahdaty, Big Data ninja.
Agenda
● Introduction to Big Data Testing
● Methodological approach for QA Automation road map
● True Story
● The Solution?
● Challenges of Big Data Testing.
● Performance & Maintenance.
Introduction to Big Data Testing
● 100M ?
● 1B ?
● 10B?
● 100B?
● 1Tr?
How Big is Big Data?
Scale OUT VS. Scale UP systems
Big Data ?
● Structured (SQL, tabular, strings, numbers)
● Unstructured (logs, pictures, binary, json,blob, video etc)
OLAP or OLTP?
● Simply put:
○ One query running on a lot of data for reporting/analytics
○ Many queries running quickly in parralel applications (e.g user login to a website, credentials
are stored on a DB)
Type of Big Data products
● Databases
○ OLAP, OLTP
○ Sql, NO SQL
○ In Memory, Disk based.
○ Hadoop Ecosystem, Spark
● Ecosystem tools
○ ETL tools
○ Visualizations tools
● General Application
● Analytics Based products
○ Froud
○ Cyber
○ Finances and etc.
Expectation Matching for today’s lecture...
● OLAP Database
● Structured data only, Synthetic data was used in testing.
● (not about QA of analytics based products/solutions).
● (not about hadoop , event streaming cluster ).
● (not try to sell anything)
● In a nutshell:
○ How to test an SQL based database?
○ What sort of challenges were met?
How Hard Can it be (aggregation)?
● Select Count(*) from t;
○ Assume
■ 1 Trillion records
■ ad-hoc query (no indexes)
■ Full Scan (no cache)
○ Challenges
■ Time it takes to compute?
■ IO bottleneck? What is IO pattern?
■ CPU bottleneck?
■ Scale up limitations?
How Hard Can it be (join)?
● Select * from t1 join t2 where t1.id =t2.id
○ Assume
■ 1 Trillion records both tables.
■ ad-hoc query (no indexes)
■ Full Scan (no cache)
○ Challenges
■ Time it takes to compute?
■ IO bottleneck? What is IO pattern?
■ CPU bottleneck?
■ Scale out limitations?
3 Pain Points in Big Data DBMS
Big Data System
Ingress data rate Egress data rate
Maintenance rate
Big Data ingress challenges
● Parsing of Data type
○ Strings
○ Dates
○ Floats
● Rate of data coming into the system
● ACID: Atomicity, Consistency, Isolation, Durability
● Compression on the fly (non unique values)
● amount of data
● On the fly analytics
● Time constraints (target: x rows per sec/hour)
Big Data egress challenges
● Sort
● Group by
● Reduce
● Join (sortMergeJoin)
● Data distribution
● Compression
● Theoretical Bottlenecks of hardware.
Methodological approach
for QA Automation road map
Business constraints?
● Budget?
● Time to market?
● Resources available?
● Skill set available ?
Product requirements?
● What are the product’s supported use cases?
● Supported Scale?
● Supported rate of ingress?
● Supported rate on egress?
● Complexity of Cluster?
● High availability?
The Automation: Must have requirements?
● Scale out/up?
● Fault tolerance!
● Reproducibility from scratch!
● Reporting!
● “Views” to avoid duplication for testing?
● Orchestration of same environment per Developer!
● Debugging options
● versioning!!!
The method● Phase 0: get ready
○ Product requirements & business constraints
○ Automation requirements
○ Creating a testing matrix based on insight and product features.
● Phase 1: Get insights (some baselines)
○ Test Ingress separately, find your baseline.
○ Test Egress separately, find your baseline.
○ Test Ingress while Egress in progress., find your baseline.
○ Stability and Performance
● Phase 2: Agile: Design an automation system that satisfies the requirements
○ Prioritize automation features based on your insights from phase 1.
○ Implement an automation infrastructure as soon as possible.
○ Update your testing matrix as you go (insight will keep coming)
○ Analyze the test results daily.
● Phase 3: Cost reduction
○ How to reduce compute time/ IO pattern/ network pattern/storage footprint
○ Maintenance time (build/package/deploy/monitor)
○ Hardware costs
True Story
Testing GPU based DBMS
(Peta Scale)
So why build a DBMS from scratch?
● Most compiler are designed for CPU → execution tree for CPU runtime
engine → new compiler with GPU resource in mind.
● Most Algorithm were written for CPU → same algorithm logic, redesign for
parallelism philosophy of GPU → new runtime with GPU resource in mind →
performance increase by order of magnitude.
● VISION: fastest DBMS, cost effective, true scalability.
True story
● Product: Big Data gpu based DBMS system for analytics. [peta scale]
○ Structed data only, designed for OLAP use cases only
● Technical Constraints
○ How to test SQL syntax?(infinite input) SQL Coverage from 0% to 20% to 90% to 100%?.
○ What is the expected impact of big data?
○ How to debug performance issues?
○ Can you virtualize? Should you virtualize?
○ Running time - how much it takes to run all the tests?
● Company Challenges
○ Cost of Hardware? High end 2U Server
○ Expertise ? skillset?
○ Human resources: Size of QA Automation team?
○ How do you manage automation?
○ What is your MVP? (Chicken and Egg)
○ TIME TO MARKET!!! The product needs to be released.
The baseline concept
● Requirements
○ CSV , DDL , query
○ Competing DB
○ Your DB
○ Synthetic data used mostly.
○ (real data in peta scale is hard to comeby)
● Steps
○ Insert same data to same table on both DB’s
○ Run same query on both DB’s
○ Compare results on both.
○ If equal → test pass.
The Solution: SQL testing framework.
● Lightweight python test suite per feature
○ DDL
○ Set of Queries
○ Json for data generation requirements
○ Data generation utilities (genUtil)
○ test results report - in CSV format.
○ Expected error mechanism.
○ Results compare mechanism
○ Command line arguments for advanced testing/config/tuning/views.
● Wrapper test suite
○ group set of test suites by logic - e.g Daily
○ Aggregate reporting mechanism
● Scheduling ,reporting,monitoring,deployment, alerting mechanism
Testing concept of SQL syntax
Simple Aggregation Joins
Simple x x x
Aggregation x x x
Joins x x x
Challenges with SQL syntax testing
● (Binary data not supported in the product)
● Repetition of queries.
● Different data type names.
● Different data ranges per data types.
● Different accuracy per data types.
● The competition DB that supports big data is expensive...
● Accuracy of results. (different DB’s return different accuracy)
● SQL has some extreme cases (differ per vendor).
● Datetime format.
● Unsupported features.
● Duplication of testing.
● Very hard to predict which queries are useful (negative and positive testing).
Challenges with Small Data testing
● Generating Random data
○ Reproducible every time -->Strict data ranges per test (length, numeric range, format, accuracy)
○ What is the amount of unique values? Different histogram, different bottlenecks:
■ Unique values challenges:
● Per chunk?
● Per Column
■ Non unique values challenges:
● String?
○ Lengths
○ Compressible?
● Numeric?
○ Overflow?
○ Floating point?
Testing matrix: Data Integrity on Big data.
Data (no null) Compression Null Data Lookup. Partitions JDBC/ODBC
1 row
1M Rows
10M rows
100M Rows
1B rows
10B rows
100B rows
1 Trillion rows
10 Trillion
rows
Challenges with Big Data testing.
● Generating Time of data → genUtil with json for user input
● Reproducibility → genUtil again.
● Disk space for testing data → peta scale product → peta scale testing data
● Results compared mechanism on big data require a big data solution→ SMJ
● Network bottlenecks on a scale out system...
● ORDER IS NOT GUARANTEED!
Insights on the fly: User Defined Test Flow
● A series of queries in a specific order - crushed the system
● Solution: Automation infrastructure on a group of queries: User defined Test
Flow.
● Good for: pre and post condition of , custom testing, system perspective
testing. (partitions)
The solution: some Metrics
1. Extra large Sanity Testing per commit (!)
2. Hourly testing on latest commit (24 X 7)
a. 800 queries
b. Zero data
3. Daily test:
a. 400,000 queries
b. Zero data for 90% , 10% testing upto 1B records.
4. Weekly testing:
a. Big data testing 10B and above.
5. Monthly testing:
a. Full regression per version.
The solution: Hardware perspective
● 50 x Desktops: Optiplex 7040, I7 4 cores, 32GB. 1X 4Tb.
● 10 x High End Servers: Dell R730/R720: 20 cores, 128GB. 16X 1TB disks.
Nvidia K40
● DDN storage with 200TB+ GPFS
● Mellanox FDR switch for the storage
● 1GB switches for the desktop compute nodes.
Performance and Maintenance
Performance Challenges: app perspective
● Architecture bottlenecks [ count(*), group by, compression, sort Merge Join]
● What is IO pattern?
○ OLTP VS OLAP.
○ Columnar or Row based?
○ How big is your data?
○ READ % vs WRITE %.
○ Sequential? random?
○ Temporary VS permanent
○ Latency VS. throughput.
○ Multi threaded ? or single thread?
○ Power query? Or production?
○ Cold data? Host data?
Performance Challenges: OPS perspective
● Metrics
○ What is Expected Total RAM required per Query?
○ OS RAM cache , CPU Cache hits ?
○ SWAP ? yes/no how much?
○ OS metrics - open files, stack size, realtime priority?
● Theoretical limits?
○ Disk type selection -limitation, expected throughput
○ Raid controller, RAID type? Configuration? caching?
○ File system used? DFS? PFS? Ext4? XFS?
○ Recommended file system filesystem Block size? File system metadata?
○ Recommended File size on disk?
○ CPU selection - number of threads , cache level, hyper threading. Cilicon
○ Ram Selection - and placement on chassi
○ Network - the difference b/w 40Gbit and 25Gbit. NIC Bonding is good for?
○ PCI Express 3, 16 Langes. PCI Switch.
Maintenance challenges: OPS Perspective
How to Optimize your hardware selection ? ( Analytics on your testing data)
a. Compute intensive
i. GPU CORE intensive
ii. CPUcores intensive
1. Frequency
2. Amount of thread for concurrency
b. IO intensive?
i. SSD?
ii. NearLine sata?
iii. Partitions?
c. Hardware uniformity - use same hardware all over.
d. Get rid of weak links
Maintenance challenges: DevOps Perspective
● Lightweight - python code
○ Advantage → focus on generic skill set (python coding)
○ Disadvantage → reinventing the wheel
● Fault tolerance - Scale out topology
○ Cheap desktops → quick recovery from Image when hardware crushes
■ Advantage - >
● Cheap, low risk, pay as you go.
● Gets 90 % of the job DONE!
■ Disadvantage
● footprint is hell.
● Management of desktops - requires creativity
● Rely heavily on automation feature: reproducibility from scratch
● Validation automation on infrastructure & Deployment mechanism.
Maintenance challenges: DevOps Perspective
● Infrastructure
○ Continuous integration: continuous build, continuous packaging, installer.
○ Continuous Deployment: pre flight check, remote machine (on site, private cloud, cloud)
○ Continuous testing: Sanity, Hourly, Daily, weekly
● Reporting and Monitoring:
○ Extensive report server and analytics on current testing.
○ Monitoring: Green/Red and system real time metrics.
Maintenance challenges: Innovation Perspective
1. Innovation on Data generation (using GPU to generate data)
2. Innovation on Competitor DB running time (hashing)
3. Innovation on Workload (one dispatcher, many compute nodes)
4. Innovation on saving testing data /compute time
a. Hashing
b. Cacheing
c. Smart Data Creating - no need to create 100TB CSV to generate 100TB test.
d. Logs: Test results were managed on The DBMS itself :)
5. File system ZFS Cluster (utilizing unused disk space, redundancy, deduplication)
6. Nvidia TK1 /GRID GPU → Footprint!!!
7. “Virtualizing GPU” → Footprint!!!
a. Dockers
b. rCUDA
8. Amazon p2 gpu instances did not exist at that time (even so still very expensive)
Building the team Challenge
● Engineering skills required
■ Hardware: Servers
■ IO (storage, disks, RAID, IO patten)
■ FileSystems
■ Linux , CLI, CMD, installing, compiling, GIT
■ Basic network understanding
■ High Availability, Redundancy
■ HPC
● Coding: python.
● QA Big Data mindset
● DevOps mindset
● Automation Mindset
● Systematic Innovation methodologies
Lesson learned (insights)● Very hard to guarantee 100% QA coverage
● Quality in a reasonable cost is a huge challenge
● Very easy to miss milestones with QA of big data,
● each mistake create a huge ripple effect in timelines
● Main costs:
○ Employees : training them.
○ Running Time: Coding of innovative algorithms inside testing framework,
○ Maintenance: Automation, Monitoring, Analyzing test results, Environments setup
○ Hardware: innovative DevOps , innovative OPS, research
● Most of our innovation was spent on:
○ Doing more tests with less resources
○ agile improvement of backbone
○ Avoiding big data complexity problems in our testing!
● Industry tools ? great, but not always enough. Know their philosophy...
● Sometimes you need to get your hands dirty
● Keep a fine balance between business and technology
Take away message:
● Testing simplifications:
○ Ingress only
○ Egress only
○ Ingress while egress
○ Testing matrix - useful on complex big data systems
● The QA automation team
○ coding/scripting skills is a MUST
○ Engineering skill is a MUST
○ DevOps understanding is a MUST
○ Allocating time for innovation is a MUST
○ Allocating time for cost reduction is a MUST
● The Automation infrastructure
○ KISS
○ MVP
○ IS A PRODUCT by itself. With its own Product manager and Architect
Special Thanks...
● Citi innovation Center
● Asaf - Birenzvieg Big Data & Data Science - Israel meetup
● PayPal Risk and Data solution Group
Stay in touch...
● Omid Vahdaty
● +972-54-2384178

Contenu connexe

Tendances

Everything I Ever Learned About JVM Performance Tuning @Twitter
Everything I Ever Learned About JVM Performance Tuning @TwitterEverything I Ever Learned About JVM Performance Tuning @Twitter
Everything I Ever Learned About JVM Performance Tuning @TwitterAttila Szegedi
 
OpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoOpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoNathaniel Braun
 
Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017Florian Lautenschlager
 
Efficient and Fast Time Series Storage - The missing link in dynamic software...
Efficient and Fast Time Series Storage - The missing link in dynamic software...Efficient and Fast Time Series Storage - The missing link in dynamic software...
Efficient and Fast Time Series Storage - The missing link in dynamic software...Florian Lautenschlager
 
Apache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series databaseApache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series databaseFlorian Lautenschlager
 
Debugging Your Production JVM
Debugging Your Production JVMDebugging Your Production JVM
Debugging Your Production JVMkensipe
 
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...NoSQLmatters
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...NETWAYS
 
Garbage collection in JVM
Garbage collection in JVMGarbage collection in JVM
Garbage collection in JVMaragozin
 
Writing Applications for Scylla
Writing Applications for ScyllaWriting Applications for Scylla
Writing Applications for ScyllaScyllaDB
 
OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017HBaseCon
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLFlink Forward
 
Golang in TiDB (GopherChina 2017)
Golang in TiDB  (GopherChina 2017)Golang in TiDB  (GopherChina 2017)
Golang in TiDB (GopherChina 2017)PingCAP
 
Kubernetes as data platform
Kubernetes as data platformKubernetes as data platform
Kubernetes as data platformLars Albertsson
 
Indexing Complex PostgreSQL Data Types
Indexing Complex PostgreSQL Data TypesIndexing Complex PostgreSQL Data Types
Indexing Complex PostgreSQL Data TypesJonathan Katz
 

Tendances (15)

Everything I Ever Learned About JVM Performance Tuning @Twitter
Everything I Ever Learned About JVM Performance Tuning @TwitterEverything I Ever Learned About JVM Performance Tuning @Twitter
Everything I Ever Learned About JVM Performance Tuning @Twitter
 
OpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoOpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ Criteo
 
Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017
 
Efficient and Fast Time Series Storage - The missing link in dynamic software...
Efficient and Fast Time Series Storage - The missing link in dynamic software...Efficient and Fast Time Series Storage - The missing link in dynamic software...
Efficient and Fast Time Series Storage - The missing link in dynamic software...
 
Apache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series databaseApache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series database
 
Debugging Your Production JVM
Debugging Your Production JVMDebugging Your Production JVM
Debugging Your Production JVM
 
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
 
Garbage collection in JVM
Garbage collection in JVMGarbage collection in JVM
Garbage collection in JVM
 
Writing Applications for Scylla
Writing Applications for ScyllaWriting Applications for Scylla
Writing Applications for Scylla
 
OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
 
Golang in TiDB (GopherChina 2017)
Golang in TiDB  (GopherChina 2017)Golang in TiDB  (GopherChina 2017)
Golang in TiDB (GopherChina 2017)
 
Kubernetes as data platform
Kubernetes as data platformKubernetes as data platform
Kubernetes as data platform
 
Indexing Complex PostgreSQL Data Types
Indexing Complex PostgreSQL Data TypesIndexing Complex PostgreSQL Data Types
Indexing Complex PostgreSQL Data Types
 

En vedette

Designing of databases
Designing of databasesDesigning of databases
Designing of databasesAnsh Jhanji
 
Getting Started With QA Automation
Getting Started With QA AutomationGetting Started With QA Automation
Getting Started With QA AutomationGiovanni Scerra ☃
 
Introduction to DevOps
Introduction to DevOpsIntroduction to DevOps
Introduction to DevOpsOmid Vahdaty
 
Graph Databases for SQL Server Professionals
Graph Databases for SQL Server ProfessionalsGraph Databases for SQL Server Professionals
Graph Databases for SQL Server ProfessionalsStéphane Fréchette
 
Automation presentation
Automation presentationAutomation presentation
Automation presentationAKANSHA GURELE
 
Ppt on automation
Ppt on automation Ppt on automation
Ppt on automation harshaa
 

En vedette (9)

Designing of databases
Designing of databasesDesigning of databases
Designing of databases
 
Getting Started With QA Automation
Getting Started With QA AutomationGetting Started With QA Automation
Getting Started With QA Automation
 
Aws s3 security
Aws s3 securityAws s3 security
Aws s3 security
 
QA automation
QA automationQA automation
QA automation
 
Introduction to DevOps
Introduction to DevOpsIntroduction to DevOps
Introduction to DevOps
 
Graph Databases for SQL Server Professionals
Graph Databases for SQL Server ProfessionalsGraph Databases for SQL Server Professionals
Graph Databases for SQL Server Professionals
 
Automation presentation
Automation presentationAutomation presentation
Automation presentation
 
BDD with Cucumber
BDD with CucumberBDD with Cucumber
BDD with Cucumber
 
Ppt on automation
Ppt on automation Ppt on automation
Ppt on automation
 

Similaire à Lessons learned from designing a QA Automation for analytics databases (big data)

Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsZhenxiao Luo
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeIdo Shilon
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned Omid Vahdaty
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
 
Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Demi Ben-Ari
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseGruter
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseJihoon Son
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroDaniel Marcous
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!DataWorks Summit
 
Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalJoachim Draeger
 
Benchmarks, performance, scalability, and capacity what s behind the numbers...
Benchmarks, performance, scalability, and capacity  what s behind the numbers...Benchmarks, performance, scalability, and capacity  what s behind the numbers...
Benchmarks, performance, scalability, and capacity what s behind the numbers...james tong
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysDemi Ben-Ari
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodbPGConf APAC
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
My benchmarks brings all the boys to the yard
My benchmarks brings all the boys to the yardMy benchmarks brings all the boys to the yard
My benchmarks brings all the boys to the yardIon Dormenco
 
Machine Learning for Capacity Management
 Machine Learning for Capacity Management Machine Learning for Capacity Management
Machine Learning for Capacity ManagementEDB
 

Similaire à Lessons learned from designing a QA Automation for analytics databases (big data) (20)

Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
 
Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ Signal
 
A Kaggle Talk
A Kaggle TalkA Kaggle Talk
A Kaggle Talk
 
Benchmarks, performance, scalability, and capacity what s behind the numbers...
Benchmarks, performance, scalability, and capacity  what s behind the numbers...Benchmarks, performance, scalability, and capacity  what s behind the numbers...
Benchmarks, performance, scalability, and capacity what s behind the numbers...
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
My benchmarks brings all the boys to the yard
My benchmarks brings all the boys to the yardMy benchmarks brings all the boys to the yard
My benchmarks brings all the boys to the yard
 
Machine Learning for Capacity Management
 Machine Learning for Capacity Management Machine Learning for Capacity Management
Machine Learning for Capacity Management
 

Plus de Omid Vahdaty

Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup Omid Vahdaty
 
Couchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data DemystifiedCouchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data DemystifiedOmid Vahdaty
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedOmid Vahdaty
 
Machine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data DemystifiedMachine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data DemystifiedOmid Vahdaty
 
The technology of fake news between a new front and a new frontier | Big Dat...
The technology of fake news  between a new front and a new frontier | Big Dat...The technology of fake news  between a new front and a new frontier | Big Dat...
The technology of fake news between a new front and a new frontier | Big Dat...Omid Vahdaty
 
Making your analytics talk business | Big Data Demystified
Making your analytics talk business | Big Data DemystifiedMaking your analytics talk business | Big Data Demystified
Making your analytics talk business | Big Data DemystifiedOmid Vahdaty
 
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...Omid Vahdaty
 
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...Omid Vahdaty
 
Aerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data DemystifiedAerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data DemystifiedOmid Vahdaty
 
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...Omid Vahdaty
 
AWS Big Data Demystified #4 data governance demystified [security, networ...
AWS Big Data Demystified #4   data governance demystified   [security, networ...AWS Big Data Demystified #4   data governance demystified   [security, networ...
AWS Big Data Demystified #4 data governance demystified [security, networ...Omid Vahdaty
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...Omid Vahdaty
 
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive Omid Vahdaty
 
Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...Omid Vahdaty
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystifiedOmid Vahdaty
 
Emr zeppelin & Livy demystified
Emr zeppelin & Livy demystifiedEmr zeppelin & Livy demystified
Emr zeppelin & Livy demystifiedOmid Vahdaty
 
Zeppelin and spark sql demystified
Zeppelin and spark sql demystifiedZeppelin and spark sql demystified
Zeppelin and spark sql demystifiedOmid Vahdaty
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data Omid Vahdaty
 
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis Omid Vahdaty
 

Plus de Omid Vahdaty (20)

Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
 
Couchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data DemystifiedCouchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data Demystified
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data Demystified
 
Machine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data DemystifiedMachine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data Demystified
 
The technology of fake news between a new front and a new frontier | Big Dat...
The technology of fake news  between a new front and a new frontier | Big Dat...The technology of fake news  between a new front and a new frontier | Big Dat...
The technology of fake news between a new front and a new frontier | Big Dat...
 
Making your analytics talk business | Big Data Demystified
Making your analytics talk business | Big Data DemystifiedMaking your analytics talk business | Big Data Demystified
Making your analytics talk business | Big Data Demystified
 
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
 
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
 
Aerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data DemystifiedAerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data Demystified
 
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
 
AWS Big Data Demystified #4 data governance demystified [security, networ...
AWS Big Data Demystified #4   data governance demystified   [security, networ...AWS Big Data Demystified #4   data governance demystified   [security, networ...
AWS Big Data Demystified #4 data governance demystified [security, networ...
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
 
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
 
Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
 
Emr zeppelin & Livy demystified
Emr zeppelin & Livy demystifiedEmr zeppelin & Livy demystified
Emr zeppelin & Livy demystified
 
Zeppelin and spark sql demystified
Zeppelin and spark sql demystifiedZeppelin and spark sql demystified
Zeppelin and spark sql demystified
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
 
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis
 

Dernier

KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 

Dernier (20)

KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 

Lessons learned from designing a QA Automation for analytics databases (big data)

  • 1. Lessons learned from designing a QA automation for analytics databases (Big Data) Omid Vahdaty, Big Data ninja.
  • 2. Agenda ● Introduction to Big Data Testing ● Methodological approach for QA Automation road map ● True Story ● The Solution? ● Challenges of Big Data Testing. ● Performance & Maintenance.
  • 3. Introduction to Big Data Testing
  • 4. ● 100M ? ● 1B ? ● 10B? ● 100B? ● 1Tr? How Big is Big Data?
  • 5. Scale OUT VS. Scale UP systems
  • 6. Big Data ? ● Structured (SQL, tabular, strings, numbers) ● Unstructured (logs, pictures, binary, json,blob, video etc)
  • 7. OLAP or OLTP? ● Simply put: ○ One query running on a lot of data for reporting/analytics ○ Many queries running quickly in parralel applications (e.g user login to a website, credentials are stored on a DB)
  • 8. Type of Big Data products ● Databases ○ OLAP, OLTP ○ Sql, NO SQL ○ In Memory, Disk based. ○ Hadoop Ecosystem, Spark ● Ecosystem tools ○ ETL tools ○ Visualizations tools ● General Application ● Analytics Based products ○ Froud ○ Cyber ○ Finances and etc.
  • 9. Expectation Matching for today’s lecture... ● OLAP Database ● Structured data only, Synthetic data was used in testing. ● (not about QA of analytics based products/solutions). ● (not about hadoop , event streaming cluster ). ● (not try to sell anything) ● In a nutshell: ○ How to test an SQL based database? ○ What sort of challenges were met?
  • 10. How Hard Can it be (aggregation)? ● Select Count(*) from t; ○ Assume ■ 1 Trillion records ■ ad-hoc query (no indexes) ■ Full Scan (no cache) ○ Challenges ■ Time it takes to compute? ■ IO bottleneck? What is IO pattern? ■ CPU bottleneck? ■ Scale up limitations?
  • 11. How Hard Can it be (join)? ● Select * from t1 join t2 where t1.id =t2.id ○ Assume ■ 1 Trillion records both tables. ■ ad-hoc query (no indexes) ■ Full Scan (no cache) ○ Challenges ■ Time it takes to compute? ■ IO bottleneck? What is IO pattern? ■ CPU bottleneck? ■ Scale out limitations?
  • 12. 3 Pain Points in Big Data DBMS Big Data System Ingress data rate Egress data rate Maintenance rate
  • 13. Big Data ingress challenges ● Parsing of Data type ○ Strings ○ Dates ○ Floats ● Rate of data coming into the system ● ACID: Atomicity, Consistency, Isolation, Durability ● Compression on the fly (non unique values) ● amount of data ● On the fly analytics ● Time constraints (target: x rows per sec/hour)
  • 14. Big Data egress challenges ● Sort ● Group by ● Reduce ● Join (sortMergeJoin) ● Data distribution ● Compression ● Theoretical Bottlenecks of hardware.
  • 15. Methodological approach for QA Automation road map
  • 16. Business constraints? ● Budget? ● Time to market? ● Resources available? ● Skill set available ?
  • 17. Product requirements? ● What are the product’s supported use cases? ● Supported Scale? ● Supported rate of ingress? ● Supported rate on egress? ● Complexity of Cluster? ● High availability?
  • 18. The Automation: Must have requirements? ● Scale out/up? ● Fault tolerance! ● Reproducibility from scratch! ● Reporting! ● “Views” to avoid duplication for testing? ● Orchestration of same environment per Developer! ● Debugging options ● versioning!!!
  • 19. The method● Phase 0: get ready ○ Product requirements & business constraints ○ Automation requirements ○ Creating a testing matrix based on insight and product features. ● Phase 1: Get insights (some baselines) ○ Test Ingress separately, find your baseline. ○ Test Egress separately, find your baseline. ○ Test Ingress while Egress in progress., find your baseline. ○ Stability and Performance ● Phase 2: Agile: Design an automation system that satisfies the requirements ○ Prioritize automation features based on your insights from phase 1. ○ Implement an automation infrastructure as soon as possible. ○ Update your testing matrix as you go (insight will keep coming) ○ Analyze the test results daily. ● Phase 3: Cost reduction ○ How to reduce compute time/ IO pattern/ network pattern/storage footprint ○ Maintenance time (build/package/deploy/monitor) ○ Hardware costs
  • 20. True Story Testing GPU based DBMS (Peta Scale)
  • 21. So why build a DBMS from scratch? ● Most compiler are designed for CPU → execution tree for CPU runtime engine → new compiler with GPU resource in mind. ● Most Algorithm were written for CPU → same algorithm logic, redesign for parallelism philosophy of GPU → new runtime with GPU resource in mind → performance increase by order of magnitude. ● VISION: fastest DBMS, cost effective, true scalability.
  • 22. True story ● Product: Big Data gpu based DBMS system for analytics. [peta scale] ○ Structed data only, designed for OLAP use cases only ● Technical Constraints ○ How to test SQL syntax?(infinite input) SQL Coverage from 0% to 20% to 90% to 100%?. ○ What is the expected impact of big data? ○ How to debug performance issues? ○ Can you virtualize? Should you virtualize? ○ Running time - how much it takes to run all the tests? ● Company Challenges ○ Cost of Hardware? High end 2U Server ○ Expertise ? skillset? ○ Human resources: Size of QA Automation team? ○ How do you manage automation? ○ What is your MVP? (Chicken and Egg) ○ TIME TO MARKET!!! The product needs to be released.
  • 23. The baseline concept ● Requirements ○ CSV , DDL , query ○ Competing DB ○ Your DB ○ Synthetic data used mostly. ○ (real data in peta scale is hard to comeby) ● Steps ○ Insert same data to same table on both DB’s ○ Run same query on both DB’s ○ Compare results on both. ○ If equal → test pass.
  • 24. The Solution: SQL testing framework. ● Lightweight python test suite per feature ○ DDL ○ Set of Queries ○ Json for data generation requirements ○ Data generation utilities (genUtil) ○ test results report - in CSV format. ○ Expected error mechanism. ○ Results compare mechanism ○ Command line arguments for advanced testing/config/tuning/views. ● Wrapper test suite ○ group set of test suites by logic - e.g Daily ○ Aggregate reporting mechanism ● Scheduling ,reporting,monitoring,deployment, alerting mechanism
  • 25. Testing concept of SQL syntax Simple Aggregation Joins Simple x x x Aggregation x x x Joins x x x
  • 26. Challenges with SQL syntax testing ● (Binary data not supported in the product) ● Repetition of queries. ● Different data type names. ● Different data ranges per data types. ● Different accuracy per data types. ● The competition DB that supports big data is expensive... ● Accuracy of results. (different DB’s return different accuracy) ● SQL has some extreme cases (differ per vendor). ● Datetime format. ● Unsupported features. ● Duplication of testing. ● Very hard to predict which queries are useful (negative and positive testing).
  • 27. Challenges with Small Data testing ● Generating Random data ○ Reproducible every time -->Strict data ranges per test (length, numeric range, format, accuracy) ○ What is the amount of unique values? Different histogram, different bottlenecks: ■ Unique values challenges: ● Per chunk? ● Per Column ■ Non unique values challenges: ● String? ○ Lengths ○ Compressible? ● Numeric? ○ Overflow? ○ Floating point?
  • 28. Testing matrix: Data Integrity on Big data. Data (no null) Compression Null Data Lookup. Partitions JDBC/ODBC 1 row 1M Rows 10M rows 100M Rows 1B rows 10B rows 100B rows 1 Trillion rows 10 Trillion rows
  • 29. Challenges with Big Data testing. ● Generating Time of data → genUtil with json for user input ● Reproducibility → genUtil again. ● Disk space for testing data → peta scale product → peta scale testing data ● Results compared mechanism on big data require a big data solution→ SMJ ● Network bottlenecks on a scale out system... ● ORDER IS NOT GUARANTEED!
  • 30. Insights on the fly: User Defined Test Flow ● A series of queries in a specific order - crushed the system ● Solution: Automation infrastructure on a group of queries: User defined Test Flow. ● Good for: pre and post condition of , custom testing, system perspective testing. (partitions)
  • 31. The solution: some Metrics 1. Extra large Sanity Testing per commit (!) 2. Hourly testing on latest commit (24 X 7) a. 800 queries b. Zero data 3. Daily test: a. 400,000 queries b. Zero data for 90% , 10% testing upto 1B records. 4. Weekly testing: a. Big data testing 10B and above. 5. Monthly testing: a. Full regression per version.
  • 32. The solution: Hardware perspective ● 50 x Desktops: Optiplex 7040, I7 4 cores, 32GB. 1X 4Tb. ● 10 x High End Servers: Dell R730/R720: 20 cores, 128GB. 16X 1TB disks. Nvidia K40 ● DDN storage with 200TB+ GPFS ● Mellanox FDR switch for the storage ● 1GB switches for the desktop compute nodes.
  • 34. Performance Challenges: app perspective ● Architecture bottlenecks [ count(*), group by, compression, sort Merge Join] ● What is IO pattern? ○ OLTP VS OLAP. ○ Columnar or Row based? ○ How big is your data? ○ READ % vs WRITE %. ○ Sequential? random? ○ Temporary VS permanent ○ Latency VS. throughput. ○ Multi threaded ? or single thread? ○ Power query? Or production? ○ Cold data? Host data?
  • 35. Performance Challenges: OPS perspective ● Metrics ○ What is Expected Total RAM required per Query? ○ OS RAM cache , CPU Cache hits ? ○ SWAP ? yes/no how much? ○ OS metrics - open files, stack size, realtime priority? ● Theoretical limits? ○ Disk type selection -limitation, expected throughput ○ Raid controller, RAID type? Configuration? caching? ○ File system used? DFS? PFS? Ext4? XFS? ○ Recommended file system filesystem Block size? File system metadata? ○ Recommended File size on disk? ○ CPU selection - number of threads , cache level, hyper threading. Cilicon ○ Ram Selection - and placement on chassi ○ Network - the difference b/w 40Gbit and 25Gbit. NIC Bonding is good for? ○ PCI Express 3, 16 Langes. PCI Switch.
  • 36. Maintenance challenges: OPS Perspective How to Optimize your hardware selection ? ( Analytics on your testing data) a. Compute intensive i. GPU CORE intensive ii. CPUcores intensive 1. Frequency 2. Amount of thread for concurrency b. IO intensive? i. SSD? ii. NearLine sata? iii. Partitions? c. Hardware uniformity - use same hardware all over. d. Get rid of weak links
  • 37. Maintenance challenges: DevOps Perspective ● Lightweight - python code ○ Advantage → focus on generic skill set (python coding) ○ Disadvantage → reinventing the wheel ● Fault tolerance - Scale out topology ○ Cheap desktops → quick recovery from Image when hardware crushes ■ Advantage - > ● Cheap, low risk, pay as you go. ● Gets 90 % of the job DONE! ■ Disadvantage ● footprint is hell. ● Management of desktops - requires creativity ● Rely heavily on automation feature: reproducibility from scratch ● Validation automation on infrastructure & Deployment mechanism.
  • 38. Maintenance challenges: DevOps Perspective ● Infrastructure ○ Continuous integration: continuous build, continuous packaging, installer. ○ Continuous Deployment: pre flight check, remote machine (on site, private cloud, cloud) ○ Continuous testing: Sanity, Hourly, Daily, weekly ● Reporting and Monitoring: ○ Extensive report server and analytics on current testing. ○ Monitoring: Green/Red and system real time metrics.
  • 39. Maintenance challenges: Innovation Perspective 1. Innovation on Data generation (using GPU to generate data) 2. Innovation on Competitor DB running time (hashing) 3. Innovation on Workload (one dispatcher, many compute nodes) 4. Innovation on saving testing data /compute time a. Hashing b. Cacheing c. Smart Data Creating - no need to create 100TB CSV to generate 100TB test. d. Logs: Test results were managed on The DBMS itself :) 5. File system ZFS Cluster (utilizing unused disk space, redundancy, deduplication) 6. Nvidia TK1 /GRID GPU → Footprint!!! 7. “Virtualizing GPU” → Footprint!!! a. Dockers b. rCUDA 8. Amazon p2 gpu instances did not exist at that time (even so still very expensive)
  • 40. Building the team Challenge ● Engineering skills required ■ Hardware: Servers ■ IO (storage, disks, RAID, IO patten) ■ FileSystems ■ Linux , CLI, CMD, installing, compiling, GIT ■ Basic network understanding ■ High Availability, Redundancy ■ HPC ● Coding: python. ● QA Big Data mindset ● DevOps mindset ● Automation Mindset ● Systematic Innovation methodologies
  • 41. Lesson learned (insights)● Very hard to guarantee 100% QA coverage ● Quality in a reasonable cost is a huge challenge ● Very easy to miss milestones with QA of big data, ● each mistake create a huge ripple effect in timelines ● Main costs: ○ Employees : training them. ○ Running Time: Coding of innovative algorithms inside testing framework, ○ Maintenance: Automation, Monitoring, Analyzing test results, Environments setup ○ Hardware: innovative DevOps , innovative OPS, research ● Most of our innovation was spent on: ○ Doing more tests with less resources ○ agile improvement of backbone ○ Avoiding big data complexity problems in our testing! ● Industry tools ? great, but not always enough. Know their philosophy... ● Sometimes you need to get your hands dirty ● Keep a fine balance between business and technology
  • 42. Take away message: ● Testing simplifications: ○ Ingress only ○ Egress only ○ Ingress while egress ○ Testing matrix - useful on complex big data systems ● The QA automation team ○ coding/scripting skills is a MUST ○ Engineering skill is a MUST ○ DevOps understanding is a MUST ○ Allocating time for innovation is a MUST ○ Allocating time for cost reduction is a MUST ● The Automation infrastructure ○ KISS ○ MVP ○ IS A PRODUCT by itself. With its own Product manager and Architect
  • 43. Special Thanks... ● Citi innovation Center ● Asaf - Birenzvieg Big Data & Data Science - Israel meetup ● PayPal Risk and Data solution Group
  • 44. Stay in touch... ● Omid Vahdaty ● +972-54-2384178