SlideShare une entreprise Scribd logo
1  sur  20
Télécharger pour lire hors ligne
T. Rabl, M. Frank, H. Mousselly Sergieh, and H. Kosch
17.09.2010
Second TPC Technology Conference on Performance Evaluation &
Benchmarking


Data sizes grow beyond petabyte barrier
 Cloud computing to the rescue



Need for cloud-scale benchmarking
 Need for realistic, cloud-scale data sets



Problems:
 Generating petabytes
 Storing petabytes
 Transporting petabytes



Solution:
 Parallel, on-site generation
A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl

2


3 tables with primary and foreign keys
Non-uniform distributions (lognormal)
Replicated data



How to generate consistent data in parallel?




A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl

3


No references
 Only uncorrelated data
 Simple statistical references



Scanned references

 Read data to generate reference
 Generate tuple pairs



Computed references

 Data is generated deterministically
 Compute data to generate reference

Parallel Data Generation Framework
A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl

4


Parallel pseudo random number generator
 Deterministic
 Fast skip ahead



Generation of values as a function
 Deterministic



Seeds + row id + generator allows
recalculation
 Every value can be computed independently
A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl

5
seed

t_id

Table RNG

seed

c_id

Column RNG

seed

r_id

Row RNG

rn





Generator

user
row_id user_id degree_program …
1
2
3

Seeding strategy
All seeds can be cached
Generation of value in row n with n-th
random number
A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl

6
seed

id

Table RNG

seed

id

Column RNG

seed

r_id

Lognormal

rn
seed
rn





Row RNG

r_id

seminar_user
row_id user_id degree_program …
1
2
3

Row RNG
Generator

Reference generation
Lognormal indexes row_id of user
Equal seeds and lognormal generator for user_id
A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl

7



Java based
Plug-in concept
 Generators
 Distributions
 RNGs
 Output
 Scheduler



TPC-H plug-in
A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl

8



XML file
Reflects SQL schema
 Tables
 Fields






Seed
Size
Scale factor
Output

<project name="simpleUserSeminar">
[..]
<table name=“seminar_user">
<size>201754</size>
<fields>
<field name="user_id">
<type>java.sql.Types.INTEGER</type>
<reference>
<referencedField>user_id</referencedField>
<referencedTable>user</referencedTable>
</reference>
<generator name="DefaultReferenceGenerator">
<distribution name="LogNormal">
<mu>7.60021</mu><sigma>1.40058</sigma>
</distribution>
</generator>
</field>
<field name="degree_program">
[..]

A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl

9




Every value can be computed independently
<?xml version="1.0" encoding="UTF-8"?>
No communication
<nodeConfig>
<nodeNumber>1</nodeNumber>
Workload distribution
<nodeCount>2</nodeCount>
 Configuration for every node

<workers>2</workers>
</nodeConfig>

 Automatic distribution
seminar_user

Cluster/Cloud
Nodes
Processors

100000 rows

Row 1 – 50000

Row 1 –
25000

Row 25001 –
50000

Row 50001 – 100000

Row 50001 –
75000

A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl

Row 75001 –
100000

10


16 node HPC cluster
 2 Intel Xeon QuadCore processors
 16GB RAM
 2 x 74GB HDD, RAID 0



SetQuery data set

 1 table “Bench”, 21 columns
 12 Random numbers, 8 strings



TPC-H data set

 8 tables, 61 columns
 Data types: integer, char, varchar, decimal, date
A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl

11




SetQuery data set
100 GB data set (SF 460)
1 to 16 nodes
A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl

12


TPC-H data set
 Data sizes: 1 GB, 10 GB, 100 GB



Single node (8 cores)
A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl

13




Parallel data generation framework
Linear speed up
Fast generation
 Large data sets
 Realistic data




Independent generation of tables / colums /
values
Easy configuration and extension
A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl

14


More implementation







generators, distributions
Benchmarks
Graphical user interface
Scheduler

Query generator

 Consistent inserts, updates, deletes
 Precomputed query results
 Time series
A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl

15


SetQuery data set
 Data sizes: 220 MB, 2.2 GB, 10 GB, 22 GB, 100 GB



Single node

A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl

17


TPC-H data set
 Data sizes: 1 GB, 10 GB, 100 GB, 1TB



1, 10, 16 nodes
A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl

18
A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl

19
A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl

20

Contenu connexe

Dernier

React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
Français Patch Tuesday - Avril
Français Patch Tuesday - AvrilFrançais Patch Tuesday - Avril
Français Patch Tuesday - AvrilIvanti
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 

Dernier (20)

React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
Français Patch Tuesday - Avril
Français Patch Tuesday - AvrilFrançais Patch Tuesday - Avril
Français Patch Tuesday - Avril
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 

PDGF: A Data Generator for Cloud-Scale Benchmarking

  • 1. T. Rabl, M. Frank, H. Mousselly Sergieh, and H. Kosch 17.09.2010 Second TPC Technology Conference on Performance Evaluation & Benchmarking
  • 2.  Data sizes grow beyond petabyte barrier  Cloud computing to the rescue  Need for cloud-scale benchmarking  Need for realistic, cloud-scale data sets  Problems:  Generating petabytes  Storing petabytes  Transporting petabytes  Solution:  Parallel, on-site generation A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl 2
  • 3.  3 tables with primary and foreign keys Non-uniform distributions (lognormal) Replicated data  How to generate consistent data in parallel?   A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl 3
  • 4.  No references  Only uncorrelated data  Simple statistical references  Scanned references  Read data to generate reference  Generate tuple pairs  Computed references  Data is generated deterministically  Compute data to generate reference Parallel Data Generation Framework A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl 4
  • 5.  Parallel pseudo random number generator  Deterministic  Fast skip ahead  Generation of values as a function  Deterministic  Seeds + row id + generator allows recalculation  Every value can be computed independently A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl 5
  • 6. seed t_id Table RNG seed c_id Column RNG seed r_id Row RNG rn    Generator user row_id user_id degree_program … 1 2 3 Seeding strategy All seeds can be cached Generation of value in row n with n-th random number A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl 6
  • 7. seed id Table RNG seed id Column RNG seed r_id Lognormal rn seed rn    Row RNG r_id seminar_user row_id user_id degree_program … 1 2 3 Row RNG Generator Reference generation Lognormal indexes row_id of user Equal seeds and lognormal generator for user_id A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl 7
  • 8.   Java based Plug-in concept  Generators  Distributions  RNGs  Output  Scheduler  TPC-H plug-in A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl 8
  • 9.   XML file Reflects SQL schema  Tables  Fields     Seed Size Scale factor Output <project name="simpleUserSeminar"> [..] <table name=“seminar_user"> <size>201754</size> <fields> <field name="user_id"> <type>java.sql.Types.INTEGER</type> <reference> <referencedField>user_id</referencedField> <referencedTable>user</referencedTable> </reference> <generator name="DefaultReferenceGenerator"> <distribution name="LogNormal"> <mu>7.60021</mu><sigma>1.40058</sigma> </distribution> </generator> </field> <field name="degree_program"> [..] A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl 9
  • 10.    Every value can be computed independently <?xml version="1.0" encoding="UTF-8"?> No communication <nodeConfig> <nodeNumber>1</nodeNumber> Workload distribution <nodeCount>2</nodeCount>  Configuration for every node <workers>2</workers> </nodeConfig>  Automatic distribution seminar_user Cluster/Cloud Nodes Processors 100000 rows Row 1 – 50000 Row 1 – 25000 Row 25001 – 50000 Row 50001 – 100000 Row 50001 – 75000 A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl Row 75001 – 100000 10
  • 11.  16 node HPC cluster  2 Intel Xeon QuadCore processors  16GB RAM  2 x 74GB HDD, RAID 0  SetQuery data set  1 table “Bench”, 21 columns  12 Random numbers, 8 strings  TPC-H data set  8 tables, 61 columns  Data types: integer, char, varchar, decimal, date A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl 11
  • 12.    SetQuery data set 100 GB data set (SF 460) 1 to 16 nodes A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl 12
  • 13.  TPC-H data set  Data sizes: 1 GB, 10 GB, 100 GB  Single node (8 cores) A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl 13
  • 14.    Parallel data generation framework Linear speed up Fast generation  Large data sets  Realistic data   Independent generation of tables / colums / values Easy configuration and extension A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl 14
  • 15.  More implementation      generators, distributions Benchmarks Graphical user interface Scheduler Query generator  Consistent inserts, updates, deletes  Precomputed query results  Time series A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl 15
  • 16.
  • 17.  SetQuery data set  Data sizes: 220 MB, 2.2 GB, 10 GB, 22 GB, 100 GB  Single node A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl 17
  • 18.  TPC-H data set  Data sizes: 1 GB, 10 GB, 100 GB, 1TB  1, 10, 16 nodes A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl 18
  • 19. A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl 19
  • 20. A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl 20