SlideShare une entreprise Scribd logo
1  sur  26
DWH-Ahsan AbdullahDWH-Ahsan Abdullah
11
Data WarehousingData Warehousing
Lecture-26Lecture-26
Need for Speed:Need for Speed:
Conventional Indexing TechniquesConventional Indexing Techniques
Virtual University of PakistanVirtual University of Pakistan
Ahsan Abdullah
Assoc. Prof. & Head
Center for Agro-Informatics Research
www.nu.edu.pk/cairindex.asp
National University of Computers & Emerging Sciences, Islamabad
Email: ahsan1010@yahoo.com
DWH-Ahsan Abdullah
2
Need For Indexing: SpeedNeed For Indexing: Speed
Consider searching your hard disk using the Windows
SEARCH command.
 Search goes into directory hierarchies.
 Takes about a minute, and there are only a few thousand files.
Assume a fast processor and (even more importantly) a fast
hard disk.
 Assume file size to be 5 KB.
 Assume hard disk scan rate of a million files per second.
 Resulting in scan rate of 5 GB per second.
Largest search engine indexes more than 8 billion pages
 At above scan rate 1,600 seconds required to scan ALL pages.
 This is just for one user!
 No one is going to wait for 26 minutes, not even 26 seconds.
Hence, a sequential scan is simply not feasible.
No text goes to graphics
DWH-Ahsan Abdullah
3
Need For Indexing: Query ComplexityNeed For Indexing: Query Complexity
 How many customers do I have in Karachi?
 How many customers in Karachi made calls during
April?
 How many customers in Karachi made calls to
Multan during April?
 How many customers in Karachi made calls to
Multan during April using a particular calling
package?
DWH-Ahsan Abdullah
4
Need For Indexing: I/O BottleneckNeed For Indexing: I/O Bottleneck
 Throwing hardware just speeds up the CPU
intensive tasks.
 The problem is of I/O, which does not scales up
easily.
 Putting the entire table in RAM is very very
expensive.
 Therefore, index!
No text goes to graphics
DWH-Ahsan Abdullah
5
Indexing ConceptIndexing Concept
 Purely physical concept, nothing to do with logical model.Purely physical concept, nothing to do with logical model.
 Invisible to the end user (programmer), optimizer choosesInvisible to the end user (programmer), optimizer chooses
it, effects only the speed, not the answer.it, effects only the speed, not the answer.
 With the library analogy, the time complexity to find aWith the library analogy, the time complexity to find a
book? The average time takenbook? The average time taken
 Using a card catalog organized in many different ways i.e.Using a card catalog organized in many different ways i.e.
author, topic, title etc and is sorted.author, topic, title etc and is sorted.
 A little bit of extra time to first check the catalog, but itA little bit of extra time to first check the catalog, but it
“gives” a pointer to the shelf and the row where book is“gives” a pointer to the shelf and the row where book is
located.located.
 The catalog has no data about the book, just an efficientThe catalog has no data about the book, just an efficient
way of searching.way of searching.
No text goes to graphics
DWH-Ahsan Abdullah
6
Indexing GoalIndexing Goal
Look at as few blocks as
possible to find the
matching record(s)
DWH-Ahsan Abdullah
7
Conventional indexing TechniquesConventional indexing Techniques
 DenseDense
 SparseSparse
 Multi-level (or B-Tree)Multi-level (or B-Tree)
 Primary Index vs. Secondary IndexesPrimary Index vs. Secondary Indexes
DWH-Ahsan Abdullah
8
Dense Index
10
20
30
40
50
60
70
80
90
100
110
120
Data File
20
10
40
30
60
50
80
70
100
90
Every key in the data
file is represented in
the index file
Dense Index: ConceptDense Index: Concept
DWH-Ahsan Abdullah
9
Dense Index: Adv & Dis AdvDense Index: Adv & Dis Adv
 Advantage:Advantage:
 A dense index, if fits in the memory, is veryA dense index, if fits in the memory, is very
efficient in locating a record given a keyefficient in locating a record given a key
 Disadvantage:Disadvantage:
 A dense index, if too big and doesn’t fit into theA dense index, if too big and doesn’t fit into the
memory, will be expensive when used to find amemory, will be expensive when used to find a
record given its keyrecord given its key
No text goes to graphics
DWH-Ahsan Abdullah
10
Sparse Index
10
30
50
70
90
110
130
150
170
190
210
230
Data File
20
10
40
30
60
50
80
70
100
90
Normally keepsNormally keeps
only one key peronly one key per
data blockdata block
Some keys in theSome keys in the
data file will notdata file will not
have an entry inhave an entry in
the index filethe index file
Sparse Index: ConceptSparse Index: Concept
DWH-Ahsan Abdullah
11
Sparse Index: Adv & Dis AdvSparse Index: Adv & Dis Adv
 Advantage:Advantage:
 A sparse index uses less space at the expense ofA sparse index uses less space at the expense of
somewhat more time to find a record given itssomewhat more time to find a record given its
keykey
 Support multi-level indexing structureSupport multi-level indexing structure
 Disadvantage:Disadvantage:
 Locating a record given a key has differentLocating a record given a key has different
performance for different key valuesperformance for different key values
No text goes to graphics
DWH-Ahsan Abdullah
12
Sparse 2nd level
10
90
170
250
330
410
490
570
Data File
20
10
40
30
60
50
80
70
100
90
10
30
50
70
90
110
130
150
170
190
210
230
Sparse Index: Multi levelSparse Index: Multi level
DWH-Ahsan Abdullah
13
B-tree Indexing: ConceptB-tree Indexing: Concept
 Can be seen as a general form of multi-levelCan be seen as a general form of multi-level
indexes.indexes.
 Generalize usual (binary) search trees (BST).Generalize usual (binary) search trees (BST).
 Allow efficient and fast exploration at the expense ofAllow efficient and fast exploration at the expense of
using slightly more space.using slightly more space.
 Popular variant: BPopular variant: B++
-tree-tree
 Support more efficiently queries like:Support more efficiently queries like:
SELECT * FROM R WHERE a = 11SELECT * FROM R WHERE a = 11
 SELECT * FROM R WHERE 0<= b and b<42SELECT * FROM R WHERE 0<= b and b<42
DWH-Ahsan Abdullah
14
200
220
250
280
130
B-tree Indexing: ExampleB-tree Indexing: Example
Each node stored in one disk block
RIDlist
9
20
100
140
145
200
210
215
220
230
250
256
279
280
300
Looking for Empno 250
DWH-Ahsan Abdullah
15
B-tree Indexing: LimitationsB-tree Indexing: Limitations
 If a table is large and there are fewer unique values.
 Capitalization is not programmatically enforced
(meaning case-sensitivity does matter and
“FLASHMAN" is different from “Flashman").
 Outcome varies with inter-character spaces.
 A noun spelled differently will result in different
results.
 Insertion can be very expensive.
Nothing will go to graphics
DWH-Ahsan Abdullah
16
B-tree Indexing: Limitations ExampleB-tree Indexing: Limitations Example
Given that MOHAMMED is the most common first name in Pakistan,
a 5-million row Customers table would produce many screens of
matching rows for MOHAMMED AHMAD, yet would skip potential
matching values such as the following:
VALUE MISSED REASON MISSED
Mohammed Ahmad Case sensitive
MOHAMMED AHMED AHMED versus AHMAD
MOHAMMED AHMAD Extra space between names
MOHAMMED AHMAD DR DR after AHMAD
MOHAMMAD AHMAD Alternative spelling of MOHAMMAD
DWH-Ahsan Abdullah
17
Hash Based IndexingHash Based Indexing
 You may recall that in internal memory, hashingYou may recall that in internal memory, hashing
can be used to quickly locate a specific key.can be used to quickly locate a specific key.
 The same technique can be used on externalThe same technique can be used on external
memory.memory.
 However, advantage over search trees is smaller inHowever, advantage over search trees is smaller in
external search than internal.external search than internal. WHY?WHY?
 Because part of search tree can be brought intoBecause part of search tree can be brought into
the main memory.the main memory.
DWH-Ahsan Abdullah
18
Hash Based Indexing: ConceptHash Based Indexing: Concept
In contrast to B-tree indexing, hash based indexes do notIn contrast to B-tree indexing, hash based indexes do not
(typically) keep index values in sorted order.(typically) keep index values in sorted order.
 Index entry is found by hashing on index value requiringIndex entry is found by hashing on index value requiring
exact match.exact match.
SELECT * FROM Customers WHERE AccttNo= 110240SELECT * FROM Customers WHERE AccttNo= 110240
 Index entries kept in hash organized tables rather than B-Index entries kept in hash organized tables rather than B-
tree structures.tree structures.
 Index entry contains ROWID values for each rowIndex entry contains ROWID values for each row
corresponding to the index value.corresponding to the index value.
 Remember few numbers in real-life to be useful for hashing.Remember few numbers in real-life to be useful for hashing.
DWH-Ahsan Abdullah
19
.
records
.
key → h(key) disk block
Note on terminology:
The word "indexing" is often used
synonymously with "B-tree indexing".
Hashing as Primary IndexHashing as Primary Index
DWH-Ahsan Abdullah
20
key → h(key)
Index
recordkey
Can always be transformed to a secondary index using
indirection, as above.
Indexing the Index
Hashing as Secondary IndexHashing as Secondary Index
DWH-Ahsan Abdullah
21
 Indexing (using B-trees) good for rangeIndexing (using B-trees) good for range
searches, e.g.:searches, e.g.:
SELECT * FROM R WHERE A > 5SELECT * FROM R WHERE A > 5
 Hashing good for match based searches,Hashing good for match based searches,
e.g.:e.g.:
SELECT * FROM R WHERE A = 5SELECT * FROM R WHERE A = 5
B-tree vs. Hash IndexesB-tree vs. Hash Indexes
DWH-Ahsan Abdullah
22
Primary Key vs. Primary IndexPrimary Key vs. Primary Index
Relation Students
Name ID dept
AHMAD 123 CS
Akram 567 EE
Numan 999 CS
 Primary Key & Primary Index:Primary Key & Primary Index:
 PK is ALWAYS unique.PK is ALWAYS unique.
 PI can be unique, but does not have to be.PI can be unique, but does not have to be.
 In DSS environment, very few queries are PK based.In DSS environment, very few queries are PK based.
DWH-Ahsan Abdullah
23
 Primary index selection criteria:Primary index selection criteria:
 Common join and retrieval key.Common join and retrieval key.
 Can be unique UPI or non-unique NUPI.Can be unique UPI or non-unique NUPI.
 Limits on NUPI.Limits on NUPI.
 Only one primary index per table (for hash-basedOnly one primary index per table (for hash-based
file system).file system).
Primary Indexing: CriterionPrimary Indexing: Criterion
DWH-Ahsan Abdullah
24
Primary Indexing Criteria: ExamplePrimary Indexing Criteria: Example
What should be the primary index of the call table for aWhat should be the primary index of the call table for a
large telecom company?large telecom company?
call_id decimal (15,0) NOT NULL
caller_no decimal (10,0) NOT NULL
call_duration decimal (15,2) NOT NULL
call_dt date NOT NULL
called_no decimal (15,0) NOT NULL
Call TableCall Table
No simple answer!!No simple answer!!
DWH-Ahsan Abdullah
25
 Almost all joins and retrievals will occur throughAlmost all joins and retrievals will occur through
the caller _no foreign key.the caller _no foreign key.
 Use caller_no as a NUPI.Use caller_no as a NUPI.
 In case of non uniform distribution on caller_noIn case of non uniform distribution on caller_no
oror
 if phone number have very large number ofif phone number have very large number of
outgoing calls (e.g., an institutional number couldoutgoing calls (e.g., an institutional number could
easily have several thousand calls).easily have several thousand calls).
 Use call_id as UPI for good data distribution.Use call_id as UPI for good data distribution.
Primary IndexingPrimary Indexing
DWH-Ahsan Abdullah
26
For a hash-based file system, primary index isFor a hash-based file system, primary index is
free!free!
 No storage cost.No storage cost.
 No index build required.No index build required.
OLTP databases use a page-based file systemOLTP databases use a page-based file system
and therefore do not deliver this performanceand therefore do not deliver this performance
advantage.advantage.
Primary IndexingPrimary Indexing

Contenu connexe

Tendances

12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMS12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMS
koolkampus
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
Mr SMAK
 
Insertion sort
Insertion sortInsertion sort
Insertion sort
MYER301
 

Tendances (20)

Dynamic Itemset Counting
Dynamic Itemset CountingDynamic Itemset Counting
Dynamic Itemset Counting
 
Hashing
HashingHashing
Hashing
 
Hashing PPT
Hashing PPTHashing PPT
Hashing PPT
 
Searching techniques in Data Structure And Algorithm
Searching techniques in Data Structure And AlgorithmSearching techniques in Data Structure And Algorithm
Searching techniques in Data Structure And Algorithm
 
Graph Traversal Algorithm
Graph Traversal AlgorithmGraph Traversal Algorithm
Graph Traversal Algorithm
 
12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMS12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMS
 
Data Structures & Recursion-Introduction.pdf
Data Structures & Recursion-Introduction.pdfData Structures & Recursion-Introduction.pdf
Data Structures & Recursion-Introduction.pdf
 
Dependencies
DependenciesDependencies
Dependencies
 
Join operation
Join operationJoin operation
Join operation
 
Searching Techniques and Analysis
Searching Techniques and AnalysisSearching Techniques and Analysis
Searching Techniques and Analysis
 
Hashing
HashingHashing
Hashing
 
Greedy Algorithm - Huffman coding
Greedy Algorithm - Huffman codingGreedy Algorithm - Huffman coding
Greedy Algorithm - Huffman coding
 
Data Structures : hashing (1)
Data Structures : hashing (1)Data Structures : hashing (1)
Data Structures : hashing (1)
 
4.4 hashing
4.4 hashing4.4 hashing
4.4 hashing
 
AVL Tree
AVL TreeAVL Tree
AVL Tree
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
 
Extensible hashing
Extensible hashingExtensible hashing
Extensible hashing
 
Design and Analysis of Algorithm ppt for unit one
Design and Analysis of Algorithm ppt for unit oneDesign and Analysis of Algorithm ppt for unit one
Design and Analysis of Algorithm ppt for unit one
 
Tree and graph
Tree and graphTree and graph
Tree and graph
 
Insertion sort
Insertion sortInsertion sort
Insertion sort
 

En vedette (20)

Lecture 18
Lecture 18Lecture 18
Lecture 18
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
 
Lecture 21
Lecture 21Lecture 21
Lecture 21
 
Lecture 31
Lecture 31Lecture 31
Lecture 31
 
Lecture 4
Lecture 4Lecture 4
Lecture 4
 
Lecture 17
Lecture 17Lecture 17
Lecture 17
 
Lecture 27
Lecture 27Lecture 27
Lecture 27
 
Lecture 5
Lecture 5Lecture 5
Lecture 5
 
Lecture 32
Lecture 32Lecture 32
Lecture 32
 
Lecture 20
Lecture 20Lecture 20
Lecture 20
 
Lecture 30
Lecture 30Lecture 30
Lecture 30
 
Lecture 35
Lecture 35Lecture 35
Lecture 35
 
Lecture 23
Lecture 23Lecture 23
Lecture 23
 
Lecture 3
Lecture 3Lecture 3
Lecture 3
 
Lecture 19
Lecture 19Lecture 19
Lecture 19
 
Lecture 16
Lecture 16Lecture 16
Lecture 16
 
Lecture 38
Lecture 38Lecture 38
Lecture 38
 
Lecture 40
Lecture 40Lecture 40
Lecture 40
 
Lecture 34
Lecture 34Lecture 34
Lecture 34
 
Lecture 39
Lecture 39Lecture 39
Lecture 39
 

Similaire à Lecture 26

OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AGOLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
Lucidworks
 

Similaire à Lecture 26 (20)

Apache kudu
Apache kuduApache kudu
Apache kudu
 
Intro to Data warehousing lecture 11
Intro to Data warehousing   lecture 11Intro to Data warehousing   lecture 11
Intro to Data warehousing lecture 11
 
Intro to Data warehousing lecture 14
Intro to Data warehousing   lecture 14Intro to Data warehousing   lecture 14
Intro to Data warehousing lecture 14
 
Intro to Data warehousing lecture 19
Intro to Data warehousing   lecture 19Intro to Data warehousing   lecture 19
Intro to Data warehousing lecture 19
 
Cs437 lecture 14_15
Cs437 lecture 14_15Cs437 lecture 14_15
Cs437 lecture 14_15
 
Time series database by Harshil Ambagade
Time series database by Harshil AmbagadeTime series database by Harshil Ambagade
Time series database by Harshil Ambagade
 
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AGOLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
 
Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingApache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketing
 
Hypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.comHypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.com
 
Hadoop Training in Hyderabad | Online Training
Hadoop Training in Hyderabad | Online TrainingHadoop Training in Hyderabad | Online Training
Hadoop Training in Hyderabad | Online Training
 
HBase introduction in azure
HBase introduction in azureHBase introduction in azure
HBase introduction in azure
 
The ABC of Big Data
The ABC of Big DataThe ABC of Big Data
The ABC of Big Data
 
Introduction to Sql on Hadoop
Introduction to Sql on HadoopIntroduction to Sql on Hadoop
Introduction to Sql on Hadoop
 
Indexing and hashing
Indexing and hashingIndexing and hashing
Indexing and hashing
 
Lec 1 indexing and hashing
Lec 1 indexing and hashing Lec 1 indexing and hashing
Lec 1 indexing and hashing
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Oslo baksia2014
Oslo baksia2014Oslo baksia2014
Oslo baksia2014
 
Building Big data solutions in Azure
Building Big data solutions in AzureBuilding Big data solutions in Azure
Building Big data solutions in Azure
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 
Ardbms
ArdbmsArdbms
Ardbms
 

Plus de Shani729 (20)

Python tutorialfeb152012
Python tutorialfeb152012Python tutorialfeb152012
Python tutorialfeb152012
 
Python tutorial
Python tutorialPython tutorial
Python tutorial
 
Interaction design _beyond_human_computer_interaction
Interaction design _beyond_human_computer_interactionInteraction design _beyond_human_computer_interaction
Interaction design _beyond_human_computer_interaction
 
Fm lecturer 13(final)
Fm lecturer 13(final)Fm lecturer 13(final)
Fm lecturer 13(final)
 
Lecture slides week14-15
Lecture slides week14-15Lecture slides week14-15
Lecture slides week14-15
 
Frequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth methodFrequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth method
 
Dwh lecture slides-week15
Dwh lecture slides-week15Dwh lecture slides-week15
Dwh lecture slides-week15
 
Dwh lecture slides-week10
Dwh lecture slides-week10Dwh lecture slides-week10
Dwh lecture slides-week10
 
Dwh lecture slidesweek7&8
Dwh lecture slidesweek7&8Dwh lecture slidesweek7&8
Dwh lecture slidesweek7&8
 
Dwh lecture slides-week5&6
Dwh lecture slides-week5&6Dwh lecture slides-week5&6
Dwh lecture slides-week5&6
 
Dwh lecture slides-week3&4
Dwh lecture slides-week3&4Dwh lecture slides-week3&4
Dwh lecture slides-week3&4
 
Dwh lecture slides-week2
Dwh lecture slides-week2Dwh lecture slides-week2
Dwh lecture slides-week2
 
Dwh lecture slides-week1
Dwh lecture slides-week1Dwh lecture slides-week1
Dwh lecture slides-week1
 
Dwh lecture slides-week 13
Dwh lecture slides-week 13Dwh lecture slides-week 13
Dwh lecture slides-week 13
 
Dwh lecture slides-week 12&13
Dwh lecture slides-week 12&13Dwh lecture slides-week 12&13
Dwh lecture slides-week 12&13
 
Data warehousing and mining furc
Data warehousing and mining furcData warehousing and mining furc
Data warehousing and mining furc
 
Lecture 37
Lecture 37Lecture 37
Lecture 37
 
Lecture 36
Lecture 36Lecture 36
Lecture 36
 
Lecture 33
Lecture 33Lecture 33
Lecture 33
 
Lecture 29
Lecture 29Lecture 29
Lecture 29
 

Dernier

result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
Tonystark477637
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
rknatarajan
 

Dernier (20)

Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spain
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 

Lecture 26

  • 1. DWH-Ahsan AbdullahDWH-Ahsan Abdullah 11 Data WarehousingData Warehousing Lecture-26Lecture-26 Need for Speed:Need for Speed: Conventional Indexing TechniquesConventional Indexing Techniques Virtual University of PakistanVirtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics Research www.nu.edu.pk/cairindex.asp National University of Computers & Emerging Sciences, Islamabad Email: ahsan1010@yahoo.com
  • 2. DWH-Ahsan Abdullah 2 Need For Indexing: SpeedNeed For Indexing: Speed Consider searching your hard disk using the Windows SEARCH command.  Search goes into directory hierarchies.  Takes about a minute, and there are only a few thousand files. Assume a fast processor and (even more importantly) a fast hard disk.  Assume file size to be 5 KB.  Assume hard disk scan rate of a million files per second.  Resulting in scan rate of 5 GB per second. Largest search engine indexes more than 8 billion pages  At above scan rate 1,600 seconds required to scan ALL pages.  This is just for one user!  No one is going to wait for 26 minutes, not even 26 seconds. Hence, a sequential scan is simply not feasible. No text goes to graphics
  • 3. DWH-Ahsan Abdullah 3 Need For Indexing: Query ComplexityNeed For Indexing: Query Complexity  How many customers do I have in Karachi?  How many customers in Karachi made calls during April?  How many customers in Karachi made calls to Multan during April?  How many customers in Karachi made calls to Multan during April using a particular calling package?
  • 4. DWH-Ahsan Abdullah 4 Need For Indexing: I/O BottleneckNeed For Indexing: I/O Bottleneck  Throwing hardware just speeds up the CPU intensive tasks.  The problem is of I/O, which does not scales up easily.  Putting the entire table in RAM is very very expensive.  Therefore, index! No text goes to graphics
  • 5. DWH-Ahsan Abdullah 5 Indexing ConceptIndexing Concept  Purely physical concept, nothing to do with logical model.Purely physical concept, nothing to do with logical model.  Invisible to the end user (programmer), optimizer choosesInvisible to the end user (programmer), optimizer chooses it, effects only the speed, not the answer.it, effects only the speed, not the answer.  With the library analogy, the time complexity to find aWith the library analogy, the time complexity to find a book? The average time takenbook? The average time taken  Using a card catalog organized in many different ways i.e.Using a card catalog organized in many different ways i.e. author, topic, title etc and is sorted.author, topic, title etc and is sorted.  A little bit of extra time to first check the catalog, but itA little bit of extra time to first check the catalog, but it “gives” a pointer to the shelf and the row where book is“gives” a pointer to the shelf and the row where book is located.located.  The catalog has no data about the book, just an efficientThe catalog has no data about the book, just an efficient way of searching.way of searching. No text goes to graphics
  • 6. DWH-Ahsan Abdullah 6 Indexing GoalIndexing Goal Look at as few blocks as possible to find the matching record(s)
  • 7. DWH-Ahsan Abdullah 7 Conventional indexing TechniquesConventional indexing Techniques  DenseDense  SparseSparse  Multi-level (or B-Tree)Multi-level (or B-Tree)  Primary Index vs. Secondary IndexesPrimary Index vs. Secondary Indexes
  • 8. DWH-Ahsan Abdullah 8 Dense Index 10 20 30 40 50 60 70 80 90 100 110 120 Data File 20 10 40 30 60 50 80 70 100 90 Every key in the data file is represented in the index file Dense Index: ConceptDense Index: Concept
  • 9. DWH-Ahsan Abdullah 9 Dense Index: Adv & Dis AdvDense Index: Adv & Dis Adv  Advantage:Advantage:  A dense index, if fits in the memory, is veryA dense index, if fits in the memory, is very efficient in locating a record given a keyefficient in locating a record given a key  Disadvantage:Disadvantage:  A dense index, if too big and doesn’t fit into theA dense index, if too big and doesn’t fit into the memory, will be expensive when used to find amemory, will be expensive when used to find a record given its keyrecord given its key No text goes to graphics
  • 10. DWH-Ahsan Abdullah 10 Sparse Index 10 30 50 70 90 110 130 150 170 190 210 230 Data File 20 10 40 30 60 50 80 70 100 90 Normally keepsNormally keeps only one key peronly one key per data blockdata block Some keys in theSome keys in the data file will notdata file will not have an entry inhave an entry in the index filethe index file Sparse Index: ConceptSparse Index: Concept
  • 11. DWH-Ahsan Abdullah 11 Sparse Index: Adv & Dis AdvSparse Index: Adv & Dis Adv  Advantage:Advantage:  A sparse index uses less space at the expense ofA sparse index uses less space at the expense of somewhat more time to find a record given itssomewhat more time to find a record given its keykey  Support multi-level indexing structureSupport multi-level indexing structure  Disadvantage:Disadvantage:  Locating a record given a key has differentLocating a record given a key has different performance for different key valuesperformance for different key values No text goes to graphics
  • 12. DWH-Ahsan Abdullah 12 Sparse 2nd level 10 90 170 250 330 410 490 570 Data File 20 10 40 30 60 50 80 70 100 90 10 30 50 70 90 110 130 150 170 190 210 230 Sparse Index: Multi levelSparse Index: Multi level
  • 13. DWH-Ahsan Abdullah 13 B-tree Indexing: ConceptB-tree Indexing: Concept  Can be seen as a general form of multi-levelCan be seen as a general form of multi-level indexes.indexes.  Generalize usual (binary) search trees (BST).Generalize usual (binary) search trees (BST).  Allow efficient and fast exploration at the expense ofAllow efficient and fast exploration at the expense of using slightly more space.using slightly more space.  Popular variant: BPopular variant: B++ -tree-tree  Support more efficiently queries like:Support more efficiently queries like: SELECT * FROM R WHERE a = 11SELECT * FROM R WHERE a = 11  SELECT * FROM R WHERE 0<= b and b<42SELECT * FROM R WHERE 0<= b and b<42
  • 14. DWH-Ahsan Abdullah 14 200 220 250 280 130 B-tree Indexing: ExampleB-tree Indexing: Example Each node stored in one disk block RIDlist 9 20 100 140 145 200 210 215 220 230 250 256 279 280 300 Looking for Empno 250
  • 15. DWH-Ahsan Abdullah 15 B-tree Indexing: LimitationsB-tree Indexing: Limitations  If a table is large and there are fewer unique values.  Capitalization is not programmatically enforced (meaning case-sensitivity does matter and “FLASHMAN" is different from “Flashman").  Outcome varies with inter-character spaces.  A noun spelled differently will result in different results.  Insertion can be very expensive. Nothing will go to graphics
  • 16. DWH-Ahsan Abdullah 16 B-tree Indexing: Limitations ExampleB-tree Indexing: Limitations Example Given that MOHAMMED is the most common first name in Pakistan, a 5-million row Customers table would produce many screens of matching rows for MOHAMMED AHMAD, yet would skip potential matching values such as the following: VALUE MISSED REASON MISSED Mohammed Ahmad Case sensitive MOHAMMED AHMED AHMED versus AHMAD MOHAMMED AHMAD Extra space between names MOHAMMED AHMAD DR DR after AHMAD MOHAMMAD AHMAD Alternative spelling of MOHAMMAD
  • 17. DWH-Ahsan Abdullah 17 Hash Based IndexingHash Based Indexing  You may recall that in internal memory, hashingYou may recall that in internal memory, hashing can be used to quickly locate a specific key.can be used to quickly locate a specific key.  The same technique can be used on externalThe same technique can be used on external memory.memory.  However, advantage over search trees is smaller inHowever, advantage over search trees is smaller in external search than internal.external search than internal. WHY?WHY?  Because part of search tree can be brought intoBecause part of search tree can be brought into the main memory.the main memory.
  • 18. DWH-Ahsan Abdullah 18 Hash Based Indexing: ConceptHash Based Indexing: Concept In contrast to B-tree indexing, hash based indexes do notIn contrast to B-tree indexing, hash based indexes do not (typically) keep index values in sorted order.(typically) keep index values in sorted order.  Index entry is found by hashing on index value requiringIndex entry is found by hashing on index value requiring exact match.exact match. SELECT * FROM Customers WHERE AccttNo= 110240SELECT * FROM Customers WHERE AccttNo= 110240  Index entries kept in hash organized tables rather than B-Index entries kept in hash organized tables rather than B- tree structures.tree structures.  Index entry contains ROWID values for each rowIndex entry contains ROWID values for each row corresponding to the index value.corresponding to the index value.  Remember few numbers in real-life to be useful for hashing.Remember few numbers in real-life to be useful for hashing.
  • 19. DWH-Ahsan Abdullah 19 . records . key → h(key) disk block Note on terminology: The word "indexing" is often used synonymously with "B-tree indexing". Hashing as Primary IndexHashing as Primary Index
  • 20. DWH-Ahsan Abdullah 20 key → h(key) Index recordkey Can always be transformed to a secondary index using indirection, as above. Indexing the Index Hashing as Secondary IndexHashing as Secondary Index
  • 21. DWH-Ahsan Abdullah 21  Indexing (using B-trees) good for rangeIndexing (using B-trees) good for range searches, e.g.:searches, e.g.: SELECT * FROM R WHERE A > 5SELECT * FROM R WHERE A > 5  Hashing good for match based searches,Hashing good for match based searches, e.g.:e.g.: SELECT * FROM R WHERE A = 5SELECT * FROM R WHERE A = 5 B-tree vs. Hash IndexesB-tree vs. Hash Indexes
  • 22. DWH-Ahsan Abdullah 22 Primary Key vs. Primary IndexPrimary Key vs. Primary Index Relation Students Name ID dept AHMAD 123 CS Akram 567 EE Numan 999 CS  Primary Key & Primary Index:Primary Key & Primary Index:  PK is ALWAYS unique.PK is ALWAYS unique.  PI can be unique, but does not have to be.PI can be unique, but does not have to be.  In DSS environment, very few queries are PK based.In DSS environment, very few queries are PK based.
  • 23. DWH-Ahsan Abdullah 23  Primary index selection criteria:Primary index selection criteria:  Common join and retrieval key.Common join and retrieval key.  Can be unique UPI or non-unique NUPI.Can be unique UPI or non-unique NUPI.  Limits on NUPI.Limits on NUPI.  Only one primary index per table (for hash-basedOnly one primary index per table (for hash-based file system).file system). Primary Indexing: CriterionPrimary Indexing: Criterion
  • 24. DWH-Ahsan Abdullah 24 Primary Indexing Criteria: ExamplePrimary Indexing Criteria: Example What should be the primary index of the call table for aWhat should be the primary index of the call table for a large telecom company?large telecom company? call_id decimal (15,0) NOT NULL caller_no decimal (10,0) NOT NULL call_duration decimal (15,2) NOT NULL call_dt date NOT NULL called_no decimal (15,0) NOT NULL Call TableCall Table No simple answer!!No simple answer!!
  • 25. DWH-Ahsan Abdullah 25  Almost all joins and retrievals will occur throughAlmost all joins and retrievals will occur through the caller _no foreign key.the caller _no foreign key.  Use caller_no as a NUPI.Use caller_no as a NUPI.  In case of non uniform distribution on caller_noIn case of non uniform distribution on caller_no oror  if phone number have very large number ofif phone number have very large number of outgoing calls (e.g., an institutional number couldoutgoing calls (e.g., an institutional number could easily have several thousand calls).easily have several thousand calls).  Use call_id as UPI for good data distribution.Use call_id as UPI for good data distribution. Primary IndexingPrimary Indexing
  • 26. DWH-Ahsan Abdullah 26 For a hash-based file system, primary index isFor a hash-based file system, primary index is free!free!  No storage cost.No storage cost.  No index build required.No index build required. OLTP databases use a page-based file systemOLTP databases use a page-based file system and therefore do not deliver this performanceand therefore do not deliver this performance advantage.advantage. Primary IndexingPrimary Indexing

Notes de l'éditeur

  1. As the complexity of the queries increases, involve table joins, requires aggregates the processing time also increases correspondingly. So the problem is one of knowledge and time.
  2. The point of using an index is to increase the speed and efficiency of searches of the database. Without some sort of index, a user’s query must sequentially scan the database, finding the records matching the parameters in the WHERE clause.
  3. &amp;lt;number&amp;gt;
  4. &amp;lt;number&amp;gt;
  5. &amp;lt;number&amp;gt;
  6. &amp;lt;number&amp;gt;
  7. &amp;lt;number&amp;gt;
  8. &amp;lt;number&amp;gt;
  9. &amp;lt;number&amp;gt;
  10. &amp;lt;number&amp;gt;
  11. &amp;lt;number&amp;gt;
  12. &amp;lt;number&amp;gt;
  13. &amp;lt;number&amp;gt;
  14. &amp;lt;number&amp;gt;
  15. &amp;lt;number&amp;gt;
  16. &amp;lt;number&amp;gt;
  17. &amp;lt;number&amp;gt;
  18. &amp;lt;number&amp;gt;
  19. &amp;lt;number&amp;gt;
  20. &amp;lt;number&amp;gt;
  21. &amp;lt;number&amp;gt;