Contenu connexe
Similaire à Local Secondary Indexes in Apache Phoenix (20)
Local Secondary Indexes in Apache Phoenix
- 2. 2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Agenda
Local Indexes Introduction
Local indexes design and data model
Local index writes and reads
Performance Results
Helpful Tips or recommendations
- 3. 3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Secondary indexes in Phoenix
Primary Key columns in a phoenix table forms HBase row key which acts as a
primary index so filtering by primary key columns become point or range
scans to the table.
Filtering on non primary key column converts query into full table scans and
consume lot time and resources.
With secondary indexes, we can create alternative access paths to convert
queries into point lookups or range scans.
Phoenix supports two kinds of indexes GLOBAL and LOCAL.
Phoenix supports Functional indexes as well.
- 4. 4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Local Secondary Indexes - Introduction
Local secondary index is LOCAL in the sense that a REGION in a table is
considered as a unit and create and maintain index of it’s data.
The local index data is stored and maintained in the shadow column
family(ies) in the same table.
So the index is 100% co-reside in the same server serving the actual data.
Faster index building.
Syntax:
- 5. 5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Local Secondary Index - Introduction
Order Id Customer ID Item ID Date
100 11 1111 06/10/2017
101 23 1231 06/01/2017
102 11 1332 05/31/2017
103 34 3221 06/01/2017
Region[100
,104)
Region[104
,107)
REGION
START KEY
IDX ID DATE Order ID
100 1 05/31/2017 102
100 1 06/01/2017 101
100 1 06/01/2017 103
100 1 06/10/2017 100
104 55 1343 05/28/2017
105 11 2312 06/01/2017
106 29 1234 05/15/2017
104 1 05/15/2017 106
104 1 05/28/2017 104
104 1 06/01/2017 105
CREATE TABLE IF NOT EXISTS ORDERS(
ORDER_ID LONG NOT NULL PRIMARY KEY,
CUSTOMER_ID LONG NOT NULL,
ITEM_ID INTEGER NOT NULL,
DATE DATE NOT NULL);
CREATE LOCAL INDEX IDX ON ORDERS(DATE)
Index of
Region[100,
104)
Index of Region[104,107)
BASE TABLE
DATA – ORDER
ID IS PRIMARY
KEY INDEX ROW KEY
- 6. 6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Table
Region1
0
L#
0
STATS
CREATE TABLE IF NOT EXISTS WEB_STAT (
HOST CHAR(2) NOT NULL,
DOMAIN VARCHAR NOT NULL,
FEATURE VARCHAR NOT NULL,
DATE DATE NOT NULL,
STATS.ACTIVE_VISITOR INTEGER
CONSTRAINT PK PRIMARY KEY (HOST, DOMAIN));
Region2
0
L#
0
STATS
2) CREATE LOCAL INDEX IDX2 ON
WEB_STAT(STATS.ACTIVE_VISITOR) INCLUDE(DATE)
Table
Region1
0
STATS
Region2
0
L#
0
STATS
3) CREATE LOCAL INDEX IDX3 ON WEB_STAT(DATE)
INCLUDE(STATS.ACTIVE_VISITOR)
L#STATS
L#
0
L#STATS
Data Model
Shadow column
families to store
the index data
1) CREATE LOCAL INDEX IDX ON WEB_STAT(DATE)
- 7. 7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Data Model
REGION
START KEY
SALT NUMBER
(Empty for
non salt table)
INDEX ID
TENANT_ID
(Empty for
non multi
tenant table)
INDEXED COLUMN
VALUE[S]
PRIMARY KEY COLUMN
VALUE[S]
Local index row key format
REGION START KEY: Start key of data region. For first region it’s empty byte array of region
end key length. This helps to index region wise data.
SALT NUMBER: A byte value represents a salt bucket number calculated for index row key.
INDEX ID: A short number represents the local index. This helps to store each index data
together.
TENANT_ID: Tenant column value of the row key. It’s empty for if a table is not multi-tenant
- 8. 8 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Write path
Region Server
Region
CLIENT
1.Write
request
prepare index updates
Data cf Index cf
2.batch call
Mem
Store
Me
mSto
re
Index
updates
Data updates
4.Merge data and
index updates
5.Write to
MemStores
WAL
6.Write to WAL
100% ATOMIC
and CONSISTENT
local index
updates with
data updates
- 9. 9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Regionserver
Region [‘’,F)
Region [F,L)
Client
0 L#0
Region [L,R)
Region [R,’’)
Regionserver
Read Path
0 L#0
0 L#0
0 L#0
SELECT COUNT(*) FROM T WHERE INDEXED_COL=‘findme’
2
1
0
5
- 10. 10 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Read Path
SELECT INDEX_COL, NON_INDEX_COL FROM T WHERE INDEX_COL=‘findme’
Joining back missing columns from data table
Region
CLIENT
1.SCAN,L#0,FILTER
Index cf Data cf
Mem
Store
Me
mSto
re
2.Apply filter
on index col
3.Get non
index cols on
matching rows
4.Merge with
index cols
5.Return
combined
results to client
6. Results
- 11. 11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Region Splits and Merges
Since the indexes also stored in the same table, splits and merges taken care
by HBase automatically.
We have special mechanism to separate HFile into child regions after split.
We scan through each key value find the data row key from it and write to
corresponding child region
- 12. 12 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Performance Results
4 node cluster
Tested with 5 local indexes on the base table of 25 columns with 10 regions.
Ingested 50M rows.
3x faster upsert time comparing to global indexes
5x less network RX/TX utilizations during write comparing to global indexes
Similar read performance comparing to global indexes with queries like aggregations, group
by, limit etc.
- 13. 13 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Performance results
Write performance
- 14. 14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Performance results
Network Tx/Rx during write
- 15. 15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Performance results
Network Tx/Rx during write
- 16. 16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Performance results
Network Tx/Rx during write
- 17. 17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Helpful Tips
Mutable vs Immutable rows table?
– Writes are much more faster with local indexes on immutable rows table than mutable.
So if the row written once and never updated then better to create table with
IMMUTABLE_ROWS property.
Online vs Offline index population?
– When a table with pre-existing data then index population time may vary depending on
the data size.
– Usually index population happen at server by reading data table and writing index to the
same table. It works very fast normally. But if the data size is too big then better to use
ASYNC population by using IndexTool.
Covered index vs non covered index?
– When a query contains the non indexed columns to access then Phoenix joins the
missing columns(in the index) from data table itself by using get calls. If the matching
number of rows are high better to create covered index to avoid get calls.
- 18. 18 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Thank You
Q & A?
rajeshbabu@apache.org
@rajeshhcu32