Storage Characteristics Of Call Data Records In Column Store Databases

STORAGE CHARACTERISTICS
OF CALL DATA RECORDS
IN COLUMN STORE DATABASES
D AV I D M WA L K E R
D ATA M A N A G E M E N T & WA R E H O U S I N G

OVERVIEW

•  This presentation gives a brief overview of the
storage characteristics of Call Data Records in
Column Store Databases
•  It discusses
•  What are Call Data Records (CDRs)?
•  What is a Column Store Database?
•  How efficient is a column store database for storing CDR
and other (similar) machine generated data?
•  It does not:
•  Examine performance in any detail
•  Compare column store to traditional row-based

Jan 2012 © 2012 Data Management & Warehousing 2

WHAT ARE CALL DATA RECORDS
(CDRs) ?
•  Every time a telephone call is made data about
that call is recorded. At its most basic this will
include:
•  The Calling Number (who made the call)
•  The Called Number (who was called)
•  The Start Time
•  The End Time (or the duration)
•  Various pieces of technical information (which network
switch was used, mobile handset identifier, call direction, is it
a free x800 type call etc.)


CDRs AT MULTIPLE LEVELS

•  A CDR is created at the switch, each switch
involved in a call creates its own CDRs, these are
often called Network CDRs

•  The Network CDRs are joined together into a record
of an end to end call record through a process
known as mediation. These are Unrated CDRs

•  Finally the cost of the call is calculated and added
to the Unrated CDRs to create Rated CDRs


MORE CDR COMPLEXITY

•  There are CDRs that are used for billing the subscriber,
often called Retail CDRs

•  There are also CDRs that are used to charge other
operators when their call travels over your network (e.g.
when you make a mobile call that finishes on land line
from another operator) These are known as Interconnect
CDRs or Wholesale CDRs

•  There are also differences between Mobile and Fixed
(Land) Line CDRs

•  Finally each Switch Manufacturer (there are over 60)
and each Mediation and/or Billing system (again at least
50) uses their own format


FOR THIS EXERCISE …

•  We are using a European Telephone Company
(Telco) Mobile Rated Interconnect CDRs

•  We have 12,902 files, containing 435,242,447 CDRs
over a 181 day period from 482,883 subscribers

•  Each CDR has 80 fields and 583 characters in a
fixed length record format file. In addition we have
added an additional mandatory field to hold the
source file name from which the record came


DATA DISTRIBUTION IN THE CDR
RECORDS (1)
•  The structure of the data in the record has a
massive impact on its storage. There are a number
of factors to look at:
•  Data Types, Padding, Place Holders and Data Cardinality

•  The example data we are using has 2 Datetime
fields, 11 Char fields, 10 Numeric fields, 33 Integer
fields and 25 Varchar fields which is a fairly typical
mix for this type of machine generated data. In the
source file these are all held as ASCII text.


RECORDS (2)
•  Fixed length records are padded. In our data set
the ‘Calling Number’ fixed length field is defined as
24 characters long however the maximum field
length in the actual data is only 11 characters long.
This means that there always 13 space characters
of padding afterwards

•  24 of our 80 fields have no information in them at all,
43 of the fields are mandatory and are 100%
populated. The remaining 13 fields have between
25% and 75% of the records filled.


RECORDS (3)
•  Finally the number of discreet values (cardinality) a
field has affects storage. One flag field has possible
values of 0 or 1 and therefore a (low) cardinality of
2, another field has a nearly unique value for every
record and therefore a very high cardinality. Of the
57 fields with data there are 20 fields with high
cardinality, 5 fields with medium cardinality and the
remaining 32 fields have a low cardinality


WHAT IS A
COLUMN STORE DATABASE?
•  Traditionally databases are ‘row-based’ i.e. each
field of data in a record is stored next to each other.
Forename Surname Gender
David Walker Male
Helen Walker Female
Sheila Jones Female

•  Column store databases store the values in columns
and then hold a mapping to form the record
•  This is transparent to the user, who queries a table
with SQL in exactly the same way as they would a
row-based database

COLUMN STORAGE

First Name F Token Note: To the user this appears as a conventional
row-based table that can be queried by standard
Value SQL, it is only the underlying storage that is different
David PPP
Helen QQQ F Token S Token G Token
Sheila RRR PPP YYY BBB
Surname Value S Token QQQ YYY AAA
Jones XXX RRR XXX AAA
Walker YYY

Gender Value G Token
Female AAA
Male BBB


EFFICIENCIES OF COLUMN STORE
DATABASES
•  Column store databases offer significant storage
optimisation opportunities especially where there is low
or medium cardinality character strings (e.g. the
telephone numbers and reference data) because long
strings are not repeatedly stored
•  In addition it is possible to compress the data column
stores very efficiently
•  It is possible, in some column store implementations, that
the column storage holds additional metadata that can
be used to speed up specific queries (e.g. the number of
records associated with each value in a column)
•  Reduced the data volume stored means reduced I/O
when querying the database, this consequently gives
query performance improvements


INEFFICIENCIES OF COLUMN STORE
DATABASES
•  In general manipulating individual rows for updates
is expensive as it has to go to each of the columns
and then update the mapping table
•  Some column store databases have specific technologies
to limit the impact of this by caching updates
•  Consequently Column Store Databases are not
efficient at OLTP type applications – however they
are very efficient for DWH/BI/Archive type
applications because the data is bulk loaded rather
than individual row inserts, it is not frequently
updated and used in large set based queries


HOW EFFICIENT IS IT TO STORE THIS
DATA?
•  What hardware was used and what would be needed for a
production environment?

•  How was the data loaded?

•  What was the storage characteristics?


THE TEST ENVIRONMENT

•  The test environment was designed to measure storage
and not system performance
•  This test was done using Sybase IQ 15.4
•  Sybase has had a column storage database called IQ since
1996 and is one of the most established of the 25 or so currently
listed on Wikipedia
•  The server was running CentOS 5.7 x64, a Redhat Linux
derivative
•  The hardware consisted of:
•  Intel Xeon Quad-Core X3363
•  16GB Memory
•  Adaptec 5405 RAID Controller with 2x 1TB 7200rpm Hard Disk (RAID1)
•  The database was built on file systems rather than raw devices
•  Total hardware cost was less than US$3000
•  Software licences were provided on evaluation


A PRODUCTION ENVIRONMENT?

•  To make this into a production environment would
depend on the volume of data per month and the
number of months data to be held and the type of CDR
•  The biggest performance driver would be to have more
disk spindles adding more (faster) drives or using solid
state disks. This would improve performance as well as
adding greater capacity
•  e.g. 16 1Tb drives in RAID10 configuration would provide
around 7.75Tb of space and store 75 Billion of these CDRs
•  Using raw devices instead or file systems would also improve
performance
•  Other performance enhancements would include
•  Moving from 1 to 2 or 4 Quad Core CPUs
•  Adding another 16Gb of memory


LOADING THE DATA

•  The data was loaded using PELT, an ETL tool written
and used by Data Management & Warehousing

•  The loading was done to production level quality

•  Data is loaded into a load table (CDR_LOAD) which
has a view (CDR_CONVERT) over it that applies
data quality checks. The data is then selected from
the view and inserted into the main table (CDRs)

•  Each step is fully logged and audited


THE LOADING STEPS

•  Copy a compressed (Unix •  Insert into the main CDR table
Compress .Z) flat file (as from the DQ view
provided) from the CDR_CONVERT over the
incoming directory to the CDR_LOAD table
workspace •  Record the size of the CDR
•  Record the size of the .Z file table in kilobytes
in bytes •  Truncate the CDR_LOAD table
•  Uncompress the file •  Compress the source file with
•  Record the size in bytes and ‘gzip -9’ (maximum
the number of records in compression, longest
the uncompressed file execution)
•  Use iSQL ‘Load’ command •  Record the size of the .gz file in
to insert the data into a bytes
CDR_LOAD table •  Move the compressed .gz file
•  Record the size of the to an archive directory
CDR_LOAD table in
kilobytes


RESULTS
•  12,902 files were loaded •  27.48 Gb of un-indexed
with zero data quality storage in the database
errors •  8.6:1 Compression Ratio

•  435,583,388 CDRs •  41.47 Gb of fully indexed
storage in the database
•  236.50 Gb of raw files •  5.7:1 Compression Ratio

•  20.03 Gb of storage in the
•  Loading: 33 hours, 22 original .Z files
minutes, 12 second •  11.8:1 Compression Ratio

•  Indexing: 2 hours, 13 •  12.42 Gb of storage in the
minutes, 9 seconds archive .gz files
•  19.0:1 Compression Ratio


ADDING INDEXES

•  By default the table has no indexes
•  This is the same in most databases
•  For this test every field was indexed
•  This added 63 indexes that took up an additional 24Gb
•  The total space used was still 5.7 times smaller than
the space used by the raw files
•  These indexes would significantly improve query
performance
•  However not all the indexes would be required in a
production system as not all fields would be actively
queried and this would reduce the space used


DISK SPACE USED


LOAD PERFORMANCE

•  The average file had 33,760 records
•  The ETL to load an average file took 11 seconds
•  2 seconds to copy to the working directory and
decompress
•  3 seconds import into CDR_LOAD table
•  3 seconds copy from CDR_CONVERT table to CDRS table
•  2 seconds to gzip -9 and archive
•  1 second logging and truncating tables
•  None of the tables were indexed during the load


OBSERVATIONS (1)

•  The results were approximately in the middle of our
expectations and previous experience of other
similar data sets where the raw data has been
compressed between 5 and 10 times
•  Even low end hardware gives acceptable load
performance suitable for archive functionality but
production scale hardware is needed for BI/DWH


OBSERVATIONS (2)

•  Some database tuning techniques are needed for truly
massive data sets but can be designed in from the
outset at low cost (e.g. which indexes/index types)
•  It is worth considering putting each month (or some
other similar date based partitioning) in separate tables
for systems management purposes as it makes it easy to
remove the data at the end of the archiving process
•  Smaller reference tables added to the schema would
have little/no compression but they are also very small
and therefore not contribute greatly to the space used


ALTERNATIVE SCENARIOS

•  This presentation uses information gathered on
specific data used for a specific purpose by a client
•  Companies may wonder how their data would
work in both storage and performance terms
•  Vendors may also wonder how their technologies
compare in both storage and performance terms
•  If you are interested in finding out please contact us
with these or any other Data Warehousing/Business
Intelligence enquiries


CONTACT US

•  Data Management & Warehousing
•  Website: http://www.datamgmt.com
•  Telephone: +44 (0) 118 321 5930
•  David Walker
•  E-Mail: davidw@datamgmt.com
•  Telephone: +44 (0) 7990 594 372
•  Skype: datamgmt
•  White Papers: http://scribd.com/davidmwalker


ABOUT US

Data Management & Warehousing is a UK based consultancy
that has been delivering successful business intelligence and
data warehousing solutions since 1995.

Our consultants have worked with major corporations around the
world including the US, Europe, Africa and the Middle East.

We have worked in many industry sectors such as telcos,
manufacturing, retail, financial and transport. We provide
governance and project management as well as expertise in the
leading technologies.


Storage Characteristics Of Call Data Records In Column Store Databases

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Storage Characteristics Of Call Data Records In Column Store Databases

Similaire à Storage Characteristics Of Call Data Records In Column Store Databases (20)

Plus de David Walker

Plus de David Walker (20)

Dernier

Dernier (20)

Storage Characteristics Of Call Data Records In Column Store Databases