Computer 10: Lesson 10 - Online Crimes and Hazards
Capacity Management of an ETL System
1. Capacity Model of an ETL system
Ashok Bhatla
Email – ASHOK.BHATLA.WRITER@GMAIL.COM
2. What is Business Intelligence?
Business Intelligence (BI) is a combination of tools, processes and
software which help a company to transform data into actionable
knowledge, thereby allowing them to take faster and informed decisions in
order to achieve their strategic goals.
It’s all about providing right information to the management at the right
time with the lowest possible cost.
As we are drowning in data, but
starving for knowledge,
Business Intelligence has
become the No. 1 priority for IT
Managers today.
3. What is ETL?
ETL stands for Extract, Transform and Load. A transactional system is meant
to be a high performance system so that users can get their work faster.
Running some reports from a Transactional system makes it slower. Therefore,
the concept of ETL gained popularity.
In computing, Extract, Transform, and Load
(ETL) refers to a process in database usage
which involves the following steps
Extracts data from outside sources.
Transforms it to fit operational needs,
which can include joining/reformatting
some tables.
Loads it into the end target (database,
more specifically, operational data
store, data mart, or data warehouse)
4. Example of ETL
OLTP Systems
Cost
Accounting
System
Payroll
Data
ETL – Joins,
Transforms,
Deletes etc.
Load Data
Sales
Data
Staged
Data
Purchasing
Data
EDW /
Reporting Data
5. What is Capacity Planning?
Capacity Planning is the process of identifying the current
computing needs of a business application and to forecast the
future computing needs based on the business plans.
In other words, it means what computing resources are needed to
meet an application’s service level objectives over a period of time.
In today’s economic climate, business requirements can change
rapidly depending upon an organization’s strategy and goals.
Therefore properly managed capacity plans should be able to take
unforeseen requirements into account.
Capacity Planning can be either done in a very casual manner or
very organized and disciplined methodologies can be used.
More data driven the capacity planning is, more accurate the
results.
6. Capacity Planning of an IT System
Capacity planning needs to
ensure that all Hardware (Disks,
Memory, CPU, and Network),
Software resources (User
Licenses) and facilities are
optimally used.
Software Licenses,
No. of Users
Servers, Storage,
Networking, CPU
Data Center Space,
Power, Cooling
7. Capacity Planning
We cannot manage
something which we
cannot measure.
Avoid
downtimes by
reducing no of
Incidents
Achieve
Performance
Objectives
established by
business
If no corrective action is
taken based on measured
data, then Capacity
Planning is of no use
Proactive
Capacity
Planning
Reduce TCO for
the ETL System
Achieve optimal
utilization of
computing
Resources
8. Capacity Planning Steps
Identify Service Level Objectives – know
the requirements in business terms
Analyze Current Capacity – Gather data
about resource consumption, ideal times
and peak usage
Know the future business needs and plan
for future capacity needs – How the IT
systems will be able to handle increased
load
9. Strike a Balance
As per Moore’s Law, IT is getting cheaper
and faster every 18 months. But
organizations cannot wait for next
generation of technology to be available –
as they need to take care of business.
Performance
Utilization
Supply
Demand
Cost
As per Parkinson’s Law, if you give
more resources to customers, they will
find ways to use more resources. IT
managers cannot keep on giving
unlimited resources to users.
Resources
10. Capacity Challenges for ETL Systems
ETL jobs are of different types
(Full Refresh and some Delta
Refresh), process varying
amounts of data and are
scheduled at different
frequencies. Therefore, there
are always spikes and valleys
of workload.
SQL queries are simple and do
not require parallelism. On
the other hand in an ETL
system, very large datasets
and processed and Workloads
are random in nature and not
easy to predict. This makes it
difficult to predict the
resource requirement.
An enterprise ETL system
processes thousands of
batch jobs on a daily basis.
These Systems connect to
large no. of data sources
which reside on different
platforms and may be on
different networks across
the WAN
Different types of users have
different peak usage
requirements. They have
different needs for
Transaction times, Elapsed
Times and Response Times
11. Disks Capacity Issues – Engineers spending lots of time cleaning
old stale data
Over Capacity – Paid for extra compute Capacity, but not
utilizing it
Network Slowness Problems – Batch Jobs running slow
sometimes.
No. of User Licenses reaching limits.
12. Analyse the Complete Picture
User Needs
Transaction Time
Response Time
Elapsed Time
Throughput Time
Data Usage Patterns
Data Complexity
(Type of SQL Queries or ETL Transformations)
(Financial, Marketing or Factory Data)
Business Terms
Volume and Frequency of Data Loads
User Profile
(No. of Batch Jobs and GB of data processed)
(Simple User or Advanced Data Miner)
Storage
( SAN / NAS / Local Disks,)
Processing Power(CPU, No. of Cores )
Technical Terms
Network Bandwidth
(Transfer Rate, Bytes Tx/Rx)
Memory (Physical, Cache, Swap)
13. Capacity Planning Tools
Vectors of Measurement
Availability
Performance
Throughput
Utilization
Quality
Efficiency
Simulation
Accurate, but needs
lots of time for setup
Testing
Costly, as another
environment similar
to Production is
needed.
Trending
Can be done using
Excel. Simple, but
does not take non
linear behavior into
account
Analytical Modeling
More advanced,
Faster and Accurate
14. Data Collection
No. of Subject
Period ( WW or Month) Areas
No. of ETL
No. of Projects Batch Jobs
Storage
Consumption
CPU
Network
Disk I/O
Tx/Rx Bytes
How do we collect Performance / Capacity Data?
OS monitoring tools – even freeware like Nagios, kSar, SQLMon. PerfMon
Data collected in SQL tables
Data collected by Software used by the Storage Frames – gives Utilization, Capacity
and Performance Data
15. Capacity Model for ETL System ??
Examples of some metrics which can be developed
o Average Run time for a Batch job
o Average CPU for a Batch job
o CPU Utilization /Subject Areas /Week
o CPU Utilization / Project / Week
o No. of Batch Jobs / GB of Storage
o No. of Batch Jobs / X Amount of CPU
16. Dashboard / Indicators
Phase I
Develop a Trending Model in the beginning
Dashboards can be developed using Share Point BI if the Capacity Data is captured
in an Excel Pivot Table or SQL Databases
Phase II
Can we develop a Predictive Model???