2. •Extract
– Reading the data from the source (based on the
data formats)
–Connecting and accessing the data from the source
–Scheduling the source system to get the data;
notifications for the same
–Capture the changed data
–Dump the data extracted to disk for availability
•Clean
–Ensure column properties like data types
–Enforce structure of data, dependencies
–Enforce data rules - direct ones as well as business
3. Data Flow II
•Conform
–Loading dimensions, sub-dimensions, facts
–Conforming the dimensions and facts
–Handling delayed data coming to dimensions and
facts
–Loading and updating aggregations
–Dump the delivered data to disk
•Deliver/operations
–Scheduling
–Job execution
–Recovery, failure handling and restart
–Quality checks
4. Extraction
•Crucial to know how to extract the data from
the source system
•Each source has distinctive characteristics and
need to be managed accordingly
•Integration with different systems is required
like
–Database management system
– Operating systems
–Hardware
–Communication protocols
•It is important to logically map the data from
5. Physical to logical data mapping I
•Crucial to have a clean and cohesive data within
the data warehouse.
Steps for mapping before physical ETL
process:
1. Planning of ETL process including a
defined logical data map.
Called the data lineage report. Logical data
map is the foundation
of metadata.
2. Identify the data sources to be used for
6. Physical to logical data mapping II
4. Go over the data lineage and business rules for extracting,
transforming and loading. This step must include the data warehouse
architects, business users, developers and QA personnel.
5. Design of complete ETL system with details of all fact and
dimension tables as a whole
6. Ensure the correctness of the computations and formulations
against the business requirements. This must involve all the members
of the data warehouse building team, architects as well as members
from the business teams.
7. Components of logical data map
1. Table name
2. Column name
3. Table type (fact, dimension, sub dimension,
supporting etc.)
4. Slow changing dimension type (applicable for
dimension tables)
5. Source database to get this information from
6. Source table
7. Source column name
8. Transformation required (if any)
9. About SCDs
•Important factor to be considered when
loading the dimension tables
•Structure of the dimension table cannot tell
what the strategy is
•Columns have historic relevance and the
strategy required for capturing this history
should be known in advance.
•Changing the SCD after the design should be
managed well through a change management
process
10. SCD management
1.Type 0: passive. Values remain same for ever.
2. Type 1: Allows new data to overwrite old
data. So not required to track the history
3. Type 2: tracks historical data by creating
multiple records for a given natural key in the
dimensional tables with separate surrogate keys
and/or different version numbers.
4. Type 3: tracks changes using separate
columns and preserves limited history
5. Type 4: maintains older data in separate