2. Problems with early COBOLian data processing
systems.
Data redundancies
From flat file to Table, each entity ultimately becomes
a Table in the physical schema.
Simple O(n2
) Join to work with Tables
2
3. ◦ Coupled with normalization drives out all
the redundancy out of the database.
◦ Change (or add or delete) the data at just
one point.
◦ Can be used with indexing for very fast
access.
◦ Resulted in success of OLTP systems.
3
4. Lets have a look at a typical ER data model first.
Some Observations:
◦ All tables look-alike, as a consequence it is difficult to
identify:
Which table is more important ?
Which is the largest?
Which tables contain numerical measurements of the business?
Which table contain nearly static descriptive attributes?
4
5. ◦ Many topologies for the same ER diagram,
all appearing different.
Very hard to visualize and remember.
A large number of possible connections to any
two (or more) tables
5
1
10
3
12
2
6
5
11 4
7
8
9
1
10
3
12
2
6
5
11
4
7
8
9
6. The Paradox: Trying to make information
accessible using tables resulted in an inability to
query them!
ER and Normalization result in large number of tables
which are:
◦ Hard to understand by the users (DB programmers)
◦ Hard to navigate optimally by DBMS software
Real value of ER is in using tables individually or in
pairs
Too complex for queries that span multiple tables with
a large number of records
6
7. ER DM
Constituted to optimize OLTP
performance.
Constituted to optimize DSS
query performance.
Models the micro relationships
among data elements.
Models the macro
relationships among data
elements with an overall
deterministic strategy.
A wild variability of the
structure of ER models.
All dimensions serve as
equal entry points to the
fact table.
Very vulnerable to changes in
the user's querying habits,
because such schemas are
asymmetrical.
Changes in users' querying
habits can be
accommodated by
automatic SQL generators.
7
9. A simpler logical model optimized for decision
support.
Inherently dimensional in nature, with a single
central fact table and a set of smaller
dimensional tables.
Multi-part key for the fact table
Dimensional tables with a single-part PK.
Keys are usually system generated
9
11. Results in a star like structure, called star schema
or star join.
◦ All relationships mandatory M-1.
◦ Single path between any two levels.
Supports ROLAP operations.
11
12. 12
Items
Books Cloths
Fiction Text Men Women
MedicalEngg
Analysts tend to look at the data through dimension at aAnalysts tend to look at the data through dimension at a
particular “level” in the hierarchyparticular “level” in the hierarchy
14. 14
CITY DISTRICT
1
ZONE CITY
DISTRICTDIVISION
MONTH QTR
STORE # STREET ZONE ...
WEEK MONTH
DATE WEEK
RECEIPT #STORE # DATE ...
ITEM #RECEIPT # ... $
ITEM # CATEGORY
ITEM #
DEPTCATEGORY
year
month
week
sale_header
store
sale_detail
item_x_cat
item_x_splir
cat_x_dept
M
1
M
1M
1
M
1
1
M M
1
M
M M1
1
M
1
1
M
YEAR QTR
1
M
quarter
SUPPLIER
DIVISIONPROVINCEM
1 BACK
division
district
zone
16. 16
Beauty lies in close correspondence
with the business, evident even to
business users.
17. Dimensional hierarchies are collapsed into a single
table for each dimension. Loss of Information?
A single fact table created with a single header from the
detail records, resulting in:
◦ A vastly simplified physical data model!
◦ Fewer tables (thousands of tables in some ERP systems).
◦ Fewer joins resulting in high performance.
◦ Some requirement of additional space.
17
Editor's Notes
There were utitlity companies which goes house by house and collect info like meter reading. Now the data is placed on books, and at a centeral place info is entered in computer. Now address remain same, but the reading changes forever. Now the info become redundant. So if data changes it needs to be reflected at a lot of places. So a solution of the problem was normalization which are based on er modeling. The problem was of the slow joins. The er diagram was turned into tables. Which were joined with other tables to collect the info.
When things were fine then why we need the DMs. Now look a schema which is in the third normal form. See the next slide
Now there are some observations about er diagram. Some questions mentioned above.
Now an example from real life. If you go somewhere and you want to know which person is the most important one. Yes, he will be one which has people arround him listening what he is saying. But now can you tell which table is more important? One with largest header size and few rows of record or viceversa.
Numerical measurements: e.g. sales data, no of items sold and revenue, the factual data.
Descriptive: or dimensional information containing data.
So what is the benefit of the simplicity if it may raise more questions at every step.
So all the previous points take us to the new representation demand. This is explained using graph theory:
An ER model can have different shape based on the designer. Every model looks different. The above two graphs are same, but different representation. The left graph is more difficult to understand. So this is the graph isomorphism problem, that you have to tell, which two graphs are same and this is a very computationaly tough problem. So the same prob exists with ER diagram, that models appear different for every problem.
So these complexities are taking us towards the need of DM.
Paradox: conflict. An example is that you went in an hospital and said how was the operation, they said the operation was successful but patiant died. So what is the benefit of such successful operation, which could not save a patients life so a paradox.
The problem is complex because of so many tables due to normalization. And in erp system this may be in thousands. The real value of er modeling is when you query a single table or few tables then you will have good performance but in dss we by defualt join many many tables, so performance will suddenly go down this is a paradox.
So a comparison of er against dm.
Er modeling is for oltp and dm for dss. Suppose you have a bike, and you decide that when you make home and decide to load the cement for house making the result is your bike will destroy. But if you do it on a truck it will never have any affect. So the problem is using the right thing for wrong problem.
In dss we are concerned with higher level or aggregation, so we will not go on minor details.
Er diagrams are different for same problem. But when you make system then all systems will have a lot of variation. But in dss the schema do not change normally. There are smart enviornment which generates sql automatically but they may become in a difficulty while optimizing if the schema always changes. But in DM or star schema, it is very difficult to generate the sql.
Er schema changes when business changes, so sql generating tool faces difficulty. But in dss the schema remains constant even with the change of business.
ER model can be simplified using de-normalization and DM.
So what is a dm or how we tell about a schema that it is optimized for the dss enviornment.
The slide points.
So the key point is it is simple, logical and intutive. So if it is easy to understand for programmers, it assures better solution. It has two tables fact and dimensional. Fact tables are large and dimensional tables are small. Fact tables are table which store numerical data i.e how much sale, sale revenew. The dimension table has info about dimension i.e time, geography etc.
Keys should be sys generated not the business key, so if the business change key should not need the maintainance.
Map business analyst representation to relational model
Data cubes with dimensions and measures
Relational design with tables and 1-M relationships (FKs)
Dimensions to dimension tables
Measures to fact tables
Group fact and dimension tables
Grain: most detailed measure values stored
How fact and dimension table connects?
In the form of star topology where fact table is in center.
Dm is designed to support the rolap operations, where we can run on the go queries.
Dimensions have hirarchies. i.e books have fiction and text, but you cannot mix them. So the benefit is decision maker can enter at a point in hirerchy to see the details of other hirarchies.
The above task can be done by two schemas.
Star are simple, either you rotate flip or reposition it wont change, but for snowflake if you do this, you will loose the entire meaning.
Star schema represents a complete business process e.g. sales, purchases, inventory etc. For each business process we will have different stars.
Star schema of the previous slide, and things become simplified.
We create the fact tables having real (physical) records, we do not run the joins on run time. This is the reason that in pivot4j we analyze a physical and real star by placing the dimensions of our requirements and mdx generates automatically. Once a star is created it doesn’t matter how you analyze it.
suppose there are hundered records in each table and 4 tables are involved in a query which needs a join, now against the join the output returns 40 rows for a specific join query. Now to retrieve these 40 rows we have computed 100x100x100x100 steps. Now if these 40 records are placed in a table (fact table) which has 1000 total rows then in worst case we will achieve the correct output in 1000 steps in star instead of 100000000 steps. So ultimately we will achieve enormous performance.
When we get star schema, we collapsed the hirarchies and make a single table i.e time is now in a single table means we will avoid the sub tables in the form of pk and fk relations, now the name of a column say city will be used in dimensional table instead of FK, it may result the loss of info i.e every city may have the province name fk but now we will not be able to tell the dependency of cities by just looking the diagram. Its disadvantage is that you cannot tell, which element is subset of which element, and what is the level of element in hirerchy. So loss of information. The benefit is that simple schema with few tables as compare to previously hundreds of tables, another disadvantage is the additional space. The simple example could be on next stage.