2. “It was the best of times ….” with apologies to Dickens, but today there are many choices in
data management. It is truly a “best of times” moment for choice. That choice, is a double
edged sword, databases are not created equal. Not all problems are created equal either.
Database designs have inherent tradeoffs forced by the problem the DMS was intend to
solve. Selecting the wrong technology can doom a project at worst, or end up costing it
millions over the lifetime of the application.
3
3. This talk isn't going to identify a “best database” between these two technologies, as we
will see, best is determined by the fit to the particular problem being solve. What I hope
you will gain from our time today is a better understanding of the core components, design
tradeoffs, and intended use cases, so you can make better choices on your next data
management project.
4
4. Credit – LandScape according to 451 Group, 2012.
Introduction
Databases have been around for over 50 years, from the beginning of electronic
computation data storage has always been fought with challenges – what to store, the
format in which to store, how to retrieve it later. How to protect it and how to share it.
The challenges of persistence are persistent even today after 60 odd years of computing.
Being on the technical side of database sales for over 14 years, I've learned that “one size
doesn't fit all” when it comes to data management. Different problems often demand
different approaches. The last 5 or so years has given us an explosion in New SQL or No SQL
technologies all aimed at better solving some part of the persistence problem.
5
5. Great summary - http://en.wikipedia.org/wiki/A_Tale_of_Two_Cities
http://www.sparknotes.com/lit/twocities/
I chose “A tale of two databases” as the title for today's talk, with apologies to Dickens, as a
motivator to look at two very different database products within the Actian portfolio.
Actian has a large offering of data management and integration products, and I encourage
you to check out our website for the larger picture, but for this discussion we're going to
focus on and look under the covers of only two products: Versant and ParAcell SMP (aka
Vectorwise) to see how they tick, and what makes one an operational DB and the other a
powerful analytic database.
Both are enterprise databases, each with 1000's of deployments, but what I find interesting
as a systems engineer is where they share design concepts and the key areas where they
differ.
6
6. Architecture Overview
The flavor and color of a city is conveyed through its architecture and inhabitants; without
straining the analogy, the style of a database is also understood through it architecture and
components from which it's made. We will see like any pair of modern cities there is much
in common; between the our two database protagonists there is also much common
ground, but there are important differences which should guide a systems designer's choice
of technology.
Cast of Characters – Our Cities
•Versant Object Database
•ParAccel SMP aka Vectorwise
A Tale of Two Cites is a story about, well two cities during the French-English War. The cites
server as the main characters with their politics, geography and inhabitants providing the
details and coloring for the story. A third major character or theme in the Dickens novel, is
water.
7
7. Daily life for early cities centered around the water. They were built on water to provide
economic advantage and improve the quality of life. Water is life. Uncontrolled or
contained, water can too be the ruin of a city. Early city inhabitants weren’t always to
careful with what was put into that life giving river or lake. Fortunately today we know how
our water cycles work and are much more careful, even reclaiming the once mistreated
bodies of water.
In our modern day story, our water is data. It flows, it changes, and has a life-cycle all its
own. Data is life for companies today. How it is managed, shaped, and used by a company
greatly affects its overall prosperity.
Today a company’s information is just as important. Like water, care must be taken to both
store and let it flow, creating value from it huge potential.
** kite boarder pictured is the author enjoying water’s potential on Lake Michigan
8
8. If our protagonists are the databases, our story needs some form of antagonists which our
technical heroes can overcome.
Data management projects have different concerns and the tools used for the project must
match the concerns.
9
12. Vectorwise is typically deployed at the heart of the BI/Reporting system to provide high
speed reporting. Actian partners with the leading BI & Reporting vendors.
13
13. Please forgive the marketing here, but the cost effective commodity hardware shows how
well Vectorwise’s re-designed query takes advantage of the new CPU and multi-core
designs. More on this later.
14
15. A picture can explain the complexity better.
This is actually a map of the Schema – SID Shared Information and Data model
Deep inheritance – sometimes 15 levels or more. Collections all over, most of them are
polymorphic.
16
16. With those typical use cases in mind, lets see how these technologies approach the data
management problem.
17
17. Database’s share some common structures when viewed at a high level. The common
elements come from the fact that they are solving the same problems utilizing different
means or with a different focus. But common structures vary greatly in their
implementation and tradeoffs that make one system excel at fast execution of ad hoc query
or the navigation of a complex telecommunications network.
18
18. The data models employed by these two systems again have some similarities albeit
different naming conventions and a few wrinkles in how their respective schema is defined.
Both systems support the basic data types: chars, ints, floats, strings with minor variances
on width. In both systems, these basic types are used to compose more complex
structures: tables or classes which on the surface look pretty similar. The Vectorwise data
model is based on the SQL standard and supports most of the SQL types. Data definition
language (DDL) and data manipulation language (DML) is SQL. SQL is used to create table
definitions, insert or update, or delete. We won't be going into the SQL details here as most
people are familiar with the model, but lets compare it to the Versant model, because here
we see some major differences.
19
19. Where the two data models diverge is seen in the object database's need to support
abstractions commonly found in the object oriented programming languages, these
concepts include: pointers, type inheritance, and collections. This doesn't imply that these
concepts can't be expressed in a RDB like Vectorwise, in fact ORM tools like JPA or
Hibernate help manage persistence problem by hiding RDM nature and SQL from the
application developer. However this hiding isn't without considerable cost in operational
friction, also known as impedance mismatch in the OODB literature.
20
20. We see here the central SQL focus for Vectorwise.
21
21. With Versant, we see the application client built with object management resources: cache,
transaction manager, and transport over the network.
Part of the friction comes from dealing with OO concepts mentioned above. Versant
backend supports these abstractions innately, and is best understood with an example.
22
25. Annotations within the Java code coupled with an added compilation step to extract the
schema and give the Java application a direct line into the database.
With JPA, the persistent class's byte code is modified to support change tracking, data
marshaling,
cascading persistence, and on demand object loading logic. Annotations indicate what
classes are destined for the database and support the nuances of how attributes should be
stored. Interestingly, with V/JPA, you need far fewer attribute annotations because the
database better understands OO concepts like inheritance and collections.
25
28. Communications
Communications for both these systems is similar, a Java application for Vectorwise would
use JDBC to query and return data sets, which could then be used to construct the objects
if required by the application's object model.
A JPA O/RM layer could be used here to hide dataset to object translations if desired, but
that isn't really Vectorwise's nature, a more typical use would be a BI application accessing
the contents.
Versant JPA uses an internal protocol built with RPC against the object server to load or
update objects within the JPA programming interface. Objects are marshaled in a binary
form and instantiated in the JVM for use by the application. In some cases, in complete
objects, hollow objects, are created inside the VM, but the lazy loading protocol ensures
they will be fully loaded prior to use by the application.
28
29. Transactions are central to the operation of both systems. They are the means through
which all data flows in and out of the server. Data creation, updates, deletions, and even
the schema manipulation itself is bounded by a transaction.
In 1983, Reuter & Harder coined the term ACID1 to describe transactions.
Both Versant and ParAccel are ACID databases, however they go about it through different
mechanisms. This brings us to our next comparison, locking and versioning.
1Haerder, T.; Reuter, A. (1983). "Principles of transaction-oriented database
recovery". ACM Computing Surveys 15 (4): 287. doi:10.1145/289.291
29
30. Locking vs Multiversioning
Versant uses a 2-phase locking protocol which gathers locks on all the objects being used to
ensure no two transactions are attempting to write to the same data (object). This is
mechanized with a locking table and transaction graph. Shared or read locks are collected
as the transactions work with data. They are then followed up with update (semi-exclusive)
or write (exclusive) locks when the transaction attempts changing the data. Deadlocks are
detected, as well as a timeout to prevent a transaction from waiting forever.
With this approach, updates are done in place on the existing data. Very likely the same
physical pages in memory and disk are updated as the object was read from. The locks
ensure transaction serialization.
I should mention that Versant supports both a pessimistic and optimistic locking schemes.
Even optimistic locking uses the read and write locks temporarily as objects are read or the
transaction commits.
--The counter part in Vectorwise is a multiversioning concurrency control (MVCC) system
whereby each transaction sees a consistent database at a given point in time – a snapshot
controlled by the transaction ID. A given transaction won't see a half-completed transaction
operating on the same data because other transactions doesn't overwrite the original data,
they create a new version with a later transaction-ID to prevent contaminating earlier
transactions. No locks or wait graphs need be maintained. Deleted and updated entries
30
31. need to be purged if space is a concern.
We have two different means of managing concurrency and serialization of transactions. The
Versant method is historically similar to RDBMS which support row and table locking.
Vectorwise's MVCC increases throughput at the expense of data growth and needed
propagation events.
If you require strict serialization of transactions, or want to limit growth, the locking model
will suit your needs. If analytic speed and concurrent read concerns are your core concern,
the MVCC will be faster, at the possible cost of stale data.
We are starting to see why Vectorwise is used for analytic, read-heavy reporting and Versant
finds itself used for operational processing.
30
32. One major difference found between these technologies is in how they physically store
data both on the disk and in memory. Of particular interest to me is the Vectorwise's
columnar approach, it is designed for pure analytical efficiency. In contrast to the
underlying storage model used by Versant which is similar to what is found in many
database systems.
Versant model older design, N-ary Storage Model, but there are some interesting tricks it
uses to optimize performance for networked object graphs.
Common in most database storage system are the concepts of volumes and pages. A
volume is a collection of pages and Versant can have as many volumes as need for the
database. A volume is mapped to a file and can be located on anything from raw devices to
storage area network (SAN) drives.
[DeWitt] [Zukowski]
NSM = N-ary Storage Model - row contains all columns
DSM – Decomposed Storage model = N attributes into N vertical storage elements
PAX = Partition Attributes Across = multiple columns stored on a page, but attributes stored
vertically
Vectorwise
Block size must be set prior to table creation.
31
33. Versant does allow for variability on the page, multiple types of objects or variable length
structures.
The min/max stats help reduce the columnar blocks that need be evaluated for a query.
32
34. Compression of data both on disk and in RAM reduces the IO bottlenecks that large data
systems confront today. By decompressing into the CPU’s cache VW takes advantage of the
Processor IO.
Column structure works really well for compression. Similar data is grouped together
allowing VW to pick an optimal compression strategy. Here optimal is not just storage
density, but also ease of decompression into the CPU cache for later processing.
33
36. Although Versant uses a traditional layout, where objects get located on a given data page,
there are some tricks it uses to efficient locate connected object.
Common in most database storage system are the concepts of volumes and pages. A
volume is a collection of pages and Versant can have as many volumes as need for the
database. A volume is mapped to a file and can be located on anything from raw devices to
storage area network (SAN) drives.
Pages are further broken down to slots used to store object instances. Multiple object
instances may stored on a page and accessed through the object's slot location. Larger
objects will span contiguous pages. Page size in Versant is modest 16K bytes; this is often
large enough hold many objects and still small enough not to waste too much space with
deleted objects. Normally, objects of the same type get stored in the same page on next
available slot, but as an optimization, it is possible to co-locate a parent and its children on
the same page. This extra effort results in a extremely efficient object loading when the
parent is used with its children frequently.
35
38. One final point about the LOID in Versant. LOID references are designed to be crossdatabase references. Here we have an application using 4 objects, but they come from two
different databases.
This give the application designer great flexibility in deciding how to partition data. The
application simply connects to all the databases involved for the cross-db references.
Transaction use a 2-phase commit protocol.
37
39. What good is a database without a means to find answers to our particular questions or
efficiently service an application's demand for data. Like the other components we've
looked at there are some similarities between these two technologies, but also some big
difference.
Indexes in Vectorwise are typically not needed. Often, VW is setup so the compressed DB
lives entirely in memory and the auto-page indexes the redesigned query engine are
enough that scanning the data without indexing performs well enough that no index tuning
is required.
Versant on the other hand allows nearly any attribute or even collection to be indexed.
Versant’s query engine will then use the index automatically or with hints supplied by the
user.
38
40. Circa 2003
SQL vs C benchmark on the TPC-H
This difference between the database and the custom C program is huge… why is the
overhead of using a database so high, what’s being left on the table?
This difference started the X100 project to try to reclaim the 100 times loss in performance.
We’ve seen the storage model change for VW, but lets look further for the query
processing.
39
41. Each level of the data handling was studied for performance loss.
Compiler optimization are easier to take advantage of in smaller units. Often don’t get fully
exploited in large programs.
Modern CPU have better instruction sets and larger chip caches which can be used for
vector processing.
40
43. The work on Ingres is very critical to Vectorwise (ParAccel SMP) as the main interface to
both Ingres and Vectorwise are the same. It is not until the Optimizer which processes the
SQL query and generates the x100 algebra that the two components separate.
After generating the result set from VW it is the Ingres components that make this available
to the application.
Aliamaki, DeWitt, Hill, Skounakis – Weaving Relations for Cache Performance
NSM = N-ary Storage Model - row contains all columns
DSM – Decomposed Storage model = N attributes into N vertical storage elements
PAX = Partition Attributes Across = multiple columns stored on a page, but attributes stored
vertically
42
44. Taking all the performance features into account for VW query processing.
This is great for Reporting where you data isn’t changing frequently.
43
45. With Versant, queries are typically used to locate the beginning of a graph or top level
objects. Once the starting point is identified, the connected objects are frequently retried
by the application as required (lazy loading) or automatically with a default fetching. The
group loading saves round trips to the server and is much more efficient on the network.
On the Versant side, query is done via OQL or JPQL. This example is JPQL. The Book has a
simple collection Authors and we want to find an Author of “Smith”.
Notice the syntax is a little SQL like. But we directly operate on the collection
Book.authors, using “auth” as a working variable.
On execution, the Book extent would be searched for all the books with a Smith author.
This would end up scanning all the books and evaluating the Authors collections, returning
the object ids for the matching books.
ResultList holds the objects and the rest of the Java program would process that list.
44
46. The thing about relationships is they don’t change often. By baking them into the server’s
data structure and making them cheap to evaluate, Versant avoids join operations which
can be quite costly. IF you look at typical ORM code, you see a fair amount of join activity
whenever collection classes are involve.
Following a few links down a list can end up with a very expensive group of joins. Where
as managing the references with LOID allow for direct navigation to the object. The server
takes advantage of this in query expressions that involve paths or collections like the
example.
45
47. Closing Comments
This brings us to the end of our tale and hope you enjoyed our time together as much as I
did. Each of the components we've examined should have given you insight into the design
and tradeoff made by the different engineering teams. When taken as a whole they provide
consistent powerful framework for solving hard real world problems. Each of these
products has thousands of users which rely on their respective products for business critical
applications. The engineers who built those applications made strategic choices for the
data management system at the heart of their project.
46