CS828 P5 Individual Project v101

Data Migration in Schemaless NoSQL Databases
Version 1.0 Page 1
CS828-1501C-01
ThienSi (TS) Le
Colorado Technical University
Professor: Dr. Kathreen Hargiss
Phase 5: Individual Project
March 15, 2015

Version 1.0 Page 2
Abstract
The short research paper in Phase 5, Individual Project of the course CS828-1501C-01
Advanced Topics in Database Systems discusses the concepts of NoSQL databases such
as Cassandra, Mongo, Neo4J, and Riak, and so forth. They adopt the Aggregate Data
Model that are supporting the application-oriented aggregates, embracing schema-less
data, running on the cluster platform in distributed network, and often making the trade-
off between the data consistency and other useful properties. This research paper will
describe the associated concepts of NoSQL’s schemalessness, then focus on data
migration especially on how to ensure the data stored in the databases matched with the
implicit schema embedded in the applications when the implicit schema has experienced
a change. The in-depth discussion, that will also cover the general principles of
conducting data migration, test strategy in NoSQL databases, consists of four main
sections:
A. The concept of NoSQL databases
This section discusses a noDefinition of NoSQLdatabases with distinct
characteristics, a brief comparison between NoSQL and traditional relational databases,
and NoSQL database’s recent emergence in Internet-centric services.
B. Aggregate Data Models
This section covers an aggregate data model and discusses some pros and cons.
C. Schemalessness and Implicit Schema
One of the primary discussions is description of the central concepts of the
schemaless database and implicit schema in NoSQL databases.
D. Data Migration in NoSQL database with implicit schema

Version 1.0 Page 3
This section describes an in-depth discussion of data migration with implicit
schema. It covers the principles, strategy, test options of data migration in application
code that contains implicit schema with two demonstration examples.
The paper will also provide a list of references used in this individual project at
the end of this document.

Version 1.0 Page 4
In a modern era of data and information, several novel standards in computing,
automation and technologies that have emerged in computing, automation, and
technologies have produced enormous amounts of electronic data. The corporations,
governments, the academic community in both public and private sectors have turned to
database management systems (DBMS) to assist them operating enterprises and
conducting business locally and globally in very competitive market. According to
Bloomberg Businessweek (2011), many companies in Fortune 500 have used the
traditional relational DBMS from one vendor to another to conduct and control their
business. However, with a vast amount of electronic and nonuniform data and custom
data fields generated by Web estates and services such as Cloud Computing, Business
Intelligence, Science & Technology, etc., NoSQL database that is a schema-free or
schemaless database with an aggregate data model has emerged as a solution to handle
big data (Chen, Chiang and Storey, 2012). Data migration becomes a primary issue to
many companies with multiple types of applications in web service, e-commerce,
business intelligence, e-government and politics, smart health, security and public safety.
A. NoSQL Databases
NoSQL is an acronym for Not Only Structured Query Language (Hargiss, 2015).
1. What is a definition of NoSQL database?
According to Sadalage and Fowler (2012), NoSQL databases have a few distinct
characteristics:

Version 1.0 Page 5
- They do not use SQL (Structured Query Language).
- They are usually open-source projects.
- Most of the NoSQL databases are driven by the enterprises’ need to run on
clusters.
- They are based on the needs of the early 21th century Web estates.
- They are polyglot persistent. That means NoSQL databases use different data
stores in various circumstances.
- and maybe one of the most unusual characteristics is NoSQL database operates
without a schema. (i.e., schema-free, schemaless, implicit schema).
With a crude set of distinct characteristics above, the NoSQL database is not
definitional. There is no standard for NoSQL databases. Therefore, Sadalage et al. (2012)
defined a NoSQL database as a noDefinition!
2. NoSQL data base versus the traditional Relational DBMS
NoSQL system is a non-relational data storage system that does not require a
relation schema, joins concept with some level of tolerance to ACID properties. A
NoSQL database management system has recently emerged as an alternative database
management system (DBMS) to the traditional relational database system (RDBMS)
(Connolly and Begg, 2014) because of several typical reasons:
a. RDBMS’s database cannot contain universal complex be-all or end-all relations.
b. There are other database languages with other data storage tools for databases.
c. A NoSQLsolution is more acceptable and suitable for a client’s advanced
internet-centric applications and services.

Version 1.0 Page 6
d. NoSQL database provides more freedom, horizontal scalability, and flexibility.
3. The emergence of the NoSQL database
Sadalage et al. (2012) believe that RDBMS has a strictly structured table of
relations that is no longer suitable for modern in-memory data structures such as
Facebook, Twitter with large data needs. In addition, other applications for cloud-based
applications, e.g., Amazon S3, dynamically-typed languages and open-source driven
community drive NoSQL DBMS’s such as Cassandra, CouchDB, Neo4J, Hbase
emerging recently. NoSQL database appears as a solution for a client’s advanced Web-
based applications and services.
B. Aggregate Data Models
The NoSQL database provides a friendly implementation and usage as an
alternative to traditional relational DBMS to developers and end-users. The NoSQL
database requires more programming but less database design. On the positive aspect, it
offers flexible schema or schema-less. It allows quicker and cheaper setup. It has
massively vertical or horizontal scalability. It relaxes data consistency for higher
performance and availability. However, on the negative aspect, it uses no declarative
query language. As a result, it requires more programming to obtain needed information.
Since it relaxes data consistency, there are fewer guarantees of meaningful information.
In addition, while the traditional relational databases could not handle the issues of the
big data, expandable horizontal scalability, complex data format, sophisticated

Version 1.0 Page 7
manageability, NoSQLdatabases employ a map-reduce computation task (Date, 2006). A
Map-reduce is a programming database model that uses a parallel and distributed
algorithm to process and generate large sets of data in databases on big clusters of servers
and processors with Mappers and Reducers. Notice that the outcomes of the Mappers and
the Reducers are stored as the materialized views in cached memory (Sadalage et al.,
2012).
1. The NoSQL databases’ aggregate data model
In contrast with a traditional relational database using the strict entity-relation
model, the NoSQLdatabases use an aggregate data model contains aggregate data. The
aggregate data is a complex structured record of the nested data. The aggregate data,
called an aggregate by Evan (2004), is a collection of related objects treated as a unit of
data. The aggregate data model is an aggregate oriented data model for a unique
NoSQLsolution. It that consists of four model categories: key-value, document, column-
family, and graph (Sadalage et al., 2012). The NoSQL database usually uses two primary
aggregate data models: Key-value or the big hash table (e.g., Amazone S3, Voldemort,
Scalaris) and schema-less (e.g., Cassandra, CouchDB, Neo4J).
2. Some Pros and Cons of the aggregate data models
There are some pros and cons of these aggregate data models. In a key-value
model, the Pros are: very fast, very scalable, a simple model, and able to distribute
horizontally. The Cons are many data structures or objects cannot be easily modeled as
key-value pairs. On the other hand, a schema-less model, the Pros are the schema-less

Version 1.0 Page 8
data model is richer than key/value pairs, eventual consistency, many are distributed and
it still provides excellent performance and scalability. Its Cons are there are no ACID
transactions or joins.
C. Schemalessness and Implicit Schema
A central theme of NoSQLdatabases is that they are schemaless. Schemalessness
has a big impact on changes of database’s structure. Users should exercise the control of
storing data so that they can access both old and new data.
1. Main concept of the schemalessness in NoSQL database
A NoSQLdatabase is ignorant of the schema (that is a defined structure such as a
table, column, data type for storing data and its attributes). A NoSQL database cannot use
the schema to store and retrieve data efficiently. It does not even apply its validation
upon that data to ensure that different applications do not manipulate data in an
inconsistent way. However, a schemaless NoSQL database provides freedom and
flexibility on data storage (Moniruzzaman and Hossain, 2013). With the schemaless
characteristic, NoSQL database allows users to store data casually. In advanced Internet-
centric services in e-commerce in the digital market, the aggregate records contain
correctly nonuniform data where its record has a different set of fields in a schemaless
database. For example, a key-value store allows users to store any data they desire in the
database. Users can efficiently store data and comfortably change data storage as they
learn more about their project. They can also add new things as they discover them
(Pankowski, 2002).

Version 1.0 Page 9
2. Implicit schema in NoSQL database
Since NoSQLdatabase is schema-free, to access aggregate records or nonuniform
data, users are required to write a program such as scripts that mostly relies on some form
of implicit schema. The implicit schema is a set of assumptions about the data’s structure
in the code that manipulate the data. A schemaless database shifts a strict fixed schema
into the application code that accesses data. That means users need to dig into application
code to understand data and its associated information (Sadalage et al., 2012). If the
application code is well structured, users are able to deduce the implicit schema for useful
data and its related information. Otherwise, they may be stuck on data access. In other
words, with implicit schema, users are required more programming skills but less design
experience.
3. A primary problem of data access with the implicit schema
Since application code in the schemaless NoSQL database contains the implicit
schema, it becomes problematic if multiple applications, developed by different
developers, access the same database. To reduce the problems, users can encapsulate all
database interaction within a single application and integrate it with other applications
using Web services. Another approach is to delineate different areas of an aggregate for
access by various applications.
D. Data Migration in NoSQL databases with implicit schema

Version 1.0 Page 10
In general, data migration is a process to transfer data between storage types,
formats, databases, computer systems. In system implementation, database integration,
upgrade or consolidation, data migration is a key deliberation. It is usually achieved
programmatically by automated migration.( datamigrationpro.com, 2009).
1. Data migration with implicit in NoSQL database
In NoSQL database, the schemalessness provides freedom and flexibility in data
migration within an aggregate record. During developing with NoSQL databases,
designers, who do not think about schema, consider other aspects such as how keys are
assigned and what is data structure inside a value object in key-value stores or types of
relationships with graph databases. Even though there is no fixed schema, data is stored
in memory with implicit schema that is defined and contained in application code. If the
application code can not parse the data from its database, a schema mismatch or data
inconsistency will occur (cisco.com). Notice that to access multiple aggregate records or
change the aggregate boundaries, the data migration with implicit schema becomes
complex as it is in the RDBMS. It is even more complex when users do not understand a
set of assumptions about the data’s structure in the application code that manipulate the
data in aggregate records.
2. Principles of the data migration in NoSQL databases with implicit schema
Data migration process in NoSQL databases is similar to other data migration
processes except some minor change in requirements from the implicit schema. The
efficient data migration has some primary mapping phases that include data extraction,

Version 1.0 Page 11
data loading, data verification with minimum of data loss and preserving consistencies.
Data cleansing is commonly performed to improve data quality. In the principles (Katzoff
, datamigrationpro.com, 2014) , data migration in NoSQLdatabases with implicit schema
maybe consists of five phases (Design, extraction, cleansing, loading, and verification)
for applications from moderate to high complexity to match the requirements of the
implicit schema. Three phases of five phases are mentioned below because they are
essential:
- Data extraction: It is a process of retrieving data out of homogeneous or
heterogeneous, unstructured data source for further data processing.
- Data loading: It is a part of the ETL (extract, transform, load) process to load data
into a final target database.
- Data verification: It is a process to check different types of data for accuracy,
inconsistencies after data migration is done.
According to Katzoff (2014), for an efficient process, data migration strategy may
have ten steps as shown below:
a. Planning – Identify the baseline and legal original.
b. Analysis and data discovery – Determine if metadata in the sources is sufficient
for target document process.
c. Tool selection -
d. Master data management – Harmonize key-value pairs and workflow process.
e. Tool configuration -
f. Data cleansing
g. Dry runs

Version 1.0 Page 12
i. Formal testing
j. Production execution
k. Post production support
After data migration is performed on NoSQL database, there are several options
to minimize migration error by testing. Testing options for data migration in NoSQL
database with implicit schema include a de facto approach data and content migration
based on the sampling of some subset of random data selected and inspected. Some
options are pre-migration testing, formal design review, post-integration testing, user
acceptance testing, and production testing.
3. Example 1 - MongoDB’s data migration
Data migration in NoSQL database such as MongoDB with implicit schema is an
example to show that implicit schema changes do matter when there are a deployed
applications and existing production data in a document data store with a data model :
customer, order, and orderItems as shown below:
MongoDB’s document data code is shown below:
{
“_id”: “31415926AB47E98374D”
“customerid”: “CTU_online”
“name”: “CS828-1501C-01 Inc”
“since”: “01/04/2015”
“order”: {
“oderid”: “18319888”, “orderdate”:01/04/2015”,
“orderItems”:
[{“product”: “Database Course”,
“price”: 2122.00}]
}
}

Version 1.0 Page 13
Application code for implicit schema to write this document structure to
MongoDB is:
BasicDBObject orderItem = new BasicDBObject();
orderItem.put(“product”, productName);
orderItem.put(“price”, price);
orderItems.add(orderItem);
Code to read the document back from the MongoDB database is:
BasicDBObject item = (BasicDBObject) orderItem;
String productName = item.getString(“product”);
Double price = item.getDouble(“price”);
Adding preferredShippingType is changing the objects does not require any
change in database because the MongoDB does not care that different documents do not
follow the same schema. All that needs ti be deployed is the applications only.The code
has to ensure that documents that do not have the preferredShippingType attribute can be
spared.
If discountedPrice is introduced and price is renamed to fullPrice, a developer
renames price attribute to fullPrice then adds discountedPrice attribute as below:
{
“_id”: “261003OPOELALKJDK”
“customerid”: “CTU_offline”
“name”: “RES860-1501C-01 Inc”
“since”: “01/04/2015”
“order”: {
“oderid”: “18319888”,
“orderdate”:03/21/2015”,
“orderItems”:
[{“product”: “Research Course”,
“fullPrice”: 2214.00,
“discountedPrice”: 2122.00}]
}
}

Version 1.0 Page 14
Once the change is deployed, new customers and orders can be saved and read
back properly. However, the price of the product for existing orders can not be read
because now the code looks for fullPrice while the document has only price attribute.
4. Example 2 - Incremental migration
(Source: Chapter 12: Schema Migration from “NoSQL distilled: a brief guide to the
emerging world of polyglot persistence” by Sadalage & Fowler (2012))
Data migration with implicit schema has a risk of data loss, schema mismatch,
attribute removal in new aggregate records. When the application changes its code,
implicit schema is also changed. In consequence, new data may not have all attributes
as the old data does. Before the implicit schema changes, developers can use incremental
migration to ensure that the new code can still parse data. The document with price and
fullPrice attributes from the example 1 is displayed:
BasicDBObject item = (BasicDBObject) orderItem;
String productName = item.getString(“product”);
Double price = item.getDouble(“price”);
If (fullPrice == null)
{
fullPrice = item.getDouble(“fullPrice”);
}
Double discountedPrice = item.getDouble(“discoutedPrice”);
When writing the document back, the old attribute price is not saved:
BasicDBObject orderItem = new BasicDBObject();
orderItem.put(“product”, productName);
orderItem.put(“fullPrice”, price);
orderItem.put(“discountedPrice”, discountedPrice);
orderItems.add(orderItem);

Version 1.0 Page 15
When using incremental migration, there could be many versions of the object
that can translate the old schema to the new schema. While saving the object back, it is
saved using the new object. This gradual migration of data helps the application evolve
faster.
Conclusion
The short research paper discusses the concepts of NoSQL databases with
adopting adopt the Aggregate Data Model that are supporting the application-oriented
aggregates, embracing schema-less data, running on the cluster platform in distributed
network, and often making the trade-off between the data consistency and other useful
properties. It focuses on the associated concepts of NoSQL’s schemalessness and
emphasizes data migration in NoSQL databases with implicit schema. The in-depth
discussion, that also covers the general principles of conducting data migration, test
strategy in NoSQL databases, consists of four main sections: (1) the concepts of NoSQL
databases, (2) aggregate data models, (3) schemalessness and implicit schema, and (4)
data migration in NoSQL database with implicit schema. A final note is whether the
NoSQL databases are able to handle Big Data with the implicit schemas in data-driven
era in the early 21th century?

Version 1.0 Page 16
REFERENCES
1. Chen, H., Chiang, R. H., & Storey, V. C. (2012). Business Intelligence and Analytics:
From Big Data to Big Impact. MIS Quarterly, 36(4), 1165-1188.
2. Connolly, T. M., & Begg, C. E. (2014). Database Systems: A Practical Approach to
Design, Implementation, and Management. New Jersey, NJ: Pearson
3. Date, C. J., 2006). The relational database dictionary: A comprehensive glossary of
relational terms and concepts, with illustrative examples. "O'Reilly Media, Inc.". pp.
59–. ISBN 978-1-4493-9115-7.
4. Hargiss, K. (2015). Chat session 9 (Lecture) of NoSQLdatabase. Information retrieved
from presentation slides.
5. McNurlin, B. C., Ralph H. Sprague, J., & Bui, T. (2009). Information Systems
Management in Practice (Eighth Edition ed.). Upper Saddle River: Pearson Prentice Hall.
6. Moniruzzaman, A. B. M., & Hossain, S. A. (2013). Nosql database: New era of
databases for big data analytics-classification, characteristics and comparison.arXiv
preprint arXiv:1307.0191.

Version 1.0 Page 17
7. Pankowski, T. (2002). PathLog: a Query Language for Schemaless Databases of
Partially Labeled Objects. Fundamenta Informaticae, 49(4), 369.
8. Sadalage, P. J., & Fowler, M. (2012). NoSQL distilled: a brief guide to the emerging
world of polyglot persistence. Pearson Education.
9. http://www.datamigrationpro.com/data-migration-articles/2009/11/30/how-to-
implement-an-effective-data-migration-testing-strateg.html.
10. http://en.wikipedia.org/wiki/Data_migration.
11. https://msdn.microsoft.com/en-us/library/ms174467.aspx.
12. http://www.cisco.com/c/en/us/td/docs/security/ise/1-
3/migration_guide/b_ise_MigrationGuide/b_ise_MigrationGuide12_chapter_011.html.
13. http://www.computerweekly.com/feature/An-ABC-guide-to-data-migration.
14. http://www.laserfiche.com/support/webhelp/Laserfiche/9.0/en-
US/AdminGuide/Content/Basic_Principles_of_the_Migration_Proc.
15. http://www.webopedia.com/TERM/D/data_migration.html.

Version 1.0 Page 18
APPENDIX
CS828 Phase 5 Individual Project: Grade: A Score: 200 pt 3/16/2015
Current Grade Average: A (955/955)
ThienSi...
Congratulations on a well written paper used to discuss the general principles of
conducting data migration in NoSQL databases. You clearly presented thoughts as
how to ensure the data stored in the databases matched with the “Implicit Schema”
embedded in the applications when the “Implicit Schema” has experienced a
change....excellent work!
Proficient: The submitted work exceeds the project criteria requirements. It
demonstrates a comprehensive understanding of course material and meets the
course objectives with proficiency.
Dr. Kathleen Hargiss.

Version 1.0 Page 19

CS828 P5 Individual Project v101

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (18)

En vedette

En vedette (7)

Similaire à CS828 P5 Individual Project v101

Similaire à CS828 P5 Individual Project v101 (20)

CS828 P5 Individual Project v101