The vision for the creation of DataCyte was to create a data storage and retrieval structure which would enable the development of applications in an organic manner and where the performance of the applications would be largely independent of the amount of data and the relationships built between the data elements.
2. DataCyte Group of Companies
• Founded in 1998
• Previously known as World Wide Objects
• Privately owned and funded
• Development done in Pretoria, South Africa
• Expanding to create distribution and partner network
• Building relationships with ISVs
29/04/2011 2
3. DataCyte Timeline
1998 - Product was conceptualized, developed first version by late 1999.
2000 - Lodge Patent Application
2001 - Rated 5-10 years before IBM grid computing initiative by
DARPA/CSC/Lockheed Martin
- Awarded United States of America Department of Defense contract
2002 - Defense contract suspended due to war on terror
2003 - Return to South Africa due to declaration of war against “terror”
- Start delivering healthcare systems to South African market
2005 - Return to the US market with Healthcare and hi-tech value proposition
2006 - Benchmark data analysis capabilities with Zirmed, prove a 50% in size reduction
and 10x faster
- Entered into business relationship with Dr PatrickSoon-Shiong of Abraxis
Biosciences and American Pharmaceutical Partners Inc.
2008 - A conflict of product direction emerged with Dr Patrick Soon-Shiong - resulted
in termination of the relationship. All Intellectual Property rights reverted back to
DataCyte.
29/04/2011 3
4. DataCyte Timeline cont.
2008 - Cedars Sinai Cancer and Proteomic Research Unit (UCLA) benchmark
- DXS Health Care Systems Technology Partnership (www.dxs-systems.com)
- Trash Can Kids Technology Partnership (www.trashcankidz.com )
- Electronic Price Labeling Technology Partnership
- Interactive Television/Phumelela Technology Partnership
(www.phumelela.com)
2009 - Establish strategic partnership Health One Global (www.healthoneglobal.com )
- IR Global Partnership to deliver international roaming at dramatic discounted
rates and enabling prepaid customer to also roam.
- Barlow World Logistics Product Development
- Re-engage with the United Sates Department of Defense through US presence
- Granted US Patent #7571442
29/04/2011 4
5. 29/04/2011 e-Merchandising (Pty) Ltd t/a Revelation Systems 5
6. DataCyte Timeline cont.
2010
April - Booz Allen Hamilton (www.boozallen.com) presents DataCyte as future data
solution at American Association for the Advancement of Science. AAAS (www.aaas.org)
is the largest paid circulation of any peer-reviewed general science journal in the world,
founded in 1848, and is considered one of the global authorities in the direction of Science,
Engineering and Innovation
May - Launch Interactive Television, 400 units rolled out in TABS. Prime Media is
currently finalizing purchase of advertising slots for 12 month period.
June - Launch of DXS (dxssynergy.com) web based system to the USA market as part of
its global rollout. A global vendor in the provisioning of healthcare related systems.
June – Launch of Trash Can Kids (www.trashcankidz.com)
June – Launch of Process Discovery Product with 2 customers going live this month. The
system has already being adopted by a large defense manufacturer.
June - E-Discovery product launched with EM (The largest non-life actuarial consultant
firm in UK)
29/04/2011 6
7. DataCyte Timeline cont.
2010
June - Negotiation started with Bytes Technology and its Med-e-mass
(www.medemass.com) subsidiary to underpin their current suite of management system
with a comprehensive EHR Solution for the South African market.
July - Health One Global (www.healthoneglobal.com.au) launches Personal Electronic
Health Record and Medical Management record in Australia. This launch coincides with the
launch of the Australian Government Unique personal health identifier, with the support of
the Australian Automobile Association and the Royal Academy of Physicians as a first step
to provide the Australian a health record management service. The Australian government
has legislated that all citizens must have these records in place by 2013.
29/04/2011 7
9. Computing Challenge
• The “Global Village” has “Global Data”
• Boundaries removed
• Information flow is more pervasive
• Physical Storage
• Users store more data than ever before
• Little new development in Data Retrieval Systems
• Processing
• More processing required to retrieve similar data
• Little development in Computing Processing Systems
• Present Business Tendencies
• Swing back to centralized systems
• Swing back to thin client
29/04/2011 DataCyte (Pty) Ltd 9
10. DataCyte Patented Solution
• Performance not dependent on number of records
• No single point of vulnerability
• No central registry
• Information redundantly distributed
• RAIS – Redundant Array of Inexpensive Servers
• Dynamic, Intelligent Information
• Contextual „named‟ links between data entities
• Dynamic data structure
• Pervasive Associations
• Self-managing, Distributed Information Structures
• Any Entity must have „independence of existence‟
• Entities „self-aware‟ of environment
• Web-enabled with open interface - Apache
29/04/2011 DataCyte (Pty) Ltd 10
11. PerformanceFeatures
• Access by association
• Fully distributed storage system
• DataCyte data storage is 10% of the size of traditional
systems
• Sustainable data creation at 400 000 cytes per second on
a standard desktop computer
• Random data access speed of over 250 000 cytes per
second on a standard desktop computer
• Caches up to 25 000 000 cytes in 2Gb memory
• Can access from 250 000 000 000 cytes in sub-
millisecond
• Runs on Linux and Windows
29/04/2011 DataCyte (Pty) Ltd 11
12. DataCyte Technology
Cyte
• Parent Registry
• Child Registry
• BLOB content
• Any data form
• Code
• Lua
• Others possible
• Flags
• Security/Access
control
• Content type, etc
• Native methods
• Provided by service
29/04/2011 DataCyte (Pty) Ltd 12
13. Access Models
• Multiple Logical Models: Data and application layer
Network Structured Containment
29/04/2011 DataCyte (Pty) Ltd 13
15. Case Studies DataCyte Pilot/Test Process
Not part of test –
steady source
Current State Process
DataStage ETL
MS SSIS MDX
BAH DW Process Timing: PME Data MS SSRS
MS SSAS Query
Oracle 4.5 hrs daily Mart Reports
Oracle –
6.5 hrs for closings 750GIG
(2 times month)
DataCyte Test
DataCyte Extraction & Web
BAH DW Translation Crystal
DataCyte Service
Oracle Reports
Fact
Maps
29/04/2011 DataCyte (Pty) Ltd 15
17. Other Case Studies
• Proteomic Research Unit
• Database Size
1,3Tb in Oracle 60Gb in DataCyte
• Retrieval speeds:
1½ minutes < 1sec in DataCyte
1 - 2 days < 11-66 mins in DataCyte
• Hardware Platform:
SunTM Grid Rack of 400 Toshiba Laptop
Sun FireTM x64 servers 1,86GH processor
7 200rpm drive
• UCS SAP database
• 860Gb in DB2 database 100Gb in DataCyte
• Queries up to 1000 times faster
29/04/2011 DataCyte (Pty) Ltd 17
18. Applications Developed
• Knowledge Management Systems
o e-Learning Systems
o Interactive TV Management Systems
o Medical Information Systems
• Health Management – “Single Patient Record”
• Practice Management
• Clinic Management System
• Pathology Laboratory Management
• Clinical Trials System
• Hospital Management System
29/04/2011 DataCyte (Pty) Ltd 18
19. Applications Developed
• Data Warehousing
o ETL
o “Data Cube”
o Lawgistics
o Fraud Detection
• SME Payroll System
• Process Management Server
o Document Tracking Systems
o Business Process Modeling
o Supply Chain Management System
• Computational Performance Systems
o Biometrics
o Proteomic and Genomic Analysis
o Shortest Path Routing
29/04/2011 DataCyte (Pty) Ltd 19
20. DataCyte Benefits
• 90% reduction in hardware requirements
• 10 to 1000 time speed improvement
• Ability to populate archive/warehouse in real-time
• Ability to access archived data faster than existing on-
line live system
• Extension of life of live systems
• Greater security due to ALL history on-line
29/04/2011 DataCyte (Pty) Ltd 20
24. Technology Overview
Database Management System
• Access
• Object
• SQL
• Cyte
Etymology: “Cyte”
• Ancient Greek word κύτος (kýtos)
• Container or Receptacle
• Human body → part of cell that keeps everything together
Developed in C++
• Runs on Windows and Linux
ODBC, XML and Web Service access
Apache module: mod_dsa
• HTTP(S), FTP, WSDL, SOAP, …
25. Technology Overview
• Store: any form of data → „Cytes‟
• Serialized and persisted on creation (more later)
• Accessed by association in a contextual / stateful manner
• Collectively form multiple intersecting hierarchies
• Each Cyte has the potential to form part of a distributed cloud
• Virtualize disparate data → single federated view
• Contain application business logic
• Lua (www.lua.org)
• Lua
• Powerful, fast, lightweight scripting language
• Embedable
• Lua is widely used:
• Industrial Applications (Adobe: Photoshop Lightroom)
• Games (Blizzard: World of Warcraft)
• Embedded Systems (Ginga, Digital TV in Brazil)
• Lua Server Pages
• Tag-based Web applications that dynamically generate
Web pages
27. Technology Overview
Basic Performance
(1.6Ghz Dual Core, 3Gb RAM, 7 200 rpm drive)
Sustained creation speed
• 400 000 cytes per second
Sequential access speed
• 400 000 cytes per second
Random access speed
• 250 000 cytes per second
Cache
• 25 000 000 cytes in 2Gb memory
Access
• Any element from 250 billion elements in under a millisecond
28. Data Structure
Cyte
• Parent Registry
• Child Registry
• BLOB content
• Any data form
• Code
• Lua
• Others possible
• Flags
• Security/Access
control
• Content type, etc
• Native methods
• Provided by service
32. Data Structure
Complexity vs Simplicity
• Simpler → faster learning curve
• Translation layer
• RDBMS
• Programmed
• Maintained
• Adding features, fixing bugs, improvements
• Collectively comprise 80% of lifetime cost
• DataCyte
• No translation layer
• Saving: Development (Time and Cost)
• Integrated into database layer (a la EJB)
• e.g. Cytes with application logic
33. Data Structure
Impedance of Mismatch (Translation Layer)
• Maintenance and Development (RDBMS)
• Different mapping → mismatch and integrity violation
• Subtle Issues
• Difficult to locate (time + money)
• Lower impedance of mismatch in DataCyte
• No translation layer → natural modelling of data
Architecture: Simple
• Option: logically structure and constrain → RDBMS + ODBC
• Multiple logical views of the same data
• Facilitates conformance to multiple standards
34. Discovery
Logical Model
1 2 3 4 External
Conceptual Model Conceptual
Physical
Model
1 2 Physical
Cytes
→ Logical representation of physical storage
→ Navigational construct
each navigation → physical disk read
→ Brokered by DataCyte service
40. Query Approach
• Types of Queries
• Without indexes
• each record is checked in turn
• Indexed
• filtered records
• Query approach (same as RDBMS)
Know what you are looking for
AND
Where you want to look for it
• Query Steps:
STEP 1: IDENTIFY STEP 2: POPULATE
RESULT SET RESULT SET
41. Query Approach
• At time of query
RESULT SET IDENTIFICATION
• Improved indexes
RESULT SET POPULATION (Compound)
• Traverse logical layer (minimal reads)
• Context = Stateful Results
vs
• Additional external lookups
Additional External Navigate through Logical Structure
Lookups O(n) → O(1)
44. Query Approach
Addressing schema
• Defines context of access
• Cyte → Unique ID within local file system
• Offset within file
• Cytes simply exist within file system
• No global registry
Multiple Contexts = Multiple Addresses
Address = Context = Chain of ID‟s (named)
45. Query Approach
• Query Language
• Show of hands: SQL users
• Similar to XPath
• Parent or child Cytes (multiple criteria)
• SQL Interface
• Cytes that conform to relational model
• Lower complexity of architecture
• More natural language
• Steeper learning curve (learn more faster)
46. Performance & Scalability
• Sub-linear Performance Degradation
• Logical Layer → Directed Searches
• Example: Geo-spatial modelling
• Instantiation
• Full control over level
• No class hierarchy
• Multiple Logical Structures
• Same data, different context
• Multi-dimensional searches → single dimension
47. Performance & Scalability
• OLTP
• Architecture marries Structured + Networking paradigms
• Container Topology
• Allows extensible heterogeneous data structuring
• 3-stage versioning protocol
• Balance: performance and integrity
• Data Footprint
• Encoding and Compressing on storage
• No Intermediate link tables
Intermediate
Products Table Ingredients
48. Performance & Scalability
• Proof of Concept: 2008
• Cancer research hospital (Los Angeles)
• Considerable funding
• A proteomic analysis problem – blood analysis study
• Data mining to search for cancer markers
• 50 data samples
• 250 billion data elements
• 1.3 Tb in Oracle
• Results are from the same data set and same queries
Cancer Center DataCyte
Single criteria queries 1½ minutes < 1 sec
Complex Queries ± 1 – 2 days < 11-66 mins
Hardware Sun™ Grid Rack 400 Laptop
Sun Fire™ x64 servers
Data footprint 1.3Tb 60Gb
52. Data Storage
o Serialized
o Encoded Open
Encode Decode
o Compressed
Encoded
o Pages
Decompress
Compress
o Caching Stack Encoded &
Compressed
o Data Distance
53. Data Storage
o Leaf nodes
o Stack management
o Partial Decoding
o Data Management
55. Data Storage
• Enterprise Cloud Storage
• Soft RAIS using commodity hardware
• RAIS provides soft parallelized, grid computing
• Soft RAIS enables redundant distribution of cytes
• Granular scalability and full sharability of resources
• Elastic auto provision of service and resources
• Unified access to data through multiple data models
• New programming approaches unconfined by
• old designs and
• existing programming languages
• to tackle the new data flood.
• Green
• Footprint
• Power usage – running, cooling and start-up
56. Data Storage
• External Data Sources
• Lua add-on libraries: LuaCOM and ADOLua
• Access: Data Services
• MSSQL NCLI (Native Client
Interface)
• DB2 OLEDB (Object Linking and
Embedding)
• Oracle OLEDB (Object Linking and
Embedding)
57. Security
• Security implemented by the service on the cyte level
• Domain-based, inclusion, exclusion
• Cyte-to-cyte communication is encrypted
• Redundant distribution of cytes provides additional
security
• Contextual access provides further flexibility for
security
• Child and parent presentation
• Hardware encryption of storage is preferable
• Cyte granularity enables
• Blind security information retaining associations
• Cleansed health data with relationships
• DataCyte can integrate with existing authentication /
authorization systems (LDAP, Active Directory)
58. Disaster Recovery
• Transaction-based with roll-back
• All transactions are Atomic, Consistent, Isolated and
Durable (ACID)
• 3-state versioning protocol
• My old
• My new
• Yours
• provides fine grain control
• Balance between performance and integrity
mitigation
• Each service can partial recover from physical loss
• Redundancy could provide complete recovery