SlideShare une entreprise Scribd logo
1  sur  32
Date: 30.08.2011

                                           Product Overview
                                              Owner: Jair


Introduction

The primaryobjective of this document is to consolidate context and interfacing for team
membersengaged in product development. We begin with an overview of key product systems, their
components,systemdata flows and key components roles in each flow. A set of appendices then dive into
specific systems and interfaces in more detail – each appendix owned by a specific team member.

The terms defined in this document should be the terms used in all other related documents. Questions
and clarifications should be directed to specific section owners so that this document can continue to
improve and expand to achieve the document objectives.

Kinor Spheres and Apps

The Kinor mission is to provide powerful tools for ordinary ‘workers’(knowledge workers) to
collaboratively harvest information (data and content) from any source (web and otherwise) in a
manner that can best serve the needs of each worker in a fully automated, private and personalized
manner. From a business perspective, the product conceptually comprises of the following:

    1. Spheres - The information harvested from one or more sources is maintained in ‘spheres’, each
       sphere covering a specific domain of common interest to a group of sphere workers. Each
       worker group typically serves a specific business, organization or community. Typicalharvested
       sources include web-based catalogues, professional publications, news feeds, social networks,
       databases,Excel worksheets and PDF documents - public and private.
    2. Pipes – Each source is conceptually connected to a sphere via a pipe that pumps harvested
       information from the source to a specific sphere on an ongoing basis. Spheres can be fed by any
       number of pipes, the pipes primed (configured) by a non-technical sphere administrator or
       collectively (crowd-sourced) by the workers themselves. The output of each pipe is a semantic
       datasetthat is published to the sphere and maintained there for automated refresh via the pipe.
       The dataset is semantic in the sense that data within itis semantically tagged in a manner that
       enables subsequent data enrichment, integration and processing to be fully automated.
    3. Apps – A growing spectrum of sphere applications will empower each worker to automatically
       view and leveragepublished informationwithin the sphere in a fully personalized way. The initial
       apps will be horizontal, i.e. readily applied to any sphere - each worker configuring the app to
       match personalized needs. Sample horizontal apps will:
           a. Enable workers (interactively or via API) to easily find the information that they need
                within a sphere and deliver it in the most useful form.
           b. Automatically mine sphere information for worker configured events, sentiments and
                inferences.
c. Automatically hyperlink and annotate sphere information with additional information
                 within a sphere as prescribed by sphere administrator or worker-defined rules.
        Horizontal apps will pave the way to an even greater number of pre-configured vertical
        appsdrawing information from specific spheres, i.e. a specific ontology with appropriate pipes.
        Once the sphere core has been configured, vertical apps provide instant value out of the box
        available to all customers. Horizontal apps on the other hand enable workers to independently
        or collaboratively develop their own spheres with unique value available only to them.

Harvested information can either be cached (replicated) within the sphere or acquired on demand from
the source. When dealing with unstructured and semi-structured sources of information, e.g. the Web,
the harvested information will typically be replicated within the sphere unless the volume of
information is prohibitive. When dealing with fully structured sources, e.g. a database, the harvested
information can be acquired on demand if this will not disrupt operations for higher priority source
access. Needless to say, the response time for on demand acquisition will highly depend upon the
volume of information, the source availability and responsiveness and the semantic complexity of that
information, i.e. the computational resources needed to semantically tag and process that information.

SemanticIntelligence

Kinor empowerment of non-technicalworkers is achieved by cultivating and applying semantic
intelligence to automate every possible aspect of harvesting, processing and applying information. The
semantic intelligence is maintained in aknowledge base (KB) linked to a growing web of
ontologiesmapped to a common super-ontology. Each sphere addresses a specific domain of interest,
e.g. online book stores. During the information acquisition stage, the KB associated with the sphere
ontology semantically tagsidentified entities and their properties within that information. The sphere
ontology must therefore contain a set of inter-related schemas (frames) that describe allentities in the
sphere, e.g. books, authors, publishers, suppliers, prices and reviews. Each entity schema must also
contain anticipated properties (frame slots), e.g. book properties might include title, author, publisher,
ISBN and year of publication.

Kinor internally refers to each entity property as an atom,each atomassigned a predefinedsemantic
tag,the atom value a predefined atom type. Thus for examplea ‘year of publication’ must be a valid year
and ‘book publisher’ must beavalid publisher name. Atom types are typically recognized by a set of text
patterns (e.g. a four digit year) ora namethesaurus (e.g. formal publisher names and known synonyms
for these names). Atom filtersauto-recognize specific atom types whileconsidering both the atom value
as well as theatom context in which the value was found, e.g. a column name (e.g. ‘home phone’) or a
prefix (e.g. ‘(H)’). Armed with adequate semantic intelligence, blocks of information piped into a sphere
are automatically parsed by atom filters into records (e.g. book records) of semantically tagged
properties to be associated with a specific entity (e.g. a specific book instance).

All sphere schemas are mapped to the super ontology with its shared bank of atom types and their
respective atom filters and contexts so that semantic intelligence can be cultivated collectively (e.g. new
filter patterns and thesaurus entries) by all spheres. Atom thesauri in the super ontology also retain
frequently adjoined entity properties (e.g. a given publisher city, state and country) to facilitate the
auto-acquisition of new entity names and synonyms. The super ontology can thus readily expand
automatically with relatively little supervision.

Entity properties from one pipe can be subsequentlymerged(joined) with entity properties from other
pipes when all properties have been attributed to the same entity. Matching entities across pipes can be
challenging since each pipe record must have a unique identifier key (UIK)based upon properties
available in each record. A set of propertiescan uniquely identify an entity with a degree of confidence
that can be computed empirically. The ISBN property alone (when available) can serve as an idealUIK for
book entitieswhereas a UIKbased upon the book title and author properties is somewhat less reliable.
Entity records from multiple pipes can only be merged if they have UIKs with adequate confidence levels
to be determined by a sphere administrator or worker.

Semantic intelligence can only operate on sphere information mapped to the sphere ontology. Schemas
from public and private ontologies are acquired and retained in an ontology bank mapped to the super
ontology. Unmapped sphere data can then be schema matched with schema in the ontology bank to
semi-automatically add or expand sphere schemas to map them. When dealing with well-structured
catalogue sources, ontologies can be auto-generated from the catalogue structure and data itself.

Key Product Systems

Key product systems include the following:

    1. Pipes – The pipes system schedules all tasks related to the harvesting of information from Web
       and additional sources and subsequent data processing. Each task is executed by one or more
       agents distributed in a cloud. The most common pipe tasks include spidering (collecting
       designated pages of interest from a specific web site), scraping (extracting designated blocks of
       information from those pages), cleansing (decomposing those blocks into semantically tagged
       atoms and normalizing the atom values where possible) and finally importing the semantic
       dataset into a sphere repository. The pipes system also includes a Wizard for priming the pipes.
    2. Spheres – Each sphere retains fresh semantic datasets for each pipe in a query-ready repository
       (QR) capable of serving a growing number of horizontal and vertical apps. The QR must respond
       to app queries on demand while also enabling a growing spectrum of ongoing app tasks to
       process and augment the QR in the background. Each QR atom maintains the history of that
       atom starting with the pipe history produced by the pipe. The origins of each atom and value
       are thus readily traced back to the source and the data processing tasks.
    3. Ontology – The ontology system comprises of a centralized ontology server (OntoServer)
       working in unison with any number of ontology plugs (OntoPlug) to apply semantic intelligence
       to every possible aspect of harvesting, processing and applying sphere information. The
       OntoServercultivates and maintains the KB for all spheres while the OntoPlug caches a minimal
       subset of the KB to serve specific agent tasks.
    4. Applications - A web-based user interface provides integrated user access to all authorized
       applications (apps) including administrator apps for managing the above systems.
5. Framework –A common framework for the above enables all of the above systems to run
      securely and efficiently atop any private or public cloud.
   6. E-Meeting –An interactive conferencing facility fully integrated with the product that enables
      existing and potential customers to instantly connect with designated support and sales
      representatives for instant pilots, training, assistance and trouble-shooting.




Key Pipe Components

Within the Pipes system, key components include the following:

   1. Wizard – The Pipe configuration wizard enables a non-technical user to prime a pipe within
      minutes, i.e. to direct the pipe in how it should navigate within a Web site to harvest all pages of
      interest and subsequently extract from those pages all required blocks of information. Very few
      user clicks are needed to determine:
          a. Which deep web forms (searches and queries) to post (submit) with which input values.
          b. Which hyperlinks and buttons to subsequently navigate (follow) to harvest result pages
               of interest. Note that some result pages may lead to additional pages with different
               layouts - hence each page must also be associated with a specific layout id.
c. Which blocks of information per page are to be extracted and semantically tagged for
           each layout id.A scraper filter is subsequently generated per layout id – hence blocks of
           interest must only be marked for one sample page per layout. Additional sample pages
           are randomly chosen to test the scraper filter for worker feedback using several pages.
        d. When it should revisit the site to refresh that pipe dataset.

    Throughout this process the wizard will provide feedback regarding the pipe dataset that will be
    produced using the current pipe configuration as well as the anticipated price tag for acquiring
    and refreshing the dataset. The pipe dataset produced for Wizard feedback will use a relatively
    small sample of pages for user feedback within seconds and the dataset will not be published to
    the sphere. The user can subsequently refine the pipe configuration to better suit user needs.
    Once primed via the wizard, the pipe can operate autonomously as depicted in the above
    diagram by the ‘Map a website’ followed by ‘Run’ that results in ‘Notification’ when the pipe
    completes its operation (‘End’).

2. Spider – Any number of spider agents can then interact in parallel with the source web site to
   harvest all pages of interest. The harvesting is accomplished in two stages:
       a. Site spidering – A multi-threaded collection of URLs and subsequent postings and
           navigations are produced with a unique id, an order tag (to collate scraper results in the
           proper order) a parent tag (the id of the page that pointed to it) and a page layout id
           tag. New pages are readily flagged by comparing the new collection with the previous
           one. The harvesting of pages can subsequently be parallelized in an appropriate order
           by allocating subsets of the collection to several agents.
       b. Page harvesting - Either all pages or only newly flagged pages are cached in the pipe
           repository by any number of spider agents with order tags so that the pages can
           subsequently be processed in an appropriate order. Each harvested page is recorded in
           a page index with sourcing tags that include the site name, an order tag, bread crumb
           tags (i.e. posted form inputs and navigations) that led to this page, a layout id tag that
           identifies the scraper filter for extracted blocks from that page, the harvesting date and
           time and the site response time for that page.
3. Scraper – The layout id tag is used to apply the appropriate scraper filter to extract the
   designated information blocks per page and transform them into dataset records. Any number
   of scraper agents can do this in parallel, each agent producing a dataset of scraped records per
   page. The page datasets are then merged into a pipe dataset in an appropriate page order. Key
   scraper filter components include the following:
       a. Sequence filter –Matching tag sequences are used to mark designated page blocks. The
           sequences are robust by being sparse (only key tags are included) and depth sensitive
           (reflecting how deep they are in the element tree).
       b. Block table filters – Blocks with conventional table structures use these structures to
           parse records with context tags that include column numbers and headers where
           available. Filters are robust in that they treat nested tables for a variety of tables while
           handling missing tags. Tables can either be vertical or horizontal.
c. Record markers – New dataset records are identified by table structure or by distinct
   fields within the record. Thus, if the third field in the record is always a telephone
   number, the beginning or end of the record is readily found. Record markers take into
   consideration that some fields might be broken into multiple parts for multiple styles,
   hyperlinks, multiple lines and other special effects within a single field.
d. Context–Context is crucial to automated semantic tagging, hence all relevant column
   headers and value prefixes (e.g. QTY in ‘QTY:50’) are extracted and attached to relevant
   field values (e.g. ‘50’). Frequent contexts are retained in the sphere ontology so that
   probable context can be auto-identified by structure (e.g. table position), style (e.g.
   color/emphasis) and content (e.g. frequent context values).
e. uFilters(micro-filters) – Block filters may contain any number of uFilters to mark records
   and context as the block information is being parsed. The set of uFilters are applied to
   every field in the block, each uFilter checking for a specific combination of field
   attributes that include the following:
          i. Field content – A specific text (Equals, StartsWith, EndsWith) or equivalence
             with the page title as identified by TITLE tags.
         ii. Field atom type – Contains a designated text pattern or name found in a
             designated thesaurus.
        iii. Field style – Has a designated set of style attributes, e.g. font color/size,
             emphasis, header level (e.g. H2).
        iv. Field ID – Was preceded by a designated “id=” tag.
         v. Field Column – Appears in a specific table column.
   All uFilters are applied to their designated blocks of information before the record
   parsing begins. uFilters can also be configured to mark nearby fields:
        vi. Field displacement – the marked field is a given number of fields before/after.
       vii. Field expand –fields before/after also marked until specific tags are matched.
f. Active atom filters and patterns – uFilters are applied to all block fields, hence the need
   to minimize the computational requirements of each uFilter. Atom filters can be heavy,
   hence only those filters that appear in the uFilters are active. Moreover, when dealing
   with atom filters with several patterns, only patterns that actually captured atoms
   during the pipe priming will be active. An active atoms tag in each scraper filter must
   contain the active atoms and patterns so that only these will be armed in the OntoPlug
   serving the scraper.
g. Variable records – Some pages may contain more record fields than others, whereupon
   placing the right fields into the right columns can be challenging. Context tags are
   helpful only if they are available. Hence the need for additional field context as defined
   by the uFilters that captured each field. Field context tags include TITLE, atom type, field
   style, field ID and field column.
h. Layout uFilters – Inconsistencies in the page templates (the actual template may
   depend upon the number of results) may cause the sequence filter to break. Optional
   layout uFilters mark block begin/end as a backup.
i.   Block relationships – Each record block is ultimately parsed into a sequence of
             semantically tagged atoms that belong to a specific record. This specific record might be
             one of several records in a table block on a parent page. A table block contains several
             records (e.g. pricing options) that might all belong to a single record. Hence block
             relationships between pages and on the same page are crucial. The following
             assumptions are made:
                    i. All record blocks on the same page belong to the same record.
                   ii. All blocks on the same page belong to a specific record on the parent page that
                       pointed to that page.
                  iii. A table block belongs to the same record (as a table within a record) as the
                       record blocks on the same page.
                  iv. An attachment (history) block belongs to all records in the table block on the
                       same page.
         j. Record categories–Records extracted from a single site might all belong to one or more
             categories, e.g. in an online catalogue in which different product categories will have
             different sets of columns. A page dataset must therefore be associated with a specific
             category. The category might appear on the page as a category block or it might be auto
             identified by the bread crumbs that led to that page.
         k. Category taxonomies – When comparing records from multiple pipes it might be
             necessary to compare only those that belong to the same category. In such cases there
             is a need to develop a standard category taxonomy for the sphere and map the local
             pipe categories to that standard taxonomy. This is referred to as automated category
             harmonization and it is carried out by the Ontology system.
   Upon selecting a page block and indicating the nature of the block contents (table, record,
   category, or attachment), the generation of candidate filters is fully automated whereupon the
   selection of a specific candidate filter is finalized by the wizard ‘by example’. This means that
   the candidates are ordered by probability and a ‘guess’ button enables the Wizard to try each
   candidate filter until the user is satisfied with the results in the Collected Data pane.
4. Tables (kTable) – All datasets produced by the pipe are maintained in a kTable component that
   is first produced by a Scraper agent and then passed on to additional pipe agents.
         a. Columns, atoms, column relationships
         b. Tables within tables and block relationships.
5. Cleanser – Tables are further refined by cleanser agents that may operate on specific sets of
   records and columns. Any given kTable may undergo several cleanser iterations as the semantic
   intelligence expands to cover more table columns and specific atoms. Hence subsequent
   cleanser iterations will typically focus on specific columns. Specific columns might also be tagged
   by the Wizard to remain ‘as is’. Similarly, the OntoPlug will indicate to the cleanser which
   columns can’t be cleansed due to a missing semantic tag or a semantic tag with inadequate
   intelligence in the KB. Key cleanser operations include:
         a. Column decomposition – Identifying atoms within a field and creating new columns for
             each of them. Decomposition can be recursive when atoms can further be decomposed
             into more refined atoms.
b. Atom normalization – Atoms that can be normalized are normalized as directed by atom
                normalization tags associated with each column.
       All transformations are documented in the kTable history so that the origin of each atom and
       value is readily traced. The refinement state of each kTable (e.g. percentage of cleansed
       columns and decomposed and normalized cells) is updated so that the quality impact of each
       agent can also be instantly assessed.
    6. Enrich – The sphere administrator is tooled with a horizontal app that analyzes QR data to
       identify potential rules that govern entity properties. Thus, for example, a memory chip with a
       given capacity, error checking capability and packaging will always have a given number of pins.
       Similarly, a person under age will not be married or have a driver license. The administrator may
       add rules that reflect domain knowledge and the app is semi-supervised to ensure that
       incidental patterns are not adopted as rules. The enrich agent can then use these rules to fill in
       missing date and flag suspicious data. Rules may also have confidence levels that are reinforced
       or diminished by additional data and administrator confidence. The kTable history must
       document all rules that affected a transformation and a follow-up tag is added to all suspect
       atoms to ensure that they are subsequently given the right attention. All tags record also when
       they were created and by whom to evaluate the quality of treatment. The enrich agent may also
       insert additional entity properties found in the super ontology for identified entities.
    7. Import – This agent imports a kTable into the QR when it is ready. The importer must also
       concatenate all record snippets from multiple blocks and pages into complete records. This is
       also where the UIK is determined via the OntoPlug.

As implied by the above, the pipe value proposition already includes the following:

    1. Semantic tagging that can also serve the auto-generation of RDFA content.
    2. Field decomposition
    3. Atom normalization
    4. Enriched data, e.g. missing fields using rules and additional entity properties from the super
       ontology.
    5. Flagged suspicious data
    6. The best available UIK for these records with a confidence level.

To emphasize this value proposition and manage customer expectations, the Wizard will reflect the pipe
output in a Collected Data pane. This means that a small sample of pages will be run through the pipe to
produce a kTable that will be presented by the Wizard in that pane for additional user feedback. User
feedback may be global (trying a different candidate filter) or local (selecting a specific semantic tag for a
column, changing the UIK or configuring a column to be left ‘as is’). By default, a column that is not
semantically tagged with be parsed syntactically by the cleanser without semantics.

Key Sphere Components

The harvested pipe datasets are published and maintained in a sphere for worker access via a growing
spectrum of horizontal and vertical apps. Horizontal apps empower each worker to automatically view
and leverage the published datasets in a fully personalized way. Most apps have both ongoing and
interactive processes:

        Ongoing app processes analyze and augment sphere data to make it more useful. Thus, for
        example, a financial app might use NLP and additional techniques to identify relevant news and
        sentiments on a variety of sites. An e-commerce app might seek reviews that compare
        competing products so that consumers seeking one product can be offered cheaper deals for
        similar products.
        Interactive appprocesses deliver personalized views to users and respond to consumer
        activities. Thus, for example, the financial app will deliver the news that each user subscribed to
        and the e-commerce site will offer deals that match products being sought via a partnering
        search engine. Each user might also want to search the sphere datasets for specific information.

The sphere administrator will also want to optimize the quality and availability of the sphere
information. To this end there will be a need for additional administrative processes:

        Quality optimization processesseek methods to continuously improve the quality of the
        information, e.g. by identifying rules that will flag erroneous data and seek corroborating
        sources to increase the confidence levels of app results.
        Performance optimization processesseek methods to improve app performance, e.g. by
        maintaining frequently queried aggregated datasets that take precedence over the datasets that
        have already been aggregated to improve app response times.

To support the above, the sphere system will initially comprise of the following key components:

    1. QR – Vertical database containing all data collected from all pipes.
    2. QBE – Enabling users to find the sphere information that they need without prior knowledge of
       the sphere categories, schemas and value formats.
    3. Indexer – Building an index of unstructured nuggets so that they can be semantically searched.
    4. Rules Builder – Tools for automatically deriving rules and their confidence levels from the QR
       and manually editing and building them for auto-enrich and the flagging of suspicious data.
    5. kGrid – Delivering ad hoc data integration on-the-fly from disparate online database sources.
       This will initially serve only cloud-hosted databases until the patent-pending bio-immune
       security is implemented whereupon it can also tap into remote private databases.

Conceptually, every entity in the sphere ontology has an independent QR table. When dealing with
catalogues, every product entity may have different properties, resulting in a very sparse database.
Upon querying for a specific product with given properties, the QBE will identify all properties that
sufficiently match the property names and constraints to include all relevant products. Properties
already covered by the ontology will match all known property names. Properties not covered will seek
similar names. RoutineOntoServer processes will attempt to identify equivalent properties across sites
to add these properties to the ontology.
kGrid maps online databases to ontological entities so that their data can also be queries on demand
and combined with data already within the sphere.

Spheres will typically fall into one of the following classes:

    1. Simple – These are spheres that can readily be built and applied by non-technical users without
       assistance.
    2. Complex – In these spheres, power users use advanced features to get the job done with
       appropriate Kinor guidance.
    3. Custom – These spheres require Kinor customization and/or additional features to get the job
       done.

Progressive sphere improvements will ensure that more and more complex spheres will become simple
and fewer spheres will require any customization.

Key Ontology Components

The ontology system comprises of the following key components:

    1. OntoServer – An integrated environment for managing, maintaining and importing/exporting
       ontologiesand super ontologies. Must also provide an API for OntoPlug access to the KB.
    2. OntoPlug– An OntoServer proxy capable of rapidly loading select portions of the KB from the
       OntoServerand providing all ontology based services to all designated clients in the Pipe,
       Spheres and Apps. Current (immediate) OntoPlug clients include the following:
           a. Wizard – Determining which atom filters and patterns should be armed for uFilters.
               Identifying most probably semantic tags per column. Also auto-suggesting block of
               information that match the sphere ontology.
           b. Spider–Use atom types and learned properties to choose best form inputs.
           c. Scraper - Identifying atoms as needed by uFilters to scrape pages.
           d. Cleanser – Decomposing and normalizing cell values.
           e. Importer –Proposing best UIKs per dataset.
           f. QR – Choosing the best UIKs across datasets and using thesauri to semantically expand
               queries to cover all synonyms and possibly even instances.
           g. Content enrich app – Finding known entities in an unstructured text (tokenizing) and
               enriching the text with the entity properties.
    3. Ontology Editor – An ability to view existing ontologies and manually model new ontologies as
       needed.
    4. Ontology Builder – Auto extension of a sphere ontology using the ontology bank or via the
       adoption of pipe schemas as an ontology (e.g. in an online catalogue) and auto-mapping other
       pipe schemas to that ontology (column harmonization).
    5. Ontology Trainer – Auto acquisition of new thesaurus entries and their synonyms from
       undefined atoms collected by the pipes. New patterns must also be acquired in a semi-
       supervised process.
6. Category Harmonization– Analyzing pipe categories to produce a category taxonomy or
       enumerator (id) that all pipe categories can readily be mapped to.

Key Framework Components

The framework system seamlessly deploys the other systems on any private or public cloud. The
framework is designed to optimize cloud utilization, measure performance, automate testing and readily
support a growing number of apps. Key framework components already include:

    1. Repository – Persistent storage for all configuration and operational data serving the pipes
        including cached pages, scraper filters, kTables and recorded data.
    2. Scheduler – Scheduling pipe agents to meet quality of service (QoS) and refresh requirements
        while also catering to hi-priority agent tasks initiated by the Wizard on demand.
    3. Planner – Planning a schedule for
    4. Back office apps -
    5. Recorder – Recording all activities for testing, performance optimization and exception handling.
    6. Auto-tester – Using archived repository content to do regression testing and ascertain that the
        quality and performance only improves with new system versions.
    7. Health monitor – Analyzing recorded data to measure KPIs (key performance indicators) that
        reflect system health and ascertain that it is acceptable and only improving.
    8. Agent emulator – To support agent debugging outside the framework.
    9. Portable GUI – An integrated web-based framework for all user interactions with these systems
        including all apps. It is portable in the sense that it will ultimately support several deployment
        modes (e.g. with and without code downloads) without modifying the apps.
    10. Bio-immune security – Elements of kGrid and its patent-pending bio-immune security will be
        integrated into the framework to support secure import and export of enterprise data for
        arbitrary cloud-based applications.

Given the generic nature of this framework, it might ultimately be open sourced to enableothers to
develop new pipe agents and apps. Upon implementing the bio-immune security, the framework might
be sold as an independent product for secure cloud computing.

eMeeting System

An interactive web-based conferencing facility must be fully integrated with the product to enable
existing and potential customers to instantly connect with designated support and sales representatives
for instant pilots, training, assistance and trouble-shooting.

System Flows

Each of the above systems has key data flows to be surveyed in the following sections with the roles
played by key components. The key data flows will be reviewed in the following order:

    1. Pipe Data Flow
    2. Application Data Flows
3. Ontology Data Flows
    4. Sphere Data Flows
    5. Framework Data Flows

Data flows typically span multiple systems but each will be addressed in the context of one system.

Pipe Data Flow

Key pipe data flows include the following:

    1. Record assembly – Parallel caching and scraping can result in the loss of order and relationships
       (parent-son) between records spanning multiple pages.
           a. Consider, for example, a table of books (block 1) with links to book details (block 2) on
               separate pages that include a link to a collection of reviews (block 3) on a single page.
               Moreover, the table of books may contain additional category information that relates
               to all books in the table (block 4).
           b. Then each block consists of one or more block records, each consisting of one or more
               data fields that the scraper will extract with their field contexts. Such a dataset could
               subsequently be imported into the QR in two ways:
                     i. One table – Data from blocks 1, 2, 3 and 4 are assembled into a single table
                        whereupon multiple reviews for a book will result in multiple rows per book,
                        one row per review with extensive duplication.
                    ii. Multiple tables – Independent tables for category data (block 4), book data
                        (blocks 1 and 2) and review data (block3) appropriately interlinked.
           c. Clearly the latter approach has advantages but a sphere administrator might prefer the
               former. By maintaining all scraped and cleansed data as block records, duplicate kTable
               storage and cleansing is avoided and the decision as to how to collate the data in the QR
               can be made in the final import stage. This also simplifies the distribution of pages to
               agents for caching and scraping – each page can be processed independently with the
               exception of cases (e.g. PDF files) in which records roll over from one page to the next.
           d. As later detailed, the initial spidering phase must therefore create an index of category
               {id, bread crumbs} and pages {id, category, order and relationship}. The scraper must
               subsequently create an index of blocks {id, type (table, record, category or attachment),
               page and relationship} and records {id, block} so that all block records belonging to the
               same entity/schema can be appropriately collated upon import.
           e. Note that cleansing and enriching are also best applied to block records to avoid
               unnecessary duplication.Note also that scraper collating of block records would require
               that each sub-tree of pagesbe processed by the same scraper in a specific order using
               multiple scraper filters to cater to the multiple page layouts.

        The kFrameworkmaintains a cache with Page objects for each pipe, each page retaining
        navigations to previous and next pages. To support block linkage for subsequent record
        assembly, kFramework maintains a kRegistryto maintain the following:
a) Each page object must retain a parent record id, e.g. when a page with an index
   containing several books links to a page per specific book, each specific book page must
   link back to a specific record in the index page. The parent record id must consist of the
   page id, the parent block id containing the index and the link value itself so that we can
   subsequently identify the parent index record for each specific page. Block Ids are
   foreign to the spider, so the spider merely registers the LinkValueToPage so that the
   scraper can later identify the link content and register block and record ids of the parent
   record id.
            Class PageRecordId { ParentPageId, ParentBlockId, LinkValueToPage}
b) Each page object must also identify the scraper filter that must be used to scrape that
   page. A scraper filters is assigned to each page layout, so the page object need only
   identify the page layout that it belongs to, each page layout also having a list of layout
   blocks:
            Class RegisteredLayout { LayoutId, ScrapeFilter, List<LayoutBlock> }
            Class LayoutBlock { LayoutBlockId, BlockType}
            EnumBlockType { Record, Table, Attachment, Category }
            Class PageLayout { LayoutId, List<LayoutBlock>}
            Class RegisteredPage {PageId, PageRecordId, LayoutId,
            List<BlockId>,BreadCrumbs}
            Class RegisteredBlock{BlockId, PageId,LayoutBlockId}
c) To keep track of all pages and blocks, kLibrary must retain a dictionary of registry of
   Layouts, lan index of each page object will retain a list of blocks the ) several pages
   withand Sibling navigations and each page must also maintain a list of Blocks.
            Dictionary<LayoutId,RegisteredLayout>LayoutRegistry
            Dictionary<PageId,RegisteredPage>PageRegistry
            Dictionary<BlockId,RegisteredBlock>BlockRegistry
d) The spider must maintain the LayoutRegistry and PageRegistry whereupon the scraper
   maintains the BlockRegistry, independently scraping records per BlockId and storing
   them per BlockId in the kTable.
e) The assembly of records spanning multiple blocks can then be accomplished by the
   Importer as follows:
            Category, Record and Attachment blocks only have a single record per block
            whereas a Table block may have several records. We assume any number of
            Table and Record blocks per page.
            If there are Record blocks, then all Record blocks are assembled as a single
            Record, all Category and Attachment blocks linked to it and any number of Table
            blocks are linked to it as well as Tables within that record.
            If there are no Record blocks, each Table block produces any number of records
            to which all Category and Attachment blocks are linked.
If the Page has a PageRecordId then all records produced are linked to that
                      record. The appropriate record is identified by finding a record in the designated
                      BlockID that has the appropriate LinkValueToMe.
                      Record fields are loaded into the QR as dictated by the sphere ontology. Thus
                      when loading a record containing a table, if the table contents map into that of
                      an ontological entity linked to other ontological entities in the record then each
                      entity will be loaded into a different table appropriately linked. If the table
                      contents map into entity properties with an appropriate cardinality then it will
                      be loaded into them.
2. Scraper filters –scraper filters are automatically generated and tested for sample pages by the
   Wizard, but they may need to be adjusted by the Wizard after they are applied to all of the
   pages. The auto-generation and adjustment is accomplished as follows:
       a. One or more blocks per page are marked by the worker as one of the following types:
           Table, Record, Category or Attachment.
       b. The default block type is determined by the block content assisted by the OntoPlug.
           Thus for example the presence of bread crumbs suggests a Category block. The
           presence of a table with multiple records suggests a Record block. The presence of
           certain types and context may suggest a Record or Attachment block.
       c. Key elements in a scraper filter include:
                   i. Tag sequences to identify the page blocks.
                  ii. Layout uFilters to identify page blocks including a potential Title uFilter to
                      identify the beginning of a nugget block.
                 iii. The above two mechanisms backup each other in case one fails (e.g.
                      inconsistent html templates) or a uFilter gets confused by dynamic content.
                 iv. A record/attachment block may have variable sets of fields per page – hence as
                      many fields as possible are provided uFilters to map them to appropriate block
                      dataset columns.
                  v. A table block requires mechanisms to detect its headers and records. If there is
                      an obvious html structure, the key table tags are used – else a set of block
                      uFilters to identify headers and the beginning or end of each record.
                 vi. In both of the above cases, the generation of uFilters begins with an attempt to
                      find a strong set of uFilter attributes per field until as many fields as possible are
                      readily captured by a uFilter. A strong uFilter is one that captures a reasonable
                      number of fields. A uFilter in a record block that captures half of the fields might
                      be useful for capturing labels whereupon it will also be included.
                vii. In a table block, the number of fields captured by each uFilter then serves as a
                      basis for identifying the number of records on the marked page.
               viii. In a table block, each uFilter is also assessed regarding its ability to serve as the
                      basis for breaking the table into records.
                 ix. All of the generated uFilters are treated as candidate uFilters of which the most
                      probable ones are armed for use in the scraper filter. Less probable ones are
                      retained so that the Wizard can show the dataset that would be produced if
they are armed, enabling the worker to choose the right combination by
                   example.
               x. The Wizard also uses the OntoPlug to assign semantic tags and cardinality per
                   uFilter to determine how captured fields will be mapped into block columns.
       d. Subsequent Wizard adjustments merely alter the set of armed uFilters and the semantic
           tags and cardinalities assigned to them.
3. Semantic tags – Whereas the scraper can readily extract fields and insert them into appropriate
   columns with context in page block tables, the decomposition of fields with multiple atoms can
   only be accomplished by the cleanser for columns that have been semantically tagged. The
   semantic tagging of columns is accomplished as follows:
       a. When dealing with site columns that have already been mapped to specific semantic
           tags, the semantic tagging can be accomplished automatically by the scraper upon
           concluding its work or by the cleanser prior to cleansing. The former is preferred since
           cleansing can be iterative and it’ll be easier to plan and schedule if we know in advance
           which columns can be cleansed.
       b. The OntoPlug attempts to determine a semantic tag ……………..
       c. When dealing with consistent columns yet to be mapped, the Wizard can query the
           OntoPlugfor the most probably headers, enable the user to approve its automated
           selection and prompt the user to select a specific semantic tag from a short list when
           the confidence levels associated with the most probable tag isinsufficient.
       d. not sufficiently differentiated.too close and allow the user to determine the right one.

        Site columns that have not yet been semantically tagged are analyzed by the OntoServer to
        attempt expansion of the sphere ontology to cover these columns and map them
        accordingly.

4. Pipe OntoPlug directives
       a. uFilter attributes
                  i. Atom Filters – Identifying the atom type and pattern that best captures the
                     atoms in a given field.
                 ii. Label/Prefix – Probability that a uFilter captures labels and/or prefixes based
                     upon the text in the fields captured by that uFilter. Labels will subsequently be
                     captured by a Text Equals attribute and a prefix by a Text StartsWith attribute.
       b. Semantic tags
5. Atom filter training – Saving fields with unknown atoms in decompose and elsewhere with their
   contexts and semantic tags…
6. Atom context training– Atoms will often appear in new contexts that have to be acquired by the
   ontology and newly harvested data will often lack the ontology needed to cleanse it. The scraper
   records the context of each field in kTableso that the OntoServer can subsequently learn
   everything possible for harvested data, including new patterns for existing atoms as well as new
   atoms. It is context like column, style, headers and ID that enable us to recognize fields that
   contain common atoms and provide hints as to what they might be.
7. Field decomposition–
    8. Table/record within table – the scraper uses context (style, id, etc) that is not available to the
       cleanser….


Application Data Flows

Key application data flows include the following:

    1.    Query auto-suggestion
    2.    Aggregated data view
    3.    Sourcing crowd-qualification
    4.    Best-practice sphere views
    5.    Best-practice content enrichment
    6.    Collective sphere ontology development
    7.

Ontology Data Flows

Key ontology data flows include the following:

    1.    Ontology Acquisition
    2.    Contexts
    3.    Entity relationships
    4.    Synonyms
    5.    Patterns
    6.    Atom Types
    7.    Rules
    8.    Atom Filter training
    9.    Unique Identifiers – production of UIK permutations
    10.   Schema expansion
    11.

Sphere Data Flows

Key sphere data flows include the following:

    1.    Publishing
    2.    Merging data across multiple pipes
    3.    Conflict resolution
    4.    Manual refinement
    5.    Rules
    6.    Crowd worker data refinement
    7.    Ontology refinement
    8.    Quality refinement
9. Performance refinement
   10.
   11.

Framework Data Flows

Key framework data flows include the following:

   1.   Data quality measurement
   2.   Key Performance Indictors (KPI)
   3.   Cloud utilization optimization
   4.   Auto-testing optimization
   5.   Pipe caching optimization
   6.
Appendix A: kTables Tags
                                            Owner: Moshe

Each pipe may produce several tables of data, each category of entities maintained in a separate table.
Each table comprises of a matrix of fields, each field belonging to a specific column and row. Each field,
column, row, table and pipe may have several properties maintained as key/value pairs maintained by
kTables at an appropriate pipe, table, column, rowand field level that reflects their scope. The kTables
keys are collectively referred to as kTable tags maintained in a kTableTags enumerator. The property
values are often objects defined in designated APIs.

Consider a site that sells books, CDs and videos with reviews. Then book, CDs, videos, prices and reviews
can be treated as independent entities maintained in separate tables with specific rows in one table
(e.g. prices and reviews) linked to specific rows in other tables (e.g. book, CD and video). Each field is
associated with a specific CategoryID to ensure that it is stored in the right table. Each field is also
associated with a specific PageID and BlockID so that we can trace back to its origin.

kTables is created by a scraper agent whereupon it is refined by Cleanser and additional agents before
finally being imported by an Importer agent into the QR. Prior to kTables creation, properties pertaining
to the data that are created by the Wizard App and Spider Agent are maintained in the kRegistry of
kFramework. This includes, for example pipe properties such as a SphereID and SphereOntology, as well
as the properties of the pages and page blocks that each table cell was extracted from. kTables
therefore need only maintain a PipeId, PageId and BlockId so that all properties associated with the
pipe, page and block can be obtained from the kRegistry. The scraper receives as input a list of pages
that it needs to process including the PipeId and all PageIds that it needs to access them. Given a PageId,
the scraper can get all BlockIds for that page. Similarly, when dealing with a cell containing a value with
a given AtomType, kTables only needs to maintain the AtomTypeId to obtain additional properties of the
AtomType from the OntoPlug.

The following table contains a list of properties per scope (pipe, table, column etc) that are available via
kTable. In many cases, the property is maintained internally – in other cases the property is prefixed
with a ‘*’ or ‘#’ to indicate that is can be obtained via the kRegistry or OntoPlugrespectively using
akTable tag designated in the second column. A ‘W’ value indicates which components write the tag
value and an ‘R’ indicated which components read it for purposes described in the final column. Post
pipe processing typically includes OntoServer analysis or QA evaluation of the quality of pipe processing.
Highlighted tags are for Beta.
Property/Tag       Level       App/   Spid   Scra   Clea   Impo    Post   Description/Purpose
                   /Scope       Wiz    er    per    nser    rter   Pipe
PipeID             Pipe                       W              R            To access all related properties in kRegistry
*CustomerID        *PipeID       W                           R            Serving QR access control
*SphereID          *PipeID       W                           R            Each sphere has an independent QR
*OntologyName      *PipeID       W            R      R       R      R     OntologyName for OntoPlug.SetContext
*MinPedigree       *PipeID       W            R      R       R      R     Min Pedigree for OntoPlug.SetContext
*SiteID            *PipeID       W                           R            For auto-navigation to the original pages
BlockID            Field                     W                            Entity records spanning multiple blocks & pages must be merge
*PageID            *BlockID                  W              R             To find other blocks on the same page
*PageLayout        *PageID             W     R                            To apply correct scraper filter to each page by layout
AtomFilters        Pipe          W           R       R                    OntoPlug.RelevantFilters(List<fields>) per block per layout for O
*Prev/NextPage     *PageID             W     R                            To scrape records in the appropriate order
*ParentBlockID     *PageID             W     R              R             To find linked block on parent page
*ParentLinkValue   *PageID             W     R              R             To find linked block on parent page
*BreadCrumbs       *PageID       R     W                            R     To auto-identify a category block & measure spider coverage
*CategoryContent   *PageID                   W              R       R     To validate right choice of category block and categoryID
CategoryID         Table                     W              R       R     OntoPlug.CategoryID(BreadCrumbs, CategoryContent)
*AgentsVersions    *PipeID       W     W     W       W      W       R     List of agents that processed this kTables and their onto version
UniqueKeySets      Table                                    W       R     OntoPlug.UniqueKeySets(List<column.SemanticTags>) per Cate
*PageTitle         *PageID                   WR                     R     HTML page title to validate matching nugget title
IsNugget           Column                    W              R       R     To apply QR indexing to these columns
ColumnContext      Column                    W                      R     Scraped TableColumnOrder,Header,Prefix/Label,Style,AtomTyp
ColumnID           Column                    W              R             Autogenerated by Scraper based upon ColumnContext
UserColumnName     Column       (W)          W              R             Manually entered by user & registered in Scraper Filter (takes p
SemanticTag        Column        W           W       R      R             OntoPlug.ProbableSemanticTags(List<columnValues>,ColumnC
FirstSonColumn     Column                            W                    To find first descendent column produced by field decompositi
NextSonColumn      Column                            W                    Whereupon this leads to remaining descendent columns
ParentColumn       Column                            W                    To recursively find all ancestor columns
SkipColumn         Column        W            R             R
#CanDecompose      *SemanticTag                      W                    OntoPlug.CanDecompose(SemanticTag) Wizard can also config
FieldProperties    Field                     W       W      R             Link,ImageLink,ImageAlt, Empty,NewWord
FieldAtomType      Field                     W       W      R       R     To get OntoPlug atom type attributes, e.g. AtomValueMin/Max
FieldValue         Field                     W       W      R       R     Object containing Amount, UnitNameetc
Quality Metrics

The following table contains a list of quality metrics per pipe designed to monitor:
    a) Data quality improvements as the data flows through the pipe
    b) Pipe data quality and KPI (key performance indicator) improvements over time.
Over time here could mean from one test cycle to the next due to a new code version or and improved
ontology. Several of the metrics are maintained per Type*Layout, i.e. for each Block Type (Record,
Table, Category, Attachment) and each Page Layout in the pipe.
PKIs are readily derived from these metrics, e.g. Suspicious/Validated characters, percentage of Empty
Pages, percentage of Normalized/Atomic cells, etc.

Property/Tag        Level /Scope   App/   Spid    Scra   Clea   Impo    Post   Description/Purpose
                                   Wiz     er     per    nser    rter   Pipe
Bread crumb paths   Pipe                   W                                   Number of queries – should correlate with Category IDs
Spider Pages        Pipe                   W                                   Total navigations
Page Layouts        Pipe                   W
Linked pages        Pipe                   W                                   Navigations via links
Broken links        Pipe                   W
Layout Pages        Layout                        W                            Scraped pages per layout ID
Empty pages         Layout                        W
Category IDs        Layout                        W
Blocks              Type*Layout                   W                            Category/Table/Record/Attachment blocks per layout ID
Texts               Type*Layout                   W                            Texts per block type per layout
Fields              Type*Layout                   W                            Extracted fields per block type/layout
Columns             Type*Layout                   W       W                    Number of columns
Semantic Tags       Type*Layout                   W       W                    How many of them tagged
Atomic Columns      Type*Layout                   W       W
Atomic Cells        Type*Layout                           W
Normalized Cells    Type*Layout                           W
Total Chars         Type*Layout                           W
Unknown Atoms       Type*Layout                           W                    Candidate new atom names/patterns
Residue Chars       Type*Layout                           W
Leave As Is Chars   Type*Layout                           W
Validated Chars     Type*Layout                                          W     Output characters validated via their history as matching spe
Suspicious Chars    Type*Layout                                          W     Non-validated output characters

Max Response        Pipe                   W                                   Source response times
Total Response      Pipe                   W                                   Total response times for all pages in spider process
Spider Time         Pipe                   W                                   Total spider process
Scrape Time         Pipe                          W
Cleanse Time        Pipe                                  W
Scrape Onto Time    Pipe                          W                            Onto processing time only
Cleanse Onto Time   Pipe                                  W                    Onto processing time only
Appendix B: Ontology API
    Owner: Naama
Appendix C: Pipe Flow with Data Example
                                                               Owner: Oksana

            1. Website Mapping

               Result: Sample data

            Pipe 1                                                             Pipe 2

             Column1 Column 2               Price      Column 3                 Column1 Column 2           Price    Column 3
             ayn rand      fountainhead     10.00$     5                        ayn rand    fountainhead   10.00$   5




            2. Collect data (scraper)
                   Process
                   System columns added – what columns?
                   Data manipulation?

                     Result: Scraped data

Pipe 1                                                                          Pipe 2
Collected from 10,000 pages, 15,000 records.                                    Collected from 5,000 pages, 10,000 records.

 Column1 Category            Column 2         Price        Column 3              Column1 Column 2          Column 3       Price    Column 4
 ayn rand     Philosophy     fountainhead     10.00$        5 - excellent        ayn rand    Inspiration   fountainhead   10.00€   10
 Author1      Cat1           Book1            10.00$       3 - medium            Author1     Cat7          Book1          5.00€    8
 Author3      Cat3           Book3            7.00$        2 – poor              Author2     Cat3          Book2          5.00€    6
 Author3      Cat1           Book4            7.00$         4 – good



            3. Cleansing
               3.1. Decomposition
                     Breakdown to atoms? Based on what rules? Prices, measure units should be decomposed?
                     Result

                     Pipe 1
                     Decomposed Column 3 to Column3_1, Column3_2

                        Column1 Category            Column 2           Price      Column 3_1        Column 3_2
                        ayn rand     Philosophy     fountainhead       10.00$      5                excellent
                        Author1      Cat1           Book1              10.00$     3                 medium
                        Author3      Cat3           Book3              7.00$      2                 poor
                        Author3      Cat1           Book4              7.00$       4                good
3.2. Typing

   What data types we distinct today? String, Number, Price, Phone, Address? Break down to units and scale
   part of typing or decomposition?

   Result

   Pipe 1                                                 Pipe 2

                     Type         Units      Scale                        Type   Units           Scale
     Column1         String                                Column1        String
     Category        String                                Column 2       String
     Column 2        String                                Column 3       String
     Price           Price        $                        Price          Price  €
     Column 3_1      Number                  1-5           Column 4       Number                 1-10
     Column 3_2      String



3.3. Column mapping & unique identifiers (entities & relationships)
   Description TBD

   Result

   Pipe 1                                                 Pipe 2

                     Column           Identifier                             Column              Identifier
     Column1         Author           Yes                  Column1           Author name         Yes
                     name                                  Column 2          Book category       Yes
     Category        Book             Yes                  Column 3          Book name           No
                     category                              Price             Book price          No
     Column 2        Book name        No                   Column 4          Book rating         No
     Price           Book price       No
     Column 3_1      Book rating      No
     Column 3_2      Book rating      No
                     legend

3.4. Normalization
   What normalization we perform in cleanser? As far as I understand the normalization can be done on
   sphere level? For example, price normalization comes to question when we join data (one source $ and
   the other €) or already in cleanse stage we bring all data to once currency? The same for rating
   normalization.


3.5. Complete data
   What completion we perform in cleanser? Is it not sphere related action as well? E.g. complete all zips,
   phones, etc’. I think this also something we cannot do automatically, user input will be required,
   guidelines as for what data to complete and how.
4. Merge data
   4.1. Unify – simple merge
         Description TBD

         Result – Unified data
Source (S)    Author name      Book category       Book name       Book price      Book rating     Book rating legend
Pipe 1        Ayn rand         Philosophy          fountainhead    10.00$          5               excellent
Pipe2         ayn rand         Inspiration         fountainhead    15.00$          5

Pipe 1        Author1           Cat1               Book1           10.00$          3               medium

Pipe2         Author1           Cat7               Book1           7.00$           4
Pipe 1        Author3           Cat3               Book3           7.00$           2               poor

Pipe 1        Author3           Cat1               Book4           7.00$           4               good

Pipe2         Author2           Cat3               Book2           7.00$           3


    4.2. Merge data by unique identifier

         4.2.1. Keep duplicates (default)

         Enabled domain expert to decide how to resolve duplicates/conflicts – manually reconcile data
         from various sources.

    Result – merge data with duplicates
Author name     Book category           Book name          Book price           Book rating     Book rating legend
Ayn rand        Philosophy (Pipe 1)     fountainhead       10.00$ (Pipe 1)      5               excellent
                Inspiration (Pipe 2)                       15.00$ (Pipe2)



         Now in this point the domain expert can perform decision round and decide how to treat the duplicates
         Based on the example above, I decide to always take categories by pipe raring (pipe 1 in my case) (actually
         defined the categories normalization here) and keep prices from both sources

        Result example
Author name Book category              Book name      Book price –    Book price –     Book          Book rating
                                                      Pipe1           Pipe2            rating        legend
Ayn rand        Philosophy             fountainhead   10.00$          15.00$           5             excellent


         4.2.2. Resolve all duplicates by pipe rating
         Based on Sphere preferences, in this case the merge should result with all duplicates
         automatically resolved by pipe rating.

         Example – My sphere preferences = always resolve duplicates based on Pipe1
         Result – merge data duplicates resolved
Author name   Book category   Book name      Book price   Book rating   Book rating legend
Ayn rand      Philosophy      fountainhead   10.00$       5             excellent
Appendix D: Domain Expert Worker Flow
                                                Owner: Oksana


    1. High level domain expert flow




More can be found here https://sites.google.com/a/kinor.com/product/design/High-Level
Appendix E: Sphere Architecture and APIs
              Owner: Irina
Appendix F: Framework Architecture and APIs
               Owner: Aryeh
Appendix G: Spider Architecture and APIs
             Owner: Yossi
Appendix H: Scraper Architecture and APIs
                                          Owner: Hagay




Subsequent iterations:

   a) Multiple record blocks: Linked records between blocks
   b) Optimize performance: replace TextList with linked texts, no need for spliceRecords
Appendix I: Cleanser Architecture and APIs
              Owner: Ronen
Appendix J: kGrid Integration
                                            Owner: Jair


kGrid was designed for cross-enterprise data integration - hencekGrid agents are fundamentally
different than pipe agents in kFramework:

   1. kGrid agents run continuously, processing requests sent to their message queues whereas
      kFramework agents are dispatched to do a specific task.
   2. kGrid agents typically reside at specific Agency or Gateway locationsuntil system dynamics
      warrant relocation whereas kFramework agents are dispatched per task anywhere in the cloud.
   3. kGrid agents cache the ontology that they need internally via an Ontology service whereas
      kFramework agents rely upon an OntoPlug to do their ontology-related work.

Kinor envisions future cross-enterprise contexts, so the fundamental kGrid architecture must be
retained. This architecture enables kGrid Agencies and Gateways anywhere to dynamically discover each
other and work together without disrupting operations. A robust dynamic discovery mechanism has yet
to be implemented. Hence kGridshould initially be deployed internally using a static configuration
dictated by an appropriate kGridConfig object in the kFrameworkkRegistry. The following principles
should enable immediate kGrid deployment for online database integration purposes within weeks:

   1) The designate machines will run Java 1.4.
   2) At least one of the machines will deploy MySQL for initial kGrid persistence.
   3) The kGridConfig object should initially comprise of the following:
      a) A list of Agency and Gateway machines for deployment of the kGrid Agent Manager
      b) A kGrid compatible XML file per Agent Manager to configure its agents and services
      c) A minimalistic agent and services configuration to get started
   4) The kGrid discovery service should be adjusted to access kRegistry for connecting with other
      kGrid discovery services rather than attempt dynamic discovery.
   5) kGrid uses Protégé as an ontology editor with a plugin that knows how to engage the kGrid
      ontology service. The OntoServer must incorporate this plugin and translate its OWL ontology
      format into the OKBC format currently serving the Ontology Server.

The following kGrid improvements can then be implemented over subsequent months:

   1)   Replace kGrid Query Builder with Sphere Query Builder
   2)   Upgrade code to be compatible with Java 7.
   3)   Use an OntoPlug to semi-automate Wrapper schema mapping.
   4)   Self-organizing kGrid deployment to match operational demand
   5)   Might be worth replacing the ODBC connectivity currently serving the kGrid wrappers with a
        standard DBMS interfacing that is more readily configured programmatically by kFramework
        when new database sources are added to a sphere.

Contenu connexe

En vedette

вікі. крок 3
вікі. крок 3вікі. крок 3
вікі. крок 3Nikolay Zalub
 
Good Relations - Internet Briefing
Good Relations - Internet BriefingGood Relations - Internet Briefing
Good Relations - Internet Briefingbasis06 AG
 
Exploring Puerto Rico Open Data with Power BI
Exploring Puerto Rico Open Data with Power BIExploring Puerto Rico Open Data with Power BI
Exploring Puerto Rico Open Data with Power BIGuillermo Caicedo
 
Shanatova
ShanatovaShanatova
Shanatovairis
 
4 hadoop for-the-disillusioned
4 hadoop for-the-disillusioned4 hadoop for-the-disillusioned
4 hadoop for-the-disillusionedBigDataCamp
 
Il clima è ciò che ti aspetti, il tempo è quello che trovi: dadi, carte e ...
Il clima è ciò che ti aspetti, il tempo è quello che trovi: dadi, carte e ...Il clima è ciò che ti aspetti, il tempo è quello che trovi: dadi, carte e ...
Il clima è ciò che ti aspetti, il tempo è quello che trovi: dadi, carte e ...Progetto Pervinca
 
阿里集团MySQL并行复制特性
阿里集团MySQL并行复制特性阿里集团MySQL并行复制特性
阿里集团MySQL并行复制特性Hui Liu
 
Powerpoint twitter
Powerpoint twitterPowerpoint twitter
Powerpoint twittergingerkid123
 
Participation in the digital age
Participation in the digital age Participation in the digital age
Participation in the digital age Joanna Saad-Sulonen
 
г.і. гайдук
г.і. гайдукг.і. гайдук
г.і. гайдукtumoshenko
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)theijes
 
Israel Orienteering 2008-9
Israel Orienteering 2008-9Israel Orienteering 2008-9
Israel Orienteering 2008-9Amri Wandel
 

En vedette (20)

вікі. крок 3
вікі. крок 3вікі. крок 3
вікі. крок 3
 
Good Relations - Internet Briefing
Good Relations - Internet BriefingGood Relations - Internet Briefing
Good Relations - Internet Briefing
 
Exploring Puerto Rico Open Data with Power BI
Exploring Puerto Rico Open Data with Power BIExploring Puerto Rico Open Data with Power BI
Exploring Puerto Rico Open Data with Power BI
 
Cotton moisture meter
Cotton moisture meterCotton moisture meter
Cotton moisture meter
 
Direct Navigation
Direct NavigationDirect Navigation
Direct Navigation
 
Dibujos MaríA
Dibujos MaríADibujos MaríA
Dibujos MaríA
 
Shanatova
ShanatovaShanatova
Shanatova
 
4 hadoop for-the-disillusioned
4 hadoop for-the-disillusioned4 hadoop for-the-disillusioned
4 hadoop for-the-disillusioned
 
Il clima è ciò che ti aspetti, il tempo è quello che trovi: dadi, carte e ...
Il clima è ciò che ti aspetti, il tempo è quello che trovi: dadi, carte e ...Il clima è ciò che ti aspetti, il tempo è quello che trovi: dadi, carte e ...
Il clima è ciò che ti aspetti, il tempo è quello che trovi: dadi, carte e ...
 
阿里集团MySQL并行复制特性
阿里集团MySQL并行复制特性阿里集团MySQL并行复制特性
阿里集团MySQL并行复制特性
 
Statistics.hpp
Statistics.hppStatistics.hpp
Statistics.hpp
 
Powerpoint twitter
Powerpoint twitterPowerpoint twitter
Powerpoint twitter
 
Entrega feb
Entrega febEntrega feb
Entrega feb
 
Participation in the digital age
Participation in the digital age Participation in the digital age
Participation in the digital age
 
Guerrilla
GuerrillaGuerrilla
Guerrilla
 
г.і. гайдук
г.і. гайдукг.і. гайдук
г.і. гайдук
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
Oisie
OisieOisie
Oisie
 
Calzado
CalzadoCalzado
Calzado
 
Israel Orienteering 2008-9
Israel Orienteering 2008-9Israel Orienteering 2008-9
Israel Orienteering 2008-9
 

Similaire à Product data processing 30.08.2011 gg

A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
A General Purpose Extensible Scanning Query Architecture for Ad Hoc AnalyticsA General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
A General Purpose Extensible Scanning Query Architecture for Ad Hoc AnalyticsFlurry, Inc.
 
Semantic Web in Action: Ontology-driven information search, integration and a...
Semantic Web in Action: Ontology-driven information search, integration and a...Semantic Web in Action: Ontology-driven information search, integration and a...
Semantic Web in Action: Ontology-driven information search, integration and a...Amit Sheth
 
Kos presentation1
 Kos presentation1 Kos presentation1
Kos presentation1annemaeannex
 
A federated search engine providing access to validated and organized "web" i...
A federated search engine providing access to validated and organized "web" i...A federated search engine providing access to validated and organized "web" i...
A federated search engine providing access to validated and organized "web" i...IAALD Community
 
COLLABORATIVE BIBLIOGRAPHIC SYSTEM FOR REVIEW/SURVEY ARTICLES
COLLABORATIVE BIBLIOGRAPHIC SYSTEM FOR REVIEW/SURVEY ARTICLESCOLLABORATIVE BIBLIOGRAPHIC SYSTEM FOR REVIEW/SURVEY ARTICLES
COLLABORATIVE BIBLIOGRAPHIC SYSTEM FOR REVIEW/SURVEY ARTICLESijcsit
 
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...Anita de Waard
 
Dynamic and repeatable transformation of existing Thesauri and Authority list...
Dynamic and repeatable transformation of existing Thesauri and Authority list...Dynamic and repeatable transformation of existing Thesauri and Authority list...
Dynamic and repeatable transformation of existing Thesauri and Authority list...DESTIN-Informatique.com
 
Towards From Manual to Automatic Semantic Annotation: Based on Ontology Eleme...
Towards From Manual to Automatic Semantic Annotation: Based on Ontology Eleme...Towards From Manual to Automatic Semantic Annotation: Based on Ontology Eleme...
Towards From Manual to Automatic Semantic Annotation: Based on Ontology Eleme...IJwest
 
OSFair2017 Workshop | EGI applications database
OSFair2017 Workshop | EGI applications databaseOSFair2017 Workshop | EGI applications database
OSFair2017 Workshop | EGI applications databaseOpen Science Fair
 
Library automation and use of open source software odade
Library automation and use of open source software odadeLibrary automation and use of open source software odade
Library automation and use of open source software odadeChris Okiki
 
LibraryLibrary Automation and Use of Open Source Software automation and use ...
LibraryLibrary Automation and Use of Open Source Software automation and use ...LibraryLibrary Automation and Use of Open Source Software automation and use ...
LibraryLibrary Automation and Use of Open Source Software automation and use ...Chris Okiki
 
Fortuna 2012 metadata_management_web_of_things
Fortuna 2012 metadata_management_web_of_thingsFortuna 2012 metadata_management_web_of_things
Fortuna 2012 metadata_management_web_of_thingscarolninap
 
Novel Database-Centric Framework for Incremental Information Extraction
Novel Database-Centric Framework for Incremental Information ExtractionNovel Database-Centric Framework for Incremental Information Extraction
Novel Database-Centric Framework for Incremental Information Extractionijsrd.com
 
Think about the most important features of a modern infrastructure. .docx
Think about the most important features of a modern infrastructure. .docxThink about the most important features of a modern infrastructure. .docx
Think about the most important features of a modern infrastructure. .docxirened6
 
Online Library Management
Online Library ManagementOnline Library Management
Online Library ManagementVarsha Sarkar
 
JS_proposal.doc
JS_proposal.docJS_proposal.doc
JS_proposal.docbutest
 
Library Automation A - Z Guide: A Hands on Module
Library Automation A - Z Guide: A Hands on ModuleLibrary Automation A - Z Guide: A Hands on Module
Library Automation A - Z Guide: A Hands on ModuleAshok Kumar Satapathy
 

Similaire à Product data processing 30.08.2011 gg (20)

A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
A General Purpose Extensible Scanning Query Architecture for Ad Hoc AnalyticsA General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
 
Semantic Web in Action: Ontology-driven information search, integration and a...
Semantic Web in Action: Ontology-driven information search, integration and a...Semantic Web in Action: Ontology-driven information search, integration and a...
Semantic Web in Action: Ontology-driven information search, integration and a...
 
Kos presentation1
 Kos presentation1 Kos presentation1
Kos presentation1
 
Splunk Components
Splunk ComponentsSplunk Components
Splunk Components
 
A federated search engine providing access to validated and organized "web" i...
A federated search engine providing access to validated and organized "web" i...A federated search engine providing access to validated and organized "web" i...
A federated search engine providing access to validated and organized "web" i...
 
COLLABORATIVE BIBLIOGRAPHIC SYSTEM FOR REVIEW/SURVEY ARTICLES
COLLABORATIVE BIBLIOGRAPHIC SYSTEM FOR REVIEW/SURVEY ARTICLESCOLLABORATIVE BIBLIOGRAPHIC SYSTEM FOR REVIEW/SURVEY ARTICLES
COLLABORATIVE BIBLIOGRAPHIC SYSTEM FOR REVIEW/SURVEY ARTICLES
 
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
 
Dynamic and repeatable transformation of existing Thesauri and Authority list...
Dynamic and repeatable transformation of existing Thesauri and Authority list...Dynamic and repeatable transformation of existing Thesauri and Authority list...
Dynamic and repeatable transformation of existing Thesauri and Authority list...
 
Towards From Manual to Automatic Semantic Annotation: Based on Ontology Eleme...
Towards From Manual to Automatic Semantic Annotation: Based on Ontology Eleme...Towards From Manual to Automatic Semantic Annotation: Based on Ontology Eleme...
Towards From Manual to Automatic Semantic Annotation: Based on Ontology Eleme...
 
OSFair2017 Workshop | EGI applications database
OSFair2017 Workshop | EGI applications databaseOSFair2017 Workshop | EGI applications database
OSFair2017 Workshop | EGI applications database
 
Library automation and use of open source software odade
Library automation and use of open source software odadeLibrary automation and use of open source software odade
Library automation and use of open source software odade
 
LibraryLibrary Automation and Use of Open Source Software automation and use ...
LibraryLibrary Automation and Use of Open Source Software automation and use ...LibraryLibrary Automation and Use of Open Source Software automation and use ...
LibraryLibrary Automation and Use of Open Source Software automation and use ...
 
Fortuna 2012 metadata_management_web_of_things
Fortuna 2012 metadata_management_web_of_thingsFortuna 2012 metadata_management_web_of_things
Fortuna 2012 metadata_management_web_of_things
 
[IJET V2I3P13] Authors: Anshika, Sujit Tak, Sandeep Ugale, Abhishek Pohekar
[IJET V2I3P13] Authors: Anshika, Sujit Tak, Sandeep Ugale, Abhishek Pohekar[IJET V2I3P13] Authors: Anshika, Sujit Tak, Sandeep Ugale, Abhishek Pohekar
[IJET V2I3P13] Authors: Anshika, Sujit Tak, Sandeep Ugale, Abhishek Pohekar
 
Novel Database-Centric Framework for Incremental Information Extraction
Novel Database-Centric Framework for Incremental Information ExtractionNovel Database-Centric Framework for Incremental Information Extraction
Novel Database-Centric Framework for Incremental Information Extraction
 
Think about the most important features of a modern infrastructure. .docx
Think about the most important features of a modern infrastructure. .docxThink about the most important features of a modern infrastructure. .docx
Think about the most important features of a modern infrastructure. .docx
 
Online Library Management
Online Library ManagementOnline Library Management
Online Library Management
 
JS_proposal.doc
JS_proposal.docJS_proposal.doc
JS_proposal.doc
 
Library Automation A - Z Guide: A Hands on Module
Library Automation A - Z Guide: A Hands on ModuleLibrary Automation A - Z Guide: A Hands on Module
Library Automation A - Z Guide: A Hands on Module
 
Semantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including AstrophysicsSemantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including Astrophysics
 

Dernier

Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
FILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinoFILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinojohnmickonozaleda
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
Culture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptxCulture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptxPoojaSen20
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 

Dernier (20)

Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
FILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinoFILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipino
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
Culture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptxCulture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptx
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 

Product data processing 30.08.2011 gg

  • 1. Date: 30.08.2011 Product Overview Owner: Jair Introduction The primaryobjective of this document is to consolidate context and interfacing for team membersengaged in product development. We begin with an overview of key product systems, their components,systemdata flows and key components roles in each flow. A set of appendices then dive into specific systems and interfaces in more detail – each appendix owned by a specific team member. The terms defined in this document should be the terms used in all other related documents. Questions and clarifications should be directed to specific section owners so that this document can continue to improve and expand to achieve the document objectives. Kinor Spheres and Apps The Kinor mission is to provide powerful tools for ordinary ‘workers’(knowledge workers) to collaboratively harvest information (data and content) from any source (web and otherwise) in a manner that can best serve the needs of each worker in a fully automated, private and personalized manner. From a business perspective, the product conceptually comprises of the following: 1. Spheres - The information harvested from one or more sources is maintained in ‘spheres’, each sphere covering a specific domain of common interest to a group of sphere workers. Each worker group typically serves a specific business, organization or community. Typicalharvested sources include web-based catalogues, professional publications, news feeds, social networks, databases,Excel worksheets and PDF documents - public and private. 2. Pipes – Each source is conceptually connected to a sphere via a pipe that pumps harvested information from the source to a specific sphere on an ongoing basis. Spheres can be fed by any number of pipes, the pipes primed (configured) by a non-technical sphere administrator or collectively (crowd-sourced) by the workers themselves. The output of each pipe is a semantic datasetthat is published to the sphere and maintained there for automated refresh via the pipe. The dataset is semantic in the sense that data within itis semantically tagged in a manner that enables subsequent data enrichment, integration and processing to be fully automated. 3. Apps – A growing spectrum of sphere applications will empower each worker to automatically view and leveragepublished informationwithin the sphere in a fully personalized way. The initial apps will be horizontal, i.e. readily applied to any sphere - each worker configuring the app to match personalized needs. Sample horizontal apps will: a. Enable workers (interactively or via API) to easily find the information that they need within a sphere and deliver it in the most useful form. b. Automatically mine sphere information for worker configured events, sentiments and inferences.
  • 2. c. Automatically hyperlink and annotate sphere information with additional information within a sphere as prescribed by sphere administrator or worker-defined rules. Horizontal apps will pave the way to an even greater number of pre-configured vertical appsdrawing information from specific spheres, i.e. a specific ontology with appropriate pipes. Once the sphere core has been configured, vertical apps provide instant value out of the box available to all customers. Horizontal apps on the other hand enable workers to independently or collaboratively develop their own spheres with unique value available only to them. Harvested information can either be cached (replicated) within the sphere or acquired on demand from the source. When dealing with unstructured and semi-structured sources of information, e.g. the Web, the harvested information will typically be replicated within the sphere unless the volume of information is prohibitive. When dealing with fully structured sources, e.g. a database, the harvested information can be acquired on demand if this will not disrupt operations for higher priority source access. Needless to say, the response time for on demand acquisition will highly depend upon the volume of information, the source availability and responsiveness and the semantic complexity of that information, i.e. the computational resources needed to semantically tag and process that information. SemanticIntelligence Kinor empowerment of non-technicalworkers is achieved by cultivating and applying semantic intelligence to automate every possible aspect of harvesting, processing and applying information. The semantic intelligence is maintained in aknowledge base (KB) linked to a growing web of ontologiesmapped to a common super-ontology. Each sphere addresses a specific domain of interest, e.g. online book stores. During the information acquisition stage, the KB associated with the sphere ontology semantically tagsidentified entities and their properties within that information. The sphere ontology must therefore contain a set of inter-related schemas (frames) that describe allentities in the sphere, e.g. books, authors, publishers, suppliers, prices and reviews. Each entity schema must also contain anticipated properties (frame slots), e.g. book properties might include title, author, publisher, ISBN and year of publication. Kinor internally refers to each entity property as an atom,each atomassigned a predefinedsemantic tag,the atom value a predefined atom type. Thus for examplea ‘year of publication’ must be a valid year and ‘book publisher’ must beavalid publisher name. Atom types are typically recognized by a set of text patterns (e.g. a four digit year) ora namethesaurus (e.g. formal publisher names and known synonyms for these names). Atom filtersauto-recognize specific atom types whileconsidering both the atom value as well as theatom context in which the value was found, e.g. a column name (e.g. ‘home phone’) or a prefix (e.g. ‘(H)’). Armed with adequate semantic intelligence, blocks of information piped into a sphere are automatically parsed by atom filters into records (e.g. book records) of semantically tagged properties to be associated with a specific entity (e.g. a specific book instance). All sphere schemas are mapped to the super ontology with its shared bank of atom types and their respective atom filters and contexts so that semantic intelligence can be cultivated collectively (e.g. new filter patterns and thesaurus entries) by all spheres. Atom thesauri in the super ontology also retain
  • 3. frequently adjoined entity properties (e.g. a given publisher city, state and country) to facilitate the auto-acquisition of new entity names and synonyms. The super ontology can thus readily expand automatically with relatively little supervision. Entity properties from one pipe can be subsequentlymerged(joined) with entity properties from other pipes when all properties have been attributed to the same entity. Matching entities across pipes can be challenging since each pipe record must have a unique identifier key (UIK)based upon properties available in each record. A set of propertiescan uniquely identify an entity with a degree of confidence that can be computed empirically. The ISBN property alone (when available) can serve as an idealUIK for book entitieswhereas a UIKbased upon the book title and author properties is somewhat less reliable. Entity records from multiple pipes can only be merged if they have UIKs with adequate confidence levels to be determined by a sphere administrator or worker. Semantic intelligence can only operate on sphere information mapped to the sphere ontology. Schemas from public and private ontologies are acquired and retained in an ontology bank mapped to the super ontology. Unmapped sphere data can then be schema matched with schema in the ontology bank to semi-automatically add or expand sphere schemas to map them. When dealing with well-structured catalogue sources, ontologies can be auto-generated from the catalogue structure and data itself. Key Product Systems Key product systems include the following: 1. Pipes – The pipes system schedules all tasks related to the harvesting of information from Web and additional sources and subsequent data processing. Each task is executed by one or more agents distributed in a cloud. The most common pipe tasks include spidering (collecting designated pages of interest from a specific web site), scraping (extracting designated blocks of information from those pages), cleansing (decomposing those blocks into semantically tagged atoms and normalizing the atom values where possible) and finally importing the semantic dataset into a sphere repository. The pipes system also includes a Wizard for priming the pipes. 2. Spheres – Each sphere retains fresh semantic datasets for each pipe in a query-ready repository (QR) capable of serving a growing number of horizontal and vertical apps. The QR must respond to app queries on demand while also enabling a growing spectrum of ongoing app tasks to process and augment the QR in the background. Each QR atom maintains the history of that atom starting with the pipe history produced by the pipe. The origins of each atom and value are thus readily traced back to the source and the data processing tasks. 3. Ontology – The ontology system comprises of a centralized ontology server (OntoServer) working in unison with any number of ontology plugs (OntoPlug) to apply semantic intelligence to every possible aspect of harvesting, processing and applying sphere information. The OntoServercultivates and maintains the KB for all spheres while the OntoPlug caches a minimal subset of the KB to serve specific agent tasks. 4. Applications - A web-based user interface provides integrated user access to all authorized applications (apps) including administrator apps for managing the above systems.
  • 4. 5. Framework –A common framework for the above enables all of the above systems to run securely and efficiently atop any private or public cloud. 6. E-Meeting –An interactive conferencing facility fully integrated with the product that enables existing and potential customers to instantly connect with designated support and sales representatives for instant pilots, training, assistance and trouble-shooting. Key Pipe Components Within the Pipes system, key components include the following: 1. Wizard – The Pipe configuration wizard enables a non-technical user to prime a pipe within minutes, i.e. to direct the pipe in how it should navigate within a Web site to harvest all pages of interest and subsequently extract from those pages all required blocks of information. Very few user clicks are needed to determine: a. Which deep web forms (searches and queries) to post (submit) with which input values. b. Which hyperlinks and buttons to subsequently navigate (follow) to harvest result pages of interest. Note that some result pages may lead to additional pages with different layouts - hence each page must also be associated with a specific layout id.
  • 5. c. Which blocks of information per page are to be extracted and semantically tagged for each layout id.A scraper filter is subsequently generated per layout id – hence blocks of interest must only be marked for one sample page per layout. Additional sample pages are randomly chosen to test the scraper filter for worker feedback using several pages. d. When it should revisit the site to refresh that pipe dataset. Throughout this process the wizard will provide feedback regarding the pipe dataset that will be produced using the current pipe configuration as well as the anticipated price tag for acquiring and refreshing the dataset. The pipe dataset produced for Wizard feedback will use a relatively small sample of pages for user feedback within seconds and the dataset will not be published to the sphere. The user can subsequently refine the pipe configuration to better suit user needs. Once primed via the wizard, the pipe can operate autonomously as depicted in the above diagram by the ‘Map a website’ followed by ‘Run’ that results in ‘Notification’ when the pipe completes its operation (‘End’). 2. Spider – Any number of spider agents can then interact in parallel with the source web site to harvest all pages of interest. The harvesting is accomplished in two stages: a. Site spidering – A multi-threaded collection of URLs and subsequent postings and navigations are produced with a unique id, an order tag (to collate scraper results in the proper order) a parent tag (the id of the page that pointed to it) and a page layout id tag. New pages are readily flagged by comparing the new collection with the previous one. The harvesting of pages can subsequently be parallelized in an appropriate order by allocating subsets of the collection to several agents. b. Page harvesting - Either all pages or only newly flagged pages are cached in the pipe repository by any number of spider agents with order tags so that the pages can subsequently be processed in an appropriate order. Each harvested page is recorded in a page index with sourcing tags that include the site name, an order tag, bread crumb tags (i.e. posted form inputs and navigations) that led to this page, a layout id tag that identifies the scraper filter for extracted blocks from that page, the harvesting date and time and the site response time for that page. 3. Scraper – The layout id tag is used to apply the appropriate scraper filter to extract the designated information blocks per page and transform them into dataset records. Any number of scraper agents can do this in parallel, each agent producing a dataset of scraped records per page. The page datasets are then merged into a pipe dataset in an appropriate page order. Key scraper filter components include the following: a. Sequence filter –Matching tag sequences are used to mark designated page blocks. The sequences are robust by being sparse (only key tags are included) and depth sensitive (reflecting how deep they are in the element tree). b. Block table filters – Blocks with conventional table structures use these structures to parse records with context tags that include column numbers and headers where available. Filters are robust in that they treat nested tables for a variety of tables while handling missing tags. Tables can either be vertical or horizontal.
  • 6. c. Record markers – New dataset records are identified by table structure or by distinct fields within the record. Thus, if the third field in the record is always a telephone number, the beginning or end of the record is readily found. Record markers take into consideration that some fields might be broken into multiple parts for multiple styles, hyperlinks, multiple lines and other special effects within a single field. d. Context–Context is crucial to automated semantic tagging, hence all relevant column headers and value prefixes (e.g. QTY in ‘QTY:50’) are extracted and attached to relevant field values (e.g. ‘50’). Frequent contexts are retained in the sphere ontology so that probable context can be auto-identified by structure (e.g. table position), style (e.g. color/emphasis) and content (e.g. frequent context values). e. uFilters(micro-filters) – Block filters may contain any number of uFilters to mark records and context as the block information is being parsed. The set of uFilters are applied to every field in the block, each uFilter checking for a specific combination of field attributes that include the following: i. Field content – A specific text (Equals, StartsWith, EndsWith) or equivalence with the page title as identified by TITLE tags. ii. Field atom type – Contains a designated text pattern or name found in a designated thesaurus. iii. Field style – Has a designated set of style attributes, e.g. font color/size, emphasis, header level (e.g. H2). iv. Field ID – Was preceded by a designated “id=” tag. v. Field Column – Appears in a specific table column. All uFilters are applied to their designated blocks of information before the record parsing begins. uFilters can also be configured to mark nearby fields: vi. Field displacement – the marked field is a given number of fields before/after. vii. Field expand –fields before/after also marked until specific tags are matched. f. Active atom filters and patterns – uFilters are applied to all block fields, hence the need to minimize the computational requirements of each uFilter. Atom filters can be heavy, hence only those filters that appear in the uFilters are active. Moreover, when dealing with atom filters with several patterns, only patterns that actually captured atoms during the pipe priming will be active. An active atoms tag in each scraper filter must contain the active atoms and patterns so that only these will be armed in the OntoPlug serving the scraper. g. Variable records – Some pages may contain more record fields than others, whereupon placing the right fields into the right columns can be challenging. Context tags are helpful only if they are available. Hence the need for additional field context as defined by the uFilters that captured each field. Field context tags include TITLE, atom type, field style, field ID and field column. h. Layout uFilters – Inconsistencies in the page templates (the actual template may depend upon the number of results) may cause the sequence filter to break. Optional layout uFilters mark block begin/end as a backup.
  • 7. i. Block relationships – Each record block is ultimately parsed into a sequence of semantically tagged atoms that belong to a specific record. This specific record might be one of several records in a table block on a parent page. A table block contains several records (e.g. pricing options) that might all belong to a single record. Hence block relationships between pages and on the same page are crucial. The following assumptions are made: i. All record blocks on the same page belong to the same record. ii. All blocks on the same page belong to a specific record on the parent page that pointed to that page. iii. A table block belongs to the same record (as a table within a record) as the record blocks on the same page. iv. An attachment (history) block belongs to all records in the table block on the same page. j. Record categories–Records extracted from a single site might all belong to one or more categories, e.g. in an online catalogue in which different product categories will have different sets of columns. A page dataset must therefore be associated with a specific category. The category might appear on the page as a category block or it might be auto identified by the bread crumbs that led to that page. k. Category taxonomies – When comparing records from multiple pipes it might be necessary to compare only those that belong to the same category. In such cases there is a need to develop a standard category taxonomy for the sphere and map the local pipe categories to that standard taxonomy. This is referred to as automated category harmonization and it is carried out by the Ontology system. Upon selecting a page block and indicating the nature of the block contents (table, record, category, or attachment), the generation of candidate filters is fully automated whereupon the selection of a specific candidate filter is finalized by the wizard ‘by example’. This means that the candidates are ordered by probability and a ‘guess’ button enables the Wizard to try each candidate filter until the user is satisfied with the results in the Collected Data pane. 4. Tables (kTable) – All datasets produced by the pipe are maintained in a kTable component that is first produced by a Scraper agent and then passed on to additional pipe agents. a. Columns, atoms, column relationships b. Tables within tables and block relationships. 5. Cleanser – Tables are further refined by cleanser agents that may operate on specific sets of records and columns. Any given kTable may undergo several cleanser iterations as the semantic intelligence expands to cover more table columns and specific atoms. Hence subsequent cleanser iterations will typically focus on specific columns. Specific columns might also be tagged by the Wizard to remain ‘as is’. Similarly, the OntoPlug will indicate to the cleanser which columns can’t be cleansed due to a missing semantic tag or a semantic tag with inadequate intelligence in the KB. Key cleanser operations include: a. Column decomposition – Identifying atoms within a field and creating new columns for each of them. Decomposition can be recursive when atoms can further be decomposed into more refined atoms.
  • 8. b. Atom normalization – Atoms that can be normalized are normalized as directed by atom normalization tags associated with each column. All transformations are documented in the kTable history so that the origin of each atom and value is readily traced. The refinement state of each kTable (e.g. percentage of cleansed columns and decomposed and normalized cells) is updated so that the quality impact of each agent can also be instantly assessed. 6. Enrich – The sphere administrator is tooled with a horizontal app that analyzes QR data to identify potential rules that govern entity properties. Thus, for example, a memory chip with a given capacity, error checking capability and packaging will always have a given number of pins. Similarly, a person under age will not be married or have a driver license. The administrator may add rules that reflect domain knowledge and the app is semi-supervised to ensure that incidental patterns are not adopted as rules. The enrich agent can then use these rules to fill in missing date and flag suspicious data. Rules may also have confidence levels that are reinforced or diminished by additional data and administrator confidence. The kTable history must document all rules that affected a transformation and a follow-up tag is added to all suspect atoms to ensure that they are subsequently given the right attention. All tags record also when they were created and by whom to evaluate the quality of treatment. The enrich agent may also insert additional entity properties found in the super ontology for identified entities. 7. Import – This agent imports a kTable into the QR when it is ready. The importer must also concatenate all record snippets from multiple blocks and pages into complete records. This is also where the UIK is determined via the OntoPlug. As implied by the above, the pipe value proposition already includes the following: 1. Semantic tagging that can also serve the auto-generation of RDFA content. 2. Field decomposition 3. Atom normalization 4. Enriched data, e.g. missing fields using rules and additional entity properties from the super ontology. 5. Flagged suspicious data 6. The best available UIK for these records with a confidence level. To emphasize this value proposition and manage customer expectations, the Wizard will reflect the pipe output in a Collected Data pane. This means that a small sample of pages will be run through the pipe to produce a kTable that will be presented by the Wizard in that pane for additional user feedback. User feedback may be global (trying a different candidate filter) or local (selecting a specific semantic tag for a column, changing the UIK or configuring a column to be left ‘as is’). By default, a column that is not semantically tagged with be parsed syntactically by the cleanser without semantics. Key Sphere Components The harvested pipe datasets are published and maintained in a sphere for worker access via a growing spectrum of horizontal and vertical apps. Horizontal apps empower each worker to automatically view
  • 9. and leverage the published datasets in a fully personalized way. Most apps have both ongoing and interactive processes: Ongoing app processes analyze and augment sphere data to make it more useful. Thus, for example, a financial app might use NLP and additional techniques to identify relevant news and sentiments on a variety of sites. An e-commerce app might seek reviews that compare competing products so that consumers seeking one product can be offered cheaper deals for similar products. Interactive appprocesses deliver personalized views to users and respond to consumer activities. Thus, for example, the financial app will deliver the news that each user subscribed to and the e-commerce site will offer deals that match products being sought via a partnering search engine. Each user might also want to search the sphere datasets for specific information. The sphere administrator will also want to optimize the quality and availability of the sphere information. To this end there will be a need for additional administrative processes: Quality optimization processesseek methods to continuously improve the quality of the information, e.g. by identifying rules that will flag erroneous data and seek corroborating sources to increase the confidence levels of app results. Performance optimization processesseek methods to improve app performance, e.g. by maintaining frequently queried aggregated datasets that take precedence over the datasets that have already been aggregated to improve app response times. To support the above, the sphere system will initially comprise of the following key components: 1. QR – Vertical database containing all data collected from all pipes. 2. QBE – Enabling users to find the sphere information that they need without prior knowledge of the sphere categories, schemas and value formats. 3. Indexer – Building an index of unstructured nuggets so that they can be semantically searched. 4. Rules Builder – Tools for automatically deriving rules and their confidence levels from the QR and manually editing and building them for auto-enrich and the flagging of suspicious data. 5. kGrid – Delivering ad hoc data integration on-the-fly from disparate online database sources. This will initially serve only cloud-hosted databases until the patent-pending bio-immune security is implemented whereupon it can also tap into remote private databases. Conceptually, every entity in the sphere ontology has an independent QR table. When dealing with catalogues, every product entity may have different properties, resulting in a very sparse database. Upon querying for a specific product with given properties, the QBE will identify all properties that sufficiently match the property names and constraints to include all relevant products. Properties already covered by the ontology will match all known property names. Properties not covered will seek similar names. RoutineOntoServer processes will attempt to identify equivalent properties across sites to add these properties to the ontology.
  • 10. kGrid maps online databases to ontological entities so that their data can also be queries on demand and combined with data already within the sphere. Spheres will typically fall into one of the following classes: 1. Simple – These are spheres that can readily be built and applied by non-technical users without assistance. 2. Complex – In these spheres, power users use advanced features to get the job done with appropriate Kinor guidance. 3. Custom – These spheres require Kinor customization and/or additional features to get the job done. Progressive sphere improvements will ensure that more and more complex spheres will become simple and fewer spheres will require any customization. Key Ontology Components The ontology system comprises of the following key components: 1. OntoServer – An integrated environment for managing, maintaining and importing/exporting ontologiesand super ontologies. Must also provide an API for OntoPlug access to the KB. 2. OntoPlug– An OntoServer proxy capable of rapidly loading select portions of the KB from the OntoServerand providing all ontology based services to all designated clients in the Pipe, Spheres and Apps. Current (immediate) OntoPlug clients include the following: a. Wizard – Determining which atom filters and patterns should be armed for uFilters. Identifying most probably semantic tags per column. Also auto-suggesting block of information that match the sphere ontology. b. Spider–Use atom types and learned properties to choose best form inputs. c. Scraper - Identifying atoms as needed by uFilters to scrape pages. d. Cleanser – Decomposing and normalizing cell values. e. Importer –Proposing best UIKs per dataset. f. QR – Choosing the best UIKs across datasets and using thesauri to semantically expand queries to cover all synonyms and possibly even instances. g. Content enrich app – Finding known entities in an unstructured text (tokenizing) and enriching the text with the entity properties. 3. Ontology Editor – An ability to view existing ontologies and manually model new ontologies as needed. 4. Ontology Builder – Auto extension of a sphere ontology using the ontology bank or via the adoption of pipe schemas as an ontology (e.g. in an online catalogue) and auto-mapping other pipe schemas to that ontology (column harmonization). 5. Ontology Trainer – Auto acquisition of new thesaurus entries and their synonyms from undefined atoms collected by the pipes. New patterns must also be acquired in a semi- supervised process.
  • 11. 6. Category Harmonization– Analyzing pipe categories to produce a category taxonomy or enumerator (id) that all pipe categories can readily be mapped to. Key Framework Components The framework system seamlessly deploys the other systems on any private or public cloud. The framework is designed to optimize cloud utilization, measure performance, automate testing and readily support a growing number of apps. Key framework components already include: 1. Repository – Persistent storage for all configuration and operational data serving the pipes including cached pages, scraper filters, kTables and recorded data. 2. Scheduler – Scheduling pipe agents to meet quality of service (QoS) and refresh requirements while also catering to hi-priority agent tasks initiated by the Wizard on demand. 3. Planner – Planning a schedule for 4. Back office apps - 5. Recorder – Recording all activities for testing, performance optimization and exception handling. 6. Auto-tester – Using archived repository content to do regression testing and ascertain that the quality and performance only improves with new system versions. 7. Health monitor – Analyzing recorded data to measure KPIs (key performance indicators) that reflect system health and ascertain that it is acceptable and only improving. 8. Agent emulator – To support agent debugging outside the framework. 9. Portable GUI – An integrated web-based framework for all user interactions with these systems including all apps. It is portable in the sense that it will ultimately support several deployment modes (e.g. with and without code downloads) without modifying the apps. 10. Bio-immune security – Elements of kGrid and its patent-pending bio-immune security will be integrated into the framework to support secure import and export of enterprise data for arbitrary cloud-based applications. Given the generic nature of this framework, it might ultimately be open sourced to enableothers to develop new pipe agents and apps. Upon implementing the bio-immune security, the framework might be sold as an independent product for secure cloud computing. eMeeting System An interactive web-based conferencing facility must be fully integrated with the product to enable existing and potential customers to instantly connect with designated support and sales representatives for instant pilots, training, assistance and trouble-shooting. System Flows Each of the above systems has key data flows to be surveyed in the following sections with the roles played by key components. The key data flows will be reviewed in the following order: 1. Pipe Data Flow 2. Application Data Flows
  • 12. 3. Ontology Data Flows 4. Sphere Data Flows 5. Framework Data Flows Data flows typically span multiple systems but each will be addressed in the context of one system. Pipe Data Flow Key pipe data flows include the following: 1. Record assembly – Parallel caching and scraping can result in the loss of order and relationships (parent-son) between records spanning multiple pages. a. Consider, for example, a table of books (block 1) with links to book details (block 2) on separate pages that include a link to a collection of reviews (block 3) on a single page. Moreover, the table of books may contain additional category information that relates to all books in the table (block 4). b. Then each block consists of one or more block records, each consisting of one or more data fields that the scraper will extract with their field contexts. Such a dataset could subsequently be imported into the QR in two ways: i. One table – Data from blocks 1, 2, 3 and 4 are assembled into a single table whereupon multiple reviews for a book will result in multiple rows per book, one row per review with extensive duplication. ii. Multiple tables – Independent tables for category data (block 4), book data (blocks 1 and 2) and review data (block3) appropriately interlinked. c. Clearly the latter approach has advantages but a sphere administrator might prefer the former. By maintaining all scraped and cleansed data as block records, duplicate kTable storage and cleansing is avoided and the decision as to how to collate the data in the QR can be made in the final import stage. This also simplifies the distribution of pages to agents for caching and scraping – each page can be processed independently with the exception of cases (e.g. PDF files) in which records roll over from one page to the next. d. As later detailed, the initial spidering phase must therefore create an index of category {id, bread crumbs} and pages {id, category, order and relationship}. The scraper must subsequently create an index of blocks {id, type (table, record, category or attachment), page and relationship} and records {id, block} so that all block records belonging to the same entity/schema can be appropriately collated upon import. e. Note that cleansing and enriching are also best applied to block records to avoid unnecessary duplication.Note also that scraper collating of block records would require that each sub-tree of pagesbe processed by the same scraper in a specific order using multiple scraper filters to cater to the multiple page layouts. The kFrameworkmaintains a cache with Page objects for each pipe, each page retaining navigations to previous and next pages. To support block linkage for subsequent record assembly, kFramework maintains a kRegistryto maintain the following:
  • 13. a) Each page object must retain a parent record id, e.g. when a page with an index containing several books links to a page per specific book, each specific book page must link back to a specific record in the index page. The parent record id must consist of the page id, the parent block id containing the index and the link value itself so that we can subsequently identify the parent index record for each specific page. Block Ids are foreign to the spider, so the spider merely registers the LinkValueToPage so that the scraper can later identify the link content and register block and record ids of the parent record id. Class PageRecordId { ParentPageId, ParentBlockId, LinkValueToPage} b) Each page object must also identify the scraper filter that must be used to scrape that page. A scraper filters is assigned to each page layout, so the page object need only identify the page layout that it belongs to, each page layout also having a list of layout blocks: Class RegisteredLayout { LayoutId, ScrapeFilter, List<LayoutBlock> } Class LayoutBlock { LayoutBlockId, BlockType} EnumBlockType { Record, Table, Attachment, Category } Class PageLayout { LayoutId, List<LayoutBlock>} Class RegisteredPage {PageId, PageRecordId, LayoutId, List<BlockId>,BreadCrumbs} Class RegisteredBlock{BlockId, PageId,LayoutBlockId} c) To keep track of all pages and blocks, kLibrary must retain a dictionary of registry of Layouts, lan index of each page object will retain a list of blocks the ) several pages withand Sibling navigations and each page must also maintain a list of Blocks. Dictionary<LayoutId,RegisteredLayout>LayoutRegistry Dictionary<PageId,RegisteredPage>PageRegistry Dictionary<BlockId,RegisteredBlock>BlockRegistry d) The spider must maintain the LayoutRegistry and PageRegistry whereupon the scraper maintains the BlockRegistry, independently scraping records per BlockId and storing them per BlockId in the kTable. e) The assembly of records spanning multiple blocks can then be accomplished by the Importer as follows: Category, Record and Attachment blocks only have a single record per block whereas a Table block may have several records. We assume any number of Table and Record blocks per page. If there are Record blocks, then all Record blocks are assembled as a single Record, all Category and Attachment blocks linked to it and any number of Table blocks are linked to it as well as Tables within that record. If there are no Record blocks, each Table block produces any number of records to which all Category and Attachment blocks are linked.
  • 14. If the Page has a PageRecordId then all records produced are linked to that record. The appropriate record is identified by finding a record in the designated BlockID that has the appropriate LinkValueToMe. Record fields are loaded into the QR as dictated by the sphere ontology. Thus when loading a record containing a table, if the table contents map into that of an ontological entity linked to other ontological entities in the record then each entity will be loaded into a different table appropriately linked. If the table contents map into entity properties with an appropriate cardinality then it will be loaded into them. 2. Scraper filters –scraper filters are automatically generated and tested for sample pages by the Wizard, but they may need to be adjusted by the Wizard after they are applied to all of the pages. The auto-generation and adjustment is accomplished as follows: a. One or more blocks per page are marked by the worker as one of the following types: Table, Record, Category or Attachment. b. The default block type is determined by the block content assisted by the OntoPlug. Thus for example the presence of bread crumbs suggests a Category block. The presence of a table with multiple records suggests a Record block. The presence of certain types and context may suggest a Record or Attachment block. c. Key elements in a scraper filter include: i. Tag sequences to identify the page blocks. ii. Layout uFilters to identify page blocks including a potential Title uFilter to identify the beginning of a nugget block. iii. The above two mechanisms backup each other in case one fails (e.g. inconsistent html templates) or a uFilter gets confused by dynamic content. iv. A record/attachment block may have variable sets of fields per page – hence as many fields as possible are provided uFilters to map them to appropriate block dataset columns. v. A table block requires mechanisms to detect its headers and records. If there is an obvious html structure, the key table tags are used – else a set of block uFilters to identify headers and the beginning or end of each record. vi. In both of the above cases, the generation of uFilters begins with an attempt to find a strong set of uFilter attributes per field until as many fields as possible are readily captured by a uFilter. A strong uFilter is one that captures a reasonable number of fields. A uFilter in a record block that captures half of the fields might be useful for capturing labels whereupon it will also be included. vii. In a table block, the number of fields captured by each uFilter then serves as a basis for identifying the number of records on the marked page. viii. In a table block, each uFilter is also assessed regarding its ability to serve as the basis for breaking the table into records. ix. All of the generated uFilters are treated as candidate uFilters of which the most probable ones are armed for use in the scraper filter. Less probable ones are retained so that the Wizard can show the dataset that would be produced if
  • 15. they are armed, enabling the worker to choose the right combination by example. x. The Wizard also uses the OntoPlug to assign semantic tags and cardinality per uFilter to determine how captured fields will be mapped into block columns. d. Subsequent Wizard adjustments merely alter the set of armed uFilters and the semantic tags and cardinalities assigned to them. 3. Semantic tags – Whereas the scraper can readily extract fields and insert them into appropriate columns with context in page block tables, the decomposition of fields with multiple atoms can only be accomplished by the cleanser for columns that have been semantically tagged. The semantic tagging of columns is accomplished as follows: a. When dealing with site columns that have already been mapped to specific semantic tags, the semantic tagging can be accomplished automatically by the scraper upon concluding its work or by the cleanser prior to cleansing. The former is preferred since cleansing can be iterative and it’ll be easier to plan and schedule if we know in advance which columns can be cleansed. b. The OntoPlug attempts to determine a semantic tag …………….. c. When dealing with consistent columns yet to be mapped, the Wizard can query the OntoPlugfor the most probably headers, enable the user to approve its automated selection and prompt the user to select a specific semantic tag from a short list when the confidence levels associated with the most probable tag isinsufficient. d. not sufficiently differentiated.too close and allow the user to determine the right one. Site columns that have not yet been semantically tagged are analyzed by the OntoServer to attempt expansion of the sphere ontology to cover these columns and map them accordingly. 4. Pipe OntoPlug directives a. uFilter attributes i. Atom Filters – Identifying the atom type and pattern that best captures the atoms in a given field. ii. Label/Prefix – Probability that a uFilter captures labels and/or prefixes based upon the text in the fields captured by that uFilter. Labels will subsequently be captured by a Text Equals attribute and a prefix by a Text StartsWith attribute. b. Semantic tags 5. Atom filter training – Saving fields with unknown atoms in decompose and elsewhere with their contexts and semantic tags… 6. Atom context training– Atoms will often appear in new contexts that have to be acquired by the ontology and newly harvested data will often lack the ontology needed to cleanse it. The scraper records the context of each field in kTableso that the OntoServer can subsequently learn everything possible for harvested data, including new patterns for existing atoms as well as new atoms. It is context like column, style, headers and ID that enable us to recognize fields that contain common atoms and provide hints as to what they might be.
  • 16. 7. Field decomposition– 8. Table/record within table – the scraper uses context (style, id, etc) that is not available to the cleanser…. Application Data Flows Key application data flows include the following: 1. Query auto-suggestion 2. Aggregated data view 3. Sourcing crowd-qualification 4. Best-practice sphere views 5. Best-practice content enrichment 6. Collective sphere ontology development 7. Ontology Data Flows Key ontology data flows include the following: 1. Ontology Acquisition 2. Contexts 3. Entity relationships 4. Synonyms 5. Patterns 6. Atom Types 7. Rules 8. Atom Filter training 9. Unique Identifiers – production of UIK permutations 10. Schema expansion 11. Sphere Data Flows Key sphere data flows include the following: 1. Publishing 2. Merging data across multiple pipes 3. Conflict resolution 4. Manual refinement 5. Rules 6. Crowd worker data refinement 7. Ontology refinement 8. Quality refinement
  • 17. 9. Performance refinement 10. 11. Framework Data Flows Key framework data flows include the following: 1. Data quality measurement 2. Key Performance Indictors (KPI) 3. Cloud utilization optimization 4. Auto-testing optimization 5. Pipe caching optimization 6.
  • 18. Appendix A: kTables Tags Owner: Moshe Each pipe may produce several tables of data, each category of entities maintained in a separate table. Each table comprises of a matrix of fields, each field belonging to a specific column and row. Each field, column, row, table and pipe may have several properties maintained as key/value pairs maintained by kTables at an appropriate pipe, table, column, rowand field level that reflects their scope. The kTables keys are collectively referred to as kTable tags maintained in a kTableTags enumerator. The property values are often objects defined in designated APIs. Consider a site that sells books, CDs and videos with reviews. Then book, CDs, videos, prices and reviews can be treated as independent entities maintained in separate tables with specific rows in one table (e.g. prices and reviews) linked to specific rows in other tables (e.g. book, CD and video). Each field is associated with a specific CategoryID to ensure that it is stored in the right table. Each field is also associated with a specific PageID and BlockID so that we can trace back to its origin. kTables is created by a scraper agent whereupon it is refined by Cleanser and additional agents before finally being imported by an Importer agent into the QR. Prior to kTables creation, properties pertaining to the data that are created by the Wizard App and Spider Agent are maintained in the kRegistry of kFramework. This includes, for example pipe properties such as a SphereID and SphereOntology, as well as the properties of the pages and page blocks that each table cell was extracted from. kTables therefore need only maintain a PipeId, PageId and BlockId so that all properties associated with the pipe, page and block can be obtained from the kRegistry. The scraper receives as input a list of pages that it needs to process including the PipeId and all PageIds that it needs to access them. Given a PageId, the scraper can get all BlockIds for that page. Similarly, when dealing with a cell containing a value with a given AtomType, kTables only needs to maintain the AtomTypeId to obtain additional properties of the AtomType from the OntoPlug. The following table contains a list of properties per scope (pipe, table, column etc) that are available via kTable. In many cases, the property is maintained internally – in other cases the property is prefixed with a ‘*’ or ‘#’ to indicate that is can be obtained via the kRegistry or OntoPlugrespectively using akTable tag designated in the second column. A ‘W’ value indicates which components write the tag value and an ‘R’ indicated which components read it for purposes described in the final column. Post pipe processing typically includes OntoServer analysis or QA evaluation of the quality of pipe processing. Highlighted tags are for Beta.
  • 19. Property/Tag Level App/ Spid Scra Clea Impo Post Description/Purpose /Scope Wiz er per nser rter Pipe PipeID Pipe W R To access all related properties in kRegistry *CustomerID *PipeID W R Serving QR access control *SphereID *PipeID W R Each sphere has an independent QR *OntologyName *PipeID W R R R R OntologyName for OntoPlug.SetContext *MinPedigree *PipeID W R R R R Min Pedigree for OntoPlug.SetContext *SiteID *PipeID W R For auto-navigation to the original pages BlockID Field W Entity records spanning multiple blocks & pages must be merge *PageID *BlockID W R To find other blocks on the same page *PageLayout *PageID W R To apply correct scraper filter to each page by layout AtomFilters Pipe W R R OntoPlug.RelevantFilters(List<fields>) per block per layout for O *Prev/NextPage *PageID W R To scrape records in the appropriate order *ParentBlockID *PageID W R R To find linked block on parent page *ParentLinkValue *PageID W R R To find linked block on parent page *BreadCrumbs *PageID R W R To auto-identify a category block & measure spider coverage *CategoryContent *PageID W R R To validate right choice of category block and categoryID CategoryID Table W R R OntoPlug.CategoryID(BreadCrumbs, CategoryContent) *AgentsVersions *PipeID W W W W W R List of agents that processed this kTables and their onto version UniqueKeySets Table W R OntoPlug.UniqueKeySets(List<column.SemanticTags>) per Cate *PageTitle *PageID WR R HTML page title to validate matching nugget title IsNugget Column W R R To apply QR indexing to these columns ColumnContext Column W R Scraped TableColumnOrder,Header,Prefix/Label,Style,AtomTyp ColumnID Column W R Autogenerated by Scraper based upon ColumnContext UserColumnName Column (W) W R Manually entered by user & registered in Scraper Filter (takes p SemanticTag Column W W R R OntoPlug.ProbableSemanticTags(List<columnValues>,ColumnC FirstSonColumn Column W To find first descendent column produced by field decompositi NextSonColumn Column W Whereupon this leads to remaining descendent columns ParentColumn Column W To recursively find all ancestor columns SkipColumn Column W R R #CanDecompose *SemanticTag W OntoPlug.CanDecompose(SemanticTag) Wizard can also config FieldProperties Field W W R Link,ImageLink,ImageAlt, Empty,NewWord FieldAtomType Field W W R R To get OntoPlug atom type attributes, e.g. AtomValueMin/Max FieldValue Field W W R R Object containing Amount, UnitNameetc
  • 20. Quality Metrics The following table contains a list of quality metrics per pipe designed to monitor: a) Data quality improvements as the data flows through the pipe b) Pipe data quality and KPI (key performance indicator) improvements over time. Over time here could mean from one test cycle to the next due to a new code version or and improved ontology. Several of the metrics are maintained per Type*Layout, i.e. for each Block Type (Record, Table, Category, Attachment) and each Page Layout in the pipe. PKIs are readily derived from these metrics, e.g. Suspicious/Validated characters, percentage of Empty Pages, percentage of Normalized/Atomic cells, etc. Property/Tag Level /Scope App/ Spid Scra Clea Impo Post Description/Purpose Wiz er per nser rter Pipe Bread crumb paths Pipe W Number of queries – should correlate with Category IDs Spider Pages Pipe W Total navigations Page Layouts Pipe W Linked pages Pipe W Navigations via links Broken links Pipe W Layout Pages Layout W Scraped pages per layout ID Empty pages Layout W Category IDs Layout W Blocks Type*Layout W Category/Table/Record/Attachment blocks per layout ID Texts Type*Layout W Texts per block type per layout Fields Type*Layout W Extracted fields per block type/layout Columns Type*Layout W W Number of columns Semantic Tags Type*Layout W W How many of them tagged Atomic Columns Type*Layout W W Atomic Cells Type*Layout W Normalized Cells Type*Layout W Total Chars Type*Layout W Unknown Atoms Type*Layout W Candidate new atom names/patterns Residue Chars Type*Layout W Leave As Is Chars Type*Layout W Validated Chars Type*Layout W Output characters validated via their history as matching spe Suspicious Chars Type*Layout W Non-validated output characters Max Response Pipe W Source response times Total Response Pipe W Total response times for all pages in spider process Spider Time Pipe W Total spider process Scrape Time Pipe W Cleanse Time Pipe W Scrape Onto Time Pipe W Onto processing time only Cleanse Onto Time Pipe W Onto processing time only
  • 21. Appendix B: Ontology API Owner: Naama
  • 22. Appendix C: Pipe Flow with Data Example Owner: Oksana 1. Website Mapping Result: Sample data Pipe 1 Pipe 2 Column1 Column 2 Price Column 3 Column1 Column 2 Price Column 3 ayn rand fountainhead 10.00$ 5 ayn rand fountainhead 10.00$ 5 2. Collect data (scraper) Process System columns added – what columns? Data manipulation? Result: Scraped data Pipe 1 Pipe 2 Collected from 10,000 pages, 15,000 records. Collected from 5,000 pages, 10,000 records. Column1 Category Column 2 Price Column 3 Column1 Column 2 Column 3 Price Column 4 ayn rand Philosophy fountainhead 10.00$ 5 - excellent ayn rand Inspiration fountainhead 10.00€ 10 Author1 Cat1 Book1 10.00$ 3 - medium Author1 Cat7 Book1 5.00€ 8 Author3 Cat3 Book3 7.00$ 2 – poor Author2 Cat3 Book2 5.00€ 6 Author3 Cat1 Book4 7.00$ 4 – good 3. Cleansing 3.1. Decomposition Breakdown to atoms? Based on what rules? Prices, measure units should be decomposed? Result Pipe 1 Decomposed Column 3 to Column3_1, Column3_2 Column1 Category Column 2 Price Column 3_1 Column 3_2 ayn rand Philosophy fountainhead 10.00$ 5 excellent Author1 Cat1 Book1 10.00$ 3 medium Author3 Cat3 Book3 7.00$ 2 poor Author3 Cat1 Book4 7.00$ 4 good
  • 23. 3.2. Typing What data types we distinct today? String, Number, Price, Phone, Address? Break down to units and scale part of typing or decomposition? Result Pipe 1 Pipe 2 Type Units Scale Type Units Scale Column1 String Column1 String Category String Column 2 String Column 2 String Column 3 String Price Price $ Price Price € Column 3_1 Number 1-5 Column 4 Number 1-10 Column 3_2 String 3.3. Column mapping & unique identifiers (entities & relationships) Description TBD Result Pipe 1 Pipe 2 Column Identifier Column Identifier Column1 Author Yes Column1 Author name Yes name Column 2 Book category Yes Category Book Yes Column 3 Book name No category Price Book price No Column 2 Book name No Column 4 Book rating No Price Book price No Column 3_1 Book rating No Column 3_2 Book rating No legend 3.4. Normalization What normalization we perform in cleanser? As far as I understand the normalization can be done on sphere level? For example, price normalization comes to question when we join data (one source $ and the other €) or already in cleanse stage we bring all data to once currency? The same for rating normalization. 3.5. Complete data What completion we perform in cleanser? Is it not sphere related action as well? E.g. complete all zips, phones, etc’. I think this also something we cannot do automatically, user input will be required, guidelines as for what data to complete and how.
  • 24. 4. Merge data 4.1. Unify – simple merge Description TBD Result – Unified data Source (S) Author name Book category Book name Book price Book rating Book rating legend Pipe 1 Ayn rand Philosophy fountainhead 10.00$ 5 excellent Pipe2 ayn rand Inspiration fountainhead 15.00$ 5 Pipe 1 Author1 Cat1 Book1 10.00$ 3 medium Pipe2 Author1 Cat7 Book1 7.00$ 4 Pipe 1 Author3 Cat3 Book3 7.00$ 2 poor Pipe 1 Author3 Cat1 Book4 7.00$ 4 good Pipe2 Author2 Cat3 Book2 7.00$ 3 4.2. Merge data by unique identifier 4.2.1. Keep duplicates (default) Enabled domain expert to decide how to resolve duplicates/conflicts – manually reconcile data from various sources. Result – merge data with duplicates Author name Book category Book name Book price Book rating Book rating legend Ayn rand Philosophy (Pipe 1) fountainhead 10.00$ (Pipe 1) 5 excellent Inspiration (Pipe 2) 15.00$ (Pipe2) Now in this point the domain expert can perform decision round and decide how to treat the duplicates Based on the example above, I decide to always take categories by pipe raring (pipe 1 in my case) (actually defined the categories normalization here) and keep prices from both sources Result example Author name Book category Book name Book price – Book price – Book Book rating Pipe1 Pipe2 rating legend Ayn rand Philosophy fountainhead 10.00$ 15.00$ 5 excellent 4.2.2. Resolve all duplicates by pipe rating Based on Sphere preferences, in this case the merge should result with all duplicates automatically resolved by pipe rating. Example – My sphere preferences = always resolve duplicates based on Pipe1 Result – merge data duplicates resolved
  • 25. Author name Book category Book name Book price Book rating Book rating legend Ayn rand Philosophy fountainhead 10.00$ 5 excellent
  • 26. Appendix D: Domain Expert Worker Flow Owner: Oksana 1. High level domain expert flow More can be found here https://sites.google.com/a/kinor.com/product/design/High-Level
  • 27. Appendix E: Sphere Architecture and APIs Owner: Irina
  • 28. Appendix F: Framework Architecture and APIs Owner: Aryeh
  • 29. Appendix G: Spider Architecture and APIs Owner: Yossi
  • 30. Appendix H: Scraper Architecture and APIs Owner: Hagay Subsequent iterations: a) Multiple record blocks: Linked records between blocks b) Optimize performance: replace TextList with linked texts, no need for spliceRecords
  • 31. Appendix I: Cleanser Architecture and APIs Owner: Ronen
  • 32. Appendix J: kGrid Integration Owner: Jair kGrid was designed for cross-enterprise data integration - hencekGrid agents are fundamentally different than pipe agents in kFramework: 1. kGrid agents run continuously, processing requests sent to their message queues whereas kFramework agents are dispatched to do a specific task. 2. kGrid agents typically reside at specific Agency or Gateway locationsuntil system dynamics warrant relocation whereas kFramework agents are dispatched per task anywhere in the cloud. 3. kGrid agents cache the ontology that they need internally via an Ontology service whereas kFramework agents rely upon an OntoPlug to do their ontology-related work. Kinor envisions future cross-enterprise contexts, so the fundamental kGrid architecture must be retained. This architecture enables kGrid Agencies and Gateways anywhere to dynamically discover each other and work together without disrupting operations. A robust dynamic discovery mechanism has yet to be implemented. Hence kGridshould initially be deployed internally using a static configuration dictated by an appropriate kGridConfig object in the kFrameworkkRegistry. The following principles should enable immediate kGrid deployment for online database integration purposes within weeks: 1) The designate machines will run Java 1.4. 2) At least one of the machines will deploy MySQL for initial kGrid persistence. 3) The kGridConfig object should initially comprise of the following: a) A list of Agency and Gateway machines for deployment of the kGrid Agent Manager b) A kGrid compatible XML file per Agent Manager to configure its agents and services c) A minimalistic agent and services configuration to get started 4) The kGrid discovery service should be adjusted to access kRegistry for connecting with other kGrid discovery services rather than attempt dynamic discovery. 5) kGrid uses Protégé as an ontology editor with a plugin that knows how to engage the kGrid ontology service. The OntoServer must incorporate this plugin and translate its OWL ontology format into the OKBC format currently serving the Ontology Server. The following kGrid improvements can then be implemented over subsequent months: 1) Replace kGrid Query Builder with Sphere Query Builder 2) Upgrade code to be compatible with Java 7. 3) Use an OntoPlug to semi-automate Wrapper schema mapping. 4) Self-organizing kGrid deployment to match operational demand 5) Might be worth replacing the ODBC connectivity currently serving the kGrid wrappers with a standard DBMS interfacing that is more readily configured programmatically by kFramework when new database sources are added to a sphere.