Product data processing 30.08.2011 gg

Date: 30.08.2011

Product Overview
Owner: Jair

Introduction

The primaryobjective of this document is to consolidate context and interfacing for team
membersengaged in product development. We begin with an overview of key product systems, their
components,systemdata flows and key components roles in each flow. A set of appendices then dive into
specific systems and interfaces in more detail – each appendix owned by a specific team member.

The terms defined in this document should be the terms used in all other related documents. Questions
and clarifications should be directed to specific section owners so that this document can continue to
improve and expand to achieve the document objectives.

Kinor Spheres and Apps

The Kinor mission is to provide powerful tools for ordinary ‘workers’(knowledge workers) to
collaboratively harvest information (data and content) from any source (web and otherwise) in a
manner that can best serve the needs of each worker in a fully automated, private and personalized
manner. From a business perspective, the product conceptually comprises of the following:

1. Spheres - The information harvested from one or more sources is maintained in ‘spheres’, each
sphere covering a specific domain of common interest to a group of sphere workers. Each
worker group typically serves a specific business, organization or community. Typicalharvested
sources include web-based catalogues, professional publications, news feeds, social networks,
databases,Excel worksheets and PDF documents - public and private.
2. Pipes – Each source is conceptually connected to a sphere via a pipe that pumps harvested
information from the source to a specific sphere on an ongoing basis. Spheres can be fed by any
number of pipes, the pipes primed (configured) by a non-technical sphere administrator or
collectively (crowd-sourced) by the workers themselves. The output of each pipe is a semantic
datasetthat is published to the sphere and maintained there for automated refresh via the pipe.
The dataset is semantic in the sense that data within itis semantically tagged in a manner that
enables subsequent data enrichment, integration and processing to be fully automated.
3. Apps – A growing spectrum of sphere applications will empower each worker to automatically
view and leveragepublished informationwithin the sphere in a fully personalized way. The initial
apps will be horizontal, i.e. readily applied to any sphere - each worker configuring the app to
match personalized needs. Sample horizontal apps will:
a. Enable workers (interactively or via API) to easily find the information that they need
within a sphere and deliver it in the most useful form.
b. Automatically mine sphere information for worker configured events, sentiments and
inferences.

c. Automatically hyperlink and annotate sphere information with additional information
within a sphere as prescribed by sphere administrator or worker-defined rules.
Horizontal apps will pave the way to an even greater number of pre-configured vertical
appsdrawing information from specific spheres, i.e. a specific ontology with appropriate pipes.
Once the sphere core has been configured, vertical apps provide instant value out of the box
available to all customers. Horizontal apps on the other hand enable workers to independently
or collaboratively develop their own spheres with unique value available only to them.

Harvested information can either be cached (replicated) within the sphere or acquired on demand from
the source. When dealing with unstructured and semi-structured sources of information, e.g. the Web,
the harvested information will typically be replicated within the sphere unless the volume of
information is prohibitive. When dealing with fully structured sources, e.g. a database, the harvested
information can be acquired on demand if this will not disrupt operations for higher priority source
access. Needless to say, the response time for on demand acquisition will highly depend upon the
volume of information, the source availability and responsiveness and the semantic complexity of that
information, i.e. the computational resources needed to semantically tag and process that information.

SemanticIntelligence

Kinor empowerment of non-technicalworkers is achieved by cultivating and applying semantic
intelligence to automate every possible aspect of harvesting, processing and applying information. The
semantic intelligence is maintained in aknowledge base (KB) linked to a growing web of
ontologiesmapped to a common super-ontology. Each sphere addresses a specific domain of interest,
e.g. online book stores. During the information acquisition stage, the KB associated with the sphere
ontology semantically tagsidentified entities and their properties within that information. The sphere
ontology must therefore contain a set of inter-related schemas (frames) that describe allentities in the
sphere, e.g. books, authors, publishers, suppliers, prices and reviews. Each entity schema must also
contain anticipated properties (frame slots), e.g. book properties might include title, author, publisher,
ISBN and year of publication.

Kinor internally refers to each entity property as an atom,each atomassigned a predefinedsemantic
tag,the atom value a predefined atom type. Thus for examplea ‘year of publication’ must be a valid year
and ‘book publisher’ must beavalid publisher name. Atom types are typically recognized by a set of text
patterns (e.g. a four digit year) ora namethesaurus (e.g. formal publisher names and known synonyms
for these names). Atom filtersauto-recognize specific atom types whileconsidering both the atom value
as well as theatom context in which the value was found, e.g. a column name (e.g. ‘home phone’) or a
prefix (e.g. ‘(H)’). Armed with adequate semantic intelligence, blocks of information piped into a sphere
are automatically parsed by atom filters into records (e.g. book records) of semantically tagged
properties to be associated with a specific entity (e.g. a specific book instance).

All sphere schemas are mapped to the super ontology with its shared bank of atom types and their
respective atom filters and contexts so that semantic intelligence can be cultivated collectively (e.g. new
filter patterns and thesaurus entries) by all spheres. Atom thesauri in the super ontology also retain

frequently adjoined entity properties (e.g. a given publisher city, state and country) to facilitate the
auto-acquisition of new entity names and synonyms. The super ontology can thus readily expand
automatically with relatively little supervision.

Entity properties from one pipe can be subsequentlymerged(joined) with entity properties from other
pipes when all properties have been attributed to the same entity. Matching entities across pipes can be
challenging since each pipe record must have a unique identifier key (UIK)based upon properties
available in each record. A set of propertiescan uniquely identify an entity with a degree of confidence
that can be computed empirically. The ISBN property alone (when available) can serve as an idealUIK for
book entitieswhereas a UIKbased upon the book title and author properties is somewhat less reliable.
Entity records from multiple pipes can only be merged if they have UIKs with adequate confidence levels
to be determined by a sphere administrator or worker.

Semantic intelligence can only operate on sphere information mapped to the sphere ontology. Schemas
from public and private ontologies are acquired and retained in an ontology bank mapped to the super
ontology. Unmapped sphere data can then be schema matched with schema in the ontology bank to
semi-automatically add or expand sphere schemas to map them. When dealing with well-structured
catalogue sources, ontologies can be auto-generated from the catalogue structure and data itself.

Key Product Systems

Key product systems include the following:

1. Pipes – The pipes system schedules all tasks related to the harvesting of information from Web
and additional sources and subsequent data processing. Each task is executed by one or more
agents distributed in a cloud. The most common pipe tasks include spidering (collecting
designated pages of interest from a specific web site), scraping (extracting designated blocks of
information from those pages), cleansing (decomposing those blocks into semantically tagged
atoms and normalizing the atom values where possible) and finally importing the semantic
dataset into a sphere repository. The pipes system also includes a Wizard for priming the pipes.
2. Spheres – Each sphere retains fresh semantic datasets for each pipe in a query-ready repository
(QR) capable of serving a growing number of horizontal and vertical apps. The QR must respond
to app queries on demand while also enabling a growing spectrum of ongoing app tasks to
process and augment the QR in the background. Each QR atom maintains the history of that
atom starting with the pipe history produced by the pipe. The origins of each atom and value
are thus readily traced back to the source and the data processing tasks.
3. Ontology – The ontology system comprises of a centralized ontology server (OntoServer)
working in unison with any number of ontology plugs (OntoPlug) to apply semantic intelligence
to every possible aspect of harvesting, processing and applying sphere information. The
OntoServercultivates and maintains the KB for all spheres while the OntoPlug caches a minimal
subset of the KB to serve specific agent tasks.
4. Applications - A web-based user interface provides integrated user access to all authorized
applications (apps) including administrator apps for managing the above systems.

5. Framework –A common framework for the above enables all of the above systems to run
securely and efficiently atop any private or public cloud.
6. E-Meeting –An interactive conferencing facility fully integrated with the product that enables
existing and potential customers to instantly connect with designated support and sales
representatives for instant pilots, training, assistance and trouble-shooting.

Key Pipe Components

Within the Pipes system, key components include the following:

1. Wizard – The Pipe configuration wizard enables a non-technical user to prime a pipe within
minutes, i.e. to direct the pipe in how it should navigate within a Web site to harvest all pages of
interest and subsequently extract from those pages all required blocks of information. Very few
user clicks are needed to determine:
a. Which deep web forms (searches and queries) to post (submit) with which input values.
b. Which hyperlinks and buttons to subsequently navigate (follow) to harvest result pages
of interest. Note that some result pages may lead to additional pages with different
layouts - hence each page must also be associated with a specific layout id.

c. Which blocks of information per page are to be extracted and semantically tagged for
each layout id.A scraper filter is subsequently generated per layout id – hence blocks of
interest must only be marked for one sample page per layout. Additional sample pages
are randomly chosen to test the scraper filter for worker feedback using several pages.
d. When it should revisit the site to refresh that pipe dataset.

Throughout this process the wizard will provide feedback regarding the pipe dataset that will be
produced using the current pipe configuration as well as the anticipated price tag for acquiring
and refreshing the dataset. The pipe dataset produced for Wizard feedback will use a relatively
small sample of pages for user feedback within seconds and the dataset will not be published to
the sphere. The user can subsequently refine the pipe configuration to better suit user needs.
Once primed via the wizard, the pipe can operate autonomously as depicted in the above
diagram by the ‘Map a website’ followed by ‘Run’ that results in ‘Notification’ when the pipe
completes its operation (‘End’).

2. Spider – Any number of spider agents can then interact in parallel with the source web site to
harvest all pages of interest. The harvesting is accomplished in two stages:
a. Site spidering – A multi-threaded collection of URLs and subsequent postings and
navigations are produced with a unique id, an order tag (to collate scraper results in the
proper order) a parent tag (the id of the page that pointed to it) and a page layout id
tag. New pages are readily flagged by comparing the new collection with the previous
one. The harvesting of pages can subsequently be parallelized in an appropriate order
by allocating subsets of the collection to several agents.
b. Page harvesting - Either all pages or only newly flagged pages are cached in the pipe
repository by any number of spider agents with order tags so that the pages can
subsequently be processed in an appropriate order. Each harvested page is recorded in
a page index with sourcing tags that include the site name, an order tag, bread crumb
tags (i.e. posted form inputs and navigations) that led to this page, a layout id tag that
identifies the scraper filter for extracted blocks from that page, the harvesting date and
time and the site response time for that page.
3. Scraper – The layout id tag is used to apply the appropriate scraper filter to extract the
designated information blocks per page and transform them into dataset records. Any number
of scraper agents can do this in parallel, each agent producing a dataset of scraped records per
page. The page datasets are then merged into a pipe dataset in an appropriate page order. Key
scraper filter components include the following:
a. Sequence filter –Matching tag sequences are used to mark designated page blocks. The
sequences are robust by being sparse (only key tags are included) and depth sensitive
(reflecting how deep they are in the element tree).
b. Block table filters – Blocks with conventional table structures use these structures to
parse records with context tags that include column numbers and headers where
available. Filters are robust in that they treat nested tables for a variety of tables while
handling missing tags. Tables can either be vertical or horizontal.

c. Record markers – New dataset records are identified by table structure or by distinct
fields within the record. Thus, if the third field in the record is always a telephone
number, the beginning or end of the record is readily found. Record markers take into
consideration that some fields might be broken into multiple parts for multiple styles,
hyperlinks, multiple lines and other special effects within a single field.
d. Context–Context is crucial to automated semantic tagging, hence all relevant column
headers and value prefixes (e.g. QTY in ‘QTY:50’) are extracted and attached to relevant
field values (e.g. ‘50’). Frequent contexts are retained in the sphere ontology so that
probable context can be auto-identified by structure (e.g. table position), style (e.g.
color/emphasis) and content (e.g. frequent context values).
e. uFilters(micro-filters) – Block filters may contain any number of uFilters to mark records
and context as the block information is being parsed. The set of uFilters are applied to
every field in the block, each uFilter checking for a specific combination of field
attributes that include the following:
i. Field content – A specific text (Equals, StartsWith, EndsWith) or equivalence
with the page title as identified by TITLE tags.
ii. Field atom type – Contains a designated text pattern or name found in a
designated thesaurus.
iii. Field style – Has a designated set of style attributes, e.g. font color/size,
emphasis, header level (e.g. H2).
iv. Field ID – Was preceded by a designated “id=” tag.
v. Field Column – Appears in a specific table column.
All uFilters are applied to their designated blocks of information before the record
parsing begins. uFilters can also be configured to mark nearby fields:
vi. Field displacement – the marked field is a given number of fields before/after.
vii. Field expand –fields before/after also marked until specific tags are matched.
f. Active atom filters and patterns – uFilters are applied to all block fields, hence the need
to minimize the computational requirements of each uFilter. Atom filters can be heavy,
hence only those filters that appear in the uFilters are active. Moreover, when dealing
with atom filters with several patterns, only patterns that actually captured atoms
during the pipe priming will be active. An active atoms tag in each scraper filter must
contain the active atoms and patterns so that only these will be armed in the OntoPlug
serving the scraper.
g. Variable records – Some pages may contain more record fields than others, whereupon
placing the right fields into the right columns can be challenging. Context tags are
helpful only if they are available. Hence the need for additional field context as defined
by the uFilters that captured each field. Field context tags include TITLE, atom type, field
style, field ID and field column.
h. Layout uFilters – Inconsistencies in the page templates (the actual template may
depend upon the number of results) may cause the sequence filter to break. Optional
layout uFilters mark block begin/end as a backup.

i. Block relationships – Each record block is ultimately parsed into a sequence of
semantically tagged atoms that belong to a specific record. This specific record might be
one of several records in a table block on a parent page. A table block contains several
records (e.g. pricing options) that might all belong to a single record. Hence block
relationships between pages and on the same page are crucial. The following
assumptions are made:
i. All record blocks on the same page belong to the same record.
ii. All blocks on the same page belong to a specific record on the parent page that
pointed to that page.
iii. A table block belongs to the same record (as a table within a record) as the
record blocks on the same page.
iv. An attachment (history) block belongs to all records in the table block on the
same page.
j. Record categories–Records extracted from a single site might all belong to one or more
categories, e.g. in an online catalogue in which different product categories will have
different sets of columns. A page dataset must therefore be associated with a specific
category. The category might appear on the page as a category block or it might be auto
identified by the bread crumbs that led to that page.
k. Category taxonomies – When comparing records from multiple pipes it might be
necessary to compare only those that belong to the same category. In such cases there
is a need to develop a standard category taxonomy for the sphere and map the local
pipe categories to that standard taxonomy. This is referred to as automated category
harmonization and it is carried out by the Ontology system.
Upon selecting a page block and indicating the nature of the block contents (table, record,
category, or attachment), the generation of candidate filters is fully automated whereupon the
selection of a specific candidate filter is finalized by the wizard ‘by example’. This means that
the candidates are ordered by probability and a ‘guess’ button enables the Wizard to try each
candidate filter until the user is satisfied with the results in the Collected Data pane.
4. Tables (kTable) – All datasets produced by the pipe are maintained in a kTable component that
is first produced by a Scraper agent and then passed on to additional pipe agents.
a. Columns, atoms, column relationships
b. Tables within tables and block relationships.
5. Cleanser – Tables are further refined by cleanser agents that may operate on specific sets of
records and columns. Any given kTable may undergo several cleanser iterations as the semantic
intelligence expands to cover more table columns and specific atoms. Hence subsequent
cleanser iterations will typically focus on specific columns. Specific columns might also be tagged
by the Wizard to remain ‘as is’. Similarly, the OntoPlug will indicate to the cleanser which
columns can’t be cleansed due to a missing semantic tag or a semantic tag with inadequate
intelligence in the KB. Key cleanser operations include:
a. Column decomposition – Identifying atoms within a field and creating new columns for
each of them. Decomposition can be recursive when atoms can further be decomposed
into more refined atoms.

b. Atom normalization – Atoms that can be normalized are normalized as directed by atom
normalization tags associated with each column.
All transformations are documented in the kTable history so that the origin of each atom and
value is readily traced. The refinement state of each kTable (e.g. percentage of cleansed
columns and decomposed and normalized cells) is updated so that the quality impact of each
agent can also be instantly assessed.
6. Enrich – The sphere administrator is tooled with a horizontal app that analyzes QR data to
identify potential rules that govern entity properties. Thus, for example, a memory chip with a
given capacity, error checking capability and packaging will always have a given number of pins.
Similarly, a person under age will not be married or have a driver license. The administrator may
add rules that reflect domain knowledge and the app is semi-supervised to ensure that
incidental patterns are not adopted as rules. The enrich agent can then use these rules to fill in
missing date and flag suspicious data. Rules may also have confidence levels that are reinforced
or diminished by additional data and administrator confidence. The kTable history must
document all rules that affected a transformation and a follow-up tag is added to all suspect
atoms to ensure that they are subsequently given the right attention. All tags record also when
they were created and by whom to evaluate the quality of treatment. The enrich agent may also
insert additional entity properties found in the super ontology for identified entities.
7. Import – This agent imports a kTable into the QR when it is ready. The importer must also
concatenate all record snippets from multiple blocks and pages into complete records. This is
also where the UIK is determined via the OntoPlug.

As implied by the above, the pipe value proposition already includes the following:

1. Semantic tagging that can also serve the auto-generation of RDFA content.
2. Field decomposition
3. Atom normalization
4. Enriched data, e.g. missing fields using rules and additional entity properties from the super
ontology.
5. Flagged suspicious data
6. The best available UIK for these records with a confidence level.

To emphasize this value proposition and manage customer expectations, the Wizard will reflect the pipe
output in a Collected Data pane. This means that a small sample of pages will be run through the pipe to
produce a kTable that will be presented by the Wizard in that pane for additional user feedback. User
feedback may be global (trying a different candidate filter) or local (selecting a specific semantic tag for a
column, changing the UIK or configuring a column to be left ‘as is’). By default, a column that is not
semantically tagged with be parsed syntactically by the cleanser without semantics.

Key Sphere Components

The harvested pipe datasets are published and maintained in a sphere for worker access via a growing
spectrum of horizontal and vertical apps. Horizontal apps empower each worker to automatically view

and leverage the published datasets in a fully personalized way. Most apps have both ongoing and
interactive processes:

Ongoing app processes analyze and augment sphere data to make it more useful. Thus, for
example, a financial app might use NLP and additional techniques to identify relevant news and
sentiments on a variety of sites. An e-commerce app might seek reviews that compare
competing products so that consumers seeking one product can be offered cheaper deals for
similar products.
Interactive appprocesses deliver personalized views to users and respond to consumer
activities. Thus, for example, the financial app will deliver the news that each user subscribed to
and the e-commerce site will offer deals that match products being sought via a partnering
search engine. Each user might also want to search the sphere datasets for specific information.

The sphere administrator will also want to optimize the quality and availability of the sphere
information. To this end there will be a need for additional administrative processes:

Quality optimization processesseek methods to continuously improve the quality of the
information, e.g. by identifying rules that will flag erroneous data and seek corroborating
sources to increase the confidence levels of app results.
Performance optimization processesseek methods to improve app performance, e.g. by
maintaining frequently queried aggregated datasets that take precedence over the datasets that
have already been aggregated to improve app response times.

To support the above, the sphere system will initially comprise of the following key components:

1. QR – Vertical database containing all data collected from all pipes.
2. QBE – Enabling users to find the sphere information that they need without prior knowledge of
the sphere categories, schemas and value formats.
3. Indexer – Building an index of unstructured nuggets so that they can be semantically searched.
4. Rules Builder – Tools for automatically deriving rules and their confidence levels from the QR
and manually editing and building them for auto-enrich and the flagging of suspicious data.
5. kGrid – Delivering ad hoc data integration on-the-fly from disparate online database sources.
This will initially serve only cloud-hosted databases until the patent-pending bio-immune
security is implemented whereupon it can also tap into remote private databases.

Conceptually, every entity in the sphere ontology has an independent QR table. When dealing with
catalogues, every product entity may have different properties, resulting in a very sparse database.
Upon querying for a specific product with given properties, the QBE will identify all properties that
sufficiently match the property names and constraints to include all relevant products. Properties
already covered by the ontology will match all known property names. Properties not covered will seek
similar names. RoutineOntoServer processes will attempt to identify equivalent properties across sites
to add these properties to the ontology.

kGrid maps online databases to ontological entities so that their data can also be queries on demand
and combined with data already within the sphere.

Spheres will typically fall into one of the following classes:

1. Simple – These are spheres that can readily be built and applied by non-technical users without
assistance.
2. Complex – In these spheres, power users use advanced features to get the job done with
appropriate Kinor guidance.
3. Custom – These spheres require Kinor customization and/or additional features to get the job
done.

Progressive sphere improvements will ensure that more and more complex spheres will become simple
and fewer spheres will require any customization.

Key Ontology Components

The ontology system comprises of the following key components:

1. OntoServer – An integrated environment for managing, maintaining and importing/exporting
ontologiesand super ontologies. Must also provide an API for OntoPlug access to the KB.
2. OntoPlug– An OntoServer proxy capable of rapidly loading select portions of the KB from the
OntoServerand providing all ontology based services to all designated clients in the Pipe,
Spheres and Apps. Current (immediate) OntoPlug clients include the following:
a. Wizard – Determining which atom filters and patterns should be armed for uFilters.
Identifying most probably semantic tags per column. Also auto-suggesting block of
information that match the sphere ontology.
b. Spider–Use atom types and learned properties to choose best form inputs.
c. Scraper - Identifying atoms as needed by uFilters to scrape pages.
d. Cleanser – Decomposing and normalizing cell values.
e. Importer –Proposing best UIKs per dataset.
f. QR – Choosing the best UIKs across datasets and using thesauri to semantically expand
queries to cover all synonyms and possibly even instances.
g. Content enrich app – Finding known entities in an unstructured text (tokenizing) and
enriching the text with the entity properties.
3. Ontology Editor – An ability to view existing ontologies and manually model new ontologies as
needed.
4. Ontology Builder – Auto extension of a sphere ontology using the ontology bank or via the
adoption of pipe schemas as an ontology (e.g. in an online catalogue) and auto-mapping other
pipe schemas to that ontology (column harmonization).
5. Ontology Trainer – Auto acquisition of new thesaurus entries and their synonyms from
undefined atoms collected by the pipes. New patterns must also be acquired in a semi-
supervised process.

6. Category Harmonization– Analyzing pipe categories to produce a category taxonomy or
enumerator (id) that all pipe categories can readily be mapped to.

Key Framework Components

The framework system seamlessly deploys the other systems on any private or public cloud. The
framework is designed to optimize cloud utilization, measure performance, automate testing and readily
support a growing number of apps. Key framework components already include:

1. Repository – Persistent storage for all configuration and operational data serving the pipes
including cached pages, scraper filters, kTables and recorded data.
2. Scheduler – Scheduling pipe agents to meet quality of service (QoS) and refresh requirements
while also catering to hi-priority agent tasks initiated by the Wizard on demand.
3. Planner – Planning a schedule for
4. Back office apps -
5. Recorder – Recording all activities for testing, performance optimization and exception handling.
6. Auto-tester – Using archived repository content to do regression testing and ascertain that the
quality and performance only improves with new system versions.
7. Health monitor – Analyzing recorded data to measure KPIs (key performance indicators) that
reflect system health and ascertain that it is acceptable and only improving.
8. Agent emulator – To support agent debugging outside the framework.
9. Portable GUI – An integrated web-based framework for all user interactions with these systems
including all apps. It is portable in the sense that it will ultimately support several deployment
modes (e.g. with and without code downloads) without modifying the apps.
10. Bio-immune security – Elements of kGrid and its patent-pending bio-immune security will be
integrated into the framework to support secure import and export of enterprise data for
arbitrary cloud-based applications.

Given the generic nature of this framework, it might ultimately be open sourced to enableothers to
develop new pipe agents and apps. Upon implementing the bio-immune security, the framework might
be sold as an independent product for secure cloud computing.

eMeeting System

An interactive web-based conferencing facility must be fully integrated with the product to enable
existing and potential customers to instantly connect with designated support and sales representatives
for instant pilots, training, assistance and trouble-shooting.

System Flows

Each of the above systems has key data flows to be surveyed in the following sections with the roles
played by key components. The key data flows will be reviewed in the following order:

1. Pipe Data Flow
2. Application Data Flows

3. Ontology Data Flows
4. Sphere Data Flows
5. Framework Data Flows

Data flows typically span multiple systems but each will be addressed in the context of one system.

Pipe Data Flow

Key pipe data flows include the following:

1. Record assembly – Parallel caching and scraping can result in the loss of order and relationships
(parent-son) between records spanning multiple pages.
a. Consider, for example, a table of books (block 1) with links to book details (block 2) on
separate pages that include a link to a collection of reviews (block 3) on a single page.
Moreover, the table of books may contain additional category information that relates
to all books in the table (block 4).
b. Then each block consists of one or more block records, each consisting of one or more
data fields that the scraper will extract with their field contexts. Such a dataset could
subsequently be imported into the QR in two ways:
i. One table – Data from blocks 1, 2, 3 and 4 are assembled into a single table
whereupon multiple reviews for a book will result in multiple rows per book,
one row per review with extensive duplication.
ii. Multiple tables – Independent tables for category data (block 4), book data
(blocks 1 and 2) and review data (block3) appropriately interlinked.
c. Clearly the latter approach has advantages but a sphere administrator might prefer the
former. By maintaining all scraped and cleansed data as block records, duplicate kTable
storage and cleansing is avoided and the decision as to how to collate the data in the QR
can be made in the final import stage. This also simplifies the distribution of pages to
agents for caching and scraping – each page can be processed independently with the
exception of cases (e.g. PDF files) in which records roll over from one page to the next.
d. As later detailed, the initial spidering phase must therefore create an index of category
{id, bread crumbs} and pages {id, category, order and relationship}. The scraper must
subsequently create an index of blocks {id, type (table, record, category or attachment),
page and relationship} and records {id, block} so that all block records belonging to the
same entity/schema can be appropriately collated upon import.
e. Note that cleansing and enriching are also best applied to block records to avoid
unnecessary duplication.Note also that scraper collating of block records would require
that each sub-tree of pagesbe processed by the same scraper in a specific order using
multiple scraper filters to cater to the multiple page layouts.

The kFrameworkmaintains a cache with Page objects for each pipe, each page retaining
navigations to previous and next pages. To support block linkage for subsequent record
assembly, kFramework maintains a kRegistryto maintain the following:

a) Each page object must retain a parent record id, e.g. when a page with an index
containing several books links to a page per specific book, each specific book page must
link back to a specific record in the index page. The parent record id must consist of the
page id, the parent block id containing the index and the link value itself so that we can
subsequently identify the parent index record for each specific page. Block Ids are
foreign to the spider, so the spider merely registers the LinkValueToPage so that the
scraper can later identify the link content and register block and record ids of the parent
record id.
Class PageRecordId { ParentPageId, ParentBlockId, LinkValueToPage}
b) Each page object must also identify the scraper filter that must be used to scrape that
page. A scraper filters is assigned to each page layout, so the page object need only
identify the page layout that it belongs to, each page layout also having a list of layout
blocks:
Class RegisteredLayout { LayoutId, ScrapeFilter, List<LayoutBlock> }
Class LayoutBlock { LayoutBlockId, BlockType}
EnumBlockType { Record, Table, Attachment, Category }
Class PageLayout { LayoutId, List<LayoutBlock>}
Class RegisteredPage {PageId, PageRecordId, LayoutId,
List<BlockId>,BreadCrumbs}
Class RegisteredBlock{BlockId, PageId,LayoutBlockId}
c) To keep track of all pages and blocks, kLibrary must retain a dictionary of registry of
Layouts, lan index of each page object will retain a list of blocks the ) several pages
withand Sibling navigations and each page must also maintain a list of Blocks.
Dictionary<LayoutId,RegisteredLayout>LayoutRegistry
Dictionary<PageId,RegisteredPage>PageRegistry
Dictionary<BlockId,RegisteredBlock>BlockRegistry
d) The spider must maintain the LayoutRegistry and PageRegistry whereupon the scraper
maintains the BlockRegistry, independently scraping records per BlockId and storing
them per BlockId in the kTable.
e) The assembly of records spanning multiple blocks can then be accomplished by the
Importer as follows:
Category, Record and Attachment blocks only have a single record per block
whereas a Table block may have several records. We assume any number of
Table and Record blocks per page.
If there are Record blocks, then all Record blocks are assembled as a single
Record, all Category and Attachment blocks linked to it and any number of Table
blocks are linked to it as well as Tables within that record.
If there are no Record blocks, each Table block produces any number of records
to which all Category and Attachment blocks are linked.

If the Page has a PageRecordId then all records produced are linked to that
record. The appropriate record is identified by finding a record in the designated
BlockID that has the appropriate LinkValueToMe.
Record fields are loaded into the QR as dictated by the sphere ontology. Thus
when loading a record containing a table, if the table contents map into that of
an ontological entity linked to other ontological entities in the record then each
entity will be loaded into a different table appropriately linked. If the table
contents map into entity properties with an appropriate cardinality then it will
be loaded into them.
2. Scraper filters –scraper filters are automatically generated and tested for sample pages by the
Wizard, but they may need to be adjusted by the Wizard after they are applied to all of the
pages. The auto-generation and adjustment is accomplished as follows:
a. One or more blocks per page are marked by the worker as one of the following types:
Table, Record, Category or Attachment.
b. The default block type is determined by the block content assisted by the OntoPlug.
Thus for example the presence of bread crumbs suggests a Category block. The
presence of a table with multiple records suggests a Record block. The presence of
certain types and context may suggest a Record or Attachment block.
c. Key elements in a scraper filter include:
i. Tag sequences to identify the page blocks.
ii. Layout uFilters to identify page blocks including a potential Title uFilter to
identify the beginning of a nugget block.
iii. The above two mechanisms backup each other in case one fails (e.g.
inconsistent html templates) or a uFilter gets confused by dynamic content.
iv. A record/attachment block may have variable sets of fields per page – hence as
many fields as possible are provided uFilters to map them to appropriate block
dataset columns.
v. A table block requires mechanisms to detect its headers and records. If there is
an obvious html structure, the key table tags are used – else a set of block
uFilters to identify headers and the beginning or end of each record.
vi. In both of the above cases, the generation of uFilters begins with an attempt to
find a strong set of uFilter attributes per field until as many fields as possible are
readily captured by a uFilter. A strong uFilter is one that captures a reasonable
number of fields. A uFilter in a record block that captures half of the fields might
be useful for capturing labels whereupon it will also be included.
vii. In a table block, the number of fields captured by each uFilter then serves as a
basis for identifying the number of records on the marked page.
viii. In a table block, each uFilter is also assessed regarding its ability to serve as the
basis for breaking the table into records.
ix. All of the generated uFilters are treated as candidate uFilters of which the most
probable ones are armed for use in the scraper filter. Less probable ones are
retained so that the Wizard can show the dataset that would be produced if

they are armed, enabling the worker to choose the right combination by
example.
x. The Wizard also uses the OntoPlug to assign semantic tags and cardinality per
uFilter to determine how captured fields will be mapped into block columns.
d. Subsequent Wizard adjustments merely alter the set of armed uFilters and the semantic
tags and cardinalities assigned to them.
3. Semantic tags – Whereas the scraper can readily extract fields and insert them into appropriate
columns with context in page block tables, the decomposition of fields with multiple atoms can
only be accomplished by the cleanser for columns that have been semantically tagged. The
semantic tagging of columns is accomplished as follows:
a. When dealing with site columns that have already been mapped to specific semantic
tags, the semantic tagging can be accomplished automatically by the scraper upon
concluding its work or by the cleanser prior to cleansing. The former is preferred since
cleansing can be iterative and it’ll be easier to plan and schedule if we know in advance
which columns can be cleansed.
b. The OntoPlug attempts to determine a semantic tag ……………..
c. When dealing with consistent columns yet to be mapped, the Wizard can query the
OntoPlugfor the most probably headers, enable the user to approve its automated
selection and prompt the user to select a specific semantic tag from a short list when
the confidence levels associated with the most probable tag isinsufficient.
d. not sufficiently differentiated.too close and allow the user to determine the right one.

Site columns that have not yet been semantically tagged are analyzed by the OntoServer to
attempt expansion of the sphere ontology to cover these columns and map them
accordingly.

4. Pipe OntoPlug directives
a. uFilter attributes
i. Atom Filters – Identifying the atom type and pattern that best captures the
atoms in a given field.
ii. Label/Prefix – Probability that a uFilter captures labels and/or prefixes based
upon the text in the fields captured by that uFilter. Labels will subsequently be
captured by a Text Equals attribute and a prefix by a Text StartsWith attribute.
b. Semantic tags
5. Atom filter training – Saving fields with unknown atoms in decompose and elsewhere with their
contexts and semantic tags…
6. Atom context training– Atoms will often appear in new contexts that have to be acquired by the
ontology and newly harvested data will often lack the ontology needed to cleanse it. The scraper
records the context of each field in kTableso that the OntoServer can subsequently learn
everything possible for harvested data, including new patterns for existing atoms as well as new
atoms. It is context like column, style, headers and ID that enable us to recognize fields that
contain common atoms and provide hints as to what they might be.

7. Field decomposition–
8. Table/record within table – the scraper uses context (style, id, etc) that is not available to the
cleanser….

Application Data Flows

Key application data flows include the following:

1. Query auto-suggestion
2. Aggregated data view
3. Sourcing crowd-qualification
4. Best-practice sphere views
5. Best-practice content enrichment
6. Collective sphere ontology development
7.

Ontology Data Flows

Key ontology data flows include the following:

1. Ontology Acquisition
2. Contexts
3. Entity relationships
4. Synonyms
5. Patterns
6. Atom Types
7. Rules
8. Atom Filter training
9. Unique Identifiers – production of UIK permutations
10. Schema expansion
11.

Sphere Data Flows

Key sphere data flows include the following:

1. Publishing
2. Merging data across multiple pipes
3. Conflict resolution
4. Manual refinement
5. Rules
6. Crowd worker data refinement
7. Ontology refinement
8. Quality refinement

9. Performance refinement
10.
11.

Framework Data Flows

Key framework data flows include the following:

1. Data quality measurement
2. Key Performance Indictors (KPI)
3. Cloud utilization optimization
4. Auto-testing optimization
5. Pipe caching optimization
6.

Appendix A: kTables Tags
Owner: Moshe

Each pipe may produce several tables of data, each category of entities maintained in a separate table.
Each table comprises of a matrix of fields, each field belonging to a specific column and row. Each field,
column, row, table and pipe may have several properties maintained as key/value pairs maintained by
kTables at an appropriate pipe, table, column, rowand field level that reflects their scope. The kTables
keys are collectively referred to as kTable tags maintained in a kTableTags enumerator. The property
values are often objects defined in designated APIs.

Consider a site that sells books, CDs and videos with reviews. Then book, CDs, videos, prices and reviews
can be treated as independent entities maintained in separate tables with specific rows in one table
(e.g. prices and reviews) linked to specific rows in other tables (e.g. book, CD and video). Each field is
associated with a specific CategoryID to ensure that it is stored in the right table. Each field is also
associated with a specific PageID and BlockID so that we can trace back to its origin.

kTables is created by a scraper agent whereupon it is refined by Cleanser and additional agents before
finally being imported by an Importer agent into the QR. Prior to kTables creation, properties pertaining
to the data that are created by the Wizard App and Spider Agent are maintained in the kRegistry of
kFramework. This includes, for example pipe properties such as a SphereID and SphereOntology, as well
as the properties of the pages and page blocks that each table cell was extracted from. kTables
therefore need only maintain a PipeId, PageId and BlockId so that all properties associated with the
pipe, page and block can be obtained from the kRegistry. The scraper receives as input a list of pages
that it needs to process including the PipeId and all PageIds that it needs to access them. Given a PageId,
the scraper can get all BlockIds for that page. Similarly, when dealing with a cell containing a value with
a given AtomType, kTables only needs to maintain the AtomTypeId to obtain additional properties of the
AtomType from the OntoPlug.

The following table contains a list of properties per scope (pipe, table, column etc) that are available via
kTable. In many cases, the property is maintained internally – in other cases the property is prefixed
with a ‘*’ or ‘#’ to indicate that is can be obtained via the kRegistry or OntoPlugrespectively using
akTable tag designated in the second column. A ‘W’ value indicates which components write the tag
value and an ‘R’ indicated which components read it for purposes described in the final column. Post
pipe processing typically includes OntoServer analysis or QA evaluation of the quality of pipe processing.
Highlighted tags are for Beta.

Property/Tag Level App/ Spid Scra Clea Impo Post Description/Purpose
/Scope Wiz er per nser rter Pipe
PipeID Pipe W R To access all related properties in kRegistry
*CustomerID *PipeID W R Serving QR access control
*SphereID *PipeID W R Each sphere has an independent QR
*OntologyName *PipeID W R R R R OntologyName for OntoPlug.SetContext
*MinPedigree *PipeID W R R R R Min Pedigree for OntoPlug.SetContext
*SiteID *PipeID W R For auto-navigation to the original pages
BlockID Field W Entity records spanning multiple blocks & pages must be merge
*PageID *BlockID W R To find other blocks on the same page
*PageLayout *PageID W R To apply correct scraper filter to each page by layout
AtomFilters Pipe W R R OntoPlug.RelevantFilters(List<fields>) per block per layout for O
*Prev/NextPage *PageID W R To scrape records in the appropriate order
*ParentBlockID *PageID W R R To find linked block on parent page
*ParentLinkValue *PageID W R R To find linked block on parent page
*BreadCrumbs *PageID R W R To auto-identify a category block & measure spider coverage
*CategoryContent *PageID W R R To validate right choice of category block and categoryID
CategoryID Table W R R OntoPlug.CategoryID(BreadCrumbs, CategoryContent)
*AgentsVersions *PipeID W W W W W R List of agents that processed this kTables and their onto version
UniqueKeySets Table W R OntoPlug.UniqueKeySets(List<column.SemanticTags>) per Cate
*PageTitle *PageID WR R HTML page title to validate matching nugget title
IsNugget Column W R R To apply QR indexing to these columns
ColumnContext Column W R Scraped TableColumnOrder,Header,Prefix/Label,Style,AtomTyp
ColumnID Column W R Autogenerated by Scraper based upon ColumnContext
UserColumnName Column (W) W R Manually entered by user & registered in Scraper Filter (takes p
SemanticTag Column W W R R OntoPlug.ProbableSemanticTags(List<columnValues>,ColumnC
FirstSonColumn Column W To find first descendent column produced by field decompositi
NextSonColumn Column W Whereupon this leads to remaining descendent columns
ParentColumn Column W To recursively find all ancestor columns
SkipColumn Column W R R
#CanDecompose *SemanticTag W OntoPlug.CanDecompose(SemanticTag) Wizard can also config
FieldProperties Field W W R Link,ImageLink,ImageAlt, Empty,NewWord
FieldAtomType Field W W R R To get OntoPlug atom type attributes, e.g. AtomValueMin/Max
FieldValue Field W W R R Object containing Amount, UnitNameetc

Quality Metrics

The following table contains a list of quality metrics per pipe designed to monitor:
a) Data quality improvements as the data flows through the pipe
b) Pipe data quality and KPI (key performance indicator) improvements over time.
Over time here could mean from one test cycle to the next due to a new code version or and improved
ontology. Several of the metrics are maintained per Type*Layout, i.e. for each Block Type (Record,
Table, Category, Attachment) and each Page Layout in the pipe.
PKIs are readily derived from these metrics, e.g. Suspicious/Validated characters, percentage of Empty
Pages, percentage of Normalized/Atomic cells, etc.

Property/Tag Level /Scope App/ Spid Scra Clea Impo Post Description/Purpose
Wiz er per nser rter Pipe
Bread crumb paths Pipe W Number of queries – should correlate with Category IDs
Spider Pages Pipe W Total navigations
Page Layouts Pipe W
Linked pages Pipe W Navigations via links
Broken links Pipe W
Layout Pages Layout W Scraped pages per layout ID
Empty pages Layout W
Category IDs Layout W
Blocks Type*Layout W Category/Table/Record/Attachment blocks per layout ID
Texts Type*Layout W Texts per block type per layout
Fields Type*Layout W Extracted fields per block type/layout
Columns Type*Layout W W Number of columns
Semantic Tags Type*Layout W W How many of them tagged
Atomic Columns Type*Layout W W
Atomic Cells Type*Layout W
Normalized Cells Type*Layout W
Total Chars Type*Layout W
Unknown Atoms Type*Layout W Candidate new atom names/patterns
Residue Chars Type*Layout W
Leave As Is Chars Type*Layout W
Validated Chars Type*Layout W Output characters validated via their history as matching spe
Suspicious Chars Type*Layout W Non-validated output characters

Max Response Pipe W Source response times
Total Response Pipe W Total response times for all pages in spider process
Spider Time Pipe W Total spider process
Scrape Time Pipe W
Cleanse Time Pipe W
Scrape Onto Time Pipe W Onto processing time only
Cleanse Onto Time Pipe W Onto processing time only

Appendix B: Ontology API
Owner: Naama

Appendix C: Pipe Flow with Data Example
Owner: Oksana

1. Website Mapping

Result: Sample data

Pipe 1 Pipe 2

Column1 Column 2 Price Column 3 Column1 Column 2 Price Column 3
ayn rand fountainhead 10.00$ 5 ayn rand fountainhead 10.00$ 5

2. Collect data (scraper)
Process
System columns added – what columns?
Data manipulation?

Result: Scraped data

Pipe 1 Pipe 2
Collected from 10,000 pages, 15,000 records. Collected from 5,000 pages, 10,000 records.

Column1 Category Column 2 Price Column 3 Column1 Column 2 Column 3 Price Column 4
ayn rand Philosophy fountainhead 10.00$ 5 - excellent ayn rand Inspiration fountainhead 10.00€ 10
Author1 Cat1 Book1 10.00$ 3 - medium Author1 Cat7 Book1 5.00€ 8
Author3 Cat3 Book3 7.00$ 2 – poor Author2 Cat3 Book2 5.00€ 6
Author3 Cat1 Book4 7.00$ 4 – good

3. Cleansing
3.1. Decomposition
Breakdown to atoms? Based on what rules? Prices, measure units should be decomposed?
Result

Pipe 1
Decomposed Column 3 to Column3_1, Column3_2

Column1 Category Column 2 Price Column 3_1 Column 3_2
ayn rand Philosophy fountainhead 10.00$ 5 excellent
Author1 Cat1 Book1 10.00$ 3 medium
Author3 Cat3 Book3 7.00$ 2 poor
Author3 Cat1 Book4 7.00$ 4 good

3.2. Typing

What data types we distinct today? String, Number, Price, Phone, Address? Break down to units and scale
part of typing or decomposition?

Result

Pipe 1 Pipe 2

Type Units Scale Type Units Scale
Column1 String Column1 String
Category String Column 2 String
Column 2 String Column 3 String
Price Price $ Price Price €
Column 3_1 Number 1-5 Column 4 Number 1-10
Column 3_2 String

3.3. Column mapping & unique identifiers (entities & relationships)
Description TBD

Result

Pipe 1 Pipe 2

Column Identifier Column Identifier
Column1 Author Yes Column1 Author name Yes
name Column 2 Book category Yes
Category Book Yes Column 3 Book name No
category Price Book price No
Column 2 Book name No Column 4 Book rating No
Price Book price No
Column 3_1 Book rating No
Column 3_2 Book rating No
legend

3.4. Normalization
What normalization we perform in cleanser? As far as I understand the normalization can be done on
sphere level? For example, price normalization comes to question when we join data (one source $ and
the other €) or already in cleanse stage we bring all data to once currency? The same for rating
normalization.

3.5. Complete data
What completion we perform in cleanser? Is it not sphere related action as well? E.g. complete all zips,
phones, etc’. I think this also something we cannot do automatically, user input will be required,
guidelines as for what data to complete and how.

4. Merge data
4.1. Unify – simple merge
Description TBD

Result – Unified data
Source (S) Author name Book category Book name Book price Book rating Book rating legend
Pipe 1 Ayn rand Philosophy fountainhead 10.00$ 5 excellent
Pipe2 ayn rand Inspiration fountainhead 15.00$ 5

Pipe 1 Author1 Cat1 Book1 10.00$ 3 medium

Pipe2 Author1 Cat7 Book1 7.00$ 4
Pipe 1 Author3 Cat3 Book3 7.00$ 2 poor

Pipe 1 Author3 Cat1 Book4 7.00$ 4 good

Pipe2 Author2 Cat3 Book2 7.00$ 3

4.2. Merge data by unique identifier

4.2.1. Keep duplicates (default)

Enabled domain expert to decide how to resolve duplicates/conflicts – manually reconcile data
from various sources.

Result – merge data with duplicates
Author name Book category Book name Book price Book rating Book rating legend
Ayn rand Philosophy (Pipe 1) fountainhead 10.00$ (Pipe 1) 5 excellent
Inspiration (Pipe 2) 15.00$ (Pipe2)

Now in this point the domain expert can perform decision round and decide how to treat the duplicates
Based on the example above, I decide to always take categories by pipe raring (pipe 1 in my case) (actually
defined the categories normalization here) and keep prices from both sources

Result example
Author name Book category Book name Book price – Book price – Book Book rating
Pipe1 Pipe2 rating legend
Ayn rand Philosophy fountainhead 10.00$ 15.00$ 5 excellent

4.2.2. Resolve all duplicates by pipe rating
Based on Sphere preferences, in this case the merge should result with all duplicates
automatically resolved by pipe rating.

Example – My sphere preferences = always resolve duplicates based on Pipe1
Result – merge data duplicates resolved

Author name Book category Book name Book price Book rating Book rating legend
Ayn rand Philosophy fountainhead 10.00$ 5 excellent

Appendix D: Domain Expert Worker Flow
Owner: Oksana

1. High level domain expert flow

More can be found here https://sites.google.com/a/kinor.com/product/design/High-Level

Appendix E: Sphere Architecture and APIs
Owner: Irina

Appendix F: Framework Architecture and APIs
Owner: Aryeh

Appendix G: Spider Architecture and APIs
Owner: Yossi

Appendix H: Scraper Architecture and APIs
Owner: Hagay

Subsequent iterations:

a) Multiple record blocks: Linked records between blocks
b) Optimize performance: replace TextList with linked texts, no need for spliceRecords

Appendix I: Cleanser Architecture and APIs
Owner: Ronen

Appendix J: kGrid Integration
Owner: Jair

kGrid was designed for cross-enterprise data integration - hencekGrid agents are fundamentally
different than pipe agents in kFramework:

1. kGrid agents run continuously, processing requests sent to their message queues whereas
kFramework agents are dispatched to do a specific task.
2. kGrid agents typically reside at specific Agency or Gateway locationsuntil system dynamics
warrant relocation whereas kFramework agents are dispatched per task anywhere in the cloud.
3. kGrid agents cache the ontology that they need internally via an Ontology service whereas
kFramework agents rely upon an OntoPlug to do their ontology-related work.

Kinor envisions future cross-enterprise contexts, so the fundamental kGrid architecture must be
retained. This architecture enables kGrid Agencies and Gateways anywhere to dynamically discover each
other and work together without disrupting operations. A robust dynamic discovery mechanism has yet
to be implemented. Hence kGridshould initially be deployed internally using a static configuration
dictated by an appropriate kGridConfig object in the kFrameworkkRegistry. The following principles
should enable immediate kGrid deployment for online database integration purposes within weeks:

1) The designate machines will run Java 1.4.
2) At least one of the machines will deploy MySQL for initial kGrid persistence.
3) The kGridConfig object should initially comprise of the following:
a) A list of Agency and Gateway machines for deployment of the kGrid Agent Manager
b) A kGrid compatible XML file per Agent Manager to configure its agents and services
c) A minimalistic agent and services configuration to get started
4) The kGrid discovery service should be adjusted to access kRegistry for connecting with other
kGrid discovery services rather than attempt dynamic discovery.
5) kGrid uses Protégé as an ontology editor with a plugin that knows how to engage the kGrid
ontology service. The OntoServer must incorporate this plugin and translate its OWL ontology
format into the OKBC format currently serving the Ontology Server.

The following kGrid improvements can then be implemented over subsequent months:

1) Replace kGrid Query Builder with Sphere Query Builder
2) Upgrade code to be compatible with Java 7.
3) Use an OntoPlug to semi-automate Wrapper schema mapping.
4) Self-organizing kGrid deployment to match operational demand
5) Might be worth replacing the ODBC connectivity currently serving the kGrid wrappers with a
standard DBMS interfacing that is more readily configured programmatically by kFramework
when new database sources are added to a sphere.

Product data processing 30.08.2011 gg

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (20)

Similaire à Product data processing 30.08.2011 gg

Similaire à Product data processing 30.08.2011 gg (20)

Dernier

Dernier (20)

Product data processing 30.08.2011 gg