About onlineextrems concept

Overview

Onlineextrems.com

Platform overview
 A single unified platform for all content types (consolidate
to reduce development and maintenance costs)
 Flexible system which can support any new content type
 High automation (cut configuration costs)
 Real time coverage or as close as possible for each content
type
 Improved data quality using validation rules
 Was implemented this year

January 1, 2013 Onlineextrems.com

Supporting all the content types
 Message boards
 Blogs and micro blogs (Myspace, Blogger, Live Journal...)
 Blog comments
 Social networks – Facebook, Linkedin, Xing
 Author profiles
 Product reviews
 Usenet – mailing lists, groups
 Traditional media – CNN, Reuters


Consolidating the content systems
Data mining systems
 Message boards
 Blogs
 Social Networking sites
 Author profiles system
 Usenet + Newsgroups system

January 1, 2013 Onlineextrems.com 4

Some of our challenges
 Dynamic nature of the web
 Supporting many different types of content
 Automatically “understanding” millions of sites with different structures
Over 8000 message boards


 Over 95 million blogs

 Supporting data in different languages
 Data quality


Data mining process
What are the important aspects of the data mining?
Managing the order in which we crawl pages
 Efficiency (e.g. not entering posts where the number of comments hasn’t
changed)
 Next page (we need to follow it to get more comments)

Extracting relevant data out of everything on the page.
Separating the data into posts (or comments)
Transforming specific data into the desired format
 Handling dates in differing formats


Data mining technologies
 Jelly –Simple XML workflow engine
 HttpClient - Fetcher
 Rome –Feed parser
 Velocity–Output template engine
 JMX + JConsole – Managing the system


Flows
 Built from steps which are the blocks
 Allows adding support for new content types without
writing code
 The implementation is based on Apache Jelly which allows
executing XML files


XML parser
 Parses the data from simple XML files into the
common in memory “items” structure
 For now only supports elements and not attributes
 Used for Twitter


HTML parser
 Applies XSLT transformations to HTML pages
 Extracts the data into the common in memory “items”
structure
 Uses “Tag Soup” library to read HTML as if it were XML
 Faster and more robust than the current XML conversion
method
 Used for Author Profiles


XML Output
 Output in XML files
 Configurable output format using template file


Sample Work

January 1, 2013

Thank You
Connect and share with us…
www.onlineextrems.com


About onlineextrems concept

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (15)

Similaire à About onlineextrems concept

Similaire à About onlineextrems concept (20)

About onlineextrems concept