SlideShare une entreprise Scribd logo
1  sur  14
Overview




Onlineextrems.com
Platform overview
  A single unified platform for all content types (consolidate
     to reduce development and maintenance costs)
    Flexible system which can support any new content type
    High automation (cut configuration costs)
    Real time coverage or as close as possible for each content
     type
    Improved data quality using validation rules
    Was implemented this year



January 1, 2013        Onlineextrems.com
Supporting all the content types
        Message boards
        Blogs and micro blogs (Myspace, Blogger, Live Journal...)
        Blog comments
        Social networks – Facebook, Linkedin, Xing
        Author profiles
        Product reviews
        Usenet – mailing lists, groups
        Traditional media – CNN, Reuters




January 1, 2013            Onlineextrems.com
Consolidating the content systems
 Data mining systems
     Message boards
     Blogs
     Social Networking sites
     Author profiles system
     Usenet + Newsgroups system




January 1, 2013         Onlineextrems.com   4
Some of our challenges
        Dynamic nature of the web
        Supporting many different types of content
        Automatically “understanding” millions of sites with different structures
           Over 8000 message boards
            

          Over 95 million blogs

        Supporting data in different languages
        Data quality




January 1, 2013               Onlineextrems.com
Data mining process
       What are the important aspects of the data mining?
       Managing the order in which we crawl pages
           Efficiency (e.g. not entering posts where the number of comments hasn’t
            changed)
           Next page (we need to follow it to get more comments)

       Extracting relevant data out of everything on the page.
       Separating the data into posts (or comments)
       Transforming specific data into the desired format
           Handling dates in differing formats




January 1, 2013              Onlineextrems.com
Data mining technologies
  Jelly –Simple XML workflow engine
  HttpClient - Fetcher
  Rome –Feed parser
  Velocity–Output template engine
  JMX + JConsole – Managing the system




January 1, 2013     Onlineextrems.com
Flows
  Built from steps which are the blocks
  Allows adding support for new content types without
   writing code
  The implementation is based on Apache Jelly which allows
   executing XML files




January 1, 2013      Onlineextrems.com
XML parser
  Parses the data from simple XML files into the
   common in memory “items” structure
  For now only supports elements and not attributes
  Used for Twitter




January 1, 2013   Onlineextrems.com
HTML parser
  Applies XSLT transformations to HTML pages
  Extracts the data into the common in memory “items”
   structure
  Uses “Tag Soup” library to read HTML as if it were XML
  Faster and more robust than the current XML conversion
   method
  Used for Author Profiles




January 1, 2013     Onlineextrems.com
XML Output
  Output in XML files
  Configurable output format using template file




January 1, 2013   Onlineextrems.com
Sample Work




January 1, 2013
Sample Work




January 1, 2013
Thank You
 Connect and share with us…
 www.onlineextrems.com




January 1, 2013   Onlineextrems.com

Contenu connexe

Tendances

MongoDB meetup at Hike
MongoDB meetup at HikeMongoDB meetup at Hike
MongoDB meetup at HikeBharvi Dixit
 
chOpaal -- Senior Project Presentation
chOpaal -- Senior Project PresentationchOpaal -- Senior Project Presentation
chOpaal -- Senior Project Presentationasimfayaz
 
Mendeley, Grammarly and Document Clouds for Thesis and Research Collaboration
Mendeley, Grammarly and Document Clouds for Thesis and Research CollaborationMendeley, Grammarly and Document Clouds for Thesis and Research Collaboration
Mendeley, Grammarly and Document Clouds for Thesis and Research CollaborationShashikiran Umakanth
 
CARA MEMBUAT REFERENSI DAN SITASI PADA NASKAH
CARA MEMBUAT REFERENSI DAN SITASI PADA NASKAHCARA MEMBUAT REFERENSI DAN SITASI PADA NASKAH
CARA MEMBUAT REFERENSI DAN SITASI PADA NASKAHRelawan Jurnal Indonesia
 
Crossref Metadata and Metadata Services
Crossref Metadata and Metadata ServicesCrossref Metadata and Metadata Services
Crossref Metadata and Metadata ServicesCrossref
 
Managing plagiarism: Similarity Check
Managing plagiarism: Similarity CheckManaging plagiarism: Similarity Check
Managing plagiarism: Similarity CheckCrossref
 
Is2215 lecture6 lecturer_file_access
Is2215 lecture6 lecturer_file_accessIs2215 lecture6 lecturer_file_access
Is2215 lecture6 lecturer_file_accessdannygriff1
 
Text and Data Mining
Text and Data MiningText and Data Mining
Text and Data MiningCrossref
 
Elasticsearch - Scalability and Multitenancy
Elasticsearch - Scalability and MultitenancyElasticsearch - Scalability and Multitenancy
Elasticsearch - Scalability and MultitenancyBozhidar Bozhanov
 
Lms impacton libraryenvironments_29ibima_2017
Lms impacton libraryenvironments_29ibima_2017Lms impacton libraryenvironments_29ibima_2017
Lms impacton libraryenvironments_29ibima_2017Naeem ullah
 
Recommendation Engines with Ruby and Redis
Recommendation Engines with Ruby and RedisRecommendation Engines with Ruby and Redis
Recommendation Engines with Ruby and Redisevanlight
 

Tendances (15)

MongoDB meetup at Hike
MongoDB meetup at HikeMongoDB meetup at Hike
MongoDB meetup at Hike
 
chOpaal -- Senior Project Presentation
chOpaal -- Senior Project PresentationchOpaal -- Senior Project Presentation
chOpaal -- Senior Project Presentation
 
Mendeley, Grammarly and Document Clouds for Thesis and Research Collaboration
Mendeley, Grammarly and Document Clouds for Thesis and Research CollaborationMendeley, Grammarly and Document Clouds for Thesis and Research Collaboration
Mendeley, Grammarly and Document Clouds for Thesis and Research Collaboration
 
APA ITU DOI?
APA ITU DOI?APA ITU DOI?
APA ITU DOI?
 
CEK KEMIRIPAN PADA CROSSREF
CEK KEMIRIPAN PADA CROSSREFCEK KEMIRIPAN PADA CROSSREF
CEK KEMIRIPAN PADA CROSSREF
 
CARA MEMBUAT REFERENSI DAN SITASI PADA NASKAH
CARA MEMBUAT REFERENSI DAN SITASI PADA NASKAHCARA MEMBUAT REFERENSI DAN SITASI PADA NASKAH
CARA MEMBUAT REFERENSI DAN SITASI PADA NASKAH
 
Securing Your WordPress Website
Securing Your WordPress WebsiteSecuring Your WordPress Website
Securing Your WordPress Website
 
Crossref Metadata and Metadata Services
Crossref Metadata and Metadata ServicesCrossref Metadata and Metadata Services
Crossref Metadata and Metadata Services
 
Managing plagiarism: Similarity Check
Managing plagiarism: Similarity CheckManaging plagiarism: Similarity Check
Managing plagiarism: Similarity Check
 
Is2215 lecture6 lecturer_file_access
Is2215 lecture6 lecturer_file_accessIs2215 lecture6 lecturer_file_access
Is2215 lecture6 lecturer_file_access
 
Text and Data Mining
Text and Data MiningText and Data Mining
Text and Data Mining
 
Elasticsearch - Scalability and Multitenancy
Elasticsearch - Scalability and MultitenancyElasticsearch - Scalability and Multitenancy
Elasticsearch - Scalability and Multitenancy
 
Lms impacton libraryenvironments_29ibima_2017
Lms impacton libraryenvironments_29ibima_2017Lms impacton libraryenvironments_29ibima_2017
Lms impacton libraryenvironments_29ibima_2017
 
Recommendation Engines with Ruby and Redis
Recommendation Engines with Ruby and RedisRecommendation Engines with Ruby and Redis
Recommendation Engines with Ruby and Redis
 
Staines - Open Source Standards Based Annotation
Staines - Open Source Standards Based AnnotationStaines - Open Source Standards Based Annotation
Staines - Open Source Standards Based Annotation
 

Similaire à About onlineextrems concept

Notey's talk 20160923
Notey's talk 20160923Notey's talk 20160923
Notey's talk 20160923Rosanna Man
 
Making Web Content Agile
Making Web Content AgileMaking Web Content Agile
Making Web Content AgileScott Abel
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured DocumentsJulián Urbano
 
Using the Kentico CMS API
Using the Kentico CMS APIUsing the Kentico CMS API
Using the Kentico CMS APIThomas Robbins
 
KnowIT, semantic informatics knowledge base
KnowIT, semantic informatics knowledge baseKnowIT, semantic informatics knowledge base
KnowIT, semantic informatics knowledge baseLaurent Alquier
 
SharePoint Connections Coast to Coast Overview of Enterprise Content Management
SharePoint Connections Coast to Coast Overview of Enterprise Content ManagementSharePoint Connections Coast to Coast Overview of Enterprise Content Management
SharePoint Connections Coast to Coast Overview of Enterprise Content ManagementIvan Sanders
 
03 Teknologi Web 2
03 Teknologi Web 203 Teknologi Web 2
03 Teknologi Web 2Herman Tolle
 
Blogswikisrss
BlogswikisrssBlogswikisrss
Blogswikisrsstomlom
 
Web 2.0: new definition of web
Web 2.0: new definition of webWeb 2.0: new definition of web
Web 2.0: new definition of webNeeraj Singh
 
CloudLab: A File System Friendly Key Value Store
CloudLab: A File System Friendly Key Value StoreCloudLab: A File System Friendly Key Value Store
CloudLab: A File System Friendly Key Value StoreMaxiScale
 
Technology of the New News Workflow
Technology of the New News WorkflowTechnology of the New News Workflow
Technology of the New News WorkflowRoger Theriault
 
Demystifying WordPress
Demystifying WordPressDemystifying WordPress
Demystifying WordPressMykl Roventine
 
Building Product Centric sites using Cross-Site publishing and Search [Swiss ...
Building Product Centric sites using Cross-Site publishing and Search [Swiss ...Building Product Centric sites using Cross-Site publishing and Search [Swiss ...
Building Product Centric sites using Cross-Site publishing and Search [Swiss ...Marius Constantinescu [MVP]
 

Similaire à About onlineextrems concept (20)

Notey's talk 20160923
Notey's talk 20160923Notey's talk 20160923
Notey's talk 20160923
 
Making Web Content Agile
Making Web Content AgileMaking Web Content Agile
Making Web Content Agile
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured Documents
 
Using the Kentico CMS API
Using the Kentico CMS APIUsing the Kentico CMS API
Using the Kentico CMS API
 
Joomla
JoomlaJoomla
Joomla
 
KnowIT, semantic informatics knowledge base
KnowIT, semantic informatics knowledge baseKnowIT, semantic informatics knowledge base
KnowIT, semantic informatics knowledge base
 
SharePoint Connections Coast to Coast Overview of Enterprise Content Management
SharePoint Connections Coast to Coast Overview of Enterprise Content ManagementSharePoint Connections Coast to Coast Overview of Enterprise Content Management
SharePoint Connections Coast to Coast Overview of Enterprise Content Management
 
Content mgmtsys
Content mgmtsysContent mgmtsys
Content mgmtsys
 
03 Teknologi Web 2
03 Teknologi Web 203 Teknologi Web 2
03 Teknologi Web 2
 
Blogswikisrss
BlogswikisrssBlogswikisrss
Blogswikisrss
 
Web 2.0: new definition of web
Web 2.0: new definition of webWeb 2.0: new definition of web
Web 2.0: new definition of web
 
Data Harmony Version 3.9 Features Update
Data Harmony Version 3.9 Features UpdateData Harmony Version 3.9 Features Update
Data Harmony Version 3.9 Features Update
 
CloudLab: A File System Friendly Key Value Store
CloudLab: A File System Friendly Key Value StoreCloudLab: A File System Friendly Key Value Store
CloudLab: A File System Friendly Key Value Store
 
Technology of the New News Workflow
Technology of the New News WorkflowTechnology of the New News Workflow
Technology of the New News Workflow
 
SNATZ Technology
SNATZ TechnologySNATZ Technology
SNATZ Technology
 
Lucene revolution with Data Harmony
Lucene revolution with Data HarmonyLucene revolution with Data Harmony
Lucene revolution with Data Harmony
 
Semantic Web
Semantic WebSemantic Web
Semantic Web
 
Demystifying WordPress
Demystifying WordPressDemystifying WordPress
Demystifying WordPress
 
Building Product Centric sites using Cross-Site publishing and Search [Swiss ...
Building Product Centric sites using Cross-Site publishing and Search [Swiss ...Building Product Centric sites using Cross-Site publishing and Search [Swiss ...
Building Product Centric sites using Cross-Site publishing and Search [Swiss ...
 

About onlineextrems concept

  • 2. Platform overview  A single unified platform for all content types (consolidate to reduce development and maintenance costs)  Flexible system which can support any new content type  High automation (cut configuration costs)  Real time coverage or as close as possible for each content type  Improved data quality using validation rules  Was implemented this year January 1, 2013 Onlineextrems.com
  • 3. Supporting all the content types  Message boards  Blogs and micro blogs (Myspace, Blogger, Live Journal...)  Blog comments  Social networks – Facebook, Linkedin, Xing  Author profiles  Product reviews  Usenet – mailing lists, groups  Traditional media – CNN, Reuters January 1, 2013 Onlineextrems.com
  • 4. Consolidating the content systems Data mining systems  Message boards  Blogs  Social Networking sites  Author profiles system  Usenet + Newsgroups system January 1, 2013 Onlineextrems.com 4
  • 5. Some of our challenges  Dynamic nature of the web  Supporting many different types of content  Automatically “understanding” millions of sites with different structures Over 8000 message boards   Over 95 million blogs  Supporting data in different languages  Data quality January 1, 2013 Onlineextrems.com
  • 6. Data mining process What are the important aspects of the data mining? Managing the order in which we crawl pages  Efficiency (e.g. not entering posts where the number of comments hasn’t changed)  Next page (we need to follow it to get more comments) Extracting relevant data out of everything on the page. Separating the data into posts (or comments) Transforming specific data into the desired format  Handling dates in differing formats January 1, 2013 Onlineextrems.com
  • 7. Data mining technologies  Jelly –Simple XML workflow engine  HttpClient - Fetcher  Rome –Feed parser  Velocity–Output template engine  JMX + JConsole – Managing the system January 1, 2013 Onlineextrems.com
  • 8. Flows  Built from steps which are the blocks  Allows adding support for new content types without writing code  The implementation is based on Apache Jelly which allows executing XML files January 1, 2013 Onlineextrems.com
  • 9. XML parser  Parses the data from simple XML files into the common in memory “items” structure  For now only supports elements and not attributes  Used for Twitter January 1, 2013 Onlineextrems.com
  • 10. HTML parser  Applies XSLT transformations to HTML pages  Extracts the data into the common in memory “items” structure  Uses “Tag Soup” library to read HTML as if it were XML  Faster and more robust than the current XML conversion method  Used for Author Profiles January 1, 2013 Onlineextrems.com
  • 11. XML Output  Output in XML files  Configurable output format using template file January 1, 2013 Onlineextrems.com
  • 14. Thank You Connect and share with us… www.onlineextrems.com January 1, 2013 Onlineextrems.com