SlideShare a Scribd company logo
1 of 11
Practical Issues for  Automated Categorization of Web Sites John M. Pierre [email_address] Metacode Technologies, Inc. 139 Townsend Street San Francisco, CA 94107 (Collaborators: B. Wohler, R. Daniel, M. Butler, R. Avedon)
Outline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Project Overview ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Web Content: Automated Categorization ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Web Content: Feature Selection ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Use metadata if possible, use body text as last resort!
Web Content: Metadata
Experimental Setup: Targeted Spidering ‘ Query’ Pages Metatags? Send Query Use <body> live? Frames? <a href=? Try www. HTTP Get Domain name Yes No Yes No Yes prod, service, about, info, press, news No
Experimental Setup: Data Classification scheme : NAICS 11 Agriculture, Forestry, Fishing and Hunting 21 Mining 23 Construction 31-33 Manufacturing 42 Wholesale Trade 44-45 Retail Trade 48-49 Transportation and Warehousing 51 Information 52 Finance and Insurance 53 Real Estate and Rental and Leasing 54 Professional, Scientific and Technical Services 55 Management of Companies and Enterprise 56 Admin. Support, Waste Mgmt and Remediation Srvcs 61 Educational Services 62 Health Care and Social Assistance 71 Arts, Entertainment & Recreation 72 Accommodation and Food Services 81 Other services (except 92) 92 Public Administration 99 Unclassified Establishments ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Experimental Setup: System Architecture The Web Domain Names IR Engine Decision SEC-NAICS Web pages Foo.com 11, 21, 23 Text Query Matching documents Spider
Results P=Precision =  # correctly assigned / # assigned R=Recall =  # correctly assigned / # total correct F1 = 2 P R / (P+R) micro-averaged = computer over all categories macro-averaged = per category, then averaged
Conclusions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

More Related Content

Viewers also liked

J M Githeko Agenda Software installation – Apche, MySQL, PHP
J M Githeko Agenda Software installation – Apche, MySQL, PHPJ M Githeko Agenda Software installation – Apche, MySQL, PHP
J M Githeko Agenda Software installation – Apche, MySQL, PHPwebhostingguy
 
Shared Exchange Email Service
Shared Exchange Email ServiceShared Exchange Email Service
Shared Exchange Email Servicewebhostingguy
 
Choosing_(and_Implem..
Choosing_(and_Implem..Choosing_(and_Implem..
Choosing_(and_Implem..webhostingguy
 
MIT-6-determina-vps.ppt
MIT-6-determina-vps.pptMIT-6-determina-vps.ppt
MIT-6-determina-vps.pptwebhostingguy
 
TIER TIER RESEARCH RESEARCH
TIER TIER RESEARCH RESEARCHTIER TIER RESEARCH RESEARCH
TIER TIER RESEARCH RESEARCHwebhostingguy
 
The WEB E government Domain
The WEB E government DomainThe WEB E government Domain
The WEB E government Domainwebhostingguy
 
LavaNet Domain Registration,
LavaNet Domain Registration, LavaNet Domain Registration,
LavaNet Domain Registration, webhostingguy
 
A Newbie's Intro to the Web
A Newbie's Intro to the WebA Newbie's Intro to the Web
A Newbie's Intro to the Webwebhostingguy
 
Weekly Market Snapshot, July 24, 2009
Weekly Market Snapshot, July 24, 2009Weekly Market Snapshot, July 24, 2009
Weekly Market Snapshot, July 24, 2009Jeff Green
 

Viewers also liked (20)

J M Githeko Agenda Software installation – Apche, MySQL, PHP
J M Githeko Agenda Software installation – Apche, MySQL, PHPJ M Githeko Agenda Software installation – Apche, MySQL, PHP
J M Githeko Agenda Software installation – Apche, MySQL, PHP
 
SWsoft_Prim@Telecom
SWsoft_Prim@TelecomSWsoft_Prim@Telecom
SWsoft_Prim@Telecom
 
Shared Exchange Email Service
Shared Exchange Email ServiceShared Exchange Email Service
Shared Exchange Email Service
 
English
EnglishEnglish
English
 
Choosing_(and_Implem..
Choosing_(and_Implem..Choosing_(and_Implem..
Choosing_(and_Implem..
 
MIT-6-determina-vps.ppt
MIT-6-determina-vps.pptMIT-6-determina-vps.ppt
MIT-6-determina-vps.ppt
 
TIER TIER RESEARCH RESEARCH
TIER TIER RESEARCH RESEARCHTIER TIER RESEARCH RESEARCH
TIER TIER RESEARCH RESEARCH
 
Presentation slides
Presentation slidesPresentation slides
Presentation slides
 
The WEB E government Domain
The WEB E government DomainThe WEB E government Domain
The WEB E government Domain
 
LavaNet Domain Registration,
LavaNet Domain Registration, LavaNet Domain Registration,
LavaNet Domain Registration,
 
A Newbie's Intro to the Web
A Newbie's Intro to the WebA Newbie's Intro to the Web
A Newbie's Intro to the Web
 
Shared Hosting
Shared HostingShared Hosting
Shared Hosting
 
Download It
Download ItDownload It
Download It
 
Download It
Download ItDownload It
Download It
 
WS-Privacy,
WS-Privacy,WS-Privacy,
WS-Privacy,
 
Presentation slides
Presentation slidesPresentation slides
Presentation slides
 
Domain Name Service
Domain Name ServiceDomain Name Service
Domain Name Service
 
Playbook
PlaybookPlaybook
Playbook
 
Sample
SampleSample
Sample
 
Weekly Market Snapshot, July 24, 2009
Weekly Market Snapshot, July 24, 2009Weekly Market Snapshot, July 24, 2009
Weekly Market Snapshot, July 24, 2009
 

Similar to Practical Issues for Automated Categorization

Slideshare 1
Slideshare 1Slideshare 1
Slideshare 1bfoley14
 
How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?confluent
 
Extreme Productivity in the Enterprise The User is the Developer is the User
Extreme Productivity in the Enterprise The User is the Developer is the UserExtreme Productivity in the Enterprise The User is the Developer is the User
Extreme Productivity in the Enterprise The User is the Developer is the Usercoolstuff
 
Universal Search for Legal Enterprises
Universal Search for Legal EnterprisesUniversal Search for Legal Enterprises
Universal Search for Legal EnterprisesAdhereSolutions
 
Alitora Innovation Networks
Alitora Innovation NetworksAlitora Innovation Networks
Alitora Innovation Networksalitora
 
webmining overview
webmining overviewwebmining overview
webmining overviewabon
 
Enterprise Data Marketplace: A Centralized Portal for All Your Data Assets
Enterprise Data Marketplace: A Centralized Portal for All Your Data AssetsEnterprise Data Marketplace: A Centralized Portal for All Your Data Assets
Enterprise Data Marketplace: A Centralized Portal for All Your Data AssetsDenodo
 
Search Engine Optimization Overview
Search Engine Optimization OverviewSearch Engine Optimization Overview
Search Engine Optimization OverviewSemel Admin
 
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...SoftServe
 
Swt Infontology
Swt InfontologySwt Infontology
Swt Infontologyguest95d86
 
SharePoint Search Enrichment
SharePoint Search EnrichmentSharePoint Search Enrichment
SharePoint Search EnrichmentManoj Mittal
 
BI & Analytics with Ms Power BI.pptx
BI & Analytics with Ms Power BI.pptxBI & Analytics with Ms Power BI.pptx
BI & Analytics with Ms Power BI.pptxCecilia Brusatori
 
FAIRy stories: the FAIR Data principles in theory and in practice
FAIRy stories: the FAIR Data principles in theory and in practiceFAIRy stories: the FAIR Data principles in theory and in practice
FAIRy stories: the FAIR Data principles in theory and in practiceCarole Goble
 
Lessons Learned: Coding Accessible Apps with Frameworks 2017
Lessons Learned: Coding Accessible Apps with Frameworks 2017Lessons Learned: Coding Accessible Apps with Frameworks 2017
Lessons Learned: Coding Accessible Apps with Frameworks 2017Kate Walser
 
Microsoft Technologies for Data Science 201612
Microsoft Technologies for Data Science 201612Microsoft Technologies for Data Science 201612
Microsoft Technologies for Data Science 201612Mark Tabladillo
 
Crossing the Mashup Chasm. Enterprise Mashup Requirements
Crossing the Mashup Chasm. Enterprise Mashup RequirementsCrossing the Mashup Chasm. Enterprise Mashup Requirements
Crossing the Mashup Chasm. Enterprise Mashup RequirementsJusto Hidalgo
 

Similar to Practical Issues for Automated Categorization (20)

Data Harmony Version 3.9 Features Update
Data Harmony Version 3.9 Features UpdateData Harmony Version 3.9 Features Update
Data Harmony Version 3.9 Features Update
 
Slideshare 1
Slideshare 1Slideshare 1
Slideshare 1
 
How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?
 
Extreme Productivity in the Enterprise The User is the Developer is the User
Extreme Productivity in the Enterprise The User is the Developer is the UserExtreme Productivity in the Enterprise The User is the Developer is the User
Extreme Productivity in the Enterprise The User is the Developer is the User
 
Universal Search for Legal Enterprises
Universal Search for Legal EnterprisesUniversal Search for Legal Enterprises
Universal Search for Legal Enterprises
 
Alitora Innovation Networks
Alitora Innovation NetworksAlitora Innovation Networks
Alitora Innovation Networks
 
webmining overview
webmining overviewwebmining overview
webmining overview
 
Enterprise Data Marketplace: A Centralized Portal for All Your Data Assets
Enterprise Data Marketplace: A Centralized Portal for All Your Data AssetsEnterprise Data Marketplace: A Centralized Portal for All Your Data Assets
Enterprise Data Marketplace: A Centralized Portal for All Your Data Assets
 
Search Engine Optimization Overview
Search Engine Optimization OverviewSearch Engine Optimization Overview
Search Engine Optimization Overview
 
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
 
Swt Infontology
Swt InfontologySwt Infontology
Swt Infontology
 
Cloudant
CloudantCloudant
Cloudant
 
SharePoint Search Enrichment
SharePoint Search EnrichmentSharePoint Search Enrichment
SharePoint Search Enrichment
 
BI & Analytics with Ms Power BI.pptx
BI & Analytics with Ms Power BI.pptxBI & Analytics with Ms Power BI.pptx
BI & Analytics with Ms Power BI.pptx
 
FAIRy stories: the FAIR Data principles in theory and in practice
FAIRy stories: the FAIR Data principles in theory and in practiceFAIRy stories: the FAIR Data principles in theory and in practice
FAIRy stories: the FAIR Data principles in theory and in practice
 
Lessons Learned: Coding Accessible Apps with Frameworks 2017
Lessons Learned: Coding Accessible Apps with Frameworks 2017Lessons Learned: Coding Accessible Apps with Frameworks 2017
Lessons Learned: Coding Accessible Apps with Frameworks 2017
 
Microsoft Technologies for Data Science 201612
Microsoft Technologies for Data Science 201612Microsoft Technologies for Data Science 201612
Microsoft Technologies for Data Science 201612
 
Report emandatarecovery.com 9-15-2010
Report emandatarecovery.com   9-15-2010Report emandatarecovery.com   9-15-2010
Report emandatarecovery.com 9-15-2010
 
Crossing the Mashup Chasm. Enterprise Mashup Requirements
Crossing the Mashup Chasm. Enterprise Mashup RequirementsCrossing the Mashup Chasm. Enterprise Mashup Requirements
Crossing the Mashup Chasm. Enterprise Mashup Requirements
 
Customer Ppt
Customer PptCustomer Ppt
Customer Ppt
 

More from webhostingguy

Running and Developing Tests with the Apache::Test Framework
Running and Developing Tests with the Apache::Test FrameworkRunning and Developing Tests with the Apache::Test Framework
Running and Developing Tests with the Apache::Test Frameworkwebhostingguy
 
MySQL and memcached Guide
MySQL and memcached GuideMySQL and memcached Guide
MySQL and memcached Guidewebhostingguy
 
Novell® iChain® 2.3
Novell® iChain® 2.3Novell® iChain® 2.3
Novell® iChain® 2.3webhostingguy
 
Load-balancing web servers Load-balancing web servers
Load-balancing web servers Load-balancing web serversLoad-balancing web servers Load-balancing web servers
Load-balancing web servers Load-balancing web serverswebhostingguy
 
SQL Server 2008 Consolidation
SQL Server 2008 ConsolidationSQL Server 2008 Consolidation
SQL Server 2008 Consolidationwebhostingguy
 
Master Service Agreement
Master Service AgreementMaster Service Agreement
Master Service Agreementwebhostingguy
 
PHP and MySQL PHP Written as a set of CGI binaries in C in ...
PHP and MySQL PHP Written as a set of CGI binaries in C in ...PHP and MySQL PHP Written as a set of CGI binaries in C in ...
PHP and MySQL PHP Written as a set of CGI binaries in C in ...webhostingguy
 
Dell Reference Architecture Guide Deploying Microsoft® SQL ...
Dell Reference Architecture Guide Deploying Microsoft® SQL ...Dell Reference Architecture Guide Deploying Microsoft® SQL ...
Dell Reference Architecture Guide Deploying Microsoft® SQL ...webhostingguy
 
Managing Diverse IT Infrastructure
Managing Diverse IT InfrastructureManaging Diverse IT Infrastructure
Managing Diverse IT Infrastructurewebhostingguy
 
Web design for business.ppt
Web design for business.pptWeb design for business.ppt
Web design for business.pptwebhostingguy
 
IT Power Management Strategy
IT Power Management Strategy IT Power Management Strategy
IT Power Management Strategy webhostingguy
 
Excel and SQL Quick Tricks for Merchandisers
Excel and SQL Quick Tricks for MerchandisersExcel and SQL Quick Tricks for Merchandisers
Excel and SQL Quick Tricks for Merchandiserswebhostingguy
 
Parallels Hosting Products
Parallels Hosting ProductsParallels Hosting Products
Parallels Hosting Productswebhostingguy
 
Microsoft PowerPoint presentation 2.175 Mb
Microsoft PowerPoint presentation 2.175 MbMicrosoft PowerPoint presentation 2.175 Mb
Microsoft PowerPoint presentation 2.175 Mbwebhostingguy
 

More from webhostingguy (20)

File Upload
File UploadFile Upload
File Upload
 
Running and Developing Tests with the Apache::Test Framework
Running and Developing Tests with the Apache::Test FrameworkRunning and Developing Tests with the Apache::Test Framework
Running and Developing Tests with the Apache::Test Framework
 
MySQL and memcached Guide
MySQL and memcached GuideMySQL and memcached Guide
MySQL and memcached Guide
 
Novell® iChain® 2.3
Novell® iChain® 2.3Novell® iChain® 2.3
Novell® iChain® 2.3
 
Load-balancing web servers Load-balancing web servers
Load-balancing web servers Load-balancing web serversLoad-balancing web servers Load-balancing web servers
Load-balancing web servers Load-balancing web servers
 
SQL Server 2008 Consolidation
SQL Server 2008 ConsolidationSQL Server 2008 Consolidation
SQL Server 2008 Consolidation
 
What is mod_perl?
What is mod_perl?What is mod_perl?
What is mod_perl?
 
What is mod_perl?
What is mod_perl?What is mod_perl?
What is mod_perl?
 
Master Service Agreement
Master Service AgreementMaster Service Agreement
Master Service Agreement
 
Notes8
Notes8Notes8
Notes8
 
PHP and MySQL PHP Written as a set of CGI binaries in C in ...
PHP and MySQL PHP Written as a set of CGI binaries in C in ...PHP and MySQL PHP Written as a set of CGI binaries in C in ...
PHP and MySQL PHP Written as a set of CGI binaries in C in ...
 
Dell Reference Architecture Guide Deploying Microsoft® SQL ...
Dell Reference Architecture Guide Deploying Microsoft® SQL ...Dell Reference Architecture Guide Deploying Microsoft® SQL ...
Dell Reference Architecture Guide Deploying Microsoft® SQL ...
 
Managing Diverse IT Infrastructure
Managing Diverse IT InfrastructureManaging Diverse IT Infrastructure
Managing Diverse IT Infrastructure
 
Web design for business.ppt
Web design for business.pptWeb design for business.ppt
Web design for business.ppt
 
IT Power Management Strategy
IT Power Management Strategy IT Power Management Strategy
IT Power Management Strategy
 
Excel and SQL Quick Tricks for Merchandisers
Excel and SQL Quick Tricks for MerchandisersExcel and SQL Quick Tricks for Merchandisers
Excel and SQL Quick Tricks for Merchandisers
 
OLUG_xen.ppt
OLUG_xen.pptOLUG_xen.ppt
OLUG_xen.ppt
 
Parallels Hosting Products
Parallels Hosting ProductsParallels Hosting Products
Parallels Hosting Products
 
Microsoft PowerPoint presentation 2.175 Mb
Microsoft PowerPoint presentation 2.175 MbMicrosoft PowerPoint presentation 2.175 Mb
Microsoft PowerPoint presentation 2.175 Mb
 
Reseller's Guide
Reseller's GuideReseller's Guide
Reseller's Guide
 

Practical Issues for Automated Categorization

  • 1. Practical Issues for Automated Categorization of Web Sites John M. Pierre [email_address] Metacode Technologies, Inc. 139 Townsend Street San Francisco, CA 94107 (Collaborators: B. Wohler, R. Daniel, M. Butler, R. Avedon)
  • 2.
  • 3.
  • 4.
  • 5.
  • 7. Experimental Setup: Targeted Spidering ‘ Query’ Pages Metatags? Send Query Use <body> live? Frames? <a href=? Try www. HTTP Get Domain name Yes No Yes No Yes prod, service, about, info, press, news No
  • 8.
  • 9. Experimental Setup: System Architecture The Web Domain Names IR Engine Decision SEC-NAICS Web pages Foo.com 11, 21, 23 Text Query Matching documents Spider
  • 10. Results P=Precision = # correctly assigned / # assigned R=Recall = # correctly assigned / # total correct F1 = 2 P R / (P+R) micro-averaged = computer over all categories macro-averaged = per category, then averaged
  • 11.