SlideShare une entreprise Scribd logo
1  sur  36
 
Web Mining in the Cloud Ken Krugler, Bixo Labs, Inc. ACM Silicon Valley Data Mining Camp 01 November 2009 Hadoop/Cascading/Bixo in EC2
About me ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Typical Data Mining
Data Mining Victory!
Meanwhile, Over at McAfee…
Web Mining 101 ,[object Object],[object Object],[object Object]
4 Steps in Web Mining ,[object Object],[object Object],[object Object],[object Object]
Web Mining versus Data Mining ,[object Object],[object Object],[object Object],[object Object]
How to Mine Large Scale Web Data? ,[object Object],[object Object],[object Object],[object Object],[object Object]
One Solution - the HECB Stack ,[object Object],[object Object],[object Object],[object Object]
EC2 - Amazon Elastic Compute Cloud ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Why Hadoop? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Why Cascading? ,[object Object],[object Object],[object Object],[object Object]
Why Bixo? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
SEO Keyword Data Mining ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Workflow
Custom Code for Example ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
End Result in Data Mining Tool
What Next? ,[object Object],[object Object],[object Object],[object Object],[object Object]
Another Example - HUGMEE ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Helpful Hadoopers ,[object Object],[object Object],[object Object],[object Object]
Scoring Algorithm ,[object Object],[object Object],[object Object],[object Object]
High Level Steps ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
High Level Steps ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Workflow
Building the Flow
mod_mbox Page
Custom Operation
Validate
This Hug’s for Ted!
Produce
Public Terabyte Dataset ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Back
Summary ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Any Questions? ,[object Object],[object Object],[object Object]
 

Contenu connexe

Tendances

Google history nd architecture
Google history nd architectureGoogle history nd architecture
Google history nd architecture
Divyangee Jain
 
MongoDB et Hadoop
MongoDB et HadoopMongoDB et Hadoop
MongoDB et Hadoop
MongoDB
 

Tendances (20)

Google history nd architecture
Google history nd architectureGoogle history nd architecture
Google history nd architecture
 
Search Queries Explained – A Deep Dive into Query Rules, Query Variables and ...
Search Queries Explained – A Deep Dive into Query Rules, Query Variables and ...Search Queries Explained – A Deep Dive into Query Rules, Query Variables and ...
Search Queries Explained – A Deep Dive into Query Rules, Query Variables and ...
 
Take Cloud Hybrid Search to the Next Level
Take Cloud Hybrid Search to the Next LevelTake Cloud Hybrid Search to the Next Level
Take Cloud Hybrid Search to the Next Level
 
Building an unstructured data management solution with elastic search and ama...
Building an unstructured data management solution with elastic search and ama...Building an unstructured data management solution with elastic search and ama...
Building an unstructured data management solution with elastic search and ama...
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Consuming External Content and Enriching Content with Apache Camel
Consuming External Content and Enriching Content with Apache CamelConsuming External Content and Enriching Content with Apache Camel
Consuming External Content and Enriching Content with Apache Camel
 
Understanding and Applying Cloud Hybrid Search
Understanding and Applying Cloud Hybrid SearchUnderstanding and Applying Cloud Hybrid Search
Understanding and Applying Cloud Hybrid Search
 
Building a spa_in_30min
Building a spa_in_30minBuilding a spa_in_30min
Building a spa_in_30min
 
Webinar: Event Processing & Data Analytics with Lucidworks Fusion
Webinar: Event Processing & Data Analytics with Lucidworks FusionWebinar: Event Processing & Data Analytics with Lucidworks Fusion
Webinar: Event Processing & Data Analytics with Lucidworks Fusion
 
How to migrate from any CMS (thru the front-door)
How to migrate from any CMS (thru the front-door)How to migrate from any CMS (thru the front-door)
How to migrate from any CMS (thru the front-door)
 
MongoDB et Hadoop
MongoDB et HadoopMongoDB et Hadoop
MongoDB et Hadoop
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
Do you need an external search platform for Adobe Experience Manager?
Do you need an external search platform for Adobe Experience Manager?Do you need an external search platform for Adobe Experience Manager?
Do you need an external search platform for Adobe Experience Manager?
 
Azure datafactory
Azure datafactoryAzure datafactory
Azure datafactory
 
Webinar: Search and Recommenders
Webinar: Search and RecommendersWebinar: Search and Recommenders
Webinar: Search and Recommenders
 
Dspace 7 presentation
Dspace 7 presentationDspace 7 presentation
Dspace 7 presentation
 
Scaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big DataScaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big Data
 
MongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business InsightsMongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business Insights
 
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
 
Almost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without ProgrammingAlmost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without Programming
 

En vedette

Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...
Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...
Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...
Mohamed Zaki
 
Data Mining with R CH1 요약
Data Mining with R CH1 요약Data Mining with R CH1 요약
Data Mining with R CH1 요약
Sung Yub Kim
 
Best Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingBest Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining Processing
Ontotext
 
Text data mining1
Text data mining1Text data mining1
Text data mining1
KU Leuven
 
Londons Digital Neighbourhoods Workshop - Background Paper
Londons Digital Neighbourhoods Workshop - Background PaperLondons Digital Neighbourhoods Workshop - Background Paper
Londons Digital Neighbourhoods Workshop - Background Paper
Networked Neighbourhoods
 

En vedette (20)

Elastic Web Mining
Elastic Web MiningElastic Web Mining
Elastic Web Mining
 
Big data concept
Big data conceptBig data concept
Big data concept
 
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
 
Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...
Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...
Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...
 
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageBuild Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
 
[2014년 3월 25일] mining minds 빅 데이터, 욕망을 읽다
[2014년 3월 25일] mining minds   빅 데이터, 욕망을 읽다[2014년 3월 25일] mining minds   빅 데이터, 욕망을 읽다
[2014년 3월 25일] mining minds 빅 데이터, 욕망을 읽다
 
Kth daisy 추천솔루션_20130509_v1.0_이호철
Kth daisy 추천솔루션_20130509_v1.0_이호철Kth daisy 추천솔루션_20130509_v1.0_이호철
Kth daisy 추천솔루션_20130509_v1.0_이호철
 
Text mining
Text miningText mining
Text mining
 
Dm ml study_roadmap
Dm ml study_roadmapDm ml study_roadmap
Dm ml study_roadmap
 
Data Mining with R CH1 요약
Data Mining with R CH1 요약Data Mining with R CH1 요약
Data Mining with R CH1 요약
 
Best Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingBest Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining Processing
 
Expanding Your Data Warehouse with Tajo
Expanding Your Data Warehouse with TajoExpanding Your Data Warehouse with Tajo
Expanding Your Data Warehouse with Tajo
 
Io t에서 big data를 통합하는 통합 빅데이터 플랫폼 flamingo_클라우다인_김병곤 대표이사
Io t에서 big data를 통합하는 통합 빅데이터 플랫폼 flamingo_클라우다인_김병곤 대표이사Io t에서 big data를 통합하는 통합 빅데이터 플랫폼 flamingo_클라우다인_김병곤 대표이사
Io t에서 big data를 통합하는 통합 빅데이터 플랫폼 flamingo_클라우다인_김병곤 대표이사
 
Text data mining1
Text data mining1Text data mining1
Text data mining1
 
집단지성 프로그래밍 01-데이터마이닝 개요
집단지성 프로그래밍 01-데이터마이닝 개요집단지성 프로그래밍 01-데이터마이닝 개요
집단지성 프로그래밍 01-데이터마이닝 개요
 
마인즈랩 회사소개서 V2.3_한국어버전
마인즈랩 회사소개서 V2.3_한국어버전마인즈랩 회사소개서 V2.3_한국어버전
마인즈랩 회사소개서 V2.3_한국어버전
 
Информационный вестник Сентябрь 2013
Информационный вестник Сентябрь 2013 Информационный вестник Сентябрь 2013
Информационный вестник Сентябрь 2013
 
Up in the clouds sdd 2012
Up in the clouds sdd 2012Up in the clouds sdd 2012
Up in the clouds sdd 2012
 
Londons Digital Neighbourhoods Workshop - Background Paper
Londons Digital Neighbourhoods Workshop - Background PaperLondons Digital Neighbourhoods Workshop - Background Paper
Londons Digital Neighbourhoods Workshop - Background Paper
 
Mongara Arbetsrätt och sociala media Svensk Bensinhandel, Mongara Gran Canari...
Mongara Arbetsrätt och sociala media Svensk Bensinhandel, Mongara Gran Canari...Mongara Arbetsrätt och sociala media Svensk Bensinhandel, Mongara Gran Canari...
Mongara Arbetsrätt och sociala media Svensk Bensinhandel, Mongara Gran Canari...
 

Similaire à Elastic Web Mining

Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 

Similaire à Elastic Web Mining (20)

Build Your Own Search Engine
Build Your Own Search EngineBuild Your Own Search Engine
Build Your Own Search Engine
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Best Practices for Building a Data Lake on AWS
Best Practices for Building a Data Lake on AWSBest Practices for Building a Data Lake on AWS
Best Practices for Building a Data Lake on AWS
 
Seravia in the Cloud
Seravia in the CloudSeravia in the Cloud
Seravia in the Cloud
 
Power BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle CloudPower BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle Cloud
 
The Internet as a Single Database
The Internet as a Single DatabaseThe Internet as a Single Database
The Internet as a Single Database
 
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
 
Best Practices to SharePoint Architecture Fundamentals NZ & AUS
Best Practices to SharePoint Architecture Fundamentals NZ & AUSBest Practices to SharePoint Architecture Fundamentals NZ & AUS
Best Practices to SharePoint Architecture Fundamentals NZ & AUS
 
Connecting Your Data Analytics Pipeline
Connecting Your Data Analytics PipelineConnecting Your Data Analytics Pipeline
Connecting Your Data Analytics Pipeline
 
Fundamentals Of Search
Fundamentals Of SearchFundamentals Of Search
Fundamentals Of Search
 
PoolParty Thesaurus Management Quick Overview
PoolParty Thesaurus Management Quick OverviewPoolParty Thesaurus Management Quick Overview
PoolParty Thesaurus Management Quick Overview
 
Office Track: SharePoint Online Migration - Asses, Prepare, Migrate & Support...
Office Track: SharePoint Online Migration - Asses, Prepare, Migrate & Support...Office Track: SharePoint Online Migration - Asses, Prepare, Migrate & Support...
Office Track: SharePoint Online Migration - Asses, Prepare, Migrate & Support...
 
RavenDB overview
RavenDB overviewRavenDB overview
RavenDB overview
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 
Example.ppt
Example.pptExample.ppt
Example.ppt
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWS
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI Mobile
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
 

Plus de Ken Krugler

Plus de Ken Krugler (8)

Faster Workflows, Faster
Faster Workflows, FasterFaster Workflows, Faster
Faster Workflows, Faster
 
Similarity at scale
Similarity at scaleSimilarity at scale
Similarity at scale
 
Suicide Risk Prediction Using Social Media and Cassandra
Suicide Risk Prediction Using Social Media and CassandraSuicide Risk Prediction Using Social Media and Cassandra
Suicide Risk Prediction Using Social Media and Cassandra
 
Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & SolrFaster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
 
Strata web mining tutorial
Strata web mining tutorialStrata web mining tutorial
Strata web mining tutorial
 
A (very) short intro to Hadoop
A (very) short intro to HadoopA (very) short intro to Hadoop
A (very) short intro to Hadoop
 
A (very) short history of big data
A (very) short history of big dataA (very) short history of big data
A (very) short history of big data
 
Thinking at scale with hadoop
Thinking at scale with hadoopThinking at scale with hadoop
Thinking at scale with hadoop
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 

Elastic Web Mining

Notes de l'éditeur

  1. Over the prior 4 years I had a startup called Krugle, that provided code search for open source projects and inside large companies. We did a large, 100M page crawl of the “programmer’s web” to find out information about open source projects. Based on what I learned from that experience, I started the Bixo open source project. It’s a toolkit for building web mining workflows, and I’ll be talking more about that later. Several companies paid me to integrate Bixo into an existing data processing environment. And that in turn led to Bixo Labs, which is a platform for quickly creating creating web mining apps. Elastic means the size of the system can easily be changed to match the web mining task.
  2. This is the world that many of you live in. Analyzing data to find important patterns. Here’s an example of output from the QlikView business intelligence tool It was used to help analyze the relative prevalence of keywords in two competing web sites. Here you see two word terms that often occur on McAfee’s site, but not on Symantec’s Which is very useful data for anybody who worries about search engine optimization.
  3. You all know about analyzing data to find important patterns that get managers all worked up…
  4. But how do you get to this point? How do you use the web as the source for data that you’re analyzing That’s what I’m going to be talking about here.
  5. Quick intro to web mining, so we’re on the same page Most people think about the big search companies when they think about web mining. Search is clearly the biggest web mining category, and generates the most revenue. But other types of web mining have value that is high and growing.
  6. It’s common to confuse web crawling with fetching. Crawling is the process of automatically finding new pages by extracting links from fetched pages. But for many web mining applications, you have a “white list” of pre-defined URLs. In either case, though, you need to reliably, efficiently and politely fetch pages. Content comes in a variety of formats - typically HTML, but also PDF, word, zip archives, etc. Need to parse these formats to extract key data - typically text, but could be image data. Often the analyze step will include aspects of machine learning - classification, clustering. “useful data” covers a lot of ground, because there are a lot of ways to use the output of web mining. Generating an index is one of the most common, because people think about search as the goal. But for data mining, the end result at this point is often highly reduced data that is input to traditional data mining tools.
  7. What are the key differences between web mining and traditional data mining I’m saying “traditional” because the face of data mining is clearly changing. But if you look at most vendor tools, the focus is on what I’d call “traditional data mining” Scale - 10M is big for data mining, but not for web mining Access - with DM, once you defeated Mongor, keeper of data base access keys, you were golden Web pages are typically public, but it’s a shared resource so implicit rules apply. Like “don’t bring my web site to its knees”. Data mining breaks traditional implicit contract, so extra cautions apply. Implicit contract is that I let you crawl me, and you drive traffic to me when your search index goes live. But with DM, there often isn’t an index as the end result. With mining DBs, there’s explicit structure, which is mostly lacking from web pages.
  8. If it doesn’t scale, then it won’t handle the quantity of data you’ll ultimately want to process from the web If you can’t create real workflows, it will never be reliable or efficient. If you don’t use specialized web crawling code, you’ll get blacklisted Because you’re trying to distill down large data, there’s often some custom processing. If you don’t run it a cloud environment, you’ll be wasting money - and I’ll explain why in a few slides.
  9. I’m focusing on one particular solution to the challenges of web mining that I just described. It’s the “HECB” stack. I’m going to talk about these from the bottom up, which is EC2 first, then Hadoop…but the acronym didn’t work as well.
  10. At Krugle we ran two clusters, one of 11 servers, and a smaller 4 server cluster In the end, our actual utilization ratio was probably < 20% Even with close to 100% utilization, the break-even point for EC2 vs. colo is somewhere between 50 and 200 servers, depending on who you talk to. If utilization was 20%, then break even would be 250 to 1000 servers. Mining for search doesn’t work so well in this model - cluster should be always crawling (ABC) so not as bursty And transferring raw content, parse, and index will generate lots of transfer charges. But for web mining that’s focused on data mining, data is distilled so this isn’t an issue.
  11. Map-reduce - how do you parallelize the processing of lots of data so that you can Do the work on many servers? The answer is Map-reduce. HDFS - how do you store lots of data in a fault-tolerant, cost-effective manner. How do you make sure the data (the big stuff) moves as little as possible during processing. The answer is the Hadoop distributed file system. It’s open source, so lots of support, consultants, rapid bug fixes, etc. Large companies are using it, especially Yahoo Elastic map reduce is a special service built on top of EC2, where it’s easier to run Hadoop jobs Because you have access to pre-configured Hadoop clusters, special tools, etc.
  12. If you ever had to write a complex workflow using Hadoop, you know the answer. It frees you from the lower-level details of thinking in map-reduce. You can think about the workflow as operations on records with fields. And in data mining, the workflow often winds up being very complex. Because you can build workflows out of a mix of pre-defined & custom pipes, it’s a real toolkit. Chris explains it as MR is assembly, and Cascading is C. Sometimes it feels more like C++ :) Key aspect of reliable workflows is Cascading’s ability to check your workflow (the DAG it builds) Finds cases where fields aren’t available for operations. Solves a key problem we ran into when customizing Nutch at Krugle
  13. Does the world really need yet another web crawler? No, but it does need a web mining toolkit Two companies agreed to sponsor work on Bixo as an open source project. Polite yet efficient - tension between those two goals that’s hard to resolve. If you do a crawl of any reasonable size, you’ll run into lots of errors. Even if a web server says “I swear to you, I’m sending you a 20K HTML file in English” It’s a 50K text file in Russian using the Cyrillic character set. And because it’s open source, you get the benefit of a community of users. They contribute re-usable toolkit components.
  14. Whenever I show a workflow diagram like this, I make a joke about it being intuitively obvious. Which, obviously, it’s not. And in fact the full workflow is a bit bigger, as I left out the second stage that describes more of the keyword analysis. But the key point is that the blue color items are provided by Cascading. And the green color items are provided by Bixo. So what’s left are two yellow items, which represent the two points of customization.
  15. There were two main pieces of custom code that needed to be written. One was some URL filtering to focus on the right content inside the web sites. Avoiding non-English pages by specific URL patterns. Same kind of thing for forums and such, since these pages weren’t part of what could easily be optimized. And if enough people need this type of support, since Bixo is open source it will likely become part of the toolkit
  16. Finally we can actually use a traditional data mining tool to help make sense of the digested data. Many things we could do in addition Clustering of results, to improve keyword analysis Larger sites have “areas of interest” Identifying broken links, typos Identifying personal data - email addresses, phone numbers
  17. I try to limit presentations to 20 slides - so I’ve hit that limit In the spirit of the unconference - let me know what you’d like to do next.
  18. Let’s use a real example now of using Bixo to do web mining. Imagine that the Apache Foundation decided to honor people who make significant contributions to the Hadoop community. In a typical company, determining the winner would depend on political maneuvering, bribes,and sucking up. But the Apache Foundation could decides to go for a quantitative approach for the HUGMEE award.
  19. How do you figure out the most helpful Hadoopers? As we discussed previously, it’s a classic web mining problem Luckily the Hadoop mailing lists are all nicely archived as monthly mbox files. How do we score based on key phrases (next slide)?
  20. Parsing the mod_mbox page is simple with Tika’s HtmlParser Cheated a bit when parsing emails - some users like Owen have many aliases So hand-generated alias resolution table.
  21. Need to ignore “thanks” in “thanks in advance for doing my job for me” signoff. Generate two tuples for each email: one with messageId/name/address One with reply-to messageId/score Group/sum aspect is classic reduce operation.
  22. I think this slide is pretty self-explanatory - two Bixo fetch cycles, 6 custom Cascading operations, 6 MR jobs. OK, actually not so clear, but… Key point is that only purple is stuff that I had to actually create Some lines are purple as well, since that workflow (DAG) is also something I defined - see next page. But only two custom operations actually needed - parsing mbox_page and calculating score Running took about 30 minutes - mostly politely waiting until it was Ok to politely do another fetch. Downloaded 150MB of mbox files 409 unique email addresses with at least one positive reply.
  23. Most of the code needed to create the workflow for this data mining app. Lots of oatmeal code - which is good. Don’t want to be writing tricky code here. Could optimize, but that would be a mistake…most web mining is programmer-constrained. So just use more servers in EC2 - cheaper & faster.
  24. Example of the top-level pages that were fetched in first phase. Then needed to be parsed to extract links to mbox files.
  25. Example of one of two custom operation Parsing mod_mbox page Uses Tika to extract Ids Emits tuple with URL for each mbox ID
  26. Curve looks right - exponential decay. 409 unique email addresses that got some love from somebody.
  27. And the winner is…Ted Dunning I know - I should have colored the elephant yellow.
  28. A list of the usual suspects Coincidentally, Ted helped me derive the scoring algorithm I used…hmm.