SlideShare une entreprise Scribd logo
1  sur  29
1
– Someday Soon (Flickr)
Mining the web with Hadoop
Steve Watt Emerging Technologies @ HP
2
– timsnell (Flickr)
3
Gathering Data
Data Marketplaces
4
5
6
Gathering Data
Apache Nutch
(Web Crawler)
7
Pascal Terjan (Flickr)
8
9
10
Using Apache
Identify Optimal Seed URLs for a Seed List & Crawl to a depth of 2
For example:
http://www.crunchbase.com/companies?c=a&q=private_held
http://www.crunchbase.com/companies?c=b&q=private_held
http://www.crunchbase.com/companies?c=c&q=private_held
http://www.crunchbase.com/companies?c=d&q=private_held
. . .
Crawl data is stored in sequence files in the segments dir on the HDFS
11
ALSO
12
Company POJO then /t Out
Prelim Filtering on URL
Making the data STRUCTURED
Retrieving HTML
13
Company City State Country Sector Round Day Month Year Amount Investors
InfoChimps Austin TX USA Enterprise Angel 14 9 2010 350000 Stage One Capital
InfoChimps Austin TX USA Enterprise A 7 11 2010 1200000 DFJ Mercury
MassRelevance Austin TX USA Enterprise A 20 12 2010 2200000 Floodgate, AV,etc
Masher Calabasas CA USA Games_Video Seed 0 2 2009 175000
Masher Calabasas CA USA Games_Video Angel 11 8 2009 300000 Tech Coast Angels
The Result? Tab Delimited Structured Data…
Note: I dropped the ZipCode because it didn’t occur consistently
14
Time to Analyze/Visualize the data…
Step1: Select the right visual encoding for your
questions
Lets start by asking questions & seeing what we can
learn from some simple Bar Charts…
*Total Tech Investments By Year
*Total Tech Investments By Year
*Total Tech Investments By Year
*Investment Funding By Sector
18
Total Investments By Zip Code for all Sectors
$7.3 Billion in San Francisco
$2.9 Billion in Mountain View
$1.2 Billion in Boston
$1.7 Billion in Austin
19
Total Investments By Zip Code for all Sectors
$7.3 Billion in San Francisco
$2.9 Billion in Mountain View
$1.2 Billion in Boston
$1.7 Billion in Austin
20
Total Investments By Zip Code for Consumer Web
$1.2 Billion in Chicago
$600 Million in Seattle
$1.7 Billion in San Francisco
21
Total Investments By Zip Code for BioTech
$1.3 Billion in Cambridge
$528 Million in Dallas
$1.1 Billion in San Diego
22
HP Confidential
Geospatial Encoding of Data
Steve’s Not so Excellent Adventure
23
• Let’s try a Choropleth Encoding of the distribution of investment income by
County
• Wait, what is GeoJSON?
• OK, the GeoJSON County is mapped to some code
• Each County code has a value that corresponds to a palette color
• So what are these codes? FIPS Codes? But Google returns 3 & 5 digit
codes?!?
• I found a 5 digit code list, it has A LOT of codes in it. I’m going to assume its
correct because there is no way I can manually verify all of them
Generating Investment Income By County
24
FIPS = LOAD ‘data/fips.txt’ using PigStorage(‘t’) as (City, State, FIPSCode);
Amt = LOAD ‘data/equity.txt’ using PigStorage(‘t’) as (City, State, Amount);
AmtGroup = Group Amt BY (City, State);
SumGroup = FOREACH AmtGroup Generate group, SUM(Amt.Amount);
JoinGroup = JOIN SumGroup by (City,State), FIPS By (City,State);
Final = FOREACH JoinGroup generate FIPSCode, Amount;
RESULT: 51234 5000000
16234 1234000 (...)
ALWAYS, ALWAYS check your output…
But wait, why are there duplicate records?
25
Apparently some cities can actually belong to two counties… I guess I’ll pick
one.
Yay, no duplicates. Lets visualize this!
26
• Wait, what happened to California ?
• Aaargh, I stored the FIPS codes in PIG as INTS instead of charrays which
trimmed off the leading Zero. OK, I add them back. Voila! We have California.
On Error Checking…
27
• Crowd Sourced data has LOADS of errors in it. Actually influencing your
results. You need a good system that helps identify those errors.
• Santa Clara, Ca
• Santa, Clara
• Santa, Clara CA
• Track(Count) input and output records. Examine the results. Something fishy?
28
HP Confidential
29
Questions?
Steve Watt swatt@hp.com
@wattsteve
emergingafrican.com

Contenu connexe

Similaire à Mining the Web for Information using Hadoop

How to Reveal Hidden Relationships in Data and Risk Analytics
How to Reveal Hidden Relationships in Data and Risk AnalyticsHow to Reveal Hidden Relationships in Data and Risk Analytics
How to Reveal Hidden Relationships in Data and Risk AnalyticsOntotext
 
Nerd Out with Hadoop: A Not-So-Basic Introduction to the Platform
Nerd Out with Hadoop: A Not-So-Basic Introduction to the PlatformNerd Out with Hadoop: A Not-So-Basic Introduction to the Platform
Nerd Out with Hadoop: A Not-So-Basic Introduction to the PlatformSteve Hoffman
 
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and VerticaBridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and VerticaSteve Watt
 
Blockchains for AI [With New Applications]
Blockchains for AI [With New Applications]Blockchains for AI [With New Applications]
Blockchains for AI [With New Applications]Trent McConaghy
 
An indepth look at Google BigQuery Architecture by Felipe Hoffa of Google
An indepth look at Google BigQuery Architecture by Felipe Hoffa of GoogleAn indepth look at Google BigQuery Architecture by Felipe Hoffa of Google
An indepth look at Google BigQuery Architecture by Felipe Hoffa of GoogleData Con LA
 
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...Amazon Web Services
 
IRE "Better Watchdog" workshop presentation "Data: Now I've got it, what do I...
IRE "Better Watchdog" workshop presentation "Data: Now I've got it, what do I...IRE "Better Watchdog" workshop presentation "Data: Now I've got it, what do I...
IRE "Better Watchdog" workshop presentation "Data: Now I've got it, what do I...J T "Tom" Johnson
 
Tapping the Data Deluge with R
Tapping the Data Deluge with RTapping the Data Deluge with R
Tapping the Data Deluge with RJeffrey Breen
 
The Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open DataThe Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open DataOntotext
 
2018 NYC Localogy: Using Data to Build Exceptional Local Pages
2018 NYC Localogy: Using Data to Build Exceptional Local Pages2018 NYC Localogy: Using Data to Build Exceptional Local Pages
2018 NYC Localogy: Using Data to Build Exceptional Local PagesLocalogy
 
Social media analytics using Azure Technologies
Social media analytics using Azure TechnologiesSocial media analytics using Azure Technologies
Social media analytics using Azure TechnologiesKoray Kocabas
 
20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser University20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser UniversityAllen Day, PhD
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)Jeremy Cabral
 
Search Different Understanding Apple's New Search Engine State of Search 2016
Search Different   Understanding Apple's New Search Engine State of Search 2016Search Different   Understanding Apple's New Search Engine State of Search 2016
Search Different Understanding Apple's New Search Engine State of Search 2016Andrew Shotland
 
R, HTTP, and APIs, with a preview of TopicWatchr
R, HTTP, and APIs, with a preview of TopicWatchrR, HTTP, and APIs, with a preview of TopicWatchr
R, HTTP, and APIs, with a preview of TopicWatchrPortland R User Group
 
"R, HTTP, and APIs, with a preview of TopicWatchr" (15 November 2011)
"R, HTTP, and APIs, with a preview of TopicWatchr" (15 November 2011)"R, HTTP, and APIs, with a preview of TopicWatchr" (15 November 2011)
"R, HTTP, and APIs, with a preview of TopicWatchr" (15 November 2011)Portland R User Group
 
8th TUC Meeting - Juan Sequeda (Capsenta). Integrating Data using Graphs and ...
8th TUC Meeting - Juan Sequeda (Capsenta). Integrating Data using Graphs and ...8th TUC Meeting - Juan Sequeda (Capsenta). Integrating Data using Graphs and ...
8th TUC Meeting - Juan Sequeda (Capsenta). Integrating Data using Graphs and ...LDBC council
 
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 

Similaire à Mining the Web for Information using Hadoop (20)

How to Reveal Hidden Relationships in Data and Risk Analytics
How to Reveal Hidden Relationships in Data and Risk AnalyticsHow to Reveal Hidden Relationships in Data and Risk Analytics
How to Reveal Hidden Relationships in Data and Risk Analytics
 
Nerd Out with Hadoop: A Not-So-Basic Introduction to the Platform
Nerd Out with Hadoop: A Not-So-Basic Introduction to the PlatformNerd Out with Hadoop: A Not-So-Basic Introduction to the Platform
Nerd Out with Hadoop: A Not-So-Basic Introduction to the Platform
 
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and VerticaBridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
 
Blockchains for AI [With New Applications]
Blockchains for AI [With New Applications]Blockchains for AI [With New Applications]
Blockchains for AI [With New Applications]
 
An indepth look at Google BigQuery Architecture by Felipe Hoffa of Google
An indepth look at Google BigQuery Architecture by Felipe Hoffa of GoogleAn indepth look at Google BigQuery Architecture by Felipe Hoffa of Google
An indepth look at Google BigQuery Architecture by Felipe Hoffa of Google
 
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...
 
IRE "Better Watchdog" workshop presentation "Data: Now I've got it, what do I...
IRE "Better Watchdog" workshop presentation "Data: Now I've got it, what do I...IRE "Better Watchdog" workshop presentation "Data: Now I've got it, what do I...
IRE "Better Watchdog" workshop presentation "Data: Now I've got it, what do I...
 
Tapping the Data Deluge with R
Tapping the Data Deluge with RTapping the Data Deluge with R
Tapping the Data Deluge with R
 
MongoDB Sharding
MongoDB ShardingMongoDB Sharding
MongoDB Sharding
 
The Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open DataThe Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open Data
 
2018 NYC Localogy: Using Data to Build Exceptional Local Pages
2018 NYC Localogy: Using Data to Build Exceptional Local Pages2018 NYC Localogy: Using Data to Build Exceptional Local Pages
2018 NYC Localogy: Using Data to Build Exceptional Local Pages
 
Social media analytics using Azure Technologies
Social media analytics using Azure TechnologiesSocial media analytics using Azure Technologies
Social media analytics using Azure Technologies
 
20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser University20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser University
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)
 
Data Science At Zillow
Data Science At ZillowData Science At Zillow
Data Science At Zillow
 
Search Different Understanding Apple's New Search Engine State of Search 2016
Search Different   Understanding Apple's New Search Engine State of Search 2016Search Different   Understanding Apple's New Search Engine State of Search 2016
Search Different Understanding Apple's New Search Engine State of Search 2016
 
R, HTTP, and APIs, with a preview of TopicWatchr
R, HTTP, and APIs, with a preview of TopicWatchrR, HTTP, and APIs, with a preview of TopicWatchr
R, HTTP, and APIs, with a preview of TopicWatchr
 
"R, HTTP, and APIs, with a preview of TopicWatchr" (15 November 2011)
"R, HTTP, and APIs, with a preview of TopicWatchr" (15 November 2011)"R, HTTP, and APIs, with a preview of TopicWatchr" (15 November 2011)
"R, HTTP, and APIs, with a preview of TopicWatchr" (15 November 2011)
 
8th TUC Meeting - Juan Sequeda (Capsenta). Integrating Data using Graphs and ...
8th TUC Meeting - Juan Sequeda (Capsenta). Integrating Data using Graphs and ...8th TUC Meeting - Juan Sequeda (Capsenta). Integrating Data using Graphs and ...
8th TUC Meeting - Juan Sequeda (Capsenta). Integrating Data using Graphs and ...
 
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
 

Plus de Steve Watt

Building Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and DockerBuilding Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and DockerSteve Watt
 
Building Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and DockerBuilding Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and DockerSteve Watt
 
Hadoop for the disillusioned
Hadoop for the disillusionedHadoop for the disillusioned
Hadoop for the disillusionedSteve Watt
 
Hadoop file systems
Hadoop file systemsHadoop file systems
Hadoop file systemsSteve Watt
 
Apache con 2013-hadoop
Apache con 2013-hadoopApache con 2013-hadoop
Apache con 2013-hadoopSteve Watt
 
Apache con 2012 taking the guesswork out of your hadoop infrastructure
Apache con 2012 taking the guesswork out of your hadoop infrastructureApache con 2012 taking the guesswork out of your hadoop infrastructure
Apache con 2012 taking the guesswork out of your hadoop infrastructureSteve Watt
 
Tech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataSteve Watt
 
Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchSteve Watt
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopSteve Watt
 

Plus de Steve Watt (11)

Building Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and DockerBuilding Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and Docker
 
Building Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and DockerBuilding Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and Docker
 
Hadoop for the disillusioned
Hadoop for the disillusionedHadoop for the disillusioned
Hadoop for the disillusioned
 
Hadoop file systems
Hadoop file systemsHadoop file systems
Hadoop file systems
 
Apache con 2013-hadoop
Apache con 2013-hadoopApache con 2013-hadoop
Apache con 2013-hadoop
 
Apache con 2012 taking the guesswork out of your hadoop infrastructure
Apache con 2012 taking the guesswork out of your hadoop infrastructureApache con 2012 taking the guesswork out of your hadoop infrastructure
Apache con 2012 taking the guesswork out of your hadoop infrastructure
 
Tech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big Data
 
Final deck
Final deckFinal deck
Final deck
 
Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache Nutch
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Extractiv
ExtractivExtractiv
Extractiv
 

Dernier

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Dernier (20)

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

Mining the Web for Information using Hadoop

  • 1. 1 – Someday Soon (Flickr) Mining the web with Hadoop Steve Watt Emerging Technologies @ HP
  • 4. 4
  • 5. 5
  • 8. 8
  • 9. 9
  • 10. 10 Using Apache Identify Optimal Seed URLs for a Seed List & Crawl to a depth of 2 For example: http://www.crunchbase.com/companies?c=a&q=private_held http://www.crunchbase.com/companies?c=b&q=private_held http://www.crunchbase.com/companies?c=c&q=private_held http://www.crunchbase.com/companies?c=d&q=private_held . . . Crawl data is stored in sequence files in the segments dir on the HDFS
  • 12. 12 Company POJO then /t Out Prelim Filtering on URL Making the data STRUCTURED Retrieving HTML
  • 13. 13 Company City State Country Sector Round Day Month Year Amount Investors InfoChimps Austin TX USA Enterprise Angel 14 9 2010 350000 Stage One Capital InfoChimps Austin TX USA Enterprise A 7 11 2010 1200000 DFJ Mercury MassRelevance Austin TX USA Enterprise A 20 12 2010 2200000 Floodgate, AV,etc Masher Calabasas CA USA Games_Video Seed 0 2 2009 175000 Masher Calabasas CA USA Games_Video Angel 11 8 2009 300000 Tech Coast Angels The Result? Tab Delimited Structured Data… Note: I dropped the ZipCode because it didn’t occur consistently
  • 14. 14 Time to Analyze/Visualize the data… Step1: Select the right visual encoding for your questions Lets start by asking questions & seeing what we can learn from some simple Bar Charts…
  • 16. *Total Tech Investments By Year *Total Tech Investments By Year
  • 18. 18 Total Investments By Zip Code for all Sectors $7.3 Billion in San Francisco $2.9 Billion in Mountain View $1.2 Billion in Boston $1.7 Billion in Austin
  • 19. 19 Total Investments By Zip Code for all Sectors $7.3 Billion in San Francisco $2.9 Billion in Mountain View $1.2 Billion in Boston $1.7 Billion in Austin
  • 20. 20 Total Investments By Zip Code for Consumer Web $1.2 Billion in Chicago $600 Million in Seattle $1.7 Billion in San Francisco
  • 21. 21 Total Investments By Zip Code for BioTech $1.3 Billion in Cambridge $528 Million in Dallas $1.1 Billion in San Diego
  • 23. Steve’s Not so Excellent Adventure 23 • Let’s try a Choropleth Encoding of the distribution of investment income by County • Wait, what is GeoJSON? • OK, the GeoJSON County is mapped to some code • Each County code has a value that corresponds to a palette color • So what are these codes? FIPS Codes? But Google returns 3 & 5 digit codes?!? • I found a 5 digit code list, it has A LOT of codes in it. I’m going to assume its correct because there is no way I can manually verify all of them
  • 24. Generating Investment Income By County 24 FIPS = LOAD ‘data/fips.txt’ using PigStorage(‘t’) as (City, State, FIPSCode); Amt = LOAD ‘data/equity.txt’ using PigStorage(‘t’) as (City, State, Amount); AmtGroup = Group Amt BY (City, State); SumGroup = FOREACH AmtGroup Generate group, SUM(Amt.Amount); JoinGroup = JOIN SumGroup by (City,State), FIPS By (City,State); Final = FOREACH JoinGroup generate FIPSCode, Amount; RESULT: 51234 5000000 16234 1234000 (...) ALWAYS, ALWAYS check your output…
  • 25. But wait, why are there duplicate records? 25 Apparently some cities can actually belong to two counties… I guess I’ll pick one.
  • 26. Yay, no duplicates. Lets visualize this! 26 • Wait, what happened to California ? • Aaargh, I stored the FIPS codes in PIG as INTS instead of charrays which trimmed off the leading Zero. OK, I add them back. Voila! We have California.
  • 27. On Error Checking… 27 • Crowd Sourced data has LOADS of errors in it. Actually influencing your results. You need a good system that helps identify those errors. • Santa Clara, Ca • Santa, Clara • Santa, Clara CA • Track(Count) input and output records. Examine the results. Something fishy?

Notes de l'éditeur

  1. Give a Nutch example