10. 10
Using Apache
Identify Optimal Seed URLs for a Seed List & Crawl to a depth of 2
For example:
http://www.crunchbase.com/companies?c=a&q=private_held
http://www.crunchbase.com/companies?c=b&q=private_held
http://www.crunchbase.com/companies?c=c&q=private_held
http://www.crunchbase.com/companies?c=d&q=private_held
. . .
Crawl data is stored in sequence files in the segments dir on the HDFS
12. 12
Company POJO then /t Out
Prelim Filtering on URL
Making the data STRUCTURED
Retrieving HTML
13. 13
Company City State Country Sector Round Day Month Year Amount Investors
InfoChimps Austin TX USA Enterprise Angel 14 9 2010 350000 Stage One Capital
InfoChimps Austin TX USA Enterprise A 7 11 2010 1200000 DFJ Mercury
MassRelevance Austin TX USA Enterprise A 20 12 2010 2200000 Floodgate, AV,etc
Masher Calabasas CA USA Games_Video Seed 0 2 2009 175000
Masher Calabasas CA USA Games_Video Angel 11 8 2009 300000 Tech Coast Angels
The Result? Tab Delimited Structured Data…
Note: I dropped the ZipCode because it didn’t occur consistently
14. 14
Time to Analyze/Visualize the data…
Step1: Select the right visual encoding for your
questions
Lets start by asking questions & seeing what we can
learn from some simple Bar Charts…
18. 18
Total Investments By Zip Code for all Sectors
$7.3 Billion in San Francisco
$2.9 Billion in Mountain View
$1.2 Billion in Boston
$1.7 Billion in Austin
19. 19
Total Investments By Zip Code for all Sectors
$7.3 Billion in San Francisco
$2.9 Billion in Mountain View
$1.2 Billion in Boston
$1.7 Billion in Austin
20. 20
Total Investments By Zip Code for Consumer Web
$1.2 Billion in Chicago
$600 Million in Seattle
$1.7 Billion in San Francisco
21. 21
Total Investments By Zip Code for BioTech
$1.3 Billion in Cambridge
$528 Million in Dallas
$1.1 Billion in San Diego
23. Steve’s Not so Excellent Adventure
23
• Let’s try a Choropleth Encoding of the distribution of investment income by
County
• Wait, what is GeoJSON?
• OK, the GeoJSON County is mapped to some code
• Each County code has a value that corresponds to a palette color
• So what are these codes? FIPS Codes? But Google returns 3 & 5 digit
codes?!?
• I found a 5 digit code list, it has A LOT of codes in it. I’m going to assume its
correct because there is no way I can manually verify all of them
24. Generating Investment Income By County
24
FIPS = LOAD ‘data/fips.txt’ using PigStorage(‘t’) as (City, State, FIPSCode);
Amt = LOAD ‘data/equity.txt’ using PigStorage(‘t’) as (City, State, Amount);
AmtGroup = Group Amt BY (City, State);
SumGroup = FOREACH AmtGroup Generate group, SUM(Amt.Amount);
JoinGroup = JOIN SumGroup by (City,State), FIPS By (City,State);
Final = FOREACH JoinGroup generate FIPSCode, Amount;
RESULT: 51234 5000000
16234 1234000 (...)
ALWAYS, ALWAYS check your output…
25. But wait, why are there duplicate records?
25
Apparently some cities can actually belong to two counties… I guess I’ll pick
one.
26. Yay, no duplicates. Lets visualize this!
26
• Wait, what happened to California ?
• Aaargh, I stored the FIPS codes in PIG as INTS instead of charrays which
trimmed off the leading Zero. OK, I add them back. Voila! We have California.
27. On Error Checking…
27
• Crowd Sourced data has LOADS of errors in it. Actually influencing your
results. You need a good system that helps identify those errors.
• Santa Clara, Ca
• Santa, Clara
• Santa, Clara CA
• Track(Count) input and output records. Examine the results. Something fishy?