1. dachisgroup.com
Dachis Group
Las Vegas 2012
Intermediate Pig Know How
Timothy Potter (Twitter: thelabdude)
Pigout Hackday, Austin TX
May 11, 2012
® 2011 Dachis Group.
2. dachisgroup.com
Agenda
UFO Sightings Data Set
1. Which US city has the most UFO sightings overall?
2. What is the most common UFO shape within a 100 mile radius of
your answer for #1?
Pig Mahout Example: Training 20 Newsgroups Classifier
• Loading messages using a custom loader
• Hashed Feature Vectors
• Train Logistic Regression Model
• Evaluate Model on held-out Data
® 2011 Dachis Group.
3. dachisgroup.com
UFO Sightings
1. What US city has the most
UFO sightings overall?
2. What is the most common
UFO shape within a 100
mile radius of your answer
for #1?
Using Two Data Sets:
• UFO sightings data set available
from Infochimps
• US city / states with geo-codes
available from US Census
® 2011 Dachis Group.
4. dachisgroup.com
Load Sightings Data
19930809 19990816 Westminster, CO triangle 1 minute A white puffy cottonball appeared and then a triangle ...
20010111 20010113 Pueblo, CO fireball 30 sec Blue fireball lights up the skies of colorado and nebraska ...
20001026 20030920 Aurora, CO triangle 10 Minutes Triangular craft (two footbal fields in size)As reported to Art Bell ...
ufo_sightings = LOAD ’ufo/ufo_awesome.tsv' AS (
sighted_at: chararray, reported_at: chararray,
location: chararray, shape: chararray, Pig provides functions
duration: chararray, description: chararray for doing basic text
); munging tasks or
use a UDF ...
ufo_sightings_split_loc = FOREACH (
FILTER ufo_sightings BY sighted_at IS NOT NULL AND location IS NOT NULL
){
split_city = REGEX_EXTRACT(TRIM(location), '([A-Z][ws-.]*)(, )([A-Z]{2})', 1);
split_state = REGEX_EXTRACT(TRIM(location), '([A-Z][ws-.]*)(, )([A-Z]{2})', 3);
city_lc = (split_city IS NOT NULL ? LOWER(split_city) : null);
state_lc = (split_state IS NOT NULL ? LOWER(split_state) : null);
GENERATE city_lc AS city, state_lc AS state, ...
® 2011 Dachis Group.
5. dachisgroup.com
Load US Cities Data
with geo-codes
CO 0862000 02411501 Pueblo city 138930097 2034229 53.641 0.785 38.273147 -104.612378
CO 0883835 02412237 Westminster city 81715203 5954681 31.550 2.299 39.882190 -105.064426
CO 0804000 02409757 Aurora city 400759192 1806832 154.734 0.698 39.688002 -104.689740
us_cities = LOAD ’dev/data/usa_cities_and_towns.tsv' AS (
state: chararray, geoid: chararray, Use projection to
ansicode: chararray, name: chararray, select only the fields
....
you want to work with:
latitude: double, longitude: double
city, state, latitude, longitude
);
us_cities_w_geo = FOREACH us_cities {
city_name = SUBSTRING(LOWER(name), 0, LAST_INDEX_OF(name,' '));
GENERATE TRIM(city_name) as city, TRIM(LOWER(state)) AS state, latitude, longitude;
};
® 2011 Dachis Group.
6. dachisgroup.com
What US city has the most
UFO sightings overall?
Things to consider ...
1. Need to select only sightings from US cities
Join sightings data with US city data
1. Need to count sightings for each city
Group results from step 1 by state/city and count
2. Need to do a TOP to get the city with the most sightings
Descending sort on count and choose the top.
® 2011 Dachis Group.
7. dachisgroup.com
What US city has the most
UFO sightings overall?
ufo_sightings_with_geo = FOREACH (
JOIN ufo_sightings_by_city BY (state,city), us_cities_w_geo BY (state,city) USING ‘replicated’
) GENERATE
ufo_sightings_by_city::state AS state, Inner JOIN by
ufo_sightings_by_city::city AS city,
(state,city) to
ufo_sightings_by_city::sighted_at AS sighted_at,
attach geo-codes
to sightings
ufo_sightings_by_city::sighted_year AS sighted_year,
ufo_sightings_by_city::shape AS shape,
us_cities_w_geo::latitude AS latitude,
us_cities_w_geo::longitude AS longitude;
grp_by_state_city = FOREACH (GROUP ufo_sightings_with_geo BY (state,city,latitude,longitude))
GENERATE FLATTEN($0) AS (state,city,latitude,longitude),
COUNT($1) AS the_count;
Group by (state,city)
to get number of
most_freq = ORDER grp_by_state_city BY the_count DESC;
top_city_state = LIMIT most_freq 1;
DUMP top_city_state;
sightings for each
Poor man’s TOP City
® 2011 Dachis Group.
8. dachisgroup.com
What US city has the most
UFO sightings overall?
(seattle,wa,446,light,47.620499,-122.350876)
Seattle only averages 58 sunny days a year.
Coincidence?
Maybe all the UFOs are coming to look at the
Space Needle?
® 2011 Dachis Group.
9. dachisgroup.com
Pig Explain: Pull back the
covers ...
pig -x local -e ‘explain -script ufo.pig’
ufo_sightings_with_geo = FOREACH (
JOIN ufo_sightings_by_city BY (state,city), us_cities_w_geo BY (state,city) USING ‘replicated’
) GENERATE
ufo_sightings_by_city::state
ufo_sightings_by_city::city
AS state,
AS city,
Job 1 - Mapper
ufo_sightings_by_city::sighted_at AS sighted_at,
ufo_sightings_by_city::sighted_year AS sighted_year,
ufo_sightings_by_city::shape AS shape,
us_cities_w_geo::latitude AS latitude,
us_cities_w_geo::longitude AS longitude;
grp_by_state_city = FOREACH (GROUP ufo_sightings_with_geo BY (state,city,latitude,longitude))
Job 1 - Reducer
GENERATE FLATTEN($0) AS (state,city,latitude,longitude),
COUNT($1) AS the_count;
most_freq = ORDER grp_by_state_city BY the_count DESC;
top_city_state = LIMIT most_freq 1; Job 2 – Full Map/Reduce
DUMP top_city_state;
® 2011 Dachis Group.
10. dachisgroup.com
What is the most common
UFO shape within a 100 mile
radius of your answer for #1?
Things we need to solve this ...
1) Some way to calculate geographical
distance from a geographical location
(lat / lng)
2) Iterate over all cities that have
sightings to get the distance from our
centroid
3) Filter by distance and count shapes
® 2011 Dachis Group.
11. dachisgroup.com
UDF: User Defined Function
REGISTER some_path/my-ufo-app-1.0-SNAPSHOT.jar;
DEFINE CalcGeoDistance com.dachisgroup.ufo.GeoDistance();
...
with_distance = FOREACH calc_dist {
GENERATE city, state,
CalcGeoDistance(from_lat, from_lng, to_lat, to_lng) AS dist_in_miles;
};
Let’s build a UDF that uses the Haversine Forumla to calculate
distance between two points
See: http://en.wikipedia.org/wiki/Haversine_formula
® 2011 Dachis Group.
12. dachisgroup.com
UDF: User Defined Function
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class GeoDistance extends EvalFunc<Double> {
public Double exec(Tuple input) throws IOException {
if (input == null || input.size() < 4 || input.isNull(0) ||
input.isNull(1) || input.isNull(2) || input.isNull(3)) {
return null;
}
Double dist = null;
try {
Double fromLat = (Double)input.get(0);
Double fromLng = (Double)input.get(1);
Double toLat = (Double)input.get(2);
Double toLng = (Double)input.get(3);
dist = haversineDistanceInMiles(fromLat, toLat, fromLng, toLng);
} catch (Exception exc) { // better to return null than to throw exception }
return dist;
}
protected double haversineDistanceInMiles(double lat1, double lat2, double lon1, double lon2) {
// details excluded for brevity – see http://www.movable-type.co.uk/scripts/latlong.html
return dist;
}
® 2011 Dachis Group.
13. dachisgroup.com
What is the most common
UFO shape ...
top_city = FOREACH top_city_state GENERATE city, state, latitude as from_lat, longitude as from_lng;
sighting_cities = FOREACH (GROUP ufo_sightings_with_geo BY (state,city,latitude,longitude))
GENERATE FLATTEN($0) AS (state,city,latitude,longitude);
Including lat / lng in group by
calc_dist = FOREACH (CROSS sighting_cities, top_city)
GENERATE key to help reduce number of
sighting_cities::city AS city, records I’m crossing
sighting_cities::state AS state,
sighting_cities::latitude AS to_lat,
sighting_cities::longitude AS to_lng, Pig only supports equi-joins
so we need to use CROSS
CalcGeoDistance(top_city::from_lat, top_city::from_lng,
sighting_cities::latitude, sighting_cities::longitude) AS dist_in_miles;
near = FILTER calc_dist BY dist_in_miles < 100;
to get the lat / lng of the two
points to calculate distance
using our UDF
shapes = FOREACH (JOIN ufo_sightings_with_geo BY (state,city), near BY (state,city) USING ‘replicated’)
generate ufo_sightings_with_geo::shape as shape;
count_shapes = FOREACH (GROUP shapes BY shape)
GENERATE $0 AS shape, COUNT($1) AS the_count; When joining, list largest relation
sorted_counts = ORDER count_shapes BY the_count DESC; first and smallest last and optimize
if possible such as using ‘replicated’
® 2011 Dachis Group.
15. dachisgroup.com
Set Logic in Pig
Use Pig’s IsEmpty function to isolate records that only occur in one of the
relations ... such as sightings in cities not in the US census list:
city_sightings = COGROUP ufo_sightings_by_city BY (state,city) OUTER,
us_cities_w_geo BY (state,city);
outside_us_sightings =
FOREACH (FILTER city_sightings BY IsEmpty(us_cities_w_geo)) GENERATE
FLATTEN(ufo_sightings_by_city);
® 2011 Dachis Group.
16. dachisgroup.com
Mahout and Pig
Example Integration: Pig-Vector
GitHub project by Ted Dunning, Mahout
Committer
https://github.com/tdunning/pig-vector
Use Case:
Train Logistic Regression Model from
Pig
Hello World of ML – 20 Newsgroups
® 2011 Dachis Group.
17. dachisgroup.com
Mahout and Pig
Step 1: Load the Training Data
Load 20-newsgroups messages using custom Pig LoadFunc:
docs = LOAD '20news-bydate-train/*/*’ USING
org.apache.mahout.pig.MessageLoader()
AS (newsgroup, id:int, subject, body);
In Java:
public class MessageLoader extends LoadFunc {
public void setLocation(String location, Job job) throws IOException {
// setup where we're reading data from
}
public InputFormat getInputFormat() throws IOException {
return new TextInputFormat() {
// ...
};
}
public Tuple getNext() throws IOException {
// parse message and build Tuple that matches the schema
}
}
® 2011 Dachis Group.
18. dachisgroup.com
Mahout and Pig
Step 2: Vectorize using Pig-Vector UDF
-- Import UDF, define vectorizing strategy and fixed size of feature vector
DEFINE encodeVector
org.apache.mahout.pig.encoders.EncodeVector('100000', 'subject+body', 'group:
word, article:numeric, subject:text, body:text');
vectors = FOREACH docs GENERATE newsgroup, encodeVector(*) as v;
Result is a hashed feature vector where features
are mapped to indexes in a fixed size sparse vector
(from Mahout)
Fixed sized vectors are needed to train
Mahout’s SGD-based logistic regression model
® 2011 Dachis Group.
19. dachisgroup.com
Mahout and Pig
Step 3: Train the Model
DEFINE train
org.apache.mahout.pig.LogisticRegression('iterations=5, inMemory=true, feat
ures=100000, categories=alt.atheism comp.sys.mac.hardware rec.motorcycles
sci.electronics talk.politics.guns comp.graphics comp.windows.x
rec.sport.baseball sci.med talk.politics.mideast comp.os.ms-windows.misc
misc.forsale rec.sport.hockey sci.space talk.politics.misc
comp.sys.ibm.pc.hardware rec.autos sci.crypt soc.religion.christian
talk.religion.misc');
/* put the training data in a single bag. We could train multiple models this way */
grouped = group vectors all;
/* train the actual model. The key is bogus to satisfy the sequence vector format. */
model = foreach grouped generate 1 as key, train(vectors) as model;
store model into 'pv-tmp/news_model' using PigModelStorage();
® 2011 Dachis Group.
20. dachisgroup.com
Mahout and Pig
Step 4: Evaluate the Model
DEFINE evaluate
org.apache.mahout.pig.LogisticRegressionEval('sequence=pv-
tmp/news_model/part-r-00000, key=1');
test = load '20news-bydate-test/*/*' using
org.apache.mahout.pig.MessageLoader()
as (newsgroup, id:int, subject, body);
testvecs = foreach test generate newsgroup, encodeVector(*) as v;
evalvecs = foreach testvecs generate evaluate(v);
® 2011 Dachis Group.
21. dachisgroup.com
Questions?
For Slides and Pig script email me at: tim.potter@dachisgroup.com
Twitter: thelabdude
® 2011 Dachis Group.
Notes de l'éditeur
UFO Sightings from Infochimps:http://www.infochimps.com/datasets/60000-documented-ufo-sightings-with-text-descriptions-and-metadaUS Cities / States with Geo-codes from census.gov:http://www.census.gov/geo/www/gazetteer/files/Gaz_places_national.txtStarted out as a new hire programming challenge
ufo_sightings = LOAD '/Users/timpotter/dev/data/ufo_awesome.tsv' AS ( sighted_at: chararray, reported_at: chararray, location: chararray, shape: chararray, duration: chararray, description: chararray );ufo_sightings_split_loc = FOREACH (FILTER ufo_sightings BY sighted_at IS NOT NULL AND location IS NOT NULL) { split_city = REGEX_EXTRACT(TRIM(location), '([A-Z][\\\\w\\\\s\\\\-\\\\.]*)(, )([A-Z]{2})', 1); split_state = REGEX_EXTRACT(TRIM(location), '([A-Z][\\\\w\\\\s\\\\-\\\\.]*)(, )([A-Z]{2})', 3); city_lc = (split_city IS NOT NULL ? LOWER(split_city) : null); state_lc = (split_state IS NOT NULL ? LOWER(split_state) : null);GENERATE city_lc AS city, state_lc AS state, sighted_at, SUBSTRING(sighted_at,0,4) AS sighted_year, reported_at, TRIM(shape) AS shape, duration, description;};ufo_sightings_by_city = FILTER ufo_sightings_split_loc BY city IS NOT NULL AND state IS NOT NULL;
Need to join our sightings data with US cities data to A) filter out non-US cities and B) attach the lat / lng to the sighting.Realize that after the JOIN, you’re data won’t be sorted if you use REPLICATED, it will be sorted if you don’t use replicated.To quote the Pig book – Pig’s optimizer is between your chair and your keyboard
Need to join our sightings data with US cities data to A) filter out non-US cities and B) attach the lat / lng to the sighting.Realize that after the JOIN, you’re data won’t be sorted if you use REPLICATED, it will be sorted if you don’t use replicated.To quote the Pig book – Pig’s optimizer is between your chair and your keyboard
Seattle Image from http://mylocalhealthguide.com/north-seattle-group-targeting-underage-drinking-meets-dec-15/
Need to join our sightings data with US cities data to A) filter out non-US cities and B) attach the lat / lng to the sighting.Realize that after the JOIN, you’re data won’t be sorted if you use REPLICATED, it will be sorted if you don’t use replicated.To quote the Pig book – Pig’s optimizer is between your chair and your keyboard
When using CROSS, minimize the size of the relations you’re crossing – thus, I’m grouping by state + city + lat + lng and just flattening the group by keyWhen joining, list the largest relation on the left and smallest on the right