SlideShare une entreprise Scribd logo
1  sur  21
dachisgroup.com




Dachis Group
Las Vegas 2012




  Intermediate Pig Know How


     Timothy Potter (Twitter: thelabdude)
     Pigout Hackday, Austin TX
     May 11, 2012
® 2011 Dachis Group.
dachisgroup.com




Agenda

  UFO Sightings Data Set
                       1.     Which US city has the most UFO sightings overall?
                       2.     What is the most common UFO shape within a 100 mile radius of
                              your answer for #1?

  Pig Mahout Example: Training 20 Newsgroups Classifier
                       •    Loading messages using a custom loader
                       •    Hashed Feature Vectors
                       •    Train Logistic Regression Model
                       •    Evaluate Model on held-out Data




® 2011 Dachis Group.
dachisgroup.com




UFO Sightings

1. What US city has the most
   UFO sightings overall?

2. What is the most common
   UFO shape within a 100
   mile radius of your answer
   for #1?

Using Two Data Sets:
• UFO sightings data set available
  from Infochimps
• US city / states with geo-codes
  available from US Census
® 2011 Dachis Group.
dachisgroup.com




  Load Sightings Data
19930809       19990816    Westminster, CO         triangle     1 minute    A white puffy cottonball appeared and then a triangle ...
20010111       20010113    Pueblo, CO   fireball      30 sec Blue fireball lights up the skies of colorado and nebraska ...
20001026       20030920    Aurora, CO   triangle       10 Minutes     Triangular craft (two footbal fields in size)As reported to Art Bell ...


ufo_sightings = LOAD ’ufo/ufo_awesome.tsv' AS (
               sighted_at: chararray,        reported_at: chararray,
               location:   chararray,        shape:           chararray,                             Pig provides functions
               duration:   chararray,        description: chararray                                   for doing basic text
             );                                                                                         munging tasks or
                                                                                                          use a UDF ...
ufo_sightings_split_loc = FOREACH (
  FILTER ufo_sightings BY sighted_at IS NOT NULL AND location IS NOT NULL
 ){
  split_city = REGEX_EXTRACT(TRIM(location), '([A-Z][ws-.]*)(, )([A-Z]{2})', 1);
  split_state = REGEX_EXTRACT(TRIM(location), '([A-Z][ws-.]*)(, )([A-Z]{2})', 3);
  city_lc = (split_city IS NOT NULL ? LOWER(split_city) : null);
  state_lc = (split_state IS NOT NULL ? LOWER(split_state) : null);
  GENERATE city_lc AS city, state_lc AS state, ...

 ® 2011 Dachis Group.
dachisgroup.com




 Load US Cities Data
 with geo-codes
CO      0862000 02411501            Pueblo city    138930097   2034229 53.641 0.785 38.273147     -104.612378
CO      0883835 02412237             Westminster city    81715203      5954681 31.550 2.299 39.882190   -105.064426
CO      0804000 02409757             Aurora city   400759192   1806832 154.734 0.698 39.688002    -104.689740


us_cities = LOAD ’dev/data/usa_cities_and_towns.tsv' AS (
                     state:      chararray,   geoid:     chararray,                   Use projection to
                     ansicode:     chararray, name:       chararray,                select only the fields
                     ....
                                                                                   you want to work with:
                     latitude:    double, longitude: double
                                                                               city, state, latitude, longitude
                 );


us_cities_w_geo = FOREACH us_cities {
                  city_name = SUBSTRING(LOWER(name), 0, LAST_INDEX_OF(name,' '));
                  GENERATE TRIM(city_name) as city, TRIM(LOWER(state)) AS state, latitude, longitude;
                };



 ® 2011 Dachis Group.
dachisgroup.com




What US city has the most
UFO sightings overall?
   Things to consider ...

   1. Need to select only sightings from US cities

         Join sightings data with US city data
   1. Need to count sightings for each city

         Group results from step 1 by state/city and count
   2. Need to do a TOP to get the city with the most sightings

         Descending sort on count and choose the top.


® 2011 Dachis Group.
dachisgroup.com




What US city has the most
UFO sightings overall?
  ufo_sightings_with_geo = FOREACH (
                        JOIN ufo_sightings_by_city BY (state,city), us_cities_w_geo BY (state,city) USING ‘replicated’
                       ) GENERATE
                        ufo_sightings_by_city::state          AS state,                                 Inner JOIN by
                        ufo_sightings_by_city::city           AS city,
                                                                                                         (state,city) to
                        ufo_sightings_by_city::sighted_at     AS sighted_at,
                                                                                                       attach geo-codes
                                                                                                           to sightings
                        ufo_sightings_by_city::sighted_year   AS sighted_year,
                        ufo_sightings_by_city::shape          AS shape,
                        us_cities_w_geo::latitude             AS latitude,
                        us_cities_w_geo::longitude            AS longitude;


  grp_by_state_city = FOREACH (GROUP ufo_sightings_with_geo BY (state,city,latitude,longitude))
              GENERATE FLATTEN($0) AS (state,city,latitude,longitude),
               COUNT($1) AS the_count;

                                                                                                       Group by (state,city)
                                                                                                         to get number of
  most_freq = ORDER grp_by_state_city BY the_count DESC;
  top_city_state = LIMIT most_freq 1;

  DUMP top_city_state;
                                                                                                        sightings for each
                                                              Poor man’s TOP                                    City

® 2011 Dachis Group.
dachisgroup.com




What US city has the most
UFO sightings overall?
  (seattle,wa,446,light,47.620499,-122.350876)


  Seattle only averages 58 sunny days a year.
  Coincidence?


  Maybe all the UFOs are coming to look at the
  Space Needle?




® 2011 Dachis Group.
dachisgroup.com




Pig Explain: Pull back the
covers ...
  pig -x local -e ‘explain -script ufo.pig’
  ufo_sightings_with_geo = FOREACH (
                       JOIN ufo_sightings_by_city BY (state,city), us_cities_w_geo BY (state,city) USING ‘replicated’
                  ) GENERATE
                       ufo_sightings_by_city::state
                       ufo_sightings_by_city::city
                                                             AS state,
                                                             AS city,
                                                                                                                        Job 1 - Mapper
                       ufo_sightings_by_city::sighted_at     AS sighted_at,
                       ufo_sightings_by_city::sighted_year   AS sighted_year,
                       ufo_sightings_by_city::shape          AS shape,
                       us_cities_w_geo::latitude             AS latitude,
                       us_cities_w_geo::longitude            AS longitude;


  grp_by_state_city = FOREACH (GROUP ufo_sightings_with_geo BY (state,city,latitude,longitude))


                                                                                                                            Job 1 - Reducer
                GENERATE FLATTEN($0) AS (state,city,latitude,longitude),
                 COUNT($1) AS the_count;


  most_freq = ORDER grp_by_state_city BY the_count DESC;
  top_city_state = LIMIT most_freq 1;                                                 Job 2 – Full Map/Reduce
  DUMP top_city_state;
® 2011 Dachis Group.
dachisgroup.com




What is the most common
UFO shape within a 100 mile
radius of your answer for #1?

  Things we need to solve this ...

  1) Some way to calculate geographical
    distance from a geographical location
    (lat / lng)

  2) Iterate over all cities that have
    sightings to get the distance from our
    centroid

  3) Filter by distance and count shapes
® 2011 Dachis Group.
dachisgroup.com




UDF: User Defined Function

  REGISTER some_path/my-ufo-app-1.0-SNAPSHOT.jar;
  DEFINE CalcGeoDistance com.dachisgroup.ufo.GeoDistance();
  ...
  with_distance = FOREACH calc_dist {
                   GENERATE city, state,
                       CalcGeoDistance(from_lat, from_lng, to_lat, to_lng) AS dist_in_miles;
                 };



  Let’s build a UDF that uses the Haversine Forumla to calculate
  distance between two points


  See: http://en.wikipedia.org/wiki/Haversine_formula




® 2011 Dachis Group.
dachisgroup.com




UDF: User Defined Function
  import org.apache.pig.EvalFunc;
  import org.apache.pig.data.Tuple;
  public class GeoDistance extends EvalFunc<Double> {
        public Double exec(Tuple input) throws IOException {
              if (input == null || input.size() < 4 || input.isNull(0) ||
                       input.isNull(1) || input.isNull(2) || input.isNull(3)) {
                       return null;
              }
              Double dist = null;
              try {
                       Double fromLat = (Double)input.get(0);
                       Double fromLng = (Double)input.get(1);
                       Double toLat = (Double)input.get(2);
                       Double toLng = (Double)input.get(3);
                       dist = haversineDistanceInMiles(fromLat, toLat, fromLng, toLng);
              } catch (Exception exc) { // better to return null than to throw exception }
              return dist;
        }
        protected double haversineDistanceInMiles(double lat1, double lat2, double lon1, double lon2) {
              // details excluded for brevity – see http://www.movable-type.co.uk/scripts/latlong.html
              return dist;
        }

® 2011 Dachis Group.
dachisgroup.com




What is the most common
UFO shape ...
  top_city = FOREACH top_city_state GENERATE city, state, latitude as from_lat, longitude as from_lng;
  sighting_cities = FOREACH (GROUP ufo_sightings_with_geo BY (state,city,latitude,longitude))
    GENERATE FLATTEN($0) AS (state,city,latitude,longitude);

                                                                                           Including lat / lng in group by
  calc_dist = FOREACH (CROSS sighting_cities, top_city)
            GENERATE                                                                       key to help reduce number of
            sighting_cities::city AS city,                                                     records I’m crossing
            sighting_cities::state AS state,
            sighting_cities::latitude AS to_lat,
            sighting_cities::longitude AS to_lng,                                           Pig only supports equi-joins
                                                                                             so we need to use CROSS
            CalcGeoDistance(top_city::from_lat, top_city::from_lng,
                sighting_cities::latitude, sighting_cities::longitude) AS dist_in_miles;

  near = FILTER calc_dist BY dist_in_miles < 100;
                                                                                           to get the lat / lng of the two
                                                                                            points to calculate distance
                                                                                                   using our UDF
  shapes = FOREACH (JOIN ufo_sightings_with_geo BY (state,city), near BY (state,city) USING ‘replicated’)
        generate ufo_sightings_with_geo::shape as shape;


  count_shapes = FOREACH (GROUP shapes BY shape)
        GENERATE $0 AS shape, COUNT($1) AS the_count;                         When joining, list largest relation
  sorted_counts = ORDER count_shapes BY the_count DESC;                       first and smallest last and optimize
                                                                              if possible such as using ‘replicated’
® 2011 Dachis Group.
dachisgroup.com




Visualize Results

                       In Pig:
                       fs -getmerge sorted_counts sorted_counts.txt


                       In R:
                       shapes <- read.table(”sorted_counts.txt",
                       header=F, sep="t", col.names=c("shape","occurs"), strin
                       gsAsFactors=F)
                       barplot(c(shapes$occurs),
                               main="UFO Sightings (Shapes)",
                               ylab="Number of Sightings",
                               ylim=c(0,500),
                               cex.names=0.8,
                               las=2,
                               names.arg=c(shapes$shape))




® 2011 Dachis Group.
dachisgroup.com




Set Logic in Pig

  Use Pig’s IsEmpty function to isolate records that only occur in one of the
  relations ... such as sightings in cities not in the US census list:


  city_sightings = COGROUP ufo_sightings_by_city BY (state,city) OUTER,
                                   us_cities_w_geo BY (state,city);


  outside_us_sightings =
    FOREACH (FILTER city_sightings BY IsEmpty(us_cities_w_geo)) GENERATE
                FLATTEN(ufo_sightings_by_city);




® 2011 Dachis Group.
dachisgroup.com




Mahout and Pig

  Example Integration: Pig-Vector


  GitHub project by Ted Dunning, Mahout
  Committer
  https://github.com/tdunning/pig-vector


  Use Case:
  Train Logistic Regression Model from
  Pig


  Hello World of ML – 20 Newsgroups




® 2011 Dachis Group.
dachisgroup.com




Mahout and Pig
Step 1: Load the Training Data

  Load 20-newsgroups messages using custom Pig LoadFunc:


  docs = LOAD '20news-bydate-train/*/*’ USING
                org.apache.mahout.pig.MessageLoader()
                     AS (newsgroup, id:int, subject, body);
  In Java:
  public class MessageLoader extends LoadFunc {
      public void setLocation(String location, Job job) throws IOException {
          // setup where we're reading data from
      }
      public InputFormat getInputFormat() throws IOException {
           return new TextInputFormat() {
            // ...
           };
      }
      public Tuple getNext() throws IOException {
          // parse message and build Tuple that matches the schema
      }
  }
® 2011 Dachis Group.
dachisgroup.com




Mahout and Pig
Step 2: Vectorize using Pig-Vector UDF

  -- Import UDF, define vectorizing strategy and fixed size of feature vector
  DEFINE encodeVector
  org.apache.mahout.pig.encoders.EncodeVector('100000', 'subject+body', 'group:
  word, article:numeric, subject:text, body:text');


  vectors = FOREACH docs GENERATE newsgroup, encodeVector(*) as v;

  Result is a hashed feature vector where features
  are mapped to indexes in a fixed size sparse vector
  (from Mahout)


  Fixed sized vectors are needed to train
  Mahout’s SGD-based logistic regression model




® 2011 Dachis Group.
dachisgroup.com




Mahout and Pig
Step 3: Train the Model

  DEFINE train
  org.apache.mahout.pig.LogisticRegression('iterations=5, inMemory=true, feat
  ures=100000, categories=alt.atheism comp.sys.mac.hardware rec.motorcycles
  sci.electronics talk.politics.guns comp.graphics comp.windows.x
  rec.sport.baseball sci.med talk.politics.mideast comp.os.ms-windows.misc
  misc.forsale rec.sport.hockey sci.space talk.politics.misc
  comp.sys.ibm.pc.hardware rec.autos sci.crypt soc.religion.christian
  talk.religion.misc');

  /* put the training data in a single bag. We could train multiple models this way */
  grouped = group vectors all;

  /* train the actual model. The key is bogus to satisfy the sequence vector format. */
  model = foreach grouped generate 1 as key, train(vectors) as model;


  store model into 'pv-tmp/news_model' using PigModelStorage();
® 2011 Dachis Group.
dachisgroup.com




Mahout and Pig
Step 4: Evaluate the Model

  DEFINE evaluate
  org.apache.mahout.pig.LogisticRegressionEval('sequence=pv-
  tmp/news_model/part-r-00000, key=1');


  test = load '20news-bydate-test/*/*' using
  org.apache.mahout.pig.MessageLoader()
        as (newsgroup, id:int, subject, body);
  testvecs = foreach test generate newsgroup, encodeVector(*) as v;
  evalvecs = foreach testvecs generate evaluate(v);




® 2011 Dachis Group.
dachisgroup.com




Questions?

  For Slides and Pig script email me at: tim.potter@dachisgroup.com

  Twitter: thelabdude




® 2011 Dachis Group.

Contenu connexe

En vedette

Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...thelabdude
 
ApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr IntegrationApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr Integrationthelabdude
 
Deploying and managing Solr at scale
Deploying and managing Solr at scaleDeploying and managing Solr at scale
Deploying and managing Solr at scaleAnshum Gupta
 
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Shalin Shekhar Mangar
 
Solr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudSolr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudthelabdude
 
Introduction to SolrCloud
Introduction to SolrCloudIntroduction to SolrCloud
Introduction to SolrCloudVarun Thacker
 
Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4thelabdude
 

En vedette (8)

Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
ApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr IntegrationApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr Integration
 
Deploying and managing Solr at scale
Deploying and managing Solr at scaleDeploying and managing Solr at scale
Deploying and managing Solr at scale
 
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
 
Solr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudSolr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloud
 
Introduction to SolrCloud
Introduction to SolrCloudIntroduction to SolrCloud
Introduction to SolrCloud
 
Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4
 

Dernier

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 

Dernier (20)

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 

Dachis Group Pig Hackday: Pig 202

  • 1. dachisgroup.com Dachis Group Las Vegas 2012 Intermediate Pig Know How Timothy Potter (Twitter: thelabdude) Pigout Hackday, Austin TX May 11, 2012 ® 2011 Dachis Group.
  • 2. dachisgroup.com Agenda UFO Sightings Data Set 1. Which US city has the most UFO sightings overall? 2. What is the most common UFO shape within a 100 mile radius of your answer for #1? Pig Mahout Example: Training 20 Newsgroups Classifier • Loading messages using a custom loader • Hashed Feature Vectors • Train Logistic Regression Model • Evaluate Model on held-out Data ® 2011 Dachis Group.
  • 3. dachisgroup.com UFO Sightings 1. What US city has the most UFO sightings overall? 2. What is the most common UFO shape within a 100 mile radius of your answer for #1? Using Two Data Sets: • UFO sightings data set available from Infochimps • US city / states with geo-codes available from US Census ® 2011 Dachis Group.
  • 4. dachisgroup.com Load Sightings Data 19930809 19990816 Westminster, CO triangle 1 minute A white puffy cottonball appeared and then a triangle ... 20010111 20010113 Pueblo, CO fireball 30 sec Blue fireball lights up the skies of colorado and nebraska ... 20001026 20030920 Aurora, CO triangle 10 Minutes Triangular craft (two footbal fields in size)As reported to Art Bell ... ufo_sightings = LOAD ’ufo/ufo_awesome.tsv' AS ( sighted_at: chararray, reported_at: chararray, location: chararray, shape: chararray, Pig provides functions duration: chararray, description: chararray for doing basic text ); munging tasks or use a UDF ... ufo_sightings_split_loc = FOREACH ( FILTER ufo_sightings BY sighted_at IS NOT NULL AND location IS NOT NULL ){ split_city = REGEX_EXTRACT(TRIM(location), '([A-Z][ws-.]*)(, )([A-Z]{2})', 1); split_state = REGEX_EXTRACT(TRIM(location), '([A-Z][ws-.]*)(, )([A-Z]{2})', 3); city_lc = (split_city IS NOT NULL ? LOWER(split_city) : null); state_lc = (split_state IS NOT NULL ? LOWER(split_state) : null); GENERATE city_lc AS city, state_lc AS state, ... ® 2011 Dachis Group.
  • 5. dachisgroup.com Load US Cities Data with geo-codes CO 0862000 02411501 Pueblo city 138930097 2034229 53.641 0.785 38.273147 -104.612378 CO 0883835 02412237 Westminster city 81715203 5954681 31.550 2.299 39.882190 -105.064426 CO 0804000 02409757 Aurora city 400759192 1806832 154.734 0.698 39.688002 -104.689740 us_cities = LOAD ’dev/data/usa_cities_and_towns.tsv' AS ( state: chararray, geoid: chararray, Use projection to ansicode: chararray, name: chararray, select only the fields .... you want to work with: latitude: double, longitude: double city, state, latitude, longitude ); us_cities_w_geo = FOREACH us_cities { city_name = SUBSTRING(LOWER(name), 0, LAST_INDEX_OF(name,' ')); GENERATE TRIM(city_name) as city, TRIM(LOWER(state)) AS state, latitude, longitude; }; ® 2011 Dachis Group.
  • 6. dachisgroup.com What US city has the most UFO sightings overall? Things to consider ... 1. Need to select only sightings from US cities Join sightings data with US city data 1. Need to count sightings for each city Group results from step 1 by state/city and count 2. Need to do a TOP to get the city with the most sightings Descending sort on count and choose the top. ® 2011 Dachis Group.
  • 7. dachisgroup.com What US city has the most UFO sightings overall? ufo_sightings_with_geo = FOREACH ( JOIN ufo_sightings_by_city BY (state,city), us_cities_w_geo BY (state,city) USING ‘replicated’ ) GENERATE ufo_sightings_by_city::state AS state, Inner JOIN by ufo_sightings_by_city::city AS city, (state,city) to ufo_sightings_by_city::sighted_at AS sighted_at, attach geo-codes to sightings ufo_sightings_by_city::sighted_year AS sighted_year, ufo_sightings_by_city::shape AS shape, us_cities_w_geo::latitude AS latitude, us_cities_w_geo::longitude AS longitude; grp_by_state_city = FOREACH (GROUP ufo_sightings_with_geo BY (state,city,latitude,longitude)) GENERATE FLATTEN($0) AS (state,city,latitude,longitude), COUNT($1) AS the_count; Group by (state,city) to get number of most_freq = ORDER grp_by_state_city BY the_count DESC; top_city_state = LIMIT most_freq 1; DUMP top_city_state; sightings for each Poor man’s TOP City ® 2011 Dachis Group.
  • 8. dachisgroup.com What US city has the most UFO sightings overall? (seattle,wa,446,light,47.620499,-122.350876) Seattle only averages 58 sunny days a year. Coincidence? Maybe all the UFOs are coming to look at the Space Needle? ® 2011 Dachis Group.
  • 9. dachisgroup.com Pig Explain: Pull back the covers ... pig -x local -e ‘explain -script ufo.pig’ ufo_sightings_with_geo = FOREACH ( JOIN ufo_sightings_by_city BY (state,city), us_cities_w_geo BY (state,city) USING ‘replicated’ ) GENERATE ufo_sightings_by_city::state ufo_sightings_by_city::city AS state, AS city, Job 1 - Mapper ufo_sightings_by_city::sighted_at AS sighted_at, ufo_sightings_by_city::sighted_year AS sighted_year, ufo_sightings_by_city::shape AS shape, us_cities_w_geo::latitude AS latitude, us_cities_w_geo::longitude AS longitude; grp_by_state_city = FOREACH (GROUP ufo_sightings_with_geo BY (state,city,latitude,longitude)) Job 1 - Reducer GENERATE FLATTEN($0) AS (state,city,latitude,longitude), COUNT($1) AS the_count; most_freq = ORDER grp_by_state_city BY the_count DESC; top_city_state = LIMIT most_freq 1; Job 2 – Full Map/Reduce DUMP top_city_state; ® 2011 Dachis Group.
  • 10. dachisgroup.com What is the most common UFO shape within a 100 mile radius of your answer for #1? Things we need to solve this ... 1) Some way to calculate geographical distance from a geographical location (lat / lng) 2) Iterate over all cities that have sightings to get the distance from our centroid 3) Filter by distance and count shapes ® 2011 Dachis Group.
  • 11. dachisgroup.com UDF: User Defined Function REGISTER some_path/my-ufo-app-1.0-SNAPSHOT.jar; DEFINE CalcGeoDistance com.dachisgroup.ufo.GeoDistance(); ... with_distance = FOREACH calc_dist { GENERATE city, state, CalcGeoDistance(from_lat, from_lng, to_lat, to_lng) AS dist_in_miles; }; Let’s build a UDF that uses the Haversine Forumla to calculate distance between two points See: http://en.wikipedia.org/wiki/Haversine_formula ® 2011 Dachis Group.
  • 12. dachisgroup.com UDF: User Defined Function import org.apache.pig.EvalFunc; import org.apache.pig.data.Tuple; public class GeoDistance extends EvalFunc<Double> { public Double exec(Tuple input) throws IOException { if (input == null || input.size() < 4 || input.isNull(0) || input.isNull(1) || input.isNull(2) || input.isNull(3)) { return null; } Double dist = null; try { Double fromLat = (Double)input.get(0); Double fromLng = (Double)input.get(1); Double toLat = (Double)input.get(2); Double toLng = (Double)input.get(3); dist = haversineDistanceInMiles(fromLat, toLat, fromLng, toLng); } catch (Exception exc) { // better to return null than to throw exception } return dist; } protected double haversineDistanceInMiles(double lat1, double lat2, double lon1, double lon2) { // details excluded for brevity – see http://www.movable-type.co.uk/scripts/latlong.html return dist; } ® 2011 Dachis Group.
  • 13. dachisgroup.com What is the most common UFO shape ... top_city = FOREACH top_city_state GENERATE city, state, latitude as from_lat, longitude as from_lng; sighting_cities = FOREACH (GROUP ufo_sightings_with_geo BY (state,city,latitude,longitude)) GENERATE FLATTEN($0) AS (state,city,latitude,longitude); Including lat / lng in group by calc_dist = FOREACH (CROSS sighting_cities, top_city) GENERATE key to help reduce number of sighting_cities::city AS city, records I’m crossing sighting_cities::state AS state, sighting_cities::latitude AS to_lat, sighting_cities::longitude AS to_lng, Pig only supports equi-joins so we need to use CROSS CalcGeoDistance(top_city::from_lat, top_city::from_lng, sighting_cities::latitude, sighting_cities::longitude) AS dist_in_miles; near = FILTER calc_dist BY dist_in_miles < 100; to get the lat / lng of the two points to calculate distance using our UDF shapes = FOREACH (JOIN ufo_sightings_with_geo BY (state,city), near BY (state,city) USING ‘replicated’) generate ufo_sightings_with_geo::shape as shape; count_shapes = FOREACH (GROUP shapes BY shape) GENERATE $0 AS shape, COUNT($1) AS the_count; When joining, list largest relation sorted_counts = ORDER count_shapes BY the_count DESC; first and smallest last and optimize if possible such as using ‘replicated’ ® 2011 Dachis Group.
  • 14. dachisgroup.com Visualize Results In Pig: fs -getmerge sorted_counts sorted_counts.txt In R: shapes <- read.table(”sorted_counts.txt", header=F, sep="t", col.names=c("shape","occurs"), strin gsAsFactors=F) barplot(c(shapes$occurs), main="UFO Sightings (Shapes)", ylab="Number of Sightings", ylim=c(0,500), cex.names=0.8, las=2, names.arg=c(shapes$shape)) ® 2011 Dachis Group.
  • 15. dachisgroup.com Set Logic in Pig Use Pig’s IsEmpty function to isolate records that only occur in one of the relations ... such as sightings in cities not in the US census list: city_sightings = COGROUP ufo_sightings_by_city BY (state,city) OUTER, us_cities_w_geo BY (state,city); outside_us_sightings = FOREACH (FILTER city_sightings BY IsEmpty(us_cities_w_geo)) GENERATE FLATTEN(ufo_sightings_by_city); ® 2011 Dachis Group.
  • 16. dachisgroup.com Mahout and Pig Example Integration: Pig-Vector GitHub project by Ted Dunning, Mahout Committer https://github.com/tdunning/pig-vector Use Case: Train Logistic Regression Model from Pig Hello World of ML – 20 Newsgroups ® 2011 Dachis Group.
  • 17. dachisgroup.com Mahout and Pig Step 1: Load the Training Data Load 20-newsgroups messages using custom Pig LoadFunc: docs = LOAD '20news-bydate-train/*/*’ USING org.apache.mahout.pig.MessageLoader() AS (newsgroup, id:int, subject, body); In Java: public class MessageLoader extends LoadFunc { public void setLocation(String location, Job job) throws IOException { // setup where we're reading data from } public InputFormat getInputFormat() throws IOException { return new TextInputFormat() { // ... }; } public Tuple getNext() throws IOException { // parse message and build Tuple that matches the schema } } ® 2011 Dachis Group.
  • 18. dachisgroup.com Mahout and Pig Step 2: Vectorize using Pig-Vector UDF -- Import UDF, define vectorizing strategy and fixed size of feature vector DEFINE encodeVector org.apache.mahout.pig.encoders.EncodeVector('100000', 'subject+body', 'group: word, article:numeric, subject:text, body:text'); vectors = FOREACH docs GENERATE newsgroup, encodeVector(*) as v; Result is a hashed feature vector where features are mapped to indexes in a fixed size sparse vector (from Mahout) Fixed sized vectors are needed to train Mahout’s SGD-based logistic regression model ® 2011 Dachis Group.
  • 19. dachisgroup.com Mahout and Pig Step 3: Train the Model DEFINE train org.apache.mahout.pig.LogisticRegression('iterations=5, inMemory=true, feat ures=100000, categories=alt.atheism comp.sys.mac.hardware rec.motorcycles sci.electronics talk.politics.guns comp.graphics comp.windows.x rec.sport.baseball sci.med talk.politics.mideast comp.os.ms-windows.misc misc.forsale rec.sport.hockey sci.space talk.politics.misc comp.sys.ibm.pc.hardware rec.autos sci.crypt soc.religion.christian talk.religion.misc'); /* put the training data in a single bag. We could train multiple models this way */ grouped = group vectors all; /* train the actual model. The key is bogus to satisfy the sequence vector format. */ model = foreach grouped generate 1 as key, train(vectors) as model; store model into 'pv-tmp/news_model' using PigModelStorage(); ® 2011 Dachis Group.
  • 20. dachisgroup.com Mahout and Pig Step 4: Evaluate the Model DEFINE evaluate org.apache.mahout.pig.LogisticRegressionEval('sequence=pv- tmp/news_model/part-r-00000, key=1'); test = load '20news-bydate-test/*/*' using org.apache.mahout.pig.MessageLoader() as (newsgroup, id:int, subject, body); testvecs = foreach test generate newsgroup, encodeVector(*) as v; evalvecs = foreach testvecs generate evaluate(v); ® 2011 Dachis Group.
  • 21. dachisgroup.com Questions? For Slides and Pig script email me at: tim.potter@dachisgroup.com Twitter: thelabdude ® 2011 Dachis Group.

Notes de l'éditeur

  1. UFO Sightings from Infochimps:http://www.infochimps.com/datasets/60000-documented-ufo-sightings-with-text-descriptions-and-metadaUS Cities / States with Geo-codes from census.gov:http://www.census.gov/geo/www/gazetteer/files/Gaz_places_national.txtStarted out as a new hire programming challenge
  2. ufo_sightings = LOAD &apos;/Users/timpotter/dev/data/ufo_awesome.tsv&apos; AS ( sighted_at: chararray, reported_at: chararray, location: chararray, shape: chararray, duration: chararray, description: chararray );ufo_sightings_split_loc = FOREACH (FILTER ufo_sightings BY sighted_at IS NOT NULL AND location IS NOT NULL) { split_city = REGEX_EXTRACT(TRIM(location), &apos;([A-Z][\\\\w\\\\s\\\\-\\\\.]*)(, )([A-Z]{2})&apos;, 1); split_state = REGEX_EXTRACT(TRIM(location), &apos;([A-Z][\\\\w\\\\s\\\\-\\\\.]*)(, )([A-Z]{2})&apos;, 3); city_lc = (split_city IS NOT NULL ? LOWER(split_city) : null); state_lc = (split_state IS NOT NULL ? LOWER(split_state) : null);GENERATE city_lc AS city, state_lc AS state, sighted_at, SUBSTRING(sighted_at,0,4) AS sighted_year, reported_at, TRIM(shape) AS shape, duration, description;};ufo_sightings_by_city = FILTER ufo_sightings_split_loc BY city IS NOT NULL AND state IS NOT NULL;
  3. us_cities = LOAD &apos;/Users/timpotter/dev/data/usa_cities_and_towns.tsv&apos; AS ( state: chararray, geoid: chararray, ansicode: chararray, name: chararray, aland: long, awater: long, aland_sqmi: double, awater_sqmi: double, latitude: double, longitude: double );us_cities_w_geo = FOREACH us_cities { city_name = SUBSTRING(LOWER(name), 0, LAST_INDEX_OF(name,&apos; &apos;));GENERATE TRIM(city_name) as city, TRIM(LOWER(state)) AS state, latitude, longitude; };
  4. Need to join our sightings data with US cities data to A) filter out non-US cities and B) attach the lat / lng to the sighting.Realize that after the JOIN, you’re data won’t be sorted if you use REPLICATED, it will be sorted if you don’t use replicated.To quote the Pig book – Pig’s optimizer is between your chair and your keyboard
  5. Need to join our sightings data with US cities data to A) filter out non-US cities and B) attach the lat / lng to the sighting.Realize that after the JOIN, you’re data won’t be sorted if you use REPLICATED, it will be sorted if you don’t use replicated.To quote the Pig book – Pig’s optimizer is between your chair and your keyboard
  6. Seattle Image from http://mylocalhealthguide.com/north-seattle-group-targeting-underage-drinking-meets-dec-15/
  7. Need to join our sightings data with US cities data to A) filter out non-US cities and B) attach the lat / lng to the sighting.Realize that after the JOIN, you’re data won’t be sorted if you use REPLICATED, it will be sorted if you don’t use replicated.To quote the Pig book – Pig’s optimizer is between your chair and your keyboard
  8. When using CROSS, minimize the size of the relations you’re crossing – thus, I’m grouping by state + city + lat + lng and just flattening the group by keyWhen joining, list the largest relation on the left and smallest on the right
  9. Image from: http://7fny.com/sub/m/m_pig_rider_xdRkS0SY.jpg
  10. Image from: http://7fny.com/sub/m/m_pig_rider_xdRkS0SY.jpg
  11. image: http://upload.wikimedia.org/wikipedia/commons/thumb/6/68/Gradient_ascent_%28surface%29.png/450px-Gradient_ascent_%28surface%29.png
  12. Image from: http://7fny.com/sub/m/m_pig_rider_xdRkS0SY.jpg
  13. Image from: http://7fny.com/sub/m/m_pig_rider_xdRkS0SY.jpg