SlideShare une entreprise Scribd logo
1  sur  48
MarkLogic and Hadoop – Genetic Algorithm
Jim Fuller
email: jim.fuller@marklogic.com twitter: @xquery
Senior Engineer, Europe
19/09/12
Senior engineer




http://jim.fuller.name
http://exslt.org           @xquery                   XSLT UK 2001
http://www.xmlprague.cz

                                            @perl6
Overview
• Genetic Algorithm Refresher
• Marklogic/Hadoop architecture for
  implementing GA
• Installing Hadoop
• Installing MarkLogic Connector
• Problem Statement
• Review of GA process runs
• Summary
Whats the Problem ?
• Bigdata breathes life into older algorithmic
  approaches
• I thought it would interesting to turn ‘bigdata’
  problem on its head (code versus data)
• Demonstrate hadoop with MarkLogic, working
  to each other strengths
Get out of your comfort zone
• This talk is slightly different then the
  description … 150 slides! Part I.
• Its got hadoop/marklogic and the genetic
  algorithm but have focused on the process
  and early results
• Doing data science means pushing yourself
  out of your comfort zone
• Start simple, then iterate
Genetic Algorithm Refresher
• The Genetic Algorithm ( GA ) is a model of the
  evolution of a population of artificial individuals
  emulating Darwinian Selection.

• Each individual is a chromosome which contains
  discrete units of information (genes).

• The driving force behind the search for new and
  better solutions is the retention and combination of
  good partial solutions to a problem
Abridged Genetic Algorithm
• The Fundamental Theorem of Genetic Algorithms




M(H, t):# of individuals in population 't' with the schema 'H'.
f(H): average fitness of the individuals with the schema 'H'.
F: average fitness of the entire population.
p1:probability of the schema being destroyed by crossover.
p2:probability of the schema being destroyed by mutation.
GA operations
• Reproduction: An individual is perfectly replicated
  to a new population
• Crossover ( Recombination ): Parental material
  is recombined to create offspring to join new
  population
• Mutation: random changes (is key for pushing past
  local optima)
• Permutation: reordering
• Editing: evaluation to a terminal
• Encapsulation: single indivisible function
• Decimation: removal of individuals
Typical GA Process
Step 0. Create a random initial population of individuals

Step 1. Evaluate the fitness of each individual

Step 2. Select individuals according to their fitness, which will
  participate in generating offspring (moms+dads)

Step 3. Apply primary and secondary genetic operations to
  generate new offspring population

Step 4. Repeat the steps 1,2,3, to generate X number of
  generations

Step 5. choose fittest individual of last generation based on
  stop criteria
Endemic GA Problems
• Finding the optimal solution to complex high
  dimensional, multimodal problems often
  requires very expensive fitness function
• Hard to pose problem statement e.g. Stop
  criteria is not clear in every problem
• Premature convergence on local optima
Bit strings vs Lisp Parse Trees
(+( 2 3) 4) evaluates to 10 and symbolic
  expression looks like;
                        +
                            4

                    2   3



Hierarchical computer programs are more
  expressive then manipulating linear strings
XSLT – markup is useful!
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
   version=“2.0">
 <xsl:template match="a">
              <d/>
              <c/>
 </xsl:template>                       <xsl:stylesheet/>
</xsl:stylesheet>
                                                                    <xsl:template/>


                                                                         <d/>
                                                       <c/>


  Obvious Difficulties to address; different node types and xpath
Objective       Generate an xslt program that
                transforms source xml into result
                xml which is equivalent to target
                xml
Terminal Set    <a/> <b/> <c/> <d/>
Function Set    Subset of xslt instructions
Fitness Cases   One fitness case
Raw fitness     Treediffmerge result, node count
                + standard diff

Standardized    Same as raw fitness,
fitness         approaching 0 is better fitness
Parameters      M=500, G=51
Source XML
<a>
  <b>
   <c>
      <d></d>
   </c>
  </b>
</a>
Target XML – clear stop criteria

       <a>
         <b>
          <c>
             <d></d>
          </c>
         </b>
       </a>
Generation zero
• XML Instance Generator which is part of the Sun
  Multi-Schema Validator
• Sun Multi-Schema Validator
• The following can do it
  – OxygenXML
  – Visual Studio
  – Eclipse
• Ended up using IBM XML Generate – very old,
  supply it a schema and it would generate
  example xml
Step 1a: Evaluate against Input


                                                  xslt    Source.xml
                                                 transformation



                                                result.xml

          XSLT generation




MarkLogic evals and places the result into the property for the xslt itself
Step 1b: Evaluate Fitness


                                               xslt    Source.xml
                                               transformation



                                              result.xml
                                                       HADOOP
         XSLT generation
                                                 evaluate fitness


fitness performed with treediffmerge + standard diff
XML Diff issues
• Many diff algorithms are based on a paper
  published in 1976 by J. W. Hunt and M. D.
  McIlroy, An Algorithm for Differential File
  Comparison
• XML has a structure, text based diff programs
  do not take this into accordance
• simple example: <footie/> versus
  <footie></footie>logically these are equal
• XML Canonization helps !
XML Canonize + TreeDiffMerge
TREEDIFFMERGE DIFFERENCE                      RESULTS

<?xml version="1.0" encoding="UTF-8"?>     <?xml version="1.0" encoding="UTF-8"?>
                                           <root/>
<diff xmlns:diff='http://diff.org'>
          <diff:insert dst="1">
                        <a>
                                   <b>
                                       <c>

             <d />

          </c>
                               </b>
                       </a>
          </diff:insert>
</diff>
<?xml version="1.0" encoding="UTF-8"?>        <?xml version="1.0" encoding="utf-8"?><root>
<diff xmlns:diff='http://diff.org'>           <a/><a><a><c/><c><a><d/></a><c/></c></a><b>
          <diff:copy src="2" dst="1">         <b/><a/><c/><b>
                                                <c>
                        <diff:copy src="16"      <d/>
dst="2" />                                      </c>
           </diff:copy>
</diff>                                       </b></b><a/></a><d><a><c/><a/><a/></a><c/></
                                              d><c/>
Simple if we match: we are done!
<?xml version="1.0" encoding="UTF-8"?>   <?xml version="1.0" encoding="utf-8"?><root><a>
<diff />                                  <b>
                                           <c>
                                            <d/>
                                           </c>
                                          </b>
                                         </a></root>
MarkLogic/Hadoop Architecture
          Interlude
        MarkLogic




      Connector API via XDBC




                                                        MarkLogic

                               Connector API via XDBC
From Hadoop pov
Hadoop Installation Recipe
•   installing Hadoop (setting up a single node cluster)
     –    brew install hadoop
     –    make sure ssh is setup properly
     –    generate id_rsa and id_rsa.pub
     –    append pub to auth keys
            •   cat id_rsa.pub >> authorized_keys
     –    enable remote on mac osx
•   configure hadoop
     –    edit core-site.xml
     –    edit mapred-site.xml
•   ssh localhost
     –    format hdfs
            •   hadoop namenode –format
•   bin/start-all.sh
     –    if asks for password, you got problem with your ssh setup
•   to check that all is well
     –    run jps
     –    ps ax | grep hadoop | wc –l
     –    Check
            •   http://localhost:50030/jobtracker.jsp
            •   http://localhost:50060/tasktracker.jsp
            •   http://localhost:50070/dfshealth.jsp
Installing ML Hadoop Connector
• copy latest xcc and connector jars to hadoop
  lib
• Copy ml-examples jar as well
• Copy ml hadoop conf to hadoop conf
Starting it all Up
• Start marklogic
• Create database
• Create xdbc connection (how hadoop/ml
  communicate)
• Edit marklogic-hello-world.xml

• Make sure hadoop is started
Starting it all Up
• Load test Data via query console

xquery version "1.0-ml";

let $hello := <data><child>hello mom</child></data>
let $world := <data><child>world event</child></data>

return(
  xdmp:document-insert("hello.xml", $hello),
  xdmp:document-insert("world.xml", $world)
)
Run hello world example
• bin/start-all.sh

• hadoop jar lib/marklogic-xcc-examples-
  6.0.20120914.jar
  com.marklogic.mapreduce.examples.HelloWorld

• Review https://gist.github.com/2484318
Fitness (hadoop) step
• Applies XML canonization
• Performs treediffmerge, outputs and writes to
  original xslt document xml property
• Performs text diff and writes to original xslt
  document xml property
Step 2. Select individuals
 • Probabilistic selection to choose which
   individuals participate in genetic operation




                                                            Selected XSLT population




Select individuals for genetic operations, based on their fitness
About fitness
• Raw fitness: is the natural representation in
  terms of the specific problem (primitive
  counting nodes of treediffmerge patch)
• Standardized fitness: lower the better
• Adjusted fitness: lies between 0-1
• Normalized fitness: lies between 0-1 with
  sum of fitness values = 1
• In our case the lower the number of ‘different’
  nodes the better, use standardized fitness
Step 3. Apply Primary Genetic Operations
                                                Reproduction




             Selected XSLT population




                                            New generation
Individual reproduced into new generation
Step 3. Primary Genetic Operations
                                            Crossover ( Recombination )



        Selected XSLT population

                                                          Creates
                                                          2 offspring
                                ‘Mom’




                                 ‘Dad’
                                                    New generation

Select parents then crossover creates 2 offspring
Step 3. Primary Genetic Operations
                                          Crossover ( Recombination )

   ‘Mom XSLT’           ‘Dad XSLT’              ‘offspring xslt’




                                                ‘offspring xslt’




Swap nodes between selected parent xslt
                                                     New generation
Crossover with xquery
xquery version "1.0-ml";
import module namespace mem = "http://xqdev.com/in-mem-update"
 at "/MarkLogic/appservices/utils/in-mem-update.xqy" ;


  let $mom := <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="/" as="item()*">
  <bar>help</bar>
  </xsl:template>
  <xsl:template match="text()" as="item()*"/>
 </xsl:stylesheet>

  let $dad     := <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="/" as="item()*">
  <a><b><c>test</c></b></a>
  </xsl:template>
 </xsl:stylesheet>

  let $momCount := fn:count($mom//.)
  let $dadCount := fn:count($dad//.)

  (: never want root node :)
  let $momRdm := xdmp:random($momCount - 2) + 2
  let $dadRdm := xdmp:random($dadCount - 2) + 2

  (: node selection :)
  let $momNode := ($mom//.)[$momRdm]
  let $dadNode := ($dad//.)[$dadRdm]

  (: crossover :)
  let $newMom := mem:node-replace( $momNode, $dadNode )
  let $newDad := mem:node-replace( $dadNode, $momNode )
  return
  <result>
    <newMom>{$newMom}</newMom>
    <newDad>{$newDad}</newDad>
  </result>
Step 3. Secondary Genetic Operations


• Mutation: is a form of random crossover
• Permutation: Reorganize nodes
• Editing: evaluate a set of nodes
• Encapsulation: takes a branch and replaces
  with 1 indivisible node
• Decimation: removes individual based on
  domain specific criteria
Step 3. Secondary Genetic Operations
                                                     mutation

        ‘selected XSLT’           ‘offspring xslt’




                                                        Completely new
                                                        set of instructions



Pick a node and randomly mutate
Step 3. Secondary Genetic Operations
                                                   permutation

              ‘selected XSLT’   ‘offspring xslt’




Permutated node order
Step 3. Secondary Genetic Operations
                                                         editing

      ‘selected XSLT’                 ‘offspring xslt’




       Replace node with evaluated expression
Step 3. Secondary Genetic Operations
                                                             encapsulation

‘selected XSLT’              ‘define new function’
                                                              ‘XSLT’




     Identify useful subtrees and encapsulate by defining new function
Step 3. Secondary Genetic Operations
                                                           decimation

  <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  version="1.0">
  </xsl:stylesheet>


                                <xsl:stylesheet/>




  Identify very poor fitness individuals and remove from population
Initial tests
• Initial Population= 500, generations = 51
• Set initial genetic operation probabilities:
   90% crossover on selected individuals
   10% reproduction on selected individuals
   0% secondary operations on selected individuals
Results
• runs faster with more servers … extreme scale out –
  unusual for GA
• Arrived quickly to a ‘correct’ solution
• Though some runs Local optima was ‘wrong
  solution’ e.g. embedded literal
• need to constrain xpath (baby steps)
• Need to constrain terminal set
• Enhance fitness definition
Source XML
<a>
  <b>
   <c>
      <d></d>
   </c>
  </b>
</a>
Target XML

<a>
  <b/>
   <c/>
  <d/>
</a>
Results
• Needed larger generations/ more individuals
• Mutation operation needed to kick out of local
  optima
Summary
• This approach can be applied to any language
  parse tree (xquery with xqueryparser.xq)
• Difficulties with little languages being
  embedded
• Today, commercially applicable to generating
  mapping solutions, more research required
• Illustrates applying strength of ML/Hadoop
  together
• Will place code and results on github soon …
References
• JOHN R KOZA, Genetic Programming, MIT Press 1992
• J. W. Hunt and M. D. McIlroy , An Algorithm for Differential
  File Comparison published in 1976

Contenu connexe

Tendances

Parallel programming patterns - Олександр Павлишак
Parallel programming patterns - Олександр ПавлишакParallel programming patterns - Олександр Павлишак
Parallel programming patterns - Олександр Павлишак
Igor Bronovskyy
 
Hibernate caching
Hibernate cachingHibernate caching
Hibernate caching
bsudy
 

Tendances (12)

Parallel programming patterns - Олександр Павлишак
Parallel programming patterns - Олександр ПавлишакParallel programming patterns - Олександр Павлишак
Parallel programming patterns - Олександр Павлишак
 
Ch23 xml processing_with_java
Ch23 xml processing_with_javaCh23 xml processing_with_java
Ch23 xml processing_with_java
 
Module 5 jdbc.ppt
Module 5   jdbc.pptModule 5   jdbc.ppt
Module 5 jdbc.ppt
 
Hibernate caching
Hibernate cachingHibernate caching
Hibernate caching
 
Hibernate caching
Hibernate cachingHibernate caching
Hibernate caching
 
Java full stack1
Java full stack1Java full stack1
Java full stack1
 
BITS: Introduction to relational databases and MySQL - Schema design
BITS: Introduction to relational databases and MySQL - Schema designBITS: Introduction to relational databases and MySQL - Schema design
BITS: Introduction to relational databases and MySQL - Schema design
 
Learning Contextual Representations for Semantic Parsing with Generation-Augm...
Learning Contextual Representations for Semantic Parsing with Generation-Augm...Learning Contextual Representations for Semantic Parsing with Generation-Augm...
Learning Contextual Representations for Semantic Parsing with Generation-Augm...
 
Sedna XML Database: Query Parser & Optimizing Rewriter
Sedna XML Database: Query Parser & Optimizing RewriterSedna XML Database: Query Parser & Optimizing Rewriter
Sedna XML Database: Query Parser & Optimizing Rewriter
 
Opps Concept
Opps ConceptOpps Concept
Opps Concept
 
PDO Basics - PHPMelb 2014
PDO Basics - PHPMelb 2014PDO Basics - PHPMelb 2014
PDO Basics - PHPMelb 2014
 
Jdbc ja
Jdbc jaJdbc ja
Jdbc ja
 

Similaire à Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Similaire à Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code (20)

JS Essence
JS EssenceJS Essence
JS Essence
 
Implementing the Genetic Algorithm in XSLT: PoC
Implementing the Genetic Algorithm in XSLT: PoCImplementing the Genetic Algorithm in XSLT: PoC
Implementing the Genetic Algorithm in XSLT: PoC
 
Handling Database Deployments
Handling Database DeploymentsHandling Database Deployments
Handling Database Deployments
 
Play framework
Play frameworkPlay framework
Play framework
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Everything is Awesome - Cutting the Corners off the Web
Everything is Awesome - Cutting the Corners off the WebEverything is Awesome - Cutting the Corners off the Web
Everything is Awesome - Cutting the Corners off the Web
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Scalax
ScalaxScalax
Scalax
 
Data Science
Data ScienceData Science
Data Science
 
Catalyst - refactor large apps with it and have fun!
Catalyst - refactor large apps with it and have fun!Catalyst - refactor large apps with it and have fun!
Catalyst - refactor large apps with it and have fun!
 
Having Fun with Play
Having Fun with PlayHaving Fun with Play
Having Fun with Play
 
Going to Mars with Groovy Domain-Specific Languages
Going to Mars with Groovy Domain-Specific LanguagesGoing to Mars with Groovy Domain-Specific Languages
Going to Mars with Groovy Domain-Specific Languages
 
JavaScript Miller Columns
JavaScript Miller ColumnsJavaScript Miller Columns
JavaScript Miller Columns
 
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
 
Migrating Legacy Data (Ruby Midwest)
Migrating Legacy Data (Ruby Midwest)Migrating Legacy Data (Ruby Midwest)
Migrating Legacy Data (Ruby Midwest)
 
Embulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loaderEmbulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loader
 
Data integration with embulk
Data integration with embulkData integration with embulk
Data integration with embulk
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
RubyEnRails2007 - Dr Nic Williams - Keynote
RubyEnRails2007 - Dr Nic Williams - KeynoteRubyEnRails2007 - Dr Nic Williams - Keynote
RubyEnRails2007 - Dr Nic Williams - Keynote
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Dernier (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

  • 1. MarkLogic and Hadoop – Genetic Algorithm Jim Fuller email: jim.fuller@marklogic.com twitter: @xquery Senior Engineer, Europe 19/09/12
  • 2. Senior engineer http://jim.fuller.name http://exslt.org @xquery XSLT UK 2001 http://www.xmlprague.cz @perl6
  • 3. Overview • Genetic Algorithm Refresher • Marklogic/Hadoop architecture for implementing GA • Installing Hadoop • Installing MarkLogic Connector • Problem Statement • Review of GA process runs • Summary
  • 4. Whats the Problem ? • Bigdata breathes life into older algorithmic approaches • I thought it would interesting to turn ‘bigdata’ problem on its head (code versus data) • Demonstrate hadoop with MarkLogic, working to each other strengths
  • 5. Get out of your comfort zone • This talk is slightly different then the description … 150 slides! Part I. • Its got hadoop/marklogic and the genetic algorithm but have focused on the process and early results • Doing data science means pushing yourself out of your comfort zone • Start simple, then iterate
  • 6. Genetic Algorithm Refresher • The Genetic Algorithm ( GA ) is a model of the evolution of a population of artificial individuals emulating Darwinian Selection. • Each individual is a chromosome which contains discrete units of information (genes). • The driving force behind the search for new and better solutions is the retention and combination of good partial solutions to a problem
  • 7. Abridged Genetic Algorithm • The Fundamental Theorem of Genetic Algorithms M(H, t):# of individuals in population 't' with the schema 'H'. f(H): average fitness of the individuals with the schema 'H'. F: average fitness of the entire population. p1:probability of the schema being destroyed by crossover. p2:probability of the schema being destroyed by mutation.
  • 8. GA operations • Reproduction: An individual is perfectly replicated to a new population • Crossover ( Recombination ): Parental material is recombined to create offspring to join new population • Mutation: random changes (is key for pushing past local optima) • Permutation: reordering • Editing: evaluation to a terminal • Encapsulation: single indivisible function • Decimation: removal of individuals
  • 9. Typical GA Process Step 0. Create a random initial population of individuals Step 1. Evaluate the fitness of each individual Step 2. Select individuals according to their fitness, which will participate in generating offspring (moms+dads) Step 3. Apply primary and secondary genetic operations to generate new offspring population Step 4. Repeat the steps 1,2,3, to generate X number of generations Step 5. choose fittest individual of last generation based on stop criteria
  • 10. Endemic GA Problems • Finding the optimal solution to complex high dimensional, multimodal problems often requires very expensive fitness function • Hard to pose problem statement e.g. Stop criteria is not clear in every problem • Premature convergence on local optima
  • 11. Bit strings vs Lisp Parse Trees (+( 2 3) 4) evaluates to 10 and symbolic expression looks like; + 4 2 3 Hierarchical computer programs are more expressive then manipulating linear strings
  • 12. XSLT – markup is useful! <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version=“2.0"> <xsl:template match="a"> <d/> <c/> </xsl:template> <xsl:stylesheet/> </xsl:stylesheet> <xsl:template/> <d/> <c/> Obvious Difficulties to address; different node types and xpath
  • 13. Objective Generate an xslt program that transforms source xml into result xml which is equivalent to target xml Terminal Set <a/> <b/> <c/> <d/> Function Set Subset of xslt instructions Fitness Cases One fitness case Raw fitness Treediffmerge result, node count + standard diff Standardized Same as raw fitness, fitness approaching 0 is better fitness Parameters M=500, G=51
  • 14. Source XML <a> <b> <c> <d></d> </c> </b> </a>
  • 15. Target XML – clear stop criteria <a> <b> <c> <d></d> </c> </b> </a>
  • 16. Generation zero • XML Instance Generator which is part of the Sun Multi-Schema Validator • Sun Multi-Schema Validator • The following can do it – OxygenXML – Visual Studio – Eclipse • Ended up using IBM XML Generate – very old, supply it a schema and it would generate example xml
  • 17. Step 1a: Evaluate against Input xslt Source.xml transformation result.xml XSLT generation MarkLogic evals and places the result into the property for the xslt itself
  • 18. Step 1b: Evaluate Fitness xslt Source.xml transformation result.xml HADOOP XSLT generation evaluate fitness fitness performed with treediffmerge + standard diff
  • 19. XML Diff issues • Many diff algorithms are based on a paper published in 1976 by J. W. Hunt and M. D. McIlroy, An Algorithm for Differential File Comparison • XML has a structure, text based diff programs do not take this into accordance • simple example: <footie/> versus <footie></footie>logically these are equal • XML Canonization helps !
  • 20. XML Canonize + TreeDiffMerge TREEDIFFMERGE DIFFERENCE RESULTS <?xml version="1.0" encoding="UTF-8"?> <?xml version="1.0" encoding="UTF-8"?> <root/> <diff xmlns:diff='http://diff.org'> <diff:insert dst="1"> <a> <b> <c> <d /> </c> </b> </a> </diff:insert> </diff> <?xml version="1.0" encoding="UTF-8"?> <?xml version="1.0" encoding="utf-8"?><root> <diff xmlns:diff='http://diff.org'> <a/><a><a><c/><c><a><d/></a><c/></c></a><b> <diff:copy src="2" dst="1"> <b/><a/><c/><b> <c> <diff:copy src="16" <d/> dst="2" /> </c> </diff:copy> </diff> </b></b><a/></a><d><a><c/><a/><a/></a><c/></ d><c/>
  • 21. Simple if we match: we are done! <?xml version="1.0" encoding="UTF-8"?> <?xml version="1.0" encoding="utf-8"?><root><a> <diff /> <b> <c> <d/> </c> </b> </a></root>
  • 22. MarkLogic/Hadoop Architecture Interlude MarkLogic Connector API via XDBC MarkLogic Connector API via XDBC
  • 24. Hadoop Installation Recipe • installing Hadoop (setting up a single node cluster) – brew install hadoop – make sure ssh is setup properly – generate id_rsa and id_rsa.pub – append pub to auth keys • cat id_rsa.pub >> authorized_keys – enable remote on mac osx • configure hadoop – edit core-site.xml – edit mapred-site.xml • ssh localhost – format hdfs • hadoop namenode –format • bin/start-all.sh – if asks for password, you got problem with your ssh setup • to check that all is well – run jps – ps ax | grep hadoop | wc –l – Check • http://localhost:50030/jobtracker.jsp • http://localhost:50060/tasktracker.jsp • http://localhost:50070/dfshealth.jsp
  • 25. Installing ML Hadoop Connector • copy latest xcc and connector jars to hadoop lib • Copy ml-examples jar as well • Copy ml hadoop conf to hadoop conf
  • 26. Starting it all Up • Start marklogic • Create database • Create xdbc connection (how hadoop/ml communicate) • Edit marklogic-hello-world.xml • Make sure hadoop is started
  • 27. Starting it all Up • Load test Data via query console xquery version "1.0-ml"; let $hello := <data><child>hello mom</child></data> let $world := <data><child>world event</child></data> return( xdmp:document-insert("hello.xml", $hello), xdmp:document-insert("world.xml", $world) )
  • 28. Run hello world example • bin/start-all.sh • hadoop jar lib/marklogic-xcc-examples- 6.0.20120914.jar com.marklogic.mapreduce.examples.HelloWorld • Review https://gist.github.com/2484318
  • 29. Fitness (hadoop) step • Applies XML canonization • Performs treediffmerge, outputs and writes to original xslt document xml property • Performs text diff and writes to original xslt document xml property
  • 30. Step 2. Select individuals • Probabilistic selection to choose which individuals participate in genetic operation Selected XSLT population Select individuals for genetic operations, based on their fitness
  • 31. About fitness • Raw fitness: is the natural representation in terms of the specific problem (primitive counting nodes of treediffmerge patch) • Standardized fitness: lower the better • Adjusted fitness: lies between 0-1 • Normalized fitness: lies between 0-1 with sum of fitness values = 1 • In our case the lower the number of ‘different’ nodes the better, use standardized fitness
  • 32. Step 3. Apply Primary Genetic Operations Reproduction Selected XSLT population New generation Individual reproduced into new generation
  • 33. Step 3. Primary Genetic Operations Crossover ( Recombination ) Selected XSLT population Creates 2 offspring ‘Mom’ ‘Dad’ New generation Select parents then crossover creates 2 offspring
  • 34. Step 3. Primary Genetic Operations Crossover ( Recombination ) ‘Mom XSLT’ ‘Dad XSLT’ ‘offspring xslt’ ‘offspring xslt’ Swap nodes between selected parent xslt New generation
  • 35. Crossover with xquery xquery version "1.0-ml"; import module namespace mem = "http://xqdev.com/in-mem-update" at "/MarkLogic/appservices/utils/in-mem-update.xqy" ; let $mom := <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/" as="item()*"> <bar>help</bar> </xsl:template> <xsl:template match="text()" as="item()*"/> </xsl:stylesheet> let $dad := <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/" as="item()*"> <a><b><c>test</c></b></a> </xsl:template> </xsl:stylesheet> let $momCount := fn:count($mom//.) let $dadCount := fn:count($dad//.) (: never want root node :) let $momRdm := xdmp:random($momCount - 2) + 2 let $dadRdm := xdmp:random($dadCount - 2) + 2 (: node selection :) let $momNode := ($mom//.)[$momRdm] let $dadNode := ($dad//.)[$dadRdm] (: crossover :) let $newMom := mem:node-replace( $momNode, $dadNode ) let $newDad := mem:node-replace( $dadNode, $momNode ) return <result> <newMom>{$newMom}</newMom> <newDad>{$newDad}</newDad> </result>
  • 36. Step 3. Secondary Genetic Operations • Mutation: is a form of random crossover • Permutation: Reorganize nodes • Editing: evaluate a set of nodes • Encapsulation: takes a branch and replaces with 1 indivisible node • Decimation: removes individual based on domain specific criteria
  • 37. Step 3. Secondary Genetic Operations mutation ‘selected XSLT’ ‘offspring xslt’ Completely new set of instructions Pick a node and randomly mutate
  • 38. Step 3. Secondary Genetic Operations permutation ‘selected XSLT’ ‘offspring xslt’ Permutated node order
  • 39. Step 3. Secondary Genetic Operations editing ‘selected XSLT’ ‘offspring xslt’ Replace node with evaluated expression
  • 40. Step 3. Secondary Genetic Operations encapsulation ‘selected XSLT’ ‘define new function’ ‘XSLT’ Identify useful subtrees and encapsulate by defining new function
  • 41. Step 3. Secondary Genetic Operations decimation <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> </xsl:stylesheet> <xsl:stylesheet/> Identify very poor fitness individuals and remove from population
  • 42. Initial tests • Initial Population= 500, generations = 51 • Set initial genetic operation probabilities: 90% crossover on selected individuals 10% reproduction on selected individuals 0% secondary operations on selected individuals
  • 43. Results • runs faster with more servers … extreme scale out – unusual for GA • Arrived quickly to a ‘correct’ solution • Though some runs Local optima was ‘wrong solution’ e.g. embedded literal • need to constrain xpath (baby steps) • Need to constrain terminal set • Enhance fitness definition
  • 44. Source XML <a> <b> <c> <d></d> </c> </b> </a>
  • 45. Target XML <a> <b/> <c/> <d/> </a>
  • 46. Results • Needed larger generations/ more individuals • Mutation operation needed to kick out of local optima
  • 47. Summary • This approach can be applied to any language parse tree (xquery with xqueryparser.xq) • Difficulties with little languages being embedded • Today, commercially applicable to generating mapping solutions, more research required • Illustrates applying strength of ML/Hadoop together • Will place code and results on github soon …
  • 48. References • JOHN R KOZA, Genetic Programming, MIT Press 1992 • J. W. Hunt and M. D. McIlroy , An Algorithm for Differential File Comparison published in 1976