3. Overview
• Genetic Algorithm Refresher
• Marklogic/Hadoop architecture for
implementing GA
• Installing Hadoop
• Installing MarkLogic Connector
• Problem Statement
• Review of GA process runs
• Summary
4. Whats the Problem ?
• Bigdata breathes life into older algorithmic
approaches
• I thought it would interesting to turn ‘bigdata’
problem on its head (code versus data)
• Demonstrate hadoop with MarkLogic, working
to each other strengths
5. Get out of your comfort zone
• This talk is slightly different then the
description … 150 slides! Part I.
• Its got hadoop/marklogic and the genetic
algorithm but have focused on the process
and early results
• Doing data science means pushing yourself
out of your comfort zone
• Start simple, then iterate
6. Genetic Algorithm Refresher
• The Genetic Algorithm ( GA ) is a model of the
evolution of a population of artificial individuals
emulating Darwinian Selection.
• Each individual is a chromosome which contains
discrete units of information (genes).
• The driving force behind the search for new and
better solutions is the retention and combination of
good partial solutions to a problem
7. Abridged Genetic Algorithm
• The Fundamental Theorem of Genetic Algorithms
M(H, t):# of individuals in population 't' with the schema 'H'.
f(H): average fitness of the individuals with the schema 'H'.
F: average fitness of the entire population.
p1:probability of the schema being destroyed by crossover.
p2:probability of the schema being destroyed by mutation.
8. GA operations
• Reproduction: An individual is perfectly replicated
to a new population
• Crossover ( Recombination ): Parental material
is recombined to create offspring to join new
population
• Mutation: random changes (is key for pushing past
local optima)
• Permutation: reordering
• Editing: evaluation to a terminal
• Encapsulation: single indivisible function
• Decimation: removal of individuals
9. Typical GA Process
Step 0. Create a random initial population of individuals
Step 1. Evaluate the fitness of each individual
Step 2. Select individuals according to their fitness, which will
participate in generating offspring (moms+dads)
Step 3. Apply primary and secondary genetic operations to
generate new offspring population
Step 4. Repeat the steps 1,2,3, to generate X number of
generations
Step 5. choose fittest individual of last generation based on
stop criteria
10. Endemic GA Problems
• Finding the optimal solution to complex high
dimensional, multimodal problems often
requires very expensive fitness function
• Hard to pose problem statement e.g. Stop
criteria is not clear in every problem
• Premature convergence on local optima
11. Bit strings vs Lisp Parse Trees
(+( 2 3) 4) evaluates to 10 and symbolic
expression looks like;
+
4
2 3
Hierarchical computer programs are more
expressive then manipulating linear strings
12. XSLT – markup is useful!
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version=“2.0">
<xsl:template match="a">
<d/>
<c/>
</xsl:template> <xsl:stylesheet/>
</xsl:stylesheet>
<xsl:template/>
<d/>
<c/>
Obvious Difficulties to address; different node types and xpath
13. Objective Generate an xslt program that
transforms source xml into result
xml which is equivalent to target
xml
Terminal Set <a/> <b/> <c/> <d/>
Function Set Subset of xslt instructions
Fitness Cases One fitness case
Raw fitness Treediffmerge result, node count
+ standard diff
Standardized Same as raw fitness,
fitness approaching 0 is better fitness
Parameters M=500, G=51
16. Generation zero
• XML Instance Generator which is part of the Sun
Multi-Schema Validator
• Sun Multi-Schema Validator
• The following can do it
– OxygenXML
– Visual Studio
– Eclipse
• Ended up using IBM XML Generate – very old,
supply it a schema and it would generate
example xml
17. Step 1a: Evaluate against Input
xslt Source.xml
transformation
result.xml
XSLT generation
MarkLogic evals and places the result into the property for the xslt itself
18. Step 1b: Evaluate Fitness
xslt Source.xml
transformation
result.xml
HADOOP
XSLT generation
evaluate fitness
fitness performed with treediffmerge + standard diff
19. XML Diff issues
• Many diff algorithms are based on a paper
published in 1976 by J. W. Hunt and M. D.
McIlroy, An Algorithm for Differential File
Comparison
• XML has a structure, text based diff programs
do not take this into accordance
• simple example: <footie/> versus
<footie></footie>logically these are equal
• XML Canonization helps !
24. Hadoop Installation Recipe
• installing Hadoop (setting up a single node cluster)
– brew install hadoop
– make sure ssh is setup properly
– generate id_rsa and id_rsa.pub
– append pub to auth keys
• cat id_rsa.pub >> authorized_keys
– enable remote on mac osx
• configure hadoop
– edit core-site.xml
– edit mapred-site.xml
• ssh localhost
– format hdfs
• hadoop namenode –format
• bin/start-all.sh
– if asks for password, you got problem with your ssh setup
• to check that all is well
– run jps
– ps ax | grep hadoop | wc –l
– Check
• http://localhost:50030/jobtracker.jsp
• http://localhost:50060/tasktracker.jsp
• http://localhost:50070/dfshealth.jsp
25. Installing ML Hadoop Connector
• copy latest xcc and connector jars to hadoop
lib
• Copy ml-examples jar as well
• Copy ml hadoop conf to hadoop conf
26. Starting it all Up
• Start marklogic
• Create database
• Create xdbc connection (how hadoop/ml
communicate)
• Edit marklogic-hello-world.xml
• Make sure hadoop is started
27. Starting it all Up
• Load test Data via query console
xquery version "1.0-ml";
let $hello := <data><child>hello mom</child></data>
let $world := <data><child>world event</child></data>
return(
xdmp:document-insert("hello.xml", $hello),
xdmp:document-insert("world.xml", $world)
)
28. Run hello world example
• bin/start-all.sh
• hadoop jar lib/marklogic-xcc-examples-
6.0.20120914.jar
com.marklogic.mapreduce.examples.HelloWorld
• Review https://gist.github.com/2484318
29. Fitness (hadoop) step
• Applies XML canonization
• Performs treediffmerge, outputs and writes to
original xslt document xml property
• Performs text diff and writes to original xslt
document xml property
30. Step 2. Select individuals
• Probabilistic selection to choose which
individuals participate in genetic operation
Selected XSLT population
Select individuals for genetic operations, based on their fitness
31. About fitness
• Raw fitness: is the natural representation in
terms of the specific problem (primitive
counting nodes of treediffmerge patch)
• Standardized fitness: lower the better
• Adjusted fitness: lies between 0-1
• Normalized fitness: lies between 0-1 with
sum of fitness values = 1
• In our case the lower the number of ‘different’
nodes the better, use standardized fitness
32. Step 3. Apply Primary Genetic Operations
Reproduction
Selected XSLT population
New generation
Individual reproduced into new generation
33. Step 3. Primary Genetic Operations
Crossover ( Recombination )
Selected XSLT population
Creates
2 offspring
‘Mom’
‘Dad’
New generation
Select parents then crossover creates 2 offspring
35. Crossover with xquery
xquery version "1.0-ml";
import module namespace mem = "http://xqdev.com/in-mem-update"
at "/MarkLogic/appservices/utils/in-mem-update.xqy" ;
let $mom := <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/" as="item()*">
<bar>help</bar>
</xsl:template>
<xsl:template match="text()" as="item()*"/>
</xsl:stylesheet>
let $dad := <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/" as="item()*">
<a><b><c>test</c></b></a>
</xsl:template>
</xsl:stylesheet>
let $momCount := fn:count($mom//.)
let $dadCount := fn:count($dad//.)
(: never want root node :)
let $momRdm := xdmp:random($momCount - 2) + 2
let $dadRdm := xdmp:random($dadCount - 2) + 2
(: node selection :)
let $momNode := ($mom//.)[$momRdm]
let $dadNode := ($dad//.)[$dadRdm]
(: crossover :)
let $newMom := mem:node-replace( $momNode, $dadNode )
let $newDad := mem:node-replace( $dadNode, $momNode )
return
<result>
<newMom>{$newMom}</newMom>
<newDad>{$newDad}</newDad>
</result>
36. Step 3. Secondary Genetic Operations
• Mutation: is a form of random crossover
• Permutation: Reorganize nodes
• Editing: evaluate a set of nodes
• Encapsulation: takes a branch and replaces
with 1 indivisible node
• Decimation: removes individual based on
domain specific criteria
37. Step 3. Secondary Genetic Operations
mutation
‘selected XSLT’ ‘offspring xslt’
Completely new
set of instructions
Pick a node and randomly mutate
40. Step 3. Secondary Genetic Operations
encapsulation
‘selected XSLT’ ‘define new function’
‘XSLT’
Identify useful subtrees and encapsulate by defining new function
41. Step 3. Secondary Genetic Operations
decimation
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
</xsl:stylesheet>
<xsl:stylesheet/>
Identify very poor fitness individuals and remove from population
42. Initial tests
• Initial Population= 500, generations = 51
• Set initial genetic operation probabilities:
90% crossover on selected individuals
10% reproduction on selected individuals
0% secondary operations on selected individuals
43. Results
• runs faster with more servers … extreme scale out –
unusual for GA
• Arrived quickly to a ‘correct’ solution
• Though some runs Local optima was ‘wrong
solution’ e.g. embedded literal
• need to constrain xpath (baby steps)
• Need to constrain terminal set
• Enhance fitness definition
46. Results
• Needed larger generations/ more individuals
• Mutation operation needed to kick out of local
optima
47. Summary
• This approach can be applied to any language
parse tree (xquery with xqueryparser.xq)
• Difficulties with little languages being
embedded
• Today, commercially applicable to generating
mapping solutions, more research required
• Illustrates applying strength of ML/Hadoop
together
• Will place code and results on github soon …
48. References
• JOHN R KOZA, Genetic Programming, MIT Press 1992
• J. W. Hunt and M. D. McIlroy , An Algorithm for Differential
File Comparison published in 1976