Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

MarkLogic and Hadoop – Genetic Algorithm
Jim Fuller
email: jim.fuller@marklogic.com twitter: @xquery
Senior Engineer, Europe
19/09/12

Senior engineer

http://jim.fuller.name
http://exslt.org @xquery XSLT UK 2001
http://www.xmlprague.cz

@perl6

Overview
• Genetic Algorithm Refresher
• Marklogic/Hadoop architecture for
implementing GA
• Installing Hadoop
• Installing MarkLogic Connector
• Problem Statement
• Review of GA process runs
• Summary

Whats the Problem ?
• Bigdata breathes life into older algorithmic
approaches
• I thought it would interesting to turn ‘bigdata’
problem on its head (code versus data)
• Demonstrate hadoop with MarkLogic, working
to each other strengths

Get out of your comfort zone
• This talk is slightly different then the
description … 150 slides! Part I.
• Its got hadoop/marklogic and the genetic
algorithm but have focused on the process
and early results
• Doing data science means pushing yourself
out of your comfort zone
• Start simple, then iterate

Genetic Algorithm Refresher
• The Genetic Algorithm ( GA ) is a model of the
evolution of a population of artificial individuals
emulating Darwinian Selection.

• Each individual is a chromosome which contains
discrete units of information (genes).

• The driving force behind the search for new and
better solutions is the retention and combination of
good partial solutions to a problem

Abridged Genetic Algorithm
• The Fundamental Theorem of Genetic Algorithms

M(H, t):# of individuals in population 't' with the schema 'H'.
f(H): average fitness of the individuals with the schema 'H'.
F: average fitness of the entire population.
p1:probability of the schema being destroyed by crossover.
p2:probability of the schema being destroyed by mutation.

GA operations
• Reproduction: An individual is perfectly replicated
to a new population
• Crossover ( Recombination ): Parental material
is recombined to create offspring to join new
population
• Mutation: random changes (is key for pushing past
local optima)
• Permutation: reordering
• Editing: evaluation to a terminal
• Encapsulation: single indivisible function
• Decimation: removal of individuals

Typical GA Process
Step 0. Create a random initial population of individuals

Step 1. Evaluate the fitness of each individual

Step 2. Select individuals according to their fitness, which will
participate in generating offspring (moms+dads)

Step 3. Apply primary and secondary genetic operations to
generate new offspring population

Step 4. Repeat the steps 1,2,3, to generate X number of
generations

Step 5. choose fittest individual of last generation based on
stop criteria

Endemic GA Problems
• Finding the optimal solution to complex high
dimensional, multimodal problems often
requires very expensive fitness function
• Hard to pose problem statement e.g. Stop
criteria is not clear in every problem
• Premature convergence on local optima

Bit strings vs Lisp Parse Trees
(+( 2 3) 4) evaluates to 10 and symbolic
expression looks like;
+
4

2 3

Hierarchical computer programs are more
expressive then manipulating linear strings

XSLT – markup is useful!
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version=“2.0">
<xsl:template match="a">
<d/>
<c/>
</xsl:template> <xsl:stylesheet/>
</xsl:stylesheet>
<xsl:template/>

<d/>
<c/>

Obvious Difficulties to address; different node types and xpath

Objective Generate an xslt program that
transforms source xml into result
xml which is equivalent to target
xml
Terminal Set <a/> <c/> <d/>
Function Set Subset of xslt instructions
Fitness Cases One fitness case
Raw fitness Treediffmerge result, node count
+ standard diff

Standardized Same as raw fitness,
fitness approaching 0 is better fitness
Parameters M=500, G=51

Source XML
<a>

<c>
<d></d>
</c>

</a>

Target XML – clear stop criteria

<a>

<c>
<d></d>
</c>

</a>

Generation zero
• XML Instance Generator which is part of the Sun
Multi-Schema Validator
• Sun Multi-Schema Validator
• The following can do it
– OxygenXML
– Visual Studio
– Eclipse
• Ended up using IBM XML Generate – very old,
supply it a schema and it would generate
example xml

Step 1a: Evaluate against Input

xslt Source.xml
transformation

result.xml

XSLT generation

MarkLogic evals and places the result into the property for the xslt itself

Step 1b: Evaluate Fitness

xslt Source.xml
transformation

result.xml
HADOOP
XSLT generation
evaluate fitness

fitness performed with treediffmerge + standard diff

XML Diff issues
• Many diff algorithms are based on a paper
published in 1976 by J. W. Hunt and M. D.
McIlroy, An Algorithm for Differential File
Comparison
• XML has a structure, text based diff programs
do not take this into accordance
• simple example: <footie/> versus
<footie></footie>logically these are equal
• XML Canonization helps !

XML Canonize + TreeDiffMerge
TREEDIFFMERGE DIFFERENCE RESULTS

<?xml version="1.0" encoding="UTF-8"?> <?xml version="1.0" encoding="UTF-8"?>
<root/>
<diff xmlns:diff='http://diff.org'>
<diff:insert dst="1">
<a>

<c>

<d />

</c>

</a>
</diff:insert>
</diff>
<?xml version="1.0" encoding="UTF-8"?> <?xml version="1.0" encoding="utf-8"?><root>
<diff xmlns:diff='http://diff.org'> <a/><a><a><c/><c><a><d/></a><c/></c></a>
<diff:copy src="2" dst="1"> <a/><c/>
<c>
<diff:copy src="16" <d/>
dst="2" /> </c>
</diff:copy>
</diff> <a/></a><d><a><c/><a/><a/></a><c/></
d><c/>

Simple if we match: we are done!
<?xml version="1.0" encoding="UTF-8"?> <?xml version="1.0" encoding="utf-8"?><root><a>
<diff /> 
<c>
<d/>
</c>

</a></root>

MarkLogic/Hadoop Architecture
Interlude
MarkLogic

Connector API via XDBC

MarkLogic

Connector API via XDBC

Hadoop Installation Recipe
• installing Hadoop (setting up a single node cluster)
– brew install hadoop
– make sure ssh is setup properly
– generate id_rsa and id_rsa.pub
– append pub to auth keys
• cat id_rsa.pub >> authorized_keys
– enable remote on mac osx
• configure hadoop
– edit core-site.xml
– edit mapred-site.xml
• ssh localhost
– format hdfs
• hadoop namenode –format
• bin/start-all.sh
– if asks for password, you got problem with your ssh setup
• to check that all is well
– run jps
– ps ax | grep hadoop | wc –l
– Check
• http://localhost:50030/jobtracker.jsp
• http://localhost:50060/tasktracker.jsp
• http://localhost:50070/dfshealth.jsp

Installing ML Hadoop Connector
• copy latest xcc and connector jars to hadoop
lib
• Copy ml-examples jar as well
• Copy ml hadoop conf to hadoop conf

Starting it all Up
• Start marklogic
• Create database
• Create xdbc connection (how hadoop/ml
communicate)
• Edit marklogic-hello-world.xml

• Make sure hadoop is started

Starting it all Up
• Load test Data via query console

xquery version "1.0-ml";

let $hello := <data><child>hello mom</child></data>
let $world := <data><child>world event</child></data>

return(
xdmp:document-insert("hello.xml", $hello),
xdmp:document-insert("world.xml", $world)
)

Run hello world example
• bin/start-all.sh

• hadoop jar lib/marklogic-xcc-examples-
6.0.20120914.jar
com.marklogic.mapreduce.examples.HelloWorld

• Review https://gist.github.com/2484318

Fitness (hadoop) step
• Applies XML canonization
• Performs treediffmerge, outputs and writes to
original xslt document xml property
• Performs text diff and writes to original xslt
document xml property

Step 2. Select individuals
• Probabilistic selection to choose which
individuals participate in genetic operation

Selected XSLT population

Select individuals for genetic operations, based on their fitness

About fitness
• Raw fitness: is the natural representation in
terms of the specific problem (primitive
counting nodes of treediffmerge patch)
• Standardized fitness: lower the better
• Adjusted fitness: lies between 0-1
• Normalized fitness: lies between 0-1 with
sum of fitness values = 1
• In our case the lower the number of ‘different’
nodes the better, use standardized fitness

Step 3. Apply Primary Genetic Operations
Reproduction


New generation
Individual reproduced into new generation

Step 3. Primary Genetic Operations
Crossover ( Recombination )


Creates
2 offspring
‘Mom’

‘Dad’
New generation

Select parents then crossover creates 2 offspring

Step 3. Primary Genetic Operations
Crossover ( Recombination )

‘Mom XSLT’ ‘Dad XSLT’ ‘offspring xslt’

‘offspring xslt’

Swap nodes between selected parent xslt
New generation

Crossover with xquery
xquery version "1.0-ml";
import module namespace mem = "http://xqdev.com/in-mem-update"
at "/MarkLogic/appservices/utils/in-mem-update.xqy" ;

let $mom := <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/" as="item()*">
<bar>help</bar>
</xsl:template>
<xsl:template match="text()" as="item()*"/>
</xsl:stylesheet>

let $dad := <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/" as="item()*">
<a><c>test</c></a>
</xsl:template>
</xsl:stylesheet>

let $momCount := fn:count($mom//.)
let $dadCount := fn:count($dad//.)

(: never want root node :)
let $momRdm := xdmp:random($momCount - 2) + 2
let $dadRdm := xdmp:random($dadCount - 2) + 2

(: node selection :)
let $momNode := ($mom//.)[$momRdm]
let $dadNode := ($dad//.)[$dadRdm]

(: crossover :)
let $newMom := mem:node-replace( $momNode, $dadNode )
let $newDad := mem:node-replace( $dadNode, $momNode )
return
<result>
<newMom>{$newMom}</newMom>
<newDad>{$newDad}</newDad>
</result>

Step 3. Secondary Genetic Operations

• Mutation: is a form of random crossover
• Permutation: Reorganize nodes
• Editing: evaluate a set of nodes
• Encapsulation: takes a branch and replaces
with 1 indivisible node
• Decimation: removes individual based on
domain specific criteria

mutation

‘selected XSLT’ ‘offspring xslt’

Completely new
set of instructions

Pick a node and randomly mutate

permutation


Permutated node order

editing


Replace node with evaluated expression

encapsulation

‘selected XSLT’ ‘define new function’
‘XSLT’

Identify useful subtrees and encapsulate by defining new function

decimation

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
</xsl:stylesheet>

<xsl:stylesheet/>

Identify very poor fitness individuals and remove from population

Initial tests
• Initial Population= 500, generations = 51
• Set initial genetic operation probabilities:
90% crossover on selected individuals
10% reproduction on selected individuals
0% secondary operations on selected individuals

Results
• runs faster with more servers … extreme scale out –
unusual for GA
• Arrived quickly to a ‘correct’ solution
• Though some runs Local optima was ‘wrong
solution’ e.g. embedded literal
• need to constrain xpath (baby steps)
• Need to constrain terminal set
• Enhance fitness definition

Target XML

<a>

<c/>
<d/>
</a>

Results
• Needed larger generations/ more individuals
• Mutation operation needed to kick out of local
optima

Summary
• This approach can be applied to any language
parse tree (xquery with xqueryparser.xq)
• Difficulties with little languages being
embedded
• Today, commercially applicable to generating
mapping solutions, more research required
• Illustrates applying strength of ML/Hadoop
together
• Will place code and results on github soon …

References
• JOHN R KOZA, Genetic Programming, MIT Press 1992
• J. W. Hunt and M. D. McIlroy , An Algorithm for Differential
File Comparison published in 1976

Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (12)

Similaire à Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

Similaire à Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code (20)

Dernier

Dernier (20)

Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code