Machine learning in php

MACHINE LEARNING
IN PHP
The roots of education are bitter, but the fruit is sweet
Verona, Italia, 2016

AGENDA
How to teach tricks to your PHP
Application : searching for code in comments
Complex learning

SPEAKER
Damien Seguy
Exakat CTO
Static analysis of PHP code

MACHINE LEARNING
Teaching the machine
Supervised learning : learning then applying
Application build its own model : training phase
It applies its model to real cases : applying phase

APPLICATIONS
Play go, chess, tic-tac-toe and beat everyone else
Fraud detection and risk analysis
Automated translation or automated transcription
OCR and face recognition
Medical diagnostics
Walk, welcome guest at hotels, play football
Finding good PHP code

PHP APPLICATIONS
Recommendations systems
Predicting user behavior
SPAM
conversion user to customer
ETA
Detect code in comments

REAL USE CASE
Identify code in comments
Classic problem
Good problem for machine learning
Complex, no simple solution
A lot of data and expertise are available

SUPERVISEDTRAINING
History
data
Training
ModelReal data Results

THE FANN EXTENSION
ext/fann (https://pecl.php.net/package/fann)
Fast Artiﬁcial Neural Network
http://leenissen.dk/fann/wp/
Neural networks in PHP
Works on PHP 7, thanks to the hard work of Jakub Zelenka
https://github.com/bukka/php-fann

NEURAL NETWORKS
Imitation of nature
Input layer
Output layer
Intermediate layers

NEURAL NETWORK
Imitation of nature
Input layer
Output layer
Intermediate layers

INITIALIZATION
<?php
$num_layers = 1;
$num_input = 5;
$num_neurons_hidden = 3;
$num_output = 1;
$ann = fann_create_standard($num_layers, $num_input,
$num_neurons_hidden, $num_output);
// Activation function
fann_set_activation_function_hidden($ann,
FANN_SIGMOID_SYMMETRIC);
fann_set_activation_function_output($ann,
FANN_SIGMOID_SYMMETRIC);

PREPARING DATA
Raw data Extract Filter Human review Fann ready

EXPERT AT WORK
// Test if the if is in a compressed format
// none need yet
// icon
// There is a parser speciﬁed in `Parser::$KEYWORD_PARSERS`
// $result should exist, regardless of $_message
// $a && $b and multidimensional
// numGlyphs + 1
// TODO : ﬁx this; var_dump($var);
// if(ob_get_clean()){
//$annots .= ' /StructParent ';
// $cfg['Servers'][$i]['controlpass'] = 'pmapass';

INPUTVECTOR
'length' : size of the comment
'countDollar' : number of $
'countEqual' : number of =
'countObjectOperator' number of -> operator ($o->p)
'countSemicolon' : number of semi-colon ;

INPUT DATA
46 5 1
825 0 0 0 1
0
37 2 0 0 0
0
55 2 2 0 1
1
61 2 1 3 1
1
...
* This file is part of Exakat.
*
* Exakat is free software: you can redist
* it under the terms of the GNU Affero Ge
* the Free Software Foundation, either ve
* (at your option) any later version.
*
* Exakat is distributed in the hope that
* but WITHOUT ANY WARRANTY; without even
* MERCHANTABILITY or FITNESS FOR A PARTIC
* GNU Affero General Public License for m
*
* You should have received a copy of the
* along with Exakat. If not, see <http:/
*
* The latest code can be found at <http:/
*
*/
// $x[3] or $x[] and multidimensional
//if ($round == 3) { die('Round '.$round);
//$this->errors[] = $this->language->get('
Number of input
Number of incoming data
Number of outgoing data

TRAINING
$max_epochs = 500000;
$desired_error = 0.001;
// the actual training
if (fann_train_on_file($ann,
'incoming.data',
$max_epochs,
$epochs_between_reports,
$desired_error)) {
fann_save($ann, 'model.out');
}
fann_destroy($ann);
?>

TRAINING
47 cases
5 characteristics
3 hidden neurons
+ 5 input + 1 output
Duration : 5.711 s

APPLICATION
History
data
Training
ModelReal data Results

APPLICATION
<?php
$ann = fann_create_from_file('model.out');
$comment = '//$gvars = $this->getGraphicVars();';
$input = makeVector($comment);
$results = fann_run($ann, $input);
if ($results[0] > 0.8) {
print ""$comment" -> $results[0] n";
}
?>

RESULTS > 0.8
Answer between 0 and 1
Values ranges from -14 to 0,999
The closer to 1, the safer.The closer to 0, the safer.
Is this a percentage? Is this a carrots count ?
It's a mix of counts…

-16
-12
-8
-4
0
60.000000
70.000000
80.000000
90.000000
100.000000

REAL CASES
Tested on 14093 comments
Duration 367.01ms
Found 1960 issues (14%)

0.99999893
// $cfg['Servers'][$i]['controlhost'] = '';
0.99999928
//$_SESSION['Import_message'] = $message->getDisplay();
/* 0.99999928
if (defined('SESSIONUPLOAD')) {
    // write sessionupload back into the loaded PMA session
    $sessionupload = unserialize(SESSIONUPLOAD);
    foreach ($sessionupload as $key => $value) {
        $_SESSION[$key] = $value;
    }
    // remove session upload data that are not set anymore
    foreach ($_SESSION as $key => $value) {
        if (mb_substr($key, 0, mb_strlen(UPLOAD_PREFIX))
            == UPLOAD_PREFIX
            && ! isset($sessionupload[$key])
        ) {
            unset($_SESSION[$key]);
        }
    }

0.98780382
//LEAD_OFFSET = (0xD800 - (0x10000 >> 10)) = 55232
0.99361396
// We have server(s) => apply default configuration

0.98383027
// Duration = as configured
0.99999928
// original -> translation mapping
0.97590065
// = (   59 x 84   ) mm  = (  2.32 x 3.31  ) in

True positive False positive
True negative False negative
Found by
FANN
Target

True
positive
False
positive
True
negative
False
negative
Found by
FANN
Target
// $cfg['Servers'][$i]['table_coords'] = 'pma__tabl
//(isset($attribs['height'])?$attribs['height']: 1)
// if ($key != null) did not work for index "0"
// the PASSWORD() function
0.99999923
0.73295981
0.99999851
0.2104115

RESULTS
1960 issues
50+% of false positive
With an easy clean, 822 issues reported
14k comments, analyzed in 367 ms
Total time of coding : 27 mins.
// = ( 59 x 84 ) mm = ( 2.32 x 3.31 ) in
/* vim: set expandtab sw=4 ts=4 sts=4: */

LEARN BETTER, NOT HARDER
Better training data
Improve characteristics
Conﬁgure the neural network
Change algorithm
Automate learning
Update constantly
Real data
History
data
Training
Model Results
Retroaction

BETTERTRAINING DATA
More data, more data, more data
Varied situations, real case situations
Include speciﬁc cases
Experience is capital
https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf

IMPROVE CHARACTERISTICS
Add new characteristics
Remove the one that are less interesting
Find the right set of characteristics

NETWORK CONFIGURATION
Input vector
Intermediate neurons
Activation function
Output vector
0
5000
10000
15000
20000
1 2 3 4 5 6 7 8 9 10
1 layer 2 layers 3 layers 4 layers
Time of training (ms)

CHANGE ALGORITHM
First add more data before changing algorithm
Try cascade2 algorithm from FANN
0.6 => 0 found
0.5 => 2 found
Not found by the ﬁrst algorithm

FINDINGTHE BEST
Test with 2-4 layers 
10 neurons
Measure results
0
2250
4500
6750
9000
1 2 3 4 5 6 7 8 9 10 11 12 13
1 layer 2 layers 3 layers 4 layers

DEEP LEARNING
Chaining the neural networks
Auto-encoders
Unsupervised Learning
Genetic algorithm, ant

OTHERTOOLS
PHP ext/fann
Langage R
https://github.com/kachkaev/php-r
Scikit-learn
https://github.com/scikit-learn/scikit-learn
Mahout
https://mahout.apache.org/

@exakat
https://joind.in/talk/42120
GRAZIE

AUTRES CONFIGURATIONS
Fonction d'activation
FANN_SIGMOID_SYMMETRIC
FANN_LINEAR
FANN_THRESHOLD
FANN_SIN_SYMMETRIC

Linéaire Seuil
Tangeante
Gaussienne Quadratique
Sigmoide

QUELLES APPLICATIONS?
Non-déterministe
Elimination de tout ce qui est systématique à trouver
Accès à l'expertise et aux vecteurs de caractéristiques
Couche ﬁnale après les résultats
Classiﬁcation, priorisation, approximation rapide

APPRENTISSAGE PAR
RENFORCEMENT
Logiciel
Monde réel
Récompense
ActionRéaction

ALGORITHMES GÉNÉTIQUES
Population
Population
Selection
Reproduction
PopulationVariations

Machine learning in php

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Machine learning in php

Similaire à Machine learning in php (20)

Plus de Damien Seguy

Plus de Damien Seguy (20)

Dernier

Dernier (20)

Machine learning in php