Machine learning is teaching the computer how to learn by itself. It is far easier to be done, especially when you have small data set and a good level of expertise in your field. Classifying objects, predicting who will buy, spotting comments in code is achieved with grassy algorithms like neural networks, genetic algorithms or ant herding. PHP is in good position to make use of such teachings, and take advantages of related technologies like fann. By the end of the session, you'll know where you want to try it.
4. Machine Learning
• Teaching the machine
• Supervised learning : learning then applying
• Application build its own model : training phase
• It applies its model to real cases : applying phase
5. Applications
• Play go, chess, tic-tac-toe and beat everyone else
• Fraud detection and risk analysis
• Automated translation or automated transcription
• OCR and face recognition
• Medical diagnostics
• Walk, welcome guest at hotels, play football
• Finding good PHP code
6. PHP Applications
• Recommendations systems
• Predicting user behavior
• SPAM
• conversion user to customer
• ETA
• Detect code in comments
7. Real use case
• Identify code in comments
• Classic problem
• Good problem for machine learning
• Complex, no simple solution
• A lot of data and expertise are available
9. The Fann Extension
• ext/fann (https://pecl.php.net/package/fann)
• Fast Artificial Neural Network
• http://leenissen.dk/fann/wp/
• Neural networks in PHP
• Works on PHP 7, thanks to the hard work of Jakub
Zelenka
• https://github.com/bukka/php-fann
14. Expert at work
// Test if the if is in a compressed format
// none need yet
// There is a parser specified in `Parser::$KEYWORD_PARSERS`
// $result should exist, regardless of $_message
// $a && $b and multidimensional
// numGlyphs + 1
// TODO : fix this; var_dump($var);
// if(ob_get_clean()){
//$annots .= ' /StructParent ';
// $cfg['Servers'][$i]['controlpass'] = 'pmapass';
15. Input vector
• 'length' : size of the comment
• 'countDollar' : number of $
• 'countEqual' : number of =
• 'countObjectOperator' number of -> operator ($o->p)
• 'countSemicolon' : number of semi-colon ;
16. Input data
46 5 1
825 0 0 0 1
0
37 2 0 0 0
0
55 2 2 0 1
1
61 2 1 3 1
1
...
* This file is part of Exakat.
*
* Exakat is free software: you can redist
* it under the terms of the GNU Affero Ge
* the Free Software Foundation, either ve
* (at your option) any later version.
*
* Exakat is distributed in the hope that
* but WITHOUT ANY WARRANTY; without even
* MERCHANTABILITY or FITNESS FOR A PARTIC
* GNU Affero General Public License for m
*
* You should have received a copy of the
* along with Exakat. If not, see <http:/
*
* The latest code can be found at <http:/
*
*/
// $x[3] or $x[] and multidimensional
//if ($round == 3) { die('Round '.$round);
//$this->errors[] = $this->language->get('
Number of input
Number of incoming data
Number of outgoing data
17. 1 5 1
37 2 0 0 0
0
// $x[3] or $x[] and multidimensional
ext/Fann
It's a comment
25. Results > 0.8
• Answer between 0 and 1
• Values ranges from -14 to 0,999
• The closer to 1, the safer. The closer to 0, the safer.
• Is this a percentage? Is this a carrots count ?
• It's a mix of counts…
32. RESULTS
• 1960 issues
• 50+% of false positive
• With an easy clean, 822 issues reported
• 14k comments, analyzed in 68 ms (367ms in PHP5)
• Total time of coding : 27 mins.
// = ( 59 x 84 ) mm = ( 2.32 x 3.31 ) in
/* vim: set expandtab sw=4 ts=4 sts=4: */
33. Learn better, not harder
• Better training data
• Improve characteristics
• Configure the neural network
• Change algorithm
• Automate learning
• Update constantly
Real data
History
data
Training
Model Results
Retroaction
34. Better training data
• More data, more data, more data
• Varied situations, real case situations
• Include specific cases
• Experience is capital
• https://homes.cs.washington.edu/~pedrod/
papers/cacm12.pdf
35. Improve characteristics
• Add new characteristics
• Remove the one that are less interesting
• Find the right set of characteristics
37. Change algorithm
• First add more data before changing algorithm
• Try cascade2 algorithm from FANN
• 0.6 => 0 found
• 0.5 => 2 found
• Not found by the first algorithm
38. Finding the BEST
• Test with 2-4
layers
10 neurons
• Measure
results
0
2250
4500
6750
9000
1 2 3 4 5 6 7 8 9 10 11 12 13
1 layer 2 layers 3 layers 4 layers
39. DEEP LEARNING
• Chaining the neural networks
• Auto-encoders
• Unsupervised Learning
• Genetic algorithm, ant, random forest, naive Bayes
40. Other tools
• PHP ext/fann
• Langage R
• https://github.com/kachkaev/php-r
• Scikit-learn
• https://github.com/scikit-learn/scikit-learn
• Mahout
• https://mahout.apache.org/
41. Conclusion
• Machine learning is about data, not code
• There are tools to use it with PHP
• Fast to try, easy results or fast fail
• Use it for complex problems, that accepts error