SlideShare une entreprise Scribd logo
1  sur  39
Using Operational Redundancy
Effective Web Data Mining
Jonathan LeBlanc
Head of Developer Evangelism N.A. (PayPal)
Github: http://github.com/jcleblanc
Slides: http://slideshare.net/jcleblanc
Twitter: @jcleblanc
Premise
The interactions of a user can be used to
personalize their experience
Elements of Mining Redundancy
Website
Data
Mining
User
Emotional
State Mining
User
Interaction
Mining
Our Subject Material
HTML content is poorly structured
There are some pretty bad web
practices on the interwebz
You can’t trust that anything
semantically valid will be present
How We’ll Capture This Data
Start with base linguistics
Extend with available extras
The Basic Pieces
Page Data
Scrapey
Scrapey
Keywords
Without all
the fluff
Weighting
Word diets
FTW
Capture Raw Page Data
Semantic data on the web
is sucktastic
Assume 5 year olds built
the sites
Language is the key
Extract Keywords
We now have a big jumble
of words. Let’s extract
Why is “and” a top word?
Stop words = sad panda
Weight Keywords
All content is not created
equal
Meta and headers and
semantics oh my!
This is where we leech
off the work of others
Questions to Keep in Mind
Should I use regex to parse web
content?
How do users interact with page
content?
What key identifiers can be monitored
to detect interest?
Fetching the Data: cURL
$req = curl_init($url);
$options = array(
CURLOPT_URL => $url,
CURLOPT_HEADER => $header,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_AUTOREFERER => true,
CURLOPT_TIMEOUT => 15,
CURLOPT_MAXREDIRS => 10
);
curl_setopt_array($req, $options);
//list of findable / replaceable string characters
$find = array('/r/', '/n/', '/ss+/'); $replace = array(' ', ' ', ' ');
//perform page content modification
$mod_content = preg_replace('#<script(.*?)>(.*?)</
script>#is', '', $page_content);
$mod_content = preg_replace('#<style(.*?)>(.*?)</
style>#is', '', $mod_content);
$mod_content = strip_tags($mod_content);
$mod_content = strtolower($mod_content);
$mod_content = preg_replace($find, $replace, $mod_content);
$mod_content = trim($mod_content);
$mod_content = explode(' ', $mod_content);
natcasesort($mod_content);
//set up list of stop words and the final found stopped list
$common_words = array('a', ..., 'zero');
$searched_words = array();
//extract list of keywords with number of occurrences
foreach($mod_content as $word) {
$word = trim($word);
if(strlen($word) > 2 && !in_array($word, $common_words)){
$searched_words[$word]++;
}
}
arsort($searched_words, SORT_NUMERIC);
Scraping Site Meta Data
//load scraped page data as a valid DOM document
$dom = new DOMDocument();
@$dom->loadHTML($page_content);
//scrape title
$title = $dom->getElementsByTagName("title");
$title = $title->item(0)->nodeValue;
//loop through all found meta tags
$metas = $dom->getElementsByTagName("meta");
for ($i = 0; $i < $metas->length; $i++){
$meta = $metas->item($i);
if($meta->getAttribute("property")){
if ($meta->getAttribute("property") == "og:description"){
$dataReturn["description"] = $meta->getAttribute("content");
}
} else {
if($meta->getAttribute("name") == "description"){
$dataReturn["description"] = $meta->getAttribute("content");
} else if($meta->getAttribute("name") == "keywords”){
$dataReturn[”keywords"] = $meta->getAttribute("content");
}
}
}
Weighting Important Data
Tags you should care
about: meta (include OG),
title, description, h1+,
header
Bonus points for adding in
content location modifiers
Weighting Important Tags
//our keyword weights
$weights = array("keywords" => "3.0",
"meta" => "2.0",
"header1" => "1.5",
"header2" => "1.2");
//add modifier here
if(strlen($word) > 2 && !in_array($word, $common_words)){
$searched_words[$word]++;
}
Expanding to Phrases
2-3 adjacent words, making
up a direct relevant callout
Seems easy right? Just like
single words
Language gets wonky
without stop words
Adding in Time Interactions
Interaction with a site does
not necessarily mean
interest in it
Time needs to also include
an interaction component
Gift buying seasons see
interest variations
Grouping Using Commonality
Interests
User A
Interests
User B
Interests
Common
Using Color Theory
Products with a feel-good message
Happiness, energy, encouragement
Health care (but not food!)
Relatable, calm, friendly, peace, security
Startups / innovative products
Creativity, imagination
Auction sites (but not sales sites!)
Passion, stimulation, excitement, power
What We’re Talking About
The CSS Service Engine
lesscss.org
sass-lang.com
learnboost.github.com/stylus
http://leafo.net/lessphp/
Design Engine Foundation: LESSPHP
+
The Basics of a Design Engine
//create new LESS object
$less= new lessc();
//compile LESS code to CSS
$less->checkedCompile(
'/path/styles.less',
'path/styles.css');
//create new CSS file and return new file link
echo "<link rel='stylesheet' href='http://path/styles.css'
type='text/css' />";
Passing Variables into LESSPHP
//create a new LESS object
$less = new lessc();
//set the variables
$less->setVariables(array(
'color' => 'red',
'base' => '960px'
));
//compile LESS into PHP and unset variables
echo $less->compile(".magic { color: @color;
width: @base - 200; }");
$less->unsetVariable('color');
Implementing Color Functions
Lighten / Darken Saturate / Desaturate
Adjust HueMix Colors
Managing Irrelevant Content
Remove / hide content
based on user profile
and state
Managing Irrelevant Content
//variables passed into LESS compilation
$less->setVariables(array(
"percent" => "80%",
));
//LESS template
.highlight{
@bg-color: "#464646”;
@font-color: "#eee";
background-color: fade(@bg-color, @percent);
color: fade(@font-color, @percent);
}
Traits of the Bored
Distraction
Repetition
Tiredness
Reasons for Boredom
Lack of interest
Readiness
Acting on Disinterest / Boredom
Highlighting on Agitated Behavior
Highlight relevant
content to reduce
agitated behavior
Acting Upon User Queues
$less->setVariables(array(
"percent" => "100%",
"size-mod" => "2"
));
Variables passed into LESS script
Acting Upon User Queues
.highlight{
@bg-calm: "blue";
@bg-action: "red";
@base-font: "14px";
background-color: mix(@bg-calm,
@bg-action,
@percent );
font-size: @size-mod + @base-font;
}
LESS script logic for color / size variations
Interaction and Emotion Plugin
jQuery Behavior Miner
by Cedric Dugas
https://github.com/posa
bsolute/jquery-
behavior-miner
In the End…
What a person is interested in
What a person is doing
What their emotional state is
http://slideshare.com/jcleblanc
Thank You! Questions?
Jonathan LeBlanc
Head of Developer Evangelism N.A. (PayPal)
Github: http://github.com/jcleblanc
Slides: http://slideshare.net/jcleblanc
Twitter: @jcleblanc

Contenu connexe

Tendances

Dream House Project Presentation
Dream House Project PresentationDream House Project Presentation
Dream House Project Presentation
jongosling
 
Hi5 Opensocial Code Lab Presentation
Hi5 Opensocial Code Lab PresentationHi5 Opensocial Code Lab Presentation
Hi5 Opensocial Code Lab Presentation
plindner
 

Tendances (20)

Google Hack
Google HackGoogle Hack
Google Hack
 
HTML5 Essentials
HTML5 EssentialsHTML5 Essentials
HTML5 Essentials
 
JSON-LD: JSON for Linked Data
JSON-LD: JSON for Linked DataJSON-LD: JSON for Linked Data
JSON-LD: JSON for Linked Data
 
Spiffy Applications With JavaScript
Spiffy Applications With JavaScriptSpiffy Applications With JavaScript
Spiffy Applications With JavaScript
 
Dream House Project Presentation
Dream House Project PresentationDream House Project Presentation
Dream House Project Presentation
 
Contacto server API in PHP
Contacto server API in PHPContacto server API in PHP
Contacto server API in PHP
 
Elastify you application: from SQL to NoSQL in less than one hour!
Elastify you application: from SQL to NoSQL in less than one hour!Elastify you application: from SQL to NoSQL in less than one hour!
Elastify you application: from SQL to NoSQL in less than one hour!
 
Hi5 Opensocial Code Lab Presentation
Hi5 Opensocial Code Lab PresentationHi5 Opensocial Code Lab Presentation
Hi5 Opensocial Code Lab Presentation
 
Why Python Web Frameworks Are Changing the Web
Why Python Web Frameworks Are Changing the WebWhy Python Web Frameworks Are Changing the Web
Why Python Web Frameworks Are Changing the Web
 
nodum.io MongoDB Meetup (Dutch)
nodum.io MongoDB Meetup (Dutch)nodum.io MongoDB Meetup (Dutch)
nodum.io MongoDB Meetup (Dutch)
 
20180424 #18 we_are_javascripters
20180424 #18 we_are_javascripters20180424 #18 we_are_javascripters
20180424 #18 we_are_javascripters
 
Introduction to jQuery
Introduction to jQueryIntroduction to jQuery
Introduction to jQuery
 
jQuery Best Practice
jQuery Best Practice jQuery Best Practice
jQuery Best Practice
 
jQuery Presentation
jQuery PresentationjQuery Presentation
jQuery Presentation
 
Findability Bliss Through Web Standards
Findability Bliss Through Web StandardsFindability Bliss Through Web Standards
Findability Bliss Through Web Standards
 
jQuery
jQueryjQuery
jQuery
 
jQuery
jQueryjQuery
jQuery
 
Introduction to Web Design, Week 1
Introduction to Web Design, Week 1Introduction to Web Design, Week 1
Introduction to Web Design, Week 1
 
Linked Data in Use: Schema.org, JSON-LD and hypermedia APIs - Front in Bahia...
Linked Data in Use: Schema.org, JSON-LD and hypermedia APIs  - Front in Bahia...Linked Data in Use: Schema.org, JSON-LD and hypermedia APIs  - Front in Bahia...
Linked Data in Use: Schema.org, JSON-LD and hypermedia APIs - Front in Bahia...
 
So cal0365productivitygroup feb2019
So cal0365productivitygroup feb2019So cal0365productivitygroup feb2019
So cal0365productivitygroup feb2019
 

Similaire à Creating Operational Redundancy for Effective Web Data Mining

Mojolicious, real-time web framework
Mojolicious, real-time web frameworkMojolicious, real-time web framework
Mojolicious, real-time web framework
taggg
 
PHP and Rich Internet Applications
PHP and Rich Internet ApplicationsPHP and Rich Internet Applications
PHP and Rich Internet Applications
elliando dias
 
Introducing Assetic (NYPHP)
Introducing Assetic (NYPHP)Introducing Assetic (NYPHP)
Introducing Assetic (NYPHP)
Kris Wallsmith
 
PHP and Rich Internet Applications
PHP and Rich Internet ApplicationsPHP and Rich Internet Applications
PHP and Rich Internet Applications
elliando dias
 
php-mysql-tutorial-part-3
php-mysql-tutorial-part-3php-mysql-tutorial-part-3
php-mysql-tutorial-part-3
tutorialsruby
 
&lt;b>PHP&lt;/b>/MySQL &lt;b>Tutorial&lt;/b> webmonkey/programming/
&lt;b>PHP&lt;/b>/MySQL &lt;b>Tutorial&lt;/b> webmonkey/programming/&lt;b>PHP&lt;/b>/MySQL &lt;b>Tutorial&lt;/b> webmonkey/programming/
&lt;b>PHP&lt;/b>/MySQL &lt;b>Tutorial&lt;/b> webmonkey/programming/
tutorialsruby
 
php-mysql-tutorial-part-3
php-mysql-tutorial-part-3php-mysql-tutorial-part-3
php-mysql-tutorial-part-3
tutorialsruby
 

Similaire à Creating Operational Redundancy for Effective Web Data Mining (20)

Assetic (Symfony Live Paris)
Assetic (Symfony Live Paris)Assetic (Symfony Live Paris)
Assetic (Symfony Live Paris)
 
Intro to php
Intro to phpIntro to php
Intro to php
 
Assetic (OSCON)
Assetic (OSCON)Assetic (OSCON)
Assetic (OSCON)
 
Mojolicious, real-time web framework
Mojolicious, real-time web frameworkMojolicious, real-time web framework
Mojolicious, real-time web framework
 
Assetic (Zendcon)
Assetic (Zendcon)Assetic (Zendcon)
Assetic (Zendcon)
 
Introducing Assetic: Asset Management for PHP 5.3
Introducing Assetic: Asset Management for PHP 5.3Introducing Assetic: Asset Management for PHP 5.3
Introducing Assetic: Asset Management for PHP 5.3
 
PHP and Rich Internet Applications
PHP and Rich Internet ApplicationsPHP and Rich Internet Applications
PHP and Rich Internet Applications
 
Introducing Assetic (NYPHP)
Introducing Assetic (NYPHP)Introducing Assetic (NYPHP)
Introducing Assetic (NYPHP)
 
Share point hosted add ins munich
Share point hosted add ins munichShare point hosted add ins munich
Share point hosted add ins munich
 
FamilySearch Reference Client
FamilySearch Reference ClientFamilySearch Reference Client
FamilySearch Reference Client
 
PHP POWERPOINT SLIDES
PHP POWERPOINT SLIDESPHP POWERPOINT SLIDES
PHP POWERPOINT SLIDES
 
How to insert json data into my sql using php
How to insert json data into my sql using phpHow to insert json data into my sql using php
How to insert json data into my sql using php
 
Building a real life application in node js
Building a real life application in node jsBuilding a real life application in node js
Building a real life application in node js
 
MYSQL DATABASE INTRODUCTION TO JAVASCRIPT.pptx
MYSQL DATABASE INTRODUCTION TO JAVASCRIPT.pptxMYSQL DATABASE INTRODUCTION TO JAVASCRIPT.pptx
MYSQL DATABASE INTRODUCTION TO JAVASCRIPT.pptx
 
PHP and Rich Internet Applications
PHP and Rich Internet ApplicationsPHP and Rich Internet Applications
PHP and Rich Internet Applications
 
Pitfalls to Avoid for Cascade Server Newbies by Lisa Hall
Pitfalls to Avoid for Cascade Server Newbies by Lisa HallPitfalls to Avoid for Cascade Server Newbies by Lisa Hall
Pitfalls to Avoid for Cascade Server Newbies by Lisa Hall
 
Scaling Complexity in WordPress Enterprise Apps
Scaling Complexity in WordPress Enterprise AppsScaling Complexity in WordPress Enterprise Apps
Scaling Complexity in WordPress Enterprise Apps
 
php-mysql-tutorial-part-3
php-mysql-tutorial-part-3php-mysql-tutorial-part-3
php-mysql-tutorial-part-3
 
&lt;b>PHP&lt;/b>/MySQL &lt;b>Tutorial&lt;/b> webmonkey/programming/
&lt;b>PHP&lt;/b>/MySQL &lt;b>Tutorial&lt;/b> webmonkey/programming/&lt;b>PHP&lt;/b>/MySQL &lt;b>Tutorial&lt;/b> webmonkey/programming/
&lt;b>PHP&lt;/b>/MySQL &lt;b>Tutorial&lt;/b> webmonkey/programming/
 
php-mysql-tutorial-part-3
php-mysql-tutorial-part-3php-mysql-tutorial-part-3
php-mysql-tutorial-part-3
 

Plus de Jonathan LeBlanc

Plus de Jonathan LeBlanc (20)

JavaScript App Security: Auth and Identity on the Client
JavaScript App Security: Auth and Identity on the ClientJavaScript App Security: Auth and Identity on the Client
JavaScript App Security: Auth and Identity on the Client
 
Improving Developer Onboarding Through Intelligent Data Insights
Improving Developer Onboarding Through Intelligent Data InsightsImproving Developer Onboarding Through Intelligent Data Insights
Improving Developer Onboarding Through Intelligent Data Insights
 
Better Data with Machine Learning and Serverless
Better Data with Machine Learning and ServerlessBetter Data with Machine Learning and Serverless
Better Data with Machine Learning and Serverless
 
Best Practices for Application Development with Box
Best Practices for Application Development with BoxBest Practices for Application Development with Box
Best Practices for Application Development with Box
 
Box Platform Overview
Box Platform OverviewBox Platform Overview
Box Platform Overview
 
Box Platform Developer Workshop
Box Platform Developer WorkshopBox Platform Developer Workshop
Box Platform Developer Workshop
 
Modern Cloud Data Security Practices
Modern Cloud Data Security PracticesModern Cloud Data Security Practices
Modern Cloud Data Security Practices
 
Box Authentication Types
Box Authentication TypesBox Authentication Types
Box Authentication Types
 
Understanding Box UI Elements
Understanding Box UI ElementsUnderstanding Box UI Elements
Understanding Box UI Elements
 
Understanding Box applications, tokens, and scoping
Understanding Box applications, tokens, and scopingUnderstanding Box applications, tokens, and scoping
Understanding Box applications, tokens, and scoping
 
The Future of Online Money: Creating Secure Payments Globally
The Future of Online Money: Creating Secure Payments GloballyThe Future of Online Money: Creating Secure Payments Globally
The Future of Online Money: Creating Secure Payments Globally
 
Modern API Security with JSON Web Tokens
Modern API Security with JSON Web TokensModern API Security with JSON Web Tokens
Modern API Security with JSON Web Tokens
 
Creating an In-Aisle Purchasing System from Scratch
Creating an In-Aisle Purchasing System from ScratchCreating an In-Aisle Purchasing System from Scratch
Creating an In-Aisle Purchasing System from Scratch
 
Secure Payments Over Mixed Communication Media
Secure Payments Over Mixed Communication MediaSecure Payments Over Mixed Communication Media
Secure Payments Over Mixed Communication Media
 
Protecting the Future of Mobile Payments
Protecting the Future of Mobile PaymentsProtecting the Future of Mobile Payments
Protecting the Future of Mobile Payments
 
Node.js Authentication and Data Security
Node.js Authentication and Data SecurityNode.js Authentication and Data Security
Node.js Authentication and Data Security
 
PHP Identity and Data Security
PHP Identity and Data SecurityPHP Identity and Data Security
PHP Identity and Data Security
 
Secure Payments Over Mixed Communication Media
Secure Payments Over Mixed Communication MediaSecure Payments Over Mixed Communication Media
Secure Payments Over Mixed Communication Media
 
Protecting the Future of Mobile Payments
Protecting the Future of Mobile PaymentsProtecting the Future of Mobile Payments
Protecting the Future of Mobile Payments
 
Future of Identity, Data, and Wearable Security
Future of Identity, Data, and Wearable SecurityFuture of Identity, Data, and Wearable Security
Future of Identity, Data, and Wearable Security
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

Creating Operational Redundancy for Effective Web Data Mining

  • 1. Using Operational Redundancy Effective Web Data Mining Jonathan LeBlanc Head of Developer Evangelism N.A. (PayPal) Github: http://github.com/jcleblanc Slides: http://slideshare.net/jcleblanc Twitter: @jcleblanc
  • 2. Premise The interactions of a user can be used to personalize their experience
  • 3. Elements of Mining Redundancy Website Data Mining User Emotional State Mining User Interaction Mining
  • 4.
  • 5. Our Subject Material HTML content is poorly structured There are some pretty bad web practices on the interwebz You can’t trust that anything semantically valid will be present
  • 6. How We’ll Capture This Data Start with base linguistics Extend with available extras
  • 7. The Basic Pieces Page Data Scrapey Scrapey Keywords Without all the fluff Weighting Word diets FTW
  • 8. Capture Raw Page Data Semantic data on the web is sucktastic Assume 5 year olds built the sites Language is the key
  • 9. Extract Keywords We now have a big jumble of words. Let’s extract Why is “and” a top word? Stop words = sad panda
  • 10. Weight Keywords All content is not created equal Meta and headers and semantics oh my! This is where we leech off the work of others
  • 11.
  • 12. Questions to Keep in Mind Should I use regex to parse web content? How do users interact with page content? What key identifiers can be monitored to detect interest?
  • 13. Fetching the Data: cURL $req = curl_init($url); $options = array( CURLOPT_URL => $url, CURLOPT_HEADER => $header, CURLOPT_RETURNTRANSFER => true, CURLOPT_FOLLOWLOCATION => true, CURLOPT_AUTOREFERER => true, CURLOPT_TIMEOUT => 15, CURLOPT_MAXREDIRS => 10 ); curl_setopt_array($req, $options);
  • 14. //list of findable / replaceable string characters $find = array('/r/', '/n/', '/ss+/'); $replace = array(' ', ' ', ' '); //perform page content modification $mod_content = preg_replace('#<script(.*?)>(.*?)</ script>#is', '', $page_content); $mod_content = preg_replace('#<style(.*?)>(.*?)</ style>#is', '', $mod_content); $mod_content = strip_tags($mod_content); $mod_content = strtolower($mod_content); $mod_content = preg_replace($find, $replace, $mod_content); $mod_content = trim($mod_content); $mod_content = explode(' ', $mod_content); natcasesort($mod_content);
  • 15. //set up list of stop words and the final found stopped list $common_words = array('a', ..., 'zero'); $searched_words = array(); //extract list of keywords with number of occurrences foreach($mod_content as $word) { $word = trim($word); if(strlen($word) > 2 && !in_array($word, $common_words)){ $searched_words[$word]++; } } arsort($searched_words, SORT_NUMERIC);
  • 16. Scraping Site Meta Data //load scraped page data as a valid DOM document $dom = new DOMDocument(); @$dom->loadHTML($page_content); //scrape title $title = $dom->getElementsByTagName("title"); $title = $title->item(0)->nodeValue;
  • 17. //loop through all found meta tags $metas = $dom->getElementsByTagName("meta"); for ($i = 0; $i < $metas->length; $i++){ $meta = $metas->item($i); if($meta->getAttribute("property")){ if ($meta->getAttribute("property") == "og:description"){ $dataReturn["description"] = $meta->getAttribute("content"); } } else { if($meta->getAttribute("name") == "description"){ $dataReturn["description"] = $meta->getAttribute("content"); } else if($meta->getAttribute("name") == "keywords”){ $dataReturn[”keywords"] = $meta->getAttribute("content"); } } }
  • 18. Weighting Important Data Tags you should care about: meta (include OG), title, description, h1+, header Bonus points for adding in content location modifiers
  • 19. Weighting Important Tags //our keyword weights $weights = array("keywords" => "3.0", "meta" => "2.0", "header1" => "1.5", "header2" => "1.2"); //add modifier here if(strlen($word) > 2 && !in_array($word, $common_words)){ $searched_words[$word]++; }
  • 20. Expanding to Phrases 2-3 adjacent words, making up a direct relevant callout Seems easy right? Just like single words Language gets wonky without stop words
  • 21. Adding in Time Interactions Interaction with a site does not necessarily mean interest in it Time needs to also include an interaction component Gift buying seasons see interest variations
  • 22. Grouping Using Commonality Interests User A Interests User B Interests Common
  • 23.
  • 24. Using Color Theory Products with a feel-good message Happiness, energy, encouragement Health care (but not food!) Relatable, calm, friendly, peace, security Startups / innovative products Creativity, imagination Auction sites (but not sales sites!) Passion, stimulation, excitement, power
  • 26. The CSS Service Engine lesscss.org sass-lang.com learnboost.github.com/stylus
  • 28. The Basics of a Design Engine //create new LESS object $less= new lessc(); //compile LESS code to CSS $less->checkedCompile( '/path/styles.less', 'path/styles.css'); //create new CSS file and return new file link echo "<link rel='stylesheet' href='http://path/styles.css' type='text/css' />";
  • 29. Passing Variables into LESSPHP //create a new LESS object $less = new lessc(); //set the variables $less->setVariables(array( 'color' => 'red', 'base' => '960px' )); //compile LESS into PHP and unset variables echo $less->compile(".magic { color: @color; width: @base - 200; }"); $less->unsetVariable('color');
  • 30. Implementing Color Functions Lighten / Darken Saturate / Desaturate Adjust HueMix Colors
  • 31. Managing Irrelevant Content Remove / hide content based on user profile and state
  • 32. Managing Irrelevant Content //variables passed into LESS compilation $less->setVariables(array( "percent" => "80%", )); //LESS template .highlight{ @bg-color: "#464646”; @font-color: "#eee"; background-color: fade(@bg-color, @percent); color: fade(@font-color, @percent); }
  • 33. Traits of the Bored Distraction Repetition Tiredness Reasons for Boredom Lack of interest Readiness Acting on Disinterest / Boredom
  • 34. Highlighting on Agitated Behavior Highlight relevant content to reduce agitated behavior
  • 35. Acting Upon User Queues $less->setVariables(array( "percent" => "100%", "size-mod" => "2" )); Variables passed into LESS script
  • 36. Acting Upon User Queues .highlight{ @bg-calm: "blue"; @bg-action: "red"; @base-font: "14px"; background-color: mix(@bg-calm, @bg-action, @percent ); font-size: @size-mod + @base-font; } LESS script logic for color / size variations
  • 37. Interaction and Emotion Plugin jQuery Behavior Miner by Cedric Dugas https://github.com/posa bsolute/jquery- behavior-miner
  • 38. In the End… What a person is interested in What a person is doing What their emotional state is
  • 39. http://slideshare.com/jcleblanc Thank You! Questions? Jonathan LeBlanc Head of Developer Evangelism N.A. (PayPal) Github: http://github.com/jcleblanc Slides: http://slideshare.net/jcleblanc Twitter: @jcleblanc

Notes de l'éditeur

  1. The semantic data movement was an abysmal failure. Strip down the site to its basic components – the language and words used on the page
  2. Open graph protocol
  3. This is why I prefer using cURL: customization of requests, timeouts, allows redirects, etc.
  4. Stripping irrelevant data
  5. Scraping site keywords
  6. You can also play with the fade in / fade out to modify the lightness and highlighting