As we are witnessing our society becoming increasingly more reliant on mobile technology, so are we seeing the mobilization of money. In this new realm of commerce, online identity is becoming significantly more important.
As a payment is processed, it becomes incredibly important to not only understand who a person is, but also to understand what their broader interests and preferences are so that personalized experiences, suggesting new content and merchandise, may be delivered on an individual level.
9. Our Subject Material
HTML content is poorly structured
You can’t trust that anything
semantically valid will be present
There are some pretty bad web
practices on the interwebz
10. How We’ll Capture This Data
Start with base linguistics
Extend with available extras
11.
12. The Basic Pieces
Page Data Keywords Weighting
Scrapey Without all Word diets
Scrapey the fluff FTW
13. Capture Raw Page Data
Semantic data on the web
is sucktastic
Assume 5 year olds built
the sites
Language is the key
14. Extract Keywords
We now have a big jumble
of words. Let’s extract
Why is “and” a top word?
Stop words = sad panda
15. Weight Keywords
All content is not created
equal
Meta and headers and
semantics oh my!
This is where we leech
off the work of others
16.
17. Questions to Keep in Mind
Should I use regex to parse web
content?
How do users interact with page
content?
What key identifiers can be monitored
to detect interest?
18. Fetching the Data: The Request
The Simple Way
$html = file_get_contents('URL');
The Controlled Way
$c = curl_init('URL');
21. //set up list of stop words and the final found stopped list
$common_words = array('a', ..., 'zero');
$searched_words = array();
//extract list of keywords with number of occurrences
foreach($mod_content as $word) {
$word = trim($word);
if (preg_match('/[^a-zA-Z]/', $word) == 1){ $word = ''; }
if(strlen($word) > 2 && !in_array($word, $common_words)){
$searched_words[$word]++;
}
}
arsort($searched_words, SORT_NUMERIC);
22. Scraping Site Meta Data
//load scraped page data as a valid DOM document
$dom = new DOMDocument();
@$dom->loadHTML($page_content);
//scrape title
$title = $dom->getElementsByTagName("title");
$title = $title->item(0)->nodeValue;
23. //loop through all found meta tags
$metas = $dom->getElementsByTagName("meta");
for ($i = 0; $i < $metas->length; $i++){
$meta = $metas->item($i);
if($meta->getAttribute("property")){
if ($meta->getAttribute("property") == "og:description"){
$dataReturn["description"] = $meta->getAttribute("content");
}
} else {
if($meta->getAttribute("name") == "description"){
$dataReturn["description"] = $meta->getAttribute("content");
} else if($meta->getAttribute("name") == "keywords”){
$dataReturn[”keywords"] = $meta->getAttribute("content");
}
}
}
24.
25. Weighting Important Data
Tags you should care
about: meta (include OG),
title, description, h1+,
header
Bonus points for adding in
content location modifiers
27. Expanding to Phrases
2-3 adjacent words, making
up a direct relevant callout
Seems easy right? Just like
single words
Language gets wonky
without stop words
28. Working with Unknown Users
The majority of users won’t
be immediately targetable
Use HTML5 LocalStorage &
Cookie backup
29. Adding in Time Interactions
Interaction with a site does
not necessarily mean
interest in it
Time needs to also include
an interaction component
Gift buying seasons see
interest variations