TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
Auto-loading of Drupal CCK Nodes
1. Automatic
Scheduled
Loading of CCK
Nodes
ETL with drupal_execute, OO,
drush, & cron
David Naughton | December 3, 2008
2. Who am I?
David Naughton
●
Web Applications Developer
●
University of Minnesota Libraries
●
naughton@umn.edu
●
11+ years development experience
●
New to Drupal & PHP
3. What's EthicShare?
ethicshare.org
• Who: UMN Center for Bioethices, UMN Libraries, & UMN Csci & EE
• What: A sustainable aggregation of bioethics research and a forum for scholarship
• When: Pilot Phase January 2008 – June 2009
• How: Funded by Andrew W. Mellon Foundation
4. Sustainable Aggregation
of Bioethics Research
• My part of the project
• Extract citations from multiple sources
• Transform into Drupal-compatible format
• Load into Drupal
• On a regular, ongoing basis
5. ETL...
• Extract, Transform, and Load = ETL
• Very common IT problem
• ETL is the most common term for it
• Librarians like to say...
• “Harvesting” instead of Extracting
• “Crosswalking” instead of Transforming
• ...but they're peculiar
6. ...ETL
• Complex problem
• Lots of packaged solutions
• Mostly Java, for data warehouses
• Not a good fit for EthicShare
• Using Drupal 5 and CCK
• No Batch API
• When we move to Drupal 6...
• Batch API http://bit.ly/BatchAPI?
• content.crud.inc http://bit.ly/content-crud-inc?
7. Without Automation
• First PubMed load alone was > 100,000
citations
• Without automation, I could have been doing
lots of this:
11. Architecture
drush
Extractors Transformers
PubMed XML PubMed CiteETL
WorlCat XML WorlCat
Loader EthicShare
PHP Array MySQL
New York New York
Times XML Times
BBC XML BBC
12. drush
A portmanteau of “Drupal shell”.
“…a command line shell and Unix scripting interface for
Drupal, a veritable Swiss Army knife designed to make
life easier for those of us who spend most of our
working hours hacking away at the command prompt.”
-- http://drupal.org/project/drush
13. Why drush?
• Very flexible scheduling via cron
●
Uses php-cli, so no web timeouts
●
Experimental support for running drush without a
running Drupal web instance
●
Run tests from the cli with Drush simpletest runner
14. Why not hook_cron?
• If you're comfortable with cron, flexible scheduling via
hook_cron requires unnecessary extra work
●
Subject to web timeouts
●
Runs within a Drupal web instance, so large loads
may affect user experience
15. drush help
$ cd $drush_dir
$ ./drush.php help
Usage: drush.php [options] <command> <command> ...
Options:
-r <path>, --root=<path> Drupal root directory to use
(default: current directory)
-l <uri> , --uri=<uri> URI of the drupal site to use (only
needed in multisite environments)
...
Commands:
cite load Load data to create new citations.
help View help. Run quot;drush help [command]quot; to view
command-specific help.
pm install Install one or more modules
16. drush command help
$ ./drush.php help cite load
Usage: drush.php cite load [options]
Options:
--E=<extractor class> Base name of an extractor class, excluding
the CiteETL/E/ parent path & '.php'. Required.
--T=<transformer class> Base name of an transformer class, excluding
the CiteETL/T/ parent path & '.php'. Required.
--L=<loader class> Base name of an loader class, excluding the
CiteETL/L/ parent path & '.php'. Optional:
default is 'Loader'.
--dbuser=<db username> Optional: 'cite load' will authenticate the
user only if both dbuser & dbpass are present.
--dbpass=<db password> Optional: 'cite load' will authenticate the
user only if both dbuser & dbpass are present.
--memory_limit=<memory limit> Optional: default is 512M.
17. drush cite load
Example specifying the New York Times – Health
extractor & transformer classes on the cli:
$ ./drush.php cite load --E=NYTHealth
--T=NYTHealth --dbuser=$dbuser
--dbpass=$dbpass
Allows for flexible, per-data-source scheduling via cron,
a requirement for EthicShare.
18. php-cli Problems
• PHP versions < 5.3 do not free circular references.
This is a problem when parsing loads of XML: Memory
Leaks With Objects in PHP 5
http://bit.ly/php5-memory-leak
• Still may have to allocate huge amounts of memory to
PHP to avoid “out of memory” errors.
19. drush API
Undocumented, but simple & http://drupal.org/project/drush
links to some modules that use it. To create a drush
command…
●
Implement hook_drush_command, mapping cli text to a
callback function name
●
Implement the callback function
…and optionally…
●
Implement a hook_help case for your command
20. drush getopt emulation…
Supports:
●
--opt=value
●
-opt or --opt (boolean based on presence or
absence)
Contrary to README.txt, does not support:
●
-opt value
●
-opt=value
21. …drush getopt emulation
• Puts options in an associative array, where keys are the option
names: $GLOBALS['args']['options']
●
Puts commands (“words” not starting with a dash) in an array:
$GLOBALS['args']['commands']
Quirks:
●
in cases of repetition (e.g. -opt --opt=value ), last one wins
●
commands & options can be interspersed, as long as order of
commands is maintained
23. …cite.module example…
function cite_load_cmd($url) {
global $args;
$options = $args['options'];
// Batch loading will often require more
// than the default memory.
$memory_limit = (
array_key_exists('memory_limit', $options)
? $options['memory_limit']
: '512M'
);
ini_set('memory_limit', $memory_limit);
// continued on next slide…
24. …cite.module example
// …continued from previous slide
if (array_key_exists('dbuser', $options)
&& array_key_exists('dbpass', $options)) {
user_authenticate($options['dbuser'], $options['dbpass']);
}
set_include_path(
'./' . drupal_get_path('module', 'cite') . PATH_SEPARATOR
. './' . drupal_get_path('module', 'cite') . '/contrib'
. PATH_SEPARATOR . get_include_path()
);
require_once 'CiteETL.php';
$etl = new CiteETL( $options );
$etl->run();
} // end function cite_load_cmd
25. CiteETL.php…
class CiteETL {
private $option_property_map = array(
'E' => 'extractor',
'T' => 'transformer',
'L' => 'loader'
);
// Not shown: identically-named accessors for these properties
private $extractor;
private $transformer;
private $loader;
26. …CiteETL.php…
function __construct($params) {
// The loading process is the almost always the same...
if (!array_key_exists('L', $params)) {
$params['L'] = 'Loader';
}
foreach ($params as $option => $class) {
if (!preg_match('/^(E|T|L)$/', $option)) {
continue;
}
// Naming-convention-based, factory-ish, dynamic
// loading of classes, e.g. CiteETL/E/NYTHealth.php:
require_once 'CiteETL/' . $option . '/' . $class . '.php';
$instantiable_class = 'CiteETL_' . $option . '_' . $class;
$property = $this->option_property_map[$option];
$this->$property = new $instantiable_class;
}
}
28. Example E. Base Class…
require_once 'simplepie.inc';
class CiteETL_E_SimplePie implements Iterator {
private $items = array();
private $valid = FALSE;
function __construct($params) {
$feed = new SimplePie();
$feed->set_feed_url( $params['feed_url'] );
$feed->init();
if ($feed->error()) {
throw new Exception( $feed->error() );
}
$feed->strip_htmltags( $params['strip_html_tags'] );
$this->items = $feed->get_items();
}
// continued on next slide…
29. …Example E. Base Class
// …continued from previous slide
function rewind() {
$this->valid = (FALSE !== reset($this->items));
}
function current() {
return current($this->items);
}
function key() {
return key($this->items);
}
function next() {
$this->valid = (FALSE !== next($this->items));
}
function valid() {
return $this->valid;
}
} # end class CiteETL_E_SimplePie
30. Example Extractor
require_once 'CiteETL/E/SimplePie.php';
class CiteETL_E_NYTHealth extends CiteETL_E_SimplePie {
function __construct() {
parent::__construct(array(
'feed_url' =>
'http://www.nytimes.com/services/xml/rss/nyt/Health.xml',
'strip_html_tags' => array('br','span','a','img')
));
}
} // end class CiteETL_E_NYTHealth
31. Example Transformer…
class CiteETL_T_NYTHealth {
private $filter_pattern;
function __construct() {
$simple_keywords = array(
'abortion',
'advance directives',
// whole bunch of keywords omitted…
'world health',
);
$this->filter_pattern =
'/(' . join('|', $simple_keywords) . ')/i';
}
// continued on next slide…
32. …Example Transformer…
// …continued from previous slide
function transform( $simplepie_item ) {
// create an array matching the cite CCK content type structure:
$citation = array();
$citation['title'] = $simplepie_item->get_title();
$citation['field_abstract'][0]['value'] =
$simplepie_item->get_content();
$this->filter( $citation );
// lots of transformation ops omitted…
$categories = $simplepie_item->get_categories();
$category_labels = array();
foreach ($categories as $category) {
array_push($category_labels, $category->get_label());
}
$citation['field_subject'][0]['value'] =
join('; ', $category_labels);
$this->filter( $citation );
return $citation;
}
33. …Example Transformer
// …continued from previous slide
function filter( $citation ) {
$combined_content =
$citation['title'] .
$citation['field_abstract'][0]['value'] .
$citation['field_subject'][0]['value'];
if (!preg_match($this->filter_pattern, $combined_content))
{
throw new Exception(
quot;The article 'quot; . $citation['title'] . quot;', id: quot;
. $citation['source_id']
. quot; was rejected by the relevancy filterquot;
);
}
}
34. Why not FeedAPI?
• Supports only simple one-feed-field to one-CCK-field
mappings
• Avoid the Rube Goldberg Effect by using the same
ETL system for feeds that use for everything else
36. CCK Auto-loading Resources
• Quick-and-dirty CCK imports
http://bit.ly/quick-dirty-cck-imports
• Programmatically Create, Insert, and Update CCK
Nodes http://bit.ly/cck-import-update
• What is the Content Construction Kit? A View from the
Database. http://bit.ly/what-is-cck
37. CCK Auto-loading Problems
• Column names may change from one database
instance to another if other CCK content types with
identical field names already exist.
• drupal_execute bug in Drupal 5 Form API:
• cannot call drupal_validate_form on the same form
more than once: http://bit.ly/drupal5-formapi-bug
• Fixed in Drupal versions > 5