SlideShare une entreprise Scribd logo
1  sur  38
Télécharger pour lire hors ligne
Automatic
  Scheduled
Loading of CCK
    Nodes
ETL with drupal_execute, OO,
        drush, & cron



     David Naughton | December 3, 2008
Who am I?
David Naughton


●
    Web Applications Developer
       ●
           University of Minnesota Libraries
       ●
           naughton@umn.edu
●
    11+ years development experience
●
    New to Drupal & PHP
What's EthicShare?
ethicshare.org

• Who: UMN Center for Bioethices, UMN Libraries, & UMN Csci & EE

• What: A sustainable aggregation of bioethics research and a forum for scholarship

• When: Pilot Phase January 2008 – June 2009

• How: Funded by Andrew W. Mellon Foundation
Sustainable Aggregation
of Bioethics Research
• My part of the project

• Extract citations from multiple sources

• Transform into Drupal-compatible format

• Load into Drupal

• On a regular, ongoing basis
ETL...
• Extract, Transform, and Load = ETL
• Very common IT problem
• ETL is the most common term for it
• Librarians like to say...
   • “Harvesting” instead of Extracting
   • “Crosswalking” instead of Transforming
• ...but they're peculiar
...ETL
• Complex problem
• Lots of packaged solutions
   • Mostly Java, for data warehouses
• Not a good fit for EthicShare
   • Using Drupal 5 and CCK
   • No Batch API
• When we move to Drupal 6...
   • Batch API http://bit.ly/BatchAPI?
   • content.crud.inc http://bit.ly/content-crud-inc?
Without Automation
• First PubMed load alone was > 100,000
citations
• Without automation, I could have been doing
lots of this:
One Solution
If money were no object, we could have hired
lots of these:
Really want...
...but don't want:
Architecture
                         drush
 Extractors         Transformers

  PubMed      XML    PubMed           CiteETL


  WorlCat     XML    WorlCat


                                                Loader   EthicShare
                                   PHP Array              MySQL
 New York           New York
  Times       XML    Times




    BBC       XML     BBC
drush
A portmanteau of “Drupal shell”.

“…a command line shell and Unix scripting interface for
Drupal, a veritable Swiss Army knife designed to make
life easier for those of us who spend most of our
working hours hacking away at the command prompt.”

 -- http://drupal.org/project/drush
Why drush?
• Very flexible scheduling via cron
●
    Uses php-cli, so no web timeouts
●
    Experimental support for running drush without a
running Drupal web instance
●
    Run tests from the cli with Drush simpletest runner
Why not hook_cron?
• If you're comfortable with cron, flexible scheduling via
hook_cron requires unnecessary extra work
●
    Subject to web timeouts
●
    Runs within a Drupal web instance, so large loads
may affect user experience
drush help
$ cd $drush_dir
$ ./drush.php help
Usage: drush.php [options] <command> <command> ...

Options:
  -r <path>, --root=<path>        Drupal root directory to use
                                  (default: current directory)

 -l <uri> , --uri=<uri>           URI of the drupal site to use (only
                                  needed in multisite environments)
...

Commands:
  cite load               Load data to create new citations.

 help                     View help. Run quot;drush help [command]quot; to view
                          command-specific help.

 pm install               Install one or more modules
drush command help
$ ./drush.php help cite load
Usage: drush.php cite load [options]

Options:
  --E=<extractor class>       Base name of an extractor class, excluding
                              the CiteETL/E/ parent path & '.php'. Required.

 --T=<transformer class>      Base name of an transformer class, excluding
                              the CiteETL/T/ parent path & '.php'. Required.

 --L=<loader class>           Base name of an loader class, excluding the
                              CiteETL/L/ parent path & '.php'. Optional:
                              default is 'Loader'.

 --dbuser=<db username>       Optional: 'cite load' will authenticate the
                              user only if both dbuser & dbpass are present.

 --dbpass=<db password>       Optional: 'cite load' will authenticate the
                              user only if both dbuser & dbpass are present.

 --memory_limit=<memory limit>         Optional: default is 512M.
drush cite load
Example specifying the New York Times – Health
extractor & transformer classes on the cli:

$ ./drush.php cite load --E=NYTHealth 
  --T=NYTHealth --dbuser=$dbuser 
  --dbpass=$dbpass

Allows for flexible, per-data-source scheduling via cron,
a requirement for EthicShare.
php-cli Problems
• PHP versions < 5.3 do not free circular references.
This is a problem when parsing loads of XML: Memory
Leaks With Objects in PHP 5
http://bit.ly/php5-memory-leak
• Still may have to allocate huge amounts of memory to
PHP to avoid “out of memory” errors.
drush API
Undocumented, but simple & http://drupal.org/project/drush
links to some modules that use it. To create a drush
command…
●
    Implement hook_drush_command, mapping cli text to a
callback function name
●
    Implement the callback function
…and optionally…
●
    Implement a hook_help case for your command
drush getopt emulation…
Supports:
●
    --opt=value
●
    -opt or --opt (boolean based on presence or
absence)
Contrary to README.txt, does not support:
●
    -opt value
●
    -opt=value
…drush getopt emulation
• Puts options in an associative array, where keys are the option
names: $GLOBALS['args']['options']
●
    Puts commands (“words” not starting with a dash) in an array:
$GLOBALS['args']['commands']
Quirks:
●
    in cases of repetition (e.g. -opt --opt=value ), last one wins
●
    commands & options can be interspersed, as long as order of
commands is maintained
cite.module example…
function cite_drush_command() {
    $items['cite load'] = array(
     'callback'    => 'cite_load_cmd',
     'description' => t('Load data to create new citations.')
    );
    return $items;
}
…cite.module example…
function cite_load_cmd($url) {

   global $args;
   $options = $args['options'];

   // Batch loading will often require more
   // than the default memory.
   $memory_limit = (
       array_key_exists('memory_limit', $options)
       ? $options['memory_limit']
       : '512M'
   );
   ini_set('memory_limit', $memory_limit);

   // continued on next slide…
…cite.module example
 // …continued from previous slide

   if (array_key_exists('dbuser', $options)
       && array_key_exists('dbpass', $options)) {
       user_authenticate($options['dbuser'], $options['dbpass']);
   }

   set_include_path(
      './' . drupal_get_path('module', 'cite') . PATH_SEPARATOR
      . './' . drupal_get_path('module', 'cite') . '/contrib'
      . PATH_SEPARATOR . get_include_path()
   );

   require_once 'CiteETL.php';
   $etl = new CiteETL( $options );
   $etl->run();

} // end function cite_load_cmd
CiteETL.php…
class CiteETL {

private   $option_property_map = array(
 'E' =>   'extractor',
 'T' =>   'transformer',
 'L' =>   'loader'
);

// Not shown: identically-named accessors for these properties
private $extractor;
private $transformer;
private $loader;
…CiteETL.php…
function __construct($params) {
    // The loading process is the almost always the same...
    if (!array_key_exists('L', $params)) {
        $params['L'] = 'Loader';
    }

    foreach ($params as $option => $class) {
        if (!preg_match('/^(E|T|L)$/', $option)) {
            continue;
        }
        // Naming-convention-based, factory-ish, dynamic
        // loading of classes, e.g. CiteETL/E/NYTHealth.php:
        require_once 'CiteETL/' . $option . '/' . $class . '.php';
        $instantiable_class = 'CiteETL_' . $option . '_' . $class;
        $property = $this->option_property_map[$option];
        $this->$property = new $instantiable_class;
    }
}
…CiteETL.php
function run() {
    // Extractors must all implement the Iterator interface.
    $extractor = $this->extractor();
    $extractor->rewind();
    while ($extractor->valid()) {
        $original_citation = $extractor->current();
        try {
            $transformed_citation = $this->transformer->transform(
                $original_citation
            );
        } catch (Exception $e) {
            fwrite(STDERR, $e->getMessage() . quot;nquot;);
            $extractor->next();
        }
        try {
            $this->loader->load( $transformed_citation );
        } catch (Exception $e) {
            fwrite(STDERR, $e->getMessage() . quot;nquot;);
        }
        $extractor->next();
    }
}
Example E. Base Class…
require_once 'simplepie.inc';

class CiteETL_E_SimplePie implements Iterator {

private $items = array();
private $valid = FALSE;

function __construct($params) {
    $feed = new SimplePie();
    $feed->set_feed_url( $params['feed_url'] );
    $feed->init();
    if ($feed->error()) {
        throw new Exception( $feed->error() );
    }
    $feed->strip_htmltags( $params['strip_html_tags'] );
    $this->items = $feed->get_items();
}

// continued on next slide…
…Example E. Base Class
// …continued from previous slide
function rewind() {
    $this->valid = (FALSE !== reset($this->items));
}

function current() {
    return current($this->items);
}

function key() {
    return key($this->items);
}

function next() {
    $this->valid = (FALSE !== next($this->items));
}

function valid() {
    return $this->valid;
}

} # end class CiteETL_E_SimplePie
Example Extractor
require_once 'CiteETL/E/SimplePie.php';

class CiteETL_E_NYTHealth extends CiteETL_E_SimplePie {

function __construct() {
    parent::__construct(array(
     'feed_url' =>
         'http://www.nytimes.com/services/xml/rss/nyt/Health.xml',
     'strip_html_tags' => array('br','span','a','img')
    ));
}

} // end class CiteETL_E_NYTHealth
Example Transformer…
class CiteETL_T_NYTHealth {

private $filter_pattern;

function __construct() {

    $simple_keywords = array(
        'abortion',
        'advance directives',
        // whole bunch of keywords omitted…
       'world health',
    );
    $this->filter_pattern =
        '/(' . join('|', $simple_keywords) . ')/i';
}

// continued on next slide…
…Example Transformer…
// …continued from previous slide

function transform( $simplepie_item ) {
    // create an array matching the cite CCK content type structure:
    $citation = array();

    $citation['title'] = $simplepie_item->get_title();
    $citation['field_abstract'][0]['value'] =
        $simplepie_item->get_content();
    $this->filter( $citation );

    // lots of transformation ops omitted…

    $categories = $simplepie_item->get_categories();
    $category_labels = array();
    foreach ($categories as $category) {
        array_push($category_labels, $category->get_label());
    }
    $citation['field_subject'][0]['value'] =
        join('; ', $category_labels);

    $this->filter( $citation );
    return $citation;
}
…Example Transformer
// …continued from previous slide

function filter( $citation ) {

    $combined_content =
        $citation['title'] .
        $citation['field_abstract'][0]['value'] .
        $citation['field_subject'][0]['value'];

    if (!preg_match($this->filter_pattern, $combined_content))
    {
        throw new Exception(
            quot;The article 'quot; . $citation['title'] . quot;', id: quot;
            . $citation['source_id']
            . quot; was rejected by the relevancy filterquot;
        );
    }
}
Why not FeedAPI?
• Supports only simple one-feed-field to one-CCK-field
mappings
• Avoid the Rube Goldberg Effect by using the same
ETL system for feeds that use for everything else
Loader
class CiteETL_L_Loader {

function load( $citation ) {
    // de-duplication code omitted…

    $node = array('type' => 'cite');
    $citation['status'] = 1;
    $node_path = drupal_execute(
     'cite_node_form', $citation, $node
    );
    $errors = form_get_errors();
    if (count($errors)) {
        $message = join('; ', $errors);
        throw new Exception( $message );
    }
    // de-duplication code omitted…
}
CCK Auto-loading Resources
• Quick-and-dirty CCK imports
http://bit.ly/quick-dirty-cck-imports
• Programmatically Create, Insert, and Update CCK
Nodes http://bit.ly/cck-import-update
• What is the Content Construction Kit? A View from the
Database. http://bit.ly/what-is-cck
CCK Auto-loading Problems
• Column names may change from one database
instance to another if other CCK content types with
identical field names already exist.
• drupal_execute bug in Drupal 5 Form API:
   • cannot call drupal_validate_form on the same form
   more than once: http://bit.ly/drupal5-formapi-bug
   • Fixed in Drupal versions > 5
Questions?

Contenu connexe

Tendances

The Beauty And The Beast Php N W09
The Beauty And The Beast Php N W09The Beauty And The Beast Php N W09
The Beauty And The Beast Php N W09Bastian Feder
 
PHP 5.3 Overview
PHP 5.3 OverviewPHP 5.3 Overview
PHP 5.3 Overviewjsmith92
 
SPL: The Missing Link in Development
SPL: The Missing Link in DevelopmentSPL: The Missing Link in Development
SPL: The Missing Link in Developmentjsmith92
 
eZ Publish Cluster Unleashed
eZ Publish Cluster UnleashedeZ Publish Cluster Unleashed
eZ Publish Cluster UnleashedBertrand Dunogier
 
PuppetCamp SEA 1 - Version Control with Puppet
PuppetCamp SEA 1 - Version Control with PuppetPuppetCamp SEA 1 - Version Control with Puppet
PuppetCamp SEA 1 - Version Control with PuppetWalter Heck
 
Speed up your developments with Symfony2
Speed up your developments with Symfony2Speed up your developments with Symfony2
Speed up your developments with Symfony2Hugo Hamon
 
4069180 Caching Performance Lessons From Facebook
4069180 Caching Performance Lessons From Facebook4069180 Caching Performance Lessons From Facebook
4069180 Caching Performance Lessons From Facebookguoqing75
 
mapserver_install_linux
mapserver_install_linuxmapserver_install_linux
mapserver_install_linuxtutorialsruby
 
Dependency Injection with PHP 5.3
Dependency Injection with PHP 5.3Dependency Injection with PHP 5.3
Dependency Injection with PHP 5.3Fabien Potencier
 
Datagrids with Symfony 2, Backbone and Backgrid
Datagrids with Symfony 2, Backbone and BackgridDatagrids with Symfony 2, Backbone and Backgrid
Datagrids with Symfony 2, Backbone and Backgrideugenio pombi
 
PHP Data Objects
PHP Data ObjectsPHP Data Objects
PHP Data ObjectsWez Furlong
 

Tendances (20)

jQuery secrets
jQuery secretsjQuery secrets
jQuery secrets
 
Puppet @ Seat
Puppet @ SeatPuppet @ Seat
Puppet @ Seat
 
Php go vrooom!
Php go vrooom!Php go vrooom!
Php go vrooom!
 
The Beauty And The Beast Php N W09
The Beauty And The Beast Php N W09The Beauty And The Beast Php N W09
The Beauty And The Beast Php N W09
 
Augeas @RMLL 2012
Augeas @RMLL 2012Augeas @RMLL 2012
Augeas @RMLL 2012
 
PHP 5.3 Overview
PHP 5.3 OverviewPHP 5.3 Overview
PHP 5.3 Overview
 
SPL: The Missing Link in Development
SPL: The Missing Link in DevelopmentSPL: The Missing Link in Development
SPL: The Missing Link in Development
 
eZ Publish Cluster Unleashed
eZ Publish Cluster UnleashedeZ Publish Cluster Unleashed
eZ Publish Cluster Unleashed
 
Apache Hacks
Apache HacksApache Hacks
Apache Hacks
 
PuppetCamp SEA 1 - Version Control with Puppet
PuppetCamp SEA 1 - Version Control with PuppetPuppetCamp SEA 1 - Version Control with Puppet
PuppetCamp SEA 1 - Version Control with Puppet
 
Speed up your developments with Symfony2
Speed up your developments with Symfony2Speed up your developments with Symfony2
Speed up your developments with Symfony2
 
4069180 Caching Performance Lessons From Facebook
4069180 Caching Performance Lessons From Facebook4069180 Caching Performance Lessons From Facebook
4069180 Caching Performance Lessons From Facebook
 
mapserver_install_linux
mapserver_install_linuxmapserver_install_linux
mapserver_install_linux
 
ReUse Your (Puppet) Modules!
ReUse Your (Puppet) Modules!ReUse Your (Puppet) Modules!
ReUse Your (Puppet) Modules!
 
Dependency Injection with PHP 5.3
Dependency Injection with PHP 5.3Dependency Injection with PHP 5.3
Dependency Injection with PHP 5.3
 
PHP MVC
PHP MVCPHP MVC
PHP MVC
 
Alfredo-PUMEX
Alfredo-PUMEXAlfredo-PUMEX
Alfredo-PUMEX
 
Datagrids with Symfony 2, Backbone and Backgrid
Datagrids with Symfony 2, Backbone and BackgridDatagrids with Symfony 2, Backbone and Backgrid
Datagrids with Symfony 2, Backbone and Backgrid
 
PHP Data Objects
PHP Data ObjectsPHP Data Objects
PHP Data Objects
 
ReactPHP
ReactPHPReactPHP
ReactPHP
 

En vedette

Creating Your WordPress Web Site
Creating Your WordPress Web SiteCreating Your WordPress Web Site
Creating Your WordPress Web Sitemythicgroup
 
Mobilizing Communities in a Connected Age
Mobilizing Communities in a Connected AgeMobilizing Communities in a Connected Age
Mobilizing Communities in a Connected AgeMargaret Stangl
 
The Greatest Generation
The Greatest GenerationThe Greatest Generation
The Greatest Generationgpsinc
 
Educational Tear Sheets
Educational Tear SheetsEducational Tear Sheets
Educational Tear Sheetssararshea
 
A Search For Compassion
A Search For CompassionA Search For Compassion
A Search For CompassionMichelle
 
Most Contagious 2008
Most Contagious 2008Most Contagious 2008
Most Contagious 2008Daniel Simon
 
Has Anyone Asked a Customer?
Has Anyone Asked a Customer?Has Anyone Asked a Customer?
Has Anyone Asked a Customer?Dan Armstrong
 
Centrifugal+Pump+Main[1].+Seminar+(Tt$)
Centrifugal+Pump+Main[1].+Seminar+(Tt$)Centrifugal+Pump+Main[1].+Seminar+(Tt$)
Centrifugal+Pump+Main[1].+Seminar+(Tt$)Mahmoud Osman
 
Twitter for Journalists Utrecht june 11 2009
Twitter for Journalists Utrecht june 11 2009Twitter for Journalists Utrecht june 11 2009
Twitter for Journalists Utrecht june 11 2009Bart Brouwers
 
TFS REST API e Universal Apps
TFS REST API e Universal AppsTFS REST API e Universal Apps
TFS REST API e Universal AppsGiovanni Bassi
 
Fontys business model generation & dichtbij
Fontys business model generation & dichtbijFontys business model generation & dichtbij
Fontys business model generation & dichtbijBart Brouwers
 
Innovatie bij Traditionele Media
Innovatie bij Traditionele MediaInnovatie bij Traditionele Media
Innovatie bij Traditionele MediaBart Brouwers
 
Lean back to Lean forward: steps to a new attitude
Lean back to Lean forward: steps to a new attitudeLean back to Lean forward: steps to a new attitude
Lean back to Lean forward: steps to a new attitudemiguelvinagre
 
Univ Aizu week10 about computer
Univ Aizu week10 about  computerUniv Aizu week10 about  computer
Univ Aizu week10 about computerI M
 

En vedette (20)

Justize
JustizeJustize
Justize
 
Obligatoriedad de antecedentes policiales
Obligatoriedad de antecedentes policialesObligatoriedad de antecedentes policiales
Obligatoriedad de antecedentes policiales
 
Creating Your WordPress Web Site
Creating Your WordPress Web SiteCreating Your WordPress Web Site
Creating Your WordPress Web Site
 
Mobilizing Communities in a Connected Age
Mobilizing Communities in a Connected AgeMobilizing Communities in a Connected Age
Mobilizing Communities in a Connected Age
 
The Greatest Generation
The Greatest GenerationThe Greatest Generation
The Greatest Generation
 
Educational Tear Sheets
Educational Tear SheetsEducational Tear Sheets
Educational Tear Sheets
 
A Search For Compassion
A Search For CompassionA Search For Compassion
A Search For Compassion
 
TTB- I Spy
TTB- I SpyTTB- I Spy
TTB- I Spy
 
Most Contagious 2008
Most Contagious 2008Most Contagious 2008
Most Contagious 2008
 
Has Anyone Asked a Customer?
Has Anyone Asked a Customer?Has Anyone Asked a Customer?
Has Anyone Asked a Customer?
 
Centrifugal+Pump+Main[1].+Seminar+(Tt$)
Centrifugal+Pump+Main[1].+Seminar+(Tt$)Centrifugal+Pump+Main[1].+Seminar+(Tt$)
Centrifugal+Pump+Main[1].+Seminar+(Tt$)
 
Twitter for Journalists Utrecht june 11 2009
Twitter for Journalists Utrecht june 11 2009Twitter for Journalists Utrecht june 11 2009
Twitter for Journalists Utrecht june 11 2009
 
Native tmg
Native tmgNative tmg
Native tmg
 
It Idea
It IdeaIt Idea
It Idea
 
Amphibians
AmphibiansAmphibians
Amphibians
 
TFS REST API e Universal Apps
TFS REST API e Universal AppsTFS REST API e Universal Apps
TFS REST API e Universal Apps
 
Fontys business model generation & dichtbij
Fontys business model generation & dichtbijFontys business model generation & dichtbij
Fontys business model generation & dichtbij
 
Innovatie bij Traditionele Media
Innovatie bij Traditionele MediaInnovatie bij Traditionele Media
Innovatie bij Traditionele Media
 
Lean back to Lean forward: steps to a new attitude
Lean back to Lean forward: steps to a new attitudeLean back to Lean forward: steps to a new attitude
Lean back to Lean forward: steps to a new attitude
 
Univ Aizu week10 about computer
Univ Aizu week10 about  computerUniv Aizu week10 about  computer
Univ Aizu week10 about computer
 

Similaire à Auto-loading of Drupal CCK Nodes

JUDCon London 2011 - Bin packing with drools planner by example
JUDCon London 2011 - Bin packing with drools planner by exampleJUDCon London 2011 - Bin packing with drools planner by example
JUDCon London 2011 - Bin packing with drools planner by exampleGeoffrey De Smet
 
course slides -- powerpoint
course slides -- powerpointcourse slides -- powerpoint
course slides -- powerpointwebhostingguy
 
Advanced PHPUnit Testing
Advanced PHPUnit TestingAdvanced PHPUnit Testing
Advanced PHPUnit TestingMike Lively
 
Introducing PHP Latest Updates
Introducing PHP Latest UpdatesIntroducing PHP Latest Updates
Introducing PHP Latest UpdatesIftekhar Eather
 
10 Things Every Plugin Developer Should Know (WordCamp Atlanta 2013)
10 Things Every Plugin Developer Should Know (WordCamp Atlanta 2013)10 Things Every Plugin Developer Should Know (WordCamp Atlanta 2013)
10 Things Every Plugin Developer Should Know (WordCamp Atlanta 2013)arcware
 
Tools and Tips for Moodle Developers - #mootus16
 Tools and Tips for Moodle Developers - #mootus16 Tools and Tips for Moodle Developers - #mootus16
Tools and Tips for Moodle Developers - #mootus16Dan Poltawski
 
Create a web-app with Cgi Appplication
Create a web-app with Cgi AppplicationCreate a web-app with Cgi Appplication
Create a web-app with Cgi Appplicationolegmmiller
 
CodeIgniter PHP MVC Framework
CodeIgniter PHP MVC FrameworkCodeIgniter PHP MVC Framework
CodeIgniter PHP MVC FrameworkBo-Yi Wu
 
Drupalcon 2023 - How Drupal builds your pages.pdf
Drupalcon 2023 - How Drupal builds your pages.pdfDrupalcon 2023 - How Drupal builds your pages.pdf
Drupalcon 2023 - How Drupal builds your pages.pdfLuca Lusso
 
2023 - Drupalcon - How Drupal builds your pages
2023 - Drupalcon - How Drupal builds your pages2023 - Drupalcon - How Drupal builds your pages
2023 - Drupalcon - How Drupal builds your pagessparkfabrik
 
Practical catalyst
Practical catalystPractical catalyst
Practical catalystdwm042
 
Debugging in drupal 8
Debugging in drupal 8Debugging in drupal 8
Debugging in drupal 8Allie Jones
 
Perl web frameworks
Perl web frameworksPerl web frameworks
Perl web frameworksdiego_k
 
Curscatalyst
CurscatalystCurscatalyst
CurscatalystKar Juan
 

Similaire à Auto-loading of Drupal CCK Nodes (20)

JUDCon London 2011 - Bin packing with drools planner by example
JUDCon London 2011 - Bin packing with drools planner by exampleJUDCon London 2011 - Bin packing with drools planner by example
JUDCon London 2011 - Bin packing with drools planner by example
 
Pecl Picks
Pecl PicksPecl Picks
Pecl Picks
 
course slides -- powerpoint
course slides -- powerpointcourse slides -- powerpoint
course slides -- powerpoint
 
Advanced PHPUnit Testing
Advanced PHPUnit TestingAdvanced PHPUnit Testing
Advanced PHPUnit Testing
 
Introducing PHP Latest Updates
Introducing PHP Latest UpdatesIntroducing PHP Latest Updates
Introducing PHP Latest Updates
 
Fatc
FatcFatc
Fatc
 
10 Things Every Plugin Developer Should Know (WordCamp Atlanta 2013)
10 Things Every Plugin Developer Should Know (WordCamp Atlanta 2013)10 Things Every Plugin Developer Should Know (WordCamp Atlanta 2013)
10 Things Every Plugin Developer Should Know (WordCamp Atlanta 2013)
 
Tools and Tips for Moodle Developers - #mootus16
 Tools and Tips for Moodle Developers - #mootus16 Tools and Tips for Moodle Developers - #mootus16
Tools and Tips for Moodle Developers - #mootus16
 
Create a web-app with Cgi Appplication
Create a web-app with Cgi AppplicationCreate a web-app with Cgi Appplication
Create a web-app with Cgi Appplication
 
Having Fun with Play
Having Fun with PlayHaving Fun with Play
Having Fun with Play
 
CodeIgniter PHP MVC Framework
CodeIgniter PHP MVC FrameworkCodeIgniter PHP MVC Framework
CodeIgniter PHP MVC Framework
 
Sprockets
SprocketsSprockets
Sprockets
 
Catalyst MVC
Catalyst MVCCatalyst MVC
Catalyst MVC
 
Drupalcon 2023 - How Drupal builds your pages.pdf
Drupalcon 2023 - How Drupal builds your pages.pdfDrupalcon 2023 - How Drupal builds your pages.pdf
Drupalcon 2023 - How Drupal builds your pages.pdf
 
2023 - Drupalcon - How Drupal builds your pages
2023 - Drupalcon - How Drupal builds your pages2023 - Drupalcon - How Drupal builds your pages
2023 - Drupalcon - How Drupal builds your pages
 
Unittests für Dummies
Unittests für DummiesUnittests für Dummies
Unittests für Dummies
 
Practical catalyst
Practical catalystPractical catalyst
Practical catalyst
 
Debugging in drupal 8
Debugging in drupal 8Debugging in drupal 8
Debugging in drupal 8
 
Perl web frameworks
Perl web frameworksPerl web frameworks
Perl web frameworks
 
Curscatalyst
CurscatalystCurscatalyst
Curscatalyst
 

Dernier

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Dernier (20)

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

Auto-loading of Drupal CCK Nodes

  • 1. Automatic Scheduled Loading of CCK Nodes ETL with drupal_execute, OO, drush, & cron David Naughton | December 3, 2008
  • 2. Who am I? David Naughton ● Web Applications Developer ● University of Minnesota Libraries ● naughton@umn.edu ● 11+ years development experience ● New to Drupal & PHP
  • 3. What's EthicShare? ethicshare.org • Who: UMN Center for Bioethices, UMN Libraries, & UMN Csci & EE • What: A sustainable aggregation of bioethics research and a forum for scholarship • When: Pilot Phase January 2008 – June 2009 • How: Funded by Andrew W. Mellon Foundation
  • 4. Sustainable Aggregation of Bioethics Research • My part of the project • Extract citations from multiple sources • Transform into Drupal-compatible format • Load into Drupal • On a regular, ongoing basis
  • 5. ETL... • Extract, Transform, and Load = ETL • Very common IT problem • ETL is the most common term for it • Librarians like to say... • “Harvesting” instead of Extracting • “Crosswalking” instead of Transforming • ...but they're peculiar
  • 6. ...ETL • Complex problem • Lots of packaged solutions • Mostly Java, for data warehouses • Not a good fit for EthicShare • Using Drupal 5 and CCK • No Batch API • When we move to Drupal 6... • Batch API http://bit.ly/BatchAPI? • content.crud.inc http://bit.ly/content-crud-inc?
  • 7. Without Automation • First PubMed load alone was > 100,000 citations • Without automation, I could have been doing lots of this:
  • 8. One Solution If money were no object, we could have hired lots of these:
  • 11. Architecture drush Extractors Transformers PubMed XML PubMed CiteETL WorlCat XML WorlCat Loader EthicShare PHP Array MySQL New York New York Times XML Times BBC XML BBC
  • 12. drush A portmanteau of “Drupal shell”. “…a command line shell and Unix scripting interface for Drupal, a veritable Swiss Army knife designed to make life easier for those of us who spend most of our working hours hacking away at the command prompt.” -- http://drupal.org/project/drush
  • 13. Why drush? • Very flexible scheduling via cron ● Uses php-cli, so no web timeouts ● Experimental support for running drush without a running Drupal web instance ● Run tests from the cli with Drush simpletest runner
  • 14. Why not hook_cron? • If you're comfortable with cron, flexible scheduling via hook_cron requires unnecessary extra work ● Subject to web timeouts ● Runs within a Drupal web instance, so large loads may affect user experience
  • 15. drush help $ cd $drush_dir $ ./drush.php help Usage: drush.php [options] <command> <command> ... Options: -r <path>, --root=<path> Drupal root directory to use (default: current directory) -l <uri> , --uri=<uri> URI of the drupal site to use (only needed in multisite environments) ... Commands: cite load Load data to create new citations. help View help. Run quot;drush help [command]quot; to view command-specific help. pm install Install one or more modules
  • 16. drush command help $ ./drush.php help cite load Usage: drush.php cite load [options] Options: --E=<extractor class> Base name of an extractor class, excluding the CiteETL/E/ parent path & '.php'. Required. --T=<transformer class> Base name of an transformer class, excluding the CiteETL/T/ parent path & '.php'. Required. --L=<loader class> Base name of an loader class, excluding the CiteETL/L/ parent path & '.php'. Optional: default is 'Loader'. --dbuser=<db username> Optional: 'cite load' will authenticate the user only if both dbuser & dbpass are present. --dbpass=<db password> Optional: 'cite load' will authenticate the user only if both dbuser & dbpass are present. --memory_limit=<memory limit> Optional: default is 512M.
  • 17. drush cite load Example specifying the New York Times – Health extractor & transformer classes on the cli: $ ./drush.php cite load --E=NYTHealth --T=NYTHealth --dbuser=$dbuser --dbpass=$dbpass Allows for flexible, per-data-source scheduling via cron, a requirement for EthicShare.
  • 18. php-cli Problems • PHP versions < 5.3 do not free circular references. This is a problem when parsing loads of XML: Memory Leaks With Objects in PHP 5 http://bit.ly/php5-memory-leak • Still may have to allocate huge amounts of memory to PHP to avoid “out of memory” errors.
  • 19. drush API Undocumented, but simple & http://drupal.org/project/drush links to some modules that use it. To create a drush command… ● Implement hook_drush_command, mapping cli text to a callback function name ● Implement the callback function …and optionally… ● Implement a hook_help case for your command
  • 20. drush getopt emulation… Supports: ● --opt=value ● -opt or --opt (boolean based on presence or absence) Contrary to README.txt, does not support: ● -opt value ● -opt=value
  • 21. …drush getopt emulation • Puts options in an associative array, where keys are the option names: $GLOBALS['args']['options'] ● Puts commands (“words” not starting with a dash) in an array: $GLOBALS['args']['commands'] Quirks: ● in cases of repetition (e.g. -opt --opt=value ), last one wins ● commands & options can be interspersed, as long as order of commands is maintained
  • 22. cite.module example… function cite_drush_command() { $items['cite load'] = array( 'callback' => 'cite_load_cmd', 'description' => t('Load data to create new citations.') ); return $items; }
  • 23. …cite.module example… function cite_load_cmd($url) { global $args; $options = $args['options']; // Batch loading will often require more // than the default memory. $memory_limit = ( array_key_exists('memory_limit', $options) ? $options['memory_limit'] : '512M' ); ini_set('memory_limit', $memory_limit); // continued on next slide…
  • 24. …cite.module example // …continued from previous slide if (array_key_exists('dbuser', $options) && array_key_exists('dbpass', $options)) { user_authenticate($options['dbuser'], $options['dbpass']); } set_include_path( './' . drupal_get_path('module', 'cite') . PATH_SEPARATOR . './' . drupal_get_path('module', 'cite') . '/contrib' . PATH_SEPARATOR . get_include_path() ); require_once 'CiteETL.php'; $etl = new CiteETL( $options ); $etl->run(); } // end function cite_load_cmd
  • 25. CiteETL.php… class CiteETL { private $option_property_map = array( 'E' => 'extractor', 'T' => 'transformer', 'L' => 'loader' ); // Not shown: identically-named accessors for these properties private $extractor; private $transformer; private $loader;
  • 26. …CiteETL.php… function __construct($params) { // The loading process is the almost always the same... if (!array_key_exists('L', $params)) { $params['L'] = 'Loader'; } foreach ($params as $option => $class) { if (!preg_match('/^(E|T|L)$/', $option)) { continue; } // Naming-convention-based, factory-ish, dynamic // loading of classes, e.g. CiteETL/E/NYTHealth.php: require_once 'CiteETL/' . $option . '/' . $class . '.php'; $instantiable_class = 'CiteETL_' . $option . '_' . $class; $property = $this->option_property_map[$option]; $this->$property = new $instantiable_class; } }
  • 27. …CiteETL.php function run() { // Extractors must all implement the Iterator interface. $extractor = $this->extractor(); $extractor->rewind(); while ($extractor->valid()) { $original_citation = $extractor->current(); try { $transformed_citation = $this->transformer->transform( $original_citation ); } catch (Exception $e) { fwrite(STDERR, $e->getMessage() . quot;nquot;); $extractor->next(); } try { $this->loader->load( $transformed_citation ); } catch (Exception $e) { fwrite(STDERR, $e->getMessage() . quot;nquot;); } $extractor->next(); } }
  • 28. Example E. Base Class… require_once 'simplepie.inc'; class CiteETL_E_SimplePie implements Iterator { private $items = array(); private $valid = FALSE; function __construct($params) { $feed = new SimplePie(); $feed->set_feed_url( $params['feed_url'] ); $feed->init(); if ($feed->error()) { throw new Exception( $feed->error() ); } $feed->strip_htmltags( $params['strip_html_tags'] ); $this->items = $feed->get_items(); } // continued on next slide…
  • 29. …Example E. Base Class // …continued from previous slide function rewind() { $this->valid = (FALSE !== reset($this->items)); } function current() { return current($this->items); } function key() { return key($this->items); } function next() { $this->valid = (FALSE !== next($this->items)); } function valid() { return $this->valid; } } # end class CiteETL_E_SimplePie
  • 30. Example Extractor require_once 'CiteETL/E/SimplePie.php'; class CiteETL_E_NYTHealth extends CiteETL_E_SimplePie { function __construct() { parent::__construct(array( 'feed_url' => 'http://www.nytimes.com/services/xml/rss/nyt/Health.xml', 'strip_html_tags' => array('br','span','a','img') )); } } // end class CiteETL_E_NYTHealth
  • 31. Example Transformer… class CiteETL_T_NYTHealth { private $filter_pattern; function __construct() { $simple_keywords = array( 'abortion', 'advance directives', // whole bunch of keywords omitted… 'world health', ); $this->filter_pattern = '/(' . join('|', $simple_keywords) . ')/i'; } // continued on next slide…
  • 32. …Example Transformer… // …continued from previous slide function transform( $simplepie_item ) { // create an array matching the cite CCK content type structure: $citation = array(); $citation['title'] = $simplepie_item->get_title(); $citation['field_abstract'][0]['value'] = $simplepie_item->get_content(); $this->filter( $citation ); // lots of transformation ops omitted… $categories = $simplepie_item->get_categories(); $category_labels = array(); foreach ($categories as $category) { array_push($category_labels, $category->get_label()); } $citation['field_subject'][0]['value'] = join('; ', $category_labels); $this->filter( $citation ); return $citation; }
  • 33. …Example Transformer // …continued from previous slide function filter( $citation ) { $combined_content = $citation['title'] . $citation['field_abstract'][0]['value'] . $citation['field_subject'][0]['value']; if (!preg_match($this->filter_pattern, $combined_content)) { throw new Exception( quot;The article 'quot; . $citation['title'] . quot;', id: quot; . $citation['source_id'] . quot; was rejected by the relevancy filterquot; ); } }
  • 34. Why not FeedAPI? • Supports only simple one-feed-field to one-CCK-field mappings • Avoid the Rube Goldberg Effect by using the same ETL system for feeds that use for everything else
  • 35. Loader class CiteETL_L_Loader { function load( $citation ) { // de-duplication code omitted… $node = array('type' => 'cite'); $citation['status'] = 1; $node_path = drupal_execute( 'cite_node_form', $citation, $node ); $errors = form_get_errors(); if (count($errors)) { $message = join('; ', $errors); throw new Exception( $message ); } // de-duplication code omitted… }
  • 36. CCK Auto-loading Resources • Quick-and-dirty CCK imports http://bit.ly/quick-dirty-cck-imports • Programmatically Create, Insert, and Update CCK Nodes http://bit.ly/cck-import-update • What is the Content Construction Kit? A View from the Database. http://bit.ly/what-is-cck
  • 37. CCK Auto-loading Problems • Column names may change from one database instance to another if other CCK content types with identical field names already exist. • drupal_execute bug in Drupal 5 Form API: • cannot call drupal_validate_form on the same form more than once: http://bit.ly/drupal5-formapi-bug • Fixed in Drupal versions > 5