Contenu connexe
Similaire à I Can Convert
Similaire à I Can Convert (20)
I Can Convert
- 2. I Can Convert!
• Sven Aas: @svenaas / saas@mtholyoke.edu
• Jason Proctor: @jmpmhc / jproctor@mtholyoke.edu
• #TPR2
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 3. We’re going to talk about
• Stories
• Patterns
• Tools
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 5. Use Your Tools
• Spreadsheet
• Programmer’s Editor
• Programming Language
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 6. Spreadsheet
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 7. Spreadsheet
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 11. Programming Language
©2012 Sven Aas and Jason Proctor,
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 12. Use Your Tools!
You’ve GOT this stuff.
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 14. Portal News
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 15. Unusual Data Representation
+""""""""""""""+ |$4692909$|$G1158673129"8322$|$$$16$|$rwlrwlr"l$|$
|$Data$$$$$$$$$|
+""""""""""""""+ 21139$|$71$1000009$1000010$1000011$1000012$1000013$
|$node$$$$$$$$$|$ 1000014$1000015$1000016$1000017$1000018$1000019$
|$name$$$$$$$$$|$
|$type$$$$$$$$$|$ 1000020$|$$$$$$|$$$$$$|$2100709$|$$$NULL$|$1158673129$
|$mode$$$$$$$$$|$ |$1170344089$|$21139$$|$$$$$$$1$|
|$owner$$$$$$$$|$
|$group$$$$$$$$|$
01|Second*Saturday:$MHC$Students$Hit$the$Road|As$part$
|$url$$$$$$$$$$|$ of$new$student$orientation,$members$of$the$class$of$
|$desc$$$$$$$$$|$ 2010$worked$on$community$service$projects$across$the$
|$parent$$$$$$$|$
|$linkto$$$$$$$|$ Pioneer$Valley$on$September$16.$View$the$photo$
|$ctime$$$$$$$$|$ gallery.||http://www.mtholyoke.edu/offices/comm/news/
|$mtime$$$$$$$$|$
|$mod_by$$$$$$$|$ sec_sat_06/page1.html|1158638400|1170305999|||||
|$visible$$$$$$|$ 11.41|:^:^:^:^:^JPG:^75:^75:^2813:^Second$
|$userdata$$$$$|$
|$datasize$$$$$|$
Saturday:^:^:^:^0:^$
|$datafilename$|$ |$$$$$2813$|$V1158673129"9689$|
+""""""""""""""+
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 16. Ruby to the Rescue
LegacyUser User
Item
Portal News
Importer
System System
LegacyItem Story Link
Channel
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 17. ActiveRecord
• A Ruby library which implements the ActiveRecord software
architecture pattern.
• The original Model and ORM component of Ruby on Rails.
• We used it to provide a convenient object layer on top of two
underlying relational databases.
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 19. Object Extraction
Context: Ingesting source data.
Problem: Source data objects contain multiple target objects.
Solution: Process or parse target data just enough to extract
objects.
Tools: String methods, RegEx, DOM/XML selection.
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 20. Encoding Change
Context: Mapping source data to target.
Problem: Source text encoding differs from target.
Solution: Perform intermediate translation.
Tools: String methods, RegEx, programming libraries.
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 21. URL/Path Translation
Context: Preparing target environment and data.
Problem: Assets in target system will be available at different
paths or URLs from their locations in source system.
Solution: Map source locations to target locations. Replace
references in data before saving to target.
Tools: String methods, RegEx, DOM/XML selection.
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 22. Getting the News Out
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 23. Easy Come, Easy Go
1. Export Athletics news items to hosted service.
2. Export all news items to digital archives.
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 24. Exporting Athletics Items
• 10 years of Athletics news in 14 channels.
• Export each item in a minimal, predictable HTML wrapper.
• Include metadata for each item in <meta> tags in the <head>.
• Group items by sport and by academic year.
• Generally accommodate the target system.
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 25. HAML
• A lightweight markup language used to generate HTML.
• A meta-markup language.
• We used it to succinctly express the HTML we wanted from
within our Ruby code.
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 26. Archiving Web News
• 14 years of news: 6,000 items, 5,000 images, 34 channels.
• Export each news item in an archival form preserving the
original markup and character entities (but not the design)
• PDF generated from HTML generated from HAML
• Export Dublin Core metadata for each news item:
• XML generated via Builder
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 27. Builder
• A Ruby library for generating XML.
• We used it to dynamically generate simple XML from within a
Ruby application.
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 28. wkhtmltopdf
• A shell utility for generating PDF files by rendering HTML
documents using the WebKit rendering engine.
• A Ruby library providing programmatic access to the
wkhtmltopdf shell utility.
• We used it so that we could use familiar web development
techniques to generate PDFs without having to implement our
own rendering and layout routines.
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 29. Familiar Patterns
• Object Extraction
• Encoding Change
• URL/Path Translation
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 30. Direct Translation
Context: Simple conversion.
Problem: Data conversion.
Solution: Read source objects and write targets in single pass.
Tools: Varies.
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 31. Markup Change
Context: Mapping source data to target.
Problem: Source text markup differs from target.
Solution: Perform intermediate translation.
Tools: String methods, RegEx, DOM/XML selection,
programming libraries.
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 32. Data Cleanup
Context: Ingesting source data.
Problem: Source data is ... imperfect.
Solution: Fix what you can confidently fix.
Tools: Varies.
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 33. Convert All the Things!
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 34. Finally Done with News?
• HTML files scraped via Nokogiri scripts.
• Quite a bit of cleanup: garbage in, garbage out.
• Unscrapable news items.
• “September 12, 2001”.
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 35. Nokogiri
• A Ruby library for parsing XML and HTML.
• Supports DOM or SAX parsing.
• Implements both XPath and CSS3 selectors.
• We used it to parse and extract content from the set of HTML
files containing existing news stories.
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 36. Familiar Patterns
• Direct Translation
• Encoding Change
• Markup Change
• URL/Path Translation
• Data Cleanup
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 37. The Big One
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 38. CMS Conversion
• Old CMS pages all published with several different
presentational styles, but all with the same DOM. That means
we can scrape ’em!
• We agreed not to change anything else during the import. That
means we can treat it as a clean switchover.
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 40. Three-Pronged Conversion
• Build the necessary structures and themes to accommodate
and represent our old content.
• Build a library of code for scraping the pages generated by the
old site, cataloging data and metadata, and storing them in an
intermediate representation.
• Build a library of code for importing this intermediate
representation into the new CMS structures.
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 41. Migrate
• An Drupal module providing a framework for data import into
the Drupal content management system.
• Supports a variety of sources and targets out of the box.
• Extensible to support additional migration sources and targets.
• We used it to import the XML representation of our site into
our Drupal system.
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 42. Familiar Patterns
• Object Extraction
• Encoding Change
• Markup Change
• URL/Path Translation
• Data Cleanup
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 43. Intermediate Representation
Context: Complex conversion.
Problem: Data conversion.
Solution: Convert source data to intermediate representation in
one pass. Then convert intermediate representation to target.
Tools: Representation: Database, XML, CSV. Conversion: Varies.
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 44. Object Identity
Context: Ingesting source data.
Problem: Data objects are repeated in source data
Solution: Uniquely identify source objects.
Tools: String methods, RegEx, DOM/XML selection.
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 45. Object Aggregation
Context: Ingesting source data.
Problem: Target data objects contain multiple source objects.
Solution: Aggregate objects at intermediate or output stage.
Tools: Varies.
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 46. Lessons
• You already have a good toolbox. Keep your tools sharp.
• Understand your source and target models.
• Watch for familiar patterns.
• Conversion is an opportunity for cleanup and improvement.
• Human labor can sometimes be cheaper than automation.
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 47. YOU Can Convert
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 48. Questions?
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 49. Thank you, & keep in touch!
• Sven Aas: @svenaas / saas@mtholyoke.edu
• Jason Proctor: @jmpmhc / jproctor@mtholyoke.edu
• #TPR2
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 50. Colophon
• This presentation is set in Exo Extra Bold from Natanael
Gama’s ndiscovered, with headings in ChunkFive from The
League of Movable Type.
• Background images were adapted from
FreeSeamlessTextures.com’s Red Watercolor and The Grid, by
Willem Pirquin.
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 51. Colophon (continued)
• Card-size survival tool photo via acreativeedge.info
• Leatherman photo via SonnyandSandy
• Studley Tool Chest photo via FineWoodworking.com
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 52. Colophon (continued)
• Audio from Wikipedia:Sound/List:
• Edvard Grieg - Piano Concerto in A Minor, Op. 16 - iii. Allegro moderato molto, recorded by
the Skidmore College Orchestra.
• W.A. Mozart - 5th Piano Concerto, i. Allegro aperto, recorded by Ben Goldstein and Bendik
Eide.
• Anton Reicha - Variations for Bassooon, recorded by Arthur Grossman
• J.S. Bach - Cello Suite 1 in G - Minuets, recorded by John Michel
• Mississippi John Hurt - “Nobody’s Dirty Business”
©2012 Sven Aas and Jason Proctor, Mount Holyoke College
- 53. Colophon (continued)
• Other Audio
• Jack Beaver - “Workaday World”
• Danny Elfman - “Breakfast Machine”
©2012 Sven Aas and Jason Proctor, Mount Holyoke College