SlideShare une entreprise Scribd logo
1  sur  39
Metadata Extraction and Content Transformations 2 Nick Burch Software Engineer, Alfresco twitter: @gagravarr
Introduction – 3 Content Related Services 3 Covering Uses Interfaces Calling the Services Java & JavaScript APIs Demos Extensions Apache Tika Metadata Extractor Content Transformer Renditions
The Metadata Extractor Service 4 What, How, Why? ,[object Object]
 Document Metadata is converted into the content model
 Typically used with uploaded binary files
 Upload a PDF, extract out the Title and Description, save these as the properties on the Alfresco Node
 Powered internally by a number of different extractors
 Service picks the appropriate extractor for you
 Since Alfresco 3.4, makes heavy use of Apache Tika,[object Object]
 Driven by mime types, source and destination
 Used to generate plain text versions for indexing
 Used to generate SWF versions for preview
 Used to generate PDF versions for web download
 Powered by a large number of different transformers
 Transformers can be linked together, eg .doc -> .pdf via Open Office, then .pdf -> .swf via pdf2swf
 Since Alfresco 3.4, makes heavy use of Apache Tika,[object Object]
 Or can just alter some content as-is
 Used to manipulate images, eg crop and resize
 Used to generate HTML .docx previews in Web Quick Start
 Often uses the Content Transformation Service
 Replaced the Thumbnail Service
 Renditions are actions,[object Object]
 Grew out of the Lucene community, now widely used
 Provides detection of files – eg this binary blob is really a word file
 Plain text, HTML and XHTML versions of a wide range of different file formats
 Consistent Metadata from different files
Tika hides the complexity of the different formats, and presents a simple, powerful API
 Easy to use and extend,[object Object]
Alfresco 3.3 - Supported Formats 9 File Formats supported out of the box ,[object Object]
 Word, PowerPoint, Excel
 HTML
 Open Document Formats (OpenOffice)
 RFC822 Email
 Outlook .msg Email,[object Object]
 DWG (CAD)
Epub
 RSS and ATOM Feeds
 True Type Fonts
 HTML

Contenu connexe

Tendances

Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 

Tendances (20)

Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
 
Integrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data LakesIntegrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data Lakes
 
Living the Stream Dream with Pulsar and Spring Boot
Living the Stream Dream with Pulsar and Spring BootLiving the Stream Dream with Pulsar and Spring Boot
Living the Stream Dream with Pulsar and Spring Boot
 
Substrait Overview.pdf
Substrait Overview.pdfSubstrait Overview.pdf
Substrait Overview.pdf
 
Alfresco Certificates
Alfresco Certificates Alfresco Certificates
Alfresco Certificates
 
Moving Gigantic Files Into and Out of the Alfresco Repository
Moving Gigantic Files Into and Out of the Alfresco RepositoryMoving Gigantic Files Into and Out of the Alfresco Repository
Moving Gigantic Files Into and Out of the Alfresco Repository
 
Alfresco CMIS
Alfresco CMISAlfresco CMIS
Alfresco CMIS
 
Kubernetes Architecture - beyond a black box - Part 1
Kubernetes Architecture - beyond a black box - Part 1Kubernetes Architecture - beyond a black box - Part 1
Kubernetes Architecture - beyond a black box - Part 1
 
Fiware overview
Fiware overviewFiware overview
Fiware overview
 
Running Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration OptionsRunning Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration Options
 
Exciting New Alfresco REST APIs
Exciting New Alfresco REST APIsExciting New Alfresco REST APIs
Exciting New Alfresco REST APIs
 
Apache Nifi Crash Course
Apache Nifi Crash CourseApache Nifi Crash Course
Apache Nifi Crash Course
 
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFiReal-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
 
Intro to the Alfresco Public API
Intro to the Alfresco Public APIIntro to the Alfresco Public API
Intro to the Alfresco Public API
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
 
(Re)Indexing Large Repositories in Alfresco
(Re)Indexing Large Repositories in Alfresco(Re)Indexing Large Repositories in Alfresco
(Re)Indexing Large Repositories in Alfresco
 
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision Tree
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision TreeApache Kafka vs RabbitMQ: Fit For Purpose / Decision Tree
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision Tree
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Getting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and GrafanaGetting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and Grafana
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 

En vedette

Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platform
mteutelink
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
Manish kumar
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
Tommaso Teofili
 

En vedette (20)

Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platform
 
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-end
 
Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache Tika
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
ProjectHub
ProjectHubProjectHub
ProjectHub
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutch
 
Search engine
Search engineSearch engine
Search engine
 
An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm Crawler
 
Actions rules and workflow in alfresco
Actions rules and workflow in alfrescoActions rules and workflow in alfresco
Actions rules and workflow in alfresco
 
Alfresco content model
Alfresco content modelAlfresco content model
Alfresco content model
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
 
Alfresco In An Hour - Document Management, Web Content Management, and Collab...
Alfresco In An Hour - Document Management, Web Content Management, and Collab...Alfresco In An Hour - Document Management, Web Content Management, and Collab...
Alfresco In An Hour - Document Management, Web Content Management, and Collab...
 

Similaire à Metadata Extraction and Content Transformation

CustomizingStyleSheetsForHTMLOutputs
CustomizingStyleSheetsForHTMLOutputsCustomizingStyleSheetsForHTMLOutputs
CustomizingStyleSheetsForHTMLOutputs
Suite Solutions
 
Lecture14Slides.ppt
Lecture14Slides.pptLecture14Slides.ppt
Lecture14Slides.ppt
Videoguy
 
IQPC Canada XML 2001: How to Use XML Parsing to Enhance Electronic Communication
IQPC Canada XML 2001: How to Use XML Parsing to Enhance Electronic CommunicationIQPC Canada XML 2001: How to Use XML Parsing to Enhance Electronic Communication
IQPC Canada XML 2001: How to Use XML Parsing to Enhance Electronic Communication
Ted Leung
 

Similaire à Metadata Extraction and Content Transformation (20)

Content analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaContent analysis for ECM with Apache Tika
Content analysis for ECM with Apache Tika
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tika
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tika
 
Hatkit Project - Datafiddler
Hatkit Project - DatafiddlerHatkit Project - Datafiddler
Hatkit Project - Datafiddler
 
Power tools in Java
Power tools in JavaPower tools in Java
Power tools in Java
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig Latin
 
Tibco business works
Tibco business worksTibco business works
Tibco business works
 
Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache Tika
 
CustomizingStyleSheetsForHTMLOutputs
CustomizingStyleSheetsForHTMLOutputsCustomizingStyleSheetsForHTMLOutputs
CustomizingStyleSheetsForHTMLOutputs
 
Ajax
AjaxAjax
Ajax
 
Lecture14Slides.ppt
Lecture14Slides.pptLecture14Slides.ppt
Lecture14Slides.ppt
 
Developing web apps using Erlang-Web
Developing web apps using Erlang-WebDeveloping web apps using Erlang-Web
Developing web apps using Erlang-Web
 
SCDJWS 6. REST JAX-P
SCDJWS 6. REST  JAX-PSCDJWS 6. REST  JAX-P
SCDJWS 6. REST JAX-P
 
5 xml parsing
5   xml parsing5   xml parsing
5 xml parsing
 
Apache tika
Apache tikaApache tika
Apache tika
 
Twig internals - Maksym MoskvychevTwig internals maksym moskvychev
Twig internals - Maksym MoskvychevTwig internals   maksym moskvychevTwig internals - Maksym MoskvychevTwig internals   maksym moskvychev
Twig internals - Maksym MoskvychevTwig internals maksym moskvychev
 
Processing XML with Java
Processing XML with JavaProcessing XML with Java
Processing XML with Java
 
Import web resources using R Studio
Import web resources using R StudioImport web resources using R Studio
Import web resources using R Studio
 
IQPC Canada XML 2001: How to Use XML Parsing to Enhance Electronic Communication
IQPC Canada XML 2001: How to Use XML Parsing to Enhance Electronic CommunicationIQPC Canada XML 2001: How to Use XML Parsing to Enhance Electronic Communication
IQPC Canada XML 2001: How to Use XML Parsing to Enhance Electronic Communication
 
Rspec API Documentation
Rspec API DocumentationRspec API Documentation
Rspec API Documentation
 

Plus de Alfresco Software

Plus de Alfresco Software (20)

Alfresco Day Benelux Inholland studentendossier
Alfresco Day Benelux Inholland studentendossierAlfresco Day Benelux Inholland studentendossier
Alfresco Day Benelux Inholland studentendossier
 
Alfresco Day Benelux Hogeschool Inholland Records Management application
Alfresco Day Benelux Hogeschool Inholland Records Management applicationAlfresco Day Benelux Hogeschool Inholland Records Management application
Alfresco Day Benelux Hogeschool Inholland Records Management application
 
Alfresco Day BeNelux: Customer Success Showcase - Saxion Hogescholen
Alfresco Day BeNelux: Customer Success Showcase - Saxion HogescholenAlfresco Day BeNelux: Customer Success Showcase - Saxion Hogescholen
Alfresco Day BeNelux: Customer Success Showcase - Saxion Hogescholen
 
Alfresco Day BeNelux: Customer Success Showcase - Gemeente Amsterdam
Alfresco Day BeNelux: Customer Success Showcase - Gemeente AmsterdamAlfresco Day BeNelux: Customer Success Showcase - Gemeente Amsterdam
Alfresco Day BeNelux: Customer Success Showcase - Gemeente Amsterdam
 
Alfresco Day BeNelux: The success of Alfresco
Alfresco Day BeNelux: The success of AlfrescoAlfresco Day BeNelux: The success of Alfresco
Alfresco Day BeNelux: The success of Alfresco
 
Alfresco Day BeNelux: Customer Success Showcase - Credendo Group
Alfresco Day BeNelux: Customer Success Showcase - Credendo GroupAlfresco Day BeNelux: Customer Success Showcase - Credendo Group
Alfresco Day BeNelux: Customer Success Showcase - Credendo Group
 
Alfresco Day BeNelux: Digital Transformation - It's All About Flow
Alfresco Day BeNelux: Digital Transformation - It's All About FlowAlfresco Day BeNelux: Digital Transformation - It's All About Flow
Alfresco Day BeNelux: Digital Transformation - It's All About Flow
 
Alfresco Day Vienna 2016: Activiti – ein Katalysator für die DMS-Strategie be...
Alfresco Day Vienna 2016: Activiti – ein Katalysator für die DMS-Strategie be...Alfresco Day Vienna 2016: Activiti – ein Katalysator für die DMS-Strategie be...
Alfresco Day Vienna 2016: Activiti – ein Katalysator für die DMS-Strategie be...
 
Alfresco Day Vienna 2016: Elektronische Geschäftsprozesse auf Basis von Alfre...
Alfresco Day Vienna 2016: Elektronische Geschäftsprozesse auf Basis von Alfre...Alfresco Day Vienna 2016: Elektronische Geschäftsprozesse auf Basis von Alfre...
Alfresco Day Vienna 2016: Elektronische Geschäftsprozesse auf Basis von Alfre...
 
Alfresco Day Vienna 2016: Alfrescos neue Rest API
Alfresco Day Vienna 2016: Alfrescos neue Rest APIAlfresco Day Vienna 2016: Alfrescos neue Rest API
Alfresco Day Vienna 2016: Alfrescos neue Rest API
 
Alfresco Day Vienna 2016: Support Tools für die Admin-Konsole
Alfresco Day Vienna 2016: Support Tools für die Admin-KonsoleAlfresco Day Vienna 2016: Support Tools für die Admin-Konsole
Alfresco Day Vienna 2016: Support Tools für die Admin-Konsole
 
Alfresco Day Vienna 2016: Entwickeln mit Alfresco
Alfresco Day Vienna 2016: Entwickeln mit AlfrescoAlfresco Day Vienna 2016: Entwickeln mit Alfresco
Alfresco Day Vienna 2016: Entwickeln mit Alfresco
 
Alfresco Day Vienna 2016: Activiti goes enterprise: Die Evolution der BPM Sui...
Alfresco Day Vienna 2016: Activiti goes enterprise: Die Evolution der BPM Sui...Alfresco Day Vienna 2016: Activiti goes enterprise: Die Evolution der BPM Sui...
Alfresco Day Vienna 2016: Activiti goes enterprise: Die Evolution der BPM Sui...
 
Alfresco Day Vienna 2016: Partner Lightning Talk: Westernacher
Alfresco Day Vienna 2016: Partner Lightning Talk: WesternacherAlfresco Day Vienna 2016: Partner Lightning Talk: Westernacher
Alfresco Day Vienna 2016: Partner Lightning Talk: Westernacher
 
Alfresco Day Vienna 2016: Bringing Content & Process together with the App De...
Alfresco Day Vienna 2016: Bringing Content & Process together with the App De...Alfresco Day Vienna 2016: Bringing Content & Process together with the App De...
Alfresco Day Vienna 2016: Bringing Content & Process together with the App De...
 
Alfresco Day Vienna 2016: Partner Lightning Talk - it-novum
Alfresco Day Vienna 2016: Partner Lightning Talk - it-novumAlfresco Day Vienna 2016: Partner Lightning Talk - it-novum
Alfresco Day Vienna 2016: Partner Lightning Talk - it-novum
 
Alfresco Day Vienna 2016: How to Achieve Digital Flow in the Enterprise - Joh...
Alfresco Day Vienna 2016: How to Achieve Digital Flow in the Enterprise - Joh...Alfresco Day Vienna 2016: How to Achieve Digital Flow in the Enterprise - Joh...
Alfresco Day Vienna 2016: How to Achieve Digital Flow in the Enterprise - Joh...
 
Alfresco Day Warsaw 2016 - Czy możliwe jest spełnienie wszystkich regulacji p...
Alfresco Day Warsaw 2016 - Czy możliwe jest spełnienie wszystkich regulacji p...Alfresco Day Warsaw 2016 - Czy możliwe jest spełnienie wszystkich regulacji p...
Alfresco Day Warsaw 2016 - Czy możliwe jest spełnienie wszystkich regulacji p...
 
Alfresco Day Warsaw 2016: Identyfikacja i podpiselektroniczny - Safran
Alfresco Day Warsaw 2016: Identyfikacja i podpiselektroniczny - SafranAlfresco Day Warsaw 2016: Identyfikacja i podpiselektroniczny - Safran
Alfresco Day Warsaw 2016: Identyfikacja i podpiselektroniczny - Safran
 
Alfresco Day Warsaw 2016: Advancing the Flow of Digital Business
Alfresco Day Warsaw 2016: Advancing the Flow of Digital BusinessAlfresco Day Warsaw 2016: Advancing the Flow of Digital Business
Alfresco Day Warsaw 2016: Advancing the Flow of Digital Business
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Metadata Extraction and Content Transformation

  • 1. Metadata Extraction and Content Transformations 2 Nick Burch Software Engineer, Alfresco twitter: @gagravarr
  • 2. Introduction – 3 Content Related Services 3 Covering Uses Interfaces Calling the Services Java & JavaScript APIs Demos Extensions Apache Tika Metadata Extractor Content Transformer Renditions
  • 3.
  • 4. Document Metadata is converted into the content model
  • 5. Typically used with uploaded binary files
  • 6. Upload a PDF, extract out the Title and Description, save these as the properties on the Alfresco Node
  • 7. Powered internally by a number of different extractors
  • 8. Service picks the appropriate extractor for you
  • 9.
  • 10. Driven by mime types, source and destination
  • 11. Used to generate plain text versions for indexing
  • 12. Used to generate SWF versions for preview
  • 13. Used to generate PDF versions for web download
  • 14. Powered by a large number of different transformers
  • 15. Transformers can be linked together, eg .doc -> .pdf via Open Office, then .pdf -> .swf via pdf2swf
  • 16.
  • 17. Or can just alter some content as-is
  • 18. Used to manipulate images, eg crop and resize
  • 19. Used to generate HTML .docx previews in Web Quick Start
  • 20. Often uses the Content Transformation Service
  • 21. Replaced the Thumbnail Service
  • 22.
  • 23. Grew out of the Lucene community, now widely used
  • 24. Provides detection of files – eg this binary blob is really a word file
  • 25. Plain text, HTML and XHTML versions of a wide range of different file formats
  • 26. Consistent Metadata from different files
  • 27. Tika hides the complexity of the different formats, and presents a simple, powerful API
  • 28.
  • 29.
  • 32. Open Document Formats (OpenOffice)
  • 34.
  • 36. Epub
  • 37. RSS and ATOM Feeds
  • 38. True Type Fonts
  • 40. Images – JPEG, GIF, PNG, TIFF, Bitmap (including EXIF where found)
  • 42.
  • 43. Microsoft Office (Binary) – Word, PowerPoint, Excel, Visio, Publisher, Works
  • 44. Microsoft Office (OOXML) – Word, PowerPoint, Excel
  • 45. MP3 (id3 v1 and v2)
  • 47. Open Document Format (Open Office)
  • 48. Old-style Open Office (.sxw etc)
  • 49.
  • 54. Java class filesAnd I probably forgot one...!
  • 55. Calling Apache Tika 13 // Get a content detector, and an auto-selecting Parser TikaConfigconfig = TikaConfig.getDefaultConfig(); ContainerAwareDetector detector = new ContainerAwareDetector( config.getMimeRepository() ); Parser parser = new AutoDetectParser(detector); // We’ll only want the plain text contents ContentHandler handler = new BodyContentHandler(); // Tell the parser what we have Metadata metadata = new Metadata(); metadata.set(Metadata.RESOURCE_NAME_KEY, filename); // Have it processed parser.parse(input, handler, metadata, new ParseContext());
  • 56. Metadata Extractor – Java Use 14 MetadataExtractorRegistry registry = (MetadataExtractorRegistry)context.getBean(“metadataExtracterRegistry”); MetadataExtracter extractor = registry.getExtracter(“application/vnd.ms-excel”); Map<QName, Serializable> properties = new HashMap<QName, Serializable>(); ContentReader reader = contentService.getReader(nodeRef, ContentModel.PROP_CONTENT); extractor.extract(reader, properties); System.err.println(properties);
  • 57. Metadata Extractor – JavaScript Use 15 JavaScript var action = actions.create("extract-metadata"); action.execute(document); Full access is not directly available You can’t get at the raw properties You can, however, trigger extraction and saving to the node easily Available via an action
  • 58. Metadata Extractor – Geo Content Model 16 <aspect name="cm:geographic"> <title>Geographic</title> <properties> <property name="cm:latitude"> <title>Latitude</title> <type>d:double</type> </property> <property name="cm:longitude"> <title>Longitude</title> <type>d:double</type> </property> </properties> </aspect>
  • 59. Metadata Extractor – Geo Mapping 17 # Namespaces namespace.prefix.cm=http://www.alfresco.org/model/content/1.0 # Geo Mappings geolat=cm:latitude geolong=cm:longitude # Normal Mappings author=cm:author title=cm:title description=cm:description created=cm:created
  • 60. Demo:Geo Tagged Image in Share 18
  • 62.
  • 63. PDF to Image
  • 64. PDF to SWF (for preview)
  • 65. Office File Formats to PDF (via Open Office, using JODConverter in Enterprise)
  • 66. Plain Text and XML to PDF
  • 67. Zip listing to Text
  • 68.
  • 69. Content Transformer – Java Use 22 ContentTransformerRegistry registry = (ContentTransformerRegistry)context.getBean(“contentTransformerRegistry”); ContentTransformer transformer = registry.getTransformer(“application/vnd.ms-excel”,”text/csv”, new TransformationOptions()); ContentReader reader = contentService.getReader(sourceNodeRef, ContentModel.PROP_CONTENT); ContentWriter writer = contentService.getReader(destNodeRef, ContentModel.PROP_CONTENT); transformer.transform(reader, writer);
  • 70. Content Transformer – JavaScript Use 23 JavaScript var action = actions.create("transform"); // Transform into the same folder action.parameters["destination-folder"] = document.parent; action.parameters["assoc-type"] = "{http://www.alfresco.org/model/content/1.0}contains"; action.parameters["assoc-name"] = document.name + "transformed"; action.parameters["mime-type"] = "text/html"; // Execute action.execute(document); Full access is not directly available You can’t control which property is transformed, it’s always Content You can control where the transformed version goes Triggering the transformation is easier than in Java Available via an action
  • 71. Custom Tika Parsers - Interface 24 Interface public interface Parser { Set<MediaType> getSupportedTypes(ParseContext context); void parse( InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException; } The Tika Parser interface is quite simple Need to provide a list of supported mime types, so that auto-detection can work Accept an input stream, populate the Metadata object, and fire SAX events to the supplied handler That’s it!
  • 72. Custom Tika Parser – Hello World Parser 25 public class HelloWorldParser implements Parser { public Set<MediaType> getSupportedTypes(ParseContext context) { Set<MediaType> types = new HashSet<MediaType>(); types.add(MediaType.parse("hello/world")); return types; } public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws SAXException { XHTMLContentHandlerxhtml = new XHTMLContentHandler(handler, metadata); xhtml.startDocument(); xhtml.startElement("h1"); xhtml.characters("Hello, World!"); xhtml.endElement("h1"); xhtml.endDocument(); metadata.set("hello","world"); metadata.set("title","Hello World!"); } }
  • 73. Custom Command Line Transformer 26 <bean id="transformer.worker.helloWorldCMD" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker"> <property name="mimetypeService“><ref bean="mimetypeService"/></property> <property name="transformCommand"> <bean class="org.alfresco.util.exec.RuntimeExec"> <property name="commandsAndArguments“><map> <entry key=".*“><list> <value>/bin/bash</value> <value>-c</value> <value>/bin/echo 'Hello World - ${source}' &gt; ${target}</value> </list></entry> </map></property> <property name="errorCodes“><value>1,127</value></property> </bean> </property <property name="explicitTransformations"> <list><bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails"> <property name="sourceMimetype“><value>text/plain</value></property> <property name="targetMimetype“><value>hello/world</value></property> </bean></list> </property> </bean> <bean id="transformer.helloWorldCMD" class="org.alfresco.repo.content.transform.ProxyContentTransformer" parent="baseContentTransformer"> <property name="worker"><ref bean="transformer.worker.helloWorldCMD"/></property> </bean>
  • 74.
  • 75. Use our Tikatransfomer to turn this back into plain text
  • 76.
  • 78.
  • 79. image – crop, resize, etc
  • 80. freemarker – runs a Freemarker Template against the content of the node
  • 81. html – turns .docx files into clean HTML + images
  • 82. xslt – runs a XSLT Transformation against the content of the node, XML content nodes only!
  • 83.
  • 84. Then set all the parameters against it
  • 85. Finally execute it against a node
  • 86. For very complicated / common renditions, you can save the definition to the data dictionary
  • 87. It can then be retrieved and run
  • 88.
  • 89. QNamerenditionName = QName.createQName( NamespaceService.CONTENT_MODEL_1_0_URI, "myRendDefn");
  • 91. // Make some changes.
  • 94. // Persist the changes.
  • 96. // Run the Rendition
  • 97.
  • 103.
  • 104. They won’t show up in Share when defining Rules, or in Explorer for running a Custom Action
  • 105. Solution – create a JS Script, or some custom Java
  • 106. Use this from your Rule / to run as an Action
  • 107. No dedicated REST API, but Renditions are available through CMIS
  • 108.
  • 109. This delivers lots of flexibility, and means anyone who can write Custom Actions already knows enough to write Custom Rendition Engines!
  • 111.
  • 113. Demo 3:Word .docx -> HTML & Images(Using Web Quick Start) 38
  • 115. Learn More 40 wiki.alfresco.com forums.alfresco.com blogs.alfresco.com/wp/nickb/ twitter: @AlfrescoECM @Gagravarr

Notes de l'éditeur

  1. Use devconf-files/geotagged.jpgAKA quickGEO.jpg from projects/repository/source/test-resources/quick/Upload it into ShareGo to the details page, and show the EXIF and GEO bits
  2. Create a script with the code shown, also available as devconf-js/contentTransformHellowWorld.jsCreate a simple .txt entry in explorerRun the script against itShow that we get a .bin with content type of hello/worldMaybe view the contents of thisRun the script againShow that we get a new plain text file, with the simple contents
  3. Use quick.xls or similar from projects/repository/source/test-resources/quick/Upload it into ExplorerRun a custom action which is a scriptvarnameBase = document.name.substring(0, document.name.lastIndexOf(&quot;.&quot;));var action = actions.create(&quot;transform&quot;);action.parameters[&quot;destination-folder&quot;] = document.parent;action.parameters[&quot;assoc-type&quot;] = &quot;{http://www.alfresco.org/model/content/1.0}contains&quot;;action.parameters[&quot;assoc-name&quot;] = nameBase + &quot;.txt&quot;;action.parameters[&quot;mime-type&quot;] = &quot;text/plain&quot;;action.execute(document);action.parameters[&quot;assoc-name&quot;] = nameBase + &quot;.csv&quot;;action.parameters[&quot;mime-type&quot;] = &quot;text/csv&quot;;action.execute(document);action.parameters[&quot;assoc-name&quot;] = nameBase + &quot;.html&quot;;action.parameters[&quot;mime-type&quot;] = &quot;text/xml&quot;;action.execute(document);
  4. In Share, create a rule on a folder to execute the script belowEnsure the folder that the script writes to isn’t a child!Upload an image, at least 600x400 big into the folderIn share, use the repo browser to get to the special folder, and show the smaller cropped versionvarrenditionDef = renditionService.createRenditionDefinition(&quot;Test&quot;, &quot;htmlRenderingEngine&quot;);renditionDef.parameters[&quot;destination-path-template&quot;] = &quot;/Company Home/Test/${name}.html&quot;;renditionDef.execute(document);renditionDef.parameters[&quot;destination-path-template&quot;] = &quot;/Company Home/Test/BodyOnly/${name}.html&quot;;renditionDef.parameters[&quot;bodyContentsOnly&quot;] = true;
  5. In share, create a Video folder, and a Video Drop subfolderOn the subfolder, create a ruleRule is copy + transformTarget mimetype is flash videoTarget directory is the parent (Videos)Don’t run in the backgroundUpload devconf-files/Video.mp4WaitHopefully show the thumbnail of the .mp4 (refresh if needed)Go to the videos directoryShow the large preview image of the new .flv image
  6. Upload a .docx (including images) to a Web Quick Start folder with .docx extraction enabled via ShareView in Share the HTML, to show what it looks likeFlip to the WCM site, and show it with imagesBACKUP – Create a script based on devconf-files/renderTest.jsIn share, create a folder with a rule of execute scriptPick the scriptUpload devconf-files/word3imgs.docxGo to Company Home/TestShow the htmlShow the images(You can’t show the two together without WCM, sorry....)