Contenu connexe
Similaire à EXTRA Open Source Rules Classification for News (20)
Plus de Stuart Myles (20)
EXTRA Open Source Rules Classification for News
- 2. An Update on EXTRA
Stuart Myles * Associated Press * 16th May 2017
© 2017 IPTC (www.iptc.org) All rights reserved
https://flic.kr/p/tiRXEB
- 3. Rules-Based Classification
• Rules better for breaking news than statistical methods
– You don’t need 50 examples before you can start tagging
– A rule for a new topic doesn’t require other rules to change
• More consistent and scalable than hand tagging
• Easier to explain why rules classify content
– Machine learning methods are still “black boxes”
– Easier to precisely explain - and correct - mistakes
• You can use your own taxonomy, rules and formats
- Example rules help us drive development of the EXTRA system
- You can use the example rules to see how to develop your own
- Rules could apply IPTC Media Topics or any other taxonomy
© 2017 IPTC (www.iptc.org) All rights reserved 3
- 4. EXTRA
EXTraction Rules Apparatus
Rules-based classification of text
Open source software
EXTRA is being developed by the IPTC
€50,000 Grant from the Digital News Initiative
https://www.digitalnewsinitiative.com/fund/
https://iptc.github.io/extra/
© 2017 IPTC (www.iptc.org) All rights reserved 4
- 5. Development Process
The EXTRA software is being developed by Infalia
- All software is open source
Two linguists creating rules in English and German
- Samples rules to apply IPTC Media Topics
Example news corpora licensed for EXTRA
- English from Thomson Reuters
- German from APA
© 2017 IPTC (www.iptc.org) All rights reserved 5
- 7. Classification using Percolator
• Elasticsearch
– A sophisticated, open source full-text search engine
– Lets you query documents stored in an index
• Elasticsearch Percolator
– Store queries in an index and match documents to queries
– Classification uses the percolator to match documents to rules
• EXTRA Rule Language
– Rule-writer-friendly language (easier than ES DSL)
– Access to all ES features, plus custom operators
© 2017 IPTC (www.iptc.org) All rights reserved 7
- 8. Schema and Rules
• EXTRA Schema
– Documents must be in (or converted to) a JSON format
– But it can be any JSON format you choose
– Allows validating that your rules reference valid fields
• Granular, field-by-field control of analyzers
– Such as whether and how to stem, e.g. by language
– Different ways to tokenize fields, e.g. for slug
– Allow a field to be queried as a whole or tokenized by sentence
or paragraph
– Allows validating that operators are valid by field type
• E.g. to flag that your rule references paragraphs in a field that has
none
© 2017 IPTC (www.iptc.org) All rights reserved 8
- 9. Schema and Rules Example
• Two fields - headline and body- with body allowed to be
queried by paragraph
headline
body
body_paragraph
• A rule to require that “angela merkel” and “us elections”
appear in the same paragraph
(prox/unit=paragraph/distance=1
(body adj "angela merkel")
(body adj "us elections")
)
© 2017 IPTC (www.iptc.org) All rights reserved 9
- 10. EXTRA Source Code
• The core classification engine
– cql parsers, cql to es mapper, rule schema dict classes,
dao classes, etc
https://github.com/iptc/extra-core
• EXTRA “extra” code
– API, UI, docker files for deployment
https://github.com/iptc/extra-ext
• Open source
– MIT license for EXTRA-specific code
– Apache license for Elasticsearch
© 2017 IPTC (www.iptc.org) All rights reserved 10
- 11. EXTRA Timetable
• First phase of the EXTRA project is due to complete
Summer 2017
• You can access the source code now
– Feedback welcome
• Will there be a second phase? TBD…
• Join the (low frequency) email list to stay up-to-date
https://groups.yahoo.com/neo/groups/iptc-extra/info
© 2017 IPTC (www.iptc.org) All rights reserved 11
- 12. News Metadata Summit
• Proposal: dedicate part of our next face-to-face meeting
to descriptive news metadata
• Gather academics, vendors, linguists, product owners
• Discuss use cases, techniques, technologies
– “Face off” between machine learning, deep learning, rules…
• Demo final version of EXTRA
• Let me know if you’re interested in participating?
© 2017 IPTC (www.iptc.org) All rights reserved 12
- 13. Date and Place of Next Meeting
Barcelona 6th – 8th November 2017
https://flic.kr/p/kAXGfC
Thanks and goodbye!!
© 2017 IPTC (www.iptc.org) All rights reserved 13