Presentation for the Symposium: Building the Biodiversity Knowledge Graph for Insects – Components, Progress, and Challenges; 2016 XXV International Congress of Entomology, Orlando, FL – September 26, 2016 (#ICE2016). See https://esa.confex.com/esa/ice2016/meetingapp.cgi/Session/24482
Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments
1. Addressing the name:meaning drift
challenge in open biodiversity
information environments
Please
@taxonbytes
Nico M. Franz1 , Salvatore A. Anzaldo1, Edward E. Gilbert1,
M. Andrew Jansen1, M. Andrew Johnston1 & Bertram Ludäscher2
1 School of Life Sciences, Arizona State University
2 iSchool, University of Illinois at Urbana-Champaign
Symposium: Building the Biodiversity Knowledge Graph for Insects – Components, Progress, and Challenges
2016 XXV International Congress of Entomology, Orlando, FL – September 26, 2016 (#ICE2016)
Presentation available @ SlideShare: http://tinyurl.com/franz-et-al-ice-2016
2. Our biodiversity informatics research program, summarized
• We are no longer just putting articles and monographs on library shelves.
91dd0ee1-8a37-4efc-85b7-8176874cf5be
3. Our biodiversity informatics research program, summarized
• We are no longer just putting articles and monographs on library shelves.
• This is more than 'just technology'; we must develop new systematic theory
to deal with inherently dynamic, open data systems.
91dd0ee1-8a37-4efc-85b7-8176874cf5be
4. Our biodiversity informatics research program, summarized
• We are no longer just putting articles and monographs on library shelves.
• This is more than 'just technology'; we must develop new systematic theory
to deal with inherently dynamic, open data systems.
• The concept taxonomy approach has practical implications for strengthening
the roles that individual experts play in big biodiversity data environments.
91dd0ee1-8a37-4efc-85b7-8176874cf5be
5. Products – concept taxonomy in theory and in practice
ZooKeys. doi:10.3897/zookeys.528.6001
Semantic Web. doi:10.3233/SW-160220
Biological Theory (in review). doi:10.1101/022145
PloS ONE. doi:10.1371/journal.pone.0118247
Systematics Biodiv. doi:10.1080/14772000.2013.806371
Systematic Biology. doi:10.1093/sysbio/syw023
Biodiversity Data Journal (in review). #6093
Research Ideas and Outcomes (in review). #6302
6. Premise: We're lucky that insect revisions are not so frequent
"In biology, there are many taxa that are so under-studied that
they are only known from their original description and
none or very few subsequent references […].
The name alone, so long as it is a unique name,
is sufficient to locate all related material."
– David Remsen 2016: 213
Source: Remsen. 2016. The use and limits of scientific names […]. ZooKeys 550: 207–223. doi:10.3897/zookeys.550.9546
8. Snapshot of a more frequently revised organismal lineage
Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)
• 9 schemata for the NA Cleistes/Cleistesiopsis complex (orchids)
9. Snapshot of a more frequently revised organismal lineage
Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)
• 9 schemata for the NA Cleistes/Cleistesiopsis complex (orchids)
• Vertical sections identify taxonomic concept regions
10. Snapshot of a more frequently revised organismal lineage
Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)
• 9 schemata for the NA Cleistes/Cleistesiopsis complex (orchids)
• Vertical sections identify taxonomic concept regions
• Colors identify lineages of taxonomic names (epithets) in use
11. Snapshot of a more frequently revised organismal lineage
Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)
• 9 schemata for the NA Cleistes/Cleistesiopsis complex (orchids)
• Vertical sections identify taxonomic concept regions
• Colors identify lineages of taxonomic names (epithets) in use
• There is no consensus! Five incongruent schemata are used concurrently
12. Premise:
If incongruent taxonomies are endorsed
– locally, provisionally, and democratically –
then what is the impact for
aggregated biodiversity data?
14. Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)
The 'consensus'
• Query: "Where do these orchid
species occur?"
• Same set of 250 orchid specimens,
according to 4 taxonomies.
"Controllingthetaxonomicvariable" Example: the Cleistes use case
15. Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)
The 'consensus' The 'bible'
"Controllingthetaxonomicvariable"
• Query: "Where do these orchid
species occur?"
• Same set of 250 orchid specimens,
according to 4 taxonomies.
Example: the Cleistes use case
16. Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)
The 'consensus' The 'bible'
The (formerly)
federal 'standard'
"Controllingthetaxonomicvariable"
17. Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)
The 'consensus' The 'bible'
The (formerly)
federal 'standard'
The 'best', latest
regional flora
"Controllingthetaxonomicvariable"
18. Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)
The 'consensus' The 'bible'
The (formerly)
federal 'standard'
The 'best', latest
regional flora
"Controllingthetaxonomicvariable"
Expert views
are in conflict
19. Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)
The 'consensus' The 'bible'
The (formerly)
federal 'standard'
The 'best', latest
regional flora
"Controllingthetaxonomicvariable"
Expert views
are in conflict
"Just bad"
20. Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)
The 'consensus' The 'bible'
The (formerly)
federal 'standard'
The 'best', latest
regional flora
Impact:
Name-based aggregation has created
a novel synthesis that nobody believes in
"Controllingthetaxonomicvariable"
"Just bad"
21. Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)
The 'consensus' The 'bible'
The (formerly)
federal 'standard'
The 'best', latest
regional flora
"Controllingthetaxonomicvariable"
"Just
bad"
Expert views
are in conflict
Solution:
Instead of aggregating
an artificial 'consensus',
…
22. Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)
The 'consensus' The 'bible'
The (formerly)
federal 'standard'
The 'best', latest
regional flora
"Controllingthetaxonomicvariable"
"Just
bad"
Expert views
are reconciled
Solution:
Instead of aggregating
an artificial 'consensus',
build translation services
23. Challenges:
How can we redesign aggregation to yield
high-quality biodiversity data packages?
24. Challenges:
How can we redesign aggregation to yield
high-quality biodiversity data packages?
What does this mean for Darwin Core1
and how we use this aggregation standard?
1 Wieczorek et al. 2012. Darwin Core: an evolving […]. PLoS ONE 7(1): e29715. doi:10.1371/journal.pone.0029715
25. Preview of solution with 8 steps
• DwC is insufficient, and part of the problem
Step 7:
26. # 1: Represent only taxonomic concept labels (TCLs) 1
• Syntax (TCL): taxonomic name [author, year, page] sec. source
1 Multi-taxonomy input/alignment visualizations generated with Euler/X toolkit: https://github.com/EulerProject/EulerX
Cleistes divaricata
sec. Gregg & Catling 1993
Pogonia
sec. Brown & Wunderlin 1997
27. # 1: DwC score keeping TCLs are optional; < 1% realized?
• TCL ~ DwC: nameAccordingTo
• SCAN: 19,722 of nearly 9 million records have TCLs (0.2%)
• Lack of enforcement to use TCLs makes standard less big data-ready
DwC record with nameAccordingTo (TCL)
(BDJ)
"Who authors GBIF's Backbone?"
https://storify.com/taxonbytes/who-authors-gbif-s-backbone
28. # 2: Represent each source coherently (Parent-Child relationships)
• Syntax (PC): TCL1 is a child/parent of TCL2 [where TCL1/2 = same source]
Cleistesiopsis bifaria sec. Pans. & de Barr. 2008
is a child of
Cleistesiopsis sec. Pans. & de Barr. 2008
29. # 2: DwC score keeping Not (adequately) represented
• PC ~ DwC: genus, family, order (etc.; higherClassification)
• However, higher-level names in DwC are not modeled as TCLs
• Taxonomic coherence of sources cannot be preserved with DwC alone
DwC record with higherClassification
(BDJ)
30. # 3: Do not force a single hierarchy onto all tip-level TCLs
• Syntax (PC): Tip-level TCL1 , TCL2 , etc. [where TCL1/2 = different sources]
31. # 3: DwC score keeping Optional Not (ever?) practiced
• No PC ~ DwC: infra-/specificEpithet only
• Typically, a single, 'unitary' higher-level classification is represented
• Combinations of algorithmic and social practices achieve the single hierarchy
"Who authors GBIF's Backbone?"
https://storify.com/taxonbytes/who-authors-gbif-s-backbone
32. # 4: Link TCLs via expert-provided RCC–5 articulations
• Syntax (RCC–5): TCL1 {==, >, <, ><, !} TCL2 [where TCL1/2 = diff. sources]
• RCC–5 = Region Connection Calculus
• 14 articulations provided by: http://tinyurl.com/Weakley-Flora-2015
Cleistes bifaria "Coastal Populations" sec. Smith et al. 2004
== (is congruent with)
Cleistesiopsis oricamporum sec. Brown & Pans. 2009
==
33. Source: Thau, D.M. 2010. Reasoning about taxonomies. Thesis, UC Davis. http://gradworks.proquest.com/3422778.pdf
Region Connection Calculus (semantics: set constraints)
== < > >< !
• Two regions N, M are either:
• congruent (N == M)
• properly inclusive (N < M)
• inversely properly inclusive (N > M)
• overlapping (N >< M)
• exclusive of each other (N ! M)
34. Source: Thau, D.M. 2010. Reasoning about taxonomies. Thesis, UC Davis. http://gradworks.proquest.com/3422778.pdf
Region Connection Calculus (semantics: set constraints)
== < > >< !
• Two regions N, M are either:
• congruent (N == M)
• properly inclusive (N < M)
• inversely properly inclusive (N > M)
• overlapping (N >< M)
• exclusive of each other (N ! M)
• RCC–5 articulations answer the query: "can we join regions N and M?"
• Taxonomies have multiple RCC–5 alignable components: nodes (parents,
children), node-associated traits, even node-anchoring specimens
36. Oscillating meanings of the epithet hyalites – 1911 to 2003
Phenotypicdiversity
Type-anchorednameidentityrelations
Source: Vane-Wright. 2003. Indifferent philosophy versus […]. Syst. Biodiv. 1: 3–11. doi:10.1017/S1477200003001063
37. # 5: Identify occurrence records only to TCLs
Records:
EKY39235
MTSU003611
NCSC00040204
…
Records:
BOON8098
CLEMS0061133
WILLI39399
…
Records:
GMUF-0039355
IBE006808
USCH58399
…
Records:
CONV0006268
MDKY00006482
NCU00038930
…
Records:
BRYV0023582, BRYV0023584
KHD00032030, MISS0016604
MMNS000227, NCSC00040206
USMS_000002923, USMS_000002924
VSC0053223, VSC0065528
…
Records:
ARIZ393087
DBG39049
USCH51217
…
Records:
NCU00040710
USCH96248
VSC0053218
…
Records:
CLEMS0012881
FUGR0003293
GA023130
…
Records:
BOON8100
NCSC00040210
SJNM45487
…
Records:
GA023144
LSU00012494
MISS0016608
…
Records:
IBE006810, IND-0012374, MMNS000227
Records:
NY8654
• Syntax (ID): Occurrence / organism is identified to TCL
"CLEMS0012881"
is identified to
Cleistes divaricata sec. Smith et al. 2004
[additional ID metadata]
38. DwC record with Identification metadata
(BDJ)
# 5: DwC score keeping ID metadata optional; > 50% realized
• ID ~ DwC: Identification, (date)identified(By), identificationReference
• SCAN: 4,715,277 of nearly 9 million records have ID metadata (52.5%)
• Enforcement…still also require use of TCLs
39. # 6: Generate comprehensive, consistent RCC–5 alignments
• Euler/X is a toolkit that infers logically consistent RCC–5 alignments
40. # 6: Generate comprehensive, consistent RCC–5 alignments
• Valued-added: MIR – set of Maximally Informative Relations containing
the RCC–5 articulation for every possible TCL pair scalability
Reasonerinference
42. Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)
The 'consensus' The 'bible'
The (formerly)
federal 'standard'
The 'best', latest
regional flora
"Controllingthetaxonomicvariable"
Impact:
"Please select your preference (A – D);
we can perform all translations"
43. • We can now respond to queries such as:
• "Show all specimens identified to the taxonomic name Cleistes divaricata"
• Returns many records resolves incongruent lineage of name usages
# 8: "Do you trust us now?" Aggregation as a translational service
44. • We can now respond to queries such as:
• "Show all specimens identified to the taxonomic name Cleistes divaricata"
• Returns many records resolves incongruent lineage of name usages
• "Now show specimens with the TCL Cleistesiopsis divaricata sec. Weakley 2015"
• Returns record subset resolving only one narrowly circumscribed concept
# 8: "Do you trust us now?" Aggregation as a translational service
45. # 8: "Do you trust us now?" Aggregation as a translational service
• We can now respond to queries such as:
• "Show all specimens identified to the taxonomic name Cleistes divaricata"
• Returns many records resolves incongruent lineage of name usages
• "Now show specimens with the TCL Cleistesiopsis divaricata sec. Weakley 2015"
• Returns record subset resolving only one narrowly circumscribed concept
• "Now show specimens identified to the TCL Cleistes divaricata sec. RAB 1968,
yet translated into the more granular TCLs sec. Weakley 2015"
• Returns (again) many records, yet represents and contrasts two treatments,
as opposed to providing the ambiguous lineage view (above)
• "Show all specimens with ambiguous 2010/2015 TCL identifications…" (etc.)
46. Conclusions – designing trusted biodiversity data services
• The Darwin Core standard for aggregating biodiversity data:
(1) Has under-utilized options for better representing taxonomic expertise
(2) Is part of a design paradigm that undermines the plurality of expertise
47. • The Darwin Core standard for aggregating biodiversity data:
(1) Has under-utilized options for better representing taxonomic expertise
(2) Is part of a design paradigm that undermines the plurality of expertise
• We are developing new solutions – including TCLs, PC relations, RCC–5, and
scalable logic applications – that realize data aggregation via translational
services, without disrupting the formation of expert-licensed, high-quality
biodiversity data packages
Conclusions – designing trusted biodiversity data services
48. • The Darwin Core standard for aggregating biodiversity data:
(1) Has under-utilized options for better representing taxonomic expertise
(2) Is part of a design paradigm that undermines the plurality of expertise
• We are developing new solutions – including TCLs, PC relations, RCC–5, and
scalable logic applications – that realize data aggregation via translational
services, without disrupting the formation of expert-licensed, high-quality
biodiversity data packages
• All of us – not just aggregators – "own" the responsibility of designing
systems where the plurality of taxonomic expertise is fairly accommodated
Conclusions – designing trusted biodiversity data services
49. Acknowledgments & links to products
• Cleistes use case: Alan Weakley (UNC)
• Euler/X toolkit: Shizhuo Yu (UC Davis)
• Data trajectories: Beckett Sterner (ASU)
• OBKMS design: Viktor Senderov (Pensoft)
• NSF DEB–1155984, DBI–1342595 (PI Franz)
• NSF IIS–118088, DBI–1147273 (PI Ludäscher)
• Euler/X code @ https://github.com/EulerProject/EulerX
• Franz et al. 2016. Two influential primate classifications logically aligned.
Systematic Biology 65(4): 561–582. Link
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The simple semantics of RCC-5 makes this a rather generic vocabulary for representing advancement in phylogenetic knowledge. At the same time, the onus is on the phylogeneticists to apply the articulations in auch ways that the desired query services are actually obtained.
The simple semantics of RCC-5 makes this a rather generic vocabulary for representing advancement in phylogenetic knowledge. At the same time, the onus is on the phylogeneticists to apply the articulations in auch ways that the desired query services are actually obtained.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.
The more one looks, the more complicated it gets. Notice also the node labeling, or lack thereof.