1. EAC‐CPF and Social Networks
Society of American Archivists
Chicago
August 2011
Daniel V. Pi+ § Ins/tute for Advanced Technology in the Humani/es § University of Virginia
2. SNAC Overview
• Funding and Timeline
• Project Team
• Project ObjecEves and RaEonale
• Data ContribuEng InsEtuEons
• Archival Standards Employed
• Methods, Processing, and Products
• Year One ExtracEon Results
• Basic ObservaEons on ExtracEon
Daniel V. Pi+ § Ins/tute for Advanced Technology in the Humani/es § University of Virginia
4. Project Team
• Daniel PiP (PI) and Worthy MarEn (InsEtute for
Advanced Technology in the HumaniEes,
University of Virginia)
• Adrian Turner and Brian Tingle (California Digital
Library, University of California)
• Ray Larson (School of InformaEon, University of
California, Berkeley)
Daniel V. Pi+ § Ins/tute for Advanced Technology in the Humani/es § University of Virginia
5. Project ObjecEves
• Archival finding aids currently intermix descripEon of records
with descripEon of the creators of records and persons evident
in the records
• Further the ongoing process of transforming archival descripEon
using advanced technologies
• By facilitaEng the separaEon of the descripEon of people from
the descripEon of records
• Using EAC‐CPF, an InternaEonal archival authority control
standard
• Goal: enhance the economy and effecEveness of archival
descripEon to enhance access and understanding of users of
archives, libraries, and museums
Daniel V. Pi+ § Ins/tute for Advanced Technology in the Humani/es § University of Virginia
6. RaEonale for SeparaEon
• Authority control of forms of names
• Flexible descripEon
• CooperaEve authority control
• Integrated access to cultural heritage
• Biographical/historical resource
• Social/historical context (social‐professional
networks)
Daniel V. Pi+ § Ins/tute for Advanced Technology in the Humani/es § University of Virginia
7. The Data
• EAD‐encoded finding aids
– Library of Congress (1,159)
– Online Archive of California (~15,400 )
– Northwest Digital Archive (5,160)
– Virginia Heritage (8,390)
• Authority records
– Library of Congress: NACO/LCNAF (3.8M personal names; 900K
corporate names)
– Gefy Vocabulary Program: Union List of ArEst Names (293K
personal and corporate names)
– Virtual InternaEonal Authority File (5M+ personal names)
Daniel V. Pi+ § Ins/tute for Advanced Technology in the Humani/es § University of Virginia
8. Methods and Processing
• Extract EAC‐CPF records from exisEng EAD‐encoded archival
descripEons
– ExtracEng both creators and referenced CPF names
• Match EAC‐CPF records against one another and against exisEng
authority records (ULAN, VIAF, LCNAF); merge records for the
same enEty
– Enhance EAC‐CPF by normalizing entries, adding alternaEve entries,
Etles (VIAF), and historical data (ULAN)
– Key challenge: two or more people with the same name; two or more
names for the same person
• Create a prototype historical resource and access system
– Historical data and social‐professional networks
– Links to archive, library, and museum resources (by and about)
Daniel V. Pi+ § Ins/tute for Advanced Technology in the Humani/es § University of Virginia
9. EAD Source Data
• Encoded Archival DescripEon
– Intermixes descripEon of creators of records and, at the discreEon of the archivists,
names associated with the content of the records
– Detailed descripEon of creators of records
• Widely varying quality
– In the number of names idenEfied and encoded
– In the formaEon of the names (direct or inverted, capitalizaEon, punctuaEon, and so
on)
– In the categorizaEon of names (personal, corporate, or family
• Many names given but not idenEfied as such
• Most important of these in biographies/histories and in correspondence
descripEon
• ExtracEon has focused on the “low hanging fruit,” that is the names tagged as
names
• AfenEon shiling to names not idenEfied as such
Daniel V. Pi+ § Ins/tute for Advanced Technology in the Humani/es § University of Virginia
10. Archival Records
• Records are the by‐products of people living and working as
individuals, in organized groups, in families
• Records document people living and working
• People exist in social‐professional contexts, in relaEon to others
• Records document these relaEons
• All records created by the same enEty are described together (a
fonds or collecEon)
– Creators documented in detail
– Many of the people documented in the record referenced in
descripEon
• Archival descripEons document interrelaEons among people
and records (documents)
Daniel V. Pi+ § Ins/tute for Advanced Technology in the Humani/es § University of Virginia
11. Source: J. Robert Oppenheimer Papers (LoC)
<originaEon>
<persname source="lcnaf">Oppenheimer, J. Robert, 1904‐1967</persname>
</originaEon>
<controlaccess>
<persname source="lcnaf" encodinganalog="100" role="creator">Oppenheimer, J.
Robert, 1904‐1967</persname>
<persname source="lcnaf" encodinganalog="600" role="subject">Bethe, Hans
Albrecht, 1906‐ ‐‐Correspondence</persname> <!‐‐ […] ‐‐>
<persname source="lcnaf" encodinganalog="600" role="subject">Born, Max,
1882‐1970 ‐‐Correspondence</persname>
<persname source="lcnaf" encodinganalog="600" role="subject">Boyd, Julian P.
(Julian Parks), 1903‐ ‐‐Correspondence</persname>
<persname source="lcnaf" encodinganalog="600" role="subject">Bush, Vannevar,
1890‐1974 ‐‐Correspondence</persname>
<persname source="lcnaf" encodinganalog="600" role="subject">Casals, Pablo,
1876‐1973 ‐‐Correspondence</persname> <!‐‐ […] ‐‐>
<corpname source="lcnaf" encodinganalog="610" role="subject">InsEtute for
Advanced Study (Princeton, N.J.)</corpname>
<corpname source="lcnaf" encodinganalog="610" role="subject">Los Alamos
ScienEfic Laboratory</corpname> <!‐‐ […] ‐‐>
</controlaccess>
Daniel V. Pi+ § Ins/tute for Advanced Technology in the Humani/es § University of Virginia
15. EAC‐CPF
• Encoded Archival Context‐Corporate bodies, Persons,
Families
• An internaEonal communicaEon standard for archival
authority control
• Based on InternaEonal Council for Archives, InternaEonal
Standard Archival Authority Records‐Corporate bodies,
persons, families (ISAAR(CPF))
• SAA Standards Commifee, Technical Subcommifee on
Encoded Archival Context
• Co‐chairs
– Katherine Wisser, Simmons College
– Anila Angjeli, Bibliothèque naEonale de France
Daniel V. Pi+ § Ins/tute for Advanced Technology in the Humani/es § University of Virginia
16. Library and Archive Authority Control
• Library (or bibliographic) authority control is almost
exclusively about the control of names
• Archival authority control involves biographical‐historical
descripEon of the CPF enEty
– DescripEons based on controlled vocabularies or values, for
example, occupaEons, place of birth and death
– But also biographical‐historical descripEon
• Prose
• Chronological list
• Archival authority control provides context for
understanding records, the context of their creaEon, the
provenance
Daniel V. Pi+ § Ins/tute for Advanced Technology in the Humani/es § University of Virginia
17. <idenEty>
<enEtyType>person</enEtyType>
<nameEntry scriptCode="Latn" xml:lang="eng">
<part>Oppenheimer, J. Robert, 1904‐1967.</part>
<authorizedForm>AACR2</authorizedForm>
</nameEntry>
<nameEntry localType="VIAF:MainHeading">
<part>Oppenheimer, J. Robert (Julius Robert), 1904‐1967</part>
<alternaEveForm>VIAF</alternaEveForm>
</nameEntry>
<nameEntry localType="VIAF:MainHeading">
<part>Oppenheimer, Julius Robert, 1904‐1967</part>
<alternaEveForm>VIAF</alternaEveForm>
</nameEntry>
<nameEntry localType="VIAF:x400">
<part>Oppenheimer, Robert</part>
<alternaEveForm>VIAF</alternaEveForm>
</nameEntry>
<nameEntry localType="VIAF:x400">
<part>Ou‐pẽn‐hai‐mo, 1904‐1967</part>
<alternaEveForm>VIAF</alternaEveForm>
</nameEntry>
</idenEty>
Daniel V. Pi+ § Ins/tute for Advanced Technology in the Humani/es § University of Virginia
18. <existDates>
<dateRange>
<fromDate standardDate=“1904‐04‐22”>1904, Apr. 22</fromDate>
<toDate standardDate=“1967‐02‐18”>1967, Feb. 18</toDate>
</dateRange>
</existDates>
<!‐‐ ... ‐‐>
<localDescripEon localType="subject">
<term>Science‐‐SocieEes, etc.</term>
</localDescripEon>
<localDescripEon localType="VIAF:naEonality">
<placeEntry countryCode="US"/>
</localDescripEon>
<localDescripEon localType="VIAF:gender">
<term>Male</term>
</localDescripEon>
<languageUsed>
<language languageCode="eng"/>
</languageUsed>
<occupaEon>
<term>Physicists.</term>
</occupaEon>
<!‐‐ ... ‐‐>
Daniel V. Pi+ § Ins/tute for Advanced Technology in the Humani/es § University of Virginia
19. <chronList>
<chronItem>
<date>1904, Apr. 22</date>
<placeEntry>New York, N.Y.</placeEntry>
<event>Born, New York, N.Y.</event>
</chronItem> <!‐‐ ... ‐‐>
<chronItem>
<date>1943‐1945</date>
<placeEntry>Los Alamos, N. Mex.</placeEntry>
<event>Director, Los Alamos ScienEfic Laboratory, Los Alamos, N. Mex.</event>
</chronItem> <!‐‐ ... ‐‐>
<chronItem>
<date>1954</date>
<event>(1) Denied security clearance […] (2) Published Science and the
Common Understanding […]
</event>
</chronItem> <!‐‐ ... ‐‐>
<chronItem>
<date>1967, Feb. 18</date>
<placeEntry>Princeton, N.J.</placeEntry>
<event>Died, Princeton, N.J.</event>
</chronItem>
</chronList>
Daniel V. Pi+ § Ins/tute for Advanced Technology in the Humani/es § University of Virginia
20. <cpfRelaEon xmlns:xlink="hfp://www.w3.org/1999/xlink"
xlink:type="simple"
xlink:role="hfp://RDVocab.info/uri/schema/FRBRenEEesRDA/Person"
xlink:arcrole="correspondedWith">
<relaEonEntry>Bush, Vannevar, 1890‐1974.</relaEonEntry>
<descripEveNote>
<p>recordId: DLC.ms998007.r007</p>
</descripEveNote>
</cpfRelaEon>
Daniel V. Pi+ § Ins/tute for Advanced Technology in the Humani/es § University of Virginia
21. <resourceRelaEon xmlns:xlink="hfp://www.w3.org/1999/xlink" xlink:arcrole="creatorOf"
xlink:role="archivalRecords” xlink:type="simple”
xlink:href="hfp://hdl.loc.gov/loc.mss/eadmss.ms998007">
<relaEonEntry>J. Robert Oppenheimer Papers, 1799‐1980 (bulk 1947‐1967)</relaEonEntry>
<objectXMLWrap>
<did xmlns="urn:isbn:1‐931666‐22‐9” >
<uniPtle>Papers <unitdate normal="1799/1980” era="ce” calendar="gregorian">1799‐1980
</unitdate><unitdate label="Bulk Dates" type="bulk" normal="1947/1967”
era="ce” calendar="gregorian">(bulk 1947‐1967)</unitdate></uniPtle>
<uniEd countrycode="US" repositorycode="US‐DLC">MSS35188</uniEd>
<originaEon label="Creator">
<persname>Oppenheimer, J. Robert, 1904‐1967</persname>
</originaEon> <!‐‐ ... ‐‐>
<repository><corpname>Manuscript Division. Library of Congress</corpname>
</repository>
<abstract>Physicist and director
of the InsEtute for Advanced Study, Princeton, New Jersey. [...] Topics include theoreEcal
physics, development of the atomic bomb, the relaEonship between government and
science, nuclear energy, security, and naEonal loyalty. </abstract>
</did>
</objectXMLWrap>
</resourceRelaEon>
Daniel V. Pi+ § Ins/tute for Advanced Technology in the Humani/es § University of Virginia
23. Early ObservaEons‐ExtracEon
• Depth of analysis and quality of descripEon of
CPF enEEes varies widely in EAD‐encoded finding
aids
–LoC a lot of names under authority control
–OAC and NWDA have less names and control varies
• To be fair, the finding aids were created without
SNAC processing in mind!
Daniel V. Pi+ § Ins/tute for Advanced Technology in the Humani/es § University of Virginia
24. Next on ExtracEon
• Refine extracEon processing, incorporaEng some
NLP‐like processing, for example
–Verifying type of name: C or P or F
–Massaging poorly formed names into befer formed
names
–IdenEfying names in strings that are names‐plus (but
name not idenEfied as such)
–Provide context informaEon to enhance matching, for
example, date or dates of correspondence, or
occupaEon of creator of records
Daniel V. Pi+ § Ins/tute for Advanced Technology in the Humani/es § University of Virginia
25. Beyond the Project
• Building a NaEonal Archival AuthoriEes
Infrastructure
– IMLS funded two‐year project, October 2011‐September
2013
– EAC‐CPF SAA workshops: 140 scholarships
– NaEonal Archival AuthoriEes CooperaEve planning
• SNAC II: a proposal to expand SNAC
– A lot more data
– NARA, SI, MARC WorldCat records, a lot more finding aids
Daniel V. Pi+ § Ins/tute for Advanced Technology in the Humani/es § University of Virginia
27. Social Networks and Archival Context
Project: Matching and Merging EAC-
CPF Records
Ray R. Larson
Krishna Janakiraman
University of California, Berkeley
School of Information
Thanks to Daniel V. Pi+ of the Ins/tute for Advanced Technology in the Humani/es, University of
Virginia, for many of the slides here
SAA 2011 - Chicago
2011-08-27 - SLIDE
28. SNAC Project
• The outlines of the project have been
discussed by Daniel Pitti previously
• The primary focus of the Berkeley group for
the project is on combining data resources
from multiple archives and other information
sources
• In this talk I will focus on our current
methods used in the prototype (to be
described by Brian Tingle later)
SAA 2011 - Chicago
2011-08-27 - SLIDE
29. Data Contributing Institutions
• EAD-encoded finding aids
– Library of Congress (1159)
– Online Archive of California (15,400+)
– Northwest Digital Archive (5,563+)
– Virginia Heritage (8,390+)
• Authority records
– Library of Congress: NACO/LCNAF (3.8M personal
names; 900K corporate names)
– Getty Vocabulary Program: Union List of Artist Names
(293K personal and corporate names)
– Virtual International Authority File (intersection with
NACO/LCNAF, 5M personal names)
• Other biographical sources (e.g., DBPedia, IMDB)
SAA 2011 - Chicago
2011-08-27 - SLIDE
30. Methods and Processing
• Extract EAC-CPF records from existing EAD-
encoded archival descriptions
– Extracting both creators and referenced CPF names
• Match EAC-CPF records against one another
and against existing authority records (ULAN,
VIAF, LCNAF)
– Enhance EAC-CPF by normalizing entries, adding
alternative entries, titles (VIAF), and historical data
(ULAN)
• Create a prototype historical resource and access
system
– Historical data and social-professional networks
– Links to archive, library, and museum resources (by and
about)
SAA 2011 - Chicago
2011-08-27 - SLIDE
31. Merging EAC-CPF Records
LCNAF Repository ULAN Repository
Cheshire
Search
Connect records
Connect exactly
using name
matching Merge
authority
records
informaEon
SAA 2011 - Chicago
2011-08-27 - SLIDE
32. Authority Control
• Identifying creator entities and referenced
entities (correspondents, etc.)
• Recording name or names used by and for
them
• Rule-based heading or entry formation and
control
SAA 2011 - Chicago
2011-08-27 - SLIDE
33. Controlled Vocabularies
• Vocabulary control is the attempt to provide
a standardized and consistent set of terms
(such as subject headings, names,
classifications, etc.) with the intent of aiding
the searcher in finding information
• That is, it is an attempt to provide a
consistent set of descriptions for use in (or
as) metadata
SAA 2011 - Chicago
2011-08-27 - SLIDE
34. The Problem
• Proliferation of the forms of names
–Different names for the same person
–Different people with the same names
• Examples
–from Books in Print (semi-controlled but not
consistent)
–ERIC author index (not controlled)
SAA 2011 - Chicago
2011-08-27 - SLIDE
35. Goethe
…etc…
SAA 2011 - Chicago
2011-08-27 - SLIDE
39. Name Authority Files
ID:NAFL8057230 ST:p EL:n STH:a MS:c UIP:a TD:19910821174242
KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:05-14-80
RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD:
VST:d 08-21-91 Other Versions: earlier
040 DLC$cDLC$dDLC$dOCoLC
053 PR6005.R517
100 10 Creasey, John
400 10 Cooke, M. E.
400 10 Cooke, Margaret,$d1908-1973
400 10 Cooper, Henry St. John,$d1908-1973Different names for the
400 00 Credo,$d1908-1973
400 10 Fecamps, Elise same person
400 10 Gill, Patrick,$d1908-1973
400 10 Hope, Brian,$d1908-1973
400 10 Hughes, Colin,$d1908-1973
400 10 Marsden, James
400 10 Matheson, Rodney
400 10 Ranger, Ken
400 20 St. John, Henry,$d1908-1973
400 10 Wilde, Jimmy
500 10 $wnnnc$aAshe, Gordon,$d1908-1973
SAA 2011 - Chicago
2011-08-27 - SLIDE
40. Merging EAC-CPF Records
Cheshire
Search
Connect records
Connect exactly
using name
matching Merge
authority
records
informaEon
SAA 2011 - Chicago
2011-08-27 - SLIDE
41. Connect Exact Matches
• The EAC-CPF records provide the names
without having to parse texts, etc.
• Allows us to use some simple methods like
exact matching
–Assume identical name entries means the same
person/corporate body/family
–Enter the full names and record IDs into a
database and flag IDs with same names for
merging
SAA 2011 - Chicago
2011-08-27 - SLIDE
42. Merging EAC-CPF Records
Cheshire
Search
Connect records
Connect exactly
using name
matching Merge
authority
records
informaEon
SAA 2011 - Chicago
2011-08-27 - SLIDE
43. Search Authority Files
• For each name, formulate a search of the
VIAF database using the Cheshire system
(SGML/XML retrieval system with
probabilistic and Boolean matching)
–Search both the “authoritative” and “non-
authoritative” forms
–Consider any name matching a non-authoritative
form to be a candidate match for the authoritative
form
–Flag EAC records that match the same authority
record as potential matches
SAA 2011 - Chicago
2011-08-27 - SLIDE
44. Merging EAC-CPF Records
Cheshire
Search
Connect records
Connect exactly
using name
matching Merge
authority
records
informaEon
SAA 2011 - Chicago
2011-08-27 - SLIDE
45. Merge Flagged Records
• For all of the exact matches and authority
matches
–Use the Authoritative form of the name
–Combine data from each match into a single EAC-
CPF record
–Retain all source record IDs and information
• Finally, output the merged EAC-CPF records
SAA 2011 - Chicago
2011-08-27 - SLIDE
46. Inputs to SNAC merging
• LoC: 43,702 EAC-CPF records derived from 1159
finding aids
• OAC: 91,811 EAC-CPF records derived from
~15,400 finding aids
• NWDA: 22,609 EAC-CPF records derived from
5,568 finding aids
• Result: 123,920 “unique” names
SAA 2011 - Chicago
2011-08-27 - SLIDE
47. Another view of the numbers…
• 93033 Person names merged from 114639
Person records
• 30161 Institutions merged from 41177
Institution records
• 1669 Families merged from 2263 Family
records
SAA 2011 - Chicago
2011-08-27 - SLIDE
48. But…
• Exact merging assumes that archives are
following LC cataloging practice in their EAD
records
–There are some problems with this assumption
SAA 2011 - Chicago
2011-08-27 - SLIDE
49. Some failures for merging…
• Different abbreviations:
– A. & G. Carisch & C.
– A. & G. Carisch & Co.
• And spacing issues:
– A. C. Peters & Bro.
– A. C. Peters & Brother.
– A. C. Peters. (??)
– A. C.Peters & Bro.
• Completeness and alternate rules
– Tabb, John B. (John Banister), 1845-1909.
– Tabb, John Banister, 1845-1909.
SAA 2011 - Chicago
2011-08-27 - SLIDE
50. More…
• Variant romanizations (and spacing):
–M. P. Belaieff.
–M. P. Belaïeff.
–M. P. Bieliaev.
–M.P. Belaïeff.
–M.P.Belaïeff.
• Initials vs. names:
–Zabolotskii, N.A.
–Zabolotskii, Nikolai Alekseevich, 1903-1958.
–Zabolotskii.
SAA 2011 - Chicago
2011-08-27 - SLIDE
51. More…
• Inverted order vs. uninverted
–Taylor, Zachary, 1784-1850.
–Zachary Taylor.
• Various combinations:
–Tchaikovsky, Peter I.
–Tchaikovsky, Pëtr Il.
–Tchaikovsky, Piotr Ilyich.
–Tchaikovsky, Pyotr Il.
–Tchaikovsky, Pyotr Ilyich.
SAA 2011 - Chicago
2011-08-27 - SLIDE
52. Another kind of failure
• Entry for “Zaphiropoulos” - no dates, no first name:
– The entry from VIAF was for “Zaphiropoulos, Lela,
1941-”
– But the name in EAD came as an attribution for photos:
– Box 113
– Lot PP13 Zaphiropoulos. [Bas-relief at Troy], 1872.
– Physical Description: 2 photographs
– Scope and Content Note
– Photographs taken for Schliemann.
• Not sure that the Zaphiropoulos indicated is a
person, and definitely not one born in 1941.
SAA 2011 - Chicago
2011-08-27 - SLIDE
53. Addressing the failures
• First we need to know where things are not working,
and why
– We are planning to do a random sample and detailed
evaluation of the database to help identify the problems
• Many of the problems we have seen already appear
to be solvable using:
– Additional contextual clues from the EAD records
– More sophisticated matching for phonetic variants
• Such as n-grams or phonetic schemes like phonex
– Additional normalization of names before merging
• For name order, etc.
– Use of advance matching methods
SAA 2011 - Chicago
2011-08-27 - SLIDE
54. Testing new merging methods
• Work done in conjunction with SNAC for a I
School Masters’ project called Biograph
–Krishna Janakiraman and Sean Marimpietri
• Using SNAC and merging with FreeBase
and IMDB
SAA 2011 - Chicago
2011-08-27 - SLIDE
55. Einstein, Albert, 1879-1955.
Einstein, Albert.
Ainshutain, A. 1879-1955
Aiyinsitan 1879-1955
Einstein, A.
Albert Einstein
Albert Einstein
Krishna Janakiraman and Sean Marimpietri - Biograph
SAA 2011 - Chicago
2011-08-27 - SLIDE
56. Learn binary classifiers over varying names
and existence dates
Our approach
Perturb existing information to generate
additional samples within specific error
levels
Krishna Janakiraman and Sean Marimpietri - Biograph
SAA 2011 - Chicago
2011-08-27 - SLIDE
57. 0
T Features
R Features
A Features Names
I Names
N Birth and Death dates String distance
Shingle Language
Model metrics
PR
ED Learn decision tree
I classifiers
C
T
0 Krishna Janakiraman and Sean Marimpietri - Biograph
Link Records
SAA 2011 - Chicago
2011-08-27 - SLIDE
58. Name: Einstein Albert
Shingle sequence: ein, ins, nst, ste, tei, ein … , ert
Probability that the sequence (ins, nst, ste) follows ein is very high for the name
einstein
Shingle Language Model for names
Krishna Janakiraman and Sean Marimpietri - Biograph
SAA 2011 - Chicago
2011-08-27 - SLIDE
59. Name 1 : Einstein Albert Name 2 : Ainshtain Albert Name 3 : Albert Einstein
ein In
hta tai
ein In ain
ste
sht
ste
al nst
nsh
nst
alb ins
ins
ins
lbe ein lbe
Ain
ein lbe
ert ert
ein
ert ein
ein tei rte
tei rte
tei rte
Shingle Language Model for names
Krishna Janakiraman and Sean Marimpietri - Biograph
SAA 2011 - Chicago
2011-08-27 - SLIDE
60. Date
String Distance
Example Decision Tree For Krishna Janakiraman and Sean Marimpietri - Biograph
Von Neumann
SAA 2011 - Chicago
2011-08-27 - SLIDE
61. Albert Einstein George W Bush Von Neumann
TP:78 FP:11 TP:39 FP:9 TP:182 FP:14
FN:25 TN:145 FN:6 TN:60 FN:27 TN:301
TPR: 75.7% TPR: 86.6% TPR: 75.7%
FPR: 7% FPR: 13% FPR: 7%
Corpus Average
TPR: 72.7%
FPR: 17%
Krishna Janakiraman and Sean Marimpietri - Biograph
SAA 2011 - Chicago
2011-08-27 - SLIDE
62. 15,300 records, thresh = 0.85
1100 records, thresh = 0.9
How many did we link ?
SAA 2011 - Chicago
2011-08-27 - SLIDE
63. Conclusions
• There will not be a single merging method,
but a staged set of approaches that will allow
us to go from the simplest exact matches, to
(we hope) reliably identifying various variant
forms of a name, etc. when corroborated by
contextual (date, etc.) information
• Once records are merged, they are passed
along to Brian for search and display…
SAA 2011 - Chicago
2011-08-27 - SLIDE
64. Discovering Historic
Social Networks
Prototype Historical Resource Demo
Brian Tingle, California Digital Library
Society of American Archivists 2011 Annual Meeting
August 27, 2011
Chicago
65. Meet the target users
Personas are fictional characters created to represent the different user types within a targeted demographic, attitude and/or behavior set that might use a site, brand
or product in a similar way. http://en.wikipedia.org/wiki/Persona_(marketing)
66. Meet the target users
Personas are fictional characters created to represent the different user types within a targeted demographic, attitude and/or behavior set that might use a site, brand
or product in a similar way. http://en.wikipedia.org/wiki/Persona_(marketing)
• Randy: Graduate student working on a PhD that involves biographies and the study of diplomatic
families and networks. Sometimes he comes to the site looking for information on specific people; other
times he is looking for information on a specific subject or event. He also TAs an undergraduate history
class and sometimes has to help students find topics for papers.
67. Meet the target users
Personas are fictional characters created to represent the different user types within a targeted demographic, attitude and/or behavior set that might use a site, brand
or product in a similar way. http://en.wikipedia.org/wiki/Persona_(marketing)
• Randy: Graduate student working on a PhD that involves biographies and the study of diplomatic
families and networks. Sometimes he comes to the site looking for information on specific people; other
times he is looking for information on a specific subject or event. He also TAs an undergraduate history
class and sometimes has to help students find topics for papers.
• Connie: Works at an institution that contributed records to the project. Is going to be asking
themselves how this site would be useful to their users. Wants to understand how their records were
used and what the added value is.
68. Meet the target users
Personas are fictional characters created to represent the different user types within a targeted demographic, attitude and/or behavior set that might use a site, brand
or product in a similar way. http://en.wikipedia.org/wiki/Persona_(marketing)
• Randy: Graduate student working on a PhD that involves biographies and the study of diplomatic
families and networks. Sometimes he comes to the site looking for information on specific people; other
times he is looking for information on a specific subject or event. He also TAs an undergraduate history
class and sometimes has to help students find topics for papers.
• Connie: Works at an institution that contributed records to the project. Is going to be asking
themselves how this site would be useful to their users. Wants to understand how their records were
used and what the added value is.
• Quincy: Library School Student working to QA record matching.
69. Meet the target users
Personas are fictional characters created to represent the different user types within a targeted demographic, attitude and/or behavior set that might use a site, brand
or product in a similar way. http://en.wikipedia.org/wiki/Persona_(marketing)
• Randy: Graduate student working on a PhD that involves biographies and the study of diplomatic
families and networks. Sometimes he comes to the site looking for information on specific people; other
times he is looking for information on a specific subject or event. He also TAs an undergraduate history
class and sometimes has to help students find topics for papers.
• Connie: Works at an institution that contributed records to the project. Is going to be asking
themselves how this site would be useful to their users. Wants to understand how their records were
used and what the added value is.
• Quincy: Library School Student working to QA record matching.
• Adele: Person doing authority work during collection processing.
70. Meet the target users
Personas are fictional characters created to represent the different user types within a targeted demographic, attitude and/or behavior set that might use a site, brand
or product in a similar way. http://en.wikipedia.org/wiki/Persona_(marketing)
• Randy: Graduate student working on a PhD that involves biographies and the study of diplomatic
families and networks. Sometimes he comes to the site looking for information on specific people; other
times he is looking for information on a specific subject or event. He also TAs an undergraduate history
class and sometimes has to help students find topics for papers.
• Connie: Works at an institution that contributed records to the project. Is going to be asking
themselves how this site would be useful to their users. Wants to understand how their records were
used and what the added value is.
• Quincy: Library School Student working to QA record matching.
• Adele: Person doing authority work during collection processing.
• Lenny: Lenny likes linked data, and wants to be able to mine the links that have been established
programatically.
94. Graph Schema
vertex
_id: auto-assigned by neo4j
_type: vertex
identity: the name of the entity (string) [indexed]
urls: n seperated list of source EAD files
entityType: 'corporateBody', 'family', or 'person'
edge
_id: auto-assigned by neo4j
_type: edge
_lable: 'correspondedWith' or 'associatedWith'
_inV: incoming vertex _id (from)
_outV: outgoing vertex _id (to)
from_name: from identity (string) denormalized
to_name: to identity (string) denormalized
95. internal id
indices/name-idx is an index on
“identity”; used to look up neo4j record
id
96. “bothE” shows in and out edges
vertices/103994/bothE
redundant data to save repeated
lookups
102. Front End Stack
• golden grid
http://code.google.com/p/the-golden-grid/
• form style http://formalize.me/
• jquery and jquery ui
• hoverIntent for advanced search
• google analytics with event tracking
103. XTF XSLT Framework
• pre filter - do special tokenization to create custom
EAC facets
• https://docs.google.com/document/d/
1wP9x6sdOZTagJNQXoyJfPh0Y6UzQgqLwLI86WSlIPbk/edit?hl=en_US
• query parser - CGI params to XTF query XML
• result formatter - XTF results to HTML
• doc formatter - EAC-CPF to HTML
• http://code.google.com/p/xtf-cpf/source/browse/?
name=xtf-cpf
104. social graph visualization
• EAC to graphML
https://code.google.com/p/eac-graph-load/
• graphML file with open license should be
viewable in other tools
• old demo uses Dracula Graph Library
• New demo uses Javascript InfoVis Toolkit
• Ed Summer’s “snac hacks” post
Flexible description: series description; dispersed collections\nCooperative authority control: dispersed collections; but also creator of one collection is referenced in a collection created by someone else (co-referencing); economic and descriptive benefits\nIntegrated access to cultural heritage: context for archival records, essential, but the descriptions can also provide context for all types of resources\nArchival authority records, like museum authority records, provide historical and biographical data that can enhance identification and understanding; (biographical dictionary; administrative histories)\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
Remember that we will solicit public evaluation and suggestions on drafts of the public interface, starting in the fall.\n