SlideShare une entreprise Scribd logo
1  sur  107
EAC‐CPF
and
Social
Networks
         Society
of
American
Archivists
                     Chicago
                  August
2011


Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

SNAC
Overview
• Funding
and
Timeline
• Project
Team
• Project
ObjecEves
and
RaEonale
• Data
ContribuEng
InsEtuEons
• Archival
Standards
Employed
• Methods,
Processing,
and
Products
• Year
One
ExtracEon
Results
• Basic
ObservaEons
on
ExtracEon

          Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

Funding
and
Timeline
• NaEonal
Endowment
for
the
HumaniEes
• A
PreservaEon
and
Access,
Research
and

  Development
grant
• Two‐year
project
• May
2010‐April
2012




          Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

Project
Team
• Daniel
PiP
(PI)
and
Worthy
MarEn
(InsEtute
for

  Advanced
Technology
in
the
HumaniEes,

  University
of
Virginia)
• Adrian
Turner
and
Brian
Tingle
(California
Digital

  Library,
University
of
California)
• Ray
Larson
(School
of
InformaEon,
University
of

  California,
Berkeley)


           Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

Project
ObjecEves
• Archival
finding
aids
currently
intermix
descripEon
of
records

  with
descripEon
of
the
creators
of
records
and
persons
evident

  in
the
records
• Further
the
ongoing
process
of
transforming
archival
descripEon

  using
advanced
technologies
• By
facilitaEng
the
separaEon
of
the
descripEon
of
people
from

  the
descripEon
of
records
• Using
EAC‐CPF,
an
InternaEonal
archival
authority
control

  standard
• Goal:
enhance
the
economy
and
effecEveness
of
archival

  descripEon
to
enhance
access
and
understanding
of
users
of

  archives,
libraries,
and
museums


              Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

RaEonale
for
SeparaEon
• Authority
control
of
forms
of
names
• Flexible
descripEon
• CooperaEve
authority
control
• Integrated
access
to
cultural
heritage
• Biographical/historical
resource
• Social/historical
context
(social‐professional

  networks)

           Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

The
Data
• EAD‐encoded
finding
aids
  – Library
of
Congress
(1,159)
  – Online
Archive
of
California
(~15,400
)
  – Northwest
Digital
Archive
(5,160)
  – Virginia
Heritage
(8,390)
• Authority
records

  – Library
of
Congress:
NACO/LCNAF
(3.8M
personal
names;
900K

    corporate
names)
  – Gefy
Vocabulary
Program:
Union
List
of
ArEst
Names
(293K

    personal
and
corporate
names)
  – Virtual
InternaEonal
Authority
File
(5M+
personal
names)

              Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

Methods
and
Processing
• Extract
EAC‐CPF
records
from
exisEng
EAD‐encoded
archival

  descripEons
  – ExtracEng
both
creators
and
referenced
CPF
names
• Match
EAC‐CPF
records
against
one
another
and
against
exisEng

  authority
records
(ULAN,
VIAF,
LCNAF);
merge
records
for
the

  same
enEty
  – Enhance
EAC‐CPF
by
normalizing
entries,
adding
alternaEve
entries,

    Etles
(VIAF),
and
historical
data
(ULAN)
  – Key
challenge:
two
or
more
people
with
the
same
name;
two
or
more

    names
for
the
same
person
• Create
a
prototype
historical
resource
and
access
system
  – Historical
data
and
social‐professional
networks
  – Links
to
archive,
library,
and
museum
resources
(by
and
about)


               Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

EAD
Source
Data
• Encoded
Archival
DescripEon
   – Intermixes
descripEon
of
creators
of
records
and,
at
the
discreEon
of
the
archivists,

     names
associated
with
the
content
of
the
records
   – Detailed
descripEon
of
creators
of
records
• Widely
varying
quality
   – In
the
number
of
names
idenEfied
and
encoded
   – In
the
formaEon
of
the
names
(direct
or
inverted,
capitalizaEon,
punctuaEon,
and
so

     on)
   – In
the
categorizaEon
of
names
(personal,
corporate,
or
family
• Many
names
given
but
not
idenEfied
as
such
• Most
important
of
these
in
biographies/histories
and
in
correspondence

  descripEon
• ExtracEon
has
focused
on
the
“low
hanging
fruit,”
that
is
the
names
tagged
as

  names
• AfenEon
shiling
to
names
not
idenEfied
as
such

                   Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

Archival
Records
• Records
are
the
by‐products
of
people
living
and
working
as

  individuals,
in
organized
groups,
in
families
• Records
document
people
living
and
working
• People
exist
in
social‐professional
contexts,
in
relaEon
to
others
• Records
document
these
relaEons

• All
records
created
by
the
same
enEty
are
described
together
(a

  fonds
or
collecEon)
  – Creators
documented
in
detail
  – Many
of
the
people
documented
in
the
record
referenced
in

    descripEon
• Archival
descripEons
document
interrelaEons
among
people

  and
records
(documents)

              Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

Source:
J.
Robert
Oppenheimer
Papers
(LoC)

<originaEon>


     <persname
source="lcnaf">Oppenheimer,
J.
Robert,
1904‐1967</persname>

</originaEon>

<controlaccess>

    <persname
source="lcnaf"
encodinganalog="100"
role="creator">Oppenheimer,
J.


    
Robert,
1904‐1967</persname>

    <persname
source="lcnaf"
encodinganalog="600"
role="subject">Bethe,
Hans

    
Albrecht,
1906‐
‐‐Correspondence</persname>
<!‐‐
[…]
‐‐>

    <persname
source="lcnaf"
encodinganalog="600"
role="subject">Born,
Max,

    
1882‐1970
‐‐Correspondence</persname>

    <persname
source="lcnaf"
encodinganalog="600"
role="subject">Boyd,
Julian
P.

    
(Julian
Parks),
1903‐
‐‐Correspondence</persname>

    <persname
source="lcnaf"
encodinganalog="600"
role="subject">Bush,
Vannevar,

    
1890‐1974
‐‐Correspondence</persname>

    <persname
source="lcnaf"
encodinganalog="600"
role="subject">Casals,
Pablo,

    
1876‐1973
‐‐Correspondence</persname>
<!‐‐
[…]
‐‐>

    <corpname
source="lcnaf"
encodinganalog="610"
role="subject">InsEtute
for

    
Advanced
Study
(Princeton,
N.J.)</corpname>

    <corpname
source="lcnaf"
encodinganalog="610"
role="subject">Los
Alamos

    
ScienEfic
Laboratory</corpname>
<!‐‐
[…]
‐‐>
</controlaccess>

                      Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

Source:
Leonard
Bernstein
CollecEon
(LoC)


<c02>



<did>






<container
type="box">1</container>







<uniPtle>Aaltonen,
Erkki
<unitdate
era="ce"
calendar="gregorian">1981</unitdate>






</uniPtle>






<physdesc>









<extent>1</extent>







</physdesc>



</did>
</c02>
<c02>



<did>






<uniPtle>Abbado,
Claudio
<unitdate
era="ce"
calendar="gregorian">1963‐90</unitdate>







</uniPtle>






<physdesc>









<extent>5</extent>







</physdesc>



</did>
</c02>
[…]



                      Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

<bioghist>




<head>Biographical
Sketch</head>




<p>José
Marcos
Mugarrieta,
prior
to
his
term
as
Mexican
consul
in
San
Francisco
1857‐1863,

served
in
the
Mexican
army
from
1837.
He
saw
acEon
in
numerous
bafles
and
campaigns
–

Jamaica,
under
General
Canalizo
in
1841;
Campeche,
1842‐1843;
Merida,
1843;
Veracruz,
1845;

Mexico
City,
1846;
Angostura
and
Cerro‐gordo,
1847;
Guanajuato,
1848,
and
Sierra‐Gorda
under

Bustamante,
1848‐1849;
and
Matamoros,
1849‐1850.
[…]
</p>




<p>In
April
1857
Mugarrieta
received
an
appointment
from
the
Comonfort
government
for
the

consulship
in
San
Francisco.
He
did
not
actually
begin
his
new
duEes
unEl
September
1,
1859,

due
to
illness
and
to
the
poliEcal
situaEon
in
Mexico.
[…]</p>

</bioghist>




                     Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

<bioghist>



<head>Chronology</head>



<chronlist>





<chronitem>







<date>1900</date>







<event>Born
on
Jan.
20
in
HasEngs,
Minnesota.</event>





</chronitem>





<chronitem>







<date>1922</date>







<event>Received
baccalaureate
from
Princeton
University,
major
in
philosophy.

       </event>





</chronitem>





[…]






<chronitem>







<date>1965</date>







<event>Died
on
April
4.</event>





</chronitem>



</chronlist>

</bioghist>




                      Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

EAC‐CPF
• Encoded
Archival
Context‐Corporate
bodies,
Persons,

  Families
• An
internaEonal
communicaEon
standard
for
archival

  authority
control
• Based
on
InternaEonal
Council
for
Archives,
InternaEonal

  Standard
Archival
Authority
Records‐Corporate
bodies,

  persons,
families
(ISAAR(CPF))
• SAA
Standards
Commifee,
Technical
Subcommifee
on

  Encoded
Archival
Context
• Co‐chairs
  – Katherine
Wisser,
Simmons
College
  – Anila
Angjeli,
Bibliothèque
naEonale
de
France
              Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

Library
and
Archive
Authority
Control
• Library
(or
bibliographic)
authority
control
is
almost

  exclusively
about
the
control
of
names
• Archival
authority
control
involves
biographical‐historical

  descripEon
of
the
CPF
enEty
  – DescripEons
based
on
controlled
vocabularies
or
values,
for

    example,
occupaEons,
place
of
birth
and
death
  – But
also
biographical‐historical
descripEon
    • Prose
    • Chronological
list
• Archival
authority
control
provides
context
for

  understanding
records,
the
context
of
their
creaEon,
the

  provenance

               Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

<idenEty>

    <enEtyType>person</enEtyType>

    <nameEntry
scriptCode="Latn"
xml:lang="eng">

    
     <part>Oppenheimer,
J.
Robert,
1904‐1967.</part>

    
     <authorizedForm>AACR2</authorizedForm>

    </nameEntry>

    <nameEntry
localType="VIAF:MainHeading">

    
     <part>Oppenheimer,
J.
Robert
(Julius
Robert),
1904‐1967</part>

    
     <alternaEveForm>VIAF</alternaEveForm>

    </nameEntry>

    <nameEntry
localType="VIAF:MainHeading">

    
     <part>Oppenheimer,
Julius
Robert,
1904‐1967</part>

    
     <alternaEveForm>VIAF</alternaEveForm>

    </nameEntry>

    
     <nameEntry
localType="VIAF:x400">

    
     <part>Oppenheimer,
Robert</part>

    
     <alternaEveForm>VIAF</alternaEveForm>

    </nameEntry>

    <nameEntry
localType="VIAF:x400">

    
     <part>Ou‐pẽn‐hai‐mo,
1904‐1967</part>

    
     <alternaEveForm>VIAF</alternaEveForm>

    </nameEntry>
</idenEty>

                     Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

<existDates>

       <dateRange>

       
    <fromDate
standardDate=“1904‐04‐22”>1904,
Apr.
22</fromDate>

       
    <toDate
standardDate=“1967‐02‐18”>1967,
Feb.
18</toDate>

       </dateRange>
</existDates>
<!‐‐
...
‐‐>
<localDescripEon
localType="subject">

       <term>Science‐‐SocieEes,
etc.</term>
</localDescripEon>
<localDescripEon
localType="VIAF:naEonality">

       <placeEntry
countryCode="US"/>
</localDescripEon>
<localDescripEon
localType="VIAF:gender">

       <term>Male</term>
</localDescripEon>
<languageUsed>

       <language
languageCode="eng"/>
</languageUsed>
<occupaEon>

       <term>Physicists.</term>
</occupaEon>
<!‐‐
...
‐‐>

                     Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

<chronList>

    <chronItem>

    
     <date>1904,
Apr.
22</date>

    
     <placeEntry>New
York,
N.Y.</placeEntry>

    
     <event>Born,
New
York,
N.Y.</event>

    </chronItem>
<!‐‐
...
‐‐>

    <chronItem>

    
     <date>1943‐1945</date>

    
     <placeEntry>Los
Alamos,
N.
Mex.</placeEntry>

    
     <event>Director,
Los
Alamos
ScienEfic
Laboratory,
Los
Alamos,
N.
Mex.</event>

    </chronItem>
<!‐‐
...
‐‐>

    <chronItem>

    
     <date>1954</date>

    
     <event>(1)
Denied
security
clearance
[…]
(2)
Published
Science
and
the

    
     
     Common
Understanding
[…]

    
     
</event>

    </chronItem>
<!‐‐
...
‐‐>

    <chronItem>

    
     <date>1967,
Feb.
18</date>

    
     <placeEntry>Princeton,
N.J.</placeEntry>

    
     <event>Died,
Princeton,
N.J.</event>

    </chronItem>
</chronList>

                       Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

<cpfRelaEon
xmlns:xlink="hfp://www.w3.org/1999/xlink"


    xlink:type="simple"

    xlink:role="hfp://RDVocab.info/uri/schema/FRBRenEEesRDA/Person"


    xlink:arcrole="correspondedWith">

    <relaEonEntry>Bush,
Vannevar,
1890‐1974.</relaEonEntry>

    <descripEveNote>

    
     <p>recordId:
DLC.ms998007.r007</p>

    </descripEveNote>
</cpfRelaEon>




                   Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

<resourceRelaEon
xmlns:xlink="hfp://www.w3.org/1999/xlink"
xlink:arcrole="creatorOf"

    xlink:role="archivalRecords”
xlink:type="simple”


    xlink:href="hfp://hdl.loc.gov/loc.mss/eadmss.ms998007">

    <relaEonEntry>J.
Robert
Oppenheimer
Papers,
1799‐1980
(bulk
1947‐1967)</relaEonEntry>

    <objectXMLWrap>

    <did
xmlns="urn:isbn:1‐931666‐22‐9”
>

    
      <uniPtle>Papers
<unitdate

normal="1799/1980”
era="ce”
calendar="gregorian">1799‐1980

    
      
</unitdate><unitdate
label="Bulk
Dates"
type="bulk"
normal="1947/1967”

    
      era="ce”
calendar="gregorian">(bulk
1947‐1967)</unitdate></uniPtle>

    
      <uniEd
countrycode="US"
repositorycode="US‐DLC">MSS35188</uniEd>

    
      <originaEon
label="Creator">

    
      
     <persname>Oppenheimer,
J.
Robert,
1904‐1967</persname>

    
      </originaEon>
<!‐‐
...
‐‐>

    
      <repository><corpname>Manuscript
Division.
Library
of
Congress</corpname>

    
      </repository>

    
      <abstract>Physicist
and
director

    
      of
the
InsEtute
for
Advanced
Study,
Princeton,
New
Jersey.
[...]
Topics
include
theoreEcal


    
      physics,
development
of
the
atomic
bomb,
the
relaEonship
between
government
and


    
      
     science,
nuclear
energy,
security,
and
naEonal
loyalty.
</abstract>

    </did>

    </objectXMLWrap>
</resourceRelaEon>



                       Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

Year
One
Results‐ExtracEon
• EAC‐CPF
records
extracted
 –LoC:
43,702
from
1,159
finding
aids
 –OAC:
91,811
from
~15,400

 –NWDA:
22,609
from
5,160
 –VH:
15,175
from

8,390
 –Total
173,297
 –Note:
in
a
more
recent
extracEon:
196,218,
but
have

  not
had
Eme
analyze
the
results


           Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

Early
ObservaEons‐ExtracEon
• Depth
of
analysis
and
quality
of
descripEon
of

  CPF
enEEes
varies
widely
in
EAD‐encoded
finding

  aids
  –LoC
a
lot
of
names
under
authority
control
  –OAC
and
NWDA
have
less
names
and
control
varies
• To
be
fair,
the
finding
aids
were
created
without

  SNAC
processing
in
mind!


           Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

Next
on
ExtracEon
• Refine
extracEon
processing,
incorporaEng
some

  NLP‐like
processing,
for
example
 –Verifying
type
of
name:
C
or
P
or
F
 –Massaging
poorly
formed
names
into
befer
formed

  names
 –IdenEfying
names
in
strings
that
are
names‐plus
(but

  name
not
idenEfied
as
such)
 –Provide
context
informaEon
to
enhance
matching,
for

  example,
date
or
dates
of
correspondence,
or

  occupaEon
of
creator
of
records

           Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

Beyond
the
Project
• Building
a
NaEonal
Archival
AuthoriEes

  Infrastructure
  – IMLS
funded
two‐year
project,
October
2011‐September

    2013
  – EAC‐CPF
SAA
workshops:
140
scholarships
  – NaEonal
Archival
AuthoriEes
CooperaEve
planning
• SNAC
II:
a
proposal
to
expand
SNAC
  – A
lot
more
data
  – NARA,
SI,
MARC
WorldCat
records,
a
lot
more
finding
aids


            Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

For
More
InformaEon
• hfp://socialarchive.iath.virginia.edu/
(Project

  website)
• hfp://socialarchive.iath.virginia.edu/x{/search

  (public
prototype)




           Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

Social Networks and Archival Context
    Project: Matching and Merging EAC-
                CPF Records
                                                      Ray R. Larson
                                               Krishna Janakiraman


                                   University of California, Berkeley
                                               School of Information
                     Thanks
to
Daniel
V.
Pi+

of
the
Ins/tute
for
Advanced
Technology
in
the
Humani/es,

University
of

                                                     Virginia,
for
many
of
the
slides
here


SAA 2011 - Chicago
                                                                                                                          2011-08-27 - SLIDE
SNAC Project
• The outlines of the project have been
  discussed by Daniel Pitti previously
• The primary focus of the Berkeley group for
  the project is on combining data resources
  from multiple archives and other information
  sources
• In this talk I will focus on our current
  methods used in the prototype (to be
  described by Brian Tingle later)


SAA 2011 - Chicago
                                       2011-08-27 - SLIDE
Data Contributing Institutions
• EAD-encoded finding aids
    – Library of Congress (1159)
    – Online Archive of California (15,400+)
    – Northwest Digital Archive (5,563+)
    – Virginia Heritage (8,390+)
• Authority records
    – Library of Congress: NACO/LCNAF (3.8M personal
      names; 900K corporate names)
    – Getty Vocabulary Program: Union List of Artist Names
      (293K personal and corporate names)
    – Virtual International Authority File (intersection with
      NACO/LCNAF, 5M personal names)
• Other biographical sources (e.g., DBPedia, IMDB)
SAA 2011 - Chicago
                                                      2011-08-27 - SLIDE
Methods and Processing
• Extract EAC-CPF records from existing EAD-
  encoded archival descriptions
    – Extracting both creators and referenced CPF names
• Match EAC-CPF records against one another
  and against existing authority records (ULAN,
  VIAF, LCNAF)
    – Enhance EAC-CPF by normalizing entries, adding
      alternative entries, titles (VIAF), and historical data
      (ULAN)
• Create a prototype historical resource and access
  system
    – Historical data and social-professional networks
    – Links to archive, library, and museum resources (by and
      about)

SAA 2011 - Chicago
                                                     2011-08-27 - SLIDE
Merging EAC-CPF Records
              LCNAF Repository                             ULAN Repository




                                           Cheshire

                                            Search




                                        Connect
records

                     Connect
exactly

                                          using
name

                       matching
                                       Merge
                                            authority

                        records
                                          informaEon




SAA 2011 - Chicago
                                                                             2011-08-27 - SLIDE
Authority Control
• Identifying creator entities and referenced
  entities (correspondents, etc.)
• Recording name or names used by and for
  them
• Rule-based heading or entry formation and
  control




SAA 2011 - Chicago
                                       2011-08-27 - SLIDE
Controlled Vocabularies
• Vocabulary control is the attempt to provide
  a standardized and consistent set of terms
  (such as subject headings, names,
  classifications, etc.) with the intent of aiding
  the searcher in finding information
• That is, it is an attempt to provide a
  consistent set of descriptions for use in (or
  as) metadata



SAA 2011 - Chicago
                                          2011-08-27 - SLIDE
The Problem
• Proliferation of the forms of names
    –Different names for the same person
    –Different people with the same names

• Examples
    –from Books in Print (semi-controlled but not
     consistent)
    –ERIC author index (not controlled)




SAA 2011 - Chicago
                                               2011-08-27 - SLIDE
Goethe




                     …etc…



SAA 2011 - Chicago
                       2011-08-27 - SLIDE
John Muir




SAA 2011 - Chicago
                     2011-08-27 - SLIDE
Pauline Cochrane nee Atherton




SAA 2011 - Chicago
                                2011-08-27 - SLIDE
Pauline Cochrane nee Atherton




SAA 2011 - Chicago
                                2011-08-27 - SLIDE
Name Authority Files
            ID:NAFL8057230 ST:p EL:n STH:a MS:c UIP:a TD:19910821174242
             KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:05-14-80
             RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD:
             VST:d 08-21-91              Other Versions: earlier
             040 DLC$cDLC$dDLC$dOCoLC
             053 PR6005.R517
             100 10 Creasey, John
             400 10 Cooke, M. E.
             400 10 Cooke, Margaret,$d1908-1973
             400 10 Cooper, Henry St. John,$d1908-1973Different names for the
             400 00 Credo,$d1908-1973
             400 10 Fecamps, Elise                    same person
             400 10 Gill, Patrick,$d1908-1973
             400 10 Hope, Brian,$d1908-1973
             400 10 Hughes, Colin,$d1908-1973
             400 10 Marsden, James
             400 10 Matheson, Rodney
             400 10 Ranger, Ken
             400 20 St. John, Henry,$d1908-1973
             400 10 Wilde, Jimmy
             500 10 $wnnnc$aAshe, Gordon,$d1908-1973




SAA 2011 - Chicago
                                                                          2011-08-27 - SLIDE
Merging EAC-CPF Records



                                           Cheshire

                                            Search




                                        Connect
records

                     Connect
exactly

                                          using
name

                       matching
                           Merge
                                            authority

                        records
                                          informaEon




SAA 2011 - Chicago
                                                             2011-08-27 - SLIDE
Connect Exact Matches
• The EAC-CPF records provide the names
  without having to parse texts, etc.
• Allows us to use some simple methods like
  exact matching
    –Assume identical name entries means the same
     person/corporate body/family
    –Enter the full names and record IDs into a
     database and flag IDs with same names for
     merging




SAA 2011 - Chicago
                                           2011-08-27 - SLIDE
Merging EAC-CPF Records



                                           Cheshire

                                            Search




                                        Connect
records

                     Connect
exactly

                                          using
name

                       matching
                           Merge
                                            authority

                        records
                                          informaEon




SAA 2011 - Chicago
                                                             2011-08-27 - SLIDE
Search Authority Files
• For each name, formulate a search of the
  VIAF database using the Cheshire system
  (SGML/XML retrieval system with
  probabilistic and Boolean matching)
    –Search both the “authoritative” and “non-
     authoritative” forms
    –Consider any name matching a non-authoritative
     form to be a candidate match for the authoritative
     form
    –Flag EAC records that match the same authority
     record as potential matches


SAA 2011 - Chicago
                                               2011-08-27 - SLIDE
Merging EAC-CPF Records



                                           Cheshire

                                            Search




                                        Connect
records

                     Connect
exactly

                                          using
name

                       matching
                           Merge
                                            authority

                        records
                                          informaEon




SAA 2011 - Chicago
                                                             2011-08-27 - SLIDE
Merge Flagged Records
• For all of the exact matches and authority
  matches
    –Use the Authoritative form of the name
    –Combine data from each match into a single EAC-
     CPF record
    –Retain all source record IDs and information

• Finally, output the merged EAC-CPF records



SAA 2011 - Chicago
                                            2011-08-27 - SLIDE
Inputs to SNAC merging
• LoC: 43,702 EAC-CPF records derived from 1159
  finding aids

• OAC: 91,811 EAC-CPF records derived from
  ~15,400 finding aids

• NWDA: 22,609 EAC-CPF records derived from
  5,568 finding aids

• Result: 123,920 “unique” names



SAA 2011 - Chicago
                                         2011-08-27 - SLIDE
Another view of the numbers…
• 93033 Person names merged from 114639
  Person records
• 30161 Institutions merged from 41177
  Institution records
• 1669 Families merged from 2263 Family
  records




SAA 2011 - Chicago
                                  2011-08-27 - SLIDE
But…
• Exact merging assumes that archives are
  following LC cataloging practice in their EAD
  records
    –There are some problems with this assumption




SAA 2011 - Chicago
                                            2011-08-27 - SLIDE
Some failures for merging…
• Different abbreviations:
    – A. & G. Carisch & C.
    – A. & G. Carisch & Co.
• And spacing issues:
    – A. C. Peters & Bro.
    – A. C. Peters & Brother.
    – A. C. Peters. (??)
    – A. C.Peters & Bro.
• Completeness and alternate rules
    – Tabb, John B. (John Banister), 1845-1909.
    – Tabb, John Banister, 1845-1909.


SAA 2011 - Chicago
                                                  2011-08-27 - SLIDE
More…
• Variant romanizations (and spacing):
    –M. P. Belaieff.
    –M. P. Belaïeff.
    –M. P. Bieliaev.
    –M.P. Belaïeff.
    –M.P.Belaïeff.
• Initials vs. names:
    –Zabolotskii, N.A.
    –Zabolotskii, Nikolai Alekseevich, 1903-1958.
    –Zabolotskii.


SAA 2011 - Chicago
                                              2011-08-27 - SLIDE
More…
• Inverted order vs. uninverted
    –Taylor, Zachary, 1784-1850.
    –Zachary Taylor.
• Various combinations:
    –Tchaikovsky, Peter I.
    –Tchaikovsky, Pëtr Il.
    –Tchaikovsky, Piotr Ilyich.
    –Tchaikovsky, Pyotr Il.
    –Tchaikovsky, Pyotr Ilyich.


SAA 2011 - Chicago
                                   2011-08-27 - SLIDE
Another kind of failure
• Entry for “Zaphiropoulos” - no dates, no first name:
    – The entry from VIAF was for “Zaphiropoulos, Lela,
      1941-”
    – But the name in EAD came as an attribution for photos:
            – Box 113
            – Lot PP13 Zaphiropoulos. [Bas-relief at Troy], 1872.
            – Physical Description: 2 photographs
            – Scope and Content Note
            – Photographs taken for Schliemann.

• Not sure that the Zaphiropoulos indicated is a
  person, and definitely not one born in 1941.




SAA 2011 - Chicago
                                                                    2011-08-27 - SLIDE
Addressing the failures
• First we need to know where things are not working,
  and why
    – We are planning to do a random sample and detailed
      evaluation of the database to help identify the problems
• Many of the problems we have seen already appear
  to be solvable using:
    – Additional contextual clues from the EAD records
    – More sophisticated matching for phonetic variants
        • Such as n-grams or phonetic schemes like phonex
    – Additional normalization of names before merging
        • For name order, etc.
    – Use of advance matching methods


SAA 2011 - Chicago
                                                            2011-08-27 - SLIDE
Testing new merging methods
• Work done in conjunction with SNAC for a I
  School Masters’ project called Biograph
    –Krishna Janakiraman and Sean Marimpietri
• Using SNAC and merging with FreeBase
  and IMDB




SAA 2011 - Chicago
                                           2011-08-27 - SLIDE
Einstein, Albert, 1879-1955.
                     Einstein, Albert.
                     Ainshutain, A. 1879-1955
                     Aiyinsitan 1879-1955
                     Einstein, A.




                     Albert Einstein




                     Albert Einstein




                       Krishna Janakiraman and Sean Marimpietri - Biograph

SAA 2011 - Chicago
                                               2011-08-27 - SLIDE
Learn binary classifiers over varying names
                                       and existence dates




             Our approach

                              Perturb existing information to generate
                              additional samples within specific error
                                                levels




                                             Krishna Janakiraman and Sean Marimpietri - Biograph

SAA 2011 - Chicago
                                                                     2011-08-27 - SLIDE
0
T                                                                           Features
R                  Features
A                                    Features                                Names
I                   Names
N                              Birth and Death dates                   String distance
            Shingle Language
                  Model                                                     metrics




PR
ED                             Learn decision tree
 I                                 classifiers
C
 T

                                        0            Krishna Janakiraman and Sean Marimpietri - Biograph

                                   Link Records
     SAA 2011 - Chicago
                                                                             2011-08-27 - SLIDE
Name: Einstein Albert


 Shingle sequence: ein, ins, nst, ste, tei, ein … , ert




  Probability that the sequence (ins, nst, ste) follows ein is very high for the name
  einstein




                          Shingle Language Model for names
                                                          Krishna Janakiraman and Sean Marimpietri - Biograph

SAA 2011 - Chicago
                                                                                  2011-08-27 - SLIDE
Name 1 : Einstein Albert                       Name 2 : Ainshtain Albert                     Name 3 : Albert Einstein




                                                                                                             ein         In
                                                              hta   tai
             ein         In                                               ain
                                                                                                       ste
                                                        sht
       ste
                                                                                al               nst
                                                  nsh
 nst
                                                                                  alb          ins
                                                 ins
ins
                                                                                  lbe           ein                                       lbe
                                                  Ain
 ein                                  lbe
                                                                                ert                                                 ert
                                                                                                       ein
                                    ert                 ein
       ein                                                                                                         tei        rte
                                                                    tei   rte
                   tei        rte




                                            Shingle Language Model for names
                                                                                        Krishna Janakiraman and Sean Marimpietri - Biograph

  SAA 2011 - Chicago
                                                                                                                     2011-08-27 - SLIDE
Date


                                                              String Distance




            Example Decision Tree For Krishna Janakiraman and Sean Marimpietri - Biograph
                                      Von Neumann
SAA 2011 - Chicago
                                                                        2011-08-27 - SLIDE
Albert Einstein         George W Bush            Von Neumann


 TP:78         FP:11    TP:39    FP:9              TP:182           FP:14
 FN:25         TN:145   FN:6     TN:60             FN:27            TN:301


 TPR: 75.7%             TPR: 86.6%                 TPR: 75.7%
 FPR: 7%                FPR: 13%                   FPR: 7%



                        Corpus Average


                          TPR: 72.7%
                          FPR: 17%



                                         Krishna Janakiraman and Sean Marimpietri - Biograph

SAA 2011 - Chicago
                                                                 2011-08-27 - SLIDE
15,300 records, thresh = 0.85




                                  1100 records, thresh = 0.9




                     How many did we link ?

SAA 2011 - Chicago
                                                        2011-08-27 - SLIDE
Conclusions
• There will not be a single merging method,
  but a staged set of approaches that will allow
  us to go from the simplest exact matches, to
  (we hope) reliably identifying various variant
  forms of a name, etc. when corroborated by
  contextual (date, etc.) information
• Once records are merged, they are passed
  along to Brian for search and display…




SAA 2011 - Chicago
                                        2011-08-27 - SLIDE
Discovering Historic
  Social Networks
       Prototype Historical Resource Demo
       Brian Tingle, California Digital Library
Society of American Archivists 2011 Annual Meeting
                  August 27, 2011
                      Chicago
Meet the target users
Personas are fictional characters created to represent the different user types within a targeted demographic, attitude and/or behavior set that might use a site, brand
or product in a similar way. http://en.wikipedia.org/wiki/Persona_(marketing)
Meet the target users
Personas are fictional characters created to represent the different user types within a targeted demographic, attitude and/or behavior set that might use a site, brand
or product in a similar way. http://en.wikipedia.org/wiki/Persona_(marketing)




• Randy:           Graduate student working on a PhD that involves biographies and the study of diplomatic
     families and networks.  Sometimes he comes to the site looking for information on specific people; other
     times he is looking for information on a specific subject or event.  He also TAs an undergraduate history
     class and sometimes has to help students find topics for papers. 
Meet the target users
Personas are fictional characters created to represent the different user types within a targeted demographic, attitude and/or behavior set that might use a site, brand
or product in a similar way. http://en.wikipedia.org/wiki/Persona_(marketing)




• Randy:           Graduate student working on a PhD that involves biographies and the study of diplomatic
     families and networks.  Sometimes he comes to the site looking for information on specific people; other
     times he is looking for information on a specific subject or event.  He also TAs an undergraduate history
     class and sometimes has to help students find topics for papers. 


• Connie:           Works at an institution that contributed records to the project.  Is going to be asking
     themselves how this site would be useful to their users.  Wants to understand how their records were
     used and what the added value is.
Meet the target users
Personas are fictional characters created to represent the different user types within a targeted demographic, attitude and/or behavior set that might use a site, brand
or product in a similar way. http://en.wikipedia.org/wiki/Persona_(marketing)




• Randy:           Graduate student working on a PhD that involves biographies and the study of diplomatic
     families and networks.  Sometimes he comes to the site looking for information on specific people; other
     times he is looking for information on a specific subject or event.  He also TAs an undergraduate history
     class and sometimes has to help students find topics for papers. 


• Connie:           Works at an institution that contributed records to the project.  Is going to be asking
     themselves how this site would be useful to their users.  Wants to understand how their records were
     used and what the added value is.


• Quincy:                    Library School Student working to QA record matching.
Meet the target users
Personas are fictional characters created to represent the different user types within a targeted demographic, attitude and/or behavior set that might use a site, brand
or product in a similar way. http://en.wikipedia.org/wiki/Persona_(marketing)




• Randy:           Graduate student working on a PhD that involves biographies and the study of diplomatic
     families and networks.  Sometimes he comes to the site looking for information on specific people; other
     times he is looking for information on a specific subject or event.  He also TAs an undergraduate history
     class and sometimes has to help students find topics for papers. 


• Connie:           Works at an institution that contributed records to the project.  Is going to be asking
     themselves how this site would be useful to their users.  Wants to understand how their records were
     used and what the added value is.


• Quincy:                    Library School Student working to QA record matching.


• Adele:                 Person doing authority work during collection processing.
Meet the target users
Personas are fictional characters created to represent the different user types within a targeted demographic, attitude and/or behavior set that might use a site, brand
or product in a similar way. http://en.wikipedia.org/wiki/Persona_(marketing)




• Randy:           Graduate student working on a PhD that involves biographies and the study of diplomatic
     families and networks.  Sometimes he comes to the site looking for information on specific people; other
     times he is looking for information on a specific subject or event.  He also TAs an undergraduate history
     class and sometimes has to help students find topics for papers. 


• Connie:           Works at an institution that contributed records to the project.  Is going to be asking
     themselves how this site would be useful to their users.  Wants to understand how their records were
     used and what the added value is.


• Quincy:                    Library School Student working to QA record matching.


• Adele:                 Person doing authority work during collection processing.


• Lenny:         Lenny likes linked data, and wants to be able to mine the links that have been established
     programatically.
Home Page
Facet tabs
Facet tabs
Advanced Search
Advanced limits match EAC
        sections
XTF result
XTF query in the
crossQueryResult
doing a search
spellcheck
search results
search results
EAC record view



                  Identity
EAC record view




           alternative forms of name
EAC record view



Biographical History
HTML 5 microdata in chron list
EAC record view



  Related Entries
EAC record view



  Related Entries
RDFa owl:sameAs
EAC record view



      View EAC XML
EAC record view



       Graph Demo
Tinkerpop
    Graph Stack
h ttp://www.tinkerpop.com/

Property Graph Model

graphML

RDF S ail support
vertex
                                                       edge




https://github.com/tinkerpop/gremlin/wiki/Defining-a-Property-Graph
Graph Schema


   vertex
  _id: auto-assigned by neo4j
  _type: vertex
  identity: the name of the entity (string) [indexed]
  urls: n seperated list of source EAD files
  entityType: 'corporateBody', 'family', or 'person'


  edge
  _id: auto-assigned by neo4j
  _type: edge
  _lable: 'correspondedWith' or 'associatedWith'
  _inV: incoming vertex _id (from)
  _outV: outgoing vertex _id (to)
  from_name: from identity (string) denormalized
  to_name: to identity (string) denormalized
internal id



    indices/name-idx is an index on
“identity”; used to look up neo4j record
                    id
“bothE” shows in and out edges
               vertices/103994/bothE




                      redundant data to save repeated
                                 lookups
RDF of the social graph




                          Thanks Ed Summers!
Silvia Mazzini
                                    regesta.exe srl

http://templates.xdams.net/IBC/ontology/eac-cpf.rdf
Front End Stack
• golden grid
  http://code.google.com/p/the-golden-grid/
• form style http://formalize.me/
• jquery and jquery ui
• hoverIntent for advanced search
• google analytics with event tracking
XTF XSLT Framework
• pre filter - do special tokenization to create custom
   EAC facets
  • https://docs.google.com/document/d/
      1wP9x6sdOZTagJNQXoyJfPh0Y6UzQgqLwLI86WSlIPbk/edit?hl=en_US


• query parser - CGI params to XTF query XML
• result formatter - XTF results to HTML
• doc formatter - EAC-CPF to HTML
• http://code.google.com/p/xtf-cpf/source/browse/?
  name=xtf-cpf
social graph visualization
• EAC to graphML
  https://code.google.com/p/eac-graph-load/
• graphML file with open license should be
  viewable in other tools
• old demo uses Dracula Graph Library
• New demo uses Javascript InfoVis Toolkit
• Ed Summer’s “snac hacks” post
EAD to EAC XSLT


• forthcoming from Virginia
Record Merging


• forthcoming from Berkeley
Demo


• http://socialarchive.iath.virginia.edu/xtf/search

Contenu connexe

En vedette (7)

Snac saa-aug-2011.ppt
Snac saa-aug-2011.pptSnac saa-aug-2011.ppt
Snac saa-aug-2011.ppt
 
saa-2011-snac
saa-2011-snacsaa-2011-snac
saa-2011-snac
 
Snac dh2011-june-2011
Snac dh2011-june-2011Snac dh2011-june-2011
Snac dh2011-june-2011
 
Snac oclc-may-2011
Snac oclc-may-2011Snac oclc-may-2011
Snac oclc-may-2011
 
Dlf 2012
Dlf 2012Dlf 2012
Dlf 2012
 
Saa 2011-snac anila
Saa 2011-snac anilaSaa 2011-snac anila
Saa 2011-snac anila
 
Snac webinar v3
Snac webinar v3Snac webinar v3
Snac webinar v3
 

Similaire à Snac saa-aug-2011-try 3 keynote

Secret Life of a Weather Datum end of project event
Secret Life of a Weather Datum end of project eventSecret Life of a Weather Datum end of project event
Secret Life of a Weather Datum end of project event
lifeofdata
 
Digital Humanities Venice Group Presentation - Opening the Libro d'Oro
Digital Humanities Venice Group Presentation - Opening the Libro d'OroDigital Humanities Venice Group Presentation - Opening the Libro d'Oro
Digital Humanities Venice Group Presentation - Opening the Libro d'Oro
Michael Mitchell
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
Jon Voss
 
Pilots & Partnerships: University Academic Computing and University Libraries...
Pilots & Partnerships: University Academic Computing and University Libraries...Pilots & Partnerships: University Academic Computing and University Libraries...
Pilots & Partnerships: University Academic Computing and University Libraries...
Chris Freeland
 
Fuller Disclosure: Getting More Collections into the Network Flow
Fuller Disclosure: Getting More Collections into the Network FlowFuller Disclosure: Getting More Collections into the Network Flow
Fuller Disclosure: Getting More Collections into the Network Flow
kramsey
 

Similaire à Snac saa-aug-2011-try 3 keynote (20)

4.16.15 Slides, “Enhancing Early Career Researcher Profiles: VIVO & ORCID Int...
4.16.15 Slides, “Enhancing Early Career Researcher Profiles: VIVO & ORCID Int...4.16.15 Slides, “Enhancing Early Career Researcher Profiles: VIVO & ORCID Int...
4.16.15 Slides, “Enhancing Early Career Researcher Profiles: VIVO & ORCID Int...
 
5-14-13 An Introduction to VIVO Presentation Slides
5-14-13 An Introduction to VIVO Presentation Slides5-14-13 An Introduction to VIVO Presentation Slides
5-14-13 An Introduction to VIVO Presentation Slides
 
Steve's CV
Steve's CVSteve's CV
Steve's CV
 
Rebecca Grant - DH research data: identification and challenges (DH2016)
Rebecca Grant - DH research data: identification and challenges (DH2016)Rebecca Grant - DH research data: identification and challenges (DH2016)
Rebecca Grant - DH research data: identification and challenges (DH2016)
 
Managing provenance in the Social Sciences: the Data Documentation Initiative...
Managing provenance in the Social Sciences: the Data Documentation Initiative...Managing provenance in the Social Sciences: the Data Documentation Initiative...
Managing provenance in the Social Sciences: the Data Documentation Initiative...
 
Secret Life of a Weather Datum end of project event
Secret Life of a Weather Datum end of project eventSecret Life of a Weather Datum end of project event
Secret Life of a Weather Datum end of project event
 
The Blossoming of the Semantic Web
The Blossoming of the Semantic WebThe Blossoming of the Semantic Web
The Blossoming of the Semantic Web
 
Grey literature in Australian education
Grey literature in Australian educationGrey literature in Australian education
Grey literature in Australian education
 
Ph d lib2015-konference-barbara-allan
Ph d lib2015-konference-barbara-allanPh d lib2015-konference-barbara-allan
Ph d lib2015-konference-barbara-allan
 
Digital Humanities Venice Group Presentation - Opening the Libro d'Oro
Digital Humanities Venice Group Presentation - Opening the Libro d'OroDigital Humanities Venice Group Presentation - Opening the Libro d'Oro
Digital Humanities Venice Group Presentation - Opening the Libro d'Oro
 
Metadata for researchers
Metadata for researchers Metadata for researchers
Metadata for researchers
 
Creating Sustainable Communities in Open Data Resources: The eagle-i and VIVO...
Creating Sustainable Communities in Open Data Resources: The eagle-i and VIVO...Creating Sustainable Communities in Open Data Resources: The eagle-i and VIVO...
Creating Sustainable Communities in Open Data Resources: The eagle-i and VIVO...
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
 
Pilots & Partnerships: University Academic Computing and University Libraries...
Pilots & Partnerships: University Academic Computing and University Libraries...Pilots & Partnerships: University Academic Computing and University Libraries...
Pilots & Partnerships: University Academic Computing and University Libraries...
 
Using QR Coding to Facilitate Student Engagement and Differentiation in the C...
Using QR Coding to Facilitate Student Engagement and Differentiation in the C...Using QR Coding to Facilitate Student Engagement and Differentiation in the C...
Using QR Coding to Facilitate Student Engagement and Differentiation in the C...
 
Realizing the Potential of Research Data by Carole L. Palmer
Realizing the Potential of Research Data by Carole L. Palmer Realizing the Potential of Research Data by Carole L. Palmer
Realizing the Potential of Research Data by Carole L. Palmer
 
Fuller Disclosure: Getting More Collections into the Network Flow
Fuller Disclosure: Getting More Collections into the Network FlowFuller Disclosure: Getting More Collections into the Network Flow
Fuller Disclosure: Getting More Collections into the Network Flow
 
Wolven, Hickey, and Henderson, "Identifiers: New Problems, New Solutions, Par...
Wolven, Hickey, and Henderson, "Identifiers: New Problems, New Solutions, Par...Wolven, Hickey, and Henderson, "Identifiers: New Problems, New Solutions, Par...
Wolven, Hickey, and Henderson, "Identifiers: New Problems, New Solutions, Par...
 
Sally Rumsey, Janet McKnight, James A.J. Wilson - Research data management fo...
Sally Rumsey, Janet McKnight, James A.J. Wilson - Research data management fo...Sally Rumsey, Janet McKnight, James A.J. Wilson - Research data management fo...
Sally Rumsey, Janet McKnight, James A.J. Wilson - Research data management fo...
 
Kristi Holmes. A bird’s-eye view of scholarship at the individual, institutio...
Kristi Holmes. A bird’s-eye view of scholarship at the individual, institutio...Kristi Holmes. A bird’s-eye view of scholarship at the individual, institutio...
Kristi Holmes. A bird’s-eye view of scholarship at the individual, institutio...
 

Dernier

Dernier (20)

Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 

Snac saa-aug-2011-try 3 keynote

  • 1. EAC‐CPF
and
Social
Networks Society
of
American
Archivists Chicago August
2011 Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • 2. SNAC
Overview • Funding
and
Timeline • Project
Team • Project
ObjecEves
and
RaEonale • Data
ContribuEng
InsEtuEons • Archival
Standards
Employed • Methods,
Processing,
and
Products • Year
One
ExtracEon
Results • Basic
ObservaEons
on
ExtracEon Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • 3. Funding
and
Timeline • NaEonal
Endowment
for
the
HumaniEes • A
PreservaEon
and
Access,
Research
and
 Development
grant • Two‐year
project • May
2010‐April
2012 Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • 4. Project
Team • Daniel
PiP
(PI)
and
Worthy
MarEn
(InsEtute
for
 Advanced
Technology
in
the
HumaniEes,
 University
of
Virginia) • Adrian
Turner
and
Brian
Tingle
(California
Digital
 Library,
University
of
California) • Ray
Larson
(School
of
InformaEon,
University
of
 California,
Berkeley) Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • 5. Project
ObjecEves • Archival
finding
aids
currently
intermix
descripEon
of
records
 with
descripEon
of
the
creators
of
records
and
persons
evident
 in
the
records • Further
the
ongoing
process
of
transforming
archival
descripEon
 using
advanced
technologies • By
facilitaEng
the
separaEon
of
the
descripEon
of
people
from
 the
descripEon
of
records • Using
EAC‐CPF,
an
InternaEonal
archival
authority
control
 standard • Goal:
enhance
the
economy
and
effecEveness
of
archival
 descripEon
to
enhance
access
and
understanding
of
users
of
 archives,
libraries,
and
museums Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • 6. RaEonale
for
SeparaEon • Authority
control
of
forms
of
names • Flexible
descripEon • CooperaEve
authority
control • Integrated
access
to
cultural
heritage • Biographical/historical
resource • Social/historical
context
(social‐professional
 networks) Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • 7. The
Data • EAD‐encoded
finding
aids – Library
of
Congress
(1,159) – Online
Archive
of
California
(~15,400
) – Northwest
Digital
Archive
(5,160) – Virginia
Heritage
(8,390) • Authority
records
 – Library
of
Congress:
NACO/LCNAF
(3.8M
personal
names;
900K
 corporate
names) – Gefy
Vocabulary
Program:
Union
List
of
ArEst
Names
(293K
 personal
and
corporate
names) – Virtual
InternaEonal
Authority
File
(5M+
personal
names) Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • 8. Methods
and
Processing • Extract
EAC‐CPF
records
from
exisEng
EAD‐encoded
archival
 descripEons – ExtracEng
both
creators
and
referenced
CPF
names • Match
EAC‐CPF
records
against
one
another
and
against
exisEng
 authority
records
(ULAN,
VIAF,
LCNAF);
merge
records
for
the
 same
enEty – Enhance
EAC‐CPF
by
normalizing
entries,
adding
alternaEve
entries,
 Etles
(VIAF),
and
historical
data
(ULAN) – Key
challenge:
two
or
more
people
with
the
same
name;
two
or
more
 names
for
the
same
person • Create
a
prototype
historical
resource
and
access
system – Historical
data
and
social‐professional
networks – Links
to
archive,
library,
and
museum
resources
(by
and
about)
 Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • 9. EAD
Source
Data • Encoded
Archival
DescripEon – Intermixes
descripEon
of
creators
of
records
and,
at
the
discreEon
of
the
archivists,
 names
associated
with
the
content
of
the
records – Detailed
descripEon
of
creators
of
records • Widely
varying
quality – In
the
number
of
names
idenEfied
and
encoded – In
the
formaEon
of
the
names
(direct
or
inverted,
capitalizaEon,
punctuaEon,
and
so
 on) – In
the
categorizaEon
of
names
(personal,
corporate,
or
family • Many
names
given
but
not
idenEfied
as
such • Most
important
of
these
in
biographies/histories
and
in
correspondence
 descripEon • ExtracEon
has
focused
on
the
“low
hanging
fruit,”
that
is
the
names
tagged
as
 names • AfenEon
shiling
to
names
not
idenEfied
as
such Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • 10. Archival
Records • Records
are
the
by‐products
of
people
living
and
working
as
 individuals,
in
organized
groups,
in
families • Records
document
people
living
and
working • People
exist
in
social‐professional
contexts,
in
relaEon
to
others • Records
document
these
relaEons
 • All
records
created
by
the
same
enEty
are
described
together
(a
 fonds
or
collecEon) – Creators
documented
in
detail – Many
of
the
people
documented
in
the
record
referenced
in
 descripEon • Archival
descripEons
document
interrelaEons
among
people
 and
records
(documents) Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • 11. Source:
J.
Robert
Oppenheimer
Papers
(LoC) <originaEon>
 
 <persname
source="lcnaf">Oppenheimer,
J.
Robert,
1904‐1967</persname>
 </originaEon> <controlaccess> 
 <persname
source="lcnaf"
encodinganalog="100"
role="creator">Oppenheimer,
J.
 
 
Robert,
1904‐1967</persname> 
 <persname
source="lcnaf"
encodinganalog="600"
role="subject">Bethe,
Hans 
 
Albrecht,
1906‐
‐‐Correspondence</persname>
<!‐‐
[…]
‐‐> 
 <persname
source="lcnaf"
encodinganalog="600"
role="subject">Born,
Max, 
 
1882‐1970
‐‐Correspondence</persname> 
 <persname
source="lcnaf"
encodinganalog="600"
role="subject">Boyd,
Julian
P. 
 
(Julian
Parks),
1903‐
‐‐Correspondence</persname> 
 <persname
source="lcnaf"
encodinganalog="600"
role="subject">Bush,
Vannevar, 
 
1890‐1974
‐‐Correspondence</persname> 
 <persname
source="lcnaf"
encodinganalog="600"
role="subject">Casals,
Pablo, 
 
1876‐1973
‐‐Correspondence</persname>
<!‐‐
[…]
‐‐> 
 <corpname
source="lcnaf"
encodinganalog="610"
role="subject">InsEtute
for 
 
Advanced
Study
(Princeton,
N.J.)</corpname> 
 <corpname
source="lcnaf"
encodinganalog="610"
role="subject">Los
Alamos 
 
ScienEfic
Laboratory</corpname>
<!‐‐
[…]
‐‐> </controlaccess> Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • 12. Source:
Leonard
Bernstein
CollecEon
(LoC) 
 <c02> 


<did> 





<container
type="box">1</container>
 





<uniPtle>Aaltonen,
Erkki
<unitdate
era="ce"
calendar="gregorian">1981</unitdate> 





</uniPtle> 





<physdesc> 








<extent>1</extent>
 





</physdesc> 


</did> </c02> <c02> 


<did> 





<uniPtle>Abbado,
Claudio
<unitdate
era="ce"
calendar="gregorian">1963‐90</unitdate>
 





</uniPtle> 





<physdesc> 








<extent>5</extent>
 





</physdesc> 


</did> </c02> […] Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • 13. <bioghist>
 


<head>Biographical
Sketch</head>
 


<p>José
Marcos
Mugarrieta,
prior
to
his
term
as
Mexican
consul
in
San
Francisco
1857‐1863,
 served
in
the
Mexican
army
from
1837.
He
saw
acEon
in
numerous
bafles
and
campaigns
–
 Jamaica,
under
General
Canalizo
in
1841;
Campeche,
1842‐1843;
Merida,
1843;
Veracruz,
1845;
 Mexico
City,
1846;
Angostura
and
Cerro‐gordo,
1847;
Guanajuato,
1848,
and
Sierra‐Gorda
under
 Bustamante,
1848‐1849;
and
Matamoros,
1849‐1850.
[…]
</p>
 


<p>In
April
1857
Mugarrieta
received
an
appointment
from
the
Comonfort
government
for
the
 consulship
in
San
Francisco.
He
did
not
actually
begin
his
new
duEes
unEl
September
1,
1859,
 due
to
illness
and
to
the
poliEcal
situaEon
in
Mexico.
[…]</p>
 </bioghist> Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • 14. <bioghist> 


<head>Chronology</head> 


<chronlist> 




<chronitem> 






<date>1900</date> 






<event>Born
on
Jan.
20
in
HasEngs,
Minnesota.</event> 




</chronitem> 




<chronitem> 






<date>1922</date> 






<event>Received
baccalaureate
from
Princeton
University,
major
in
philosophy. 
 </event> 




</chronitem> 




[…]
 




<chronitem> 






<date>1965</date> 






<event>Died
on
April
4.</event> 




</chronitem> 


</chronlist> 
</bioghist> Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • 15. EAC‐CPF • Encoded
Archival
Context‐Corporate
bodies,
Persons,
 Families • An
internaEonal
communicaEon
standard
for
archival
 authority
control • Based
on
InternaEonal
Council
for
Archives,
InternaEonal
 Standard
Archival
Authority
Records‐Corporate
bodies,
 persons,
families
(ISAAR(CPF)) • SAA
Standards
Commifee,
Technical
Subcommifee
on
 Encoded
Archival
Context • Co‐chairs – Katherine
Wisser,
Simmons
College – Anila
Angjeli,
Bibliothèque
naEonale
de
France Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • 16. Library
and
Archive
Authority
Control • Library
(or
bibliographic)
authority
control
is
almost
 exclusively
about
the
control
of
names • Archival
authority
control
involves
biographical‐historical
 descripEon
of
the
CPF
enEty – DescripEons
based
on
controlled
vocabularies
or
values,
for
 example,
occupaEons,
place
of
birth
and
death – But
also
biographical‐historical
descripEon • Prose • Chronological
list • Archival
authority
control
provides
context
for
 understanding
records,
the
context
of
their
creaEon,
the
 provenance Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • 17. <idenEty> 
 <enEtyType>person</enEtyType> 
 <nameEntry
scriptCode="Latn"
xml:lang="eng"> 
 
 <part>Oppenheimer,
J.
Robert,
1904‐1967.</part> 
 
 <authorizedForm>AACR2</authorizedForm> 
 </nameEntry> 
 <nameEntry
localType="VIAF:MainHeading"> 
 
 <part>Oppenheimer,
J.
Robert
(Julius
Robert),
1904‐1967</part> 
 
 <alternaEveForm>VIAF</alternaEveForm> 
 </nameEntry> 
 <nameEntry
localType="VIAF:MainHeading"> 
 
 <part>Oppenheimer,
Julius
Robert,
1904‐1967</part> 
 
 <alternaEveForm>VIAF</alternaEveForm> 
 </nameEntry> 
 
 <nameEntry
localType="VIAF:x400"> 
 
 <part>Oppenheimer,
Robert</part> 
 
 <alternaEveForm>VIAF</alternaEveForm> 
 </nameEntry> 
 <nameEntry
localType="VIAF:x400"> 
 
 <part>Ou‐pẽn‐hai‐mo,
1904‐1967</part> 
 
 <alternaEveForm>VIAF</alternaEveForm> 
 </nameEntry> </idenEty> Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • 18. <existDates> 
 <dateRange> 
 
 <fromDate
standardDate=“1904‐04‐22”>1904,
Apr.
22</fromDate> 
 
 <toDate
standardDate=“1967‐02‐18”>1967,
Feb.
18</toDate> 
 </dateRange> </existDates> <!‐‐
...
‐‐> <localDescripEon
localType="subject"> 
 <term>Science‐‐SocieEes,
etc.</term> </localDescripEon> <localDescripEon
localType="VIAF:naEonality"> 
 <placeEntry
countryCode="US"/> </localDescripEon> <localDescripEon
localType="VIAF:gender"> 
 <term>Male</term> </localDescripEon> <languageUsed> 
 <language
languageCode="eng"/> </languageUsed> <occupaEon> 
 <term>Physicists.</term> </occupaEon> <!‐‐
...
‐‐> Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • 19. <chronList> 
 <chronItem> 
 
 <date>1904,
Apr.
22</date> 
 
 <placeEntry>New
York,
N.Y.</placeEntry> 
 
 <event>Born,
New
York,
N.Y.</event> 
 </chronItem>
<!‐‐
...
‐‐> 
 <chronItem> 
 
 <date>1943‐1945</date> 
 
 <placeEntry>Los
Alamos,
N.
Mex.</placeEntry> 
 
 <event>Director,
Los
Alamos
ScienEfic
Laboratory,
Los
Alamos,
N.
Mex.</event> 
 </chronItem>
<!‐‐
...
‐‐> 
 <chronItem> 
 
 <date>1954</date> 
 
 <event>(1)
Denied
security
clearance
[…]
(2)
Published
Science
and
the 
 
 
 Common
Understanding
[…] 
 
 
</event> 
 </chronItem>
<!‐‐
...
‐‐> 
 <chronItem> 
 
 <date>1967,
Feb.
18</date> 
 
 <placeEntry>Princeton,
N.J.</placeEntry> 
 
 <event>Died,
Princeton,
N.J.</event> 
 </chronItem> </chronList> Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • 20. <cpfRelaEon
xmlns:xlink="hfp://www.w3.org/1999/xlink"
 
 xlink:type="simple" 
 xlink:role="hfp://RDVocab.info/uri/schema/FRBRenEEesRDA/Person"
 
 xlink:arcrole="correspondedWith"> 
 <relaEonEntry>Bush,
Vannevar,
1890‐1974.</relaEonEntry> 
 <descripEveNote> 
 
 <p>recordId:
DLC.ms998007.r007</p> 
 </descripEveNote> </cpfRelaEon> Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • 21. <resourceRelaEon
xmlns:xlink="hfp://www.w3.org/1999/xlink"
xlink:arcrole="creatorOf" 
 xlink:role="archivalRecords”
xlink:type="simple”
 
 xlink:href="hfp://hdl.loc.gov/loc.mss/eadmss.ms998007"> 
 <relaEonEntry>J.
Robert
Oppenheimer
Papers,
1799‐1980
(bulk
1947‐1967)</relaEonEntry> 
 <objectXMLWrap> 
 <did
xmlns="urn:isbn:1‐931666‐22‐9”
> 
 
 <uniPtle>Papers
<unitdate

normal="1799/1980”
era="ce”
calendar="gregorian">1799‐1980 
 
 
</unitdate><unitdate
label="Bulk
Dates"
type="bulk"
normal="1947/1967” 
 
 era="ce”
calendar="gregorian">(bulk
1947‐1967)</unitdate></uniPtle> 
 
 <uniEd
countrycode="US"
repositorycode="US‐DLC">MSS35188</uniEd> 
 
 <originaEon
label="Creator"> 
 
 
 <persname>Oppenheimer,
J.
Robert,
1904‐1967</persname> 
 
 </originaEon>
<!‐‐
...
‐‐> 
 
 <repository><corpname>Manuscript
Division.
Library
of
Congress</corpname> 
 
 </repository> 
 
 <abstract>Physicist
and
director 
 
 of
the
InsEtute
for
Advanced
Study,
Princeton,
New
Jersey.
[...]
Topics
include
theoreEcal
 
 
 physics,
development
of
the
atomic
bomb,
the
relaEonship
between
government
and
 
 
 
 science,
nuclear
energy,
security,
and
naEonal
loyalty.
</abstract> 
 </did> 
 </objectXMLWrap> </resourceRelaEon> Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • 22. Year
One
Results‐ExtracEon • EAC‐CPF
records
extracted –LoC:
43,702
from
1,159
finding
aids –OAC:
91,811
from
~15,400
 –NWDA:
22,609
from
5,160 –VH:
15,175
from

8,390 –Total
173,297 –Note:
in
a
more
recent
extracEon:
196,218,
but
have
 not
had
Eme
analyze
the
results Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • 23. Early
ObservaEons‐ExtracEon • Depth
of
analysis
and
quality
of
descripEon
of
 CPF
enEEes
varies
widely
in
EAD‐encoded
finding
 aids –LoC
a
lot
of
names
under
authority
control –OAC
and
NWDA
have
less
names
and
control
varies • To
be
fair,
the
finding
aids
were
created
without
 SNAC
processing
in
mind! Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • 24. Next
on
ExtracEon • Refine
extracEon
processing,
incorporaEng
some
 NLP‐like
processing,
for
example –Verifying
type
of
name:
C
or
P
or
F –Massaging
poorly
formed
names
into
befer
formed
 names –IdenEfying
names
in
strings
that
are
names‐plus
(but
 name
not
idenEfied
as
such) –Provide
context
informaEon
to
enhance
matching,
for
 example,
date
or
dates
of
correspondence,
or
 occupaEon
of
creator
of
records Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • 25. Beyond
the
Project • Building
a
NaEonal
Archival
AuthoriEes
 Infrastructure – IMLS
funded
two‐year
project,
October
2011‐September
 2013 – EAC‐CPF
SAA
workshops:
140
scholarships – NaEonal
Archival
AuthoriEes
CooperaEve
planning • SNAC
II:
a
proposal
to
expand
SNAC – A
lot
more
data – NARA,
SI,
MARC
WorldCat
records,
a
lot
more
finding
aids Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • 26. For
More
InformaEon • hfp://socialarchive.iath.virginia.edu/
(Project
 website) • hfp://socialarchive.iath.virginia.edu/x{/search
 (public
prototype) Daniel
V.
Pi+
§

Ins/tute
for
Advanced
Technology
in
the
Humani/es
§

University
of
Virginia

  • 27. Social Networks and Archival Context Project: Matching and Merging EAC- CPF Records Ray R. Larson Krishna Janakiraman University of California, Berkeley School of Information Thanks
to
Daniel
V.
Pi+

of
the
Ins/tute
for
Advanced
Technology
in
the
Humani/es,

University
of
 Virginia,
for
many
of
the
slides
here SAA 2011 - Chicago 2011-08-27 - SLIDE
  • 28. SNAC Project • The outlines of the project have been discussed by Daniel Pitti previously • The primary focus of the Berkeley group for the project is on combining data resources from multiple archives and other information sources • In this talk I will focus on our current methods used in the prototype (to be described by Brian Tingle later) SAA 2011 - Chicago 2011-08-27 - SLIDE
  • 29. Data Contributing Institutions • EAD-encoded finding aids – Library of Congress (1159) – Online Archive of California (15,400+) – Northwest Digital Archive (5,563+) – Virginia Heritage (8,390+) • Authority records – Library of Congress: NACO/LCNAF (3.8M personal names; 900K corporate names) – Getty Vocabulary Program: Union List of Artist Names (293K personal and corporate names) – Virtual International Authority File (intersection with NACO/LCNAF, 5M personal names) • Other biographical sources (e.g., DBPedia, IMDB) SAA 2011 - Chicago 2011-08-27 - SLIDE
  • 30. Methods and Processing • Extract EAC-CPF records from existing EAD- encoded archival descriptions – Extracting both creators and referenced CPF names • Match EAC-CPF records against one another and against existing authority records (ULAN, VIAF, LCNAF) – Enhance EAC-CPF by normalizing entries, adding alternative entries, titles (VIAF), and historical data (ULAN) • Create a prototype historical resource and access system – Historical data and social-professional networks – Links to archive, library, and museum resources (by and about) SAA 2011 - Chicago 2011-08-27 - SLIDE
  • 31. Merging EAC-CPF Records LCNAF Repository ULAN Repository Cheshire
 Search Connect
records
 Connect
exactly
 using
name
 matching
 Merge authority
 records informaEon SAA 2011 - Chicago 2011-08-27 - SLIDE
  • 32. Authority Control • Identifying creator entities and referenced entities (correspondents, etc.) • Recording name or names used by and for them • Rule-based heading or entry formation and control SAA 2011 - Chicago 2011-08-27 - SLIDE
  • 33. Controlled Vocabularies • Vocabulary control is the attempt to provide a standardized and consistent set of terms (such as subject headings, names, classifications, etc.) with the intent of aiding the searcher in finding information • That is, it is an attempt to provide a consistent set of descriptions for use in (or as) metadata SAA 2011 - Chicago 2011-08-27 - SLIDE
  • 34. The Problem • Proliferation of the forms of names –Different names for the same person –Different people with the same names • Examples –from Books in Print (semi-controlled but not consistent) –ERIC author index (not controlled) SAA 2011 - Chicago 2011-08-27 - SLIDE
  • 35. Goethe …etc… SAA 2011 - Chicago 2011-08-27 - SLIDE
  • 36. John Muir SAA 2011 - Chicago 2011-08-27 - SLIDE
  • 37. Pauline Cochrane nee Atherton SAA 2011 - Chicago 2011-08-27 - SLIDE
  • 38. Pauline Cochrane nee Atherton SAA 2011 - Chicago 2011-08-27 - SLIDE
  • 39. Name Authority Files ID:NAFL8057230 ST:p EL:n STH:a MS:c UIP:a TD:19910821174242 KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:05-14-80 RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d 08-21-91 Other Versions: earlier 040 DLC$cDLC$dDLC$dOCoLC 053 PR6005.R517 100 10 Creasey, John 400 10 Cooke, M. E. 400 10 Cooke, Margaret,$d1908-1973 400 10 Cooper, Henry St. John,$d1908-1973Different names for the 400 00 Credo,$d1908-1973 400 10 Fecamps, Elise same person 400 10 Gill, Patrick,$d1908-1973 400 10 Hope, Brian,$d1908-1973 400 10 Hughes, Colin,$d1908-1973 400 10 Marsden, James 400 10 Matheson, Rodney 400 10 Ranger, Ken 400 20 St. John, Henry,$d1908-1973 400 10 Wilde, Jimmy 500 10 $wnnnc$aAshe, Gordon,$d1908-1973 SAA 2011 - Chicago 2011-08-27 - SLIDE
  • 40. Merging EAC-CPF Records Cheshire
 Search Connect
records
 Connect
exactly
 using
name
 matching
 Merge authority
 records informaEon SAA 2011 - Chicago 2011-08-27 - SLIDE
  • 41. Connect Exact Matches • The EAC-CPF records provide the names without having to parse texts, etc. • Allows us to use some simple methods like exact matching –Assume identical name entries means the same person/corporate body/family –Enter the full names and record IDs into a database and flag IDs with same names for merging SAA 2011 - Chicago 2011-08-27 - SLIDE
  • 42. Merging EAC-CPF Records Cheshire
 Search Connect
records
 Connect
exactly
 using
name
 matching
 Merge authority
 records informaEon SAA 2011 - Chicago 2011-08-27 - SLIDE
  • 43. Search Authority Files • For each name, formulate a search of the VIAF database using the Cheshire system (SGML/XML retrieval system with probabilistic and Boolean matching) –Search both the “authoritative” and “non- authoritative” forms –Consider any name matching a non-authoritative form to be a candidate match for the authoritative form –Flag EAC records that match the same authority record as potential matches SAA 2011 - Chicago 2011-08-27 - SLIDE
  • 44. Merging EAC-CPF Records Cheshire
 Search Connect
records
 Connect
exactly
 using
name
 matching
 Merge authority
 records informaEon SAA 2011 - Chicago 2011-08-27 - SLIDE
  • 45. Merge Flagged Records • For all of the exact matches and authority matches –Use the Authoritative form of the name –Combine data from each match into a single EAC- CPF record –Retain all source record IDs and information • Finally, output the merged EAC-CPF records SAA 2011 - Chicago 2011-08-27 - SLIDE
  • 46. Inputs to SNAC merging • LoC: 43,702 EAC-CPF records derived from 1159 finding aids • OAC: 91,811 EAC-CPF records derived from ~15,400 finding aids • NWDA: 22,609 EAC-CPF records derived from 5,568 finding aids • Result: 123,920 “unique” names SAA 2011 - Chicago 2011-08-27 - SLIDE
  • 47. Another view of the numbers… • 93033 Person names merged from 114639 Person records • 30161 Institutions merged from 41177 Institution records • 1669 Families merged from 2263 Family records SAA 2011 - Chicago 2011-08-27 - SLIDE
  • 48. But… • Exact merging assumes that archives are following LC cataloging practice in their EAD records –There are some problems with this assumption SAA 2011 - Chicago 2011-08-27 - SLIDE
  • 49. Some failures for merging… • Different abbreviations: – A. & G. Carisch & C. – A. & G. Carisch & Co. • And spacing issues: – A. C. Peters & Bro. – A. C. Peters & Brother. – A. C. Peters. (??) – A. C.Peters & Bro. • Completeness and alternate rules – Tabb, John B. (John Banister), 1845-1909. – Tabb, John Banister, 1845-1909. SAA 2011 - Chicago 2011-08-27 - SLIDE
  • 50. More… • Variant romanizations (and spacing): –M. P. Belaieff. –M. P. Belaïeff. –M. P. Bieliaev. –M.P. Belaïeff. –M.P.Belaïeff. • Initials vs. names: –Zabolotskii, N.A. –Zabolotskii, Nikolai Alekseevich, 1903-1958. –Zabolotskii. SAA 2011 - Chicago 2011-08-27 - SLIDE
  • 51. More… • Inverted order vs. uninverted –Taylor, Zachary, 1784-1850. –Zachary Taylor. • Various combinations: –Tchaikovsky, Peter I. –Tchaikovsky, Pëtr Il. –Tchaikovsky, Piotr Ilyich. –Tchaikovsky, Pyotr Il. –Tchaikovsky, Pyotr Ilyich. SAA 2011 - Chicago 2011-08-27 - SLIDE
  • 52. Another kind of failure • Entry for “Zaphiropoulos” - no dates, no first name: – The entry from VIAF was for “Zaphiropoulos, Lela, 1941-” – But the name in EAD came as an attribution for photos: – Box 113 – Lot PP13 Zaphiropoulos. [Bas-relief at Troy], 1872. – Physical Description: 2 photographs – Scope and Content Note – Photographs taken for Schliemann. • Not sure that the Zaphiropoulos indicated is a person, and definitely not one born in 1941. SAA 2011 - Chicago 2011-08-27 - SLIDE
  • 53. Addressing the failures • First we need to know where things are not working, and why – We are planning to do a random sample and detailed evaluation of the database to help identify the problems • Many of the problems we have seen already appear to be solvable using: – Additional contextual clues from the EAD records – More sophisticated matching for phonetic variants • Such as n-grams or phonetic schemes like phonex – Additional normalization of names before merging • For name order, etc. – Use of advance matching methods SAA 2011 - Chicago 2011-08-27 - SLIDE
  • 54. Testing new merging methods • Work done in conjunction with SNAC for a I School Masters’ project called Biograph –Krishna Janakiraman and Sean Marimpietri • Using SNAC and merging with FreeBase and IMDB SAA 2011 - Chicago 2011-08-27 - SLIDE
  • 55. Einstein, Albert, 1879-1955. Einstein, Albert. Ainshutain, A. 1879-1955 Aiyinsitan 1879-1955 Einstein, A. Albert Einstein Albert Einstein Krishna Janakiraman and Sean Marimpietri - Biograph SAA 2011 - Chicago 2011-08-27 - SLIDE
  • 56. Learn binary classifiers over varying names and existence dates Our approach Perturb existing information to generate additional samples within specific error levels Krishna Janakiraman and Sean Marimpietri - Biograph SAA 2011 - Chicago 2011-08-27 - SLIDE
  • 57. 0 T Features R Features A Features Names I Names N Birth and Death dates String distance Shingle Language Model metrics PR ED Learn decision tree I classifiers C T 0 Krishna Janakiraman and Sean Marimpietri - Biograph Link Records SAA 2011 - Chicago 2011-08-27 - SLIDE
  • 58. Name: Einstein Albert Shingle sequence: ein, ins, nst, ste, tei, ein … , ert Probability that the sequence (ins, nst, ste) follows ein is very high for the name einstein Shingle Language Model for names Krishna Janakiraman and Sean Marimpietri - Biograph SAA 2011 - Chicago 2011-08-27 - SLIDE
  • 59. Name 1 : Einstein Albert Name 2 : Ainshtain Albert Name 3 : Albert Einstein ein In hta tai ein In ain ste sht ste al nst nsh nst alb ins ins ins lbe ein lbe Ain ein lbe ert ert ein ert ein ein tei rte tei rte tei rte Shingle Language Model for names Krishna Janakiraman and Sean Marimpietri - Biograph SAA 2011 - Chicago 2011-08-27 - SLIDE
  • 60. Date String Distance Example Decision Tree For Krishna Janakiraman and Sean Marimpietri - Biograph Von Neumann SAA 2011 - Chicago 2011-08-27 - SLIDE
  • 61. Albert Einstein George W Bush Von Neumann TP:78 FP:11 TP:39 FP:9 TP:182 FP:14 FN:25 TN:145 FN:6 TN:60 FN:27 TN:301 TPR: 75.7% TPR: 86.6% TPR: 75.7% FPR: 7% FPR: 13% FPR: 7% Corpus Average TPR: 72.7% FPR: 17% Krishna Janakiraman and Sean Marimpietri - Biograph SAA 2011 - Chicago 2011-08-27 - SLIDE
  • 62. 15,300 records, thresh = 0.85 1100 records, thresh = 0.9 How many did we link ? SAA 2011 - Chicago 2011-08-27 - SLIDE
  • 63. Conclusions • There will not be a single merging method, but a staged set of approaches that will allow us to go from the simplest exact matches, to (we hope) reliably identifying various variant forms of a name, etc. when corroborated by contextual (date, etc.) information • Once records are merged, they are passed along to Brian for search and display… SAA 2011 - Chicago 2011-08-27 - SLIDE
  • 64. Discovering Historic Social Networks Prototype Historical Resource Demo Brian Tingle, California Digital Library Society of American Archivists 2011 Annual Meeting August 27, 2011 Chicago
  • 65. Meet the target users Personas are fictional characters created to represent the different user types within a targeted demographic, attitude and/or behavior set that might use a site, brand or product in a similar way. http://en.wikipedia.org/wiki/Persona_(marketing)
  • 66. Meet the target users Personas are fictional characters created to represent the different user types within a targeted demographic, attitude and/or behavior set that might use a site, brand or product in a similar way. http://en.wikipedia.org/wiki/Persona_(marketing) • Randy: Graduate student working on a PhD that involves biographies and the study of diplomatic families and networks.  Sometimes he comes to the site looking for information on specific people; other times he is looking for information on a specific subject or event.  He also TAs an undergraduate history class and sometimes has to help students find topics for papers. 
  • 67. Meet the target users Personas are fictional characters created to represent the different user types within a targeted demographic, attitude and/or behavior set that might use a site, brand or product in a similar way. http://en.wikipedia.org/wiki/Persona_(marketing) • Randy: Graduate student working on a PhD that involves biographies and the study of diplomatic families and networks.  Sometimes he comes to the site looking for information on specific people; other times he is looking for information on a specific subject or event.  He also TAs an undergraduate history class and sometimes has to help students find topics for papers.  • Connie: Works at an institution that contributed records to the project.  Is going to be asking themselves how this site would be useful to their users.  Wants to understand how their records were used and what the added value is.
  • 68. Meet the target users Personas are fictional characters created to represent the different user types within a targeted demographic, attitude and/or behavior set that might use a site, brand or product in a similar way. http://en.wikipedia.org/wiki/Persona_(marketing) • Randy: Graduate student working on a PhD that involves biographies and the study of diplomatic families and networks.  Sometimes he comes to the site looking for information on specific people; other times he is looking for information on a specific subject or event.  He also TAs an undergraduate history class and sometimes has to help students find topics for papers.  • Connie: Works at an institution that contributed records to the project.  Is going to be asking themselves how this site would be useful to their users.  Wants to understand how their records were used and what the added value is. • Quincy: Library School Student working to QA record matching.
  • 69. Meet the target users Personas are fictional characters created to represent the different user types within a targeted demographic, attitude and/or behavior set that might use a site, brand or product in a similar way. http://en.wikipedia.org/wiki/Persona_(marketing) • Randy: Graduate student working on a PhD that involves biographies and the study of diplomatic families and networks.  Sometimes he comes to the site looking for information on specific people; other times he is looking for information on a specific subject or event.  He also TAs an undergraduate history class and sometimes has to help students find topics for papers.  • Connie: Works at an institution that contributed records to the project.  Is going to be asking themselves how this site would be useful to their users.  Wants to understand how their records were used and what the added value is. • Quincy: Library School Student working to QA record matching. • Adele: Person doing authority work during collection processing.
  • 70. Meet the target users Personas are fictional characters created to represent the different user types within a targeted demographic, attitude and/or behavior set that might use a site, brand or product in a similar way. http://en.wikipedia.org/wiki/Persona_(marketing) • Randy: Graduate student working on a PhD that involves biographies and the study of diplomatic families and networks.  Sometimes he comes to the site looking for information on specific people; other times he is looking for information on a specific subject or event.  He also TAs an undergraduate history class and sometimes has to help students find topics for papers.  • Connie: Works at an institution that contributed records to the project.  Is going to be asking themselves how this site would be useful to their users.  Wants to understand how their records were used and what the added value is. • Quincy: Library School Student working to QA record matching. • Adele: Person doing authority work during collection processing. • Lenny: Lenny likes linked data, and wants to be able to mine the links that have been established programatically.
  • 75. Advanced limits match EAC sections
  • 77. XTF query in the crossQueryResult
  • 82. EAC record view Identity
  • 83. EAC record view alternative forms of name
  • 85. HTML 5 microdata in chron list
  • 86. EAC record view Related Entries
  • 87. EAC record view Related Entries
  • 89. EAC record view View EAC XML
  • 90. EAC record view Graph Demo
  • 91.
  • 92. Tinkerpop Graph Stack h ttp://www.tinkerpop.com/ Property Graph Model graphML RDF S ail support
  • 93. vertex edge https://github.com/tinkerpop/gremlin/wiki/Defining-a-Property-Graph
  • 94. Graph Schema vertex _id: auto-assigned by neo4j _type: vertex identity: the name of the entity (string) [indexed] urls: n seperated list of source EAD files entityType: 'corporateBody', 'family', or 'person' edge _id: auto-assigned by neo4j _type: edge _lable: 'correspondedWith' or 'associatedWith' _inV: incoming vertex _id (from) _outV: outgoing vertex _id (to) from_name: from identity (string) denormalized to_name: to identity (string) denormalized
  • 95. internal id indices/name-idx is an index on “identity”; used to look up neo4j record id
  • 96. “bothE” shows in and out edges vertices/103994/bothE redundant data to save repeated lookups
  • 97.
  • 98.
  • 99. RDF of the social graph Thanks Ed Summers!
  • 100.
  • 101. Silvia Mazzini regesta.exe srl http://templates.xdams.net/IBC/ontology/eac-cpf.rdf
  • 102. Front End Stack • golden grid http://code.google.com/p/the-golden-grid/ • form style http://formalize.me/ • jquery and jquery ui • hoverIntent for advanced search • google analytics with event tracking
  • 103. XTF XSLT Framework • pre filter - do special tokenization to create custom EAC facets • https://docs.google.com/document/d/ 1wP9x6sdOZTagJNQXoyJfPh0Y6UzQgqLwLI86WSlIPbk/edit?hl=en_US • query parser - CGI params to XTF query XML • result formatter - XTF results to HTML • doc formatter - EAC-CPF to HTML • http://code.google.com/p/xtf-cpf/source/browse/? name=xtf-cpf
  • 104. social graph visualization • EAC to graphML https://code.google.com/p/eac-graph-load/ • graphML file with open license should be viewable in other tools • old demo uses Dracula Graph Library • New demo uses Javascript InfoVis Toolkit • Ed Summer’s “snac hacks” post
  • 105. EAD to EAC XSLT • forthcoming from Virginia

Notes de l'éditeur

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. Flexible description: series description; dispersed collections\nCooperative authority control: dispersed collections; but also creator of one collection is referenced in a collection created by someone else (co-referencing); economic and descriptive benefits\nIntegrated access to cultural heritage: context for archival records, essential, but the descriptions can also provide context for all types of resources\nArchival authority records, like museum authority records, provide historical and biographical data that can enhance identification and understanding; (biographical dictionary; administrative histories)\n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. Remember that we will solicit public evaluation and suggestions on drafts of the public interface, starting in the fall.\n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. \n
  58. \n
  59. \n
  60. \n
  61. \n
  62. \n
  63. \n
  64. \n
  65. \n
  66. \n
  67. \n
  68. \n
  69. \n
  70. \n
  71. \n
  72. \n
  73. \n
  74. \n
  75. \n
  76. \n
  77. \n
  78. \n
  79. \n
  80. \n
  81. \n
  82. \n
  83. \n
  84. \n
  85. \n
  86. \n
  87. \n
  88. \n
  89. \n
  90. \n
  91. \n
  92. \n
  93. \n
  94. \n
  95. \n
  96. \n
  97. \n
  98. \n
  99. \n
  100. \n
  101. \n
  102. \n
  103. \n
  104. \n
  105. \n
  106. \n