SlideShare une entreprise Scribd logo
1  sur  24
Cultural Heritage Institutions
and Big Data Collections
Leslie Johnston
Chief of Repository Development
Library of Congress
Cultural Heritage organizations
have, until recently, spoken of
“collections” and “content” and
“records” and even “files.”
Now it’s also data.
Data is not just generated by satellites,
identified during experiments, or collected
during surveys.
Datasets are not just scientific and business
tables and spreadsheets.
We have Big Data in our Libraries, Archives
and Museums.
Like other cultural heritage
organizations, the Library of
Congress has as one of its
mandates that it make its
collections freely available,
whether that is in person or on
the web.
What are some Library of
Congress examples of
collecting and preserving large
scale collections in many
formats, and making them
usable as collections and as
data?
National Digital
Newspaper Program
chroniclingamerica.loc.gov/
This collection was transformative for the Library of Congress:
it was the first to be made to be available as a bulk download
and exposed as a text and image dataset.
Some researchers want to search for stories in historic
newspapers. Some researchers want to mine newspaper OCR
for trends across time periods and geographic areas.
Requests have come in to analyze the full collection..
The program has:
 Multiple producers (36 now, ultimately 54)
 Free and open public access
 APIs for machine access and automated processes,
including access to RDF linked data.
Over 6.7 million newspaper pages ingested to date
Over 250 Tb of data
Web Archives
http://www.loc.gov/webarchiving/
lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.html
The Library has been archiving the web since 2000. Subject
area specialists curate the collections, and Library catalogers
create collection-level metadata records.
The collections include:
• U.S. elections
• Web sites created by members of the House and Senate
• Thematic collections around events, such as elections in
the Philippines, the Iraq war, and the appointment of
Supreme Court Justices.
• Collections around an area of study, such as Legal “Blawgs”
We frequently receive requests for access to full collections for
full-text data mining.
Every format possible on the web
Almost 8 billion files
Over 425 TB
congress.gov
Congress.gov is still in its beta phase,
transforming congressional
information discovery.
Legislation from 1993 to the present,
The Congressional Record from
1995 to the present, Committee
Reports from 1995 to the present,
and Member profiles from 1973 to
the present (with some from 1947 to
1972).
The Twitter Archive
Every public tweet since Twitter’s launch in March
2006.
Research requests have included users looking for
their own Twitter history, the study of the
geographic spread of news, the study of the
spread of epidemics, and the study of the
transmission of new uses of language.
The collection comprises only a few TB, but 100s of
billions of tweets.
A White Paper is available online at:
http://blogs.loc.gov/loc/2013/01/update-on-the-twitter-arc
status
privacy
commercial
personal
events
social media
visualization
social
science
Research Datasets
Research datasets are created by
faculty, curators, researchers, and
federal and state agencies.
It is not enough to be collecting
publications; we must collect the
datasets that support the published
work, to allow for replicability and r-
use in research.
We are now planning to expands its
collections to preserve research
data, in addition to recognizing that
the collections we already have are
Big Data to be mined.
And the full breadth of the
Library’s Collections
The American Memory collection, one of the oldest
and most used digital collections on the web.
The oral histories of the Veteran’s History Project.
The audio and video collections of the American
Folklife Center.
More than 1.2 million images from Prints and
Photographs.
Digitized maps and GIS data from Geography and
Maps
More than 300,000 digitized audio and video files
comprising over 5 PB at the Packard Campus.
And many, many, many more.
id.loc.gov
The Library of Congress is, in part, a
standards agency for rules used to
create metadata records and in
controlled vocabularies (authorities)
used to describe items.
The Library is gradually making its
vocabularies available as serialized
RDF datasets (SKOS and JSON).
In the library community, The LC
authorities are one of the most
common tools for building linked
data relationships.
13
What are some of the
technological challenges of
managing and preserving
large digital collections in
many formats, and making
them available for use?
14
Sheer amount.
Huge variation in file formats.
Unclear and undocumented rights.
Security
Missing metadata.
Data citation and identifier issues.
Discovery expectations: discovery across collections and
institutions together.
Cost.
I will mention infrastructure only in passing.
There are scale issues related to:
 Storage
 Archiving
 Bandwidth
 Software development
 Staffing for processing
This Requires a Preservation Infrastructure
The Library developed the BagIt transfer specification for the
movement of files between and within organizations.
 http://www.digitalpreservation.gov/documents/bagitspec.pdf
The Library inventories incoming files, and is gradually inventorying all
digital content.
The Library maintains multiple copies of files on servers and on tape,
in geographically distributed locations.
The Library has documented sustainability factors for file formats.
 http://www.digitalpreservation.gov/formats/
For cases where we do have control over content we receive, we have
a “Best Edition” Preferred Formats statement, which is currently being
updated.
•http://www.copyright.gov/circs/circ07b.pdf
There are many new
activities to be planned for
with new researcher uses
and expectations.
We still have collections. But what we also have is Big Data,
which requires us to rethink the infrastructure that is needed
to support Big Data services. Our community used to
expect researchers to come to us, ask us questions about
our collections, and use our digital collections in our
environment.
Now our collections are, more often than not, self-serve.
Researchers are taking collections as data away to work
with in their own computational environments. This is a shift
away from recent service models where libraries built out
and housed lab spaces for specialized activities such as text
mining and geospatial modeling and provided staff to assist
in acquiring and manipulating data.
More and more researchers want to use one
or more collections as a whole, mining and
organizing the information in novel ways.
Researchers use what used to be
unimaginable computing power on a desktop
to mine the rich information and tools to
create pictures that translate that information
into knowledge.
Should collections be pre-processed to create a
variety of derivatives that might be used in various
forms of analysis before ingesting them? Or do we
limit access to the native format? Or put on-the-fly
format transformation services for downloads in
place?
We are beginning to put into place the infrastructure
needed to create full-text indexes for millions/billions
of items to support full discovery for researchers.
We are only just starting the process of generating
linked data representations of billions of items.
Cultural heritage institutions are increasingly looking
towards self-service – researchers need not ask to
download or tell us that they have. We may never
know.
BUT … we do have collections that are limited to on-
site only access due to licenses or gift agreements. In
that case, libraries may have to consider providing
high-powered workstations with analytical tools for
researchers to work with these collections and take
analysis outputs away with them.
Both have policy implications and implications for
public service staffing.
But the benefits outweigh
the challenges.
Cultural heritage institutions are managing
and preserving the datasets and big data
necessary for re-use and replicability.
We are working to make the deposit and
management of such data easier to
accomplish.
This is an important new role for our
organizations in enabling new research.
Discussion…
Leslie Johnston
lesliej@loc.gov

Contenu connexe

Tendances

Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
lljohnston
 
Ifla swsig meeting - Puerto Rico - 20110817
Ifla swsig meeting - Puerto Rico - 20110817Ifla swsig meeting - Puerto Rico - 20110817
Ifla swsig meeting - Puerto Rico - 20110817
Figoblog
 
LIS 653 Posters Fall 2014
LIS 653 Posters Fall 2014 LIS 653 Posters Fall 2014
LIS 653 Posters Fall 2014
PrattSILS
 
Next Steps for IMLS's National Digital Platform
Next Steps for IMLS's National Digital PlatformNext Steps for IMLS's National Digital Platform
Next Steps for IMLS's National Digital Platform
Trevor Owens
 
VRA_2015_CatalogingRoundup_Seneff
VRA_2015_CatalogingRoundup_SeneffVRA_2015_CatalogingRoundup_Seneff
VRA_2015_CatalogingRoundup_Seneff
Heather Seneff
 
LIS 653-02 Spring 2014 Final Presentation Posters
LIS 653-02 Spring 2014 Final Presentation PostersLIS 653-02 Spring 2014 Final Presentation Posters
LIS 653-02 Spring 2014 Final Presentation Posters
PrattSILS
 

Tendances (20)

LIS 653 fall 2013 final project posters
LIS 653 fall 2013 final project postersLIS 653 fall 2013 final project posters
LIS 653 fall 2013 final project posters
 
INFO 653 Posters, Fall 2019
INFO 653 Posters, Fall 2019INFO 653 Posters, Fall 2019
INFO 653 Posters, Fall 2019
 
LODLAM Landscape
LODLAM LandscapeLODLAM Landscape
LODLAM Landscape
 
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
 
SWSIG wlic2016
SWSIG wlic2016SWSIG wlic2016
SWSIG wlic2016
 
Ifla swsig meeting - Puerto Rico - 20110817
Ifla swsig meeting - Puerto Rico - 20110817Ifla swsig meeting - Puerto Rico - 20110817
Ifla swsig meeting - Puerto Rico - 20110817
 
LIS 653 Posters Fall 2014
LIS 653 Posters Fall 2014 LIS 653 Posters Fall 2014
LIS 653 Posters Fall 2014
 
Next Steps for IMLS's National Digital Platform
Next Steps for IMLS's National Digital PlatformNext Steps for IMLS's National Digital Platform
Next Steps for IMLS's National Digital Platform
 
Beyond the catalogue : BibFrame, Linked Data and Ending the Invisible Library
Beyond the catalogue : BibFrame, Linked Data and Ending the 	Invisible LibraryBeyond the catalogue : BibFrame, Linked Data and Ending the 	Invisible Library
Beyond the catalogue : BibFrame, Linked Data and Ending the Invisible Library
 
LIS 653 Knowledge Organization | Pratt Institute School of Information | Fall...
LIS 653 Knowledge Organization | Pratt Institute School of Information | Fall...LIS 653 Knowledge Organization | Pratt Institute School of Information | Fall...
LIS 653 Knowledge Organization | Pratt Institute School of Information | Fall...
 
Linked Data In Action
Linked Data In ActionLinked Data In Action
Linked Data In Action
 
VRA_2015_CatalogingRoundup_Seneff
VRA_2015_CatalogingRoundup_SeneffVRA_2015_CatalogingRoundup_Seneff
VRA_2015_CatalogingRoundup_Seneff
 
Cambridge university library ess update for ucs
Cambridge university library  ess update for ucsCambridge university library  ess update for ucs
Cambridge university library ess update for ucs
 
Archives Hub - Data in :: Data out
Archives Hub - Data in :: Data outArchives Hub - Data in :: Data out
Archives Hub - Data in :: Data out
 
Linked Open Data for Libraries
Linked Open Data for LibrariesLinked Open Data for Libraries
Linked Open Data for Libraries
 
LIS 653-02 Spring 2014 Final Presentation Posters
LIS 653-02 Spring 2014 Final Presentation PostersLIS 653-02 Spring 2014 Final Presentation Posters
LIS 653-02 Spring 2014 Final Presentation Posters
 
LODLAM Landscape NOTES
LODLAM Landscape NOTESLODLAM Landscape NOTES
LODLAM Landscape NOTES
 
Large Scale Data Clean-ups & Challenges for the Library
Large Scale Data Clean-ups & Challenges for the Library Large Scale Data Clean-ups & Challenges for the Library
Large Scale Data Clean-ups & Challenges for the Library
 
INFO 653 posters Fall 2018
INFO 653 posters Fall 2018INFO 653 posters Fall 2018
INFO 653 posters Fall 2018
 
Linked data 101: Getting Caught in the Semantic Web
Linked data 101: Getting Caught in the Semantic Web Linked data 101: Getting Caught in the Semantic Web
Linked data 101: Getting Caught in the Semantic Web
 

Similaire à Cultural Heritage Insitutions and Big Data Collections

ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
Jon Voss
 
Fuller Disclosure: Getting More Collections into the Network Flow
Fuller Disclosure: Getting More Collections into the Network FlowFuller Disclosure: Getting More Collections into the Network Flow
Fuller Disclosure: Getting More Collections into the Network Flow
kramsey
 
Digital preservation and curation of information.presentation
Digital preservation and curation of information.presentationDigital preservation and curation of information.presentation
Digital preservation and curation of information.presentation
Prince Sterling
 
Dp Geosc Info Presentation Final Version 2
Dp Geosc Info Presentation Final Version 2Dp Geosc Info Presentation Final Version 2
Dp Geosc Info Presentation Final Version 2
Smita Chandra
 

Similaire à Cultural Heritage Insitutions and Big Data Collections (20)

WORLD CAT AS BIG DATA
WORLD CAT AS  BIG DATAWORLD CAT AS  BIG DATA
WORLD CAT AS BIG DATA
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
 
Fuller Disclosure: Getting More Collections into the Network Flow
Fuller Disclosure: Getting More Collections into the Network FlowFuller Disclosure: Getting More Collections into the Network Flow
Fuller Disclosure: Getting More Collections into the Network Flow
 
Are New Digital Literacies Skills Neededrscd2018
Are New Digital Literacies Skills Neededrscd2018Are New Digital Literacies Skills Neededrscd2018
Are New Digital Literacies Skills Neededrscd2018
 
Open access (1)
Open access (1)Open access (1)
Open access (1)
 
The Open Access Community, and OAIster
The Open Access Community, and OAIsterThe Open Access Community, and OAIster
The Open Access Community, and OAIster
 
Open access
Open accessOpen access
Open access
 
Digital preservation and curation of information.presentation
Digital preservation and curation of information.presentationDigital preservation and curation of information.presentation
Digital preservation and curation of information.presentation
 
Open Repositories and Interoperability Challenges in UK
Open Repositories and Interoperability Challenges in UKOpen Repositories and Interoperability Challenges in UK
Open Repositories and Interoperability Challenges in UK
 
Digitallibrary
DigitallibraryDigitallibrary
Digitallibrary
 
201399627 kovacs-collection-cyberspace
201399627 kovacs-collection-cyberspace201399627 kovacs-collection-cyberspace
201399627 kovacs-collection-cyberspace
 
An ontology-based context aware system for Selective Dissemination of Informa...
An ontology-based context aware system for Selective Dissemination of Informa...An ontology-based context aware system for Selective Dissemination of Informa...
An ontology-based context aware system for Selective Dissemination of Informa...
 
Digital library-overview
Digital library-overviewDigital library-overview
Digital library-overview
 
Dp Geosc Info Presentation Final Version 2
Dp Geosc Info Presentation Final Version 2Dp Geosc Info Presentation Final Version 2
Dp Geosc Info Presentation Final Version 2
 
Information Retrieval Methods in Libraries and Information Centers
Information Retrieval Methods in Libraries and Information CentersInformation Retrieval Methods in Libraries and Information Centers
Information Retrieval Methods in Libraries and Information Centers
 
The Future of Metadata Management & Making Library Collections Discoverable o...
The Future of Metadata Management & Making Library Collections Discoverable o...The Future of Metadata Management & Making Library Collections Discoverable o...
The Future of Metadata Management & Making Library Collections Discoverable o...
 
Webscale Discovery with the Enduser in Mind
Webscale Discovery with the Enduser in Mind Webscale Discovery with the Enduser in Mind
Webscale Discovery with the Enduser in Mind
 
Open Data and Institutional Repositories
Open Data and Institutional RepositoriesOpen Data and Institutional Repositories
Open Data and Institutional Repositories
 
We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...
We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...
We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...
 
Boundless Opportunity
Boundless OpportunityBoundless Opportunity
Boundless Opportunity
 

Plus de lljohnston

Leslie Johnston: Challenges of Preserving Every Digital Format, 2012
Leslie Johnston: Challenges of Preserving Every Digital Format, 2012Leslie Johnston: Challenges of Preserving Every Digital Format, 2012
Leslie Johnston: Challenges of Preserving Every Digital Format, 2012
lljohnston
 
Leslie Johnston code4lib 2013 Keynote
Leslie Johnston code4lib 2013 KeynoteLeslie Johnston code4lib 2013 Keynote
Leslie Johnston code4lib 2013 Keynote
lljohnston
 

Plus de lljohnston (7)

Technology and Service Trends in Libraries: The Library of Congress and the B...
Technology and Service Trends in Libraries: The Library of Congress and the B...Technology and Service Trends in Libraries: The Library of Congress and the B...
Technology and Service Trends in Libraries: The Library of Congress and the B...
 
An Introduction to digital preservation at the Library of Congress
An Introduction to digital preservation at the Library of CongressAn Introduction to digital preservation at the Library of Congress
An Introduction to digital preservation at the Library of Congress
 
Strategies for Establishing Partnerships for Digital Preservation
Strategies for Establishing Partnerships for Digital PreservationStrategies for Establishing Partnerships for Digital Preservation
Strategies for Establishing Partnerships for Digital Preservation
 
Personal Digital Archiving Initiatives at the Library of Congress
Personal Digital Archiving Initiatives at the Library of CongressPersonal Digital Archiving Initiatives at the Library of Congress
Personal Digital Archiving Initiatives at the Library of Congress
 
Leslie Johnston on Citizen Archiving, iPres 2011
Leslie Johnston on Citizen Archiving, iPres 2011Leslie Johnston on Citizen Archiving, iPres 2011
Leslie Johnston on Citizen Archiving, iPres 2011
 
Leslie Johnston: Challenges of Preserving Every Digital Format, 2012
Leslie Johnston: Challenges of Preserving Every Digital Format, 2012Leslie Johnston: Challenges of Preserving Every Digital Format, 2012
Leslie Johnston: Challenges of Preserving Every Digital Format, 2012
 
Leslie Johnston code4lib 2013 Keynote
Leslie Johnston code4lib 2013 KeynoteLeslie Johnston code4lib 2013 Keynote
Leslie Johnston code4lib 2013 Keynote
 

Dernier

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 

Dernier (20)

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 

Cultural Heritage Insitutions and Big Data Collections

  • 1. Cultural Heritage Institutions and Big Data Collections Leslie Johnston Chief of Repository Development Library of Congress
  • 2. Cultural Heritage organizations have, until recently, spoken of “collections” and “content” and “records” and even “files.” Now it’s also data.
  • 3. Data is not just generated by satellites, identified during experiments, or collected during surveys. Datasets are not just scientific and business tables and spreadsheets. We have Big Data in our Libraries, Archives and Museums.
  • 4. Like other cultural heritage organizations, the Library of Congress has as one of its mandates that it make its collections freely available, whether that is in person or on the web.
  • 5. What are some Library of Congress examples of collecting and preserving large scale collections in many formats, and making them usable as collections and as data?
  • 6. National Digital Newspaper Program chroniclingamerica.loc.gov/ This collection was transformative for the Library of Congress: it was the first to be made to be available as a bulk download and exposed as a text and image dataset. Some researchers want to search for stories in historic newspapers. Some researchers want to mine newspaper OCR for trends across time periods and geographic areas. Requests have come in to analyze the full collection.. The program has:  Multiple producers (36 now, ultimately 54)  Free and open public access  APIs for machine access and automated processes, including access to RDF linked data. Over 6.7 million newspaper pages ingested to date Over 250 Tb of data
  • 7. Web Archives http://www.loc.gov/webarchiving/ lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.html The Library has been archiving the web since 2000. Subject area specialists curate the collections, and Library catalogers create collection-level metadata records. The collections include: • U.S. elections • Web sites created by members of the House and Senate • Thematic collections around events, such as elections in the Philippines, the Iraq war, and the appointment of Supreme Court Justices. • Collections around an area of study, such as Legal “Blawgs” We frequently receive requests for access to full collections for full-text data mining. Every format possible on the web Almost 8 billion files Over 425 TB
  • 8. congress.gov Congress.gov is still in its beta phase, transforming congressional information discovery. Legislation from 1993 to the present, The Congressional Record from 1995 to the present, Committee Reports from 1995 to the present, and Member profiles from 1973 to the present (with some from 1947 to 1972).
  • 9. The Twitter Archive Every public tweet since Twitter’s launch in March 2006. Research requests have included users looking for their own Twitter history, the study of the geographic spread of news, the study of the spread of epidemics, and the study of the transmission of new uses of language. The collection comprises only a few TB, but 100s of billions of tweets. A White Paper is available online at: http://blogs.loc.gov/loc/2013/01/update-on-the-twitter-arc status privacy commercial personal events social media visualization social science
  • 10. Research Datasets Research datasets are created by faculty, curators, researchers, and federal and state agencies. It is not enough to be collecting publications; we must collect the datasets that support the published work, to allow for replicability and r- use in research. We are now planning to expands its collections to preserve research data, in addition to recognizing that the collections we already have are Big Data to be mined.
  • 11. And the full breadth of the Library’s Collections The American Memory collection, one of the oldest and most used digital collections on the web. The oral histories of the Veteran’s History Project. The audio and video collections of the American Folklife Center. More than 1.2 million images from Prints and Photographs. Digitized maps and GIS data from Geography and Maps More than 300,000 digitized audio and video files comprising over 5 PB at the Packard Campus. And many, many, many more.
  • 12. id.loc.gov The Library of Congress is, in part, a standards agency for rules used to create metadata records and in controlled vocabularies (authorities) used to describe items. The Library is gradually making its vocabularies available as serialized RDF datasets (SKOS and JSON). In the library community, The LC authorities are one of the most common tools for building linked data relationships.
  • 13. 13 What are some of the technological challenges of managing and preserving large digital collections in many formats, and making them available for use?
  • 14. 14 Sheer amount. Huge variation in file formats. Unclear and undocumented rights. Security Missing metadata. Data citation and identifier issues. Discovery expectations: discovery across collections and institutions together. Cost.
  • 15. I will mention infrastructure only in passing. There are scale issues related to:  Storage  Archiving  Bandwidth  Software development  Staffing for processing
  • 16. This Requires a Preservation Infrastructure The Library developed the BagIt transfer specification for the movement of files between and within organizations.  http://www.digitalpreservation.gov/documents/bagitspec.pdf The Library inventories incoming files, and is gradually inventorying all digital content. The Library maintains multiple copies of files on servers and on tape, in geographically distributed locations. The Library has documented sustainability factors for file formats.  http://www.digitalpreservation.gov/formats/ For cases where we do have control over content we receive, we have a “Best Edition” Preferred Formats statement, which is currently being updated. •http://www.copyright.gov/circs/circ07b.pdf
  • 17. There are many new activities to be planned for with new researcher uses and expectations.
  • 18. We still have collections. But what we also have is Big Data, which requires us to rethink the infrastructure that is needed to support Big Data services. Our community used to expect researchers to come to us, ask us questions about our collections, and use our digital collections in our environment. Now our collections are, more often than not, self-serve. Researchers are taking collections as data away to work with in their own computational environments. This is a shift away from recent service models where libraries built out and housed lab spaces for specialized activities such as text mining and geospatial modeling and provided staff to assist in acquiring and manipulating data.
  • 19. More and more researchers want to use one or more collections as a whole, mining and organizing the information in novel ways. Researchers use what used to be unimaginable computing power on a desktop to mine the rich information and tools to create pictures that translate that information into knowledge.
  • 20. Should collections be pre-processed to create a variety of derivatives that might be used in various forms of analysis before ingesting them? Or do we limit access to the native format? Or put on-the-fly format transformation services for downloads in place? We are beginning to put into place the infrastructure needed to create full-text indexes for millions/billions of items to support full discovery for researchers. We are only just starting the process of generating linked data representations of billions of items.
  • 21. Cultural heritage institutions are increasingly looking towards self-service – researchers need not ask to download or tell us that they have. We may never know. BUT … we do have collections that are limited to on- site only access due to licenses or gift agreements. In that case, libraries may have to consider providing high-powered workstations with analytical tools for researchers to work with these collections and take analysis outputs away with them. Both have policy implications and implications for public service staffing.
  • 22. But the benefits outweigh the challenges.
  • 23. Cultural heritage institutions are managing and preserving the datasets and big data necessary for re-use and replicability. We are working to make the deposit and management of such data easier to accomplish. This is an important new role for our organizations in enabling new research.