Big Data's Long Tail

•Télécharger en tant que PPTX, PDF•

1 j'aime•1,018 vues

John Kunze

Technologie

Intercept researchers
where theyalready
work

Requirements

Features
Best practices check
Generate metadata
Generate citation
Post data to repository

The long tail of the long tail

A data repository for
Anyone
Anywhere

Build community
From animationresources.org

Add repositories

Add metadata schema

ONEShare, NSF DataNet,
Merritt, UCSB, INTEROP, etc.
Dryad, etc.

Recommandé

Digital ScienceKaitlin Thaney

Executive Summary - Data Management HubDenis Parfenov

2016 07 12_purdue_bigdatainomics_seandavisSean Davis

Scott Edmunds A*STAR open access workshop: how licensing can change the way w...GigaScience, BGI Hong Kong

Media Suite: Unlocking Archives for Mixed Media Scholarly Research roelandordelman.nl

Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...Merce Crosas

Accelerating your research with Microsoft AzureMicrosoft Azure for Research

Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...Merce Crosas

Recommandé

Digital ScienceKaitlin Thaney

Executive Summary - Data Management HubDenis Parfenov

2016 07 12_purdue_bigdatainomics_seandavisSean Davis

Scott Edmunds A*STAR open access workshop: how licensing can change the way w...GigaScience, BGI Hong Kong

Media Suite: Unlocking Archives for Mixed Media Scholarly Research roelandordelman.nl

Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...Merce Crosas

Accelerating your research with Microsoft AzureMicrosoft Azure for Research

Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...Merce Crosas

Dataverse on the MOCMerce Crosas

Data Publishing at Harvard's Research Data Access SymposiumMerce Crosas

DataCite: the Perfect Complement to CrossRefCrossref

Christine borgman keynoteResearch Data Alliance

Scott Edmunds: Data Dissemination in the era of "Big-Data"GigaScience, BGI Hong Kong

Talk at OHSU, September 25, 2013Anita de Waard

Open Context and Publishing to the Web of Data: Eric Kansa's LAWDI Presentationekansa

Genericity versus expressivity – reflections about the semantics of interoper...Andrea Scharnhorst

Scott Edmunds: Data Dissemination in the era of "Big-Data"GigaScience, BGI Hong Kong

Repository Federation: Towards Data InteroperabilityRobert H. McDonald

Dataset Citation and Identificationguest453b14

Dataset citation and identificationAdam Farquhar

Dataset Citation and Identificationguest453b14

TripFS presentation at ldow 2010Niko Popitsch

Building Capacity in Your Library for Research Data Management Support (Or Wh...Charleston Conference

Digital forensics and Cyber Crime: Yesterday, Today & TomorrowPankaj Choudhary

Data curation issues for repositoriesChris Rusbridge

The Dataverse CommonsMerce Crosas

FAIR Data Management and FAIR Data SharingMerce Crosas

The YAMZ MetadictionaryJohn Kunze

YAMZ Metadata Vocabulary BuilderJohn Kunze

Contenu connexe

Similaire à Big Data's Long Tail

Dataverse on the MOCMerce Crosas

Data Publishing at Harvard's Research Data Access SymposiumMerce Crosas

DataCite: the Perfect Complement to CrossRefCrossref

Christine borgman keynoteResearch Data Alliance

Scott Edmunds: Data Dissemination in the era of "Big-Data"GigaScience, BGI Hong Kong

Talk at OHSU, September 25, 2013Anita de Waard

Open Context and Publishing to the Web of Data: Eric Kansa's LAWDI Presentationekansa

Genericity versus expressivity – reflections about the semantics of interoper...Andrea Scharnhorst

Scott Edmunds: Data Dissemination in the era of "Big-Data"GigaScience, BGI Hong Kong

Repository Federation: Towards Data InteroperabilityRobert H. McDonald

Dataset Citation and Identificationguest453b14

Dataset citation and identificationAdam Farquhar

Dataset Citation and Identificationguest453b14

TripFS presentation at ldow 2010Niko Popitsch

Building Capacity in Your Library for Research Data Management Support (Or Wh...Charleston Conference

Digital forensics and Cyber Crime: Yesterday, Today & TomorrowPankaj Choudhary

Data curation issues for repositoriesChris Rusbridge

The Dataverse CommonsMerce Crosas

FAIR Data Management and FAIR Data SharingMerce Crosas

Similaire à Big Data's Long Tail (20)

Dataverse on the MOC

Data Publishing at Harvard's Research Data Access Symposium

DataCite: the Perfect Complement to CrossRef

Christine borgman keynote

Scott Edmunds: Data Dissemination in the era of "Big-Data"

Talk at OHSU, September 25, 2013

Open Context and Publishing to the Web of Data: Eric Kansa's LAWDI Presentation

Genericity versus expressivity – reflections about the semantics of interoper...

Scott Edmunds: Data Dissemination in the era of "Big-Data"

Repository Federation: Towards Data Interoperability

Dataset Citation and Identification

Dataset citation and identification

Dataset Citation and Identification

TripFS presentation at ldow 2010

Building Capacity in Your Library for Research Data Management Support (Or Wh...

Digital forensics and Cyber Crime: Yesterday, Today & Tomorrow

Data curation issues for repositories

The Dataverse Commons

FAIR Data Management and FAIR Data Sharing

Plus de John Kunze

The YAMZ MetadictionaryJohn Kunze

YAMZ Metadata Vocabulary BuilderJohn Kunze

The ARK Alliance: 20 years, 850 institutions, 8.2 billion persistent identifi...John Kunze

EZID and N2T at CDLJohn Kunze

YAMZ.net: better, faster, cheaper taxonomy buildingJohn Kunze

A Vocabulary for PersistenceJohn Kunze

Identifiers obey Resolvers not SchemesJohn Kunze

Names, Things, and Open Identifier Infrastructure: N2T and ARKsJohn Kunze

ARK identifiers: lessons learnt at BnF: paths forwardJohn Kunze

YAMZ: a cross-domain crowd-sourced metadata vocabularyJohn Kunze

DataONE Preservation and Metadata Working Group Report 2014John Kunze

Selected Bash shell tricks from Camp CDL breakout groupJohn Kunze

Annotating Research DatasetsJohn Kunze

The Data Management EcosystemJohn Kunze

Library Tools Supporting Data-Rich ResearchJohn Kunze

Pamwg 2012ahmJohn Kunze

Scalable Identifiers for Natural History CollectionsJohn Kunze

Future-Proofing the Web: What We Can Do TodayJohn Kunze

Supporting Data-Rich Research on Many FrontsJohn Kunze

The ARK Identifier Scheme at Ten Years OldJohn Kunze

Plus de John Kunze (20)

The YAMZ Metadictionary

YAMZ Metadata Vocabulary Builder

The ARK Alliance: 20 years, 850 institutions, 8.2 billion persistent identifi...

EZID and N2T at CDL

YAMZ.net: better, faster, cheaper taxonomy building

A Vocabulary for Persistence

Identifiers obey Resolvers not Schemes

Names, Things, and Open Identifier Infrastructure: N2T and ARKs

ARK identifiers: lessons learnt at BnF: paths forward

YAMZ: a cross-domain crowd-sourced metadata vocabulary

DataONE Preservation and Metadata Working Group Report 2014

Selected Bash shell tricks from Camp CDL breakout group

Annotating Research Datasets

The Data Management Ecosystem

Library Tools Supporting Data-Rich Research

Pamwg 2012ahm

Scalable Identifiers for Natural History Collections

Future-Proofing the Web: What We Can Do Today

Supporting Data-Rich Research on Many Fronts

The ARK Identifier Scheme at Ten Years Old

Dernier

Manulife - Insurer Innovation Award 2024The Digital Insurer

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra

Why Teams call analytics are critical to your entire businesspanagenda

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

Dernier (20)

Manulife - Insurer Innovation Award 2024

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Top 10 Most Downloaded Games on Play Store in 2024

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Why Teams call analytics are critical to your entire business

Exploring the Future Potential of AI-Enabled Smartphone Processors

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Apidays New York 2024 - The value of a flexible API Management solution for O...

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

presentation ICT roal in 21st century education

How to Troubleshoot Apps for the Modern Connected Worker

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Axa Assurance Maroc - Insurer Innovation Award 2024

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

Big Data's Long Tail

1. Big Data’s Long Tail CarlyStrasser, John Kunze, Trisha Cruse University of California Curation Center, California Digital Library 10 December 2012 From Flickr by rahenz

8. Intercept researchers where theyalready work

10.

11.

12.

13. Requirements Features Best practices check Generate metadata Generate citation Post data to repository

14.

15.

16.

17. Attribute Metadata 17

18. Create Data Citation

19.

20.

21. The long tail of the long tail A data repository for Anyone Anywhere

22. Build community From animationresources.org Add repositories Add metadata schema ONEShare, NSF DataNet, Merritt, UCSB, INTEROP, etc. Dryad, etc.

Notes de l'éditeur

[ prepared using PPT on a Mac, with animation ]
One way of looking at Big Data is this graph showing dataset size on the vertical axis against numbers of datasets on the horizontal axis.While there are some very large, celebrated datasets produced by satellites, ocean sensors, etc., there’s a very long tail off to the right of smaller, more obscure datasets that cumulatively account for a large portion of Big Data.
There are many more researchers out in the field collecting heterogeneous data, such as species counts obtained by visual sightings.
And there are many more grants supporting this kind of research...
And those grants are usually much smaller in terms of dollar amounts.
As a result, the large, celebrated datasets tend to come with staff positions for data management, as well as well-supported, standardized software tools.So for a huge number of grants and datasets, especially in Earth, environmental, and ecological sciences, ...
... there’s an ugly truth. Many of these researchers,are not taught data managementdon’t know what metadata arecan’t name data centers or repositoriesdon’t share data publicly or store it in an archivearen’t convinced they should share dataThis amounts to a whole lot of inertia that hasn’t yielded much after years of effort to teach, exhort, cajole, or even scold them about the tools and database systems they should be using ...
... so instead, how about meeting them where they already are, and trying to move forward from there?As it happens, many of them collect data in spreadsheets using Microsoft Excel software. Although it wasn’t designed for research data management, Excel is nonetheless ubiquitous and familiar. Trouble arises because some uses can make data re-use harder.
This led to a vision of extending Excel to make it more friendly for research data.So CDL partnered with MS Research, GBMF, and the DataONE community to gather requirements and build this extension.
The project was to become a tool called DataUp, designed to facilitate Archiving, Sharing, and Publishing.This was to be achieved by enabling and supporting data management and organization.The result would enable and support data re-use and reproducibility.
That was DataUp – the vision. This is DataUp – the implementation, and of course, the Logo.It consists of an open source tool available on bitbucket,plus a community of Earth, environmental, and ecological researchers whose needs were the focus of the first release,plus the DataUp client, which consists of an add-in and a web application.What are these last two things? Good question.
What does it look like? Here’s the add-in.Note the wide ribbon across the top showing extra buttons supported by DataUp.Click on Best Practices check and it will produce an error report like this on the right.
Blowing up that report a bit, it shows you some of errors it found: embedded comments, tables, pictures, color coding, non-contiguous data, missing data, etc.Some of these it can offer to correct by removing. Others it can only point out.
Then there’s the Generate Metadata action, which add a new a tab on the bottom – the red arrow is pointing to its label.This feature supports general description of the entire dataset.The inset on the right shows the web app equivalent. In it you can see red stars indicating the few fields that are required.This first implementation of DataUp supports only EML, the Ecological Metadata Language.
For column, or attribute metadata, in the add-in you can select a set of columns that you want to describe.Again, right now only EML is supported.
An important step is Creation of the Data Citation, which is the primary mechanism to enable discovery and to give credit for producing the data.An important part of that is when you click on “retrieve an identifier”...
You then get asked to select a repository to which you’ll be depositing the data. this first version of DataUp knows about only one repository, CDL’s Merritt-based ONEShare archive after selecting it you’ll see the identifier, in this case an ARK how this works is that it asks the repository (CDL’s Merritt in this case), for an identifier, and that in turn asks CDL’s EZID service for the identifier
Finally you click on Post, and you’ll get an email when your dataset has been registered in the repository.There is actually a practice version of ONEShare that you can select – we strongly recommend that you use that when testing DataUp, as it helps to keep the main repository quality high.
Among all the many researchers receiving those smaller grants I talked about earlier – the long tail researchers – there are those who have or can afford a relationship with an existing data center repository. But within that long tail there’s a long tail of researchers who are unaffiliated with data center or can’t afford it, and we wanted them to be able to use DataUp and be able immediately to deposit in an archive. The long tail of the long tail includes, eg, citizen scientists.
What does the future hold?We plan tobuild a strong community of users and developers,add more repository types,add support for in additional metadata schema.There’s no reason that DataUp couldn’t be extended to any domain that uses spreadsheet data.A high priority is to support more DataONE repositories, which will give DataUp visibility and leverage from the entire DataONE federation,and from there take advantage of connections to the entire NSF DataNet program and global initiatives such as INTEROP.