SlideShare une entreprise Scribd logo
1  sur  70
Télécharger pour lire hors ligne
Grab a bucket!



                                                  It’s raining data!
      Photo: http://www.flickr.com/photos/peasap/655111542/
                                                                                      Dorothea Salo
                                                                             University of Wisconsin
                                                                                        Access 2009




Hi there. Thanks very much to Mark Leggott for inviting me here, and to all of you for lending me
your ears for a time.

Youʼll have noticed that the title of this talk in the program notes is very formal and buttoned-down.
ʻRepresenting and managing the data deluge.ʼ Well, okay. I am not a formal and buttoned-down
person, but when Mark approached me to speak here, I was actually scared to death to accept, and
so I wrote this really terribly boring title -- so like Peter, I just up and changed it. The REAL title is
ʻGrab a bucket -- itʼs raining data!”

To hear some folks tell it, itʼs a golden age to be a digital librarian. Here we have an entire new form
of scholarly publication -- digital research data -- and itʼs ours for the asking! In times when weʼre all
worried about the future of libraries (and, letʼs face it, librarians), this feels heaven-sent. Grab a
bucket, itʼs raining data, hallelujah!
the...




      Painting: “Cassandra,” Evelyn de Morgan
      Photo: http://commons.wikimedia.org/wiki/File:Cassandra1.jpeg
                                                                      of Open Access
In some quarters, I am now styled the “Cassandra of Open Access.” Cassandra, for those not up on
their Greek myth, was a Trojan prophetess who was cursed such that nobody believed what she
said until it was too late. Being from Troy, which was of course completely doomed, most of her
prophecies were fairly dire, too. “Hey, the Greeks are about to wheel a big wooden horse into your
city so they can burn it down and kill everybody!” Not happy-making stuff weʼre talking about.
I’ve got nothing against




                                             but the reality was...
      Photo: http://www.flickr.com/photos/y2bk/528300692/




Some people have mistaken my Cassandra-nature for an onus against open access generally and
institutional repositories in particular. Iʼve never had it in for open access! Who doesnʼt like open
access! Itʼs similar to what Cory said yesterday, itʼs hard to be against an unambiguous good like
open access without sounding like a total jerk... which hasnʼt stopped some publishers, of course.
(*CLICK*) But Iʼve been running institutional repositories for close to five years now and the on-the-
ground reality has been quite a bit...
... blurrier.
          goals?

                         means?

                                    something for nothing?

                                                  fit between content and container?

                                                                 fit between user needs and system?

                            and so now, I may be becoming
     Photo: http://www.flickr.com/photos/jennsstuff/2965783700/




... blurrier. Conflicting, contradictory, and in some cases flatly impossible goals. Minimal means,
because of people who seem to have been reading the mythical “Frommerʼs Institutional
Repositories on $5 A Day.” Asking for time, effort, and data from faculty without giving them any real
service or any return on their time investment that made sense to them. We crammed things into
IRs that just didnʼt fit with the very limited IR view of the digital universe, just because we hadnʼt
anywhere else to put them: our content didnʼt fit in the container we had. And we completely ignored
faculty needs and desires.

Iʼm seeing some of the same thought and design processes happening now with regard to e-
science, e-research, cyberinfrastructure, data curation, whatever you want to call it. And this
troubles me. So I canʼt help but wonder if Iʼm becoming...
the...




                                                     of Data Curation?
But, optimistically, itʼs early days yet. Thereʼs no reason we have to make the same mistakes with
data that we made with IRs. So, I donʼt want anyone to think that Iʼm raising the problems Iʼm going
to raise in this talk because Iʼm somehow AGAINST research data curation, or I think libraries
shouldnʼt get involved with it.

I am all for research-data curation, and I believe very strongly that libraries need to get involved. I
just think we should know what weʼre getting ourselves into, and if that means Iʼm a little
Cassandraic, okay, so be it.
goals?

                 means?

                        something for nothing?

                                fit between content and container?

                                       fit between user needs and system?




I could spend hours talking about all these things, but I guarantee that nobody here wants to listen
to me for hours. So Iʼm going to focus this talk on the fit between content and container, though I
may touch on other things.

Iʼm going to examine some of the qualities of typical research data, then talk about digital libraries
and IRs, looking hard at some of the impedance mismatches weʼre liable to run into, and maybe
strategize a little bit about how to make ourselves and our systems better now, before we run
headlong into another mess.

And the lens Iʼm going to be looking through is a human lens, not so much a technological lens.
THIS IS NOT JUST A TECHNOLOGY PROBLEM, I canʼt say that loudly enough.
What do we know about data?




     Photo: http://www.flickr.com/photos/kentbye/2053916246/




So what do we know about research data, speaking very broadly and generally?
There’s a lot of data.




      Photo: http://www.flickr.com/photos/noelzialee/2126153623/




Richard talked about this yesterday, but Iʼll just reiterate: Even if we admit that the Large Hadron
Collider types are probably going to take care of themselves -- and this isnʼt something I necessarily
admit; I know huge, well-funded projects that are making huge messes with their data -- even if we
admit that, weʼre still looking at an incredible flood of stuff.

Have we got big enough buckets? I dunno. At this juncture I feel it incumbent upon me to say the
word “cloud.” Cloud. There. I have said it. I now feel no need at all to say it again.

Look, I understand that storage and networking are problems that have to be solved before we can
do anything else. I get that. Just -- to me, itʼs necessary but not sufficient, even though it seems to
be getting all the attention right now. So Iʼm going to move on to characteristics of research data that
Iʼm more interested in.
Photo: http://www.flickr.com/photos/jonevans/1032687817/



                Data are there to be interacted with.




One thing I think we need to keep in mind about data is that they are not an end in themselves. We
donʼt keep data just to keep data; we do it because researchers can pick up shovels and dig around
in the sands and build knowledge like sand castles!

Data are there to do things with. To be examined, cleaned up, verified, refuted, corrected, number-
crunched, mashed up with other data, graphed, charted, visualized... and if we treat them as though
they were unchangeable museum objects -- look but donʼt touch, like books chained to a medieval
lectern -- we are actually getting in the way of making new knowledge. If nobody can do things with
data, there is no point in keeping them! Thatʼs what CC0 is about, as Richard mentioned in his Q&A
session yesterday: removing legal barriers to messing about with data. We, we librarians, need to
remove TECHNICAL barriers to messing about with data.

Whatʼs more, different kinds of data have different affordances. You donʼt use a plastic sand-shovel
to dig a rock quarry, just the way you donʼt use a backhoe to build a sand castle. The way a
sociologist interacts with census data is just wildly different from the way a medical researcher
interacts with MRI data. The data buckets we build will have to internalize and respect those
affordances, or at the VERY least allow RESEARCHERS to build tools on top that respect those
affordances.
Data are wildly diverse in nature...




               ... as are their technical environments.
      Photo: http://www.flickr.com/photos/28481088@N00/670258156/




In other words, data are diverse, so the buckets we put them in will need to be different shapes and
colors in order to respect that diversity.

Now, differences in data can sometimes be skin-deep. The difference between a digital image of a
sculpture and a digital image of a physics field station in Antarctica is in some ways not much for our
purposes, however different our researchers may think they are. But sometimes the differences
really do matter. You canʼt treat a book in TEI markup the same as a book of page-scanned images;
you will be doing violence to readers of one or the other. A microscopy researcher on my campus
does cell sections digitally; you can train a microscope to focus from the top of the cell all the way
through and down, and then you can create a 3D cell image to play with. Itʼs really cool! But a
system that treats each section image as a wholly separate and unrelated thing
(*cough*DSpace*cough*), is making it impossible to get any knowledge out of those data.

Think for a moment about a single bucket that works for the TEI book, the book of page scans, the
images of the Antarctic field station, and the microscopy data, and youʼre starting to realize the
scope of the data-diversity problem.

Again, we donʼt control the technical environments our researchers are using to generate data.
Some of those environments are proprietary, and Mike Rylander talked yesterday about why thatʼs a
dangerous, dangerous problem. But even leaving that aside, if weʼre really, really lucky, we might
have a chance to make recommendations to researchers about their data. For the most part,
though, WE are the ones who will have to adapt to whatever theyʼre doing.
Data are already out there.




       Photo: NASA (via http://nasaimages.org/), “Multiwavelength M81”




Why is that? Weʼre not creating all the digital research data out there; the researchers are. And
theyʼve created it in huge volumes already. So Iʼm really interested when Dan Chudnov says that
the Library of Congress is working to capture data at world-scale and web-scale, because I want
them to teach ME how to do that.

So, researchers. Theyʼre not thinking long-term about the data theyʼve created. Theyʼre not thinking
past the expiration of their next grant! That means we have to. Weʼre the only people with a long-
term time horizon. Furthermore, theyʼre not gonna come to us; for the most part they canʼt even
imagine that we can help. The inescapable corollary here is that we canʼt just sit back and wait for
data to come to us -- a lot of it weʼre going to have to go out there and rescue!

And I may be airing some library dirty laundry here, in which case, forgive me, but itʼs not just them
-- itʼs us. WE have plenty of unsustainable digital projects sitting around our libraries. Just think for a
second: how many different digital-library, repository, and storage platforms are running inside your
library? I wonʼt even answer for mine; itʼs a scary large number. The stuff in those platforms is in
danger. We made this mess, we librarians; we have to clean it up. As Richard said yesterday, we
have to set an example with our own data! How are we going to establish ourselves as authorities in
describing and organizing data if our own datastores are not in order?
A lot of data are analog...




                           ... but really want to be digital.
      Photo: http://www.flickr.com/photos/mrbill/3452943573/




So, right, back to research data. (read slide titles)

For example, scientists still use paper lab notebooks. I wish they didnʼt too! The university archivist
on my campus really wishes they didnʼt, because they keep trying to give him hundreds of boxes of
lab notebooks that he canʼt possibly find space to store! And thatʼs just one example. Linguistic field
notes, on paper. For one of the linguists Iʼve talked to, her notes are some of the only attestations of
the language we have! Slides are a bugaboo in visual arts communities. Faculty have a tremendous
volume of analog materials that would be much, much greater use if they were digital. Can we scale
up to that? Again, I donʼt know, and Iʼm not going to talk about this problem again. Itʼs there, we
probably need to solve it, end of story.
Data are project-based.




     http://www.exploringthehyper.net/




Aha. Now we get interesting.

This is a dissertation; you can look at it at ʻexploring the hyper dot netʼ. It includes the underlying
data (read “Explore” text). As you may be able to see at bottom right there, itʼs built on the blogging
tool WordPress and the Center for History and New Mediaʼs exhibit-builder tool called Omeka.
These are great tools! I love them both. But what are we librarians going to do as our dissertators
pile random webtools on top of each other to build their dissertations? Thatʼs what project-based
thinking gets you: total technological randomness. But our researchers think in terms of projects.
The latest grant. The latest collaboration. And when it comes to technology, theyʼre not above doing
something different and sui generis for every single one.
Data are sloppy.


      Photo: http://www.flickr.com/photos/midorisyu/2622024163/




By the same token, faculty are not librarians. They are messy, messy people, a lot of them. Many
more of them leave petty chores like, I donʼt know, organizing research materials and results -- they
leave that to their grad students. This means that our data buckets are not going to fill up with nice
neat orderly well-described, data-dictionaried columns of numbers. Honestly, what weʼre doing is
catching sloppy leaks, and we can expect to be for a long, long time.

And when our systems, library systems, only accept data thatʼs clean and pretty, we have a
problem.

In a word...
Data aren’t standardized.




     Photo: http://www.flickr.com/photos/mikewade/3463334719/




... data are not standardized, most of them. Thatʼs not even seen as a desideratum by data creators
yet. I know most of us know this, but in the print world, the journal-article-and-book world, we have
publishers to impose some kind of uniformity. Data doesnʼt live in that kind of world. We may yet get
there, but honestly? I donʼt expect it in the length of my career.

Thatʼs our trawl through some basic characteristics of research data. What do we have in libraries to
throw at this problem?
Our Big Bucket:




                                  the digital library
People think that primary-source data, even big data is a new thing to libraries. Itʼs not. We were
doing big digital data before the researchers were, in the form of the digital library! What did I hear
yesterday, ten terabytes of TIFFs from a single digitization project? So itʼs possible to think hey,
weʼve got this solved! We just apply our existing digital-library infrastructure, human and
technological, to this new problem.
Our other Big Bucket:




                       the institutional repository
If thatʼs not enough, at the same time, weʼve been building another kind of digital bucket; weʼve
called it the institutional repository. And again, some people think that IRs just solve the data
problem. Magic IR pixie dust! Or something...
Photo: http://www.flickr.com/photos/peasap/655111542/



                                          Impedance mismatches




Well, it wonʼt surprise anyone that I donʼt think thatʼs true. There is no magic pixie dust for research
data curation, not in digital libraries and not in IRs. What weʼve done with digital libraries and IRs
gives us a lot of the skill and knowledge we need to work with research data; I firmly believe that,
though itʼs hard to find researchers who do. But weʼre going to have to do a lot of rethinking and
reworking the way we do things. Otherwise, weʼll just trip all over ourselves and the impedance
mismatches between the characteristics of research data and the characteristics of digital libraries
and IRs. So letʼs take this a piece at a time.
What do we know about these?
      Photo: http://www.flickr.com/photos/schex/193912573/




Digital libraries. You know, I love this picture, because theyʼre so proud of being digital. National
DIGITAL Library. Where I am, weʼre trying to rebrand our digital collections, because we donʼt think
“digital” should be what linguists call a ʻmarkedʼ state any more. Digital is ordinary, or it should be.
Digital is normal. So given that, how DO you brand digital collections? --If you have an idea, see me
after, okay?

Anyway, what are digital libraries like? And how is that going to work with research data?
Carefully built and tended




                                                                     http://www.collectionscanada.gc.ca/naskapi/index-e.html




Just like our print libraries, weʼve built our digital libraries carefully, out of the best materials. Weʼre
not making digital libraries out of any old thing; we SELECT what weʼre prepared to lavish effort on.

And we do lavish effort! Look at this site! Itʼs a lexicon of a First Nations language, itʼs been
translated INTO that language, including the fonts to represent that language -- I love this site! Itʼs
beautiful!
Data are already out there.




      Photo: NASA (via http://nasaimages.org/), “Multiwavelength M81”




How are our thoughtful, careful collection-development policies going to cope with whatʼs already
out there? How will we decide what we pick up and what we leave behind? I already see troubling
signs that in the absence of better policy, cyberinfrastructure shops are deciding to help whoever
has money. I donʼt like that trend and hope we in libraries challenge it.

When I go to data curation workshops, most people think of data curation as “the new special
collections” or “the new archives.” Understanding that we canʼt keep everything, theyʼve come up
with elaborate decision mechanisms for figuring out what to keep and what to toss.

Well, I think thereʼs a problem with that. Itʼs a human problem. Itʼs the faculty member who, when
told youʼre not going to curate his lousy badly-designed badly-described dataset, turns around and
never darkens your door again -- even when heʼs got a dataset that will revolutionize his field.

How do we harmonize the need to provide good service with the need not to swamp ourselves with
garbage? I donʼt know, and I suspect answers will differ, but I do know we need to figure that one
out.
Data are sloppy.


      Photo: http://www.flickr.com/photos/midorisyu/2622024163/




How are we going to rescue data when, by our standards, a lot of it is sloppy? Are we prepared for
the work involved in rescuing other peopleʼs sloppy data? Are we prepared to let other peopleʼs
sloppy data in alongside our nice clean pretty data?
Data are project-based.




     http://www.exploringthehyper.net/




Are we going to pick and choose among projects? Based on their software platforms? Can we?
What about dissertations, which are institutional records no matter HOW theyʼre created?
Carefully built and tended




So weʼre going to have to rethink how much and what kind of care we can and should give our data
libraries. Like it or not, they canʼt all look as beautiful as this; volume and condition forbid.
Production is a Taylorist’s dream.                Photo: http://www.flickr.com/photos/villeneuve53/1808995620/




Where Iʼm from, and perhaps where youʼre from too, we like our production of digital objects, mostly
but not entirely through digitization, to run like a well-oiled machine. Just like the factory floor Peter
talked about. Itʼs generally more cost-effective to do things in large volumes and in systematic ways.
In the States, we call this a Taylorist way of going about things -- for those who donʼt read
management literature, Frederick Taylor was the guy who taught Henry Ford how to run auto
production. Taylor measured how long it took people to do things, and made it so people had to
make the fewest and smallest motions possible to get the work done.

What Taylorist production methods mean in a digitization context, of course, is that you tend to limit
the type of work that you do to what you can easily automate and train for, which in practice means
only a few kinds of data per library. We do our image collections or our newspapers or our finding
aids or our text collections -- we in essence specialize ourselves by data type, again for efficiencyʼs
sake.
Data are wildly diverse in nature...




               ... as are their technical environments.
      Photo: http://www.flickr.com/photos/28481088@N00/670258156/




How well is that going to serve us when weʼre not in control of the data-creation process? When the
data donʼt fit into the buckets weʼve designed for our own particular digital-data specialties? If weʼre
going to come to grips with data on an institutional basis, we wonʼt have the luxury of specializing
any more. How are we going to cope?
Data aren’t standardized.




     Photo: http://www.flickr.com/photos/mikewade/3463334719/




How can we be Taylorist about gathering and describing data when the data just arenʼt
standardized? And if we canʼt be Taylorist about it, how do we keep up with the flood?
Data are project-based.




     http://www.exploringthehyper.net/




How are we going to manage when thereʼs a technical-infrastructure mismatch between their project
silos and our Taylorist, tailored environments? We have some choices, but none of them are
particularly good. Do we pull the data out and start over, ignoring the effort put in on the original
interface? If itʼs on the web, do we take a static snapshot of the original? That feels a bit to me like
pinning a gorgeous butterfly through the head, killing it, to display it in a glass case, though I have to
admit that I do it because I donʼt necessary have a better option. Do we recreate the original
interface, and take on the work of maintaining and improving it? Those donʼt sound like Taylorist
processes to me!

Iʼm frightened -- honestly scared to death -- at how many librarians do not realize that this is a
problem. They really seem to think that you wave a magic wand over somebodyʼs random dataset
and it miraculously shows up in a repository! It does not work that way! For every new input,
somebody has to figure out whatʼs in there, how best to represent whatʼs in there on the repository
technology platform (whatever that is), and how to move the old representation into the new one.
That... looks suspiciously like work. No, look, I do it -- trust me, itʼs work.
Production is a Taylorist’s dream.                   Photo: http://www.flickr.com/photos/peasap/655111542/




Where I work weʼre starting to think and talk very seriously about this, because our digital-library
processes are very Taylorist, and weʼre realizing that thatʼs not serving us well as smaller and more
specialized projects come our way. Everything right down to how we BUDGET PROJECTS is going
to have to change. Honestly, weʼre finding this a struggle -- but a necessary one, and one that I am
proud to say that weʼre confronting head-on.
when it isn’t a Taylorist’s nightmare.
       Photo: http://www.flickr.com/photos/elsie/97542274/




Some of you are looking at me right now with utter bemusement. Your digital-library production isnʼt
Taylorist at all! You only WISH it were. What it is, is completely ad-hoc. Something interesting comes
in, you build a way to deal with it, you slap it up on the Web somehow or other, problem solved.
Many digital libraries are project silos.




      http://www.brown.edu/Departments/Italian_Studies/dweb/dweb.shtml




And thus are born project silos, both inside and outside libraries!

One of the problems with project silos is that they arenʼt replicable across libraries and institutions...
and the last thing any of us need is to reinvent the wheel! If youʼve never looked at Decameron
Web, I love it, check it out -- thereʼs some nice TEI-based UI in there. (Google for it; the URL is
really long.) But I canʼt build DanteWeb or CervantesWeb based on DecameronWeb; the innards of
DecameronWeb are opaque to me. It should be easier.

And another problem: project silos arenʼt part of the web. (Hi, Dan Chudnov.) Itʼs what I saw called a
“cabinet of curiosities” in an article I was reading. Nice to look at; impossible to really work with.
Now, this isnʼt entirely the fault of library technology. Itʼs partly the fault of librarians who natter on
about “context” as though it were the be-all and end-all. My belief is that context is fluid, not fixed;
itʼs constantly being built and rebuilt, rather than something trapped like a fly in amber. We have to
expose our digital objects so that they can appear in entirely NEW contexts. Thatʼs not
decontextualization! Itʼs RE-contextualization, and cabinets of curiosities donʼt allow it.
Many are content-specialized.
                Presentation is content-specific.




For each project silo, its own user interface. Books browse differently from maps, which browse
differently from finding aids. Right? I wonder. How can we maintain all this UI code?

Now, Iʼm the last person to tell you to build The One User Interface To Rule Them All. Not possible!
As I said earlier, data have affordances, ways they want to be interacted with, and we absolutely
need to respect that. However, itʼs possible to go too far in the other direction, building interfaces so
content-specific that the content winds up in a cage of jargon and non-interoperability. Thatʼs where I
think we are in digital libraries, and itʼs a problem.
Data are project-based.




     http://www.exploringthehyper.net/




These practices have a lot in common with what our researchers do! Everything is its own project
with its own technology stack and its own silo.

Well, this isnʼt workable. Itʼs wasteful duplication of technical effort, for one thing; why build -- oh, a
tagging infrastructure -- more than once? It also creates huge headaches for discovery processes
and especially for digital preservation. The more interaction you have to preserve, and the more
different ways itʼs coded, the more lines of code weʼre all maintaining, and who needs more lines of
code to maintain?

But itʼs happening anyway, and if weʼre serious about data weʼre going to have to deal with the
result.
when it isn’t a Taylorist’s nightmare.
       Photo: http://www.flickr.com/photos/elsie/97542274/




Now Iʼm going to go a little Cassandra on you -- we have already lost a lot of digital projects to the
project-silo problem, particularly in the digital humanities. Some of those projects were ours:
developed in libraries, but not sustainably. I predict with absolute confidence we will lose more such
projects. There is a crying need for academic librarianship to develop a coordinated, collaborative
rescue effort for early digital projects, if only to stem the bleeding.

On a happier note, if we DO take the trouble to rescue our own projects, we will learn a LOT about
rescuing other peopleʼs. I think that that learning process all by itself should be incentive for forward-
thinking academic libraries and librarians to start undertaking rescue efforts.

So thatʼs where we are with digital libraries, and where I think our practices are going to come up
short in the new data world.
What do we know about these?
What about institutional repositories?
We’re caged up




                                         inside our institutions.
      Photo: http://www.flickr.com/photos/annia316/115439737/




Well, first you need to understand what Stevan Harnadʼs cat was hunting.

No, okay, seriously. (*CLICK*) The word “institutional” is becoming a serious problem. I would argue
it always was. In my worklife, if I run into digital objects needing archival, I cannot go anywhere near
them until I prove a link to one or more faculty members in my home institutions, and the weaker
that link is, the more red tape and bureaucracy I have to go through to get permission to help with
the project -- no matter how important I think that project may be.
Data are already out there.




       Photo: NASA (via http://nasaimages.org/), “Multiwavelength M81”




The problem is most acute for already-existing data. For example, think about what happens when a
researcher leaves your institution for a different one. Their institutional web presence tends to
remain behind. There may be valuable data there. But can the IR get involved, if the researcher
doesnʼt have a connection to the institution any more?

And, of course, it means that data at institutions without IRs just fall between the cracks. Definitely
not ideal, not what we want.
Data are sloppy.



This is another aspect of data sloppiness. A lot of them donʼt clearly belong to one institution, or
indeed to ANY one institution!

Consider something like a disciplinary data or e-print repository. One of those just came up for
rescue, the anthropology repository known as Manaʼo. Would I, as an IR manager, like to rescue it?
Sure! Do I have the technical capacity to do it? Mmmmostly; I could at least take a stab at it. Can I
do the rescue? Oh, heck no. Not in my remit. Iʼm not allowed. Itʼs not institutional data, so I canʼt
touch it.

So it follows, at least to me, that if weʼre going to grapple with data in our institutions, we will have to
give up on the purely inward-looking focus that IRs have had. Maybe different institutions will
choose disciplinary specialties to focus on. Maybe weʼll just drop the idea that data have to originate
within our institution before the institution is interested in them. I donʼt know. But if IRs are going to
play in the data space, something in the policy environment has to give.
We’re caged up




                            inside our institutions.
This restriction, this institutional cage, is an artifact of the scholarly publishers; itʼs not something
libraries invented. Some publishers allow self-archiving only in “institutional” web presences. If an IR
opens itself to a lot of stuff that doesnʼt have strong and obvious ties to the institution, it is opening
its institution to a very real legal risk, a risk that some publishers will sue the institution, making the
argument that itʼs not an “institutional” repository any more because it contains non-institutional
content.

But the reality is that research does not stop at institutional borders. And the more that IRs cling to
that institutional cage, the less we can actually DO to salvage and protect research data.
Photo: http://commons.wikimedia.org/wiki/File:Black_Ford_Model_T_in_HK.JPG




                                           Any color you want...




Unlike digital libraries, at least in theory, IRs were supposed to accept any kind of digital content or
data at all! But the snag there is that theyʼre not really designed for it; theyʼre optimized for research
papers. So in practice, you get the famous Henry Ford statement about Model T cars: you can have
any color you want, as long as itʼs black!

Iʼll show you what I mean.
Bring it on; we’ll take anything!




                       ... as long as it’s static and final.
      Photo: http://www.flickr.com/photos/orblivio/146691405/




The “weʼll take anything” promise is broken and has always been broken. Weʼd take anything
IMMUTABLE. I use this photo advisedly, because for a lot of faculty, once something they produce is
static and final and immutable, itʼs junk! Itʼs out of their sight and they donʼt care about it any more.
So it never gets deposited in the IR to begin with, which means nobodyʼs taking care of it. The
researcher sure isnʼt; itʼs old news.
Photo: http://www.flickr.com/photos/jonevans/1032687817/




                          It’s there to be interacted with.




The ʻstatic and finalʼ model is absolute garbage for interactive data. Itʼs especially garbage if
interacting with the data is one of the ways that the data are made more reliable! Maybe the first
reduction of the data is wrong. If we then canʼt change it because our repository only handles whatʼs
final and static... we are not serving the need here.
Data are already out there.




       Photo: NASA (via http://nasaimages.org/), “Multiwavelength M81”




Itʼs also not ideal for whatʼs already out there. We *know* a lot of that stuff is in bad shape. But if we
wait to ingest it until we can clean it up into an acceptable final form -- we may lose it altogether.
Data are wildly diverse in nature...




               ... as are their technical environments.
      Photo: http://www.flickr.com/photos/28481088@N00/670258156/




IRs are LOUSY at dealing with the diversity of data. Iʼll have a few more words about this later, but
for now Iʼll just state the obvious: putting research data into a user-interface optimized for research
PAPERS is a total loser. Papers have built up a lot of uniformity over the centuries weʼve had
journals. Data are a whole different story.
Bring it on; we’ll take anything!




                      ... as long as it’s static and final.
      Photo: http://www.flickr.com/photos/peasap/655111542/




Again, all this is a profoundly HUMAN problem, and another place where the technology we created
has an impedance mismatch with the way researchers actually work and think.

Richardʼs Q&A session yesterday brought up a key problem with the static-and-final idea:
sometimes you THINK something is static and final when itʼs really, really not. And some things are
just not even MEANT to be static and final!

DSpace, for example, assumes the static-and-final, so much so that it makes correction of an item
already ingested into DSpace difficult and perhaps impossible unless youʼre the systems
administrator. How much time I have wasted swapping out files for people, you really donʼt want to
know. Fedora users, donʼt get smug here, because Fedora has similar problems.

We canʼt DO that with data. Humans are imperfect. The artifacts that we produce are imperfect and
incomplete. Our systems need to accept and work with that imperfection, allowing us to work
TOWARD perfection, conscious that weʼll never quite get there.

Librarians tend to HATE this point of view. Weʼre all about the static and final and authoritative. I am
here to say we have to get over our bad selves. HAVE to, if weʼre going to do justice to research
data.
Right, anything you’ve got!




                                                ... one file at a time.
     Photo: http://www.flickr.com/photos/jetalone/39990302/




So, IRs promise to take anything youʼve got, anything at all -- but you have to put it in one file at a
time, like coins into a glass piggy bank.
There’s a lot of data.




      Photo: http://www.flickr.com/photos/noelzialee/2126153623/




Putting data into repositories one file at a time, MANUALLY, is like emptying the ocean into a bucket
with an eyedropper. Not gonna fly.
Data are already out there.




       Photo: NASA (via http://nasaimages.org/), “Multiwavelength M81”




And since data are already out there, we have to make it easy to dump in large quantity into our
buckets. That means more APIs and protocols. SWORD is good, I like SWORD, I love what I heard
about BagIt yesterday -- but honestly, itʼs got to be even easier than that. I want researchers to be
able to push the “Archive It!” button and have it just silently, seamlessly, WORK.
Any look and feel...




And IRs promise that you can customize the look and feel, but in practice, itʼs too hard. How many
people in here can tell a DSpace from an EPrints install just by looking at the front page of the site?
I sure can.

And anyway, what you get even when you customize is this very sterile, boring, libraryish look and
behavior; itʼs not appealing to the researchers whose hearts and minds we need to capture. Look, I
did this redesign for MINDS@UW, I am hoisting myself on my own petard -- but gosh, we need to
do better than this!
Data are project-based.




     http://www.exploringthehyper.net/




Look at this gorgeous little site! Isnʼt it appealing? If I promise the researcher here that Iʼll take care
of her data forever and ever at the cost of it losing all its visual appeal and its individualized usability,
is she going to take me up on that? I wouldnʼt take me up on that!

So this becomes a content-recruitment problem; researchers see IRsʼ ugly, pathetic little one-horse
interfaces and interaction patterns and they run screaming in the opposite direction.
Data are wildly diverse in nature...




               ... as are their technical environments.
      Photo: http://www.flickr.com/photos/28481088@N00/670258156/




I know I keep coming back to this data-diversity issue like a bad record, but so much of our
infrastructure just fails when confronted with it. One interface does NOT fit all.
Any look and feel...




There is some experimentation happening in IR space. Manakin for DSpace making collection-
based theming possible was definitely a step forward, though perhaps not enough of one; too much
of the page-construction logic still lives in Java. The KULTUR project in the UK is adapting ePrints to
be appealing to visual and performing artists. All of this is good and we need more of it, but I think
we have to confront a wider issue: building our platforms with enough flexibility to be easy to
customize for as much variation as we can manage.

We also need to make it easy for people to construct their own look and feel on top of our stuff, or
just with our stuff in it where that makes sense. Our silos really get in the way of that now, and itʼs a
problem.
Any metadata you want!




                      ... as long as it’s key-value pairs.
      Photo: http://www.flickr.com/photos/rattodisabina/2460905893/




I hate this. It drives me insane. All the marvelous work being done with linked data, XML, semantic
webby sorts of things, and all I can have in my IR is key-value pairs? What is up with that?
Data are wildly diverse in nature...




               ... as are their technical environments.
      Photo: http://www.flickr.com/photos/28481088@N00/670258156/




The diversity of data environments includes diversity in metadata; Iʼm sure thatʼs a surprise to no
one. It also means a diversity of metadata content models, well beyond key-value pairs.
Data are already out there.




       Photo: NASA (via http://nasaimages.org/), “Multiwavelength M81”




Imagine the ideal data project. Itʼs already well-described in an elaborate schema and well-
organized. Are we seriously going to tell the provider that they have to dumb it down to key-value
pairs before we can take it? Seriously? Gosh, I hope not.
Any metadata you want!




              ... as long as it’s key-value pairs.
Reality check: anybody developing a metadata standard these days expresses it in XML or RDF or
both. Key-value pairs donʼt cut it, and arguably never did.
Do anything you want...




                             ... as long as it’s “download.”
      Photo: http://www.flickr.com/photos/procsilas/306417902/




IRs can take in digital files and they can give them back. Honestly, thatʼs pretty much all they can
do.
Photo: http://www.flickr.com/photos/jonevans/1032687817/



                Data are there to be interacted with.




This just KILLS us with interaction. It kills us! A lot of these data need APIs. If weʼre not providing
them, honestly, we might as well not bother.

Interact with data? In an institutional repository? Mash it up with something else? Heavens forfend --
that would imply that digital objects are somehow related to each other, and thatʼs just crazy talk.

So we have a lot of interface and API work to do. A lot of it!
Content models




                                                              Enough said.
Hereʼs a real-world example of the difficulty.

A project Iʼm helping with for the UW-Madison Zoology Museum involves a teaching collection of
animal skeletons that students measure and do comparisons on. Weʼre photographing those and
whomping up an interface that lets students do that measurement work digitally. Saves wear and
tear on fragile realia, allows distance students to participate fully, and, we hope, creates an archive
thatʼs useful outside our campus borders.

Weʼre using Fedora for this, and the content modeling gets complicated. We have a specimen --
say, a squirrel -- which has any number of actual bones, and each bone may have several photos in
various views, and this matters as far as “where do we hang which metadata” and “what do you
want people to find in a search?” and “how do you display on a specimen page all of its component
bones and views?”

So, those have been a lot of entertaining (and sometimes macabre) conversations. Imagine for a
moment that this had been a DSpace project. *CLICK* Here is the content model for DSpace. The
ONLY content model. Community is not even relevant here; collection kinda-sorta fits but not really,
so weʼre left with items, bundles (whatever the heck they are), and bitstreams. And only items can
have metadata! (*CLICK*) I donʼt need to say any more. DSpace, which is running the lionʼs share
of institutional repositories in the United States, is completely functionally inadequate as a serious
data bucket! So much for the IR.
So where does all that leave us?




        Photo: http://www.flickr.com/photos/library_of_congress/2162653769/




Hopefully not with a trainwreck!
Photo: http://www.flickr.com/photos/jonevans/1032687817/



                       We need bigger, better buckets.




I love the idea of just grabbing a bucket and going after data. I admit itʼs probably an 80/20 thing;
thereʼs 20% of the problem-space weʼre looking at that we cannot realistically solve. But I know -- I
KNOW -- we havenʼt served 80% of our users or 80% of our potential content. We can do better,
and we need to.
Silos are both necessary




                                               and unacceptable.
      Photo: http://www.flickr.com/photos/jojakeman/2818910104/




At some level data are all bits -- and at that level, silos tend to be counterproductive and stupid. We
shouldnʼt have to build a checksum engine for sixteen different silos! Where I am, weʼre working
toward combining our digital library and our institutional repository on a single technical
infrastructure, because it just makes sense to do that.

But because data come in thirty-six flavors and then some, once you get above the pure-bits level
itʼs unrealistic to think that we can design one silo that will work equally well for everything. Our
infrastructure has to be flexible, it has to have APIs that other people can build on as well as
ourselves, and it should make the most of the commonalities we *do* find in wildly diverse and
heterogeneous data. Homogeneity whenever possible, flexibility where necessary: that needs to be
our motto as we build these systems.
We have a lot of modeling to do.




                                             And meta-modeling.
     Photo: http://www.flickr.com/photos/crobj/727348790/




Again, because of data diversity, the content-modeling exercise I talked about with the zoology
skeletons will have to be replicated, over and over and over again, as new kinds of data come our
way. I donʼt know if this scares you -- it sure scares me. Add standardization processes on top of
this, because we can expect some kinds of research data to develop standards, and it gets even
scarier.

Fundamentally, we need more efficient ways to do this work -- a sort of meta-model for content
modeling, if you will. I donʼt know how that can work; I just know it has to.
We have a lot of code to write.
      Photo: http://www.flickr.com/photos/fienna/170559081/




Should be uncontroversial. I know a lot of you are already writing this code! Thank you. Now share it
with the rest of us, please, because...
We can’t code or model in isolation.
     Photo: http://www.flickr.com/photos/naus3a01/240614578/




... here is another Cassandraic dire warning. We cannot possibly *hope* to keep up with the data
flood if weʼre all making our own little content models and coding up discovery and dissemination
frameworks in isolation. Why should anyone out there have to decide how to represent skeletons
and bones? At Wisconsin weʼve done that for you!

“We love open source; no, you canʼt have our code” is not gonna fly any longer, folks. We have no
choice but to figure out how to share code better. Whatʼs more, we have to figure out how to share
code with people no more technically inclined than I am, and perhaps less. Now, just a little bit about
me: I hate Java. I am violently allergic to Tomcat. I donʼt even like man pages! Can you build a
system for me? Now think about the vast DSpace installbase out there. Think about how many of
those installs happened because DSpace was supposed to be an out-of-the-box solution. Now think
about how weʼre going to migrate these people to something more flexible. Scared yet? I am!

Brian Owen talked yesterday about how HARD it is to solve these collaboration problems. I agree
with him! Itʼs hard. Itʼs a human problem, and human problems are hard. The problem is, all the
ALTERNATIVES to solving this problem are even harder. Weʼve GOT to fix library-technology
collaboration.
Fedora is the new world.




                                 But Fedora must change.
      Photo: http://www.flickr.com/photos/mythwhisper/3361907495/




Of the digital-library and institutional-repository platforms out there today, I think Fedora is the horse
to bet on. Itʼs the only one that comes close to the storage and presentation flexibility needed for a
big data bucket, and I think the data buckets such as RepoMMan that have already been built atop it
are all by themselves a pretty good indicator that it is the future.

But Fedora needs to make some changes -- some technical, some social.

Content models, service definitions, and their associated code need to be pluggable, to avoid the
wheel-reinvention Iʼve said we canʼt afford. I donʼt entirely know how this needs to work, though the
plugin and mod structures for projects like Drupal and WordPress may be models. I do know that it
does need to work, or weʼre all going to drown in our buckets. And then we have to build the social
scaffolding to actually share these pieces of code, which may turn out to be harder than the actual
technology!

Fedora also made the same mistake DSpace did with regard to the editability and replaceability of
objects. Getting stuff into Fedora you can do with what Fedora hands you. Removing stuff, you can
do. Editing stuff? No, unless you want to edit XML as text in an incredibly clunky and ugly Java app.
Replacing an object with a better object? No. This is not gonna fly, Fedora. It needs to be fixed as
soon as possible.

We also have to put easier tools on top of Fedora, both on the data-producing and data-consuming
ends. Thatʼs being worked on: Islandora, Omeka-over-Fedora, lots of things, and that is all to the
good. Fundamentally, we have to figure out ingest straight from whatever unholy mess a researcher
has, and we have to be able to translate the affordances of a particular dataset easily into our
systems. I donʼt think either of those solved yet, though RepoMMan comes close; even the SWORD
Photo: http://www.flickr.com/photos/werwin15/3554539197/




                                             Focus on the start...




                  ... not so much the finished product.
... you canʼt curate what you donʼt HAVE. Fundamental truth that the IR experience should have
taught us. If our systems donʼt invite deposits -- even sloppy ones, even unfinished ones, even bad
ones by any measurement -- and if they donʼt do it as early as possible in the research process, so
that researchers donʼt get fixated on some other software system, thereʼs no point to having
research-data repositories at all. I know this goes against the grain, hundreds of years of library
perfectionism, but Iʼm afraid thatʼs just too bad! If weʼre playing in this space, we have to be ready to
make some mud pies.

Iʼve always, always loved the RepoMMan project for this reason -- if youʼre googling for it, it has two
“m”s -- and I also really like what the California Digital Library is building. Theyʼre starting with the
good old filesystem, which we all know and more or less love, and theyʼre enhancing it into a
curation system. Itʼs an approach I think will bear fruit, and thatʼs because theyʼre starting from the
right place: where people actually do their work.
Solr brings it all together
     Photo: http://www.flickr.com/photos/chantrybee/2911840052/




Now, to end on a positive note, I love the Solr app, and I think itʼs a marvelous example of the kind
of lightweight tool that does really heavyweight things. The beauty of Solr is that once Iʼve solved
the intellectual problem of “what metadata do I want to expose for search and browse?” Solr makes
expressing that in a crosswalk just stunningly, beautifully trivial -- and then I never have to worry
about it again for that flavor of metadata. There is complexity under the hood, but our experience at
Wisconsin has so far been that you donʼt encounter that complexity until you actually need it, which
is just perfect.
... the




      Vermeer: the Muse Clio, from “The Allegory of Painting”
                                                                of Data Curation.
So thatʼs what I have to tell you. If Iʼve helped you see some of these problems in a new way, if Iʼve
expressed them usefully, such that they get solved, perhaps Iʼll get to stop being Cassandra -- and
instead become the Clio of data curation. Hereʼs hoping.
Thank you!



This presentation is available under a Creative
Commons Attribution 3.0 United States license.

Contenu connexe

Tendances

Miscellaneous Connections
Miscellaneous ConnectionsMiscellaneous Connections
Miscellaneous ConnectionsMal Booth
 
Closing Plenary: Museums and the Web Asia
Closing Plenary: Museums and the Web AsiaClosing Plenary: Museums and the Web Asia
Closing Plenary: Museums and the Web AsiaGeorge Oates
 
Gavin Bell Toc09 Long Tail Needs Community Sm
Gavin Bell Toc09 Long Tail Needs Community SmGavin Bell Toc09 Long Tail Needs Community Sm
Gavin Bell Toc09 Long Tail Needs Community SmGavin Bell
 
The surprising adventures of the mechanical curator
The surprising adventures of the mechanical curatorThe surprising adventures of the mechanical curator
The surprising adventures of the mechanical curatorbenosteen
 
Where are Repository's Going?
Where are Repository's Going?Where are Repository's Going?
Where are Repository's Going?benosteen
 
MW2011: Cope, A., Authority Records, Future Computers and Other Unfinished Hi...
MW2011: Cope, A., Authority Records, Future Computers and Other Unfinished Hi...MW2011: Cope, A., Authority Records, Future Computers and Other Unfinished Hi...
MW2011: Cope, A., Authority Records, Future Computers and Other Unfinished Hi...museums and the web
 
UKSG 2015 Mechanical curator and British Library labs
UKSG 2015  Mechanical curator and British Library labsUKSG 2015  Mechanical curator and British Library labs
UKSG 2015 Mechanical curator and British Library labsbenosteen
 
Academic Library Journal Panic
Academic Library Journal PanicAcademic Library Journal Panic
Academic Library Journal PanicTeam 144L
 
Introduction to Semantic Web
Introduction to Semantic WebIntroduction to Semantic Web
Introduction to Semantic WebIvan Herman
 
“New spaces, activities and challenges: village kids in the library”
“New spaces, activities and challenges: village kids in the library”“New spaces, activities and challenges: village kids in the library”
“New spaces, activities and challenges: village kids in the library”bridgingworlds2008
 
Social Media for the Scared February 2014
Social Media for the Scared February 2014Social Media for the Scared February 2014
Social Media for the Scared February 2014Bex Lewis
 
Closing Plenary: National Digital Forum
Closing Plenary: National Digital ForumClosing Plenary: National Digital Forum
Closing Plenary: National Digital ForumGeorge Oates
 
IA Isn't New, or: What would Samuel Pepys' website look like?
IA Isn't New, or: What would Samuel Pepys' website look like?IA Isn't New, or: What would Samuel Pepys' website look like?
IA Isn't New, or: What would Samuel Pepys' website look like?James Aylett
 
Scanned and Delivered: How the DHLab made remote research work
Scanned and Delivered: How the DHLab made remote research workScanned and Delivered: How the DHLab made remote research work
Scanned and Delivered: How the DHLab made remote research workYHRUploads
 
Web 2.0: How to Stop Thinking and Start Doing: Addressing Organisational Barr...
Web 2.0: How to Stop Thinking and Start Doing: Addressing Organisational Barr...Web 2.0: How to Stop Thinking and Start Doing: Addressing Organisational Barr...
Web 2.0: How to Stop Thinking and Start Doing: Addressing Organisational Barr...lisbk
 
Audio in a social Web of linked data
Audio in a social Web of linked dataAudio in a social Web of linked data
Audio in a social Web of linked dataEduserv Foundation
 

Tendances (20)

Miscellaneous Connections
Miscellaneous ConnectionsMiscellaneous Connections
Miscellaneous Connections
 
Closing Plenary: Museums and the Web Asia
Closing Plenary: Museums and the Web AsiaClosing Plenary: Museums and the Web Asia
Closing Plenary: Museums and the Web Asia
 
Gavin Bell Toc09 Long Tail Needs Community Sm
Gavin Bell Toc09 Long Tail Needs Community SmGavin Bell Toc09 Long Tail Needs Community Sm
Gavin Bell Toc09 Long Tail Needs Community Sm
 
The surprising adventures of the mechanical curator
The surprising adventures of the mechanical curatorThe surprising adventures of the mechanical curator
The surprising adventures of the mechanical curator
 
Where are Repository's Going?
Where are Repository's Going?Where are Repository's Going?
Where are Repository's Going?
 
MW2011: Cope, A., Authority Records, Future Computers and Other Unfinished Hi...
MW2011: Cope, A., Authority Records, Future Computers and Other Unfinished Hi...MW2011: Cope, A., Authority Records, Future Computers and Other Unfinished Hi...
MW2011: Cope, A., Authority Records, Future Computers and Other Unfinished Hi...
 
UKSG 2015 Mechanical curator and British Library labs
UKSG 2015  Mechanical curator and British Library labsUKSG 2015  Mechanical curator and British Library labs
UKSG 2015 Mechanical curator and British Library labs
 
Academic Library Journal Panic
Academic Library Journal PanicAcademic Library Journal Panic
Academic Library Journal Panic
 
Introduction to Semantic Web
Introduction to Semantic WebIntroduction to Semantic Web
Introduction to Semantic Web
 
“New spaces, activities and challenges: village kids in the library”
“New spaces, activities and challenges: village kids in the library”“New spaces, activities and challenges: village kids in the library”
“New spaces, activities and challenges: village kids in the library”
 
Social Media for the Scared February 2014
Social Media for the Scared February 2014Social Media for the Scared February 2014
Social Media for the Scared February 2014
 
Closing Plenary: National Digital Forum
Closing Plenary: National Digital ForumClosing Plenary: National Digital Forum
Closing Plenary: National Digital Forum
 
Metanomics Transcript Nov 11
Metanomics Transcript Nov 11Metanomics Transcript Nov 11
Metanomics Transcript Nov 11
 
Metanomics Transcript Nov 18 2009
Metanomics Transcript Nov 18 2009Metanomics Transcript Nov 18 2009
Metanomics Transcript Nov 18 2009
 
IA Isn't New, or: What would Samuel Pepys' website look like?
IA Isn't New, or: What would Samuel Pepys' website look like?IA Isn't New, or: What would Samuel Pepys' website look like?
IA Isn't New, or: What would Samuel Pepys' website look like?
 
Scanned and Delivered: How the DHLab made remote research work
Scanned and Delivered: How the DHLab made remote research workScanned and Delivered: How the DHLab made remote research work
Scanned and Delivered: How the DHLab made remote research work
 
Web 2.0: How to Stop Thinking and Start Doing: Addressing Organisational Barr...
Web 2.0: How to Stop Thinking and Start Doing: Addressing Organisational Barr...Web 2.0: How to Stop Thinking and Start Doing: Addressing Organisational Barr...
Web 2.0: How to Stop Thinking and Start Doing: Addressing Organisational Barr...
 
Audio in a social Web of linked data
Audio in a social Web of linked dataAudio in a social Web of linked data
Audio in a social Web of linked data
 
SWONtech News for July, 2012
SWONtech News for July, 2012SWONtech News for July, 2012
SWONtech News for July, 2012
 
SWONtech News Podcast for April, 2012
SWONtech News Podcast for April, 2012SWONtech News Podcast for April, 2012
SWONtech News Podcast for April, 2012
 

En vedette

Grab a bucket! It's raining data!
Grab a bucket! It's raining data!Grab a bucket! It's raining data!
Grab a bucket! It's raining data!Dorothea Salo
 
Manufacturing Serendipity
Manufacturing SerendipityManufacturing Serendipity
Manufacturing SerendipityDorothea Salo
 
So are we winning yet?
So are we winning yet?So are we winning yet?
So are we winning yet?Dorothea Salo
 
Preservation and institutional repositories for the digital arts and humanities
Preservation and institutional repositories for the digital arts and humanitiesPreservation and institutional repositories for the digital arts and humanities
Preservation and institutional repositories for the digital arts and humanitiesDorothea Salo
 
Open Sesame (and other open movements)
Open Sesame (and other open movements)Open Sesame (and other open movements)
Open Sesame (and other open movements)Dorothea Salo
 
RDF, RDA, and other TLAs
RDF, RDA, and other TLAsRDF, RDA, and other TLAs
RDF, RDA, and other TLAsDorothea Salo
 
21회+고급 1교시(어휘-쓰기b)-최종
21회+고급 1교시(어휘-쓰기b)-최종21회+고급 1교시(어휘-쓰기b)-최종
21회+고급 1교시(어휘-쓰기b)-최종Vantharith Oum
 
22회+고급 1교시(어휘-쓰기)b형
22회+고급 1교시(어휘-쓰기)b형22회+고급 1교시(어휘-쓰기)b형
22회+고급 1교시(어휘-쓰기)b형Vantharith Oum
 
15회한국어중급1교시(b형 어휘문법,쓰기)
15회한국어중급1교시(b형 어휘문법,쓰기)15회한국어중급1교시(b형 어휘문법,쓰기)
15회한국어중급1교시(b형 어휘문법,쓰기)Vantharith Oum
 
16회한국어중급2교시(b형 듣기,읽기)
16회한국어중급2교시(b형 듣기,읽기)16회한국어중급2교시(b형 듣기,읽기)
16회한국어중급2교시(b형 듣기,읽기)Vantharith Oum
 
@KhmerWikipedia's #WikiMeetup PP2 - PPT Deck 201100430
@KhmerWikipedia's #WikiMeetup PP2 - PPT Deck 201100430@KhmerWikipedia's #WikiMeetup PP2 - PPT Deck 201100430
@KhmerWikipedia's #WikiMeetup PP2 - PPT Deck 201100430Vantharith Oum
 
17회중급 1교시(어휘-쓰기b)
17회중급 1교시(어휘-쓰기b)17회중급 1교시(어휘-쓰기b)
17회중급 1교시(어휘-쓰기b)Vantharith Oum
 
18회 중급b형 1교시_정답표
18회 중급b형 1교시_정답표18회 중급b형 1교시_정답표
18회 중급b형 1교시_정답표Vantharith Oum
 
22회 중급 2교시(듣기-읽기b)
22회 중급 2교시(듣기-읽기b)22회 중급 2교시(듣기-읽기b)
22회 중급 2교시(듣기-읽기b)Vantharith Oum
 
캄보디아... 어서오세요! 20110927
캄보디아... 어서오세요! 20110927캄보디아... 어서오세요! 20110927
캄보디아... 어서오세요! 20110927Vantharith Oum
 
Save the Cows! Cyberinfrastructure for the Rest of Us
Save the Cows! Cyberinfrastructure for the Rest of UsSave the Cows! Cyberinfrastructure for the Rest of Us
Save the Cows! Cyberinfrastructure for the Rest of UsDorothea Salo
 
Open Data Day 2014, Phnom Penh - PPT Deck
Open Data Day 2014, Phnom Penh - PPT DeckOpen Data Day 2014, Phnom Penh - PPT Deck
Open Data Day 2014, Phnom Penh - PPT DeckVantharith Oum
 

En vedette (20)

Grab a bucket! It's raining data!
Grab a bucket! It's raining data!Grab a bucket! It's raining data!
Grab a bucket! It's raining data!
 
Manufacturing Serendipity
Manufacturing SerendipityManufacturing Serendipity
Manufacturing Serendipity
 
Metadata
MetadataMetadata
Metadata
 
So are we winning yet?
So are we winning yet?So are we winning yet?
So are we winning yet?
 
Preservation and institutional repositories for the digital arts and humanities
Preservation and institutional repositories for the digital arts and humanitiesPreservation and institutional repositories for the digital arts and humanities
Preservation and institutional repositories for the digital arts and humanities
 
Open Sesame (and other open movements)
Open Sesame (and other open movements)Open Sesame (and other open movements)
Open Sesame (and other open movements)
 
Occupy Copyright!
Occupy Copyright!Occupy Copyright!
Occupy Copyright!
 
RDF, RDA, and other TLAs
RDF, RDA, and other TLAsRDF, RDA, and other TLAs
RDF, RDA, and other TLAs
 
21회+고급 1교시(어휘-쓰기b)-최종
21회+고급 1교시(어휘-쓰기b)-최종21회+고급 1교시(어휘-쓰기b)-최종
21회+고급 1교시(어휘-쓰기b)-최종
 
22회+고급 1교시(어휘-쓰기)b형
22회+고급 1교시(어휘-쓰기)b형22회+고급 1교시(어휘-쓰기)b형
22회+고급 1교시(어휘-쓰기)b형
 
15회한국어중급1교시(b형 어휘문법,쓰기)
15회한국어중급1교시(b형 어휘문법,쓰기)15회한국어중급1교시(b형 어휘문법,쓰기)
15회한국어중급1교시(b형 어휘문법,쓰기)
 
16회한국어중급2교시(b형 듣기,읽기)
16회한국어중급2교시(b형 듣기,읽기)16회한국어중급2교시(b형 듣기,읽기)
16회한국어중급2교시(b형 듣기,읽기)
 
@KhmerWikipedia's #WikiMeetup PP2 - PPT Deck 201100430
@KhmerWikipedia's #WikiMeetup PP2 - PPT Deck 201100430@KhmerWikipedia's #WikiMeetup PP2 - PPT Deck 201100430
@KhmerWikipedia's #WikiMeetup PP2 - PPT Deck 201100430
 
17회중급 1교시(어휘-쓰기b)
17회중급 1교시(어휘-쓰기b)17회중급 1교시(어휘-쓰기b)
17회중급 1교시(어휘-쓰기b)
 
18회 중급b형 1교시_정답표
18회 중급b형 1교시_정답표18회 중급b형 1교시_정답표
18회 중급b형 1교시_정답표
 
캄보디아 발표
캄보디아 발표캄보디아 발표
캄보디아 발표
 
22회 중급 2교시(듣기-읽기b)
22회 중급 2교시(듣기-읽기b)22회 중급 2교시(듣기-읽기b)
22회 중급 2교시(듣기-읽기b)
 
캄보디아... 어서오세요! 20110927
캄보디아... 어서오세요! 20110927캄보디아... 어서오세요! 20110927
캄보디아... 어서오세요! 20110927
 
Save the Cows! Cyberinfrastructure for the Rest of Us
Save the Cows! Cyberinfrastructure for the Rest of UsSave the Cows! Cyberinfrastructure for the Rest of Us
Save the Cows! Cyberinfrastructure for the Rest of Us
 
Open Data Day 2014, Phnom Penh - PPT Deck
Open Data Day 2014, Phnom Penh - PPT DeckOpen Data Day 2014, Phnom Penh - PPT Deck
Open Data Day 2014, Phnom Penh - PPT Deck
 

Similaire à Grab a bucket! It's raining data!

Is this BIG DATA which I see before me?
Is this BIG DATA which I see before me?Is this BIG DATA which I see before me?
Is this BIG DATA which I see before me?Dorothea Salo
 
Research Data and Scholarly Communication (with notes)
Research Data and Scholarly Communication (with notes)Research Data and Scholarly Communication (with notes)
Research Data and Scholarly Communication (with notes)Dorothea Salo
 
Are Human Beings Becoming Dumb Terminals? Notes and Works Cited
Are Human Beings Becoming Dumb Terminals? Notes and Works CitedAre Human Beings Becoming Dumb Terminals? Notes and Works Cited
Are Human Beings Becoming Dumb Terminals? Notes and Works CitedChris Boese
 
Manufacturing Serendipity
Manufacturing SerendipityManufacturing Serendipity
Manufacturing SerendipityDorothea Salo
 
2600 v25 n2 (summer 2008)
2600 v25 n2 (summer 2008)2600 v25 n2 (summer 2008)
2600 v25 n2 (summer 2008)Felipe Prado
 
Designing for hyper-connectivity
Designing for hyper-connectivityDesigning for hyper-connectivity
Designing for hyper-connectivityJames Box
 
Radically Open at the National Archives
Radically Open at the National ArchivesRadically Open at the National Archives
Radically Open at the National ArchivesJon Voss
 
Libraries in a data-centered environment
Libraries in a data-centered environmentLibraries in a data-centered environment
Libraries in a data-centered environmentJakob .
 
Ain't Nobody's Business If I Do (Read Serials)
Ain't Nobody's Business If I Do (Read Serials)Ain't Nobody's Business If I Do (Read Serials)
Ain't Nobody's Business If I Do (Read Serials)NASIG
 
RBMS LODLAM presentation
RBMS LODLAM presentationRBMS LODLAM presentation
RBMS LODLAM presentationJon Voss
 
Data as Seductive Material, Spring Summit, Umeå March09
Data as Seductive Material, Spring Summit, Umeå March09Data as Seductive Material, Spring Summit, Umeå March09
Data as Seductive Material, Spring Summit, Umeå March09Matt Jones
 
Intelligence, Insight, and the role of Scale: Data stories from the business ...
Intelligence, Insight, and the role of Scale: Data stories from the business ...Intelligence, Insight, and the role of Scale: Data stories from the business ...
Intelligence, Insight, and the role of Scale: Data stories from the business ...Paul Miller
 
How to Build a Better Starship
How to Build a Better StarshipHow to Build a Better Starship
How to Build a Better StarshipScott Nazarian
 
Researchers, Discovery and the Internet: What Next?
Researchers, Discovery and the Internet: What Next?Researchers, Discovery and the Internet: What Next?
Researchers, Discovery and the Internet: What Next?David Smith
 
"If you love your content, set it free" ?
"If you love your content, set it free" ?"If you love your content, set it free" ?
"If you love your content, set it free" ?Mike Ellis
 
Connections a life in the day of - david walker
Connections   a life in the day of - david walkerConnections   a life in the day of - david walker
Connections a life in the day of - david walkerDavid Walker
 
Designing a Future We Want to Live In - UX STRAT USA 2017
Designing a Future We Want to Live In - UX STRAT USA 2017Designing a Future We Want to Live In - UX STRAT USA 2017
Designing a Future We Want to Live In - UX STRAT USA 2017Andrew Hinton
 

Similaire à Grab a bucket! It's raining data! (20)

Is this BIG DATA which I see before me?
Is this BIG DATA which I see before me?Is this BIG DATA which I see before me?
Is this BIG DATA which I see before me?
 
Research Data and Scholarly Communication (with notes)
Research Data and Scholarly Communication (with notes)Research Data and Scholarly Communication (with notes)
Research Data and Scholarly Communication (with notes)
 
Are Human Beings Becoming Dumb Terminals? Notes and Works Cited
Are Human Beings Becoming Dumb Terminals? Notes and Works CitedAre Human Beings Becoming Dumb Terminals? Notes and Works Cited
Are Human Beings Becoming Dumb Terminals? Notes and Works Cited
 
Manufacturing Serendipity
Manufacturing SerendipityManufacturing Serendipity
Manufacturing Serendipity
 
2600 v25 n2 (summer 2008)
2600 v25 n2 (summer 2008)2600 v25 n2 (summer 2008)
2600 v25 n2 (summer 2008)
 
Designing for hyper-connectivity
Designing for hyper-connectivityDesigning for hyper-connectivity
Designing for hyper-connectivity
 
Radically Open at the National Archives
Radically Open at the National ArchivesRadically Open at the National Archives
Radically Open at the National Archives
 
Libraries in a data-centered environment
Libraries in a data-centered environmentLibraries in a data-centered environment
Libraries in a data-centered environment
 
Ain't Nobody's Business If I Do (Read Serials)
Ain't Nobody's Business If I Do (Read Serials)Ain't Nobody's Business If I Do (Read Serials)
Ain't Nobody's Business If I Do (Read Serials)
 
RBMS LODLAM presentation
RBMS LODLAM presentationRBMS LODLAM presentation
RBMS LODLAM presentation
 
Jdkunesh Idea2008
Jdkunesh Idea2008Jdkunesh Idea2008
Jdkunesh Idea2008
 
Data as Seductive Material, Spring Summit, Umeå March09
Data as Seductive Material, Spring Summit, Umeå March09Data as Seductive Material, Spring Summit, Umeå March09
Data as Seductive Material, Spring Summit, Umeå March09
 
Intelligence, Insight, and the role of Scale: Data stories from the business ...
Intelligence, Insight, and the role of Scale: Data stories from the business ...Intelligence, Insight, and the role of Scale: Data stories from the business ...
Intelligence, Insight, and the role of Scale: Data stories from the business ...
 
How to Build a Better Starship
How to Build a Better StarshipHow to Build a Better Starship
How to Build a Better Starship
 
3D-DH&VH Downunder
3D-DH&VH Downunder3D-DH&VH Downunder
3D-DH&VH Downunder
 
Researchers, Discovery and the Internet: What Next?
Researchers, Discovery and the Internet: What Next?Researchers, Discovery and the Internet: What Next?
Researchers, Discovery and the Internet: What Next?
 
"If you love your content, set it free" ?
"If you love your content, set it free" ?"If you love your content, set it free" ?
"If you love your content, set it free" ?
 
Connections a life in the day of - david walker
Connections   a life in the day of - david walkerConnections   a life in the day of - david walker
Connections a life in the day of - david walker
 
Designing a Future We Want to Live In - UX STRAT USA 2017
Designing a Future We Want to Live In - UX STRAT USA 2017Designing a Future We Want to Live In - UX STRAT USA 2017
Designing a Future We Want to Live In - UX STRAT USA 2017
 
Neo luddism
Neo luddismNeo luddism
Neo luddism
 

Plus de Dorothea Salo

Risk management and auditing
Risk management and auditingRisk management and auditing
Risk management and auditingDorothea Salo
 
The Canonically Bad (Digital) Humanities Proposal (and how to avoid it)
The Canonically Bad (Digital) Humanities Proposal (and how to avoid it)The Canonically Bad (Digital) Humanities Proposal (and how to avoid it)
The Canonically Bad (Digital) Humanities Proposal (and how to avoid it)Dorothea Salo
 
MARC and BIBFRAME; Linking libraries and archives
MARC and BIBFRAME; Linking libraries and archivesMARC and BIBFRAME; Linking libraries and archives
MARC and BIBFRAME; Linking libraries and archivesDorothea Salo
 
Research Data and Scholarly Communication
Research Data and Scholarly CommunicationResearch Data and Scholarly Communication
Research Data and Scholarly CommunicationDorothea Salo
 
I own copyright, so I pwn you!
I own copyright, so I pwn you!I own copyright, so I pwn you!
I own copyright, so I pwn you!Dorothea Salo
 
Librarians love data!
Librarians love data!Librarians love data!
Librarians love data!Dorothea Salo
 
Taming the Monster: Digital Preservation Planning and Implementation Tools
Taming the Monster: Digital Preservation Planning and Implementation ToolsTaming the Monster: Digital Preservation Planning and Implementation Tools
Taming the Monster: Digital Preservation Planning and Implementation ToolsDorothea Salo
 
Lipstick on a Pig: Integrated Library Systems
Lipstick on a Pig: Integrated Library SystemsLipstick on a Pig: Integrated Library Systems
Lipstick on a Pig: Integrated Library SystemsDorothea Salo
 
Databases, Markup, and Regular Expressions
Databases, Markup, and Regular ExpressionsDatabases, Markup, and Regular Expressions
Databases, Markup, and Regular ExpressionsDorothea Salo
 
So are we winning yet?
So are we winning yet?So are we winning yet?
So are we winning yet?Dorothea Salo
 

Plus de Dorothea Salo (16)

Encryption
EncryptionEncryption
Encryption
 
Risk management and auditing
Risk management and auditingRisk management and auditing
Risk management and auditing
 
The Canonically Bad (Digital) Humanities Proposal (and how to avoid it)
The Canonically Bad (Digital) Humanities Proposal (and how to avoid it)The Canonically Bad (Digital) Humanities Proposal (and how to avoid it)
The Canonically Bad (Digital) Humanities Proposal (and how to avoid it)
 
MARC and BIBFRAME; Linking libraries and archives
MARC and BIBFRAME; Linking libraries and archivesMARC and BIBFRAME; Linking libraries and archives
MARC and BIBFRAME; Linking libraries and archives
 
Library Linked Data
Library Linked DataLibrary Linked Data
Library Linked Data
 
FRBR and RDA
FRBR and RDAFRBR and RDA
FRBR and RDA
 
Research Data and Scholarly Communication
Research Data and Scholarly CommunicationResearch Data and Scholarly Communication
Research Data and Scholarly Communication
 
What We Organize
What We OrganizeWhat We Organize
What We Organize
 
I own copyright, so I pwn you!
I own copyright, so I pwn you!I own copyright, so I pwn you!
I own copyright, so I pwn you!
 
Librarians love data!
Librarians love data!Librarians love data!
Librarians love data!
 
Taming the Monster: Digital Preservation Planning and Implementation Tools
Taming the Monster: Digital Preservation Planning and Implementation ToolsTaming the Monster: Digital Preservation Planning and Implementation Tools
Taming the Monster: Digital Preservation Planning and Implementation Tools
 
Open Content
Open ContentOpen Content
Open Content
 
Lipstick on a Pig: Integrated Library Systems
Lipstick on a Pig: Integrated Library SystemsLipstick on a Pig: Integrated Library Systems
Lipstick on a Pig: Integrated Library Systems
 
Databases, Markup, and Regular Expressions
Databases, Markup, and Regular ExpressionsDatabases, Markup, and Regular Expressions
Databases, Markup, and Regular Expressions
 
Escaping Datageddon
Escaping DatageddonEscaping Datageddon
Escaping Datageddon
 
So are we winning yet?
So are we winning yet?So are we winning yet?
So are we winning yet?
 

Dernier

ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQuiz Club NITW
 
How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17Celine George
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSMae Pangan
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operationalssuser3e220a
 
week 1 cookery 8 fourth - quarter .pptx
week 1 cookery 8  fourth  -  quarter .pptxweek 1 cookery 8  fourth  -  quarter .pptx
week 1 cookery 8 fourth - quarter .pptxJonalynLegaspi2
 
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...Nguyen Thanh Tu Collection
 
4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptxmary850239
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDecoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDhatriParmar
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdfMr Bounab Samir
 
Mental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young mindsMental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young mindsPooky Knightsmith
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationdeepaannamalai16
 
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptxMan or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptxDhatriParmar
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfPrerana Jadhav
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWQuiz Club NITW
 

Dernier (20)

ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptx
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
 
How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17
 
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHS
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operational
 
week 1 cookery 8 fourth - quarter .pptx
week 1 cookery 8  fourth  -  quarter .pptxweek 1 cookery 8  fourth  -  quarter .pptx
week 1 cookery 8 fourth - quarter .pptx
 
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
 
4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDecoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdf
 
Mental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young mindsMental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young minds
 
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptxINCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentation
 
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptxMan or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdf
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITW
 

Grab a bucket! It's raining data!

  • 1. Grab a bucket! It’s raining data! Photo: http://www.flickr.com/photos/peasap/655111542/ Dorothea Salo University of Wisconsin Access 2009 Hi there. Thanks very much to Mark Leggott for inviting me here, and to all of you for lending me your ears for a time. Youʼll have noticed that the title of this talk in the program notes is very formal and buttoned-down. ʻRepresenting and managing the data deluge.ʼ Well, okay. I am not a formal and buttoned-down person, but when Mark approached me to speak here, I was actually scared to death to accept, and so I wrote this really terribly boring title -- so like Peter, I just up and changed it. The REAL title is ʻGrab a bucket -- itʼs raining data!” To hear some folks tell it, itʼs a golden age to be a digital librarian. Here we have an entire new form of scholarly publication -- digital research data -- and itʼs ours for the asking! In times when weʼre all worried about the future of libraries (and, letʼs face it, librarians), this feels heaven-sent. Grab a bucket, itʼs raining data, hallelujah!
  • 2. the... Painting: “Cassandra,” Evelyn de Morgan Photo: http://commons.wikimedia.org/wiki/File:Cassandra1.jpeg of Open Access In some quarters, I am now styled the “Cassandra of Open Access.” Cassandra, for those not up on their Greek myth, was a Trojan prophetess who was cursed such that nobody believed what she said until it was too late. Being from Troy, which was of course completely doomed, most of her prophecies were fairly dire, too. “Hey, the Greeks are about to wheel a big wooden horse into your city so they can burn it down and kill everybody!” Not happy-making stuff weʼre talking about.
  • 3. I’ve got nothing against but the reality was... Photo: http://www.flickr.com/photos/y2bk/528300692/ Some people have mistaken my Cassandra-nature for an onus against open access generally and institutional repositories in particular. Iʼve never had it in for open access! Who doesnʼt like open access! Itʼs similar to what Cory said yesterday, itʼs hard to be against an unambiguous good like open access without sounding like a total jerk... which hasnʼt stopped some publishers, of course. (*CLICK*) But Iʼve been running institutional repositories for close to five years now and the on-the- ground reality has been quite a bit...
  • 4. ... blurrier. goals? means? something for nothing? fit between content and container? fit between user needs and system? and so now, I may be becoming Photo: http://www.flickr.com/photos/jennsstuff/2965783700/ ... blurrier. Conflicting, contradictory, and in some cases flatly impossible goals. Minimal means, because of people who seem to have been reading the mythical “Frommerʼs Institutional Repositories on $5 A Day.” Asking for time, effort, and data from faculty without giving them any real service or any return on their time investment that made sense to them. We crammed things into IRs that just didnʼt fit with the very limited IR view of the digital universe, just because we hadnʼt anywhere else to put them: our content didnʼt fit in the container we had. And we completely ignored faculty needs and desires. Iʼm seeing some of the same thought and design processes happening now with regard to e- science, e-research, cyberinfrastructure, data curation, whatever you want to call it. And this troubles me. So I canʼt help but wonder if Iʼm becoming...
  • 5. the... of Data Curation? But, optimistically, itʼs early days yet. Thereʼs no reason we have to make the same mistakes with data that we made with IRs. So, I donʼt want anyone to think that Iʼm raising the problems Iʼm going to raise in this talk because Iʼm somehow AGAINST research data curation, or I think libraries shouldnʼt get involved with it. I am all for research-data curation, and I believe very strongly that libraries need to get involved. I just think we should know what weʼre getting ourselves into, and if that means Iʼm a little Cassandraic, okay, so be it.
  • 6. goals? means? something for nothing? fit between content and container? fit between user needs and system? I could spend hours talking about all these things, but I guarantee that nobody here wants to listen to me for hours. So Iʼm going to focus this talk on the fit between content and container, though I may touch on other things. Iʼm going to examine some of the qualities of typical research data, then talk about digital libraries and IRs, looking hard at some of the impedance mismatches weʼre liable to run into, and maybe strategize a little bit about how to make ourselves and our systems better now, before we run headlong into another mess. And the lens Iʼm going to be looking through is a human lens, not so much a technological lens. THIS IS NOT JUST A TECHNOLOGY PROBLEM, I canʼt say that loudly enough.
  • 7. What do we know about data? Photo: http://www.flickr.com/photos/kentbye/2053916246/ So what do we know about research data, speaking very broadly and generally?
  • 8. There’s a lot of data. Photo: http://www.flickr.com/photos/noelzialee/2126153623/ Richard talked about this yesterday, but Iʼll just reiterate: Even if we admit that the Large Hadron Collider types are probably going to take care of themselves -- and this isnʼt something I necessarily admit; I know huge, well-funded projects that are making huge messes with their data -- even if we admit that, weʼre still looking at an incredible flood of stuff. Have we got big enough buckets? I dunno. At this juncture I feel it incumbent upon me to say the word “cloud.” Cloud. There. I have said it. I now feel no need at all to say it again. Look, I understand that storage and networking are problems that have to be solved before we can do anything else. I get that. Just -- to me, itʼs necessary but not sufficient, even though it seems to be getting all the attention right now. So Iʼm going to move on to characteristics of research data that Iʼm more interested in.
  • 9. Photo: http://www.flickr.com/photos/jonevans/1032687817/ Data are there to be interacted with. One thing I think we need to keep in mind about data is that they are not an end in themselves. We donʼt keep data just to keep data; we do it because researchers can pick up shovels and dig around in the sands and build knowledge like sand castles! Data are there to do things with. To be examined, cleaned up, verified, refuted, corrected, number- crunched, mashed up with other data, graphed, charted, visualized... and if we treat them as though they were unchangeable museum objects -- look but donʼt touch, like books chained to a medieval lectern -- we are actually getting in the way of making new knowledge. If nobody can do things with data, there is no point in keeping them! Thatʼs what CC0 is about, as Richard mentioned in his Q&A session yesterday: removing legal barriers to messing about with data. We, we librarians, need to remove TECHNICAL barriers to messing about with data. Whatʼs more, different kinds of data have different affordances. You donʼt use a plastic sand-shovel to dig a rock quarry, just the way you donʼt use a backhoe to build a sand castle. The way a sociologist interacts with census data is just wildly different from the way a medical researcher interacts with MRI data. The data buckets we build will have to internalize and respect those affordances, or at the VERY least allow RESEARCHERS to build tools on top that respect those affordances.
  • 10. Data are wildly diverse in nature... ... as are their technical environments. Photo: http://www.flickr.com/photos/28481088@N00/670258156/ In other words, data are diverse, so the buckets we put them in will need to be different shapes and colors in order to respect that diversity. Now, differences in data can sometimes be skin-deep. The difference between a digital image of a sculpture and a digital image of a physics field station in Antarctica is in some ways not much for our purposes, however different our researchers may think they are. But sometimes the differences really do matter. You canʼt treat a book in TEI markup the same as a book of page-scanned images; you will be doing violence to readers of one or the other. A microscopy researcher on my campus does cell sections digitally; you can train a microscope to focus from the top of the cell all the way through and down, and then you can create a 3D cell image to play with. Itʼs really cool! But a system that treats each section image as a wholly separate and unrelated thing (*cough*DSpace*cough*), is making it impossible to get any knowledge out of those data. Think for a moment about a single bucket that works for the TEI book, the book of page scans, the images of the Antarctic field station, and the microscopy data, and youʼre starting to realize the scope of the data-diversity problem. Again, we donʼt control the technical environments our researchers are using to generate data. Some of those environments are proprietary, and Mike Rylander talked yesterday about why thatʼs a dangerous, dangerous problem. But even leaving that aside, if weʼre really, really lucky, we might have a chance to make recommendations to researchers about their data. For the most part, though, WE are the ones who will have to adapt to whatever theyʼre doing.
  • 11. Data are already out there. Photo: NASA (via http://nasaimages.org/), “Multiwavelength M81” Why is that? Weʼre not creating all the digital research data out there; the researchers are. And theyʼve created it in huge volumes already. So Iʼm really interested when Dan Chudnov says that the Library of Congress is working to capture data at world-scale and web-scale, because I want them to teach ME how to do that. So, researchers. Theyʼre not thinking long-term about the data theyʼve created. Theyʼre not thinking past the expiration of their next grant! That means we have to. Weʼre the only people with a long- term time horizon. Furthermore, theyʼre not gonna come to us; for the most part they canʼt even imagine that we can help. The inescapable corollary here is that we canʼt just sit back and wait for data to come to us -- a lot of it weʼre going to have to go out there and rescue! And I may be airing some library dirty laundry here, in which case, forgive me, but itʼs not just them -- itʼs us. WE have plenty of unsustainable digital projects sitting around our libraries. Just think for a second: how many different digital-library, repository, and storage platforms are running inside your library? I wonʼt even answer for mine; itʼs a scary large number. The stuff in those platforms is in danger. We made this mess, we librarians; we have to clean it up. As Richard said yesterday, we have to set an example with our own data! How are we going to establish ourselves as authorities in describing and organizing data if our own datastores are not in order?
  • 12. A lot of data are analog... ... but really want to be digital. Photo: http://www.flickr.com/photos/mrbill/3452943573/ So, right, back to research data. (read slide titles) For example, scientists still use paper lab notebooks. I wish they didnʼt too! The university archivist on my campus really wishes they didnʼt, because they keep trying to give him hundreds of boxes of lab notebooks that he canʼt possibly find space to store! And thatʼs just one example. Linguistic field notes, on paper. For one of the linguists Iʼve talked to, her notes are some of the only attestations of the language we have! Slides are a bugaboo in visual arts communities. Faculty have a tremendous volume of analog materials that would be much, much greater use if they were digital. Can we scale up to that? Again, I donʼt know, and Iʼm not going to talk about this problem again. Itʼs there, we probably need to solve it, end of story.
  • 13. Data are project-based. http://www.exploringthehyper.net/ Aha. Now we get interesting. This is a dissertation; you can look at it at ʻexploring the hyper dot netʼ. It includes the underlying data (read “Explore” text). As you may be able to see at bottom right there, itʼs built on the blogging tool WordPress and the Center for History and New Mediaʼs exhibit-builder tool called Omeka. These are great tools! I love them both. But what are we librarians going to do as our dissertators pile random webtools on top of each other to build their dissertations? Thatʼs what project-based thinking gets you: total technological randomness. But our researchers think in terms of projects. The latest grant. The latest collaboration. And when it comes to technology, theyʼre not above doing something different and sui generis for every single one.
  • 14. Data are sloppy. Photo: http://www.flickr.com/photos/midorisyu/2622024163/ By the same token, faculty are not librarians. They are messy, messy people, a lot of them. Many more of them leave petty chores like, I donʼt know, organizing research materials and results -- they leave that to their grad students. This means that our data buckets are not going to fill up with nice neat orderly well-described, data-dictionaried columns of numbers. Honestly, what weʼre doing is catching sloppy leaks, and we can expect to be for a long, long time. And when our systems, library systems, only accept data thatʼs clean and pretty, we have a problem. In a word...
  • 15. Data aren’t standardized. Photo: http://www.flickr.com/photos/mikewade/3463334719/ ... data are not standardized, most of them. Thatʼs not even seen as a desideratum by data creators yet. I know most of us know this, but in the print world, the journal-article-and-book world, we have publishers to impose some kind of uniformity. Data doesnʼt live in that kind of world. We may yet get there, but honestly? I donʼt expect it in the length of my career. Thatʼs our trawl through some basic characteristics of research data. What do we have in libraries to throw at this problem?
  • 16. Our Big Bucket: the digital library People think that primary-source data, even big data is a new thing to libraries. Itʼs not. We were doing big digital data before the researchers were, in the form of the digital library! What did I hear yesterday, ten terabytes of TIFFs from a single digitization project? So itʼs possible to think hey, weʼve got this solved! We just apply our existing digital-library infrastructure, human and technological, to this new problem.
  • 17. Our other Big Bucket: the institutional repository If thatʼs not enough, at the same time, weʼve been building another kind of digital bucket; weʼve called it the institutional repository. And again, some people think that IRs just solve the data problem. Magic IR pixie dust! Or something...
  • 18. Photo: http://www.flickr.com/photos/peasap/655111542/ Impedance mismatches Well, it wonʼt surprise anyone that I donʼt think thatʼs true. There is no magic pixie dust for research data curation, not in digital libraries and not in IRs. What weʼve done with digital libraries and IRs gives us a lot of the skill and knowledge we need to work with research data; I firmly believe that, though itʼs hard to find researchers who do. But weʼre going to have to do a lot of rethinking and reworking the way we do things. Otherwise, weʼll just trip all over ourselves and the impedance mismatches between the characteristics of research data and the characteristics of digital libraries and IRs. So letʼs take this a piece at a time.
  • 19. What do we know about these? Photo: http://www.flickr.com/photos/schex/193912573/ Digital libraries. You know, I love this picture, because theyʼre so proud of being digital. National DIGITAL Library. Where I am, weʼre trying to rebrand our digital collections, because we donʼt think “digital” should be what linguists call a ʻmarkedʼ state any more. Digital is ordinary, or it should be. Digital is normal. So given that, how DO you brand digital collections? --If you have an idea, see me after, okay? Anyway, what are digital libraries like? And how is that going to work with research data?
  • 20. Carefully built and tended http://www.collectionscanada.gc.ca/naskapi/index-e.html Just like our print libraries, weʼve built our digital libraries carefully, out of the best materials. Weʼre not making digital libraries out of any old thing; we SELECT what weʼre prepared to lavish effort on. And we do lavish effort! Look at this site! Itʼs a lexicon of a First Nations language, itʼs been translated INTO that language, including the fonts to represent that language -- I love this site! Itʼs beautiful!
  • 21. Data are already out there. Photo: NASA (via http://nasaimages.org/), “Multiwavelength M81” How are our thoughtful, careful collection-development policies going to cope with whatʼs already out there? How will we decide what we pick up and what we leave behind? I already see troubling signs that in the absence of better policy, cyberinfrastructure shops are deciding to help whoever has money. I donʼt like that trend and hope we in libraries challenge it. When I go to data curation workshops, most people think of data curation as “the new special collections” or “the new archives.” Understanding that we canʼt keep everything, theyʼve come up with elaborate decision mechanisms for figuring out what to keep and what to toss. Well, I think thereʼs a problem with that. Itʼs a human problem. Itʼs the faculty member who, when told youʼre not going to curate his lousy badly-designed badly-described dataset, turns around and never darkens your door again -- even when heʼs got a dataset that will revolutionize his field. How do we harmonize the need to provide good service with the need not to swamp ourselves with garbage? I donʼt know, and I suspect answers will differ, but I do know we need to figure that one out.
  • 22. Data are sloppy. Photo: http://www.flickr.com/photos/midorisyu/2622024163/ How are we going to rescue data when, by our standards, a lot of it is sloppy? Are we prepared for the work involved in rescuing other peopleʼs sloppy data? Are we prepared to let other peopleʼs sloppy data in alongside our nice clean pretty data?
  • 23. Data are project-based. http://www.exploringthehyper.net/ Are we going to pick and choose among projects? Based on their software platforms? Can we? What about dissertations, which are institutional records no matter HOW theyʼre created?
  • 24. Carefully built and tended So weʼre going to have to rethink how much and what kind of care we can and should give our data libraries. Like it or not, they canʼt all look as beautiful as this; volume and condition forbid.
  • 25. Production is a Taylorist’s dream. Photo: http://www.flickr.com/photos/villeneuve53/1808995620/ Where Iʼm from, and perhaps where youʼre from too, we like our production of digital objects, mostly but not entirely through digitization, to run like a well-oiled machine. Just like the factory floor Peter talked about. Itʼs generally more cost-effective to do things in large volumes and in systematic ways. In the States, we call this a Taylorist way of going about things -- for those who donʼt read management literature, Frederick Taylor was the guy who taught Henry Ford how to run auto production. Taylor measured how long it took people to do things, and made it so people had to make the fewest and smallest motions possible to get the work done. What Taylorist production methods mean in a digitization context, of course, is that you tend to limit the type of work that you do to what you can easily automate and train for, which in practice means only a few kinds of data per library. We do our image collections or our newspapers or our finding aids or our text collections -- we in essence specialize ourselves by data type, again for efficiencyʼs sake.
  • 26. Data are wildly diverse in nature... ... as are their technical environments. Photo: http://www.flickr.com/photos/28481088@N00/670258156/ How well is that going to serve us when weʼre not in control of the data-creation process? When the data donʼt fit into the buckets weʼve designed for our own particular digital-data specialties? If weʼre going to come to grips with data on an institutional basis, we wonʼt have the luxury of specializing any more. How are we going to cope?
  • 27. Data aren’t standardized. Photo: http://www.flickr.com/photos/mikewade/3463334719/ How can we be Taylorist about gathering and describing data when the data just arenʼt standardized? And if we canʼt be Taylorist about it, how do we keep up with the flood?
  • 28. Data are project-based. http://www.exploringthehyper.net/ How are we going to manage when thereʼs a technical-infrastructure mismatch between their project silos and our Taylorist, tailored environments? We have some choices, but none of them are particularly good. Do we pull the data out and start over, ignoring the effort put in on the original interface? If itʼs on the web, do we take a static snapshot of the original? That feels a bit to me like pinning a gorgeous butterfly through the head, killing it, to display it in a glass case, though I have to admit that I do it because I donʼt necessary have a better option. Do we recreate the original interface, and take on the work of maintaining and improving it? Those donʼt sound like Taylorist processes to me! Iʼm frightened -- honestly scared to death -- at how many librarians do not realize that this is a problem. They really seem to think that you wave a magic wand over somebodyʼs random dataset and it miraculously shows up in a repository! It does not work that way! For every new input, somebody has to figure out whatʼs in there, how best to represent whatʼs in there on the repository technology platform (whatever that is), and how to move the old representation into the new one. That... looks suspiciously like work. No, look, I do it -- trust me, itʼs work.
  • 29. Production is a Taylorist’s dream. Photo: http://www.flickr.com/photos/peasap/655111542/ Where I work weʼre starting to think and talk very seriously about this, because our digital-library processes are very Taylorist, and weʼre realizing that thatʼs not serving us well as smaller and more specialized projects come our way. Everything right down to how we BUDGET PROJECTS is going to have to change. Honestly, weʼre finding this a struggle -- but a necessary one, and one that I am proud to say that weʼre confronting head-on.
  • 30. when it isn’t a Taylorist’s nightmare. Photo: http://www.flickr.com/photos/elsie/97542274/ Some of you are looking at me right now with utter bemusement. Your digital-library production isnʼt Taylorist at all! You only WISH it were. What it is, is completely ad-hoc. Something interesting comes in, you build a way to deal with it, you slap it up on the Web somehow or other, problem solved.
  • 31. Many digital libraries are project silos. http://www.brown.edu/Departments/Italian_Studies/dweb/dweb.shtml And thus are born project silos, both inside and outside libraries! One of the problems with project silos is that they arenʼt replicable across libraries and institutions... and the last thing any of us need is to reinvent the wheel! If youʼve never looked at Decameron Web, I love it, check it out -- thereʼs some nice TEI-based UI in there. (Google for it; the URL is really long.) But I canʼt build DanteWeb or CervantesWeb based on DecameronWeb; the innards of DecameronWeb are opaque to me. It should be easier. And another problem: project silos arenʼt part of the web. (Hi, Dan Chudnov.) Itʼs what I saw called a “cabinet of curiosities” in an article I was reading. Nice to look at; impossible to really work with. Now, this isnʼt entirely the fault of library technology. Itʼs partly the fault of librarians who natter on about “context” as though it were the be-all and end-all. My belief is that context is fluid, not fixed; itʼs constantly being built and rebuilt, rather than something trapped like a fly in amber. We have to expose our digital objects so that they can appear in entirely NEW contexts. Thatʼs not decontextualization! Itʼs RE-contextualization, and cabinets of curiosities donʼt allow it.
  • 32. Many are content-specialized. Presentation is content-specific. For each project silo, its own user interface. Books browse differently from maps, which browse differently from finding aids. Right? I wonder. How can we maintain all this UI code? Now, Iʼm the last person to tell you to build The One User Interface To Rule Them All. Not possible! As I said earlier, data have affordances, ways they want to be interacted with, and we absolutely need to respect that. However, itʼs possible to go too far in the other direction, building interfaces so content-specific that the content winds up in a cage of jargon and non-interoperability. Thatʼs where I think we are in digital libraries, and itʼs a problem.
  • 33. Data are project-based. http://www.exploringthehyper.net/ These practices have a lot in common with what our researchers do! Everything is its own project with its own technology stack and its own silo. Well, this isnʼt workable. Itʼs wasteful duplication of technical effort, for one thing; why build -- oh, a tagging infrastructure -- more than once? It also creates huge headaches for discovery processes and especially for digital preservation. The more interaction you have to preserve, and the more different ways itʼs coded, the more lines of code weʼre all maintaining, and who needs more lines of code to maintain? But itʼs happening anyway, and if weʼre serious about data weʼre going to have to deal with the result.
  • 34. when it isn’t a Taylorist’s nightmare. Photo: http://www.flickr.com/photos/elsie/97542274/ Now Iʼm going to go a little Cassandra on you -- we have already lost a lot of digital projects to the project-silo problem, particularly in the digital humanities. Some of those projects were ours: developed in libraries, but not sustainably. I predict with absolute confidence we will lose more such projects. There is a crying need for academic librarianship to develop a coordinated, collaborative rescue effort for early digital projects, if only to stem the bleeding. On a happier note, if we DO take the trouble to rescue our own projects, we will learn a LOT about rescuing other peopleʼs. I think that that learning process all by itself should be incentive for forward- thinking academic libraries and librarians to start undertaking rescue efforts. So thatʼs where we are with digital libraries, and where I think our practices are going to come up short in the new data world.
  • 35. What do we know about these? What about institutional repositories?
  • 36. We’re caged up inside our institutions. Photo: http://www.flickr.com/photos/annia316/115439737/ Well, first you need to understand what Stevan Harnadʼs cat was hunting. No, okay, seriously. (*CLICK*) The word “institutional” is becoming a serious problem. I would argue it always was. In my worklife, if I run into digital objects needing archival, I cannot go anywhere near them until I prove a link to one or more faculty members in my home institutions, and the weaker that link is, the more red tape and bureaucracy I have to go through to get permission to help with the project -- no matter how important I think that project may be.
  • 37. Data are already out there. Photo: NASA (via http://nasaimages.org/), “Multiwavelength M81” The problem is most acute for already-existing data. For example, think about what happens when a researcher leaves your institution for a different one. Their institutional web presence tends to remain behind. There may be valuable data there. But can the IR get involved, if the researcher doesnʼt have a connection to the institution any more? And, of course, it means that data at institutions without IRs just fall between the cracks. Definitely not ideal, not what we want.
  • 38. Data are sloppy. This is another aspect of data sloppiness. A lot of them donʼt clearly belong to one institution, or indeed to ANY one institution! Consider something like a disciplinary data or e-print repository. One of those just came up for rescue, the anthropology repository known as Manaʼo. Would I, as an IR manager, like to rescue it? Sure! Do I have the technical capacity to do it? Mmmmostly; I could at least take a stab at it. Can I do the rescue? Oh, heck no. Not in my remit. Iʼm not allowed. Itʼs not institutional data, so I canʼt touch it. So it follows, at least to me, that if weʼre going to grapple with data in our institutions, we will have to give up on the purely inward-looking focus that IRs have had. Maybe different institutions will choose disciplinary specialties to focus on. Maybe weʼll just drop the idea that data have to originate within our institution before the institution is interested in them. I donʼt know. But if IRs are going to play in the data space, something in the policy environment has to give.
  • 39. We’re caged up inside our institutions. This restriction, this institutional cage, is an artifact of the scholarly publishers; itʼs not something libraries invented. Some publishers allow self-archiving only in “institutional” web presences. If an IR opens itself to a lot of stuff that doesnʼt have strong and obvious ties to the institution, it is opening its institution to a very real legal risk, a risk that some publishers will sue the institution, making the argument that itʼs not an “institutional” repository any more because it contains non-institutional content. But the reality is that research does not stop at institutional borders. And the more that IRs cling to that institutional cage, the less we can actually DO to salvage and protect research data.
  • 40. Photo: http://commons.wikimedia.org/wiki/File:Black_Ford_Model_T_in_HK.JPG Any color you want... Unlike digital libraries, at least in theory, IRs were supposed to accept any kind of digital content or data at all! But the snag there is that theyʼre not really designed for it; theyʼre optimized for research papers. So in practice, you get the famous Henry Ford statement about Model T cars: you can have any color you want, as long as itʼs black! Iʼll show you what I mean.
  • 41. Bring it on; we’ll take anything! ... as long as it’s static and final. Photo: http://www.flickr.com/photos/orblivio/146691405/ The “weʼll take anything” promise is broken and has always been broken. Weʼd take anything IMMUTABLE. I use this photo advisedly, because for a lot of faculty, once something they produce is static and final and immutable, itʼs junk! Itʼs out of their sight and they donʼt care about it any more. So it never gets deposited in the IR to begin with, which means nobodyʼs taking care of it. The researcher sure isnʼt; itʼs old news.
  • 42. Photo: http://www.flickr.com/photos/jonevans/1032687817/ It’s there to be interacted with. The ʻstatic and finalʼ model is absolute garbage for interactive data. Itʼs especially garbage if interacting with the data is one of the ways that the data are made more reliable! Maybe the first reduction of the data is wrong. If we then canʼt change it because our repository only handles whatʼs final and static... we are not serving the need here.
  • 43. Data are already out there. Photo: NASA (via http://nasaimages.org/), “Multiwavelength M81” Itʼs also not ideal for whatʼs already out there. We *know* a lot of that stuff is in bad shape. But if we wait to ingest it until we can clean it up into an acceptable final form -- we may lose it altogether.
  • 44. Data are wildly diverse in nature... ... as are their technical environments. Photo: http://www.flickr.com/photos/28481088@N00/670258156/ IRs are LOUSY at dealing with the diversity of data. Iʼll have a few more words about this later, but for now Iʼll just state the obvious: putting research data into a user-interface optimized for research PAPERS is a total loser. Papers have built up a lot of uniformity over the centuries weʼve had journals. Data are a whole different story.
  • 45. Bring it on; we’ll take anything! ... as long as it’s static and final. Photo: http://www.flickr.com/photos/peasap/655111542/ Again, all this is a profoundly HUMAN problem, and another place where the technology we created has an impedance mismatch with the way researchers actually work and think. Richardʼs Q&A session yesterday brought up a key problem with the static-and-final idea: sometimes you THINK something is static and final when itʼs really, really not. And some things are just not even MEANT to be static and final! DSpace, for example, assumes the static-and-final, so much so that it makes correction of an item already ingested into DSpace difficult and perhaps impossible unless youʼre the systems administrator. How much time I have wasted swapping out files for people, you really donʼt want to know. Fedora users, donʼt get smug here, because Fedora has similar problems. We canʼt DO that with data. Humans are imperfect. The artifacts that we produce are imperfect and incomplete. Our systems need to accept and work with that imperfection, allowing us to work TOWARD perfection, conscious that weʼll never quite get there. Librarians tend to HATE this point of view. Weʼre all about the static and final and authoritative. I am here to say we have to get over our bad selves. HAVE to, if weʼre going to do justice to research data.
  • 46. Right, anything you’ve got! ... one file at a time. Photo: http://www.flickr.com/photos/jetalone/39990302/ So, IRs promise to take anything youʼve got, anything at all -- but you have to put it in one file at a time, like coins into a glass piggy bank.
  • 47. There’s a lot of data. Photo: http://www.flickr.com/photos/noelzialee/2126153623/ Putting data into repositories one file at a time, MANUALLY, is like emptying the ocean into a bucket with an eyedropper. Not gonna fly.
  • 48. Data are already out there. Photo: NASA (via http://nasaimages.org/), “Multiwavelength M81” And since data are already out there, we have to make it easy to dump in large quantity into our buckets. That means more APIs and protocols. SWORD is good, I like SWORD, I love what I heard about BagIt yesterday -- but honestly, itʼs got to be even easier than that. I want researchers to be able to push the “Archive It!” button and have it just silently, seamlessly, WORK.
  • 49. Any look and feel... And IRs promise that you can customize the look and feel, but in practice, itʼs too hard. How many people in here can tell a DSpace from an EPrints install just by looking at the front page of the site? I sure can. And anyway, what you get even when you customize is this very sterile, boring, libraryish look and behavior; itʼs not appealing to the researchers whose hearts and minds we need to capture. Look, I did this redesign for MINDS@UW, I am hoisting myself on my own petard -- but gosh, we need to do better than this!
  • 50. Data are project-based. http://www.exploringthehyper.net/ Look at this gorgeous little site! Isnʼt it appealing? If I promise the researcher here that Iʼll take care of her data forever and ever at the cost of it losing all its visual appeal and its individualized usability, is she going to take me up on that? I wouldnʼt take me up on that! So this becomes a content-recruitment problem; researchers see IRsʼ ugly, pathetic little one-horse interfaces and interaction patterns and they run screaming in the opposite direction.
  • 51. Data are wildly diverse in nature... ... as are their technical environments. Photo: http://www.flickr.com/photos/28481088@N00/670258156/ I know I keep coming back to this data-diversity issue like a bad record, but so much of our infrastructure just fails when confronted with it. One interface does NOT fit all.
  • 52. Any look and feel... There is some experimentation happening in IR space. Manakin for DSpace making collection- based theming possible was definitely a step forward, though perhaps not enough of one; too much of the page-construction logic still lives in Java. The KULTUR project in the UK is adapting ePrints to be appealing to visual and performing artists. All of this is good and we need more of it, but I think we have to confront a wider issue: building our platforms with enough flexibility to be easy to customize for as much variation as we can manage. We also need to make it easy for people to construct their own look and feel on top of our stuff, or just with our stuff in it where that makes sense. Our silos really get in the way of that now, and itʼs a problem.
  • 53. Any metadata you want! ... as long as it’s key-value pairs. Photo: http://www.flickr.com/photos/rattodisabina/2460905893/ I hate this. It drives me insane. All the marvelous work being done with linked data, XML, semantic webby sorts of things, and all I can have in my IR is key-value pairs? What is up with that?
  • 54. Data are wildly diverse in nature... ... as are their technical environments. Photo: http://www.flickr.com/photos/28481088@N00/670258156/ The diversity of data environments includes diversity in metadata; Iʼm sure thatʼs a surprise to no one. It also means a diversity of metadata content models, well beyond key-value pairs.
  • 55. Data are already out there. Photo: NASA (via http://nasaimages.org/), “Multiwavelength M81” Imagine the ideal data project. Itʼs already well-described in an elaborate schema and well- organized. Are we seriously going to tell the provider that they have to dumb it down to key-value pairs before we can take it? Seriously? Gosh, I hope not.
  • 56. Any metadata you want! ... as long as it’s key-value pairs. Reality check: anybody developing a metadata standard these days expresses it in XML or RDF or both. Key-value pairs donʼt cut it, and arguably never did.
  • 57. Do anything you want... ... as long as it’s “download.” Photo: http://www.flickr.com/photos/procsilas/306417902/ IRs can take in digital files and they can give them back. Honestly, thatʼs pretty much all they can do.
  • 58. Photo: http://www.flickr.com/photos/jonevans/1032687817/ Data are there to be interacted with. This just KILLS us with interaction. It kills us! A lot of these data need APIs. If weʼre not providing them, honestly, we might as well not bother. Interact with data? In an institutional repository? Mash it up with something else? Heavens forfend -- that would imply that digital objects are somehow related to each other, and thatʼs just crazy talk. So we have a lot of interface and API work to do. A lot of it!
  • 59. Content models Enough said. Hereʼs a real-world example of the difficulty. A project Iʼm helping with for the UW-Madison Zoology Museum involves a teaching collection of animal skeletons that students measure and do comparisons on. Weʼre photographing those and whomping up an interface that lets students do that measurement work digitally. Saves wear and tear on fragile realia, allows distance students to participate fully, and, we hope, creates an archive thatʼs useful outside our campus borders. Weʼre using Fedora for this, and the content modeling gets complicated. We have a specimen -- say, a squirrel -- which has any number of actual bones, and each bone may have several photos in various views, and this matters as far as “where do we hang which metadata” and “what do you want people to find in a search?” and “how do you display on a specimen page all of its component bones and views?” So, those have been a lot of entertaining (and sometimes macabre) conversations. Imagine for a moment that this had been a DSpace project. *CLICK* Here is the content model for DSpace. The ONLY content model. Community is not even relevant here; collection kinda-sorta fits but not really, so weʼre left with items, bundles (whatever the heck they are), and bitstreams. And only items can have metadata! (*CLICK*) I donʼt need to say any more. DSpace, which is running the lionʼs share of institutional repositories in the United States, is completely functionally inadequate as a serious data bucket! So much for the IR.
  • 60. So where does all that leave us? Photo: http://www.flickr.com/photos/library_of_congress/2162653769/ Hopefully not with a trainwreck!
  • 61. Photo: http://www.flickr.com/photos/jonevans/1032687817/ We need bigger, better buckets. I love the idea of just grabbing a bucket and going after data. I admit itʼs probably an 80/20 thing; thereʼs 20% of the problem-space weʼre looking at that we cannot realistically solve. But I know -- I KNOW -- we havenʼt served 80% of our users or 80% of our potential content. We can do better, and we need to.
  • 62. Silos are both necessary and unacceptable. Photo: http://www.flickr.com/photos/jojakeman/2818910104/ At some level data are all bits -- and at that level, silos tend to be counterproductive and stupid. We shouldnʼt have to build a checksum engine for sixteen different silos! Where I am, weʼre working toward combining our digital library and our institutional repository on a single technical infrastructure, because it just makes sense to do that. But because data come in thirty-six flavors and then some, once you get above the pure-bits level itʼs unrealistic to think that we can design one silo that will work equally well for everything. Our infrastructure has to be flexible, it has to have APIs that other people can build on as well as ourselves, and it should make the most of the commonalities we *do* find in wildly diverse and heterogeneous data. Homogeneity whenever possible, flexibility where necessary: that needs to be our motto as we build these systems.
  • 63. We have a lot of modeling to do. And meta-modeling. Photo: http://www.flickr.com/photos/crobj/727348790/ Again, because of data diversity, the content-modeling exercise I talked about with the zoology skeletons will have to be replicated, over and over and over again, as new kinds of data come our way. I donʼt know if this scares you -- it sure scares me. Add standardization processes on top of this, because we can expect some kinds of research data to develop standards, and it gets even scarier. Fundamentally, we need more efficient ways to do this work -- a sort of meta-model for content modeling, if you will. I donʼt know how that can work; I just know it has to.
  • 64. We have a lot of code to write. Photo: http://www.flickr.com/photos/fienna/170559081/ Should be uncontroversial. I know a lot of you are already writing this code! Thank you. Now share it with the rest of us, please, because...
  • 65. We can’t code or model in isolation. Photo: http://www.flickr.com/photos/naus3a01/240614578/ ... here is another Cassandraic dire warning. We cannot possibly *hope* to keep up with the data flood if weʼre all making our own little content models and coding up discovery and dissemination frameworks in isolation. Why should anyone out there have to decide how to represent skeletons and bones? At Wisconsin weʼve done that for you! “We love open source; no, you canʼt have our code” is not gonna fly any longer, folks. We have no choice but to figure out how to share code better. Whatʼs more, we have to figure out how to share code with people no more technically inclined than I am, and perhaps less. Now, just a little bit about me: I hate Java. I am violently allergic to Tomcat. I donʼt even like man pages! Can you build a system for me? Now think about the vast DSpace installbase out there. Think about how many of those installs happened because DSpace was supposed to be an out-of-the-box solution. Now think about how weʼre going to migrate these people to something more flexible. Scared yet? I am! Brian Owen talked yesterday about how HARD it is to solve these collaboration problems. I agree with him! Itʼs hard. Itʼs a human problem, and human problems are hard. The problem is, all the ALTERNATIVES to solving this problem are even harder. Weʼve GOT to fix library-technology collaboration.
  • 66. Fedora is the new world. But Fedora must change. Photo: http://www.flickr.com/photos/mythwhisper/3361907495/ Of the digital-library and institutional-repository platforms out there today, I think Fedora is the horse to bet on. Itʼs the only one that comes close to the storage and presentation flexibility needed for a big data bucket, and I think the data buckets such as RepoMMan that have already been built atop it are all by themselves a pretty good indicator that it is the future. But Fedora needs to make some changes -- some technical, some social. Content models, service definitions, and their associated code need to be pluggable, to avoid the wheel-reinvention Iʼve said we canʼt afford. I donʼt entirely know how this needs to work, though the plugin and mod structures for projects like Drupal and WordPress may be models. I do know that it does need to work, or weʼre all going to drown in our buckets. And then we have to build the social scaffolding to actually share these pieces of code, which may turn out to be harder than the actual technology! Fedora also made the same mistake DSpace did with regard to the editability and replaceability of objects. Getting stuff into Fedora you can do with what Fedora hands you. Removing stuff, you can do. Editing stuff? No, unless you want to edit XML as text in an incredibly clunky and ugly Java app. Replacing an object with a better object? No. This is not gonna fly, Fedora. It needs to be fixed as soon as possible. We also have to put easier tools on top of Fedora, both on the data-producing and data-consuming ends. Thatʼs being worked on: Islandora, Omeka-over-Fedora, lots of things, and that is all to the good. Fundamentally, we have to figure out ingest straight from whatever unholy mess a researcher has, and we have to be able to translate the affordances of a particular dataset easily into our systems. I donʼt think either of those solved yet, though RepoMMan comes close; even the SWORD
  • 67. Photo: http://www.flickr.com/photos/werwin15/3554539197/ Focus on the start... ... not so much the finished product. ... you canʼt curate what you donʼt HAVE. Fundamental truth that the IR experience should have taught us. If our systems donʼt invite deposits -- even sloppy ones, even unfinished ones, even bad ones by any measurement -- and if they donʼt do it as early as possible in the research process, so that researchers donʼt get fixated on some other software system, thereʼs no point to having research-data repositories at all. I know this goes against the grain, hundreds of years of library perfectionism, but Iʼm afraid thatʼs just too bad! If weʼre playing in this space, we have to be ready to make some mud pies. Iʼve always, always loved the RepoMMan project for this reason -- if youʼre googling for it, it has two “m”s -- and I also really like what the California Digital Library is building. Theyʼre starting with the good old filesystem, which we all know and more or less love, and theyʼre enhancing it into a curation system. Itʼs an approach I think will bear fruit, and thatʼs because theyʼre starting from the right place: where people actually do their work.
  • 68. Solr brings it all together Photo: http://www.flickr.com/photos/chantrybee/2911840052/ Now, to end on a positive note, I love the Solr app, and I think itʼs a marvelous example of the kind of lightweight tool that does really heavyweight things. The beauty of Solr is that once Iʼve solved the intellectual problem of “what metadata do I want to expose for search and browse?” Solr makes expressing that in a crosswalk just stunningly, beautifully trivial -- and then I never have to worry about it again for that flavor of metadata. There is complexity under the hood, but our experience at Wisconsin has so far been that you donʼt encounter that complexity until you actually need it, which is just perfect.
  • 69. ... the Vermeer: the Muse Clio, from “The Allegory of Painting” of Data Curation. So thatʼs what I have to tell you. If Iʼve helped you see some of these problems in a new way, if Iʼve expressed them usefully, such that they get solved, perhaps Iʼll get to stop being Cassandra -- and instead become the Clio of data curation. Hereʼs hoping.
  • 70. Thank you! This presentation is available under a Creative Commons Attribution 3.0 United States license.