This document summarizes available data visualization tools and datasets for digital humanities research. It discusses examples of tools for searching, discovery, visualization, analysis and publishing including Perseus, JSTOR Data For Research, Wordseer, Google Ngram Viewer, Concordancing tools, Google's Public Data Explorer, NodeXL for network and text analysis, and Google Refine for data cleaning. The document also outlines roles for librarians in providing comparisons of tools, research support, and helping shift reference services to support new forms of data-driven research.
Data Visualization and Digital Tools for the Humanities
1. Data visualization and digital humanities research: a survey of available data sets and tools LITA National Forum 2011 St. Louis, MO Friday, September 30, 2011 Erik Mitchell, University of Maryland Susan Sharpless Smith, Wake Forest University
2. Motivation “Digital humanities needs gateway drugs. Kudos to the pushers on the Google Books team.” - Dan Cohen http://www.dancohen.org/2010/12/19/ “Linked open data could have the same leveraging effect that the World Wide Web had on computing, said Micki McGee, an assistant professor of sociology at Fordham University” -Steve Kolowich, The Promise of Digital Humanities, Inside HigherEd
3. Birth of a word “Imagine if you could record your life, everything you said, everything you did available in a perfect memory store at your finger tips. “ - Deb Roy – The Birth of a Word http://www.ted.com/
4. Overview Discuss examples of data-focused research tools Explore tools Consider roles for librarians Wrap-up/Q & A
10. Tool exploration Discover / Search What kinds of discovery tools exist and how common are the discovery features across different datasets / systems? Visualization What visualization features exist, are there products that are easy to use, are the skills transferable? Analysis / Annotation What analytical tools are included, what analysis techniques are common?
14. Google’s Ngram Viewerbooks.google.com/ngramsculturomics.org But here's the rub. Google Books, as others point out, wasn't really built for research. . . That means Google Books didn't come with the interfaces scholars need for vast data manipulation . . . http://chronicle.com/article/The-Humanities-Go-Google/65713/
15. Ted talk on Google NGRAM viewer http://www.ted.com/talks/what_we_learned_from_5_million_books.html
16. Concordancing Eric Lease Morgan - http://dh.crc.nd.edu/sandbox/cyl/catalog/
18. Data analysis - NodeXL http://nodexl.codeplex.com/ Analyzing Social Media Networks with NodeXL: Insights from a Connected World
19. Data cleaning – Google Refine http://code.google.com/p/google-refine
20. Data visualization – Google Fusion Tables http://www.google.com/fusiontables/DataSource?dsrcid=332788 http://google.com/fusiontables
21. Research/teaching need Researcher needs vary from advanced linguistic analysis and IT support to need for basic digital content/infrastructure Corpus-based research
22. Librarian contributions Domain specific, tool-type specific comparisons IT and research support – data analysis, data curation, tool/data sources identification Shift from “reference” to “research” in sync with move from resource discovery to thematic analysis
23. Next steps Build new skills, develop new systems Create tutorials guides Explore connections between data/curation and publishing and these tools – so is there a connection Explore role of library discovery systems and consider new feature implementation.
Today presenting on a “summer exploration” project completed at WFUWide scope, exploratory in natureHere today to share what we foundFrom article: The Promise of Digital Humanities (Inside HigherEd) September 28, 2011They are building tools that could facilitate insights into history, language, art and culture that human researchers might never have been able to glean on their own. And some say that could help restore public interest in the humanities.Digital humanities is a hot topic this year: The NEH held a symposium on Tuesday for 60 recipients of its 2011 Digital Humanities Start-Up Grants, most of whom were given between $25,000 and $50,000. digital humanities -- a branch of scholarship that takes the computational rigor that has long undergirded the sciences and applies it the study of history, language, art and culture.
Got interested because WFU faculty were talking about DH researchWe saw lots of enthusiasm but little knowledge about what really existed. Story – different definitions, WFU DH Institute, Computational Humanities, linguisticsPoint – it is clear that the field has energy and that DH is focusing on the same structures and information tools as libraries
discuss how data and computational power is sexyWe pause on this video for a moment to mention this video specifically (Deb Roy, 20 minutes)Focuses on the impact of large scale data collection and cross analysis“Imagine if you could record your life, everything you said, everything you did available in a perfect memory store at your finger tips”Picture shows connection between a televised moment (Obama's State of the Union Speech) on the bottom of the screen and all of the social media conversations happening in real time at the top of the screen. Network graph - wider view of experience, understand ideas from more than one perspectivePoint –Consider the impact if librarians could help students and researchers begin this type of data analysis
going to present some examples of “data” focused research tools. Definition: Databases that allow asking research questions focused on dataWe are going to explore tools that fit three functions – searching/discovery, visualization, analysis/publishConsider how these tools could impact teaching and researchConsider the roles that librarians can play in this field
Goal in this chart is to introduce the types of tools – Show how they complement each otherDiscoveryText searchingCitation chaining: tracing citations both forward and backward, something core to academic research, WOK (citation mapping give a visual of this idea), and this data can be exported.Concept exploration, facets and contextual metadataVisualization (for both presentation and behind the scenes)MappingGraphingChartingData cleanup and normalizationAnalysis/PublishingDataset publishing:Statistical analysisAnnotation (tagging text), drilling in, inverse Be aware that there is overlap among the groups.
Discuss types of discoveryCorpus (collection of written texts) exploration – Full text, linguistic components, concepts (copa, coca, ngram . .) examples at: http://corpus.byu.eduBibliometrics – citation trees (Web of Knowledge, DFR)Bibliometrics is a set of methods used to study or measure texts and information. Citation analysis and content analysis are commonly used bibliometric methods.Used to study impact, of researchers, papers, journals, academic output (Eigenfactor recommends) new project coming outMetadata – structured data on any topic (Google public data, GIS)Hybrid – JSTOR DFR (Data for Research) is a good example – it includes full text searching and metadata limiting and bibliometrics
Purpose:main goal of data visualization is to communicate information clearly and effectively through graphical meansMany free tools are available for visualization (link on slide)Purpose of these tools is to provide visualization and data exploration platforms – Nodexl is an excel plug in for windowsTypes of visualizationData cleaningData analysisGraphical representations of data: ie: table, map, heatmap, line chart, bar graph, pie chart, scatter plot, timeline, storyline or motion (animation over time) (Google Fusion Tablesdoes all these)One example using GIS http://inside.uidaho.edu/ Google Fusion Table: http://www.google.com/fusiontables/Home
These tools allow statistical analysis of data or provide a platform for visualization or publishing.Great understandable example is the Google public data explorerWe will look at this in a few minutes
Second thing we did was explore. We tried to compare linguistic tools:Article - Literary & Linguistic Computing as a journal – Corpus design criteria – Volume 7, 1992How we explored – interviews, datasets, tools, Focused on linguisticsGoal of this slide is to talk through one comparison exercise: corpus.byu.eduCorpus of contemporary americanenglish, Google Books, Brisish National CorpusFindings – lack of consistency, new search features, Need here is for published comparative documentsAll familiar but different context Word frequency, concordancing, lemmatization (roots), semantic and syntactic relationships, kwic, sense disambiguation, links, population scope (open closed), randomPoint – librarians already know what these tools can do to an extent.word frequencyconcordancingIndex of words in text, often shown in context of sentence structureAbility to search/lemmatizationSearching words using rootsSemantic relationshipsDerived relationship (e.g. is done by, is described as)Syntactic relationships part of speech labelingSentence decomposition (Stanford parser)collocationKWIC, sequence of words that are taken togethersense disambiguation(e.g. run, running, ran)link to lexical databasedictionary of words - http://wordnet.princeton.edu/how is population defined?Is the corpus open our closed? Was it a random sample, a limited text source? What impact does that have on generalization?synchronic/diachronicDoes the corpus focus on a "point in time" or change over time?monolingual/bilingual/plurilingualWhat languages are represented?
Nowwe are going to explore some toolsWe grouped into three areas: discovery, visualization, and analysisWe have included some questions that we asked as we explored
A tool that I did not know about until recently is Perseus, mentioned in the project bambooDigital humanities research tool at Tufts.Listened to David Mimnofrom Princeton talk about Computational HumanitiesSpanning between distant and close readingDavid was the head programmer for Perseus for many yearsFeatures includeDirect access to text searchingAbility to explore connections between documents – lexicons, concordances (alphabetical list of words)See position of text in larger collection
At the talk I was at, David also talked about his work on computational topic modeling using JSTOR dataIt is an interesting talk – you can find it at mith.umd.edu under digital dialogues – recap his ideaDavid’s idea was – if you analyze all of the text in a specific set of journals (Classics journals), you canSee changes in topics and language over time – he found that in the 1980s the two fields of philology (books) and archaeology converged in some journalsGenerate topics that show granular ‘aboutness’ – Some interesting discussions about value of human vs computing modelsExplore aboutness not from a qualitative ‘hunch’ but from statistical comparisonDemo – I want to see what topics academics have explored with janeausten1. dfr.jstor.org, login2. You can search, view chart data or view citations3. you can export although by default you are limited to 1000 records4. I searched for janeausten, limited to research articles, limited to subject language and literature5. I then downloaded the data >> data requests > submit new request6. Download key terms, csv -> janeaustenkeyterms7. check email, wait, download
From another presentation at UMD MITH – AditiMuralidharanThis is a highly focused corpus database that includes semantic relationship analysis, visualization tools, and data annotationNeat hybrid systemWordseer focuses on Slave NarrativesDiscovery, Annotation, VisualizationSemantic relationshipsDemo: link to itExamples > god, point to chartAdd blessClick on heat maps, or read/annotate
Google NGRAM I expect we are familiar with the NGRAM viewer http://books.google.com/ngramsWork by Jean-Baptiste Michel and lots of others2009 snapshot, 5.2 Million books, English, French, German, Hebrew, Russian, Spanish, ChineseBest data is between 1800 and 2000Searching – date, phrase, language, smoothing (Average of occurrence over years), ngrams (how far from other words it is – within 2, 3 4) Discover trends – for instance, while the concept of “good cats” has remained steady (but limited), there has been diminishing focus on ”good dogs” in the 20th century. Does this point to a disturbing trend in dog goodness?But be careful – culturomics.org – what does this data say – Paper by jean-baptisemichel, and lots of other folks “quantitative analysis of culture using millions of books”
In fact there was a recent ted talk on the ngram viewer. In 15 minutes it gives a good overview of the background and uses of the system
We found innumerable tools for processing!Eric Lease Morgan at Notre Dame has done some interesting work in this area and has released his Lingua perl modules for processingThere are other methods – Stanford parser for example offers these toolsDeveloped concordancing software Available in cpanGreat iPad demo hereHis data is from internet archive – interesting source for data for harvesting and analysisYou can see he focuses on some other specific search methdodsPoint of this one – wordseer and cathlolic – both special collection focused, different research tools availableProblem – this proves to be very confusing for people trying to practice a research method across multiple data sets.
Google Public data explorerA visualization tool that animates so you can see change over time. You also can embed charts into your website (link icon in upper right corner)Over 40 data sets currently uploaded and ready to use.Allows simple visualization tools to be applied to any datasetQuick Demo of unemployment rateDo the search, Show how you limit to resultsViews: Line Chart, bar chart, map, bubble chart
NodeXLis a tool to display and analyze data through a network graph. It is open source, windows only and is a Excel template.Specifically,NodeXL was designed especially to facilitate learning the concepts and methods of social network analysis with visualization as a key componentWhat can you do?*Easily* customize the graph’s appearance; zoom, scale and pan the graph; dynamically filter vertices and edges; alter the graph’s layout; find clusters of related vertices; and calculate graph metrics.What I like is that I could use it quickly by importing data. There are built-in connections for getting networks from Twitter, Flickr, YouTube, and your local email are provided. Additional importers for Exchange Email, Facebook, and Hyperlink networks are available here.There is a 47 page tutorial, which was a good indication that it is not totally intuitive to learn, however it has good flexibility
We also found a number of data clearning and tools. There is a great site digitalresearchtools.pbworks.com that lists a lot o these toolsGoogle refine runs in Chrome, it supports up to 200K rows – which is actually not that much when we get to humanities data1. goto erikmitchell.dyndns.org:3333 - explain what you are doing2. I downloaded key terms from jstor doing a search of janeausten3. I imported the file using defaults4. It imported weight and key terms5. Weight is the relevance or centrality to document (e.g. every document has a term with rank 1)6. Lets say I just want to see the central words7. weight, facet, numeric facet8. limit to .98-19. You can see this drops the matching rows10. Now lets say I want to see how many times each of these key terms is used11. keyterms -> facet -> text facet12. sort by count13. You can include and exclude, perform other data analysis,etc.If this is interesting there are some good quick video tutorials on the siteModify for XML or wiki publishing formats
Google refine is also designed to work with their visualization tools. We showed public data explorer,There is also Google fusion tablesFusion tables makes it very easy to connect and explore dataHere is one link
So what did we find.We found lots of tools, lots of uses, lots of dataWe ultimately decided that there is a strong research and teaching needslide is to talk about the data focused research activities that we found researchers engaged in.A second part of our project was to explore research needsWidely varied – some statistical , some linguistic, some just wanted to digitize stuffJerid is actually doing research on movie subtitles and translation . .not sure what this is. .Focus on - http://francojc.wordpress.com/List of publications from corpus-based research:http://corpus.byu.edu/publicationSearch.asp
We also found that there are areas for us to contributeConversational.One BYU comparison:http://googlebooks.byu.edu/compare-googleBooks.asp Compares “possible” and “not possible” for the following functionality:Exact word and phrasesRelated words and cultural insightsSearching for conceptsChanges in meaningCollates (nearby words) s and cultural shiftsFunction of wordsGrammatical changesLanguage change and genreA tool to locate research data is being developed by Purdue Libraries (Michael Witt) and Penn State: Databib. The goal is “to create a community-driven, annotated bibliography of research data repositories”http://databib.lib.purdue.edu/
Next stepsfirst - librarians already understand metadata interoperability and harvesting, we should expand our understanding of these fields to include full text data and develop toolkits to facilitate harvesting and meshing of research data from different sources. This includes tools like the Stanford NLP parser (Stanford Parsernlp.stanford.edu/software/lex-parser.shtml ) , a tool that facilitates the coding and parsing of text data.Second - librarians understand searching across multiple systems - we need to build on this skill by honing our abilities to perform content anslys and generalize results.Third - We need to better understand the landscape of research data. This means understanding types of data set and sources of data. It also means having thea ability to crosswalk data between databases. It also means getting past resource disovery and into resource analysisFourth - we need qualitative and quantitative research skills - we ned to be able to help researchers know when they have a representative sample, how to harvest, code and analyze that dataFifth - We bring a multi-disciplinary understanding of domains of knowledge - we need to leverage that familiarity with active research agendas.Story here is about that hathitrustserach in summon and in oclc. These searching platforms are tying to leverage book fulltext in a new way but what else could they do?
Can we add the list of tools? https://digitalresearchtools.pbworks.com/w/page/17801672/FrontPage