Data Mining for Libraries

•Download as PPTX, PDF•

1 like•662 views

Data analysis to do service assessment in academic libraries. Web presentation for the State University of New York Librarians' Association.

Education

Data Mining for Libraries:
What are the Possibilities?
Elaine M. Lasda Bergman, MLS
Twitter: @ElaineLibrarian
elasdabergman@albany.edu
Subject Librarian for Social Welfare
University at Albany, SUNY
SUNYLA Midwinter Conference
January 30, 2015

What is Data Mining?
http://pixabay.com/en/helmet-mine-mining-headgear-155632/

Knowledge Discovery In Databases
(KDD)
Input data
Data
Preprocessing
Data Mining Postprocessing Information
Adapted from Tan, et al. (2006), p.3

A note about data collection
• It’s the kicker: GIGO
• Cleaning
• Preprocessing

What is Weka?
http://www.cs.waikato.ac.nz/ml/weka/

Weka for Prediction
Mackenzie, Ian: https://www.flickr.com/photos/madmack/165933656/

Did Student
use Email/IM
reference
Did student
Receive
instruction
0 sessions
1-2 session
Time between
grad/undergrad
1-5 years
100% yes
None
45% yes
5+ years
100% yes
3+ sessions
Student’ s
residency
status
On campus full
time
Off campus full
time Part time
Likelihood of graduate
students
using library resources
based on survey questions
Yes
No

Weka for Classification
http://www.geograph.org.uk/photo/971476

Weka for Association Analysis
http://analytics-arena.blogspot.com/2012/12/the-famous-beer-diaper-planogram.html

(Anomaly Detection)
https://www.flickr.com/photos/fonalite/2780198933/

How Can Libraries Use Data Mining?
http://dlg.galileo.usg.edu/dahlonega/dahlonega_logo.jpg

Circling Back:
It All Starts With Data Collection
http://www.navigatingthetension.com/2012/02/circle-wagons.html

Questions?
Me:
Elaine Lasda Bergman, Subject Librarian for Social Welfare, University at Albany
email: elasdabergman@albany.edu
Twitter: @ElaineLibrarian
Resources used:
Tan, P. et al. (2006). Introduction to Data Mining. Boston: Pearson Education, Inc.
Newton, et al. (2012). Your Statistical Consultant: Answers to Your Data Analysis Questions. Thousand
Oaks: SAGE Publications.
Two good Weka Tutorials:
http://www.cs.ccsu.edu/~markov/weka-tutorial.pdf
http://www.uh.edu/~smiertsc/4397cis/WEKA_Data_Mining_Tool.pdf
Data Mining for the Masses:
https://rapidminer.com/wp-content/uploads/2013/10/DataMiningForTheMasses.pdf

Recently uploaded

Jamworks pilot and AI at Jisc (20/03/2024)

Jisc

Application orientated numerical on hev.ppt

RamjanShidvankar

Micro-Scholarship, What it is, How can it help me.pdf

Poh-Sun Goh

Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...

Pooja Bhuva

Making communications land - Are they received and understood as intended? webinar Thursday 2 May 2024 A joint webinar created by the APM Enabling Change and APM People Interest Networks, this is the third of our three part series on Making Communications Land. presented by Ian Cribbes, Director, IMC&T Ltd @cribbesheet The link to the write up page and resources of this webinar: https://www.apm.org.uk/news/making-communications-land-are-they-received-and-understood-as-intended-webinar/ Content description: How do we ensure that what we have communicated was received and understood as we intended and how do we course correct if it has not.

Making communications land - Are they received and understood as intended? we...

Association for Project Management

𝐋𝐞𝐬𝐬𝐨𝐧 𝐎𝐮𝐭𝐜𝐨𝐦𝐞𝐬: -Discern accommodations and modifications within inclusive classroom environments, distinguishing between their respective roles and applications. -Through critical analysis of hypothetical scenarios, learners will adeptly select appropriate accommodations and modifications, honing their ability to foster an inclusive learning environment for students with disabilities or unique challenges.

Understanding Accommodations and Modifications

MJDuyan

Introduction to Nonprofit Accounting: The Basics

TechSoup

Basic Civil Engineering notes first year Notes Building notes Selection of site for Building Layout of a Building What is Burjis, Mutam Building Bye laws Basic Concept of sunlight ventilation in building National Building Code of India Set back or building line Types of Buildings Floor Space Index (F.S.I) Institutional Vs Educational Building Components & function Sills, Lintels, Cantilever Doors, Windows and Ventilators Types of Foundation AND THEIR USES Plinth Area Shallow and Deep Foundation Super Built-up & carpet area Floor Area Ratio (F.A.R) RCC Reinforced Cement Concrete RCC VS PCC

Basic Civil Engineering first year Notes- Chapter 4 Building.pptx

Denish Jangid

Accessible Digital Futures project (20/03/2024)

Jisc

Sociology 101 Demonstration of Learning Exhibit

jbellavia9

Spellings Wk 3 English CAPS CARES Please Practise

AnaAcapella

Salient Features of India constitution especially power and functions

KarakKing

Fostering Friendships - Enhancing Social Bonds in the Classroom

Pooky Knightsmith

Interdisciplinary_Insights_Data_Collection_Methods.pptx

Pooja Bhuva

FSB Advising Checklist - Orientation 2024

Elizabeth Walsh

Key note speaker Neum_Admir Softic_ENG.pdf

Admir Softic

Klinik_ Apotek Onlin 085657271886 Solusi Menggugurkan Masalah Kehamilan Anda Jual Obat Aborsi Asli KLINIK ABORSI TERPEECAYA _ Jual Obat Aborsi Cytotec Misoprostol Asli 100% Ampuh Hanya 3 Jam Langsung Gugur || OBAT PENGGUGUR KANDUNGAN AMPUH MANJUR OBAT ABORSI OLINE" APOTIK Jual Obat Cytotec, Gastrul, Gynecoside Asli Ampuh. JUAL ” Obat Aborsi Tuntas | Obat Aborsi Manjur | Obat Aborsi Ampuh | Obat Penggugur Janin | Obat Pencegah Kehamilan | Obat Pelancar Haid | Obat terlambat Bulan | Ciri Obat Aborsi Asli | Obat Telat Bulan | Pil Aborsi Asli | Cara Menggugurkan Konten | Cara Aborsi Tuntas | Harga Obat Aborsi Asli | Pil Aborsi | Jual Obat Aborsi Cytotec | Cara Aborsi Sendiri | Cara Aborsi Usia 1 Bulan | Cara Aborsi Usia 2 Tahun | Cara Aborsi Usia 3 Bulan | Obat Aborsi Usia 4 Bulan | Cara Abrasi Usia 5 Bulan | Cara Menggugurkan Konten | Kandungan Obat Penggugur | Cara Menghitung Usia Konten | Cara Mengatasi Terlambat Bulan | Penjual Obat Aborsi Asli | Obat Aborsi Garansi | Kandungan Obat Peluntur | Obat Telat Datang Bulan | Obat Telat Haid | Obat Aborsi Paling Murah | Klinik Jual Obat Aborsi | Jual Pil Cytotec | Apotik Jual Obat Aborsi | Kandungan Dokter Abrasi | Cara Aborsi Cepat | Jual Obat Aborsi Bergaransi | Jual Obat Cytotec Asli | Obat Aborsi Aman Manjur | Obat Misoprostol Cytotec Asli. "APA ITU ABORSI" “Aborsi Adalah dengan membendung hormon yang di perlukan untuk mempertahankan kehamilan yaitu hormon progesteron, karena hormon ini dibendung, maka jalur kehamilan mulai membuka dan leher rahim menjadi melunak,sehingga mengeluarkan darah yang merupakan tanda bahwa obat telah bekerja || maksimal 1 jam obat diminum || PENJELASAN OBAT ABORSI USIA 1 _7 BULAN Pada usia kandungan ini, pasien akan merasakan sakit yang sedikit tidak berlebihan || sekitar 1 jam ||. namun hanya akan terjadi pada saatdarah keluar merupakan pertanda menstruasi. Hal ini dikarenakan pada usiakandungan 3 bulan,janin sudah terbentuk sebesar kepalan tangan orang dewasa. Cara kerja obat aborsi : JUAL OBAT ABORSI AMPUH dosis 3 bulan secara umum sama dengan cara kerja || DOSIS OBAT ABORSI 2 bulan”, hanya berbedanya selain mengisolasijanin juga menghancurkan janin dengan formula methotrexate dikandungdidalamnya. Formula methotrexate ini sangat ampuh untuk menghancurkan janinmenjadi serpihan-serpihan kecil akan sangat berguna pada saat dikeluarkan nanti. APA ALASAN WANITA MELAKUKAN ABORSI? Aborsi di lakukan wanita hamil baik yang sudah menikah maupun belum menikah dengan berbagai alasan , akan tetapi alasan yang utama adalah alasan-alasan non medis (termasuk aborsi sendiri / di sengaja/ buatan] MELAYANI PEMESANAN OBAT ABORSI SETIAP HARI, SIAP KIRIM KESELURUH KOTA BESAR DI INDONESIA DAN LUAR NEGERI. HUBUNGI PEMESANAN LEBIH NYAMAN VIA WA/: 085657271886

Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...

ZurliaSoop

Towards a code of practice for AI in AT.pptx

Jisc

TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...

Nguyen Thanh Tu Collection

HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx

marlenawright1

Recently uploaded (20)

Jamworks pilot and AI at Jisc (20/03/2024)

Application orientated numerical on hev.ppt

Micro-Scholarship, What it is, How can it help me.pdf

Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...

Making communications land - Are they received and understood as intended? we...

Understanding Accommodations and Modifications

Introduction to Nonprofit Accounting: The Basics

Basic Civil Engineering first year Notes- Chapter 4 Building.pptx

Accessible Digital Futures project (20/03/2024)

Sociology 101 Demonstration of Learning Exhibit

Spellings Wk 3 English CAPS CARES Please Practise

Salient Features of India constitution especially power and functions

Fostering Friendships - Enhancing Social Bonds in the Classroom

Interdisciplinary_Insights_Data_Collection_Methods.pptx

FSB Advising Checklist - Orientation 2024

Key note speaker Neum_Admir Softic_ENG.pdf

Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...

Towards a code of practice for AI in AT.pptx

TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...

HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx

Data Mining for Libraries

1. Data Mining for Libraries: What are the Possibilities? Elaine M. Lasda Bergman, MLS Twitter: @ElaineLibrarian elasdabergman@albany.edu Subject Librarian for Social Welfare University at Albany, SUNY SUNYLA Midwinter Conference January 30, 2015

2. What is Data Mining? http://pixabay.com/en/helmet-mine-mining-headgear-155632/

3. Knowledge Discovery In Databases (KDD) Input data Data Preprocessing Data Mining Postprocessing Information Adapted from Tan, et al. (2006), p.3

4. A note about data collection • It’s the kicker: GIGO • Cleaning • Preprocessing

5. What is Weka? http://www.cs.waikato.ac.nz/ml/weka/

6. Weka for Prediction Mackenzie, Ian: https://www.flickr.com/photos/madmack/165933656/

7. Decision Tree From Weka

8. Did Student use Email/IM reference Did student Receive instruction 0 sessions 1-2 session Time between grad/undergrad 1-5 years 100% yes None 45% yes 5+ years 100% yes 3+ sessions Student’ s residency status On campus full time Off campus full time Part time Likelihood of graduate students using library resources based on survey questions Yes No

9. Weka for Classification http://www.geograph.org.uk/photo/971476

10. Animal Clusters

11. Weka for Association Analysis http://analytics-arena.blogspot.com/2012/12/the-famous-beer-diaper-planogram.html

12. Association Rules

13. (Anomaly Detection) https://www.flickr.com/photos/fonalite/2780198933/

14. How Can Libraries Use Data Mining? http://dlg.galileo.usg.edu/dahlonega/dahlonega_logo.jpg

15. Circling Back: It All Starts With Data Collection http://www.navigatingthetension.com/2012/02/circle-wagons.html

16. Questions? Me: Elaine Lasda Bergman, Subject Librarian for Social Welfare, University at Albany email: elasdabergman@albany.edu Twitter: @ElaineLibrarian Resources used: Tan, P. et al. (2006). Introduction to Data Mining. Boston: Pearson Education, Inc. Newton, et al. (2012). Your Statistical Consultant: Answers to Your Data Analysis Questions. Thousand Oaks: SAGE Publications. Two good Weka Tutorials: http://www.cs.ccsu.edu/~markov/weka-tutorial.pdf http://www.uh.edu/~smiertsc/4397cis/WEKA_Data_Mining_Tool.pdf Data Mining for the Masses: https://rapidminer.com/wp-content/uploads/2013/10/DataMiningForTheMasses.pdf

Editor's Notes

Data mining is a way to make sense of large datasests. It borrows theoretical underpinnings from statistics as well as computer science, allowing us to generate new insights and knowledge. Data mining is very useful in many capacities, and it is increasingly easy to generate useful models to predict, classify, and describe new information. The results of data mining analytics can be utilized in administrative decision making, understanding user behavior, and identifying appropriate resources and services to meet the needs of customers or library patrons. We don’t have time today to get into the nitty-gritty of how all of these algorithms and models can be implemented, but I wanted to show you today some of the possibilities afforded to libraries and librarians through the use of data mining techniques.
Data mining is part of a larger process known as “Knowledge Discovery in Databases” or KDD. Essentially, first, we have a dataset, or input data. Then we preprocess it – which means to make it ready for analysis. This could mean changing the format of the data, throwing out incomplete data, converting the data into the proper format for analysis. Then comes the data mining, which we will focus on in a minute. After data has been mined, analyzed and our conclusions identified, we create visualizations and present the results in a way that is understandable and makes sense. This is known as postprocessing. Finally , after all of these treatments, we have moved from raw data to useable, actionable information and new insights.
Before I get into the meat and potatoes of data mining, I would like to spend a moment on the importance of data quality. It is of paramount importance not only to collect as complete and accurate data as possible but then we need to make certain the dataset is scrutinized for errors and omissions, typographical errors, formatting considerations. An example would be the value zero in a dataset. This could mean the mathematical concept of zero (Numerical), no value entered (Null), or even No to a yes/no type of question(non-mathematical or nominal). We’ve heard for years the acronym GIGO: garbage in, garbage out. It is vital that we take the time to get rid of the garbage in the preprocessing and formatting stages of our knowledge discovery process. The data I am using for the library related examples in this presentation are from data collected last fall through a user survey of graduate students at the University at Albany’s Downtown Campus.
So, moving on to tools for Data Mining. For my current project with the grad student library user survey, I am using an open-source analytics tool known as Weka. It is free for downloading at the link above. I will focus on this tool, because it is the one with which I am most familiar, but there are other similar tools including RapidMiner, sciKit; and the statistical application R also has many data mining capabilities. If a researcher has a reasonably solid background in statistics she will find basic functionality in WEKA easy to grasp. I recommend the book Your Statistical Consultant as a reference, as well as Data Mining for the Masses. At the end of this presentation, there will be links to help you locate these books. (DMM is a free PDF, the other costs money). Next we will talk about the major types of data mining models and how they can be used. The main types of data mining tasks are: Prediction, Classification, and Association. My aim is to show you some possibilities and pique your interest to learn more!
Prediction is exactly what it sounds like: we hope to reliably determine the value or outcome of a variable (known as the target variable), based on the values of other variables in the dataset (known as the explanatory variables). There are several ways to do predictive analysis, but the one I am going to show you today is known as a decision tree algorithm. What a decision tree does for us is to ask a series of questions in a heirarchical format, not unlike a flow chart. Decision trees are easy to interpret. They are resistant to data “noise” which is a term for outliers, less relevant variables, and so forth. The tricky part with decision trees remains the structure and preprocessing of the data, and there is a risk of “overfitting” your model. Overfitting can occur when the decision tree algorithm computes a high accuracy rate with your training set (or model), but does not work as well on test data or new data. On the next slide, I will show you a decision tree of a training set I did to predict the likelihood of a survey respondent to have answered that they frequently use library resources, based on answers to certain demographic questions.
Ok, so this is what my decision tree looks like. I have a simplified version of a few of the branches on the next slide so you can see how this works.
Here is a piece of the decision tree made prettier (sort of) through Microsoft Shapes. Hopefully it will make more sense to you. What we are looking at are the questions the decision tree asks in order to predict the likelihood of a student at my library who uses library resources. The first question it asks is: Did the student use email or Instant message reference. There is a branch for “YES” and a branch for “NO”. Let’s follow the right side for a minute. If they did NOT use email or IM reference, the next question it asks is about the student’s residency and full time/part time status, and there are 3 options for this variable. Be aware, each of those options has more branches below it in the real tree I showed you on the last slide, so the probabilities are not calculated. Going back up to the top, let’s follow if they answered “YES” to using electronic reference. The next question the tree asks is… Did the student attend any library workshops? And we have the values: none, 1-2 sessions, and 3 or more sessions. If the student took one or two sessions, the next criteria that matters is how much time the student took between graduate studies and undergrad. I should add that the other numbers of sessions also have lower branches, we are trying to simplify by following the shortest trail of branches. The interesting feature of the time between grad and undergrad is that those who have taken any sort of break are practically guaranteed to use library resources, provided they received instruction and used electronic reference. Those who do not take a break were less likely despite these library interventions. Hmmmm….
Classification is a way of identifying similarities or patterns in a dataset based on comparable variable attributes in each case. There are a number of ways to do this, but I would like to show you clustering. Clustering is a very visual way of determining patterns in your data. Any cases with lots of similar values in their variables are grouped closer together, those with different values are grouped farther away. What this means is you can inspect the clusters and determine what the similar values are in each case. The similar values give you the pattern of each cluster, which in turn is a way of classifying your data. Unfortunately, my own data did not respond well to clustering, which I will discuss in a minute. For now I will show you a classic clustering example from zoology – predicting animal genus based on physical characteristics of the creature.
I know this is hard to see, but there is a purple cluster, a blue cluster, a brown cluster, a yellow cluster, green cluster. Each of these represents a grouping the clustering algorithm determined based on characteristics of the animals. For example, worms and snakes have no legs, lay eggs, whereas seals porpoises and dolphins are aquatic mammals. The tricky part of cluster analysis is that unlike the decision tree, it IS very sensitive to “noise” in your data, notice that platypus, which is a mammal, is classified with the turtle type of creatures. As mentioned, in my case clustering the graduate student survey data was not particularly successful. This is because #1, my dataset is probably too small and #2 I may have asked the wrong questions or combinations of questions to generate clusters. One thing I intend to do is go back to the preprocessing stage of the data and see if there are ways to group responses to variables that reduce data “noise” and give us some sort of pattern.
Association can be used for classification purposes as well. Association, however is based on “rules” rather than “clusters”. Association rules are if…then rules that show patterns of association variables. This allows for complex comparisons and generates some interesting associations. The famous example of association “rules” is the urban myth that, due to what is known as “marketbasket analysis,” Walmart (or whatever big box store) puts its beer and diapers in the same aisle. So the rule would go: If customers by diapers then they are likely to also purchase beer. The myth goes that this is because the young husbands get sent out to buy diapers and pick up some beer for themselves while they are out. Association rules is also how Netflix and Amazon determine what to recommend to you. Association rules are easy to interpret and describe, and they handle skewed data very well (for example, my survey results were 70% women, 30% men).
Here is what the association rules look like in Weka. I understand that the variable names and values are not particularly descriptive on this screenshot, you need my survey “codebook” to explain what the variables mean and what each value signfies. This run of my association rules algorithm shows that if students are “somewhat” confident in finding the information they need (confidence =4), they are likely off campus, full time students (residency=2). This is interesting because survey gave 4 options: extremely confident, very confident, somewhat confident, and not confident at all. Zero respondents indicated that they were not confident at all. Our least confident students are “somewhat confident” and our least confident students are most frequently full time commuters, as opposed to part timers or full time on campus students. Hmmmm…..
There is, actually a fourth data mining task known as anomaly detection. This is the opposite of something like classification or prediction, it is identifying the outliers which DON’T fit your model. Practical uses for anomaly detection are are: determining credit card fraud (your credit card was just used in Bali, and you are in Rochester) and email spam filtering algorithms. I don’t have a good example of this one, because the goal of my survey was to look for patterns and trends, and also because it works best with so-called “Big Data,” but you can see where this is a useful application in a business context.
So what are some things we can do with data mining techniques to provide better user services, work processes, and administrative decisions? Like me you could take user survey and use the data to predict and associate certain characteristics with library resource use. (and try to classify!) You could try to determine the likelihood that a book will go missing by considering various circ stats as the explanatory variables (times circulated, publication year, call number range, etc.) Cluster the patterns of library use of subject groupings (call number ranges) by explanatory variables such as counts of : interlibrary loans, purchases on demand, circulation of books, journal article downloads Determine which academic majors or faculty departments are most associated with the use of various services: (reference, borrow a kindle, check out more than 5 books a semester)
I hope this has talk has given you some ideas about the possibilities of what we can learn from mining library data for interesting insights, patterns, and information. Some things to consider – all of this interesting analysis is predicated on GOOD DATA, and getting “good” data may be more challenging than the analysis itself. Privacy concerns may keep us from mining data about our library users, for example, we don’t make circulation data available. But such data in the aggregate can be used to great effect, provided extreme care is taken to protect our users’ identities. Second, we may not currently be collecting the data we need to appropriately tell our stories; we may have to change what information we collect and how we collect it to get the “good stuff.” Do a data inventory of your library! What is missing to help you achieve your strategic goals? And, even if you have data that you’re ready to mine, you may not be ready to do the mining yourself. But now that you know what possibilities exist, why not ask around on campus for help? Computer science and statistics students may welcome the opportunity.

Data Mining for Libraries

Recommended

Recommended

More Related Content

More from Elaine Lasda

More from Elaine Lasda (20)

Recently uploaded

Recently uploaded (20)

Data Mining for Libraries

Editor's Notes