A lot of data are created in an LMS instance, and much of this can be analyzed for insight. In 2016, Instructure, the makers of Canvas, made their LMS data available to their customers through a data portal (updated monthly). This portal enables access to a number of flat files related to that particular instance. This presentation showcases how this big data was analyzed on a regular laptop with basic office software, to summarize Kansas State University’s use of the LMS. Methods for analysis include the following: basic descriptive statistics, survival analysis, computational linguistic analysis, and others.
The results are reported out with both numbers and data visualizations, including classic pie charts, line graphs, bar charts, mixed-charts, word clouds, and others. The findings provide some insights about how to approach the data, how to use a data dictionary, and other methods for extracting the data for awareness and practical decision-making. This work also is suggestive of next steps for more advanced analysis (using the flat files in a SQL database).
More information about this may be accessed at http://scalar.usc.edu/works/c2c-digital-magazine-spring--summer-2017/wrangling-big-data-in-a-small-tech-ecosystem.
Leveraging Flat Files from the Canvas LMS Data Portal at K-State
1. Leveraging “Flat Files”
from the Canvas LMS
Data Portal (at K-State)
SIDLIT 2017 | LMS Preconference
Colleague 2 Colleague
August 2, 2017
2. Presentation
A lot of data are created in an LMS instance, and much of this can be
analyzed for insight. In 2016, Instructure, the makers of Canvas, made their
LMS data available to their customers through a data portal (updated
monthly). This portal enables access to a number of flat files related to that
particular instance. This presentation showcases how this big data was
analyzed on a regular laptop with basic office software, to summarize
Kansas State University’s use of the LMS. Methods for analysis include the
following: basic descriptive statistics, survival analysis, computational
linguistic analysis, and others.
2
3. Presentation (cont.)
The results are reported out with both numbers and data visualizations,
including classic pie charts, line graphs, bar charts, mixed-charts, word
clouds, and others. The findings provide some insights about how to
approach the data, how to use a data dictionary, and other methods for
extracting the data for awareness and practical decision-making. This
work also is suggestive of next steps for more advanced analysis (using the
flat files in a SQL database).
More information about this experience may be accessed on SlideShare
through an article download titled “Wrangling Big Data in a Small Tech
Ecosystem” at http://www.slideshare.net/ShalinHaiJew/wrangling-big-data-
in-a-small-tech-ecosystem (orig. from Oct. 2016). The original article
“Wrangling Big Data in a Small Tech Ecosystem” is from C2C Digital
Magazine.
3
4. Presentation Order
Canvas LMS at Kansas State University (K-State)
Canvas LMS Data Portal and Flat Files
The Summary Data
Some Practical Applications
Moving Forward with the Data
4
5. General Approach
Framework
Approaches
An instructional design approach
What can enhance teaching and
learning?
A researcher approach
What can enhance accurate
data collection, usage, researcher
awareness, and decision-making?
Using all data (every part!)
Using all basic software tools
available on a regular machine
Data Clients on a
Campus
Faculty
Staff
System Administrators
Leaders
Students
Analysts
5
7. LMS History at K-State
Homegrown Learning Management System (LMS) (Axio Learning)
Informed by faculty, admin, and staff needs (IT Help Desk tickets, focus groups
with faculty and staff)
Software updates rolled out annually with some patches in-between
Built mostly by K-State graduates and professional developers (often hired from
student ranks)
Instructure’s Canvas LMS at K-State (2013 – present)
Availability of the data portal in 2016
Monthly updates of select data from the particular instance
Accessed at K-State in October 2016
7
8. An Early Brainstorm
Brainstorm beneficial questions (data queries) before exploring the data, so
you’re not limited by the found data, and keep these in mind even after
the initial data exploration. It is important to conceptualize what may be
practically helpful through the informed imagination first.
It would be helpful to continue with the brainstorming as the data are
explored.
8
9. Initial Brainstormed Questions
What can be reported out at various levels: university, college,
department, course, and individual?
Is it possible to make observations about course design? Learner
engagement (Discussions? Conversations?)? Advising? Technology usage
(such as external tools)? Uses of the LMS site for non-course applications?
What sorts of manual-created courses exist, and how are these used?
What percentage of the courses are these manual types of courses?
9
10. Initial Brainstormed Questions (cont.)
How closely is it possible to map the data of a learner’s trajectory? A
group’s trajectory?
What are some attributes to use to identify various groups? Which attributes
would be helpful? What sorts of group-specific questions may be asked?
For example, is it possible to identify high-performing groups vs. low-performing
groups in order to run analytics to see what differences there may be between
the two?
What may be understood about the learning going on in a particular
course? A learning sequence?
Are there ways to understand effective support for learners and support for
learning from this data?
10
11. Required Preliminary Understandings
Need to understand the front-end view of the LMS and its general uses on
campus; otherwise, the back-end data view will be looking through a mirror
darkly
Need to understand what terms are applied to the various types of data
(because you want to be on the same page with the creators and users of the
LMS)
Need to have experiences with the various analytical technologies applied to
the particular data because various queries require different data processing
and data structures
Will be applying the following: descriptive statistics, inferential statistics, direct
data queries, linguistic analysis, survival analysis, sentiment analysis, topic
modeling, and others
Will ultimately be applying more complex machine learning as well
11
12. Required Preliminary Understandings
(cont.)
Need understandings of “states” of being for various objects in an LMS
Need ability to identify anomalies and the skills to interpret what these
might mean
Need to know what data mean and where to dig deeper for more relevant
information
Need to know where noise might enter a particular dataset or an analytical
process…and to head off the introduction of or inclusion of noise
12
14. Canvas Data Portal
Data updated once a month (then, now, daily)
Live dynamic data may be accessed via a higher level of service
Flat files (in compressed .gz format for download with 7Zip) downloaded
from SQL servers
Also known as table data (albeit without defined structural relationships between
records and therefore “flat”)
May contain labeled data like numbers
May contain unstructured or semi-structured data like texts, names, messages, and
others
Contain content data (messaging), trace data (interaction data), and some
metadata (data about data, often riding on imagery and multimedia)
Data described in a formal data dictionary
14
15. “Flat Files” Strengths and Weaknesses
Strengths
Manageable on a small-scale
laptop
Can ask questions across several
flat files
Weaknesses
Lack relational data between the
various flat files
Cannot query data effectively
across the various data tables
(because the relationships are not
defined)
Lack access to identifier column
Lack access to the foreign key
15
16. Data Dictionary
A reference resource that describes particular data
Documentation of data captured in the Canvas Data warehouse
Helpful for understanding naming protocols of the various data types
The following is a verbatim example:
16
Name Type Description
assignment_id bigint (big integer) Foreign key to the
assignment the
override is associated
with. May be empty.
23. Order: First Data Visualizations and
Then Light Text Commentary
The data visualizations come first…so that the audience may analyze the
data to see what it says
The summary analyses come directly after the visualization, so there is a
kind of debriefing
23
25. Purposeful Blur and Block
Need to know how to protect against data leakage
Never share the underlying dataset
Never share unique identifiers
Always double check screen grabs against accidental inclusion of personally
identifiable data (PII); use effective redaction if PII is viewable
When redacting, make sure that the redaction cannot be reversed (backwards iterated
or some other strategy) and a person re-identified
Check that no metadata is riding with multimedia being released
Any personally identifiable information (PII) is obfuscated here
No granular level of data was captured in the article
25
27. Workflow
1. Conceptualizing questions and applications of the data
2. Review of the dataset information
3. Data download
4. Data extraction
5. Data processing (cleaning) and analytics
6. Validating / invalidating the findings
7. Additional data analytics
8. Write-up for presentation
9. Data and informational materials archival
27
37. Date Restriction Accesses for Course
Sections
Non-defined (default) as the majority
Restricted section access (by learner name) to defined dates
Non-restricted (all participants in the course welcome) section access to
defined dates
37
44. Time Features for Assignments
Half of assignments with no time allotment
Other half with time features
Due_at, no unlock_at, no look_at
Due_at, lock_at, unlock_at (all three)
44
48. Some Linguistic Features of the
Assignment Titles and Descriptions
Analytic: 91.69
“Formal, logical, and hierarchical thinking” vs. “more informal, personal, here-and-
now, and narrative thinking”
Clout: 73.25
“perspective of high expertise” and confidence vs. “more tentative, humble, even
anxious style”
Authentic: 11.83
“more honest, personal, and disclosing text” vs. “a more guarded, distanced form of
discourse”
Tone: 64.98
“a more positive, upbeat style” vs. “greater anxiety, sadness, or hostility” (emotional
tone) (“Linguistic Inquiry and Word Count: LIWC2015 Operator’s Manual,” 2015, p. 22)
48
50. Delving into Topics of Interest
Identifying words (names, formulas, dates, symbols, etc.)-of-interest
Using NVivo 11 Plus to create word trees with the target term as the seeding
topic
Ability to double-click on the respective branches to link back to the
original source data files
50
56. Survival Function of Assignments to
Update
How long does it take before an assignment is updated?
At what point does an assignment seem to be “safe” against update?
What are some ways to understand assignments that are updated some
1,000 days after the date of creation?
Is it possible that some assignments were transferred over from a prior LMS
through an LTI-enabled process that might have captured the very first moment
of creation for that assignment? (“LTI” refers to the Learning Tools Interoperability
standard created by the IMS Global Learning Consortium.)
56
62. A Survey of Quiz Types
Assignment
Practice quiz
Graded survey
Survey
Affordances of the various quiz types change over time, so it is important to
update on the various functions and capabilities even as one is looking at
the data.
62
66. Quiz Question Workflow States
unpublished (default)
published
deleted
So a majority of quiz questions are created / drafted but held in reserve
and not published.
What are some possible inferences that can be made from the instance-
scale statistics and numbers?
66
68. An Inclusive Scatterplot of Quiz Point
Values
min-max range: 0 – 23,700 points per quiz
average quiz value: 33 points (w/o zeroes average in) and 28 points (with
zeroes averaged in)
The 23,700 occurred twice, which suggests that it might be purposeful. That
huge number, though, pulls the curve, and in a normal research context,
such an outlier would likely be omitted to erase its pull on the curve, which
would result in skew. A zoom-in would require going to the particular
instructor and course. That might require a different approach to the data
than described in this work…such as re-animating all the flat files in a SQL
database and using unique identifiers to connect related data.
68
70. Histogram of Quiz Point Values in LMS
Instance (with a normal curve)
Frequency of point values for quizzes
Tendencies
Most at the lower number values
70
72. Survival Curve of Deleted Quizzes in
LMS Instance
Based on timestamp data, how long does it take for a deleted quiz to
achieve “event” or be deleted (from its moment of creation)?
In this dataset, 22% of quizzes were deleted (14,769/66,366).
The min-max day range for the quiz deletions ranged from 0 - 813 days.
A survival analysis showed that the estimated survival time of quizzes that
were deleted were 23.6 days, with a lower bound of 22.7 and an upper
bound of 24.4 in the 95% confidence interval; the standard error was .419.
The median survival time--of the deleted quizzes--was a low 2 days, which
means if a quiz is to be deleted, it usually happens fairly early.
The drop-off in the curve below is steep but tapers off after about several
months.
72
74. One Minus Survival Function Curve for
Deleted Quizzes in the LMS Instance
Shows how long a quiz survives before it is deleted from a set of quizzes that
were ultimately deleted
74
76. Hazard Function for Deleted Quizzes in
the LMS Instance
All quizzes in the set were ultimately deleted
This linegraph shows time-to-event of when quizzes were deleted from their
respective creation-dates in the LMS instance.
All quizzes listed here ultimately were deleted.
The hazard function curve sometimes shows particular time-patterns of
when a quiz is most at risk of deletion…but this curve only generally shows a
steep rise initially and then a gradual achievement of time-to-event.
76
90. Submission Comment Participation
Type
Admin
Submitter
Author
So administrators all comment on learner submissions, but not all authors or
submitters comment. In other words, the creator of contents may submit
the file without comment.
90
93. Uploads and Revisions of Files to the
LMS Instance by Year
A sense of the university’s transition to the LMS, over multiple years (so
caution)
93
101. Wikis and Wiki Pages
A “wiki” in Canvas is a page with its history captured and able to be
reinstituted (enabled by wiki software)
Pages may be interconnected
A page may be set as the home page
A page may be embedded in a modular sequence
A page may contain the MediaSite video
A page may contain any number of contents: imagery, iframes, videos,
and other contents
101
103. Parent Types for Wiki Pages in the LMS
Instance
Course
Group
In other words, the administrators (instructors) of courses are the ones who
create a majority of the pages. The learners in groups create fewer of the
wiki pages.
Note that the sense of a “wiki” page is different here.
103
105. Wiki Page Workflow
Null (default)
Active
Unpublished
Deleted
This needs more insight, but the data dictionary does not explain the
different states and what they mean. For example, is a “null” wiki page
published? Is an “active” wiki page something that is included in a
sequence? Is a “deleted” wiki page recoverable or not?
105
109. About Enrollment Role Types
Role Name Basic Role Type
Librarian TAEnrollment
StudentEnrollment StudentEnrollment
TeacherEnrollment TeacherEnrollment
TAEnrollment TAEnrollment
DesignerEnrollment DesignerEnrollment
ObserverEnrollment ObserverEnrollment
Grader TAEnrollment
GradeObserver TAEnrollment
109
116. Request Types in the LMS Instance
GET (Read)
POST (Create)
PUT (Create)
HEAD (Retrieve Resource)
DELETE (Remove)
PATCH (Update, Modify)
116
126. User “Workflow” States in the LMS
Instance
registered
pre_registered
deleted
creation_pending
The “creation_pending” may well refer to a process of approval for people
to have access—for a level of security.
126
128. Years of Origination of User Accounts
Initial exploration in 2013
Big push in 2014
New accounts in 2015 and 2016 indicating not only students but also
employment churn and stragglers slow to change to a new LMS
128
130. Retired Accounts = Registered False
2013 – early May 2017
Word frequency count from unigrams (so no full names represented as
such)
First names more common and so better represented
One number removed in the “stopwords” list
130
132. Pseudonyms
Pseudonyms = “logins associated with users”
Seems to be the connection between the LMS and various university
information systems
Seems like partial data (extracted in May 2017)
132
140. Conversations with Media Objects
Included
False
True
So when people use the email system inside Canvas, they do not generally
attach media objects (like digital imagery, slideshows, audio, video, or
other digital files).
140
142. Conversations w/ or without
Attachments
A majority of conversations are without attachments
A minority of conversations are with attachments
142
146. Conversation Messages Word
Frequency Count
482,339 conversation messages
Texts with 60,509,894 words
2/3 analyzed for textual contents (because of data size)
146
148. Mass Conversation Message Contents
Analytic: 82.33
“Formal, logical, and hierarchical thinking” vs. “more informal, personal, here-and-
now, and narrative thinking”
Clout: 80.21
“perspective of high expertise” and confidence vs. “more tentative, humble, even
anxious style”
Authentic: 26.41
“more honest, personal, and disclosing text” vs. “a more guarded, distanced form of
discourse”
Tone: 66.24
“a more positive, upbeat style” vs. “greater anxiety, sadness, or hostility” (emotional
tone) (“Linguistic Inquiry and Word Count: LIWC2015 Operator’s Manual,” 2015, p. 22)
148
150. Messaging about “Human Drives” in
the Mass Conversation Messages
Affiliation (2.35)
Power (2.19)
Achievement (1.46)
Reward (1.3)
Risk (0.37)
“The focus on affiliation and social identity seems reasonable, given the
typical college age of learners. The "power" language may come from
faculty speaking from positions of authority. The low level of focus on risk is
intriguing here (maybe young learners are not thought to have developed
the efficacy and confidence to take on uncontrolled risks?). Clearly, there
is a role for theorizing and interpretation, even with computation-based
analytics.”
150
152. Sentiment Analysis of Sample of
Conversation Messaging
A smaller sample of the conversation messages were analyzed for
sentiment. This set consisted of 72,377 messages.
The automated observations of sentiment showed that there were two
tendencies...either very positive or moderately negative (in terms of text
categories).
In this software tool, it is possible to explore which texts were categorized to
which categories of sentiment (very negative, moderately negative,
moderately positive, or very positive) in the comparisons between the
target text and the built-in sentiment dictionary.
In other words, the actual exploration of the content is possible through both
machine reading and human close reading.
152
154. Auto-Extracted Theme Based Hierarchy
Chart of Conversation Messaging Sample
(as a Treemap)
Class
Assignment
Time
Paper
Questions
Exam
Online
Group, etc.
154
156. Auto-extracted Themes from
Conversation Messaging Sample
These are in alphabetical order
The themes are listed in a human-readable way going clockwise around
the pie (in a pie chart)
156
158. Auto-Coded Theme-Based Hierarchy Chart of
Topics and Subtopics from Conversation
Messaging Sample (as a Sunburst Diagram)
This sunburst diagram—in the software—is somewhat interactive
This enables digging down into a Topic by double-clicking on it and seeing
the subtopic contents there
If the sliver is too thin, a mouse hovering will result in the actual subtopic
and the statistics and quant data available for viewing
158
160. Contexts of “Help” in a Word Tree
It is possible to analyze the various contexts in which “help” was used in the
conversation messaging in the prior word tree
In the software (NVivo 11 Plus), the word tree is interactive and is linked to
the original sources where the word appears, so it is possible to achieve
close reading of every use of “help” from the underlying dataset
The challenge is engaging a full dataset of millions of words
160
169. External Tool Activations in 2014
There is an increase in both variety and number of external tool activations
No deeper analysis was applied, but it could be…as to the external tool types
and the changing senses of needs
169
171. External Tool Activations in 2015
There is an increase in both variety and number of external tool activations
No deeper analysis was applied, but it could be…as to the external tool types
and the changing senses of needs
171
173. External Tool Activations in 2016
There is an increase in both variety and number of external tool activations
No deeper analysis was applied, but it could be…as to the external tool types
and the changing senses of needs
173
176. Course User Interface Navigation Item
State
Visible
Hidden
This refers to user capabilities of enabling the pre-set functions in the left
navigation of a course shell remain active or be placed in “hidden.”
There are “hidden” navigation element presets as well, which users may
choose to activate.
176
179. Delimiting the Analytics from the LMS
Data Portal Data
The concept behind delimiting is to make conclusions more accurate by
representing how confident one may be about the results.
As noted, there may be challenges and noise in the data from any step in
the workflow…but there are inherent limits also to the various data analytics
types—as shown in the visualization in the prior slide.
179
181. Some Practical Applications
Self awareness (holding up a mirror to the campus for its use of its LMS)
Analytics
To improve usage of the LMS
To know what functions and features are desirable
To support learner usage
To support teaching and learning
To support non-teaching and learning approaches to the data
Decision-making
Instructional design
Administrative awareness, decision-making, funding, and others
181
183. What are Ways to Go Beyond?
Other Analytical Methods
Reconnecting the flat files as
relational files in SQL server
Design of specific cross-file queries
for data analytics
Applying more and varied
computational text analysis
Engaging machine learning for
patterns (such as decision trees
for predictivity of classifications
based on available information)
Bringing in More Data
Comparing macro-level data with
other instances of the Canvas LMS
(such as with comparable
institutions of higher education)
Using additional data to enable
close-in reads (but without
compromising people’s privacy)
Keep confidential information
confidential
183
185. Assessing the Initial Haul of Biggish Data
Formulating askable questions
Analyzing the columnar data (and variables)
Understanding where the data comes from and how it is processed by Instructure
Analyzing the date data
Analyzing the textual data
Understanding ways to mix data in various datasets for enriched querying
Conceptualizing mixes of questions and potential findings based on the
available data
185
186. Assessing the Initial Haul of Biggish Data
(cont.)
Understanding the types of software that may be used to engage the data
Software enables cross-sectional base rate counts from flat files
Software enables cross-tabulation analysis and assessments of statistical
significance (rarity of patterns)
Software enables finding patterns through machine learning (like applying
decision trees to see what variables help determine classifications)
Software enables the identification of text-based patterns
186
187. Some Early Lessons Learned
Data visualizations are only summary data, and it’s important to get to the
actual underlying data to understand some dynamics.
It helps to theorize or hypothesize broadly to understand what may be
going on with the observed empirical data.
It is always wise to “sanity check” data extractions and data processing to
see what is going on.
It is important to understand the LMS data portal’s default settings and the
rationales behind those defaults to make sure that they make sense for the
particular context.
187
188. Some Early Lessons Learned (cont.)
Avoid double-counting for complex data with similar lead-in terms.
Watch out to not type incorrectly.
Do not ignore error messages; figure out why they’re happening and deal
with the issues.
Slow down the process, so you’re certain of what is happening at every
step. Be careful not to lose data.
Be careful about going to Excel, which has 1.05 million rows of data limits.
Be careful also of OS clipboards, which have 65,000 record limits. Do not let
such limits stall the work and result in lost data. Go to MS Access first or SQL
server.
188
189. Some Early Lessons Learned (cont.)
Use the LMS data portal “data dictionary” for the LMS data, but realize that
it may be dated or incomplete or inaccurate. A particular instance of an
LMS will be particular, so a general dictionary offers a general view, not a
specific one. Use the data dictionary in an attentive way.
Realize that there are nuances in the data that may not be apparent initially.
With computational text analysis, oftentimes, foreign languages will get
short shrift. There may be effective ways to address this.
With any sort of automation, there will be trade-offs. It is important to check
findings against the data and conduct data queries on multiple software
tools.
189
190. Some Early Lessons Learned (cont.)
Data is messy. It is totally possible (even probable) to have a process going
smoothly when something has glitch-ed with a data download.
No matter what, it is not possible to import the data for processing into either
Microsoft Access or SQL. In that case, there may need to be a data
“substitution” by extracting the “same-ish” set from the LMS data portal (days
later from when the first set was extracted).
The assumption is that new data is incremented on the end of the existing data,
so if the file is the proper one, a “later” version still should be accurate.
Depending on the data handling, though, that assumption may not be true. It
will be important to check.
190
191. Some Early Lessons Learned (cont.)
Don’t just go with how software is designed. For example, with a word
frequency count, don’t just go with the high counts, but analyze the “long
tail” of the low counts.
The “power law” does often apply to word counts in language. The long tail
shows something of outlier data in terms of single mentions (but you have to slog
through misspellings, strange alphanumeric strings, and other noise first).
There are certain data visualizations that work better for certain types of
data.
All data visualizations should be sufficiently labeled.
It helps to calculate not only raw numbers but percentages, where possible.
191
192. Some Early Lessons Learned (cont.)
Data portals contain personally identifiable information (PII), so extra care
has to be taken to ensure that people’s private information is not misused
nor leaked.
What is knowable depends on what other datasets one has access to and
how one sets up the analyses…
It helps to know what is possible to know from the data (full universe)
It helps to know what is politically viable to ask and capture (subset) (people
may ask for the moon)
It helps to use resources wisely to pursue asks that create constructive awareness
and good decision-making (sub-subset)
Recording steps is important (in notes and in macros)…so everything can
be repeated as needed.
192
193. To a Relational Database
So…Flat files are downloaded as compressed .gz files, opened with 7Zip as
.csv files.
Microsoft offers SQL Server Express as a free tool but limits to one CPU (up to
4 cores), 1 GB RAM, and database size limits to 10 GB (“Limitations of SQL
Server Express”).
Set this up on a dedicated machine, so the setup does not disrupt other work.
In shifting to SQL Server Express, the flat files have to be properly processed
for the data to move without lossiness or other problems.
It may help to process the data first in MS Access (as long as the flat file data is
not too large to handle in Access). Treat text columns as “Long Text,” not “Short
Text.” Label Date fields not as text but “Date with Time.” The idea is to have the
proper settings for appropriate receipt in SQL.
193
194. To a Relational Database (cont.)
Then, export the object from Access to Excel 2016 with the formatting and proper
data structure.
If the records have > 65,000 records, then MS Access is unable to export the data
table.
194
195. To a Relational Database (cont.)
One option is to split the dataset in
Access (Highlight the table -> go to
Database Tools tab -> click Access
Database -> Split database.) The
problem with this is that a dataset will
have to be split quite a few times to
get to the low 65,000 records, and then
after ingestion into SQL, any repeat
data will have to be deleted. This path
is too onerous to be helpful, especially
with LMS data portal data which can
easily go into the millions and millions
of rows.
A more direct option follows on the next
slide.
195
196. To a Relational Database (cont.)
When files are too large (anything over the 65,000 records that will fit in a
clipboard), then it makes better sense to just clean data on export in SQL. The
sequence goes like this: .gz -> .csv (using 7Zip) -> open SQL Management Studio
-> import data (change “DT_String” columns to “DT_Text” (for a “text stream”), so
there is not a 50 character constraint on the columns), and the data import
generally goes well. (This solution takes up more computer memory and is
inelegant, but it solves the many issues that would crop up otherwise with a
straight import without the data label adjustments.)
There is no import of column names in the first row.
In SQL Server Management Studio 17, go to Databases -> System
Databases -> “master” database (right-click) -> Tasks -> Import Data … and
specify that the original source is from Microsoft Excel. The flat files are now
database objects (dbos) in the master database. Do keep the original file
names, for ease-of-reference.
196
197. To a Relational Database (cont.)
Re-indexing needed?
If so, the foreign keys may have to be reconnected to the correct primary
keys for the relating in a relational database to make sense and for SQL
queries across the files to make sense.
Foreign keys point to primary keys in another table; they are unique identifiers
that connect related data between tables.
Primary keys are unique identifiers (and “reserved” against reuse in that sense),
and they indicate unique records in data tables (and databases).
If not, it may be possible to run SQL queries by loading the tables with
primary keys first and those with referring foreign keys second…but I am not
there yet. Working on it.
197
198. To a Relational Database (cont.)
Proceed with a good basic text on SQL server. Give it a good read-through
before actually going too far into a project. (Experimentation is always
good, but time wastage—not so much.)
If local support with a database administrator (DBA) is available, that would
be optimal.
198
199. References
Pennebaker, J.W., Booth, R.J., Boyd, R.L., & Francis, M.E. (2015). Linguistic
Inquiry and Word Count: LIWC2015. Operator’s Manual. Retrieved at
https://s3-us-west-
2.amazonaws.com/downloads.liwc.net/LIWC2015_OperatorManual.pdf.
199
200. Contact and Conclusion
Dr. Shalin Hai-Jew
iTAC
Kansas State University
212 Hale / Farrell Library
shalin@k-state.edu
785-532-5262
200