This presentation reports the data science curriculum development and implementation at Syracuse iSchool, which has shaped by the fast changing data-intensive environment not only for science but also for business and research at large.
Educating a New Breed of Data Scientists for Scientific Data Management
1. Educating a New Breed of
Data Scientists for Scientific
Data Management
Jian Qin
School of Information Studies
Syracuse University
Microsoft eScience Workshop, Chicago, October 9, 2012
2. DS
Talk points
› Data science (DS) and data scientists in the context of
scientific data
› An iSchool version of the DS curriculum
› Findings and lessons from implementing the DS curriculum
› A new breed of data scientists: the iSchool approach
10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 2
3. What is data science?
DS
“An emerging area of work
concerned with the collection,
presentation, analysis, visualization,
management, and preservation of
large collections of information.”
Stanton, J. (2012). Introduction to Data Science.
http://ischool.syr.edu/media/documents/2012/3/
DataScienceBook1_1.pdf
10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 3
4. DS
Data science and scientific research
Plan, design, consult Ingest, store,
for, implement, and organize, merge,
evaluate data filter, and transform
management projects data and create
and services analysis-ready data
10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 4
5. What data scientists are expected to do:
the job market
DS
Laboratory Data Data Modeling/
Management Specialist Management Specialist
Scientific Data Management • Administer operational database • Working closely with the high
Specialist • Assure the quality of data performance computing and
• Design, develop, implement, and database content the IT manager
manage high-throughput automatic • Interact closely with researchers, • Develop a data model for
data processing infrastructure for lab managers, and platform complex multi-scale rocks
large databases in a mature system coordinators • Design and organize a
• Develop and improve the • Track deliverables against budget database and complex
infrastructure supporting this system and prepare data reports queries
• Interface with multiple data providers • Collaborate closely with IT and • Integrate and mange multi-
to design, build, and maintain their bioinformatics colleagues scale rocks subjected to
customized databases • Assist IT in gathering workflow large-scale scientific
• Clarify requirements, feature requirements computing applications
requests and bug reports for software • Test changes and updates in IT
systems http://www.ingrainrocks.com/
developers and assist in testing data-management-specialist/
code. • Create and maintain app
documentation
http://www.bioinformatics.org/
forums/forum.php?forum_id=9670
10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 5
6. DS
“We’re increasingly finding data in
the wild, and data scientists are
involved with gathering data,
massaging it into a tractable form,
making it tell its story, and presenting
that story to others.”
Loukides, M. (2011). What is data science? Sebastopol, CA: O’Reilly.
10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 6
7. What data scientists are expected to do:
DS
the difference from tradition
› Data scientists are more likely to be involved across the
data lifecycle:
– Acquiring new data sets: 33%
– Parsing data sets: 29%
– Filtering and organizing data: 40%
– Mining data for patterns: 30%
– Advanced algorithms to solve analytical problems: 29%
– Representing data visually: 38%
– Telling a story with data: 34%
– Interacting with data dynamically: 37%
– Making business decisions based on data: 40%
http://mashable.com/2012/01/13/career-of- 10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 7
the-future-data-scientist-infographic/
8. How should educational
programs address the
challenge?
A case of the CAS in Data Science
program at Syracuse iSchool
DS
10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 8
9. Story 1: cognitive-demanding workflows and
DS
data management
› Domain: Thermochronology and tectonics
› What’s involved: rock samples from drilling and field observation, sliced and
grained rock samples
› Data types: Excel data files (lots of them), spectrum and microscopic images,
annotations
› Analysis: modeling and sensemaking by combining data from multiple data
files with specialized software
› Bottleneck problem: manually matching/merging/filtering data is extremely
cumbersome and the problem is compounded by the difficulty finding the right
data files
10. DS
Story 2: highly automated workflows
› Domain: Astrophysics: gravitational wave detection
› What’s involved: data ingestion from laser interferometers, raw data
calibration and segmentation, workflow management, provenance
› Data types: streaming data from
the laser interferometers, images
› Analysis: detection of “events”
› Bottleneck problem: tracking of
data and processes and the
relationships between them
10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 10
11. Ability to use a Knowledge
Data
DS
wide variety of a subject
modeling,
tools for domain
documentation, database and
analysis, and query design
report of data
Data OS,
Collaboration,
communication,
scientists Programming
languages
and co-
ordination
Content and Encoding
What are repository languages
systems
expected of data
scientists? 10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 11
12. DS
Analytical skills: domain modeling
Requirement analysis
Interview skills, analysis and
generalization skills
Workflow analysis
Ability to capture components and
sequences in workflows
Data modeling
Ability to translate domain analysis
Data transformation into data models
needs analysis
Ability to envision the data model
Data provenance within the larger system architecture
needs analysis
13. Analytical skills: from data sources to patterns,
DS
relationships, and trends
Analytical tools
“Hacking”
Knowledge
Data
products
10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 13
14. Data management skills: data lifecycle and
DS
infrastructural services
Metadata Encoding Semantic Identify Infrastructural
standards language control management services
Processed, transformed, derived, calculated, … data • Data source
discovery
• Data curation
Common data format
Image formats
• Data preservation
Matrix formats • Data integration and
Microarray file formats mashup
Communication protocols • Data citation,
publication, and
distribution
• Data linking and
interoperability
• …
10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 14
15. Technology skills with excellent communication
DS
skills
TECHNOLOGY SKILLS COMMUNICATION SKILLS
› Operation systems › Interviews
› Repository systems › “Ice breaking”
› Database systems › Community building
› Programming languages › Institutionalization
› Encoding languages › Stakeholder buy-in
› Specialized programming
10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 15
16. No superman model for beginning data
DS
scientists
Data
analytics Data storage
and
management
Data scientists
Core:
Applied data science
Databases
General system
management Data
visualization
10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 16
17. DS
The CAS in Data Science program at SU
› Required:
– Data Administration Concepts and Database Management
– Applied Data Science
› Elective:
Data Analytics Data Storage and Data Visualization
• Data Mining Management • Information Architecture for
• Basics of Information • Technologies for Web Internet Services
Retrieval Systems Content Management • Information Visualization
• Natural Language • Foundations of Digital Data
Processing • Creating, Managing, and General Systems Management
• Advanced Information Preserving Digital Assets • Enterprise Technologies
Analytics • Data Warehousing • Managing Information Systems
• Research Methods • Advanced Database Projects
• Statistical Methods Management • Information Systems Analysis
10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 17
18. What we learned from the program
DS
development
› Data science is a moving target with multiple focal points
– Versions from statistics, computer science, and library and
information science
› Skills vs. theories
– Students are anxious to learn skills but not so interested in
theories
– Theories help build visions
› Sufficient hands-on time for technologies and tools
› Authentic learning through real-world data management
projects
10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 18
19. DS
Reconciliation of the two views of data science
“An emerging area of work “We’re increasingly finding
concerned with the data in the wild, and data
collection, presentation, scientists are involved with
analysis, visualization, gathering data, massaging it
management, and into a tractable form, making it
preservation of large tell its story, and presenting that
collections of information.” story to others.”
Stanton, J. (2012). Introduction to Data Science. Loukides, M. (2011). What is data
http://ischool.syr.edu/media/documents/2012/3/ science? Sebastopol, CA: O’Reilly.
DataScienceBook1_1.pdf
10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 19
20. The iSchool’s version of data science
DS
education
Ability to use a Knowledge
wide variety of a subject Data
tools for domain modeling,
documentation, database and
analysis, and query design
Eventually the report of data
iSchool data science
program will build Data OS,
Collaboration,
the foundation for communication,
scientists Programming
languages
and co-
super data ordination
scientists…
Content and Encoding
repository languages
systems
10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 20
21. DS
eScience Librarianship Curriculum Project:
http://eslib.ischool.syr.edu/
Science Data Literacy Project:
http://sdl.syr.edu/
CAS in Data Science:
http://ischool.syr.edu/future/cas/
datascience.aspx
10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 21