SlideShare une entreprise Scribd logo
1  sur  21
Télécharger pour lire hors ligne
Educating a New Breed of
Data Scientists for Scientific
Data Management
Jian Qin
School of Information Studies
Syracuse University
Microsoft eScience Workshop, Chicago, October 9, 2012
DS	

   Talk points
        ›  Data science (DS) and data scientists in the context of
           scientific data
        ›  An iSchool version of the DS curriculum
        ›  Findings and lessons from implementing the DS curriculum
        ›  A new breed of data scientists: the iSchool approach




                               10/9/12     MICROSOFT ESCIENCE WORKSHOP 2012   2
What is data science?
DS	




                     “An emerging area of work
                   concerned with the collection,
                presentation, analysis, visualization,
                 management, and preservation of
                  large collections of information.”


                    Stanton, J. (2012). Introduction to Data Science.
                    http://ischool.syr.edu/media/documents/2012/3/
                                DataScienceBook1_1.pdf
                              10/9/12           MICROSOFT ESCIENCE WORKSHOP 2012   3
DS	

   Data science and scientific research


    Plan, design, consult                                     Ingest, store,
     for, implement, and                                   organize, merge,
        evaluate data                                    filter, and transform
    management projects                                     data and create
         and services                                    analysis-ready data




                            10/9/12   MICROSOFT ESCIENCE WORKSHOP 2012     4
What data scientists are expected to do:
               the job market
       DS	

                                 Laboratory Data                         Data Modeling/
                                             Management Specialist                   Management Specialist
Scientific Data Management                   •  Administer operational database      •  Working closely with the high
Specialist                                   •  Assure the quality of data              performance computing and
•     Design, develop, implement, and           database content                        the IT manager
      manage high-throughput automatic       •  Interact closely with researchers,   •  Develop a data model for
      data processing infrastructure for        lab managers, and platform              complex multi-scale rocks
      large databases in a mature system        coordinators                         •  Design and organize a
•     Develop and improve the                •  Track deliverables against budget       database and complex
      infrastructure supporting this system     and prepare data reports                queries
•     Interface with multiple data providers •  Collaborate closely with IT and      •  Integrate and mange multi-
      to design, build, and maintain their      bioinformatics colleagues               scale rocks subjected to
      customized databases                   •  Assist IT in gathering workflow         large-scale scientific
•     Clarify requirements, feature             requirements                            computing applications
      requests and bug reports for software •  Test changes and updates in IT
                                                systems                               http://www.ingrainrocks.com/
      developers and assist in testing                                                data-management-specialist/
      code.                                  •  Create and maintain app
                                                documentation
     http://www.bioinformatics.org/
     forums/forum.php?forum_id=9670
                                                    10/9/12           MICROSOFT ESCIENCE WORKSHOP 2012          5
DS	


         “We’re increasingly finding data in
        the wild, and data scientists are
        involved with gathering data,
        massaging it into a tractable form,
        making it tell its story, and presenting
        that story to others.”
          Loukides, M. (2011). What is data science? Sebastopol, CA: O’Reilly.



                                   10/9/12           MICROSOFT ESCIENCE WORKSHOP 2012   6
What data scientists are expected to do:
DS	

   the difference from tradition
        ›  Data scientists are more likely to be involved across the
           data lifecycle:
           –  Acquiring new data sets: 33%
           –  Parsing data sets: 29%
           –  Filtering and organizing data: 40%
           –  Mining data for patterns: 30%
           –  Advanced algorithms to solve analytical problems: 29%
           –  Representing data visually: 38%
           –  Telling a story with data: 34%
           –  Interacting with data dynamically: 37%
           –  Making business decisions based on data: 40%
    http://mashable.com/2012/01/13/career-of-   10/9/12   MICROSOFT ESCIENCE WORKSHOP 2012   7
    the-future-data-scientist-infographic/
How should educational
        programs address the
        challenge?
        A case of the CAS in Data Science
        program at Syracuse iSchool


DS	

                     10/9/12   MICROSOFT ESCIENCE WORKSHOP 2012   8
Story 1: cognitive-demanding workflows and
DS	

   data management
›  Domain: Thermochronology and tectonics
›  What’s involved: rock samples from drilling and field observation, sliced and
   grained rock samples
›  Data types: Excel data files (lots of them), spectrum and microscopic images,
   annotations
›  Analysis: modeling and sensemaking by combining data from multiple data
   files with specialized software
›  Bottleneck problem: manually matching/merging/filtering data is extremely
   cumbersome and the problem is compounded by the difficulty finding the right
   data files
DS	

   Story 2: highly automated workflows
    ›  Domain: Astrophysics: gravitational wave detection
    ›  What’s involved: data ingestion from laser interferometers, raw data
       calibration and segmentation, workflow management, provenance

    ›  Data types: streaming data from
       the laser interferometers, images
    ›  Analysis: detection of “events”
    ›  Bottleneck problem: tracking of
       data and processes and the
       relationships between them

                                   10/9/12     MICROSOFT ESCIENCE WORKSHOP 2012   10
Ability to use a        Knowledge
                                                               Data
DS	

      wide variety          of a subject
                                                             modeling,
             tools for             domain
         documentation,                                    database and
          analysis, and                                    query design
          report of data



                                           Data                        OS,
             Collaboration,
            communication,
                                         scientists                Programming
                                                                    languages
                and co-
              ordination


                              Content and                Encoding
    What are                   repository               languages
                                systems
    expected of data
    scientists?                10/9/12          MICROSOFT ESCIENCE WORKSHOP 2012   11
DS	

Analytical    skills: domain modeling
   Requirement analysis
                                Interview skills, analysis and
                                generalization skills
    Workflow analysis
                                Ability to capture components and
                                sequences in workflows
      Data modeling

                                Ability to translate domain analysis
   Data transformation          into data models
     needs analysis
                                Ability to envision the data model
     Data provenance            within the larger system architecture
      needs analysis
Analytical skills: from data sources to patterns,
DS	

   relationships, and trends
                                       Analytical tools


                     “Hacking”


                                                                        Knowledge



                                    Data
                                    products


                          10/9/12         MICROSOFT ESCIENCE WORKSHOP 2012    13
Data management skills: data lifecycle and
DS	

 infrastructural services

      Metadata    Encoding       Semantic         Identify               Infrastructural
      standards   language        control       management               services

     Processed, transformed, derived, calculated, … data                 •  Data source
                                                                            discovery
                                                                         •  Data curation
                      Common data format
                          Image formats
                                                                         •  Data preservation
                          Matrix formats                                 •  Data integration and
                      Microarray file formats                               mashup
                     Communication protocols                             •  Data citation,
                                                                            publication, and
                                                                            distribution
                                                                         •  Data linking and
                                                                            interoperability
                                                                         •  …
                                     10/9/12          MICROSOFT ESCIENCE WORKSHOP 2012      14
Technology skills with excellent communication
DS	

   skills

        TECHNOLOGY SKILLS                    COMMUNICATION SKILLS
        ›  Operation systems                 ›  Interviews
        ›  Repository systems                ›  “Ice breaking”
        ›  Database systems                  ›  Community building
        ›  Programming languages             ›  Institutionalization
        ›  Encoding languages                ›  Stakeholder buy-in
        ›  Specialized programming



                                   10/9/12      MICROSOFT ESCIENCE WORKSHOP 2012   15
No superman model for beginning data
DS	

   scientists
              Data
            analytics                                        Data storage
                                                                 and
                                                             management

                            Data scientists
                                Core:
                          Applied data science
                               Databases
         General system
          management                                             Data
                                                             visualization


                           10/9/12     MICROSOFT ESCIENCE WORKSHOP 2012      16
DS	

     The CAS in Data Science program at SU
          ›  Required:
             –  Data Administration Concepts and Database Management
             –  Applied Data Science
          ›  Elective:
        Data Analytics             Data Storage and                 Data Visualization
        •  Data Mining             Management                       •  Information Architecture for
        •  Basics of Information   •  Technologies for Web             Internet Services
           Retrieval Systems          Content Management            •  Information Visualization
        •  Natural Language        •  Foundations of Digital Data
           Processing              •  Creating, Managing, and       General Systems Management
        •  Advanced Information       Preserving Digital Assets     •  Enterprise Technologies
           Analytics               •  Data Warehousing              •  Managing Information Systems
        •  Research Methods        •  Advanced Database                Projects
        •  Statistical Methods        Management                    •  Information Systems Analysis


                                            10/9/12           MICROSOFT ESCIENCE WORKSHOP 2012        17
What we learned from the program
DS	

   development
        ›  Data science is a moving target with multiple focal points
          –  Versions from statistics, computer science, and library and
             information science
        ›  Skills vs. theories
          –  Students are anxious to learn skills but not so interested in
             theories
          –  Theories help build visions
        ›  Sufficient hands-on time for technologies and tools
        ›  Authentic learning through real-world data management
           projects
                                   10/9/12       MICROSOFT ESCIENCE WORKSHOP 2012   18
DS	

   Reconciliation of the two views of data science

         “An emerging area of work                              “We’re increasingly finding
             concerned with the                                  data in the wild, and data
          collection, presentation,                            scientists are involved with
           analysis, visualization,                           gathering data, massaging it
             management, and                                 into a tractable form, making it
            preservation of large                           tell its story, and presenting that
         collections of information.”                                  story to others.”

        Stanton, J. (2012). Introduction to Data Science.       Loukides, M. (2011). What is data
        http://ischool.syr.edu/media/documents/2012/3/          science? Sebastopol, CA: O’Reilly.
                    DataScienceBook1_1.pdf


                                                 10/9/12         MICROSOFT ESCIENCE WORKSHOP 2012    19
The iSchool’s version of data science
DS	

    education
                               Ability to use a       Knowledge
                                 wide variety         of a subject                Data
                                   tools for            domain                  modeling,
                               documentation,                                 database and
                                analysis, and                                 query design
        Eventually the          report of data

        iSchool data science
        program will build                                    Data                      OS,
                                   Collaboration,
        the foundation for        communication,
                                                            scientists              Programming
                                                                                     languages
                                      and co-
        super data                  ordination

        scientists…
                                                    Content and             Encoding
                                                     repository            languages
                                                      systems


                                     10/9/12          MICROSOFT ESCIENCE WORKSHOP 2012        20
DS	


        eScience Librarianship Curriculum Project:
        http://eslib.ischool.syr.edu/

        Science Data Literacy Project:
        http://sdl.syr.edu/

        CAS in Data Science:
        http://ischool.syr.edu/future/cas/
        datascience.aspx



                            10/9/12      MICROSOFT ESCIENCE WORKSHOP 2012   21

Contenu connexe

Tendances

Building a Data Discovery Network for Sustainability Science
Building a Data Discovery Network for Sustainability ScienceBuilding a Data Discovery Network for Sustainability Science
Building a Data Discovery Network for Sustainability ScienceRobert H. McDonald
 
THE Jisc Supplement 25 Nov 2009
THE Jisc Supplement 25 Nov 2009THE Jisc Supplement 25 Nov 2009
THE Jisc Supplement 25 Nov 2009Fiona Salvage
 
SEAD Datanet and Sustainability Science
SEAD Datanet and Sustainability Science SEAD Datanet and Sustainability Science
SEAD Datanet and Sustainability Science Robert H. McDonald
 
ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...
ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...
ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...Jason Riedy
 
Big Data-Survey
Big Data-SurveyBig Data-Survey
Big Data-Surveyijeei-iaes
 
SEAD Virtual Archive: Building a Federation of Institutional Repositories fo...
 SEAD Virtual Archive: Building a Federation of Institutional Repositories fo... SEAD Virtual Archive: Building a Federation of Institutional Repositories fo...
SEAD Virtual Archive: Building a Federation of Institutional Repositories fo...skonkiel
 
NOSQL Database Engines for Big Data Management
NOSQL Database Engines for Big Data ManagementNOSQL Database Engines for Big Data Management
NOSQL Database Engines for Big Data Managementijtsrd
 
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...i_scienceEU
 
Ibm big data hadoop summit 2012 james kobielus final 6-13-12(1)
Ibm big data    hadoop summit 2012 james kobielus final 6-13-12(1)Ibm big data    hadoop summit 2012 james kobielus final 6-13-12(1)
Ibm big data hadoop summit 2012 james kobielus final 6-13-12(1)Ajay Ohri
 
CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...
CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...
CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...ijcsit
 

Tendances (10)

Building a Data Discovery Network for Sustainability Science
Building a Data Discovery Network for Sustainability ScienceBuilding a Data Discovery Network for Sustainability Science
Building a Data Discovery Network for Sustainability Science
 
THE Jisc Supplement 25 Nov 2009
THE Jisc Supplement 25 Nov 2009THE Jisc Supplement 25 Nov 2009
THE Jisc Supplement 25 Nov 2009
 
SEAD Datanet and Sustainability Science
SEAD Datanet and Sustainability Science SEAD Datanet and Sustainability Science
SEAD Datanet and Sustainability Science
 
ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...
ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...
ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...
 
Big Data-Survey
Big Data-SurveyBig Data-Survey
Big Data-Survey
 
SEAD Virtual Archive: Building a Federation of Institutional Repositories fo...
 SEAD Virtual Archive: Building a Federation of Institutional Repositories fo... SEAD Virtual Archive: Building a Federation of Institutional Repositories fo...
SEAD Virtual Archive: Building a Federation of Institutional Repositories fo...
 
NOSQL Database Engines for Big Data Management
NOSQL Database Engines for Big Data ManagementNOSQL Database Engines for Big Data Management
NOSQL Database Engines for Big Data Management
 
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
 
Ibm big data hadoop summit 2012 james kobielus final 6-13-12(1)
Ibm big data    hadoop summit 2012 james kobielus final 6-13-12(1)Ibm big data    hadoop summit 2012 james kobielus final 6-13-12(1)
Ibm big data hadoop summit 2012 james kobielus final 6-13-12(1)
 
CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...
CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...
CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...
 

Similaire à Educating a New Breed of Data Scientists for Scientific Data Management

Data Science: An Emerging Field for Future Jobs
Data Science: An Emerging Field for Future JobsData Science: An Emerging Field for Future Jobs
Data Science: An Emerging Field for Future JobsJian Qin
 
Simon Hodson
Simon HodsonSimon Hodson
Simon HodsonEduserv
 
Jeff's what isdatascience
Jeff's what isdatascienceJeff's what isdatascience
Jeff's what isdatasciencelizliddy
 
The Science of Data Science
The Science of Data Science The Science of Data Science
The Science of Data Science James Hendler
 
Ch1IntroductiontoDataScience.pptx
Ch1IntroductiontoDataScience.pptxCh1IntroductiontoDataScience.pptx
Ch1IntroductiontoDataScience.pptxAbderrahmanABID2
 
To architect or engineer? Lessons from DataPool on building RDM repositories
To architect or engineer? Lessons from DataPool on building RDM repositoriesTo architect or engineer? Lessons from DataPool on building RDM repositories
To architect or engineer? Lessons from DataPool on building RDM repositoriesjiscdatapool
 
Unlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data VirtualizationUnlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data VirtualizationDenodo
 
Building Data Ecosystems for Accelerated Discovery
Building Data Ecosystems for Accelerated DiscoveryBuilding Data Ecosystems for Accelerated Discovery
Building Data Ecosystems for Accelerated Discoveryadamkraut
 
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...Denodo
 
Linked Open data: CNR
Linked Open data: CNRLinked Open data: CNR
Linked Open data: CNRDatiGovIT
 
Analytic Platforms in the Real World with 451Research and Calpont_July 2012
Analytic Platforms in the Real World with 451Research and Calpont_July 2012Analytic Platforms in the Real World with 451Research and Calpont_July 2012
Analytic Platforms in the Real World with 451Research and Calpont_July 2012Calpont Corporation
 
Metadata in general and Dublin Core in specific; some experiences
Metadata in general and Dublin Core in specific; some experiencesMetadata in general and Dublin Core in specific; some experiences
Metadata in general and Dublin Core in specific; some experiencesKerstin Forsberg
 
How Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science StackHow Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science StackDenodo
 
Breed data scientists_ A Presentation.pptx
Breed data scientists_ A Presentation.pptxBreed data scientists_ A Presentation.pptx
Breed data scientists_ A Presentation.pptxGautamPopli1
 
Introduction to Data Science.pdf
Introduction to Data Science.pdfIntroduction to Data Science.pdf
Introduction to Data Science.pdfUniversity of Sindh
 
9th International Conference on Database and Data Mining (DBDM 2021)
9th International Conference on Database and Data Mining (DBDM 2021)9th International Conference on Database and Data Mining (DBDM 2021)
9th International Conference on Database and Data Mining (DBDM 2021)albert ca
 
Data centric business and knowledge graph trends
Data centric business and knowledge graph trendsData centric business and knowledge graph trends
Data centric business and knowledge graph trendsAlan Morrison
 
2010/10 - Database Architechs - Data Services Summary
2010/10 - Database Architechs - Data Services Summary2010/10 - Database Architechs - Data Services Summary
2010/10 - Database Architechs - Data Services SummaryDatabase Architechs
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
 

Similaire à Educating a New Breed of Data Scientists for Scientific Data Management (20)

Data Science: An Emerging Field for Future Jobs
Data Science: An Emerging Field for Future JobsData Science: An Emerging Field for Future Jobs
Data Science: An Emerging Field for Future Jobs
 
Simon Hodson
Simon HodsonSimon Hodson
Simon Hodson
 
Paper presentation
Paper presentationPaper presentation
Paper presentation
 
Jeff's what isdatascience
Jeff's what isdatascienceJeff's what isdatascience
Jeff's what isdatascience
 
The Science of Data Science
The Science of Data Science The Science of Data Science
The Science of Data Science
 
Ch1IntroductiontoDataScience.pptx
Ch1IntroductiontoDataScience.pptxCh1IntroductiontoDataScience.pptx
Ch1IntroductiontoDataScience.pptx
 
To architect or engineer? Lessons from DataPool on building RDM repositories
To architect or engineer? Lessons from DataPool on building RDM repositoriesTo architect or engineer? Lessons from DataPool on building RDM repositories
To architect or engineer? Lessons from DataPool on building RDM repositories
 
Unlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data VirtualizationUnlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data Virtualization
 
Building Data Ecosystems for Accelerated Discovery
Building Data Ecosystems for Accelerated DiscoveryBuilding Data Ecosystems for Accelerated Discovery
Building Data Ecosystems for Accelerated Discovery
 
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
 
Linked Open data: CNR
Linked Open data: CNRLinked Open data: CNR
Linked Open data: CNR
 
Analytic Platforms in the Real World with 451Research and Calpont_July 2012
Analytic Platforms in the Real World with 451Research and Calpont_July 2012Analytic Platforms in the Real World with 451Research and Calpont_July 2012
Analytic Platforms in the Real World with 451Research and Calpont_July 2012
 
Metadata in general and Dublin Core in specific; some experiences
Metadata in general and Dublin Core in specific; some experiencesMetadata in general and Dublin Core in specific; some experiences
Metadata in general and Dublin Core in specific; some experiences
 
How Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science StackHow Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science Stack
 
Breed data scientists_ A Presentation.pptx
Breed data scientists_ A Presentation.pptxBreed data scientists_ A Presentation.pptx
Breed data scientists_ A Presentation.pptx
 
Introduction to Data Science.pdf
Introduction to Data Science.pdfIntroduction to Data Science.pdf
Introduction to Data Science.pdf
 
9th International Conference on Database and Data Mining (DBDM 2021)
9th International Conference on Database and Data Mining (DBDM 2021)9th International Conference on Database and Data Mining (DBDM 2021)
9th International Conference on Database and Data Mining (DBDM 2021)
 
Data centric business and knowledge graph trends
Data centric business and knowledge graph trendsData centric business and knowledge graph trends
Data centric business and knowledge graph trends
 
2010/10 - Database Architechs - Data Services Summary
2010/10 - Database Architechs - Data Services Summary2010/10 - Database Architechs - Data Services Summary
2010/10 - Database Architechs - Data Services Summary
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 

Dernier

microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxnegromaestrong
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfChris Hunter
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterMateoGardella
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfSanaAli374401
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 

Dernier (20)

microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 

Educating a New Breed of Data Scientists for Scientific Data Management

  • 1. Educating a New Breed of Data Scientists for Scientific Data Management Jian Qin School of Information Studies Syracuse University Microsoft eScience Workshop, Chicago, October 9, 2012
  • 2. DS Talk points ›  Data science (DS) and data scientists in the context of scientific data ›  An iSchool version of the DS curriculum ›  Findings and lessons from implementing the DS curriculum ›  A new breed of data scientists: the iSchool approach 10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 2
  • 3. What is data science? DS “An emerging area of work concerned with the collection, presentation, analysis, visualization, management, and preservation of large collections of information.” Stanton, J. (2012). Introduction to Data Science. http://ischool.syr.edu/media/documents/2012/3/ DataScienceBook1_1.pdf 10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 3
  • 4. DS Data science and scientific research Plan, design, consult Ingest, store, for, implement, and organize, merge, evaluate data filter, and transform management projects data and create and services analysis-ready data 10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 4
  • 5. What data scientists are expected to do: the job market DS Laboratory Data Data Modeling/ Management Specialist Management Specialist Scientific Data Management •  Administer operational database •  Working closely with the high Specialist •  Assure the quality of data performance computing and •  Design, develop, implement, and database content the IT manager manage high-throughput automatic •  Interact closely with researchers, •  Develop a data model for data processing infrastructure for lab managers, and platform complex multi-scale rocks large databases in a mature system coordinators •  Design and organize a •  Develop and improve the •  Track deliverables against budget database and complex infrastructure supporting this system and prepare data reports queries •  Interface with multiple data providers •  Collaborate closely with IT and •  Integrate and mange multi- to design, build, and maintain their bioinformatics colleagues scale rocks subjected to customized databases •  Assist IT in gathering workflow large-scale scientific •  Clarify requirements, feature requirements computing applications requests and bug reports for software •  Test changes and updates in IT systems http://www.ingrainrocks.com/ developers and assist in testing data-management-specialist/ code. •  Create and maintain app documentation http://www.bioinformatics.org/ forums/forum.php?forum_id=9670 10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 5
  • 6. DS “We’re increasingly finding data in the wild, and data scientists are involved with gathering data, massaging it into a tractable form, making it tell its story, and presenting that story to others.” Loukides, M. (2011). What is data science? Sebastopol, CA: O’Reilly. 10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 6
  • 7. What data scientists are expected to do: DS the difference from tradition ›  Data scientists are more likely to be involved across the data lifecycle: –  Acquiring new data sets: 33% –  Parsing data sets: 29% –  Filtering and organizing data: 40% –  Mining data for patterns: 30% –  Advanced algorithms to solve analytical problems: 29% –  Representing data visually: 38% –  Telling a story with data: 34% –  Interacting with data dynamically: 37% –  Making business decisions based on data: 40% http://mashable.com/2012/01/13/career-of- 10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 7 the-future-data-scientist-infographic/
  • 8. How should educational programs address the challenge? A case of the CAS in Data Science program at Syracuse iSchool DS 10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 8
  • 9. Story 1: cognitive-demanding workflows and DS data management ›  Domain: Thermochronology and tectonics ›  What’s involved: rock samples from drilling and field observation, sliced and grained rock samples ›  Data types: Excel data files (lots of them), spectrum and microscopic images, annotations ›  Analysis: modeling and sensemaking by combining data from multiple data files with specialized software ›  Bottleneck problem: manually matching/merging/filtering data is extremely cumbersome and the problem is compounded by the difficulty finding the right data files
  • 10. DS Story 2: highly automated workflows ›  Domain: Astrophysics: gravitational wave detection ›  What’s involved: data ingestion from laser interferometers, raw data calibration and segmentation, workflow management, provenance ›  Data types: streaming data from the laser interferometers, images ›  Analysis: detection of “events” ›  Bottleneck problem: tracking of data and processes and the relationships between them 10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 10
  • 11. Ability to use a Knowledge Data DS wide variety of a subject modeling, tools for domain documentation, database and analysis, and query design report of data Data OS, Collaboration, communication, scientists Programming languages and co- ordination Content and Encoding What are repository languages systems expected of data scientists? 10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 11
  • 12. DS Analytical skills: domain modeling Requirement analysis Interview skills, analysis and generalization skills Workflow analysis Ability to capture components and sequences in workflows Data modeling Ability to translate domain analysis Data transformation into data models needs analysis Ability to envision the data model Data provenance within the larger system architecture needs analysis
  • 13. Analytical skills: from data sources to patterns, DS relationships, and trends Analytical tools “Hacking” Knowledge Data products 10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 13
  • 14. Data management skills: data lifecycle and DS infrastructural services Metadata Encoding Semantic Identify Infrastructural standards language control management services Processed, transformed, derived, calculated, … data •  Data source discovery •  Data curation Common data format Image formats •  Data preservation Matrix formats •  Data integration and Microarray file formats mashup Communication protocols •  Data citation, publication, and distribution •  Data linking and interoperability •  … 10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 14
  • 15. Technology skills with excellent communication DS skills TECHNOLOGY SKILLS COMMUNICATION SKILLS ›  Operation systems ›  Interviews ›  Repository systems ›  “Ice breaking” ›  Database systems ›  Community building ›  Programming languages ›  Institutionalization ›  Encoding languages ›  Stakeholder buy-in ›  Specialized programming 10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 15
  • 16. No superman model for beginning data DS scientists Data analytics Data storage and management Data scientists Core: Applied data science Databases General system management Data visualization 10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 16
  • 17. DS The CAS in Data Science program at SU ›  Required: –  Data Administration Concepts and Database Management –  Applied Data Science ›  Elective: Data Analytics Data Storage and Data Visualization •  Data Mining Management •  Information Architecture for •  Basics of Information •  Technologies for Web Internet Services Retrieval Systems Content Management •  Information Visualization •  Natural Language •  Foundations of Digital Data Processing •  Creating, Managing, and General Systems Management •  Advanced Information Preserving Digital Assets •  Enterprise Technologies Analytics •  Data Warehousing •  Managing Information Systems •  Research Methods •  Advanced Database Projects •  Statistical Methods Management •  Information Systems Analysis 10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 17
  • 18. What we learned from the program DS development ›  Data science is a moving target with multiple focal points –  Versions from statistics, computer science, and library and information science ›  Skills vs. theories –  Students are anxious to learn skills but not so interested in theories –  Theories help build visions ›  Sufficient hands-on time for technologies and tools ›  Authentic learning through real-world data management projects 10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 18
  • 19. DS Reconciliation of the two views of data science “An emerging area of work “We’re increasingly finding concerned with the data in the wild, and data collection, presentation, scientists are involved with analysis, visualization, gathering data, massaging it management, and into a tractable form, making it preservation of large tell its story, and presenting that collections of information.” story to others.” Stanton, J. (2012). Introduction to Data Science. Loukides, M. (2011). What is data http://ischool.syr.edu/media/documents/2012/3/ science? Sebastopol, CA: O’Reilly. DataScienceBook1_1.pdf 10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 19
  • 20. The iSchool’s version of data science DS education Ability to use a Knowledge wide variety of a subject Data tools for domain modeling, documentation, database and analysis, and query design Eventually the report of data iSchool data science program will build Data OS, Collaboration, the foundation for communication, scientists Programming languages and co- super data ordination scientists… Content and Encoding repository languages systems 10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 20
  • 21. DS eScience Librarianship Curriculum Project: http://eslib.ischool.syr.edu/ Science Data Literacy Project: http://sdl.syr.edu/ CAS in Data Science: http://ischool.syr.edu/future/cas/ datascience.aspx 10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 21