My slides from the MASTODONS (big data) workshop of the CNRS.
http://www.cnrs.fr/mi/spip.php?article631&lang=fr
On defining and managing the data science ecosystem at Université Paris-Saclay. Challenges and tools.
5. Center for Data Science
Paris-Saclay
• My eight-year of experience interfacing
between high-energy physics and data science
• Our one-year experience of running PSCDS
• Extensive discussions with management
scientist and the MI/Mastodons/MaDICS for
the last year
5
WHERE DOES IT COME FROM?
6. Center for Data Science
Paris-Saclay6
UNIVERSITÉ PARIS-SACLAY
19 founding partners
7. Center for Data Science
Paris-Saclay
UNIVERSITÉ PARIS-SACLAY
7
+ horizontal multi-disciplinary and multi-partner
initiatives (“lidex”) to create cohesion
8. Center for Data Science
Paris-Saclay8
Biology & bioinformatics
IBISC/UEvry
LRI/UPSud
Hepatinov
CESP/UPSud-UVSQ-Inserm
IGM-I2BC/UPSud
MIA/Agro
MIAj-MIG/INRA
LMAS/Centrale
Chemistry
EA4041/UPSud
Earth sciences
LATMOS/UVSQ
GEOPS/UPSud
IPSL/UVSQ
LSCE/UVSQ
LMD/Polytechnique
Economy
LM/ENSAE
RITM/UPSud
LFA/ENSAE
Neuroscience
UNICOG/Inserm
U1000/Inserm
NeuroSpin/CEA
Particle physics
astrophysics &
cosmology
LPP/Polytechnique
DMPH/ONERA
CosmoStat/CEA
IAS/UPSud
AIM/CEA
LAL/UPSud
250researchers in 35laboratories
Machine learning
LRI/UPSud
LTCI/Telecom
CMLA/Cachan
LS/ENSAE
LIX/Polytechnique
MIA/Agro
CMA/Polytechnique
LSS/Supélec
CVN/Centrale
LMAS/Centrale
DTIM/ONERA
IBISC/UEvry
Visualization
INRIA
LIMSI
Signal processing
LTCI/Telecom
CMA/Polytechnique
CVN/Centrale
LSS/Supélec
CMLA/Cachan
LIMSI
DTIM/ONERA
Statistics
LMO/UPSud
LS/ENSAE
LSS/Supélec
CMA/Polytechnique
LMAS/Centrale
MIA/AgroParisTech
machine learning
information retrieval
signal processing
data visualization
databases
Domain science
human society
life
brain
earth
universe
Tool building
software engineering
clouds/grids
high-performance
computing
optimization
Domain scientistSoftware engineer
datascience-paris-saclay.fr
Center for Data Science
Paris-Saclay
A multi-disciplinary initiative to define, structure, and manage
the data science ecosystem at the Université Paris-Saclay
http://www.datascience-paris-saclay.fr/
9. Center for Data Science
Paris-Saclay
DATA SCIENCE
9
Design of automated methods
to analyze massive and complex data
to extract useful information
10. Center for Data Science
Paris-Saclay10
CENTER FOR DATA SCIENCE
=
DATA CENTER
We are focusing on inference:
data knowledge
Interfacing with HPC, cloud, storage, production
11. Center for Data Science
Paris-Saclay11
PARAMETERS
• 2 years:April 2014 - June 2016, 1.2M€
• +1 year, conditional on evaluation
• Light management
• executive committee of 17 members
• work groups
• management, strategy (around objectives)
• thematic (around scientific themes),
• open to everyone to propose and to participate
12. Center for Data Science
Paris-Saclay
Domain science
energy and physical sciences
health and life sciences
Earth and environment
economy and society
brain
12
THE DATA SCIENCE LANDSCAPE
Data scientist
Data trainer
Applied scientist
Domain scientistSoftware engineer
Data engineer
Data science
statistics
machine learning
information retrieval
signal processing
data visualization
databases
Tool building
software engineering
clouds/grids
high-performance
computing
optimization
13. Center for Data Science
Paris-Saclay
• Manpower
• especially at the interfaces
• industrial brain-drain
• Incentives
• data scientists are not incentivized to work on domain science
• scientists are not incentivized to work on tools
• Access
• no well-developed channels to identify the right experts for a given problem
• Tools
• few tools that can help domain scientists and data scientists to collaborate efficiently
13
CHALLENGES
14. Center for Data Science
Paris-Saclay
TOOLS
14
We are designing and learning to manage tools to
accompany data science projects with different needs
15. Center for Data Science
Paris-Saclay
TOOLS: LANDSCAPE TO ECOSYSTEM
15
Data scientist
Data trainer
Applied scientist
Domain expertSoftware engineer
Data engineer
Tool building Data domains
Data science
statistics
machine learning
information retrieval
signal processing
data visualization
databases
• interdisciplinary projects
• matchmaking tool
• design and innovation strategy workshops
• data challenges
• coding sprints
• Open Software Initiative
• code consolidator and engineering projects
software engineering
clouds/grids
high-performance
computing
optimization
energy and physical sciences
health and life sciences
Earth and environment
economy and society
brain
• data science bootcamps
• IT platform for linked data
• annotation tool
• SaaS data science platform
16. Center for Data Science
Paris-Saclay16
POSTDOCS, THESES, SABBATICALS
• Common selection criteria
• scientific quality
• expected results both in domain science and data science
• relevance and feasibility
• (real) scientific data, available at the start of the project
• interdisciplinarity (PIs both from domain and data sciences, different LABEXs)
• community building
• organizing and participating in thematic days, bootcamps, workshops
17. Center for Data Science
Paris-Saclay17
ENGINEERING AND CODE CONSOLIDATING
PROJECTS
• No research
• implementation, maintenance, integration
• 1 year engineering projects
• 3-6 month code consolidating projects
• a postdoc or PhD student drops research during the project,
implements his/her research code in a professional software, or
integrates it into a toolbox
18. Center for Data Science
Paris-Saclay
• A window to open data at Paris-Saclay
• We are not storing or handling existing large data sets
• Rather indexing, linking, and mapping, embedding in the
worldwide linked data (RDF) ecosystem
• Storing small data sets of small teams is possible
• Subsets of large sets for prototyping
• Or simply store metadata plus pointer
18
IT PLATFORM FOR LINKED DATA
http://io.datascience-paris-saclay.fr/
19. Center for Data Science
Paris-Saclay
IT PLATFORM FOR LINKED DATA
19
20. Center for Data Science
Paris-Saclay20
BOOTCAMPS
• Single-day coding sessions
• 20-30 participants
• Goals
• training PhD students, postdocs, engineers, senior researchers for
hands-on data science (problem types, tools)
• solving (prototyping) real data science problems
• networking, knowing each other
25. Center for Data Science
Paris-Saclay
• A data challenge is a recently developed unconventional
dissemination and communication tool
• a scientific or industrial data producer arrives with a well-defined problem
and a corresponding annotated data set
• defines a quantitative goal
• makes the problem and part of the data set (the training set) public on a
dedicated site
• data science experts then take the public training data and submit solutions
for a test set with hidden annotations
• submissions are evaluated numerically using the quantitative measure
• contestants are listed on a leaderboard
• after a predefined time, typically a couple of months, the final results are
revealed and the winners are awarded
25
DATA CHALLENGES
26. Center for Data Science
Paris-Saclay
• The HiggsML challenge on Kaggle
• https://www.kaggle.com/c/higgs-boson
• 1785 teams, huge publicity
• significant improvement on baseline
• 18 month preparation, yet partially missing the target
26
DATA CHALLENGES
27. Center for Data Science
Paris-Saclay
• Challenges are useful for
• generating visibility in the data science community about
novel application domains
• benchmarking in a fair way state-of-the-art techniques on
well-defined problems
• finding talented data scientists
• Limitations
• not necessary adapted to solving complex and open-ended
data science problems in realistic environments
• emphasizes competition
27
DATA CHALLENGES
29. Center for Data Science
Paris-Saclay
• Putting domain scientists, data scientists, and management
scientist in the same room
• Getting them understand each other
• Keeping them collectively creative
• The goal: identifying and defining projects
• low-hanging fruits
• breakthrough projects
• long-term vision
29
DESIGN AND INNOVATION STRATEGY WORKSHOPS
30. Center for Data Science
Paris-Saclay
innovative design
=
interaction and joint
expansion of concepts
and knowledge
30
DESIGN AND INNOVATION STRATEGY WORKSHOPS
C/K design theory
31. Center for Data Science
Paris-Saclay31
DESIGN AND INNOVATION STRATEGY WORKSHOPS
[P]$Project$
building!
Ini3alisa3on$
[K]$Knowledge$
sharing$
Workshops$
[C]$IFM?Design$
Workshops$
[RUN]!
DKCP process: linearizing C-K dynamics
32. Center for Data Science
Paris-Saclay32
TAKE HOME MESSAGE NO1
If you are interested in adapting any of these tools in
your project/site, feel free to contact us, we would be
happy to share our experience
33. Center for Data Science
Paris-Saclay
• We need data engineers and trainers to support research
• ideally: 75% research scientists, 25% research engineers
• We are lucky in France that the position exists in public research
• Tasks
• integrating research code into general-purpose professional software (e.g.,
scikit-learn,Torch)
• providing an interface between computational infrastructure (e.g., clouds)
and scientists
• training scientists to use the tools
33
WHAT CNRS CAN DO
34. Center for Data Science
Paris-Saclay34
WHAT CNRS CAN DO
Incentives
35. Center for Data Science
Paris-Saclay35
WHAT CNRS CAN DO
Most data scientists, as other scientists, are trained and incentivized to do research on
highly specialized domains. They search scientific visibility in their international community,
which is equally highly specialized, because their carrier advancement is almost entirely
based on peer-reviewed publications. Even when they would have the expertise, they have
little incentive to venture into the tool builder (data engineer) role since software
authorship has little value in their evaluation, and it can only serve them implicitly through
the visibility they gain in the community of tool users. By the same token, they have little
incentive to venture into domain sciences and to tackle economic or societal challenges. It
is possible that a domain science or an industrial project requires new techniques which
then can be published in data science venues, but this is not guaranteed at all. It usually
takes heavy investment of time and effort to be able to understand domain problems, so
excursions into domain sciences are highly risky. Even when such collaborations are
established, the data scientist has a strong prior to use his/her favorite methodology which
is not necessarily the best solution for a given problem. Finally, the data scientist has little
incentive to bring the project to full fruition, and he/she often “runs away” with an abstract
data science problem (and solution) extracted from the project. Symmetrically, domain
scientist and industrial domain experts have no incentive to advance data science and to
develop and publish new techniques, as long as their data science problems get solved.
When they venture into tool development, they have little incentive in developing general
purpose tools.
36. Center for Data Science
Paris-Saclay
Affirmative action
36
TAKE HOME MESSAGE NO2
37. • Overweight (e.g, double count) out-of-domain papers in
every evaluation
• Make software tool ownership count
Affirmative action
37
TAKE HOME MESSAGE NO2
38. Center for Data Science
Paris-Saclay38
TAKE HOME MESSAGE NO3
Co-locality is important.
It would be desirable that data science for scientific
data in France be grouped into 5-6 sites of similar size.
Resources and experience can and should be shared.