This talk presents areas of investigation underway at the Rensselaer Institute for Data Exploration and Applications. First presented at Flipkart, Bangalore India, 3/2015.
1. The Science of Data Science
(Data plus Semantics yields Knowledge)
Prof. James Hendler
Tetherless World Constellation Chair of Computer, Web and
Cognitive Sciences
Director, The Rensselaer IDEA
1
2. The Rensselaer Institute for Data
Exploration and Applications
Performance Plan to Budget
Presentation
February 2015
The Rensselaer Institute for Data Exploration and Applications (IDEA) is a
breakthrough initiative brings together key research areas and advanced
technologies to revolutionize the way we use data in science, engineering,
and virtually every other research and educational discipline. By bridging the
gaps between analytics, modeling, and simulation we continue the
Rensselaer tradition as a leader in applying critical technologies to improving
everyday life and meeting the challenges of the future.
3. 3
The Rensselaer Institute for Data Exploration and Applications
Business
Systems:
Built and Natural
Environments:
Cyber-
Resiliency:
Policy, Ethics and
Stewardship:
Materials Informatics:Data-driven
Physical/Life Sciences:
Healthcare Analytics
and Mobile Health:
Social Network
Analytics:
Agents and
Augmented Reality:
4. 4
IDEA project examples
• Healthcare in Context:
Data mining/analytics to
Improve public health from
a systems perspective at
the individual to national
scales.
• Data-Centric Engineering
Design: Data-driven
Design & Control under
uncertainty via data fusion
across multiple scales and
sources
• Supply Chain Resilience
through Information
Visibility: Demonstrate
uses of supply chain
information visibility for
anticipating, mitigating and
recovering from disruptive
events
• Accelerated design of
functional materials/Material
Ontology: Address basic
materials processing data-based
informatics for complex,
multifunctional (often nano)
materials.
• Biome-informatics: Develop data
aggregation and computational
tools to integrate disparate
datasets into large ecosystem
models using data collected on
the microbial communities that
inhabit the base of most
ecosystems
• Deducing Structure to Function
in Biomedicine: Develop
systematic data-resourced
methods for discovering and
exploiting structure-to-function
relationships.
5. 5
KDD Pipeline – as usually presented
Data Storage
(Big Data
Warehouse)
6. KDD Pipeline – in the real world
Data is increasingly being
brought in from external
sources, with mixed
provenance, and
increasingly outside the
analyzers’ control.
At increasing rates and scales
6
Data
Storage
Sensors and apps Social
Media
Customer
Behaviors
Web
Partners
Formatting, standards use, data
cleansing, data bias analysis, …
Open data
Data
Storage
Data
Storage
Data
Storage
Data
Storage
Data
Storage
Data
Storage
Data
Storage
7. Tough data integration challenges
Enterprise
analytics
Open Data
Integration
Hard
problems!
8. Closing the loop on (big) data
IDEA is focusing on key data science
areas
which are revolutionizing engineering, science
and business with significant social impact
8
Predictive Analytics Discovery Informatics Data Exploration
10. Courtesy of
Eric Schadt,
Mount Sinai
Example: Healthcare Data Analytics
The Digital Universe of Data to Better
Diagnose and Treat Patients
Courtesy of
Eric
Schadt,
Mount Sinai
11. Identifying predictive features in data
Each factor must be separately
analyzed for its “Predictivity”
• Mutual information measure
The “black art” of predictive
analytics is finding the right
ones
• Use too few, the model is
weak
• Use too many, the model
becomes slow and dominated
by noise
Algorithms required to do this
because the overwhelming
number of “weak” factors defies
human abilities to combine
• Machine learning identifies
key feature
• some require “roll ups”
• some require “pull outs”
• Mathematical techniques then
reduce the dimensionality
11
12. 12
Predictive analytics in sensors
Extend-o-hand
(Josh Shinavier. PhD)
Classification of the sensor data (via machine-learning) allows predictive recognition
of different gestures (i.e. before the gesture is finished).
13. 13
Predictive analytics in large scale behaviors
List clusters at risk for Asian Clams
<1mile Cook’s Bay.”
Machine-learning generates predicts future distributions of invasive species in Lake
George based on current distributions and bathymetry similarity.
14. Predictive Social Network Analytics (with RPI NeST center)
14
Social Networks in Action
Analyzing cascading failures
Modeling (supply chain)
networks…
and predicting (cascading)
network risks.
Modeling network stressors (including
human cognitive element)
Understanding network dynamics
15. 15
Data Science Research Center: tools for data analytics
Theory & Algorithms
• Randomized
• Optimization
• Approximation
• Multilinear Algebra
Applications
Statistics
• Multivariate analysis
• Optimal Experimental
Design
Dimension reduction by
randomized algorithms for
numerical linear algebra for
identify significant components
and visualizing Petabyte-scale
data matrices (P. Drineas, CSCI)
Parallel Factor Analysis for tensor systems creates a scalable
solution, on AMOS, for a critical data-processing component of
data analytics for large graphs. (B. Yener, CSCI)
Computational concerns
• Scaling
• Cyber Security for Data
17. 17
Scientific data: Microbiome informatics
Human Biome
Environmental Biome
Built Environment
Data Analytics
Semantic Data Integration
While microbes are among the smallest
organisms on the planet, they are also
the largest influence on mass and
nutrient transport in the biosphere. They
are the base of most natural ecosystems,
as well as the purveyors of air and water
quality. It is also microbes that primarily
govern disease transfer and human
health in our built environments.
18. 18
Materials Processing Ontology (cMDIS/IDEA)
The materials field has made much progress on systematically understanding materials
structure-to-property relationships, but lacks an organized model of processing-to-
property relations.
A critical need for systematic development of new materials technologies!
Goal: Create a (machine-readable) ontology
for materials processing.
By combining our expertise in data science,
materials and manufacturing, we are creating a
key missing link in the Materials Genome
Initiative.
19. Some questions need a qualitative answer
Platform for Experimental Collaborative Ethnography
20. 20
Discovery Informatics Requires Unstructured data
Integration of text analytics,
natural language processing,
network-based multimedia
analysis and
structured/unstructured data
integration
21. Requires Unstructured data (real-time feeds /images/video)
DOE SEAB report on HPC:
How might a neuromorphic “accelerator” type processor be
used to improve the application performance, power
consumption and overall system reliability of future
exascale systems?
21
Power Consumption (w/IBM)
Network Learning (sensors)
Sparse Distributed Representations
Hybrid Neural/Symbolic Systems
Neuromorphic Computing: software systems that implement models
inspired by neural systems to analyze data tied to perception, motor control,
or multisensory integration.
22. 22
Neuromorphic Computing (CCI/IDEA)
Joint CCI/IDEA project to use supercomputer to model state-of-the-art neuromorphic processors
Use for improving AMOS energy use (like autonomic control)
Use for exploring inputs from data-sensing systems (extrinsic control)
Neuromorphic Computing requires critical Rensselaer technologies
Integrating data analytics (on the fly) with simulation and modeling
CCI (AMOS) allows us to explore new variants on neuromorphic
approaches
IDEA provides learning models and analytics capabilities for evaluation
Together allow us to attack audio/visual streaming data
autonomic
extrinsic
23. Theme 3: Data Exploration
23
From “why” to “what is”
24. 24
From visualization to exploration
… Unfortunately, visualization too often becomes an end product of scientific analysis,
rather than an exploration tool that scientists can use throughout the research life cycle.
However, new database technologies, coupled with emerging Web-based technologies,
may hold the key to lowering the cost of visualization generation and allow it to become
a more integral part of the scientific process.
25. 25
From visualization to exploration
… Unfortunately, visualization too often becomes an end product of scientific analysis,
rather than an exploration tool that scientists can use throughout the research life cycle.
However, new database technologies, coupled with emerging Web-based technologies,
may hold the key to lowering the cost of visualization generation and allow it to become
a more integral part of the scientific process.
26. 26
From what is, to what if, to why (and back)
These capabilities are critical in “closing the loop” between data,
simulation and modeling in scientific discovery, engineering
design, and business innovation.
27. 27
A “Data Science” Research Agenda
Multiscale
Sparcity
Abductive Agent-oriented
28. • Gathering and
representing
information from
multiple sources
• topic of CODS talk
• Systematic (and
scalable) methods for
predictive analytics
• example: Parallel
search for best kernel
functions
28
Supporting the Scientific agenda
• New Data Exploration
platforms
• example: Patent
pending on new multi-
user collaborative
device
• Cognitive and
immersive platforms
• Data sharing standards
• Research Data Alliance
• W3C
29. The Rensselaer IDEA
Summary
• Data is not just the “oil” of the new
generation
• information is the new power source generated from that “oil”
• Using data for prediction is becoming less of an art,
but still needs systematicity
• Scaling tools beyond MapReduce
• Better methods for rapid customization
• Turning data into causal or design knowledge is in its
early stages
• Closing the loop from data to design requires new informatics,
new mathematics, and new ways of thinking beyond data mining
29
Notes de l'éditeur
Ones with numbers are secondary diagnosis indicator variables. * indicate categorical variables. In practice during modeling they is one
“predictivity” index
Working with faculty from SoS, SoE, HASS and SoA
(in the UCTE power grid network, employing capacity-limited current flows in resistor networks).
Put it on a slide: show an example of someone using it
Current version of PECE is running over 4 different projects (Disaster STS Network, The Asthma Files, World PECE, World Academia)
Largest site (DSTS Network) has 35 users from universities all over the country and 14 different user groups
“Feature” lists of moveable functionality and modules that can be ported to any Drupal site.
this is the data science agenda- basically, these are the hard problems in the closing the loop – how to go from the correlation on one side to the causal on the other – I don’t love the term agent-oriented, but we mean a combination of unstructured, AI, etc – abductive is usually where I talk about these being hard inverse problems where we don’t know a specific function, but rathr are looking for an explanation.
this is the data science agenda- basically, these are the hard problems in the closing the loop – how to go from the correlation on one side to the causal on the other – I don’t love the term agent-oriented, but we mean a combination of unstructured, AI, etc – abductive is usually where I talk about these being hard inverse problems where we don’t know a specific function, but rathr are looking for an explanation.