The Rensselaer Institute for Data Exploration and Applications is addressing new modes of data exploration and integration to enhance the work of campus researchers (and beyond). This talk outlines the "data exploration" technologies being explored
1. Data Exploration
Jim Hendler
Director, Rensselaer Institute for
Data Exploration and Applications
THE RENSSELAER IDEA
Rensselaer Polytechnic Institute, USA
http://www.cs.rpi.edu/~hendler
2. Data-driven research areas at RPI
•
•
•
•
•
•
•
•
•
Data-driven Medical and Healthcare Applications
Predictive Models for Business and Economics
“Biome” studies for Built and Natural Environments
Question Answering from texts and data
Resiliency Models for Population-Scale Problems and cybersecurity domains
Semantically-enabled Data Services for Science and
Engineering Research
Materials genome and nano-manufacturing informatics
Platforms for testing Policy and Open Data issues
…
IDEA
3. The Rensselaer IDEA: empowering our researchers
Application-specific
data tools
Data discovery,
integration,
and interaction
technologies
IDEA
4. The trunk: Shared Data Technologies
High Performance Modeling and Simulation
• Center for Computational Innovation
Cognitive Computing
• Watson at Rensselaer IBM Partnership
Perceptualization
• Experimental Multimedia Performing Arts Center
Data Science
• Data Science Research Center
IDEA
5. Roots: Data Exploration
Geekopedia: Data exploration helps a data consumer focus an information search on the pertinent
aspect of relevant data before true analysis can be achieved. In large data sets, data is not
gathered or controlled in a focused manner. Even in smaller data sets, it is also true that data
gathered are not in a very rigid and specific technique can result in a disorganized manner and a
myriad of subsets each…
Discover
Integrate
Validate
Explain
DATA
IDEA
9. Discovery needs more than keywords
World Bank: Africa
Africover: Agriculture
Kenya: Agricultural
US Data.gov: Crop
IDEA
10. Integration needs Semantics
Person
Campus Personnel
RIN
660125137
Address #
1118
Address St
Pinehurst
Address zip
12203
Course topic
CSCI
Course #
YES
RPI ID
4961
660125137
Name
Hendler
NO!!!!
Campus Classes
CRN
Name
IDEA
1118
Intro to Physics
11. Semantic Web and Linked Data (UK)
Royal Mail
County Council
IOGDC Open Data Tutorial
Ordnance Survey
IDEA
11
14. Hard for machines…
Head to head comparison shows that burglaries in Avon
and Somerset (UK) far exceed those in Los Angeles,
California
IDEA
15. Data + everything else you know
Same or
different?
Do the terms mean the same? Are they collected in the same way? Are
they processed differently? …
IDEA
17. Explanation also needs Semantics
Inference Web: McGuinness – various DoD/IC projects
IDEA
18. Closing the loop: where do the semantics come from?
How do we go
from the
predictive
analytics of Big
Data to
models/explanat
ions that allow
new
understanding?
Data
Prediction
Design
Model
IDEA
19. 1. Better tools for Analytics, Agents and HPC
Make the tools and algorithms being developed by RPI
researchers more “reusable” and multitask (including
HPC data-analytic tools)
IDEA
20. 2. Next-Gen Visualization (at scale)
How can multi-modal, multi-user, large scale sensory (visualization,
sonification, haptics) interaction change the way we understand data?
IDEA
21. 3. Include “agents” in the modeling
Develop technologies that enable
researchers to work with “humanbased” data at larger scales and in new
ways
• Population-scale
computing models
for agent-based
simulations
IDEA
22. Approach
Platform: Research in using
supercomputers for
discrete modeling
• Carothers’ ROSS model
KR Model:
• Weaver’s restricted rules
on graphs
Challenge problem:
• Classification algorithms at petaflop scale
• “Logical” (nonlinear, discontinuous) agents
IDEA
23. 4. Exploit Cognitive Computing
IDEA will be the hub of Rensselaer’s cognitivecomputing research
• eg. Answer questions such as “Why” and “How”
integrated with large scale simulations
IDEA
25. Cognitive Computing at Scale
DeepQA type
approach best on
large clusters
(Physical)
Simulation runs on
supercomputers
IDEA
26. Approach: link these computational models
Surmise (unproven): Cognitive Computing on a fast (large) cluster
can query computations run against data generated by simulations
(physical or agent-based) on the supercomputer
IDEA
27. 5. Data services will provide synergy across disciplines
•
Semantics is a key technology for
common data services
P o le
ep
Agency Policy
Makers
System Scientists
Politicians
Decision-level semantic mediation: high-level vocabularies that facilitate policy-level
decision-making
Inte ra d
g te
A p a io s
p lic t n
Inter-disciplinary
Data Visualization
Apps
S m tic
e an
in rope
te
rability
Integration
Frameworks &
Methodologies
Eco & other system
Assessment Apps
Application-level semantic mediation: mid-level vocabularies that facilitate the interoperability of system models and data products
S f t w re
o
a ,
T o &A p
o ls
p s
Disciplinespecific
model(s)
S m tic
e an
in rope
te
rability
Dataproduct
Generator
S m tic qu ry
e an
e ,
h
ypoth is an
s
d
in re c
fe n e
Information/
S
cience Apps
Qu ry
e ,
ac e s an
c s
d
u e of data
s
Data-level Semantic mediation: lower-level vocabularies applied to each data source
for a specific science domain of interest
D ta
a
Rp s o
e o it rie
s
Federal
Repository
Discovery, Integration. Validation
Curation, Citation,Archiving …
IDEA
Commercial
Database
Researcher
Private
Database
Other Data
Sources
Me
tadata,
s h m
c e a,
data
... ... ...
28. Conclusions
• The “warehouse” is only a small part of the data
ecosystem
• Database technologies are only part of the story
• Discovery, Integration, … , validation, explanation are key to
solving problems with data
• Closing the loop means “exploring” our data
• Humans are still a key player in this
• The Rensselaer IDEA will explore
• Data-driven applications and tools, but also…
• … multimodal visualization, multiscale and agent modeling,
cognitive computing, and semantic data platforms
IDEA