Presentation made at SemTech2010 detailing the Calit2 Research Intelligence system for faculty expertise profile and our experience with semantics in this space.
1. The Research Intelligence Project
California Institute for Telecommunications and Information Technology (Calit2)
Jerry Sheehan, Chief of Staff
June 25th, 2010
SemTech 2010
2. The Research Intelligence Project
Outline
The Semantic
Our
Research Data
Problem
Intel Tools Evolution
Future Concluding
Directions Thoughts
SemTech 2010
3. My Bias
Prefer
Found
Elsewhere
SemTech 2010 Image Courtesy of Matt Jones, Creative Commons License, Flickr (blackbeltjones)
10. How Research Universities Look
At Their Business Data
SemTech 2010 Image Courtesy of HA! Designers, Creative Commons License, Flickr (artbyheather)
11. How We Could Look At Our Data
SemTech 2010 Logo Design by Kyle Bowen, http://www.educause.edu/Community/MemDir/Profiles/KyleBowen/58744
12. Research Intelligence Platform Development
2005 2006 2007 2008 2009 2010
Idea Proof of Concept Alpha/Beta for Calit2 Beta for Others Production for Campus New Domains
# of Users
250 300 480 900 Faculty 71 Companies
460
SemTech 2010
14. 2005: Topic Modeling of Researchers
SemTech 2010 Initial Site Developed by David Newman with Direction from Padhric Smyth, University of California, Irvine
15. 2005: The Topic Modeling Proof of Concept
SemTech 2010 http://datalab-1.ics.uci.edu/calit2/
18. Manual Tagging Experiment
• Three person team examined one
university affiliated web page for affiliated
faculty and associated a minimum of three
keywords with each person.
• No controlled vocabulary but rather a
narrative question to focus manual
tagging.
• What type of research does this person
primarily do?
• Created SQL Database of all UCSD
affiliated academic researchers.
SemTech 2010
19. Unfiltered Tags: Automated Extraction
1. ucsd (157) 28. structural engineering (16)
2. email (117) 29. associate professor (16)
3. university of california san diego (112) 30. electrical engineering (16)
4. sdsc (55) 31. department of computer science (16)
5. contact (50) 32. cse (16)
6. california san diego (47) 33. responsphere (16)
7. professor (44) 34. computational biology (15)
8. university of california (44) 35. adjunct professor (15)
9. computer science (36) 36. algorithms (15)
10. mail (36) 37. nsf (14)
11. edu (34) 38. networking (14)
12. wireless (31) 39. digital signal processing (14)
13. telecommunications (31) 40. geophysics (14)
14. california institute (28) 41. (14)
15. photonics (27) 42. california institutes (14)
16. physics (26) 43. information technology staff (14)
17. signal processing (23) 44. cwc (13)
18. visualization (22) 45. san diego supercomputer center (13)
19. computer engineering (22) 46. biology (13)
20. bioinformatics (21) 47. cognitive science (13)
21. capsule bio (21) 48. information theory (13)
22. nanotechnology (19) 49. optical networking (13)
23. uc san diego (19) 50. mit (13)
24. sensors (18)
25. scripps institution of oceanography (18)
26. information technology (17)
27. ucsd faculty (17)
SemTech 2010
33. Topic III
Semantic Data Evolution
SemTech 2010
34. Research Intelligence View of Semantic Data Evolution
Initial Open
Linked Data
Repositories
Complexity
Initial Open APIS
Semantic Services
Few Open APIs for NLP
Closed NLP
Text Mining
2005 2008 2009 2010
SemTech 2010 Time
35. Research Intelligence: The Data, Grant Abstract
NSF Solicitation: Software Infrastructure for Sustained Innovation
Computation is accepted as the third pillar supporting innovation and discovery in science and engineering and is central to NSF's
future vision of Cyberinfrastructure Framework for 21st Century Science and Engineering (CF21)[1]. Software is an integral part
of the computation paradigm and a primary modality for realizing the CF21 vision. Scientific discovery and innovation are advancing
fundamentally new pathways opened by development of increasingly sophisticated software. Software is also directly responsible for
increased scientific productivity and significant enhancement of researchers' capabilities. In order to nurture, accelerate and sustain this
critical mode of scientific progress, NSF is establishing a new program, Software Infrastructure for Sustained Innovation (SI2), with the
overarching goal of transforming innovations in research and education into sustained software resources that are an integral part of
the cyberinfrastructure. SI2 is a long-term investment focused on catalyzing new thinking, paradigms, and practices in using software to
understand natural, human, and engineered systems. SI2's intent is to foster a pervasive cyberinfrastructure to help researchers address
problems of unprecedented scale, complexity, resolution, and accuracy by integrating computation, data, networking and experiments in
novel ways. It is NSF's expectation that SI2 investment will result in robust, reliable, usable and sustainable software
infrastructure that is critical to the CF21 vision and will transform science and engineering. It is expected that SI2 will generate and
nurture the multidisciplinary processes required to support the entire software lifecycle and will result in the development of
sustainable software communities. SI2 envisions vibrant partnerships among academia, government laboratories and industry for the
development and stewardship of a sustainable software infrastructure that can enhance productivity and accelerate innovation in
science and engineering. The goal of the SI2 program is to create a software ecosystem that includes all levels of the software stack
and scales from individual or small groups of software innovators to large hubs of software excellence. The program addresses all
aspects of CI, from embedded sensor systems and instruments, to desktops and high-end data and computing systems, to major
instruments and facilities.The SI2 program envisions three classes of awards:1. Scientific Software Elements (SSE): SSE awards
target small groups that will create and deploy robust software elements for which there is a demonstrated need, encapsulating
innovation in science and engineering. The effort targeted by a SSE award is up to a level roughly comparable to: summer support for
two investigators with complementary expertise; two graduate students; and their collective research needs (e.g. materials, supplies,
travel) for three years.2. Scientific Software Integration (SSI): SSI awards target larger groups of PIs organized around common
research problems as well as common software infrastructure, and will result in a sustainable community software framework. The
effort targeted by a SSI award is up to a level roughly comparable to: summer support for three to four investigators with
complementary expertise; three to four graduate students; one or two senior personnel (including post-doctoral researchers, software
developers, and staff); and their collective research needs (e.g., materials, supplies, travel) for three to five years. The integrative
contributions of the SSI team should clearly be greater than the sum of the contributions of each individual member of the team.3.
Scientific Software Innovation Institutes (S2I2): S2I2 awards will focus on the establishment of long-term community-wide
hubs of software excellence. These hubs will provide expertise, processes, resources and implementation mechanism to
transform computational science and engineering innovations and community software into robust and sustained tools for enabling
science and engineering. S2I2 proposals will bring together multidisciplinary teams of domains scientists and engineers, computer
scientists and software engineers, technologists and educators.The FY 2010 SI2 competition will be limited to SSE and SSI awards. The
solicitation in FY 2011, and in subsequent years, will outline funding opportunities for all three classes of awards (SSE, SSI and S2I2),
subject to availability of funds.[1] http://www.nsf.gov/pubs/2010/nsf10015/nsf10015.jsp
SemTech 2010
36. Keyword Extraction Across Sources
Term Human Yahoo KEA Calais Alchemy OAmplify
Common Software Infrastructure
Community Software
Cyberinfrastructure
Embedded Sensor
Engineering
Hubs of Scientific Innovation
Innovation
NSF
Scientific
Scientific Discovery
Scientific Software
Scientific Software Integration
Scientific Software Innovation Institutes
SI2
Software
Software Developers
Software Ecosystem
Software Elements
Software Engineers
Software Infrastructure
Software Innovators
Software Lifecycle
Software Stack
SSI
Sustainable Software
Sustained Tool
Vision
SemTech 2010 12 3 9 15 20 10
37. Semantic Structure Returned by Open Calais
Industry Terms Social Tags
•Community Software •Cyberinfrastructure
•Software Lifecycle •E-Science
•Sustainable Software Communities •Computing
•Usable and Sustainable Software Infrastructure •Computer Software
•Software Infrastructure •Innovation
•Software Stack •Software Engineer
•Software Developers •Technology
•Sustainable Community Software Framework •Science
•Sustained Software Resources •Technology_Internet
•Software Ecosystem
•Software Excellence
•Embedded Sensor Systems Organization
•Software Elements •National Science Foundation
•Sustainable Software Infrastructure
URL
•http://www.nsf.gov/pubs/
2010/nsf10015/nsf10015.jsp
SemTech 2010 http://www.opencalais.com/
38. Semantic Structure Returned by Alchemy API
Tags Company
•Scientific productivity •pillar supporting innovation •Scientific Software
•overarching goal •primary modality
•graduate students •program envisions Field Terminology
•scientific discovery •researchers address problems
•robust software elements •Software
•21st century science •Software Stack
•collective research •scientific progress
•scientific software elements •Software Developers
•common research problems •Software Ecosystems
•common software infrastructure •scientific software innovation
•scientific software integration •Software Engineers
•community software
•complementary expertise •si2's intent
•computation paradigm •small groups
•cyberinfrastructure framework •software elements Organization
•entire software lifecycle •software excellence •NSF
•envisions vibrant partnerships •software infrastructure •SSI
•innovation computation •software innovators
•innovations •software resources
•long-term community-wide hubs •sophisticated software
•nsf's expectation •sse award Category
•nsf's future vision •ssi awards •Science and Technology
•pervasive cyberinfrastructure •ssi team
•summer support
•sustainable community software
•sustainable software communities
•sustainable software infrastructure
SemTech 2010 http://www.openamplify.com/
41. Open Calais Faculty Linked Data Results
Tag Type Linked Data Relevancy
National Science
Organization http://d.opencalais.com/genericHasher-1/f7d1451f-915f-31bc-8194-b9794401ea2d.html 52%
Foundation
Software Excellence Industry Term http://d.opencalais.com/genericHasher-1/3da6f84d-cff9-3eec-8fce-99ea792e370c.html 34%
Sustained Software
Industry Term h,p://d.opencalais.com/genericHasher-‐1/61a1eb6d-‐196d-‐3493-‐ad6c-‐8ea0b85ce421.html 32%
Resources
Usable and Sustainable
Industry Term http://d.opencalais.com/genericHasher-1/9e6fe116-e562-3753-9b93-8f938095a715.html 31%
Software Infrastructure
Software Lifecycle Industry Term http://d.opencalais.com/genericHasher-1/9c7876e1-a85f-307c-8b38-163c129f19f7.html 30%
Sustainable Software
Industry Term http://d.opencalais.com/genericHasher-1/5228ac30-2bf5-397e-bc1a-04275a3f5045.html 29%
Communities
Sustainable Software
Industry Term http://d.opencalais.com/genericHasher-1/4be05ead-30cd-3c3a-bd88-5dbb8427acc9.html 27%
Infrastructure
Software Stack Industry Term http://d.opencalais.com/genericHasher-1/c22ad2e5-bd08-3083-9dc5-14945fb77010.html 24%
Software Innovators Industry Term http://d.opencalais.com/genericHasher-1/eba4d676-5aa8-3b1e-83dc-c4bd91b4d0f4.html 21%
SemTech 2010
42. Open Calais Linked Data Examples
National Science Foundation
Software Excellence
SemTech 2010
48. Open Calais Faculty Linked Data Example
Tag Type Linked Data Relevancy
Lo Research Group Company http://d.opencalais.com/comphash-1/2cf74602-005c-3d32-a184-4bc49ef2d5f2.html 50%
California Institute Facility http://d.opencalais.com/genericHasher-1/37ab20cd-0681-3775-bf97-7583b4ec1434.html 46%
X@ece.ucsd.edu EmailAddress h,p://d.opencalais.com/genericHasher-‐1/babf08c8-‐1f57-‐3b99-‐b020-‐7e0dd8eaf1fc.html 31%
California Institute for
Organization http://d.opencalais.com/genericHasher-1/6a1fba6f-cf57-300b-94fc-f36d027c8ff0.html 31%
Telecommunications
858-xxx-xxxx PhoneNumber http://d.opencalais.com/genericHasher-1/e8e3ad15-ace3-3616-be5a-ae9038bc0678.html 31%
PhoneNumber
858-xxx-xxxx http://d.opencalais.com/genericHasher-1/5228ac30-2bf5-397e-bc1a-04275a3f5045.html 31%
Information Technology Technology http://d.opencalais.com/genericHasher-1/a0f02cf0-dc13-3b0f-a139-5509b026bd96.html 31%
optoelectronic devices Industry Term http://d.opencalais.com/genericHasher-1/7f81f0c9-b94f-3959-b35b-67be2f703ab4.html 29%
International Business Company http://d.opencalais.com/er/company/ralg-tr1r/9e3f6c34-aa6b-3a3b-b221-a07aa7933633.html 6%
Machines
SemTech 2010
51. Zemanta Linked Data Results
Tag Linked Data Confidence
Integrated Circuits wikipedia: Integrated circuit 0.65
geolocation: University of California, Berkeley
UC Berkeley homepage: University of California, Berkeley 0.64
wikipedia: University of California, Berkeley
Information Technology wikipedia:
InformaHon
technology 0.63
geolocation: California Institute for Telecommunications and Information Technology
Calit2 wikipedia: California Institute for Telecommunications and Information Technology 0.60
geolocation: IBM Almaden Research Center
Almaden Research Center wikipedia: IBM Almaden Research Center 0.59
Age related Macular wikipedia: Macular degeneration 0.59
Degeneration
Minimally Invasive Surgery wikipedia: Invasiveness of surgical procedures 0.58
Cancer http://en.wikipedia.org/wiki/Cancer 0.57
geolocation: Cornell University
homepage: Cornell University
Cornell wikipedia: Cornell University 0.57
youtube: Cornell University
Fluorescence Activated Cell wikipedia: Flow cytometry 0.57
Sorter
SemTech 2010
54. Linked Data and a Wikipedia Base
Wikipedia: How Accurate?
Source: Jeremy Hsu, “Wikipedia: How Accurate is it?”
November 2009, Live Science,
http://www.livescience.com/technology/091106-ttr-wikipedia.html#comments
SemTech 2010
55. Is It A Problem?
John S., Is a Possible Assassin of, John K
SemTech 2010 SOURCE: USA Today, November 29, 2005
56. Maybe Not?
How Important is Validity to Researchers?
SemTech 2010 SOURCE: PHARMANEWS.EU, January 23, 2009