Apache Web Services in the Real World, an E-Science Perspective
1. Apache Web Services in the
Real World, an E-Science
Perspective
Srinath Perera
Architect, WSO2 Inc.
Member, Apache Software Foundation
Lanka Software Foundation
2. Outline
● Linked Environment for Atmospheric
Discovery Project (LEAD), the Use Case.
● LEAD Architecture, using SOA to build a
Large Scale E-Science Project.
● History: LEAD and Apache Web Service
Projects.
● Apache as a Sustainability Model for
Academic Projects.
3. E-Science
● Continuation of High Performance Computing,
Parallel Computing, and Grid.
● Cyber-infrastructures to support Scientific
Research.
● Build around “Computation” as the third Pillar of
Science (along with Analysis and
Experimentation).
● Characterized by wide range of computing (CPU
minutes to CPU years) and Data (few KB to Pbs
of data) requirements.
● Based on Real life usecases.
4. Reality is Harder than Fiction
● E-Science joins Theory with Real life data
● Real Life Applications often go beyond our
experiences.
● Most Weather models are calculated much less
than ideal resolutions, otherwise a 24 hour forecast
takes more than 24 hours !!!
● Physics Usecases (e.g. Large Hadron Collider),
Telescopes, Genome Analysis generate Tera bytes
of data in days if not hours, and moving a 1TB
takes hours even in a 10 GB networks of TeraGrid.
● Scale, Geographical Distribution of resources,
Heterogeneity makes these usecases Complex.
5. Linked Environments for
Atmospheric Discovery (LEAD)
● U.S. NSF funded, 10+ Universities, 11M $, 5
Years.
● Used for U.S. National Weather forecasts by
NOAA.
● Presented to U.S. Congress as an example to
justify Scientific research spendings by U.S.
NSF.
● Have brought the state of the art forecasting
capabilities to wider audience ranging from
hardcore scientists to high schools students.
7. Why is it Hard?
● Geographically Distributed Sensors, Computing Power,
Storage, and Expertise.
● Handling Failures and Recovery
● Long Running Jobs (> 1 Hour).
● Large Scale Jobs (10-1000+ processors).
● Large Sized Data (KBs to GB of data).
● Need to serve many Parallel Users.
● Usage Spikes.
8. LEAD as an Example
● Assume a Hurricane developed, and 1000
scientists across U.S. come to LEAD portal to
run forecasts.
● Lets assume,
● Each user run 3 workflows.
● Each Workflow has 6 services, generates about 300
notifications, moves 50 100MB files, generates 50
100MB files, and runs for one hour.
● Each Service needs 5 CPUs Hours .
9. Which Means
● 3000 Parallel workflows
● Need 90,000 CPUs per Hour
● 250 TPS for messaging System
● Move 8GB/Sec through the network
● Generate 15TB data per Hour
LEAD Can not handle these numbers
yet, but they give us an idea about the
challenge.
10. SOA, E-Science and LEAD
● E-Science infrastructures are Distributed, Complex,
and Heterogeneous.
● SOA is designed to handle just the like.
● LEAD is based on many SOA Specs
– WSDL, SOAP, WS-Addressing for Communication
– WS-BPEL for Workflows
– WS-Eventing for Messaging
– WSDM for service Management
● LEAD People have closely worked with and
contributed to Web Services, pushing its limits to
apply it to LEAD.
15. LEAD & Apache WS History
● Few People from LEAD has been major contributors for
Apache Axis, and then Axis2.
● LEAD is not based on Axis2.
● LEAD is older than Axis2, and it forked off in Axis era,
mainly because of Async messaging support.
● Five years ago LEAD implemented many tools (e.g.
Registries, Async Messaging, Workflow Engine), that are
hot topics now.
● Towards the end, LEAD started looking at Axis2 and other
Apache Projects from a Sustainability Perspective.
● Most part are already converted, others are being
converted.
16. LEAD with Apache Projects
● LEAD Switched to Apache ODE for workflow
execution more than a Year ago.
● LEAD data subsystems switched to Axis2 about a
Year ago.
● Job Submission was switched to Axis2 based solution
few months back.
● Service Factory is being converted to Axis2 right now.
● Conversion of Messaging System is in progress
(Through a Indiana University and LSF collaboration).
17. Apache as a Sustainability model
for Research projects
● Industry values “People”, we (opensource) value “Code”, and
Academia values “Ideas”.
● Most NSF Grants, now, ask for a Sustainability Model as part
of Proposals.
● One option is a commercial spin off
● Doing it in a opensource way, building a community and users
around a project is also a potential Solution.
● Many Challenges: ownership, need to renounce control, active
engagement of the community are the key.
● “Source Open” is not good enough!!
● “Dump and Run” does not work either.
18. Pros & Cons
Advantages Disadvantages
Reach to a wider Audience. Healthy You have to let go of the
User Community, world debug your ownership, at least to a some
project for you. extent.
Potential Long Lifetime, Self Need for community Consent
sustaining community if Successful. might slow you down.
To take advantage of Apache You have to learn to listen and
Process throughout Project life cycle explain. Some arguments are
(Releases, SVN, Jira, Wiki, Culture ). harder to do in a mailing list.
Better Chances of Attracting external Have to Time Publications.
Developers, more inputs. Better
chance of avoiding “source open”.
Take advantage of Apache
Infrastructure.
19. Conclusion
● Wanted to share a Real Life, Large-Scale SOA
Usecase
● Wanted to show LEAD-Apache interactions as
a real Life Case Study of interactions between
Apache and an Academic Project.
● Wanted to Showcase Apache as a
Sustainability Mechanism, if it is done right.