This is my PhD defense presentation discussing the work I did on improving scientific job execution in Grids and Clouds. It talks about how user patterns can be used to learn user behavior and improve meta-scheduler decisions. The resource abstraction layer proposed and implemented helps scientists to interact with a wide variety compute resources.
User Inspired Management of Scientific Jobs in Grids and Clouds
1. User Inspired Management of Scientific Jobs in Grids and Clouds Eran Chinthaka Withana School of Informatics and Computing Indiana University, Bloomington, Indiana, USA Doctoral Committee Professor Beth Plale, PhD Dr. Dennis Gannon, PhD Professor Geoffrey Fox, PhD Professor David Leake, PhD
2. Outline Mid-Range Science Challenges and Opportunities Current Landscape Research Research Questions Contributions Mining Historical Information to Find Patterns and Experiences Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1] Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2] Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4] Applications Related Work Conclusion and Future Work Thesis Defense - EranChinthakaWithana 2
3. Outline Mid-Range Science Challenges and Opportunities Current Landscape Research Research Questions Contributions Mining Historical Information to Find Patterns and Experiences Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1] Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2] Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4] Applications Related Work Conclusion and Future Work Thesis Defense - EranChinthakaWithana 3
4. Mid-Range Science Challenges Resource requirements going beyond lab and university, but not suited for large-scale resources Difficulties finding sufficient compute resources E.g.: short term forecast in LEAD for energy and agriculture Lacking resources to have strong CS support person on team Need for less-expensive and more-available resources Opportunities Wide variety of computational resources Science gateways Thesis Defense - EranChinthakaWithana 4
5. Current Landscape Grid Computing Batch orientation, long queues even under moderate loads, no access transparency Drawbacks in quota system Levels of computer science expertise required Cloud Computing High availability, pay-as-you-go model, on-demand limitless1 resource allocation Payment policy and research cost models Use of Workflow Systems Hybrid workflows Enables utilization of heterogeneous compute resources E.g.: Vortex2 Experiment Need for resource abstraction layers and optimal selection of resources Need for improvement of scientific job executions Better scheduler decisions, selection of compute resources Reliability issues in compute resources Importance of learning user patterns and experiences Thesis Defense - Eran Chinthaka Withana 5 1M. Armbrust et al. Above the clouds: A Berkeley view of cloud computing Tech. Rep. UCB/EECS-2009-28, EECS Department, University of California, Berkeley., 2009.
6. Outline Mid-Range Science Challenges and Opportunities Current Landscape Research Research Questions Contributions Mining Historical Information to Find Patterns and Experiences Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1] Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2] Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4] Applications Related Work Conclusion and Future Work Thesis Defense - Eran Chinthaka Withana 6
7. Research Questions “Can user patterns and experiences be used to improve scientific job executions in large scale systems?” “Can a simple, reliable and a highly scalable uniform resource abstraction be achieved to interact with a variety compute resource providers? “ “Can these be put to use to advance science?” Thesis Defense - Eran Chinthaka Withana 7
8. Contributions Propose and empirically demonstrate user patterns, deduced by knowledge-based approaches, to provision for compute resources reducing impact of startup overheads in cloud computing environments. Propose and empirically demonstrate user perceived reliability, learned by mining historical job execution information, as new dimension to consider during resource selections. Propose and demonstrate effectiveness and applicability of light-weight and reliable resource abstraction service to hide complexities of interacting with multiple resources managers in grids and clouds. Prototype implementation to evaluate feasibility and performance of resource abstraction service and integration with four different application domains to prove its usability. Thesis Defense - Eran Chinthaka Withana 8
9. Outline Mid-Range Science Challenges and Opportunities Current Landscape Research Research Questions Contributions Mining Historical Information to Find Patterns and Experiences Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1] Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2] Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4] Applications Related Work Conclusion and Future Work Thesis Defense - Eran Chinthaka Withana 9
10. Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds Objective Reducing the impact of startup overheads for time-critical applications Problem space Workflows can have multiple paths Workflow descriptions not available Need for predictions to identify job execution sequence Learning from user behavioral patterns to predict future jobs Research outline Algorithm to predict future jobs by extracting user patterns from historical information Use of knowledge-based techniques Zero knowledge or pre-populated job information consisting of connection between jobs Similar cases retrieved are used to predict future jobs, reducing high startup overheads Algorithm assessment Two different workloads representing individual scientific jobs executed in LANL and set of workflows executed by three users 10 Thesis Defense - Eran Chinthaka Withana
11. Demonstration of User Patterns with Workflows Suite of workflows can differ from domain to domain E.g. WRF (Weather Research and Forecasting) as upstream node User patterns reveal sequence of jobs taking different users/domains into consideration Useful for a science gateway serving wide-range of mid-scale scientists 11 Weather Predictions Crop Predictions WRF Wind Farm Location Evaluations Wild Fire Propagation Simulation Thesis Defense - Eran Chinthaka Withana
12. Role of Successful Predictions to Reduce Startup Overheads Largest gain can be achieved when our prediction accuracy is high and setup time (s) is large with respect to execution time (t) r = probability of successful prediction (prediction accuracy) Percentage time = reduction For simplicity, assuming equal job exec and startup times Percentage time = reduction 12 Thesis Defense - Eran Chinthaka Withana
13. Relationship of Predictions to Execution Time Observations Percentage time reduction increases with accuracy of predictions Time reduction is reduced exponentially with increased work-to-overhead ratio Need to find criticalpoint for a given situation Fixing required percentage time reduction for a given t/s ratio and finding required accuracy of predictions Cost of wrong predictions Depends on compute resource Demonstrated higher prediction accuracies (~90%) will reduce impact of wrong predictions Compromising cost to improve time Percentage time = reduction 13 Accuracy of Predictions = total successful future job predictions / total predictions Thesis Defense - Eran Chinthaka Withana
14. Prediction Engine: System Architecture Prediction Retriever 14 Thesis Defense - Eran Chinthaka Withana
15. Use of Reasoning Store and retrieve cases Steps Retrieval of similar cases Similarity measurement Use of thresholds Reuse of old cases Case adaptation Storage 15 Thesis Defense - Eran Chinthaka Withana
16. Case Similarity Calculation Each case represented by set of attributes Selected by finding effect on goal variable (next job) 16 Thesis Defense - Eran Chinthaka Withana
17. Evaluation Use cases Individual job workload1 40k jobs over two years from 1024-node CM-5 at Los Alamos National Lab Workflow use case System doesn’t see or assume workflow specification Experimental setup 2.0GHz dual-core processor, 4GB memory and on a 64-bit Windows operating system 1: Parallel Workload Archive http://www.cs.huji.ac.il/labs/parallel/workload/ 17 Thesis Defense - Eran Chinthaka Withana
18. Evaluation: Average Accuracy of Predictions Individual Jobs Workload ~ 75% accurate predictions with user patterns ~ 32% accurate predictions with service names 18 Thesis Defense - Eran Chinthaka Withana Workflow Workload ~ 95% accurate predictions with user patterns ~ 53% accurate predictions with service names
19. Evaluation: Time Saved Amount of time that can be saved, if resources are provisioned, when job is ready to run Startup time Assumed to be 3mins (average for commercial providers) 19 Individual Jobs Workload Workflow Workload Thesis Defense - Eran Chinthaka Withana
20. Evaluation: Prediction Accuracies for Use Cases User patterns based predictions performs 2x better than service names based Thesis Defense - Eran Chinthaka Withana 20
21. Outline Mid-Range Science Challenges and Opportunities Current Landscape Research Research Questions Contributions Mining Historical Information to Find Patterns and Experiences Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1] Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2] Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4] Applications Related Work Conclusion and Future Work Thesis Defense - Eran Chinthaka Withana 21
22. User Perceived Reliability Failures tolerated through fault tolerance, high availability, recoverability, etc.,[Birman05]. What matters from a user’s point of view is whether these failures are visible to users or not E.g. reliability of commodity hardware (in clouds) vs user-perceived reliability Reliability is not of resources themselves Not derived from halting failures, fail-stop failures, network partitioning failures[Birman05] or machine downtimes. It is a more broadly encompassing system reliability that can only be seen at user or workflow level Can depend on user’s configuration and job types as well We refer to this form of reliability as user-perceived reliability. Importance of user-perceived reliability Selecting a resource to schedule an experiment when user has access to multiple compute resources E.g. LEAD reliability supercomputing resources vs Windows Azure resources Thesis Defense - Eran Chinthaka Withana 22
23. Why User Perceived Reliability is Useful User perceived failure probabilities Cluster A, p(A) = 0.2 and Cluster B, p(B) = 0.3 𝑝𝐴∩ 𝐵=𝑝𝐴∗𝑝(𝐵) = 0.2 * ( 1 – 0.3) = 0.14 𝑝𝐵∩ 𝐴=𝑝𝐵∗𝑝(𝐴) = 0.3 * ( 1 – 0.2) = 0.24 Since 𝑝𝐴∩ 𝐵 < 𝑝𝐵∩ 𝐴, try cluster A first and then cluster B. Thesis Defense - Eran Chinthaka Withana 23
24. Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions Objective Reduce impact of low reliability of compute resources Deducing user-perceived reliabilities learning from user experiences and perceptions Research outline Algorithm to predict user perceived reliabilities, learning from user experiences mining historical information Use of machine learning techniques Trained classifiers to represent compute resources and their reliabilities Prediction of job failures Algorithm assessment Workloads from parallel workload archive representing jobs executed in two different supercomputing clusters 24 Thesis Defense - Eran Chinthaka Withana
25. System Architecture Thesis Defense - Eran Chinthaka Withana 25 A machine learning classifier is trained to learn user-perceived reliabilities of each cluster. Classifiers types Static classifier: train classifier initially from historical information Dynamic (updateable) classifier: starts from zero knowledge and build when system is in operation
26. System Architecture Thesis Defense - Eran Chinthaka Withana 26 Classifier manager uses Weka[Hall09] framework Classification methods Naïve Bayes and KStar Static and Dynamic classifiers Dynamic pruning of features[Fadishei09] for increased efficiency Classifier manager Creates and maintains classifiers for each compute resource A new job is evaluated based on these classifiers to deduce predicted reliability of job execution Policy Implementers Considers resource reliability predictions together with other quality of service information (time, cost) to select a resource
27. Evaluation Workloads from parallel workload archive[Feitelson] LANL: Two years worth of jobs from 1994 to 1996 on 1024-node CM-5 at Los Alamos National Lab LPC: Ten months (Aug, 2004 to May, 2005) worth of job records on 70 Xeon node cluster at ”Laboratoire de Physique Corpusculaire” of UniversitatBlaise-Pascal, France Minor cleanups to remove intermediate job states 10000 jobs were selected from each workload LANL had 20% failed jobs LPC had 30% failed jobs Thesis Defense - Eran Chinthaka Withana 27
28. Evaluation Workload classification and maintenance Classifiers: Naïve Bayes[John95] and KStar[Cleary95] classifier implementations in Weka[Hall09]. Classifier construction Static classifier: first 1000 jobs trains classifier. Dynamic classifier: all 10000 jobs for classifier construction and evaluation. Evaluation Metrics Average reliability prediction accuracy: accuracy of predicting success/fail of job Time saved: cumulative time saved by aggregating execution time of a job if it fails and if our system predicted failure successfully baseline measure: ideal cumulative time that can be saved over time Time Consumed For Classification and Updating Classifier Effect of pruning attributes Static subset of attributes (as proposed in Fadishei et el.[Fadishei09]) vs dynamic subset of attributes (checking affect on goal variable) Thesis Defense - Eran Chinthaka Withana 28
29. Evaluation Evaluation Metrics Effect of Job Reliability Predictions on Selecting Compute Resources Extended version of GridSim[Buyya02] models four compute resources NWS[Wolski99] for bandwidth estimation and QBets[Nurmi07] for queue wait time estimation Total execution time = data movement time + queue wait time + job execution time (found in workload) Schedulers Total Execution Time Priority Scheduler Reliability Prediction Based Time Priority Scheduler Metrics Average Accuracy of Selecting Reliable Resources to Execute Jobs Time Wasted Due to Incorrect Selection of Compute Resources to Execute Jobs All evaluations were run within a 3.0GHz dual-core processor, 4GB memory on Windows 7 professional operating system. Thesis Defense - EranChinthakaWithana 29
31. Results:Average Reliability Prediction Accuracy 31 Static Dynamic / Updateable LANL LANL Accuracy Saturation ~ 82% LPC Accuracy Saturation ~ 97% KStar has performed slightly better than Naïve Bayes LPC Thesis Defense - Eran Chinthaka Withana
32. Results:Time Savings 32 Static Dynamic / Updateable LANL With static classifier, KStar has saved 90-100% Updateable classifier For LANL Both KStar and NB ~ 50% saving For LPC ~ 90% saving LPC Thesis Defense - Eran Chinthaka Withana
33. Results:Time Consumed for Classification and Updating Classifier Thesis Defense - Eran Chinthaka Withana 33 Static Classifier Updateable Classifier Both static and updateable Naïve Bayes classifiers take very little time (not included in graphs)
34. Results:Effect of Pruning Attributes Static sub-set of attributes (Fadishei09) performs poorly on this data set and classifier Dynamic pruning has improved accuracy of predictions compared to non-pruned case, but improvement is marginal Conclusion -> our classifiers are handling noise features well without compromising accuracy of classifications Identification of attributes to prune is a dynamic and expensive task system can be used in practical cases even without pruning of attributes. Thesis Defense - Eran Chinthaka Withana 34
35. Results:Effect of Job Reliability Predictions on Selecting Compute Resources Poor performance of execution time priority scheduler After 1000 jobs (training) time wasted with our approach stays fairly constant Thesis Defense - Eran Chinthaka Withana 35
36. Evaluation Conclusion Even though average accuracy of predictions with KStarclassifier has decreased with static classifier, it has managed to learn and predict failures better than any other method. Even though amount of time saved has increased slightly with Naive Bayes updateable classifier, comparatively, amount of time saved using static KStar classifier is higher than both methods. Even though total accuracy of predictions is not performing compared to other methods, static KStar classifier is ideal for correctly predicting failure cases, with very low overhead. Taking user-perceived reliability of compute resources in to consideration can save a significant amount of time in scientific job executions Thesis Defense - Eran Chinthaka Withana 36
37. Outline Mid-Range Science Challenges and Opportunities Current Landscape Research Research Questions Contributions Mining Historical Information to Find Patterns and Experiences Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1] Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2] Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4] Applications Related Work Conclusion and Future Work Thesis Defense - Eran Chinthaka Withana 37
38. Scientific Computing Resource Abstraction Layer Variety of scientific computing platforms and opportunities Requirements Support existing job description languages and also should be extensible to support other languages. Provide a uniform and interoperable interface for external entities to interact with it. Support heterogeneous compute resource manager interfaces and operating platforms from grids, IaaS, PaaS clouds, departmental clusters. Extensibility to support new and future resource managers with minimal changes. Provide monitoring and fault recovery, especially when working with utility computing resources. Provide light-weight, robust and scalable infrastructure. Integration to variety of workflow environments. Thesis Defense - Eran Chinthaka Withana 38
39. Scientific Computing Resource Abstraction Layer Our contribution Resource abstraction layer Implemented as a web service Provides a uniform abstraction layer over heterogeneous compute resources including grids, clouds and local departmental clusters. Support for standard job specification languages including, but not limited to, Job Submission Description Language (JSDL)[Anjomshoaa04] and Globus Resource Specification Language (RSL), directly interacts with resource managers so requires no grid or meta scheduling middleware Integration with current resource managers, including Load Leveler, PBS, LSF and Windows HPC, Amazon EC2 and Microsoft Azure platforms Features Does not need high level of computer science knowledge to install and maintain system Use of Globus was a challenge for most non-compute scientists Involvement of system administrators to install and maintain Sigiri is minimal Memory foot print of is minimal Other tools require installation of most of heavy Globus stack but Sigiri does not require a complete stack installation to run. (Note that installing Globus on a small clusters is something scientists never wanted to do.) Better fault tolerance and failure recovery. Thesis Defense - Eran Chinthaka Withana 39
40. Architecture Asynchronous messaging model of message publishers and consumers Daemons shadowing compute resources Distributed component deployment Daemon, front end Web service and job queue Thesis Defense - Eran Chinthaka Withana 40
41. Client Interaction Service Deployed as an Apache Axis2 Web service to enable interoperability Accepts job requests and enable management and monitoring functions Job submission schema does not enforce schema for job description Enables multiple job description languages Thesis Defense - Eran Chinthaka Withana 41
42. Client Interaction Service Thesis Defense - Eran Chinthaka Withana 42 Job Submission Response Job Submission Request
43. Daemons Each managed compute resource has a light-weight daemon periodically checks job request queue translates job specification to a resource manager specific language submits pending jobs and persists correlation between resource manager's job id with internal id Extensible daemon API enables integration of wide range of resource managers while keeping complexities of these resources managers transparent to end users of these systems Queuing based approach enables daemons to be run on any compute platform, without any software or operating system requirements Current Support LSF, PBS,SLURM, LoadLeveler, Amazon EC2, Windows HPC, Windows Azure Thesis Defense - Eran Chinthaka Withana 43
44. Integration of Cloud Computing Resources Unique set of dynamically loaded and configured extensions to handle security, schedule jobs and perform required data movements. Enables scientists to interact with multiple cloud providers within same system Features Extensions can be written as modules independent of other extensions, typically to carry out a single task Enforced failure handling to prevent orphan VMs, resources Thesis Defense - Eran Chinthaka Withana 44
45. Security Client Security Between client and Web service layer Support for both transport level security (using SSL) and application layer security (using WS-Security) Client negotiation of security credentials with WS-Security policy support within Apache Axis2 Compute Resource Security System has support to store different types of security credentials Username/password combinations, X.509 credentials Thesis Defense - Eran Chinthaka Withana 45
46. Performance Evaluation Test Scenarios Case 1: Jobs arrive at our system as a burst of concurrent submissions from a controlled number of clients. Each client waits for all jobs to finish before submitting next set of jobs. For example, during test with 100 clients, each client sends 1 job to server making 100 jobs coming to server in parallel. Case 2: Each client submits 10 jobs having varying execution times in sequence with no delay between submissions client does not block upon submission of a job failure rate and server performance, from clients point of view, are measured and number of simultaneous clients will be systematically increased Thesis Defense - Eran Chinthaka Withana 46
50. Performance Evaluation Experimental Setup Daemon hosted within gatekeeper node (quad-core IBM PowerPC (1.6GHz) with 8GB of physical memory) of Big Red cluster System Web service and database co-hosted in a box with (4 2.6GHz dual-core processors with 32GB of RAM) Both these nodes were not dedicated for our experiment when we were running tests Client Environment Setup within 128 node Odin Cluster (each node is a Dual AMD 2.0GHz Opteron processor with 4GB physical memory) All client nodes were used in dedicated mode and each client is running on separate java virtual machine to eliminate any external overhead Data Collection Each test was run number of clients * 10 times and results were averaged. Each parameter is tested for 100 to 1000 concurrent clients Total of 110,000 tests were run. Gram4 experiment results produced in Gram4 evaluation paper[Marru08] were used for system performance comparison. Thesis Defense - Eran Chinthaka Withana 50
51. Results Thesis Defense - Eran Chinthaka Withana 51 Baseline Measurements All overheads scaling proportional to number of clients No failures Case 1 Case 2
52. Results Thesis Defense - Eran Chinthaka Withana 52 Metrics for Test Case 1 and 2 Both response time and total overhead scaling proportional to number of clients No failures
53. Results Thesis Defense - Eran Chinthaka Withana 53 Scalability Metrics Failures No failures with Sigiri Failures starting from 300 clients for Gram Case 1 Case 2
54. Outline Mid-Range Science Challenges and Opportunities Current Landscape Research Research Questions Contributions Mining Historical Information to Find Patterns and Experiences Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1] Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2] Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4] Applications Related Work Conclusion and Future Work Thesis Defense - Eran Chinthaka Withana 54
55. Applications: LEAD Motivations Grid middleware reliability and scalability study[Marru08] and workflow failure rates. components of LEAD infrastructure were considered for adaptation to other scientific environments. Sigiri initially prototyped to support Load Leveler, PBS and LSF. Implications Improved workflow success rates Mitigation need for Globus middleware Ability work with non-standard job managers Thesis Defense - Eran Chinthaka Withana 55
56. Applications: LEAD II Emergence of community- driven, production-quality workflow infrastructures E.g. Trident Scientific Workflow Workbench with Workflow Foundation Possibility of using alternate supercomputing resources E.g. Recent port WRF (Weather Research & Forecast) model to Windows platform, Azure Support for Windows based scientific computing environments. 56
57. Background: LEAD II and Vortex2 Experiment May 1, 2010 to June 15, 2010 ~6 weeks, 7-days per week Workflow started on hour every hour each morning. Had to find and bind to latest model data (i.e., RUC 13km and ADAS data) to set initial and boundary conditions. If model data was not available at NCEP and University of Oklahoma, workflow could not begin. Execution of complete WRF stack within 1 hour 57
58. Trident Vortex2 Workflow Bulk of time (50 min) spent in Lead Workflow Proxy Activity 58 Sigiri Integration
59. Applications: Enabling Geo-Science Application on Windows Azure Geo-Science Applications High Resource Requirements Compute intensive, dedicated HPC hardware e.g. Weather Research and Forecasting (WRF) Model Emergence of ensemble applications Large amount of small jobs e.g. Examining each air layer, over a long period of time. Single experiment = About 14000 jobs each taking few minutes to complete 59
60. Geo-Science Applications: Opportunities Cloud computing resources On-demand access to “unlimited” resources Flexibility Worker roles and VM roles Recent porting of geo-science applications WRF, WRF Preprocessing System (WPS) port to Windows Increased use of ensemble applications (large number of small runs) Production quality, opensource scientific workflow systems Microsoft Trident 60
61. Research Vision Enabling geo-science experiments Type of applications Compute intensive, ensembles Type of scientists Meteorologists, atmospheric scientists, emergency management personnel, geologists Utilizing both Cloud computing and Grid computing resources Utilizing opensource, production quality scientific workflow environments Improved data and meta-data management Geo-Science Applications Scientific Workflows Compute Resources 61
62. Proposed Framework Thesis Defense - Eran Chinthaka Withana 62 Azure Blob Store Azure Management API Sigiri Job Mgmt.Daemons Azure Fabric Web Service Trident Activity Job Queue Azure Custom VM Images VM Instance IIS WRF Sigiri Worker Service MSMPI Windows 2008R2
63. Applications: Pragma Testbed Support Pacific Rim Applications and Grid Middleware (PRAGMA)[Zheng06] an open international organization founded in 2002 to focus on practical issues of building international scientific collaborations In 2010, Indiana University (IU) joined PRAGMA and added a dedicated cluster for testbed. Sigiri was used within IU Pragma testbed IU PRAGMA testbed system required a light-weight system that could be installed and maintained with minimal effort. IU PRAGMA team wanted to evaluate on adding cloud resources into testbed with little or no changes to interfaces. In 2011, PRAGMA - Opal - Sigiri integration was demonstrated successfully Thesis Defense - Eran Chinthaka Withana 63
64. Outline Mid-Range Science Challenges and Opportunities Current Landscape Research Research Questions Contributions Mining Historical Information to Find Patterns and Experiences Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1] Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2] Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4] Applications Related Work Conclusion and Future Work Thesis Defense - Eran Chinthaka Withana 64
65. Related Work Scientific Job Management Systems Grid Resource Allocation and Management (GRAM)[Foster05], Condor-G[Frey02], Nimrod/G[Buyya00], GridWay[Huedo05] and SAGA[Goodale06] and Falkon[Raicu07] provide uniform job management APIs, but are tightly integrated with complex middleware to address a broad range of problems. Carmen[Watson81] project provided a cloud environment that has enabled collaboration between neuroscientists requires all programs to be packaged as WS-I[Ballinger04] compliant Web services Condor[Frey02] pools can also be utilized to unify certain compute resource interactions. uses Globus toolkit[Foster05] (and GRAM underneath) Poor failure recovery overlooks failure modes of a cloud platform Thesis Defense - Eran Chinthaka Withana 65
66. Related Work Scientific Research and Cloud Computing IaaS, PaaS and SaaS environment evaluations Scientists have mainly evaluated use of IaaS services for scientific job executions[Abadi09][Hoffa08][Keahey08] [Yu05] Ease of setting up custom environments and control Growing interest in using PaaS services[Humphrey10][Lu10] [Qiu09] Optimization to balance cost and time of executions[Deelman08][Yu05] Startup overheads[Chase03][Figueiredo03][Foster06][Sotomayor06][Keahey07] Job Prediction Algorithms Prediction of Execution times[Smith], job start times[Li04], queue-wait times[Nurmi07] and resource requirements[Julian04] AI based and statistical modeling based approaches AppleS[Berman03] argues that a good scheduler must involve some prediction of application and system performance Reliability of Compute Resources Birman[Birman05] and aspects of resources causing system reliability issues Statistical modeling to predict failures[Kandaswamy08] Thesis Defense - Eran Chinthaka Withana 66
67. Outline Mid-Range Science Challenges and Opportunities Current Landscape Research Research Questions Contributions Mining Historical Information to Find Patterns and Experiences Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1] Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2] Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4] Applications Related Work Conclusion and Future Work Thesis Defense - Eran Chinthaka Withana 67
68. Conclusion User inspired management of scientific jobs Concentrate on identification of user patterns and perceptions Harnesses historical information Applies knowledge gained to improve scientific job executions Argues that patterns, if identified based on individual users, can reveal important information to make sophisticated estimations on resource requirements Evaluations demonstrates usability of predictions for a meta-scheduler, especially ones integrated into community gateways, to improve their scheduling decisions. Resource abstraction service Help mid-scale scientists to obtain access to resources that are cheap and available Strives to do so with a tool that is easy to set up and administer Prototype implementations introduced and discussed is integrated and used in different domains and scientific applications Applications demonstrate how our research contributed to advance science in respective domains. Thesis Defense - Eran Chinthaka Withana 68
69. Contributions Propose and empirically demonstrate user patterns, deduced by knowledge-based approaches, to provision for compute resources reducing impact of startup overheads in cloud computing environments. Propose and empirically demonstrate user perceived reliability, learned by mining historical job execution information, as a new dimension to consider during resource selections. Propose and demonstrate effectiveness and applicability of a light-weight and reliable resource abstraction service to hide complexities of interacting with multiple resources managers in grids and clouds. Prototype implementation to evaluate feasibility and performance of resource abstraction service and integration with four different application domains to prove its usability. Thesis Defense - Eran Chinthaka Withana 69
70. Future Work Short term research directions Integration of future job predictions and user-perceived reliability predictions Evolving resource abstraction service to support more compute resources Management of ensemble runs Fault tolerance with proactive replication Long Term Research Directions Thesis Defense - Eran Chinthaka Withana 70
71. Thank You !! Thesis Defense - Eran Chinthaka Withana 71