SlideShare une entreprise Scribd logo
1  sur  71
User Inspired Management of Scientific Jobs in Grids and Clouds Eran Chinthaka Withana School of Informatics and Computing Indiana University, Bloomington, Indiana, USA Doctoral Committee Professor Beth Plale, PhD Dr. Dennis Gannon, PhD Professor Geoffrey Fox, PhD Professor David Leake, PhD
Outline Mid-Range Science Challenges and Opportunities Current Landscape Research Research Questions Contributions Mining Historical Information to Find Patterns and Experiences Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1] Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2] Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4] Applications Related Work Conclusion and Future Work Thesis Defense - EranChinthakaWithana 2
Outline Mid-Range Science Challenges and Opportunities Current Landscape Research Research Questions Contributions Mining Historical Information to Find Patterns and Experiences Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1] Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2] Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4] Applications Related Work Conclusion and Future Work Thesis Defense - EranChinthakaWithana 3
Mid-Range Science Challenges Resource requirements going beyond lab and university, but not suited for large-scale resources Difficulties finding sufficient compute resources E.g.: short term forecast in LEAD for energy and agriculture Lacking resources to have strong CS support person on team Need for less-expensive and more-available resources Opportunities  Wide variety of computational resources Science gateways Thesis Defense - EranChinthakaWithana 4
Current Landscape Grid Computing Batch orientation, long queues even under moderate loads, no access transparency Drawbacks in quota system Levels of computer science expertise required Cloud Computing High availability, pay-as-you-go model, on-demand limitless1 resource allocation Payment policy and research cost models Use of Workflow Systems Hybrid workflows Enables utilization of heterogeneous compute resources E.g.: Vortex2 Experiment Need for resource abstraction layers and optimal selection of resources Need for  improvement of scientific job executions Better scheduler decisions, selection of compute resources Reliability issues in compute resources Importance of learning user patterns and experiences	 Thesis Defense - Eran Chinthaka Withana 5 1M. Armbrust et al. Above the clouds: A Berkeley view of cloud computing Tech. Rep. UCB/EECS-2009-28, EECS  Department, University of California, Berkeley., 2009.
Outline Mid-Range Science Challenges and Opportunities Current Landscape Research Research Questions Contributions Mining Historical Information to Find Patterns and Experiences Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1] Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2] Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4] Applications Related Work Conclusion and Future Work Thesis Defense - Eran Chinthaka Withana 6
Research Questions “Can user patterns and experiences be used to improve scientific job executions in large scale systems?” “Can a simple, reliable and a highly scalable uniform resource abstraction be achieved to interact with a variety compute resource providers? “ “Can these be put to use to advance science?” Thesis Defense - Eran Chinthaka Withana 7
Contributions Propose and empirically demonstrate user patterns, deduced by knowledge-based approaches, to provision for compute resources reducing impact of startup overheads in cloud computing environments. Propose and empirically demonstrate user perceived reliability, learned by mining historical job execution information, as new dimension to consider during resource selections. Propose and demonstrate effectiveness and applicability of light-weight and reliable resource abstraction service to hide complexities of interacting with multiple resources managers in grids and clouds. Prototype implementation to evaluate feasibility and performance of resource abstraction service and integration with four different application domains to prove its usability. Thesis Defense - Eran Chinthaka Withana 8
Outline Mid-Range Science Challenges and Opportunities Current Landscape Research Research Questions Contributions Mining Historical Information to Find Patterns and Experiences Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1] Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2] Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4] Applications Related Work Conclusion and Future Work Thesis Defense - Eran Chinthaka Withana 9
Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds Objective Reducing the impact of startup overheads for time-critical applications Problem space Workflows can have multiple paths Workflow descriptions not available Need for predictions to identify job execution sequence Learning from user behavioral patterns to predict future jobs Research outline Algorithm to predict future jobs by extracting user patterns from historical information Use of knowledge-based techniques Zero knowledge or pre-populated job information consisting of connection between jobs Similar cases retrieved are used to predict future jobs, reducing high startup overheads Algorithm assessment  Two different workloads representing individual scientific jobs executed in LANL and set of workflows executed by three users 10 Thesis Defense - Eran Chinthaka Withana
Demonstration of User Patterns with Workflows Suite of workflows can differ from domain to domain E.g. WRF (Weather Research and Forecasting) as upstream node User patterns reveal sequence of jobs taking different users/domains into consideration Useful for a science gateway serving wide-range of mid-scale scientists 11 Weather Predictions Crop Predictions WRF Wind Farm Location Evaluations Wild Fire Propagation Simulation Thesis Defense - Eran Chinthaka Withana
Role of Successful Predictions to Reduce Startup Overheads Largest gain can be achieved when our prediction accuracy is high and setup time (s) is large with respect to execution time (t) r = probability of  successful prediction  (prediction accuracy) Percentage time  = reduction For simplicity, assuming equal job exec and startup times  Percentage time  = reduction 12 Thesis Defense - Eran Chinthaka Withana
Relationship of Predictions to Execution Time Observations Percentage time reduction increases with accuracy of predictions Time reduction is reduced exponentially with increased work-to-overhead ratio Need to find criticalpoint for a given situation Fixing required percentage time reduction for a given t/s ratio and finding required accuracy of predictions Cost of wrong predictions Depends on compute resource Demonstrated higher prediction accuracies (~90%) will reduce impact of wrong predictions Compromising cost to improve time Percentage time  = reduction 13 Accuracy of Predictions =           total successful future job predictions / total predictions Thesis Defense - Eran Chinthaka Withana
Prediction Engine: System Architecture Prediction Retriever 14 Thesis Defense - Eran Chinthaka Withana
Use of Reasoning Store and retrieve cases Steps Retrieval of similar cases Similarity measurement Use of thresholds Reuse of old cases Case adaptation Storage 15 Thesis Defense - Eran Chinthaka Withana
Case Similarity Calculation Each case represented by set of attributes Selected by finding effect on goal variable (next job) 16 Thesis Defense - Eran Chinthaka Withana
Evaluation Use cases Individual job workload1 40k jobs over two years from 1024-node CM-5 at Los Alamos National Lab Workflow use case System doesn’t see or assume workflow specification Experimental setup 2.0GHz dual-core processor, 4GB memory and on a 64-bit Windows operating system 1: Parallel Workload Archive http://www.cs.huji.ac.il/labs/parallel/workload/  17 Thesis Defense - Eran Chinthaka Withana
Evaluation: Average Accuracy of Predictions Individual Jobs Workload ~ 75% accurate predictions with user patterns  ~ 32% accurate predictions with service names 18 Thesis Defense - Eran Chinthaka Withana Workflow Workload ~ 95% accurate predictions with user patterns  ~ 53% accurate predictions with service names
Evaluation: Time Saved Amount of time that can be saved, if resources are provisioned, when job is ready to run Startup time Assumed to be 3mins (average for commercial providers) 19 Individual Jobs Workload Workflow Workload Thesis Defense - Eran Chinthaka Withana
Evaluation: Prediction Accuracies for Use Cases User patterns based predictions performs 2x better than service names based Thesis Defense - Eran Chinthaka Withana 20
Outline Mid-Range Science Challenges and Opportunities Current Landscape Research Research Questions Contributions Mining Historical Information to Find Patterns and Experiences Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1] Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2] Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4] Applications Related Work Conclusion and Future Work Thesis Defense - Eran Chinthaka Withana 21
User Perceived Reliability Failures tolerated through fault tolerance, high availability, recoverability, etc.,[Birman05].  What matters from a user’s point of view is whether these failures are visible to users or not E.g. reliability of commodity hardware (in clouds) vs user-perceived reliability Reliability is not of resources themselves  Not derived from halting failures, fail-stop failures, network partitioning failures[Birman05] or machine downtimes.  It is a more broadly encompassing system reliability that can only be seen at user or workflow level Can depend on user’s configuration and job types as well We refer to this form of reliability as user-perceived reliability. Importance of user-perceived reliability  Selecting a resource to schedule an experiment when user has access to multiple compute resources E.g. LEAD reliability supercomputing resources vs Windows Azure resources Thesis Defense - Eran Chinthaka Withana 22
Why User Perceived Reliability is Useful User perceived failure probabilities  Cluster A, p(A) = 0.2 and Cluster B, p(B) = 0.3 𝑝𝐴∩ 𝐵=𝑝𝐴∗𝑝(𝐵) = 0.2 * ( 1 – 0.3) = 0.14  𝑝𝐵∩ 𝐴=𝑝𝐵∗𝑝(𝐴) = 0.3 * ( 1 – 0.2) = 0.24 Since 𝑝𝐴∩ 𝐵 < 𝑝𝐵∩ 𝐴, try cluster A first and then cluster B.    Thesis Defense - Eran Chinthaka Withana 23
Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions Objective Reduce impact of low reliability of compute resources Deducing user-perceived reliabilities  learning from user experiences and perceptions Research outline Algorithm to predict user perceived reliabilities, learning from user experiences mining historical information Use of machine learning techniques Trained classifiers to represent compute resources and their reliabilities Prediction of job failures Algorithm assessment  Workloads from parallel workload archive representing jobs executed in two different supercomputing clusters 24 Thesis Defense - Eran Chinthaka Withana
System Architecture Thesis Defense - Eran Chinthaka Withana 25 A machine learning classifier is trained to learn user-perceived reliabilities of each cluster. Classifiers types Static classifier: train classifier initially from historical information Dynamic (updateable) classifier: starts from zero knowledge and build when system is in operation
System Architecture Thesis Defense - Eran Chinthaka Withana 26 Classifier manager uses Weka[Hall09] framework Classification methods Naïve Bayes and KStar Static and Dynamic classifiers Dynamic pruning of features[Fadishei09] for increased efficiency Classifier manager Creates and maintains classifiers for each compute resource A new job is evaluated based on these classifiers to deduce predicted reliability of job execution Policy Implementers Considers resource reliability predictions together with other quality of service information (time, cost) to select a resource
Evaluation Workloads from parallel      workload archive[Feitelson] LANL: Two years worth of  jobs from 1994 to 1996 on 1024-node CM-5 at Los  Alamos National Lab LPC: Ten months (Aug, 2004  to May, 2005) worth of job  records on 70 Xeon node  cluster at ”Laboratoire de  Physique Corpusculaire”  of UniversitatBlaise-Pascal, France Minor cleanups to remove intermediate job states 10000 jobs were selected from each workload LANL had 20% failed jobs LPC had 30% failed jobs Thesis Defense - Eran Chinthaka Withana 27
Evaluation Workload classification and maintenance Classifiers: Naïve Bayes[John95] and KStar[Cleary95] classifier implementations in Weka[Hall09]. Classifier construction Static classifier: first 1000 jobs trains classifier. Dynamic classifier: all 10000 jobs for classifier construction and evaluation.  Evaluation Metrics Average reliability prediction accuracy: accuracy of predicting success/fail of job Time saved: cumulative time saved by aggregating execution time of a job if it fails and if our system predicted failure successfully baseline measure: ideal cumulative time that can be saved over time Time Consumed For Classification and Updating Classifier Effect of pruning attributes Static subset of attributes (as proposed in Fadishei et el.[Fadishei09]) vs dynamic subset of attributes (checking affect on goal variable) Thesis Defense - Eran Chinthaka Withana 28
Evaluation Evaluation Metrics Effect of Job Reliability Predictions on Selecting Compute Resources Extended version of GridSim[Buyya02]  models four compute resources NWS[Wolski99] for bandwidth estimation and  QBets[Nurmi07] for queue wait time  estimation Total execution time = data  movement time + queue wait time + job execution time (found in workload) Schedulers Total Execution Time Priority Scheduler  Reliability Prediction Based Time Priority Scheduler Metrics Average Accuracy of Selecting Reliable Resources to Execute Jobs Time Wasted Due to Incorrect Selection of Compute Resources to Execute Jobs All evaluations were run within a 3.0GHz dual-core processor, 4GB memory on Windows 7 professional operating system. Thesis Defense - EranChinthakaWithana 29
Evaluation Metrics Summary Thesis Defense - Eran Chinthaka Withana 30
Results:Average Reliability Prediction Accuracy 31 Static Dynamic / Updateable LANL LANL Accuracy Saturation  ~ 82% LPC Accuracy Saturation  ~ 97% KStar has performed slightly better than Naïve Bayes LPC Thesis Defense - Eran Chinthaka Withana
Results:Time Savings 32 Static Dynamic / Updateable LANL With static classifier, KStar has saved 90-100% Updateable classifier  For LANL Both KStar and NB ~ 50% saving For LPC ~ 90% saving LPC Thesis Defense - Eran Chinthaka Withana
Results:Time Consumed for Classification and Updating Classifier Thesis Defense - Eran Chinthaka Withana 33 Static Classifier Updateable Classifier Both static and updateable Naïve Bayes classifiers take very little time (not included in graphs)
Results:Effect of Pruning Attributes Static sub-set of attributes (Fadishei09) performs poorly on this data set and classifier Dynamic pruning has improved accuracy of predictions compared to non-pruned case, but improvement is marginal Conclusion -> our classifiers are handling noise features well without compromising accuracy of classifications Identification of attributes to prune is a dynamic and expensive task  system can be used in practical cases even without pruning of attributes. Thesis Defense - Eran Chinthaka Withana 34
Results:Effect of Job Reliability Predictions on Selecting Compute Resources Poor performance of execution time priority scheduler After 1000 jobs (training) time wasted with our approach stays fairly constant Thesis Defense - Eran Chinthaka Withana 35
Evaluation Conclusion Even though average accuracy of predictions with KStarclassifier has decreased with static classifier, it has managed to learn and predict failures better than any other method. Even though amount of time saved has increased slightly with Naive Bayes updateable classifier, comparatively, amount of time saved using static KStar classifier is higher than both methods. Even though total accuracy of predictions is not performing compared to other methods, static KStar classifier is ideal for correctly predicting failure cases, with very low overhead. Taking user-perceived reliability of compute resources in to consideration can save a significant amount of time in scientific job executions Thesis Defense - Eran Chinthaka Withana 36
Outline Mid-Range Science Challenges and Opportunities Current Landscape Research Research Questions Contributions Mining Historical Information to Find Patterns and Experiences Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1] Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2] Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4] Applications Related Work Conclusion and Future Work Thesis Defense - Eran Chinthaka Withana 37
Scientific Computing Resource Abstraction Layer Variety of scientific computing platforms and opportunities Requirements Support existing job description languages and also should be extensible to support other languages. Provide a uniform and interoperable interface for external entities to interact with it. Support heterogeneous compute resource manager interfaces and operating platforms from grids, IaaS, PaaS clouds, departmental clusters. Extensibility to support new and future resource managers with minimal changes.  Provide monitoring and fault recovery, especially when working with utility computing resources. Provide light-weight, robust and scalable infrastructure. Integration to variety of workflow environments. Thesis Defense - Eran Chinthaka Withana 38
Scientific Computing Resource Abstraction Layer Our contribution Resource abstraction layer  Implemented as a web service Provides a uniform abstraction layer over heterogeneous compute resources including grids, clouds and local departmental clusters. Support for standard job specification languages including, but not limited to, Job Submission Description Language (JSDL)[Anjomshoaa04] and Globus Resource Specification Language (RSL),  directly interacts with resource managers so requires no grid or meta scheduling middleware Integration with current resource managers, including Load Leveler, PBS, LSF and Windows HPC, Amazon EC2 and Microsoft Azure platforms Features Does not need high level of computer science knowledge to install and maintain system  Use of Globus was a challenge for most non-compute scientists Involvement of system administrators to install and maintain Sigiri is minimal Memory foot print of is minimal Other tools require installation of most of heavy Globus stack but Sigiri does not require a complete stack installation to run. (Note that installing Globus on a small clusters is something scientists never wanted to do.) Better fault tolerance and failure recovery. Thesis Defense - Eran Chinthaka Withana 39
Architecture Asynchronous messaging model of message publishers and consumers Daemons shadowing compute resources Distributed component deployment Daemon, front end Web service and job queue  Thesis Defense - Eran Chinthaka Withana 40
Client Interaction Service Deployed as an Apache Axis2 Web service to enable interoperability Accepts job requests and enable management and monitoring functions Job submission schema does not enforce schema for job description Enables multiple job description languages Thesis Defense - Eran Chinthaka Withana 41
Client Interaction Service Thesis Defense - Eran Chinthaka Withana 42 Job Submission Response Job Submission Request
Daemons Each managed compute resource has a light-weight daemon periodically checks job request queue translates job specification to a resource manager specific language submits pending jobs and persists correlation between resource manager's job id with internal id Extensible daemon API  enables integration of wide range of resource managers while keeping complexities of these resources managers transparent to end users of these systems Queuing based approach enables daemons to be run on any compute platform, without any software or operating system requirements Current Support LSF, PBS,SLURM, LoadLeveler, Amazon EC2, Windows HPC, Windows Azure Thesis Defense - Eran Chinthaka Withana 43
Integration of Cloud Computing Resources Unique set of dynamically loaded and configured extensions to handle security, schedule jobs and perform required data movements. Enables scientists to interact with multiple cloud providers within same system Features Extensions can be written as modules independent of other extensions, typically to carry out a single task Enforced failure handling to prevent orphan VMs, resources Thesis Defense - Eran Chinthaka Withana 44
Security Client Security Between client and Web service layer Support for both transport level security (using SSL) and application layer security (using WS-Security) Client negotiation of security credentials with WS-Security policy support within Apache Axis2 Compute Resource Security System has support to store different types of security credentials Username/password combinations, X.509 credentials Thesis Defense - Eran Chinthaka Withana 45
Performance Evaluation Test Scenarios Case 1: Jobs arrive at our system as a burst of concurrent submissions from a controlled number of clients. Each client waits for all jobs to finish before submitting next set of jobs. For example, during test with 100 clients, each client sends 1 job to server making 100 jobs coming to server in parallel. Case 2: Each client submits 10 jobs having varying execution times in sequence with no delay between submissions client does not block upon submission of a job failure rate and server performance, from clients point of view, are measured and number of simultaneous clients will be systematically increased Thesis Defense - Eran Chinthaka Withana 46
Performance Evaluation:Baseline Measurements Thesis Defense - Eran Chinthaka Withana 47
Performance Evaluation:Metrics Thesis Defense - Eran Chinthaka Withana 48
Performance Evaluation:Scalability Metrics Thesis Defense - Eran Chinthaka Withana 49
Performance Evaluation Experimental Setup Daemon hosted within gatekeeper node (quad-core IBM PowerPC (1.6GHz) with 8GB of physical memory) of Big Red cluster  System Web service and database co-hosted in a box with (4 2.6GHz dual-core processors with 32GB of RAM) Both these nodes were not dedicated for our experiment when we were running tests Client Environment Setup within 128 node Odin Cluster (each node is a Dual AMD 2.0GHz Opteron processor with 4GB physical memory) All client nodes were used in dedicated mode and each client is running on separate java virtual machine to eliminate any external overhead  Data Collection Each test was run number of clients * 10 times and results were averaged. Each parameter is tested for 100 to 1000 concurrent clients Total of 110,000 tests were run.  Gram4 experiment results produced in Gram4 evaluation paper[Marru08] were used for system performance comparison.  Thesis Defense - Eran Chinthaka Withana 50
Results Thesis Defense - Eran Chinthaka Withana 51 Baseline Measurements All overheads scaling proportional to number of clients No failures Case 1 Case 2
Results Thesis Defense - Eran Chinthaka Withana 52 Metrics for Test Case 1 and 2 Both response time and total overhead scaling proportional to number of clients No failures
Results Thesis Defense - Eran Chinthaka Withana 53 Scalability Metrics Failures No failures with Sigiri Failures starting from 300 clients for Gram Case 1 Case 2
Outline Mid-Range Science Challenges and Opportunities Current Landscape Research Research Questions Contributions Mining Historical Information to Find Patterns and Experiences Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1] Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2] Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4] Applications Related Work Conclusion and Future Work Thesis Defense - Eran Chinthaka Withana 54
Applications: LEAD Motivations Grid middleware reliability and scalability study[Marru08] and workflow failure rates.  components of LEAD infrastructure were considered for adaptation to other scientific environments. Sigiri initially prototyped to support Load Leveler, PBS and LSF.  Implications Improved workflow success rates  Mitigation need for Globus middleware Ability work with non-standard job managers Thesis Defense - Eran Chinthaka Withana 55
Applications: LEAD II Emergence of community- driven, production-quality workflow infrastructures E.g. Trident Scientific Workflow Workbench with Workflow Foundation Possibility of using alternate supercomputing resources E.g.  Recent port WRF (Weather Research & Forecast) model to Windows platform, Azure Support for Windows based scientific computing environments. 56
Background: LEAD II and Vortex2 Experiment May 1, 2010 to June 15, 2010 ~6 weeks, 7-days per week Workflow started on hour every hour each morning.  Had to find and bind to latest model data (i.e., RUC 13km and ADAS data) to set initial and boundary conditions.   If model data was not available at NCEP and University of Oklahoma, workflow could not begin. Execution of complete WRF stack within 1 hour 57
Trident Vortex2 Workflow Bulk of time (50 min) spent in Lead Workflow Proxy Activity 58 Sigiri Integration
Applications: Enabling Geo-Science Application on Windows Azure Geo-Science Applications High Resource Requirements Compute intensive, dedicated HPC hardware e.g. Weather Research and Forecasting (WRF) Model Emergence of ensemble applications Large amount of small jobs e.g.  Examining each air layer, over a long period of time.  Single experiment = About 14000 jobs each taking few minutes to complete 59
Geo-Science Applications: Opportunities Cloud computing resources On-demand access to “unlimited” resources Flexibility Worker roles and VM roles Recent porting of geo-science applications WRF, WRF Preprocessing System (WPS) port to Windows Increased use of ensemble applications (large number of small runs) Production quality, opensource scientific workflow systems Microsoft Trident 60
Research Vision Enabling geo-science experiments  Type of applications Compute intensive, ensembles Type of scientists Meteorologists, atmospheric scientists, emergency management personnel, geologists Utilizing both Cloud computing and Grid computing resources Utilizing opensource, production quality scientific workflow environments Improved data and meta-data management Geo-Science Applications Scientific Workflows Compute Resources 61
Proposed Framework Thesis Defense - Eran Chinthaka Withana 62 Azure Blob Store Azure  Management API Sigiri Job Mgmt.Daemons Azure Fabric Web Service Trident Activity Job Queue Azure Custom  VM Images VM Instance IIS WRF Sigiri Worker Service MSMPI Windows 2008R2
Applications: Pragma Testbed Support Pacific Rim Applications and Grid Middleware (PRAGMA)[Zheng06] an open international organization founded in 2002 to focus on practical issues of building international scientific collaborations In 2010, Indiana University (IU) joined PRAGMA and added a dedicated cluster for testbed.  Sigiri was used within IU Pragma testbed IU PRAGMA testbed system required a light-weight system that could be installed and maintained with minimal effort. IU PRAGMA team wanted to evaluate on adding cloud resources into testbed with little or no changes to interfaces. In 2011, PRAGMA - Opal - Sigiri integration was demonstrated successfully Thesis Defense - Eran Chinthaka Withana 63
Outline Mid-Range Science Challenges and Opportunities Current Landscape Research Research Questions Contributions Mining Historical Information to Find Patterns and Experiences Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1] Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2] Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4] Applications Related Work Conclusion and Future Work Thesis Defense - Eran Chinthaka Withana 64
Related Work Scientific Job Management Systems Grid Resource Allocation and Management (GRAM)[Foster05], Condor-G[Frey02], Nimrod/G[Buyya00], GridWay[Huedo05] and SAGA[Goodale06] and Falkon[Raicu07] provide uniform job management APIs, but are tightly integrated with complex middleware to address a broad range of problems. Carmen[Watson81] project  provided a cloud environment that has enabled collaboration between neuroscientists requires all programs to be packaged as WS-I[Ballinger04] compliant Web services Condor[Frey02] pools can also be utilized to unify certain compute resource interactions. uses Globus toolkit[Foster05] (and GRAM underneath)  Poor failure recovery  overlooks failure modes of a cloud platform Thesis Defense - Eran Chinthaka Withana 65
Related Work Scientific Research and Cloud Computing IaaS, PaaS and SaaS environment evaluations Scientists have mainly evaluated use of IaaS services for scientific job executions[Abadi09][Hoffa08][Keahey08] [Yu05] Ease of setting up custom environments and control Growing interest in using PaaS services[Humphrey10][Lu10] [Qiu09] Optimization to balance cost and time of executions[Deelman08][Yu05] Startup overheads[Chase03][Figueiredo03][Foster06][Sotomayor06][Keahey07]	 Job Prediction Algorithms Prediction of Execution times[Smith], job start times[Li04], queue-wait times[Nurmi07] and resource requirements[Julian04] AI based and statistical modeling based approaches AppleS[Berman03] argues that a good scheduler must involve some prediction of application and system performance Reliability of Compute Resources Birman[Birman05] and aspects of resources causing system reliability issues Statistical modeling to predict failures[Kandaswamy08] Thesis Defense - Eran Chinthaka Withana 66
Outline Mid-Range Science Challenges and Opportunities Current Landscape Research Research Questions Contributions Mining Historical Information to Find Patterns and Experiences Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1] Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2] Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4] Applications Related Work Conclusion and Future Work Thesis Defense - Eran Chinthaka Withana 67
Conclusion User inspired management of scientific jobs Concentrate on identification of user patterns and perceptions Harnesses historical information Applies knowledge gained to improve scientific job executions Argues that patterns, if identified based on individual users, can reveal important information to make sophisticated estimations on resource requirements Evaluations demonstrates usability of predictions for a meta-scheduler, especially ones integrated into community gateways, to improve their scheduling decisions. Resource abstraction service Help mid-scale scientists to obtain access to resources that are cheap and available Strives to do so with a tool that is easy to set up and administer Prototype implementations introduced and discussed is integrated and used in different domains and scientific applications Applications demonstrate how our research contributed to advance science in respective domains. Thesis Defense - Eran Chinthaka Withana 68
Contributions Propose and empirically demonstrate user patterns, deduced by knowledge-based approaches, to provision for compute resources reducing impact of startup overheads in cloud computing environments. Propose and empirically demonstrate user perceived reliability, learned by mining historical job execution information, as a new dimension to consider during resource selections. Propose and demonstrate effectiveness and applicability of a light-weight and reliable resource abstraction service to hide complexities of interacting with multiple resources managers in grids and clouds. Prototype implementation to evaluate feasibility and performance of resource abstraction service and integration with four different application domains to prove its usability. Thesis Defense - Eran Chinthaka Withana 69
Future Work Short term research directions Integration of future job predictions and user-perceived reliability predictions Evolving resource abstraction service to support more compute resources Management of ensemble runs Fault tolerance with proactive replication Long Term Research Directions Thesis Defense - Eran Chinthaka Withana 70
Thank You !! Thesis Defense - Eran Chinthaka Withana 71

Contenu connexe

Tendances

Accelerating Discovery via Science Services
Accelerating Discovery via Science ServicesAccelerating Discovery via Science Services
Accelerating Discovery via Science ServicesIan Foster
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilitiesIan Foster
 
Scientific workflow-overview-2012-01-rev-2
Scientific workflow-overview-2012-01-rev-2Scientific workflow-overview-2012-01-rev-2
Scientific workflow-overview-2012-01-rev-2Terence Critchlow
 
Efficient Resource Management Mechanism with Fault Tolerant Model for Computa...
Efficient Resource Management Mechanism with Fault Tolerant Model for Computa...Efficient Resource Management Mechanism with Fault Tolerant Model for Computa...
Efficient Resource Management Mechanism with Fault Tolerant Model for Computa...Editor IJCATR
 
Continuous modeling - automating model building on high-performance e-Infrast...
Continuous modeling - automating model building on high-performance e-Infrast...Continuous modeling - automating model building on high-performance e-Infrast...
Continuous modeling - automating model building on high-performance e-Infrast...Ola Spjuth
 
DGBSA : A BATCH JOB SCHEDULINGALGORITHM WITH GA WITH REGARD TO THE THRESHOLD ...
DGBSA : A BATCH JOB SCHEDULINGALGORITHM WITH GA WITH REGARD TO THE THRESHOLD ...DGBSA : A BATCH JOB SCHEDULINGALGORITHM WITH GA WITH REGARD TO THE THRESHOLD ...
DGBSA : A BATCH JOB SCHEDULINGALGORITHM WITH GA WITH REGARD TO THE THRESHOLD ...IJCSEA Journal
 
Integrating scientific laboratories into the cloud
Integrating scientific laboratories into the cloudIntegrating scientific laboratories into the cloud
Integrating scientific laboratories into the cloudData Finder
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Anubhav Jain
 
Improving the Performance of Mapping based on Availability- Alert Algorithm U...
Improving the Performance of Mapping based on Availability- Alert Algorithm U...Improving the Performance of Mapping based on Availability- Alert Algorithm U...
Improving the Performance of Mapping based on Availability- Alert Algorithm U...AM Publications
 
CHASE-CI: A Distributed Big Data Machine Learning Platform
CHASE-CI: A Distributed Big Data Machine Learning PlatformCHASE-CI: A Distributed Big Data Machine Learning Platform
CHASE-CI: A Distributed Big Data Machine Learning PlatformLarry Smarr
 
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...inside-BigData.com
 
Earlier stage for straggler detection and handling using combined CPU test an...
Earlier stage for straggler detection and handling using combined CPU test an...Earlier stage for straggler detection and handling using combined CPU test an...
Earlier stage for straggler detection and handling using combined CPU test an...IJECEIAES
 
Analysis of User Submission Behavior on HPC and HTC
Analysis of User Submission Behavior on HPC and HTCAnalysis of User Submission Behavior on HPC and HTC
Analysis of User Submission Behavior on HPC and HTCRafael Ferreira da Silva
 
Mark_Yashar_Resume_2017
Mark_Yashar_Resume_2017Mark_Yashar_Resume_2017
Mark_Yashar_Resume_2017Mark Yashar
 
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralCloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralPaolo Missier
 
IRJET-Framework for Dynamic Resource Allocation and Efficient Scheduling Stra...
IRJET-Framework for Dynamic Resource Allocation and Efficient Scheduling Stra...IRJET-Framework for Dynamic Resource Allocation and Efficient Scheduling Stra...
IRJET-Framework for Dynamic Resource Allocation and Efficient Scheduling Stra...IRJET Journal
 
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning  ClusteringGraphlab Ted Dunning  Clustering
Graphlab Ted Dunning ClusteringMapR Technologies
 

Tendances (20)

Accelerating Discovery via Science Services
Accelerating Discovery via Science ServicesAccelerating Discovery via Science Services
Accelerating Discovery via Science Services
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilities
 
Scientific workflow-overview-2012-01-rev-2
Scientific workflow-overview-2012-01-rev-2Scientific workflow-overview-2012-01-rev-2
Scientific workflow-overview-2012-01-rev-2
 
Journals analysis ppt
Journals analysis pptJournals analysis ppt
Journals analysis ppt
 
Efficient Resource Management Mechanism with Fault Tolerant Model for Computa...
Efficient Resource Management Mechanism with Fault Tolerant Model for Computa...Efficient Resource Management Mechanism with Fault Tolerant Model for Computa...
Efficient Resource Management Mechanism with Fault Tolerant Model for Computa...
 
Continuous modeling - automating model building on high-performance e-Infrast...
Continuous modeling - automating model building on high-performance e-Infrast...Continuous modeling - automating model building on high-performance e-Infrast...
Continuous modeling - automating model building on high-performance e-Infrast...
 
DGBSA : A BATCH JOB SCHEDULINGALGORITHM WITH GA WITH REGARD TO THE THRESHOLD ...
DGBSA : A BATCH JOB SCHEDULINGALGORITHM WITH GA WITH REGARD TO THE THRESHOLD ...DGBSA : A BATCH JOB SCHEDULINGALGORITHM WITH GA WITH REGARD TO THE THRESHOLD ...
DGBSA : A BATCH JOB SCHEDULINGALGORITHM WITH GA WITH REGARD TO THE THRESHOLD ...
 
Integrating scientific laboratories into the cloud
Integrating scientific laboratories into the cloudIntegrating scientific laboratories into the cloud
Integrating scientific laboratories into the cloud
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
 
Improving the Performance of Mapping based on Availability- Alert Algorithm U...
Improving the Performance of Mapping based on Availability- Alert Algorithm U...Improving the Performance of Mapping based on Availability- Alert Algorithm U...
Improving the Performance of Mapping based on Availability- Alert Algorithm U...
 
CHASE-CI: A Distributed Big Data Machine Learning Platform
CHASE-CI: A Distributed Big Data Machine Learning PlatformCHASE-CI: A Distributed Big Data Machine Learning Platform
CHASE-CI: A Distributed Big Data Machine Learning Platform
 
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...
 
Earlier stage for straggler detection and handling using combined CPU test an...
Earlier stage for straggler detection and handling using combined CPU test an...Earlier stage for straggler detection and handling using combined CPU test an...
Earlier stage for straggler detection and handling using combined CPU test an...
 
Analysis of User Submission Behavior on HPC and HTC
Analysis of User Submission Behavior on HPC and HTCAnalysis of User Submission Behavior on HPC and HTC
Analysis of User Submission Behavior on HPC and HTC
 
[IJET V2I2P18] Authors: Roopa G Yeklaspur, Dr.Yerriswamy.T
[IJET V2I2P18] Authors: Roopa G Yeklaspur, Dr.Yerriswamy.T[IJET V2I2P18] Authors: Roopa G Yeklaspur, Dr.Yerriswamy.T
[IJET V2I2P18] Authors: Roopa G Yeklaspur, Dr.Yerriswamy.T
 
Mark_Yashar_Resume_2017
Mark_Yashar_Resume_2017Mark_Yashar_Resume_2017
Mark_Yashar_Resume_2017
 
CV_myashar_2017
CV_myashar_2017CV_myashar_2017
CV_myashar_2017
 
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralCloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
 
IRJET-Framework for Dynamic Resource Allocation and Efficient Scheduling Stra...
IRJET-Framework for Dynamic Resource Allocation and Efficient Scheduling Stra...IRJET-Framework for Dynamic Resource Allocation and Efficient Scheduling Stra...
IRJET-Framework for Dynamic Resource Allocation and Efficient Scheduling Stra...
 
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning  ClusteringGraphlab Ted Dunning  Clustering
Graphlab Ted Dunning Clustering
 

En vedette

Cassandra: Two data centers and great performance
Cassandra: Two data centers and great performanceCassandra: Two data centers and great performance
Cassandra: Two data centers and great performanceDATAVERSITY
 
Introduction to Cassandra: Replication and Consistency
Introduction to Cassandra: Replication and ConsistencyIntroduction to Cassandra: Replication and Consistency
Introduction to Cassandra: Replication and ConsistencyBenjamin Black
 
DataStax NYC Java Meetup: Cassandra with Java
DataStax NYC Java Meetup: Cassandra with JavaDataStax NYC Java Meetup: Cassandra with Java
DataStax NYC Java Meetup: Cassandra with Javacarolinedatastax
 
Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...
Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...
Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...DataStax
 
Cassandra - Eine Einführung
Cassandra - Eine EinführungCassandra - Eine Einführung
Cassandra - Eine EinführungMikio L. Braun
 
A PhD: A Career in Nursing Research
A PhD: A Career in Nursing ResearchA PhD: A Career in Nursing Research
A PhD: A Career in Nursing ResearchKelly Brittain
 
Solr & Cassandra: Searching Cassandra with DataStax Enterprise
Solr & Cassandra: Searching Cassandra with DataStax EnterpriseSolr & Cassandra: Searching Cassandra with DataStax Enterprise
Solr & Cassandra: Searching Cassandra with DataStax EnterpriseDataStax Academy
 
WIUGC 2010 - Pegmatite and Leucogranite-Hosted U-Th Mineralization In norther...
WIUGC 2010 - Pegmatite and Leucogranite-Hosted U-Th Mineralization In norther...WIUGC 2010 - Pegmatite and Leucogranite-Hosted U-Th Mineralization In norther...
WIUGC 2010 - Pegmatite and Leucogranite-Hosted U-Th Mineralization In norther...Christine McKechnie
 
Thesis presentation_ Final Defence
Thesis presentation_ Final DefenceThesis presentation_ Final Defence
Thesis presentation_ Final DefenceSharmin Ahmed
 
Different Tools to Detect and Monitor Oil Spills Aerial Observation Tech.
Different Tools to Detect and Monitor Oil Spills Aerial Observation Tech.Different Tools to Detect and Monitor Oil Spills Aerial Observation Tech.
Different Tools to Detect and Monitor Oil Spills Aerial Observation Tech.A.Tuğsan İşiaçık Çolak
 
Introduction to cassandra 2014
Introduction to cassandra 2014Introduction to cassandra 2014
Introduction to cassandra 2014Patrick McFadin
 
Highly available, scalable and secure data with Cassandra and DataStax Enterp...
Highly available, scalable and secure data with Cassandra and DataStax Enterp...Highly available, scalable and secure data with Cassandra and DataStax Enterp...
Highly available, scalable and secure data with Cassandra and DataStax Enterp...Johnny Miller
 
Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache CassandraEric Evans
 
Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)Eric Evans
 
Cassandra architecture
Cassandra architectureCassandra architecture
Cassandra architectureT Jake Luciani
 
Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache CassandraEric Evans
 

En vedette (20)

Cassandra: Two data centers and great performance
Cassandra: Two data centers and great performanceCassandra: Two data centers and great performance
Cassandra: Two data centers and great performance
 
Introduction to Cassandra: Replication and Consistency
Introduction to Cassandra: Replication and ConsistencyIntroduction to Cassandra: Replication and Consistency
Introduction to Cassandra: Replication and Consistency
 
DataStax NYC Java Meetup: Cassandra with Java
DataStax NYC Java Meetup: Cassandra with JavaDataStax NYC Java Meetup: Cassandra with Java
DataStax NYC Java Meetup: Cassandra with Java
 
Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...
Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...
Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...
 
Deploma
DeplomaDeploma
Deploma
 
Cassandra - Eine Einführung
Cassandra - Eine EinführungCassandra - Eine Einführung
Cassandra - Eine Einführung
 
A PhD: A Career in Nursing Research
A PhD: A Career in Nursing ResearchA PhD: A Career in Nursing Research
A PhD: A Career in Nursing Research
 
Solr & Cassandra: Searching Cassandra with DataStax Enterprise
Solr & Cassandra: Searching Cassandra with DataStax EnterpriseSolr & Cassandra: Searching Cassandra with DataStax Enterprise
Solr & Cassandra: Searching Cassandra with DataStax Enterprise
 
WIUGC 2010 - Pegmatite and Leucogranite-Hosted U-Th Mineralization In norther...
WIUGC 2010 - Pegmatite and Leucogranite-Hosted U-Th Mineralization In norther...WIUGC 2010 - Pegmatite and Leucogranite-Hosted U-Th Mineralization In norther...
WIUGC 2010 - Pegmatite and Leucogranite-Hosted U-Th Mineralization In norther...
 
Thesis presentation_ Final Defence
Thesis presentation_ Final DefenceThesis presentation_ Final Defence
Thesis presentation_ Final Defence
 
Thesis defence
Thesis defence Thesis defence
Thesis defence
 
Cassandra at scale
Cassandra at scaleCassandra at scale
Cassandra at scale
 
Different Tools to Detect and Monitor Oil Spills Aerial Observation Tech.
Different Tools to Detect and Monitor Oil Spills Aerial Observation Tech.Different Tools to Detect and Monitor Oil Spills Aerial Observation Tech.
Different Tools to Detect and Monitor Oil Spills Aerial Observation Tech.
 
Introduction to cassandra 2014
Introduction to cassandra 2014Introduction to cassandra 2014
Introduction to cassandra 2014
 
Highly available, scalable and secure data with Cassandra and DataStax Enterp...
Highly available, scalable and secure data with Cassandra and DataStax Enterp...Highly available, scalable and secure data with Cassandra and DataStax Enterp...
Highly available, scalable and secure data with Cassandra and DataStax Enterp...
 
Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache Cassandra
 
Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)
 
Cassandra architecture
Cassandra architectureCassandra architecture
Cassandra architecture
 
Compound Structure Detection
Compound Structure DetectionCompound Structure Detection
Compound Structure Detection
 
Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache Cassandra
 

Similaire à User Inspired Management of Scientific Jobs in Grids and Clouds

An enhanced adaptive scoring job scheduling algorithm with replication strate...
An enhanced adaptive scoring job scheduling algorithm with replication strate...An enhanced adaptive scoring job scheduling algorithm with replication strate...
An enhanced adaptive scoring job scheduling algorithm with replication strate...eSAT Publishing House
 
An Iterative Model as a Tool in Optimal Allocation of Resources in University...
An Iterative Model as a Tool in Optimal Allocation of Resources in University...An Iterative Model as a Tool in Optimal Allocation of Resources in University...
An Iterative Model as a Tool in Optimal Allocation of Resources in University...Dr. Amarjeet Singh
 
FDMC: Framework for Decision Making in Cloud for EfficientResource Management
FDMC: Framework for Decision Making in Cloud for EfficientResource Management FDMC: Framework for Decision Making in Cloud for EfficientResource Management
FDMC: Framework for Decision Making in Cloud for EfficientResource Management IJECEIAES
 
H03302058066
H03302058066H03302058066
H03302058066theijes
 
Demand-driven Gaussian window optimization for executing preferred population...
Demand-driven Gaussian window optimization for executing preferred population...Demand-driven Gaussian window optimization for executing preferred population...
Demand-driven Gaussian window optimization for executing preferred population...IJECEIAES
 
Parallel multivariate deep learning models for time-series prediction: A comp...
Parallel multivariate deep learning models for time-series prediction: A comp...Parallel multivariate deep learning models for time-series prediction: A comp...
Parallel multivariate deep learning models for time-series prediction: A comp...IAESIJAI
 
Cost and accuracy aware scientific workflow composition for service oriented ...
Cost and accuracy aware scientific workflow composition for service oriented ...Cost and accuracy aware scientific workflow composition for service oriented ...
Cost and accuracy aware scientific workflow composition for service oriented ...Nexgen Technology
 
Implementing Workload Postponing In Cloudsim to Maximize Renewable Energy Uti...
Implementing Workload Postponing In Cloudsim to Maximize Renewable Energy Uti...Implementing Workload Postponing In Cloudsim to Maximize Renewable Energy Uti...
Implementing Workload Postponing In Cloudsim to Maximize Renewable Energy Uti...IJERA Editor
 
An efficient information retrieval ontology system based indexing for context
An efficient information retrieval ontology system based indexing for contextAn efficient information retrieval ontology system based indexing for context
An efficient information retrieval ontology system based indexing for contexteSAT Journals
 
The Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource ProvisioningThe Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource ProvisioningRafael Ferreira da Silva
 
Energy efficient virtual machine (vm) migration in cloud data centers
Energy efficient virtual machine (vm) migration in cloud data centersEnergy efficient virtual machine (vm) migration in cloud data centers
Energy efficient virtual machine (vm) migration in cloud data centersDinesh Raj Paneru
 
The Green Lab - Experimentation in Software Energy Efficiency (ICSE)
The Green Lab - Experimentation in Software Energy Efficiency (ICSE)The Green Lab - Experimentation in Software Energy Efficiency (ICSE)
The Green Lab - Experimentation in Software Energy Efficiency (ICSE)Giuseppe Procaccianti
 
Resource Allocation for Task Using Fair Share Scheduling Algorithm
Resource Allocation for Task Using Fair Share Scheduling AlgorithmResource Allocation for Task Using Fair Share Scheduling Algorithm
Resource Allocation for Task Using Fair Share Scheduling AlgorithmIRJET Journal
 
Modeling and Optimization of Resource Allocation in Cloud [PhD Thesis Proposal]
Modeling and Optimization of Resource Allocation in Cloud [PhD Thesis Proposal]Modeling and Optimization of Resource Allocation in Cloud [PhD Thesis Proposal]
Modeling and Optimization of Resource Allocation in Cloud [PhD Thesis Proposal]AtakanAral
 
A Novel Dynamic Priority Based Job Scheduling Approach for Cloud Environment
A Novel Dynamic Priority Based Job Scheduling Approach for Cloud EnvironmentA Novel Dynamic Priority Based Job Scheduling Approach for Cloud Environment
A Novel Dynamic Priority Based Job Scheduling Approach for Cloud EnvironmentIRJET Journal
 
HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores inside-BigData.com
 
LINEAR REGRESSION MODEL FOR KNOWLEDGE DISCOVERY IN ENGINEERING MATERIALS
LINEAR REGRESSION MODEL FOR KNOWLEDGE DISCOVERY IN ENGINEERING MATERIALSLINEAR REGRESSION MODEL FOR KNOWLEDGE DISCOVERY IN ENGINEERING MATERIALS
LINEAR REGRESSION MODEL FOR KNOWLEDGE DISCOVERY IN ENGINEERING MATERIALScscpconf
 
Resource Availability Prediction in the Grid: Taxonomy and Review of State of...
Resource Availability Prediction in the Grid: Taxonomy and Review of State of...Resource Availability Prediction in the Grid: Taxonomy and Review of State of...
Resource Availability Prediction in the Grid: Taxonomy and Review of State of...IJEACS
 

Similaire à User Inspired Management of Scientific Jobs in Grids and Clouds (20)

An enhanced adaptive scoring job scheduling algorithm with replication strate...
An enhanced adaptive scoring job scheduling algorithm with replication strate...An enhanced adaptive scoring job scheduling algorithm with replication strate...
An enhanced adaptive scoring job scheduling algorithm with replication strate...
 
An Iterative Model as a Tool in Optimal Allocation of Resources in University...
An Iterative Model as a Tool in Optimal Allocation of Resources in University...An Iterative Model as a Tool in Optimal Allocation of Resources in University...
An Iterative Model as a Tool in Optimal Allocation of Resources in University...
 
FDMC: Framework for Decision Making in Cloud for EfficientResource Management
FDMC: Framework for Decision Making in Cloud for EfficientResource Management FDMC: Framework for Decision Making in Cloud for EfficientResource Management
FDMC: Framework for Decision Making in Cloud for EfficientResource Management
 
H03302058066
H03302058066H03302058066
H03302058066
 
Ax34298305
Ax34298305Ax34298305
Ax34298305
 
Demand-driven Gaussian window optimization for executing preferred population...
Demand-driven Gaussian window optimization for executing preferred population...Demand-driven Gaussian window optimization for executing preferred population...
Demand-driven Gaussian window optimization for executing preferred population...
 
Parallel multivariate deep learning models for time-series prediction: A comp...
Parallel multivariate deep learning models for time-series prediction: A comp...Parallel multivariate deep learning models for time-series prediction: A comp...
Parallel multivariate deep learning models for time-series prediction: A comp...
 
Cost and accuracy aware scientific workflow composition for service oriented ...
Cost and accuracy aware scientific workflow composition for service oriented ...Cost and accuracy aware scientific workflow composition for service oriented ...
Cost and accuracy aware scientific workflow composition for service oriented ...
 
Implementing Workload Postponing In Cloudsim to Maximize Renewable Energy Uti...
Implementing Workload Postponing In Cloudsim to Maximize Renewable Energy Uti...Implementing Workload Postponing In Cloudsim to Maximize Renewable Energy Uti...
Implementing Workload Postponing In Cloudsim to Maximize Renewable Energy Uti...
 
An efficient information retrieval ontology system based indexing for context
An efficient information retrieval ontology system based indexing for contextAn efficient information retrieval ontology system based indexing for context
An efficient information retrieval ontology system based indexing for context
 
The Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource ProvisioningThe Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource Provisioning
 
Energy efficient virtual machine (vm) migration in cloud data centers
Energy efficient virtual machine (vm) migration in cloud data centersEnergy efficient virtual machine (vm) migration in cloud data centers
Energy efficient virtual machine (vm) migration in cloud data centers
 
The Green Lab - Experimentation in Software Energy Efficiency (ICSE)
The Green Lab - Experimentation in Software Energy Efficiency (ICSE)The Green Lab - Experimentation in Software Energy Efficiency (ICSE)
The Green Lab - Experimentation in Software Energy Efficiency (ICSE)
 
Resource Allocation for Task Using Fair Share Scheduling Algorithm
Resource Allocation for Task Using Fair Share Scheduling AlgorithmResource Allocation for Task Using Fair Share Scheduling Algorithm
Resource Allocation for Task Using Fair Share Scheduling Algorithm
 
Modeling and Optimization of Resource Allocation in Cloud [PhD Thesis Proposal]
Modeling and Optimization of Resource Allocation in Cloud [PhD Thesis Proposal]Modeling and Optimization of Resource Allocation in Cloud [PhD Thesis Proposal]
Modeling and Optimization of Resource Allocation in Cloud [PhD Thesis Proposal]
 
A Novel Dynamic Priority Based Job Scheduling Approach for Cloud Environment
A Novel Dynamic Priority Based Job Scheduling Approach for Cloud EnvironmentA Novel Dynamic Priority Based Job Scheduling Approach for Cloud Environment
A Novel Dynamic Priority Based Job Scheduling Approach for Cloud Environment
 
HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores 
 
Paper review
Paper reviewPaper review
Paper review
 
LINEAR REGRESSION MODEL FOR KNOWLEDGE DISCOVERY IN ENGINEERING MATERIALS
LINEAR REGRESSION MODEL FOR KNOWLEDGE DISCOVERY IN ENGINEERING MATERIALSLINEAR REGRESSION MODEL FOR KNOWLEDGE DISCOVERY IN ENGINEERING MATERIALS
LINEAR REGRESSION MODEL FOR KNOWLEDGE DISCOVERY IN ENGINEERING MATERIALS
 
Resource Availability Prediction in the Grid: Taxonomy and Review of State of...
Resource Availability Prediction in the Grid: Taxonomy and Review of State of...Resource Availability Prediction in the Grid: Taxonomy and Review of State of...
Resource Availability Prediction in the Grid: Taxonomy and Review of State of...
 

Plus de Eran Chinthaka Withana

Opensource development and apache software foundation
Opensource development and apache software foundationOpensource development and apache software foundation
Opensource development and apache software foundationEran Chinthaka Withana
 
Towards Enabling Mid-Scale Geo-Science Experiments Through Microsoft Trident ...
Towards Enabling Mid-Scale Geo-Science Experiments Through Microsoft Trident ...Towards Enabling Mid-Scale Geo-Science Experiments Through Microsoft Trident ...
Towards Enabling Mid-Scale Geo-Science Experiments Through Microsoft Trident ...Eran Chinthaka Withana
 
CBR Based Workflow Composition Assistant
CBR Based Workflow Composition AssistantCBR Based Workflow Composition Assistant
CBR Based Workflow Composition AssistantEran Chinthaka Withana
 

Plus de Eran Chinthaka Withana (7)

Cassandra At Wize Commerce
Cassandra At Wize CommerceCassandra At Wize Commerce
Cassandra At Wize Commerce
 
Opensource development and apache software foundation
Opensource development and apache software foundationOpensource development and apache software foundation
Opensource development and apache software foundation
 
Towards Enabling Mid-Scale Geo-Science Experiments Through Microsoft Trident ...
Towards Enabling Mid-Scale Geo-Science Experiments Through Microsoft Trident ...Towards Enabling Mid-Scale Geo-Science Experiments Through Microsoft Trident ...
Towards Enabling Mid-Scale Geo-Science Experiments Through Microsoft Trident ...
 
Versioning for Workflow Evolution
Versioning for Workflow EvolutionVersioning for Workflow Evolution
Versioning for Workflow Evolution
 
Web Services in the Real World
Web Services in the Real WorldWeb Services in the Real World
Web Services in the Real World
 
Axis2 Landscape
Axis2 LandscapeAxis2 Landscape
Axis2 Landscape
 
CBR Based Workflow Composition Assistant
CBR Based Workflow Composition AssistantCBR Based Workflow Composition Assistant
CBR Based Workflow Composition Assistant
 

Dernier

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 

Dernier (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

User Inspired Management of Scientific Jobs in Grids and Clouds

  • 1. User Inspired Management of Scientific Jobs in Grids and Clouds Eran Chinthaka Withana School of Informatics and Computing Indiana University, Bloomington, Indiana, USA Doctoral Committee Professor Beth Plale, PhD Dr. Dennis Gannon, PhD Professor Geoffrey Fox, PhD Professor David Leake, PhD
  • 2. Outline Mid-Range Science Challenges and Opportunities Current Landscape Research Research Questions Contributions Mining Historical Information to Find Patterns and Experiences Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1] Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2] Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4] Applications Related Work Conclusion and Future Work Thesis Defense - EranChinthakaWithana 2
  • 3. Outline Mid-Range Science Challenges and Opportunities Current Landscape Research Research Questions Contributions Mining Historical Information to Find Patterns and Experiences Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1] Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2] Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4] Applications Related Work Conclusion and Future Work Thesis Defense - EranChinthakaWithana 3
  • 4. Mid-Range Science Challenges Resource requirements going beyond lab and university, but not suited for large-scale resources Difficulties finding sufficient compute resources E.g.: short term forecast in LEAD for energy and agriculture Lacking resources to have strong CS support person on team Need for less-expensive and more-available resources Opportunities Wide variety of computational resources Science gateways Thesis Defense - EranChinthakaWithana 4
  • 5. Current Landscape Grid Computing Batch orientation, long queues even under moderate loads, no access transparency Drawbacks in quota system Levels of computer science expertise required Cloud Computing High availability, pay-as-you-go model, on-demand limitless1 resource allocation Payment policy and research cost models Use of Workflow Systems Hybrid workflows Enables utilization of heterogeneous compute resources E.g.: Vortex2 Experiment Need for resource abstraction layers and optimal selection of resources Need for improvement of scientific job executions Better scheduler decisions, selection of compute resources Reliability issues in compute resources Importance of learning user patterns and experiences Thesis Defense - Eran Chinthaka Withana 5 1M. Armbrust et al. Above the clouds: A Berkeley view of cloud computing Tech. Rep. UCB/EECS-2009-28, EECS Department, University of California, Berkeley., 2009.
  • 6. Outline Mid-Range Science Challenges and Opportunities Current Landscape Research Research Questions Contributions Mining Historical Information to Find Patterns and Experiences Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1] Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2] Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4] Applications Related Work Conclusion and Future Work Thesis Defense - Eran Chinthaka Withana 6
  • 7. Research Questions “Can user patterns and experiences be used to improve scientific job executions in large scale systems?” “Can a simple, reliable and a highly scalable uniform resource abstraction be achieved to interact with a variety compute resource providers? “ “Can these be put to use to advance science?” Thesis Defense - Eran Chinthaka Withana 7
  • 8. Contributions Propose and empirically demonstrate user patterns, deduced by knowledge-based approaches, to provision for compute resources reducing impact of startup overheads in cloud computing environments. Propose and empirically demonstrate user perceived reliability, learned by mining historical job execution information, as new dimension to consider during resource selections. Propose and demonstrate effectiveness and applicability of light-weight and reliable resource abstraction service to hide complexities of interacting with multiple resources managers in grids and clouds. Prototype implementation to evaluate feasibility and performance of resource abstraction service and integration with four different application domains to prove its usability. Thesis Defense - Eran Chinthaka Withana 8
  • 9. Outline Mid-Range Science Challenges and Opportunities Current Landscape Research Research Questions Contributions Mining Historical Information to Find Patterns and Experiences Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1] Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2] Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4] Applications Related Work Conclusion and Future Work Thesis Defense - Eran Chinthaka Withana 9
  • 10. Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds Objective Reducing the impact of startup overheads for time-critical applications Problem space Workflows can have multiple paths Workflow descriptions not available Need for predictions to identify job execution sequence Learning from user behavioral patterns to predict future jobs Research outline Algorithm to predict future jobs by extracting user patterns from historical information Use of knowledge-based techniques Zero knowledge or pre-populated job information consisting of connection between jobs Similar cases retrieved are used to predict future jobs, reducing high startup overheads Algorithm assessment Two different workloads representing individual scientific jobs executed in LANL and set of workflows executed by three users 10 Thesis Defense - Eran Chinthaka Withana
  • 11. Demonstration of User Patterns with Workflows Suite of workflows can differ from domain to domain E.g. WRF (Weather Research and Forecasting) as upstream node User patterns reveal sequence of jobs taking different users/domains into consideration Useful for a science gateway serving wide-range of mid-scale scientists 11 Weather Predictions Crop Predictions WRF Wind Farm Location Evaluations Wild Fire Propagation Simulation Thesis Defense - Eran Chinthaka Withana
  • 12. Role of Successful Predictions to Reduce Startup Overheads Largest gain can be achieved when our prediction accuracy is high and setup time (s) is large with respect to execution time (t) r = probability of successful prediction (prediction accuracy) Percentage time = reduction For simplicity, assuming equal job exec and startup times Percentage time = reduction 12 Thesis Defense - Eran Chinthaka Withana
  • 13. Relationship of Predictions to Execution Time Observations Percentage time reduction increases with accuracy of predictions Time reduction is reduced exponentially with increased work-to-overhead ratio Need to find criticalpoint for a given situation Fixing required percentage time reduction for a given t/s ratio and finding required accuracy of predictions Cost of wrong predictions Depends on compute resource Demonstrated higher prediction accuracies (~90%) will reduce impact of wrong predictions Compromising cost to improve time Percentage time = reduction 13 Accuracy of Predictions = total successful future job predictions / total predictions Thesis Defense - Eran Chinthaka Withana
  • 14. Prediction Engine: System Architecture Prediction Retriever 14 Thesis Defense - Eran Chinthaka Withana
  • 15. Use of Reasoning Store and retrieve cases Steps Retrieval of similar cases Similarity measurement Use of thresholds Reuse of old cases Case adaptation Storage 15 Thesis Defense - Eran Chinthaka Withana
  • 16. Case Similarity Calculation Each case represented by set of attributes Selected by finding effect on goal variable (next job) 16 Thesis Defense - Eran Chinthaka Withana
  • 17. Evaluation Use cases Individual job workload1 40k jobs over two years from 1024-node CM-5 at Los Alamos National Lab Workflow use case System doesn’t see or assume workflow specification Experimental setup 2.0GHz dual-core processor, 4GB memory and on a 64-bit Windows operating system 1: Parallel Workload Archive http://www.cs.huji.ac.il/labs/parallel/workload/ 17 Thesis Defense - Eran Chinthaka Withana
  • 18. Evaluation: Average Accuracy of Predictions Individual Jobs Workload ~ 75% accurate predictions with user patterns ~ 32% accurate predictions with service names 18 Thesis Defense - Eran Chinthaka Withana Workflow Workload ~ 95% accurate predictions with user patterns ~ 53% accurate predictions with service names
  • 19. Evaluation: Time Saved Amount of time that can be saved, if resources are provisioned, when job is ready to run Startup time Assumed to be 3mins (average for commercial providers) 19 Individual Jobs Workload Workflow Workload Thesis Defense - Eran Chinthaka Withana
  • 20. Evaluation: Prediction Accuracies for Use Cases User patterns based predictions performs 2x better than service names based Thesis Defense - Eran Chinthaka Withana 20
  • 21. Outline Mid-Range Science Challenges and Opportunities Current Landscape Research Research Questions Contributions Mining Historical Information to Find Patterns and Experiences Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1] Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2] Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4] Applications Related Work Conclusion and Future Work Thesis Defense - Eran Chinthaka Withana 21
  • 22. User Perceived Reliability Failures tolerated through fault tolerance, high availability, recoverability, etc.,[Birman05]. What matters from a user’s point of view is whether these failures are visible to users or not E.g. reliability of commodity hardware (in clouds) vs user-perceived reliability Reliability is not of resources themselves Not derived from halting failures, fail-stop failures, network partitioning failures[Birman05] or machine downtimes. It is a more broadly encompassing system reliability that can only be seen at user or workflow level Can depend on user’s configuration and job types as well We refer to this form of reliability as user-perceived reliability. Importance of user-perceived reliability Selecting a resource to schedule an experiment when user has access to multiple compute resources E.g. LEAD reliability supercomputing resources vs Windows Azure resources Thesis Defense - Eran Chinthaka Withana 22
  • 23. Why User Perceived Reliability is Useful User perceived failure probabilities Cluster A, p(A) = 0.2 and Cluster B, p(B) = 0.3 𝑝𝐴∩ 𝐵=𝑝𝐴∗𝑝(𝐵) = 0.2 * ( 1 – 0.3) = 0.14 𝑝𝐵∩ 𝐴=𝑝𝐵∗𝑝(𝐴) = 0.3 * ( 1 – 0.2) = 0.24 Since 𝑝𝐴∩ 𝐵 < 𝑝𝐵∩ 𝐴, try cluster A first and then cluster B.   Thesis Defense - Eran Chinthaka Withana 23
  • 24. Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions Objective Reduce impact of low reliability of compute resources Deducing user-perceived reliabilities learning from user experiences and perceptions Research outline Algorithm to predict user perceived reliabilities, learning from user experiences mining historical information Use of machine learning techniques Trained classifiers to represent compute resources and their reliabilities Prediction of job failures Algorithm assessment Workloads from parallel workload archive representing jobs executed in two different supercomputing clusters 24 Thesis Defense - Eran Chinthaka Withana
  • 25. System Architecture Thesis Defense - Eran Chinthaka Withana 25 A machine learning classifier is trained to learn user-perceived reliabilities of each cluster. Classifiers types Static classifier: train classifier initially from historical information Dynamic (updateable) classifier: starts from zero knowledge and build when system is in operation
  • 26. System Architecture Thesis Defense - Eran Chinthaka Withana 26 Classifier manager uses Weka[Hall09] framework Classification methods Naïve Bayes and KStar Static and Dynamic classifiers Dynamic pruning of features[Fadishei09] for increased efficiency Classifier manager Creates and maintains classifiers for each compute resource A new job is evaluated based on these classifiers to deduce predicted reliability of job execution Policy Implementers Considers resource reliability predictions together with other quality of service information (time, cost) to select a resource
  • 27. Evaluation Workloads from parallel workload archive[Feitelson] LANL: Two years worth of jobs from 1994 to 1996 on 1024-node CM-5 at Los Alamos National Lab LPC: Ten months (Aug, 2004 to May, 2005) worth of job records on 70 Xeon node cluster at ”Laboratoire de Physique Corpusculaire” of UniversitatBlaise-Pascal, France Minor cleanups to remove intermediate job states 10000 jobs were selected from each workload LANL had 20% failed jobs LPC had 30% failed jobs Thesis Defense - Eran Chinthaka Withana 27
  • 28. Evaluation Workload classification and maintenance Classifiers: Naïve Bayes[John95] and KStar[Cleary95] classifier implementations in Weka[Hall09]. Classifier construction Static classifier: first 1000 jobs trains classifier. Dynamic classifier: all 10000 jobs for classifier construction and evaluation. Evaluation Metrics Average reliability prediction accuracy: accuracy of predicting success/fail of job Time saved: cumulative time saved by aggregating execution time of a job if it fails and if our system predicted failure successfully baseline measure: ideal cumulative time that can be saved over time Time Consumed For Classification and Updating Classifier Effect of pruning attributes Static subset of attributes (as proposed in Fadishei et el.[Fadishei09]) vs dynamic subset of attributes (checking affect on goal variable) Thesis Defense - Eran Chinthaka Withana 28
  • 29. Evaluation Evaluation Metrics Effect of Job Reliability Predictions on Selecting Compute Resources Extended version of GridSim[Buyya02] models four compute resources NWS[Wolski99] for bandwidth estimation and QBets[Nurmi07] for queue wait time estimation Total execution time = data movement time + queue wait time + job execution time (found in workload) Schedulers Total Execution Time Priority Scheduler Reliability Prediction Based Time Priority Scheduler Metrics Average Accuracy of Selecting Reliable Resources to Execute Jobs Time Wasted Due to Incorrect Selection of Compute Resources to Execute Jobs All evaluations were run within a 3.0GHz dual-core processor, 4GB memory on Windows 7 professional operating system. Thesis Defense - EranChinthakaWithana 29
  • 30. Evaluation Metrics Summary Thesis Defense - Eran Chinthaka Withana 30
  • 31. Results:Average Reliability Prediction Accuracy 31 Static Dynamic / Updateable LANL LANL Accuracy Saturation ~ 82% LPC Accuracy Saturation ~ 97% KStar has performed slightly better than Naïve Bayes LPC Thesis Defense - Eran Chinthaka Withana
  • 32. Results:Time Savings 32 Static Dynamic / Updateable LANL With static classifier, KStar has saved 90-100% Updateable classifier For LANL Both KStar and NB ~ 50% saving For LPC ~ 90% saving LPC Thesis Defense - Eran Chinthaka Withana
  • 33. Results:Time Consumed for Classification and Updating Classifier Thesis Defense - Eran Chinthaka Withana 33 Static Classifier Updateable Classifier Both static and updateable Naïve Bayes classifiers take very little time (not included in graphs)
  • 34. Results:Effect of Pruning Attributes Static sub-set of attributes (Fadishei09) performs poorly on this data set and classifier Dynamic pruning has improved accuracy of predictions compared to non-pruned case, but improvement is marginal Conclusion -> our classifiers are handling noise features well without compromising accuracy of classifications Identification of attributes to prune is a dynamic and expensive task system can be used in practical cases even without pruning of attributes. Thesis Defense - Eran Chinthaka Withana 34
  • 35. Results:Effect of Job Reliability Predictions on Selecting Compute Resources Poor performance of execution time priority scheduler After 1000 jobs (training) time wasted with our approach stays fairly constant Thesis Defense - Eran Chinthaka Withana 35
  • 36. Evaluation Conclusion Even though average accuracy of predictions with KStarclassifier has decreased with static classifier, it has managed to learn and predict failures better than any other method. Even though amount of time saved has increased slightly with Naive Bayes updateable classifier, comparatively, amount of time saved using static KStar classifier is higher than both methods. Even though total accuracy of predictions is not performing compared to other methods, static KStar classifier is ideal for correctly predicting failure cases, with very low overhead. Taking user-perceived reliability of compute resources in to consideration can save a significant amount of time in scientific job executions Thesis Defense - Eran Chinthaka Withana 36
  • 37. Outline Mid-Range Science Challenges and Opportunities Current Landscape Research Research Questions Contributions Mining Historical Information to Find Patterns and Experiences Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1] Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2] Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4] Applications Related Work Conclusion and Future Work Thesis Defense - Eran Chinthaka Withana 37
  • 38. Scientific Computing Resource Abstraction Layer Variety of scientific computing platforms and opportunities Requirements Support existing job description languages and also should be extensible to support other languages. Provide a uniform and interoperable interface for external entities to interact with it. Support heterogeneous compute resource manager interfaces and operating platforms from grids, IaaS, PaaS clouds, departmental clusters. Extensibility to support new and future resource managers with minimal changes. Provide monitoring and fault recovery, especially when working with utility computing resources. Provide light-weight, robust and scalable infrastructure. Integration to variety of workflow environments. Thesis Defense - Eran Chinthaka Withana 38
  • 39. Scientific Computing Resource Abstraction Layer Our contribution Resource abstraction layer Implemented as a web service Provides a uniform abstraction layer over heterogeneous compute resources including grids, clouds and local departmental clusters. Support for standard job specification languages including, but not limited to, Job Submission Description Language (JSDL)[Anjomshoaa04] and Globus Resource Specification Language (RSL), directly interacts with resource managers so requires no grid or meta scheduling middleware Integration with current resource managers, including Load Leveler, PBS, LSF and Windows HPC, Amazon EC2 and Microsoft Azure platforms Features Does not need high level of computer science knowledge to install and maintain system Use of Globus was a challenge for most non-compute scientists Involvement of system administrators to install and maintain Sigiri is minimal Memory foot print of is minimal Other tools require installation of most of heavy Globus stack but Sigiri does not require a complete stack installation to run. (Note that installing Globus on a small clusters is something scientists never wanted to do.) Better fault tolerance and failure recovery. Thesis Defense - Eran Chinthaka Withana 39
  • 40. Architecture Asynchronous messaging model of message publishers and consumers Daemons shadowing compute resources Distributed component deployment Daemon, front end Web service and job queue Thesis Defense - Eran Chinthaka Withana 40
  • 41. Client Interaction Service Deployed as an Apache Axis2 Web service to enable interoperability Accepts job requests and enable management and monitoring functions Job submission schema does not enforce schema for job description Enables multiple job description languages Thesis Defense - Eran Chinthaka Withana 41
  • 42. Client Interaction Service Thesis Defense - Eran Chinthaka Withana 42 Job Submission Response Job Submission Request
  • 43. Daemons Each managed compute resource has a light-weight daemon periodically checks job request queue translates job specification to a resource manager specific language submits pending jobs and persists correlation between resource manager's job id with internal id Extensible daemon API enables integration of wide range of resource managers while keeping complexities of these resources managers transparent to end users of these systems Queuing based approach enables daemons to be run on any compute platform, without any software or operating system requirements Current Support LSF, PBS,SLURM, LoadLeveler, Amazon EC2, Windows HPC, Windows Azure Thesis Defense - Eran Chinthaka Withana 43
  • 44. Integration of Cloud Computing Resources Unique set of dynamically loaded and configured extensions to handle security, schedule jobs and perform required data movements. Enables scientists to interact with multiple cloud providers within same system Features Extensions can be written as modules independent of other extensions, typically to carry out a single task Enforced failure handling to prevent orphan VMs, resources Thesis Defense - Eran Chinthaka Withana 44
  • 45. Security Client Security Between client and Web service layer Support for both transport level security (using SSL) and application layer security (using WS-Security) Client negotiation of security credentials with WS-Security policy support within Apache Axis2 Compute Resource Security System has support to store different types of security credentials Username/password combinations, X.509 credentials Thesis Defense - Eran Chinthaka Withana 45
  • 46. Performance Evaluation Test Scenarios Case 1: Jobs arrive at our system as a burst of concurrent submissions from a controlled number of clients. Each client waits for all jobs to finish before submitting next set of jobs. For example, during test with 100 clients, each client sends 1 job to server making 100 jobs coming to server in parallel. Case 2: Each client submits 10 jobs having varying execution times in sequence with no delay between submissions client does not block upon submission of a job failure rate and server performance, from clients point of view, are measured and number of simultaneous clients will be systematically increased Thesis Defense - Eran Chinthaka Withana 46
  • 47. Performance Evaluation:Baseline Measurements Thesis Defense - Eran Chinthaka Withana 47
  • 48. Performance Evaluation:Metrics Thesis Defense - Eran Chinthaka Withana 48
  • 49. Performance Evaluation:Scalability Metrics Thesis Defense - Eran Chinthaka Withana 49
  • 50. Performance Evaluation Experimental Setup Daemon hosted within gatekeeper node (quad-core IBM PowerPC (1.6GHz) with 8GB of physical memory) of Big Red cluster System Web service and database co-hosted in a box with (4 2.6GHz dual-core processors with 32GB of RAM) Both these nodes were not dedicated for our experiment when we were running tests Client Environment Setup within 128 node Odin Cluster (each node is a Dual AMD 2.0GHz Opteron processor with 4GB physical memory) All client nodes were used in dedicated mode and each client is running on separate java virtual machine to eliminate any external overhead Data Collection Each test was run number of clients * 10 times and results were averaged. Each parameter is tested for 100 to 1000 concurrent clients Total of 110,000 tests were run. Gram4 experiment results produced in Gram4 evaluation paper[Marru08] were used for system performance comparison. Thesis Defense - Eran Chinthaka Withana 50
  • 51. Results Thesis Defense - Eran Chinthaka Withana 51 Baseline Measurements All overheads scaling proportional to number of clients No failures Case 1 Case 2
  • 52. Results Thesis Defense - Eran Chinthaka Withana 52 Metrics for Test Case 1 and 2 Both response time and total overhead scaling proportional to number of clients No failures
  • 53. Results Thesis Defense - Eran Chinthaka Withana 53 Scalability Metrics Failures No failures with Sigiri Failures starting from 300 clients for Gram Case 1 Case 2
  • 54. Outline Mid-Range Science Challenges and Opportunities Current Landscape Research Research Questions Contributions Mining Historical Information to Find Patterns and Experiences Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1] Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2] Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4] Applications Related Work Conclusion and Future Work Thesis Defense - Eran Chinthaka Withana 54
  • 55. Applications: LEAD Motivations Grid middleware reliability and scalability study[Marru08] and workflow failure rates. components of LEAD infrastructure were considered for adaptation to other scientific environments. Sigiri initially prototyped to support Load Leveler, PBS and LSF. Implications Improved workflow success rates Mitigation need for Globus middleware Ability work with non-standard job managers Thesis Defense - Eran Chinthaka Withana 55
  • 56. Applications: LEAD II Emergence of community- driven, production-quality workflow infrastructures E.g. Trident Scientific Workflow Workbench with Workflow Foundation Possibility of using alternate supercomputing resources E.g. Recent port WRF (Weather Research & Forecast) model to Windows platform, Azure Support for Windows based scientific computing environments. 56
  • 57. Background: LEAD II and Vortex2 Experiment May 1, 2010 to June 15, 2010 ~6 weeks, 7-days per week Workflow started on hour every hour each morning. Had to find and bind to latest model data (i.e., RUC 13km and ADAS data) to set initial and boundary conditions. If model data was not available at NCEP and University of Oklahoma, workflow could not begin. Execution of complete WRF stack within 1 hour 57
  • 58. Trident Vortex2 Workflow Bulk of time (50 min) spent in Lead Workflow Proxy Activity 58 Sigiri Integration
  • 59. Applications: Enabling Geo-Science Application on Windows Azure Geo-Science Applications High Resource Requirements Compute intensive, dedicated HPC hardware e.g. Weather Research and Forecasting (WRF) Model Emergence of ensemble applications Large amount of small jobs e.g. Examining each air layer, over a long period of time. Single experiment = About 14000 jobs each taking few minutes to complete 59
  • 60. Geo-Science Applications: Opportunities Cloud computing resources On-demand access to “unlimited” resources Flexibility Worker roles and VM roles Recent porting of geo-science applications WRF, WRF Preprocessing System (WPS) port to Windows Increased use of ensemble applications (large number of small runs) Production quality, opensource scientific workflow systems Microsoft Trident 60
  • 61. Research Vision Enabling geo-science experiments Type of applications Compute intensive, ensembles Type of scientists Meteorologists, atmospheric scientists, emergency management personnel, geologists Utilizing both Cloud computing and Grid computing resources Utilizing opensource, production quality scientific workflow environments Improved data and meta-data management Geo-Science Applications Scientific Workflows Compute Resources 61
  • 62. Proposed Framework Thesis Defense - Eran Chinthaka Withana 62 Azure Blob Store Azure Management API Sigiri Job Mgmt.Daemons Azure Fabric Web Service Trident Activity Job Queue Azure Custom VM Images VM Instance IIS WRF Sigiri Worker Service MSMPI Windows 2008R2
  • 63. Applications: Pragma Testbed Support Pacific Rim Applications and Grid Middleware (PRAGMA)[Zheng06] an open international organization founded in 2002 to focus on practical issues of building international scientific collaborations In 2010, Indiana University (IU) joined PRAGMA and added a dedicated cluster for testbed. Sigiri was used within IU Pragma testbed IU PRAGMA testbed system required a light-weight system that could be installed and maintained with minimal effort. IU PRAGMA team wanted to evaluate on adding cloud resources into testbed with little or no changes to interfaces. In 2011, PRAGMA - Opal - Sigiri integration was demonstrated successfully Thesis Defense - Eran Chinthaka Withana 63
  • 64. Outline Mid-Range Science Challenges and Opportunities Current Landscape Research Research Questions Contributions Mining Historical Information to Find Patterns and Experiences Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1] Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2] Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4] Applications Related Work Conclusion and Future Work Thesis Defense - Eran Chinthaka Withana 64
  • 65. Related Work Scientific Job Management Systems Grid Resource Allocation and Management (GRAM)[Foster05], Condor-G[Frey02], Nimrod/G[Buyya00], GridWay[Huedo05] and SAGA[Goodale06] and Falkon[Raicu07] provide uniform job management APIs, but are tightly integrated with complex middleware to address a broad range of problems. Carmen[Watson81] project provided a cloud environment that has enabled collaboration between neuroscientists requires all programs to be packaged as WS-I[Ballinger04] compliant Web services Condor[Frey02] pools can also be utilized to unify certain compute resource interactions. uses Globus toolkit[Foster05] (and GRAM underneath) Poor failure recovery overlooks failure modes of a cloud platform Thesis Defense - Eran Chinthaka Withana 65
  • 66. Related Work Scientific Research and Cloud Computing IaaS, PaaS and SaaS environment evaluations Scientists have mainly evaluated use of IaaS services for scientific job executions[Abadi09][Hoffa08][Keahey08] [Yu05] Ease of setting up custom environments and control Growing interest in using PaaS services[Humphrey10][Lu10] [Qiu09] Optimization to balance cost and time of executions[Deelman08][Yu05] Startup overheads[Chase03][Figueiredo03][Foster06][Sotomayor06][Keahey07] Job Prediction Algorithms Prediction of Execution times[Smith], job start times[Li04], queue-wait times[Nurmi07] and resource requirements[Julian04] AI based and statistical modeling based approaches AppleS[Berman03] argues that a good scheduler must involve some prediction of application and system performance Reliability of Compute Resources Birman[Birman05] and aspects of resources causing system reliability issues Statistical modeling to predict failures[Kandaswamy08] Thesis Defense - Eran Chinthaka Withana 66
  • 67. Outline Mid-Range Science Challenges and Opportunities Current Landscape Research Research Questions Contributions Mining Historical Information to Find Patterns and Experiences Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1] Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2] Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4] Applications Related Work Conclusion and Future Work Thesis Defense - Eran Chinthaka Withana 67
  • 68. Conclusion User inspired management of scientific jobs Concentrate on identification of user patterns and perceptions Harnesses historical information Applies knowledge gained to improve scientific job executions Argues that patterns, if identified based on individual users, can reveal important information to make sophisticated estimations on resource requirements Evaluations demonstrates usability of predictions for a meta-scheduler, especially ones integrated into community gateways, to improve their scheduling decisions. Resource abstraction service Help mid-scale scientists to obtain access to resources that are cheap and available Strives to do so with a tool that is easy to set up and administer Prototype implementations introduced and discussed is integrated and used in different domains and scientific applications Applications demonstrate how our research contributed to advance science in respective domains. Thesis Defense - Eran Chinthaka Withana 68
  • 69. Contributions Propose and empirically demonstrate user patterns, deduced by knowledge-based approaches, to provision for compute resources reducing impact of startup overheads in cloud computing environments. Propose and empirically demonstrate user perceived reliability, learned by mining historical job execution information, as a new dimension to consider during resource selections. Propose and demonstrate effectiveness and applicability of a light-weight and reliable resource abstraction service to hide complexities of interacting with multiple resources managers in grids and clouds. Prototype implementation to evaluate feasibility and performance of resource abstraction service and integration with four different application domains to prove its usability. Thesis Defense - Eran Chinthaka Withana 69
  • 70. Future Work Short term research directions Integration of future job predictions and user-perceived reliability predictions Evolving resource abstraction service to support more compute resources Management of ensemble runs Fault tolerance with proactive replication Long Term Research Directions Thesis Defense - Eran Chinthaka Withana 70
  • 71. Thank You !! Thesis Defense - Eran Chinthaka Withana 71