Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Big Data, Beyond the Data Center

643 vues

Publié le

Big Data, Beyond the Data Center

Increasingly the next scientific discoveries and the next industrial innovative breakthroughs will depend on the capacity to extract knowledge and sense from gigantic amount of information. Examples vary from processing data provided by scientific instruments such as the CERN’s LHC; collecting data from large-scale sensor networks; grabbing, indexing and nearly instantaneously mining and searching the Web; building and traversing the billion-edges social network graphs; anticipating market and customer trends through multiple channels of information. Collecting information from various sources, recognizing patterns and distilling insights constitutes what is called the Big Data challenge. However, As the volume of data grows exponentially, the management of these data becomes more complex in proportion. A key challenge is to handle the complexity of data management on Hybrid distributed infrastructures, i.e assemblage of Cloud, Grid or Desktop Grids. In this talk, I will overview our works in this research area; starting with BitDew, a middleware for large scale data management on Clouds and Desktop Grids. Then I will present our approach to enable MapReduce on Desktop Grids. Finally, I will present our latest results around Active Data, a programming model for managing data life cycle on heterogeneous systems and infrastructures.

Publié dans : Données & analyses
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Big Data, Beyond the Data Center

  1. 1. Big Data, Beyond the Data Center Gilles Fedak Gilles.Fedak@inria.fr INRIA, University of Lyon, France Cluj Economics and Business Seminar Series (CEBSS) University Babes-Bolyai Faculty of Economics and Business Administration Cluj-Napoca, Romania 6/11/2014
  2. 2. AVALON Team I Located in Lyon, France I Joint Research Group I INRIA : French National Institute for Research in Informatics I ENS-Lyon : Ecole Normale Suprieure I University of Lyon G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  3. 3. AVALON Members 2 Avalon Members @ April 1st, 2014 Faculty Members (8) (4 INRIA, 1 CNRS, 2 UCBL, 1 ENSL) • Eddy Caron, MCF ENS Lyon, HDR (80%) • Frédéric Desprez, DR INRIA, HDR (30%) • Gilles Fedak, CR INRIA • Jean-Patrick Gelas, MCF UCBL • Olivier Glück, MCF UCBL • Laurent Lefèvre, CR INRIA, HDR • Christian Perez, DR INRIA, HDR, Project leader • Frédéric Suter, CR CNRS PhD students (6) • Maurice-Djibril Faye, ENS-Lyon / Université Gaston Berger (Sénégal) • Sylvain Gault, MapReduce, INRIA • Anthony Simonet, MapReduce, INRIA • Vincent Lanore, ENSL • Arnaud Lefray, SEED4C, ENSIB • Daniel Balouek, CIFRE New Generation SR Engineers (3+4+1) • Simon Delamare, IR CNRS (80%) • Jean-Christophe Mignot, IR CNRS (20%) • Matthieu Imbert, INRIA SED (40%) • Sylvain Bernard, CloudPower • François Rossigneux, XCLOUD • Guillaume Verger, SEED4C • Yulin Zhang Huaxi, SEED4C • Laurent Pouilloux (AE Héméra) Postdoc • Jonathan Rouzaud-Cornabas, CNRS Temporary Teacher-Researcher • Ghislain Landry Tsafack, UCBL Assistant • Evelyne Blesle, INRIA Avalon Team Overview G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  4. 4. AVALON Topics 3 Avalon: Research Activities !#$%'($)*#+,-.) /00-'1%2#()3) 4,.#51,) *#+,-.) /-$#'67'1.) 4,.#51,)*%(%$,,(6) Super-computers (Exascale) Desktop Grids Clouds (IaaS, PaaS) Grids (EGI) Avalon Team Overview Energy Application Profiling and Modelization • Large Scale Energy Consumption Analysis for Physical and Virtual Resources • Energy Efficiency of Next Generation Large Scale Platforms Data-intensive Application Profiling, Modeling, and Management • Performance Prediction of Parallel Regular Applications • Modeling Large Scale Storage Infrastructure • Data Management for Hybrid Computing Infrastructures Resource Agnostic Application Description Model • Moldable Application Description Model • Dynamic Adaptation of the Application Structure Application Mapping and Scheduling • Application Mapping and Software Deployment • Non-Deterministic Workflow Scheduling • Security Management in Cloud Infrastructure /00-'1%2#(.) G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  5. 5. Big Data ... I Huge and growing volume of information originating from multiple sources. 0+1)*2+(2() !#$%'($#) *+',-./#) !$(%($) 34()5-$-) I Impacts many scienti
  6. 6. c disciplines and industry branches I Large Scienti
  7. 7. c Instruments (LSST, LHC, OOOI), but not only (Sequencing machines) I Internet and Social Network (Google, Facebook, Twitter, etc.) I Open Data (Open Library, Governemental, Genomics) ! impacts the whole process of scienti
  8. 8. c discovery (4th paradigm of science) G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  9. 9. ... or Big Bottlenecks ? I Big Data creates several challenges : I how to scale the infrastructure ? I end-to-end performance improvement, inter-system optimization. I how to improve productivity of data-intensive scientist ? I work ow, programming language, quality of data provenance. I how to enable collaborative data science ? I incentive for data publication, data-sets sharing, collaborative work ow. I New models and software are needed to represent and manipulate large and distributed scienti
  10. 10. c data-sets. G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  11. 11. BitDew: Large Scale Data Management Haiwu He (CAS/CNIC), Franck Cappello (ANL, UIUC) I G. Fedak, H. He, and F. Cappello. BitDew: A Programmable Environment for Large-Scale Data Management and Distribution. In Proceedings of the ACM/IEEE SuperComputing Conference (SC08), pages 112, Austin, USA, November 2008. I BitDew: A Data Management and Distribution Service with Multi-Protocol and Reliable File Transfer. G. Fedak, H. He, and F. Cappello Journal of Network and Computer Applications, 32(5):961{975, 2009. I H. He, G. Fedak, B. Tran, and F. Cappello. BLAST Application with Data-aware Desktop Grid Middleware. In Proceedings of 9th IEEE International Symposium on Cluster Computing and the Grid CCGRID09, pages 284291, Shanghai, China, May 2009. G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  12. 12. Towards Data Desktop Grid Desktop Grid or Volunteers Computing Systems I High Throughput Computing over Large Sets of Idle Desktop Computers I Mature technology I EU support : European Desktop Grid Infrastructures But ... I High number of resources I Volatility I Lack of trust I Owned by volunteer G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  13. 13. Towards Data Desktop Grid Desktop Grid or Volunteers Computing Systems I High Throughput Computing over Large Sets of Idle Desktop Computers I Mature technology I EU support : European Desktop Grid Infrastructures But ... I High number of resources I Volatility I Lack of trust I Owned by volunteer I Scalable but mainly for embarrassingly parallel applications with few I/O requirements I Enabling data-intensive applications I Bridge with Cloud and Grid infrastructures G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  14. 14. Large Scale Data Management BitDew : a Programmable Environment for Large Scale Data Management Key Idea 1: provides an API and a runtime environment which integrates several P2P technologies in a consistent way Key Idea 2: relies on metadata (Data Attributes) to drive transparently data management operation : replication, fault-tolerance, distribution, placement, life-cycle. G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  15. 15. BitDew : the Big Cloudy Picture I Aggregates storage in a single Data Space: I Clients put and get data from the data space I Clients de
  16. 16. nes data attributes Data Space put get client REPLICAT =3 G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  17. 17. BitDew : the Big Cloudy Picture I Distinguishes service nodes (stable), client and Worker nodes (volatile) I Service : ensure fault tolerance, indexing and scheduling of data to Worker nodes I Worker : stores data on Desktop PCs I push/pull protocol between client ! service Worker Data Space Service Nodes Reservoir Nodes pull put get client pull pull G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  18. 18. Data Attributes replica : indicates how many occurrences of data should be available at the same time in the system resilience : controls the resilience of data in presence of machine crash lifetime : is a duration, absolute or relative to the existence of other data, which indicates when a datum is obsolete affinity : drives movement of data according to dependency rules transfer protocol : gives the runtime environment hints about the
  19. 19. le transfer pro- tocol appropriate to distribute the data distribution : which indicates the maximum number of pieces of Data with the same Attribute should be sent to particular node. G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  20. 20. Architecture Overview File System Http Ftp Bittorrent Data Scheduler Data Transfer Active Data BitDew Transfer Manager Command -line Tool Service Container Storage Master/ Worker API Service Back-ends Applications Data Catalog Data Repository SQL Server DHT BitDew Runtime Environnement I Programming APIs to create Data, Attributes, manage
  21. 21. le transfers and program applications I Services (DC, DR, DT, DS) to store, index, distribute, schedule, transfer and provide resiliency to the Data I Several information storage backends and
  22. 22. le transfer protocols G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  23. 23. Examples of BitDew Applications I Data-Intensive Application I DI-BOT : data-driven master-worker Arabic characters recognition (M. Labidi, University of Sfax) I MapReduce vs Hadoop, (X. Shi, L. Lu HUST, Wuhan China) I Data Management Utilities I File Sharing for Social Network (N. Kourtellis, Univ. Florida) I Distributed Checkpoint Storage (F. Bouabache, Univ. Paris XI) I Grid Data Stagging (IEP, Chinese Academy of Science) G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  24. 24. MapReduce for Hybrid Distributed Computing Infrastructures Haiwu He (CAS/CSNET), Bing Tang (WUST), Xunhua Shi Lu Lu (HUST), Mircea Moca Gheorghe Silaghi (Univ Babes Bolay) I Towards MapReduce for Desktop Grid Computing .B. Tang, M. Moca, S. Chevalier, H. He, and G. Fedak. In Fifth International Conference on P2P, Paral lel, Grid, Cloud and Internet Computing (3PGCIC'10), Fukuoka, Japan, November 2010. I Distributed Results Checking for MapReduce on Volunteer Computing Mircea Moca, Gheorghe Cosmin Silaghi and Gilles Fedak, in 4th Workshop on Desktop Grids and Volunteer Computing Systems (PCGrid 2010) IPDPS'2011, Anchorage Alaska. I Assessing MapReduce for Internet Computing: a Comparison of Hadoop and BitDew-MapReduce Lu Lu, Hai Jin, Xuanhua Shi and Gilles Fedak in the 13th ACM/IEEE International Conference on Grid Computing (Grid 2012), Beijing, China, 2012 I Data-Intensive Computing on Desktop Grids, H. Lin and W.-C. Feng and G. Fedak Book Chapter in Desktop Grid Computing Book, CRC Press, 2012 I Parallel Data Processing in Dynamic Hybrid Computing Environment Using MapReduce, Bing Tang, Haiwu He, Gilles Fedak, 4th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP'14), LNCS/Springer Verlags, August 24-27, Dalian, China, 2014 G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  25. 25. What is MapReduce ? I Programming Model for data-intense applications I Proposed by Google in 2004 I Simple, inspired by functionnal programming I programmer simply de
  26. 26. nes Map and Reduce tasks I Building block for other parallel programming tools I Strong open-source implementation: Hadoop I Highly scalable I Accommodate large scale clusters: faulty and unreliable resources MapReduce: Simpli
  27. 27. ed Data Processing on Large Clusters Jerey Dean and Sanjay Ghemawat, in OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004. G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  28. 28. Challenge of MapReduce over the Internet Data Split Final Output Reducer Combine Mapper Input Files Intermediate Results Output Files Output1 Mapper Mapper Reducer Mapper Reducer Mapper Output2 Output3 I no shared
  29. 29. le system nor direct communication between hosts I Faults and hosts churn I Result Certi
  30. 30. cation of Intermediate Data I Collective Operation (scatter + gather/reduction) G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  31. 31. Implementing MapReduce over BitDew Latency Hiding I Multi-thredead worker to overlap communication and computation. I The number of maximum concurrent Map and Reduce tasks can be con
  32. 32. gured, as well as the minimum number of tasks in the queue before computations can start. Barrier-free computation I Reducers detect duplication of intermediate results (that happen because of faults and/or lags). I Early reduction : process IR as they arrive ! allowed us to remove the barrier between Map and Reduce tasks. I But ... IR are not sorted. G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  33. 33. Scheduling and Fault-tolerance 2-level scheduling 1. Data placement is ensured by the BitDew scheduler, which is mainly guided by the data attribute. 2. Workers periodically report to the MR-scheduler, running on the master node the state of their ongoing computation. 3. The MR-scheduler determines if there are more nodes available than tasks to execute which can avoid the lagger eect. Fault-tolerance I In Desktop Grid, computing resources have high failure rates : ! during the computation (either execution of Map or Reduce tasks) ! during the communication, that is
  34. 34. le upload and download. I MapInput data and ReduceToken token have the resilient attribute enabled. G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  35. 35. MapReduce Evaluation #$' a 2.7GB collective measure chunks size broadcast. The #!! #$!! #!!! '!! (!! !! $!! ! ' #( %$ ( #$' $( #$ *+,-./ 0?6,@A?B@9:3C5D/4 Figure 6. Scalability evaluation on the WordCount application: the y axis presents the throughput in MB/s and the x axis the number of nodes varying from 1 to 512. Figure: Scalability evaluation on the WordCount application: the y axis presents the throughput in MB/s and the x axis the number of nodes varying from 1 to 512. Table II EVALUATION OF THE PERFORMANCE ACCORDING TO THE NUMBER OF MAPPERS AND REDUCERS. #Mappers 4 8 16 32 32 32 32 #Reducers 1 1 1 1 4 8 16 G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  36. 36. Data Security and Privacy Distributed Result Checking I Traditional DG or VC projects implements result checking on the server. I IR are too large to be sent back to the server ! distributed result checking I replicates the MapInput and Reducers I select correct results by majority voting ! !# !$ % %' !) (% * + (a) No replication (b) Replication of the Map tasks (c) Replication of both Map and Reduce tasks Figure 2. Dataflows in the MapReduce implementation Ensuring distinct Data workers as Privacy input for Map task. Consequently, the reducer receives a set of rm versions of intermediate files I Use corresponding a hybrid to a map infrastructure input fi. After receiving : rm composed + 1 2 of private and public resources I Use IDA (Information Dispersal Algorithms) approach to distribute and identical versions, the reducer considers the respective result correct and further, accepts it as input for a Reduce task. Figure 2b illustrates the dataflow corresponding to a MapReduce execution where rm = 3. In order to activate the checking mechanism for the results received from reducers, the master node schedules the received intermediate result is present in its own codes list K, ignoring failed results. IV. EVALUATION OF THE RESULTS CHECKING MECHANISM In this section we describe the model for characteriz-ing the errors produced by the MapReduce algorithm in Volunteer Computing. We assume that workers sabotage independently of one another, thus we do not take into store securely the data G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  37. 37. Active Data: Data Life-Cycle Management Anthony Simonet (INRIA), Matei Ripeanu (UCB), Samer Al-Kiswany (UCB) I Active Data: A Data-Centric Approach to Data Life-Cycle Management Anthony Simonet, Gilles Fedak, Matei Ripeanu and Samer Al-Kiswany. 8th Parallel Data Storage Workshop (PDSW'13), Proceedings of SC13 workshops, Denver, November, 2013 (position paper 5 pages) I Active Data: A Programming Model for Data Life-Cycle Management on Heterogeneous Systems and Infrastructures. Anthony Simonet, Gilles Fedak and Matei Ripeanu. Technical Report under evaluation. G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  38. 38. Focus on Data Life-Cycle Data Life Cycle: the course of operational stages through which data pass from the time when they enter a system to the time when they leave it. !#$%%'()* +,-.,(-%)/* 01(,2/-* !)234%* !)234%* G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  39. 39. Use Case : The Advanced Photon Source I 3 to 5 TB of data per week on this detector I Raw data are pre-processed and registered in the Globus Catalog : I Data are curated by several applications I Data are shared amongst scienti
  40. 40. c user Transfer Instrument(Beamline) LocalStorage MetadataCatalog Extract Register Metadata RemoteData Center Transfer AcademicCluster Analysis More analysis Upload result Register result metadata G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  41. 41. Objectives We're aiming at : I A model to capture the essential life cycle stages and properties: creation, deletion, faults, replication, error checking . . . I Allows legacy systems to expose their intrinsic data life cycle. I Allow to reason about data sets handled by heterogeneous software and infrastructures. I Simplify the programming of applications that implement data life cycle management. G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  42. 42. Active Data principles System programmers expose their system's internal data life cycle with a model based on Petri Nets. A life cycle model is made of Places and Transitions Created t1 Written t2 Read t3 Terminated t4 Each token has a unique identi
  43. 43. er, corresponding to the actual data item's. G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  44. 44. Active Data principles System programmers expose their system's internal data life cycle with a model based on Petri Nets. A life cycle model is made of Places and Transitions Created t1 Written t2 Read t3 Terminated t4 A transition is
  45. 45. red whenever a data state changes. G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  46. 46. Active Data principles System programmers expose their system's internal data life cycle with a model based on Petri Nets. A life cycle model is made of Places and Transitions Created t1 Written t2 Read t3 Terminated t4 public void handler () { compressFile (); } Code may be plugged by clients to transitions. It is executed whenever the transition is
  47. 47. red. G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  48. 48. Active Data features The Active Data programming model and runtime environment: I Allows to react to life cycle progression I Exposes transparently distributed data sets I Can be integrated with existing systems I Has scalable performance and minimum overhead over existing systems G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  49. 49. Integration with Data Management Systems Created t1 To Place t2 Deleted t3 t4 Placed Loop t5 2 t6 t7 t9 Lost t8 (a) Bitdew Scheduler Created t1 Ready t2 Started Deleted Invalid t3 t4 Completed t5 t6 t7 t8 (b) Bitdew File Transfer IN CREATE t1 IN MOVED FROM t2 t3 IN MOVED TO t4 IN CLOSE WRITE t5 t6 t11 t9 t7 t8 t10 DELETED t14 t13 CREATED t12 (c) inotify Deleted t8 t7 t6 Get Put t5 Created (d) iRODS Created t1 t2 Succeeded Failed t3 t4 Deleted (e) Globus Online Figure 3: Data life cycle models for four data management system. structed from its documentation. Reading the source code of BitDew, we observe that data items are managed by instances of the Data class, and this class has the status variable which holds the data item state. Therefore, we simply deduce from the enumeration of the possible value of status the set of corresponding places in the Petri Net (see Figure 2a and 2b). By further analyz-ing the source code, we construct the model and summarize how high level DLM features are modelized using Active Data model: Scheduling and replication Part of the complexity of the data life cycle in BitDew comes from the Data Scheduler I BitDew (INRIA), programmable environment for data management. I inotify Linux kernel subsystem: noti
  50. 50. cation system for
  51. 51. le creation, modi
  52. 52. cation, write, movement and deletion. I iRODS (DICE, Univ. North Carolina), rule-oriented data management system. I Globus Online (ANL) oers fast, simple and reliable service to transfer large volumes of data. G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  53. 53. Implementation I Prototype implemented in Java (' 2,800 LOC) I Client/Service communication is Publish/Subscribe I 2 types of subscription: I Every transitions for a given data item I Every data items for a given transition Active DataService Client subscribe Client Client subscribe G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  54. 54. Implementation I Several ways to publish transitions I Instrument the code I Read the logs I Rely on an existing noti
  55. 55. cation system I The service orders transitions by time of arrival publish transition Active DataService Client subscribe Client Client subscribe publish transition G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  56. 56. Implementation I Clients run transition handler code locally I Transition handlers are executed I Serially I In a blocking way I In the order transitions were published publish transition Active DataService Client subscribe notify Client notify Client subscribe publish transition G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  57. 57. Data Surveillance FrameWork for APS Anthony Simonet (INRIA), Kyle Chard (ANL), Ian Foster (ANL/UC) I A. Simonet, K. Chard, G. Fedak, I. Foster Active Data to Provide Smart Data Surveillance to E-Science Users In Proceedings of EuromicroPDP'15, Turku Finland, March 4-6, 2015 G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  58. 58. Problems with APS Globus Catalog Globus Detector Local Storage Compute Cluster 1. Local Transfer 2. Extract Metadata 3. Globus Transfer 4. Swift Parallel Analysis What is inecient in this work ow? I Many error-prone tasks are performed manually I Users can't monitor the whole process at once I Small failures are dicult to detect I A system alone can't recover from failures caused outside its scope G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  59. 59. Data Surveillance Framework 4 goals (that would otherwise require a lot of scripting and hacking): I Monitoring Data Set Progress I Better Automation I Sharing Noti
  60. 60. cation I Error Discovery Recovery G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  61. 61. Active Data Advanced Features A. Progress Monitoring I Associate Tags to Data I Install Taggers on Scientists require mechanisms to monitor their workflows from a high level, generate reports on progress, and identify potential errors without painstakingly auditing every dataset. Monitoring is not limited to estimating completion time, but also: i) receiving a single relevant notification when several related events occurred in different systems; ii) quickly notic-ing Transitions I Guarded Transitions : only that an operation failed within the mass of operations that completed normally; iii) identifying steps that take longer to run than usual, backtracking the chain of causality, fixing the problem at runtime and optimizing the workflow for future executions; iv) accelerating data sharing with the community by pushing notifications to collaborators and colleagues. executes on token which have speci
  62. 62. c B. Automation tags. I Noti
  63. 63. cation Handlers: The APS workflow, like many scientific workflows, re-quires explicit human intervention to progress between stages and to recover from unexpected events. Such interventions include running scripts on generated datasets on the shared cluster, registering datasets in the Globus catalog, and execut-ing Push.co, Twitter, gdoc, ifttt etc... Swift analysis scripts on the compute cluster Such inter-ventions cannot be easily integrated in a traditional workflow system, because they reside a level of abstraction above the workflow system. In fact, they are the operations that start the workflow systems. C. Sharing and Notification Scientific sharing can be made more efficient by allowing other scientists to be notified of new datasets availability (in the catalog) with powerful filters to extract only the notifications they need, and even to start processes as soon as files are available. We believe the best way for scientists to automatically integrate new datasets in their workflows is to rely on widely used dissemination mechanisms—such as File Dataset File transfer Metadata Life Cycle View Guard Code Execution }Tagged Tokens Notification Fig. 2: Data surveillance framework design IV. SYSTEM DESIGN We next present the data surveillance framework that we designed to satisfy the APS users’ needs presented above. We also elaborate on the design of specific features. G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  64. 64. APS Data Life Cycle Model Avalon Daniel Arnaud Anthony Vincent Use-case: APS data life cycle model Created Start transfer End Terminated Detector Created Failure Success Failed Succeeded End End Terminated End transfer Globus transfer Created End Terminated Start transfer Shared storage Update Globus Catalog Created Failure Success Failed Succeeded End End Terminated Start Swift Globus transfer Extract Created Remove Terminated Created Initialize Set End Failure Terminated Derive Swift Data life cycle model composed of 6 systems. Avalon June 3rd, 2014 22/30 G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  65. 65. Exemple: Data Provenance De
  66. 66. nition Complete history of data derivations and operations I Assess dataset quality I Records the context of data acquisition and transformation I PASS: Provenance Aware Storage Systems G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  67. 67. Exemple: Data Provenance De
  68. 68. nition Complete history of data derivations and operations I Assess dataset quality I Records the context of data acquisition and transformation I PASS: Provenance Aware Storage Systems

×