Azure Brain: 4th paradigm, scientific discovery & (really) big data
 

Like this? Share it with your network

Share

Azure Brain: 4th paradigm, scientific discovery & (really) big data

le

  • 739 vues

Un cloud pour comparer nos gènes aux images du cerveau" Le pionnier des bases de données, aujourd'hui disparu, Jim Gray avait annoncé en 2007 l'emergence d'un 4eme paradigme scientifique: celui ...

Un cloud pour comparer nos gènes aux images du cerveau" Le pionnier des bases de données, aujourd'hui disparu, Jim Gray avait annoncé en 2007 l'emergence d'un 4eme paradigme scientifique: celui d'une recherche scientifique numérique entierement guidée par l'exploration de données massives. Cette vision est aujourd'hui la réalité de tous les jours dans les laboratoire de recherche scientifique, et elle va bien au delà de ce que l'on appelle communément "BIG DATA". Microsoft Research et Inria on démarré en 2010 un projet intitulé Azure-Brain (ou A-Brain) dont l'originalité consiste à a la fois construire au dessus de Windows Azure une nouvelle plateforme d'acces aux données massives pour les applications scientifiques, et de se confronter à la réalité de la recherche scientifique. Dans cette session nous vous proposons dans une premiere partie de resituer les enjeux recherche concernant la gestion de données massives dans le cloud, et ensuite de vous presenter la plateforme "TOMUS Blob" cloud storage optimisé sur Azure. Enfin nous vous presenterons le projet A-Brain et les résultats que nous avons obtenus: La neuro-imagerie contribue au diagnostic de certaines maladies du système nerveux. Mais nos cerveaux s'avèrent tous un peu différents les uns des autres. Cette variabilité complique l'interprétation médicale. D'où l'idée de corréler ldes images IRM du cerveaux et le patrimoine génétique de chaque patient afin de mieux délimiter les régions cérébrales qui présentent un intérêt symptomatique. Les images IRM haute définition de ce projet sont produites par la plate-forme Neurospin du CEA (Saclay). Problème pour Les chercheurs : la masse d'informations à traiter. Le CV génétique d'un individu comporte environ un million de données. À cela s'ajoutent des volumes tout aussi colossaux de pixel 3D pour décrire les images. Un data deluge: des peta octets de donnés et potentiellement des années de calcul. C'est donc ici qu'entre en jeu le cloud et une plateforme optimisée sur Azure pour traiter des applications massivement parallèles sur des données massives... Comme l'explique Gabriel Antoniu, son responsable, cette équipe de recherche rennaise a développé “des mécanismes de stockage efficaces pour améliorer l'accès à ces données massives et optimiser leur traitement. Nos développements permettent de répondre aux besoins applicatifs de nos collègues de Saclay.

Statistiques

Vues

Total des vues
739
Vues sur SlideShare
739
Vues externes
0

Actions

J'aime
0
Téléchargements
4
Commentaires
0

0 Ajouts 0

No embeds

Accessibilité

Catégories

Détails de l'import

Uploaded via as Adobe PDF

Droits d'utilisation

© Tous droits réservés

Report content

Signalé comme inapproprié Signaler comme inapproprié
Signaler comme inapproprié

Indiquez la raison pour laquelle vous avez signalé cette présentation comme n'étant pas appropriée.

Annuler
  • Full Name Full Name Comment goes here.
    Êtes-vous sûr de vouloir
    Votre message apparaîtra ici
    Processing...
Poster un commentaire
Modifier votre commentaire

Azure Brain: 4th paradigm, scientific discovery & (really) big data Presentation Transcript

  • 1. Azure Brain: 4th paradigm, scientificdiscovery & (really) big data (REC201)Gabriel AntoniuSenior Research Scientist, InriaHead of the KerData Project-Team, Inria Rennes – Bretagne AtlantiqueRadu TudoranPhD student, ENS Cachan – BrittanyKerData Project-Team, Inria Rennes – Bretagne AtlantiqueFeb. 12, 2013
  • 2. INRIA’s strategy in Cloud ComputingINRIA is among the leaders in Europe in the area of distributed computing and HPC• Long history of researches around distributed systems, HPC, Grids• Now several activities virtualized environments/cloud infrastructures• Culture of multidisciplinary research• Culture of exploration tools (owner of massively parallel machines since 1987, large scaletestbeds such as Grid’5000)• Strong involvement in national, European and international collaborative projects• Strong collaboration history with industry (Joint Microsoft Research – Inria Centre, IBM,EDF, Bull, etc.)- 2
  • 3. Clouds: where within Inria ?12Networks, Systems and Services,Distributed Computing3Perception, Cognition, Interaction45Applied Mathematics, Computationand SimulationAlgorithmics, Programming,Software and ArchitectureComputational Sciencesfor Biology, Medicine andthe Environment- 3
  • 4. Some project-teams involved in Cloud ComputingINRIA NancyGrand EstINRIA GrenobleRhône-AlpesINRIA Sophia AntipolisMéditerranéeINRIA RennesBretagne AtlantiqueINRIA BordeauxSud-OuestINRIA LilleNord EuropeINRIASaclayÎle-de-FranceINRIA ParisRocquencourtKERDATA: Data Storage and ProcessingMYRIADS: Autonomous Distributed SystemsASCOLA: Languages and virtualizationCEPAGE: task managementAVALON: middleware & programmingMESCAL: models & toolsREGAL: Large Scale dist. systemsALGORILLE: algorithms & modelsOASIS: programmingZENITH: Scientific Data Management- 4
  • 5. Initiatives to support Cloud Computing and HPC within InriaWhy dedicated initiatives to support HPC/Clouds ?• Project-teams are geographically dispersed• Project-teams belong to different domains• Researchers from scientific computing need access to the latest research resultsrelated to tools, libraries, runtime systems, …• Researchers from “computer science” need access to applications to test theirideas as well as to find new ideas !Concept of “Inria Large Scale Initiatives”• Enable ambitious projects linked with the strategic plan• Promote an interdisciplinary approach• Mobilizing expertise of Inria researchers around key challenges- 5
  • 6. CLOUD COMPUTING@INRIA RENNES BRETAGNE ATLANTIQUE- 6
  • 7. Some Research Focus AreasSoftware architecture and infrastructure for cloud computing• Autonomic service management, resource management, SLA, skycomputing: Myriads• Big Data storage and management, MapReduce: KerData• Hybrid Cloud and P2P systems, privacy: ASAPAdvanced usage for specific application communities• Bioinformatics: GENSCALE• Cloud for medical imaging: EasyMed project (IRT B-Com): Visages- 7
  • 8. Some Research Focus AreasSoftware architecture and infrastructure for cloud computing• Autonomic service management, resource management, SLA, skycomputing: Myriads• Big Data storage and management, MapReduce: KerData• Hybrid Cloud and P2P systems, privacy: ASAPAdvanced usage for specific application communities• Bioinformatics: GENSCALE• Cloud for medical imaging: EasyMed project (IRT B-Com): Visages- 8
  • 9. Contrail EU projectGoal: develop an integrated approach to virtualization offering• services for federating IaaS clouds• elastic PaaS services on top of federated cloudsOverview: provide tools for• managing federation of multiple heterogeneous IaaS clouds• offering a secure yet usable platform for end users through federated identity management• supporting SLAs and quality of service (QoS) for satisfying stringent business requirements forusing the cloudResourceProviderFedera&on)API)+)Fed.)core)ResourceProviderStorage(Provider( Public(Cloud(Storage(Provider(Network(Provider(A) A)A) A)Applica&on)Applica&on)Applica&on)Federa&on)API)+)Fed.)core)Federa&on)API)+)Fed.)core)Contrail is an open source cloud computingsoftware stack compliant with cloudstandardshttp://contrail-project.eu- 9
  • 10. Contrail EU projecthttp://contrail-project.euhttp://contrail.projects.ow2.org/xwiki/bin/view/Main/WebHome- 10
  • 11. Open source software under the GNU GPLv2licensehttp://snooze.inria.frOther Research Activities on Cloud ComputingSnooze: an autonomic energy-efficient IaaS management systemScalability• Distributed VM management system• Self-organizing & self-healing hierarchyEnergy conservation• Idle nodes in power-saving mode• Holistic approach to favor idle nodesVM management algorithms• Energy-efficient VM placement• Under-load / overload mitigation• Automatic node power-cycling and wake-upResilin: Elastic MapReduce onmultiple clouds (sky computing)Goals• Creation of MapReduce executionplatforms on top of multiple clouds• Elasticity of the platforms• Support all kinds of Hadoop jobs• Support different Hadoop versionsInterfaces• Amazon EMR for users• Libcloud with underlying IaaS providersOpen source software under GNU AfferoGPL licensehttp://resilin.inria.fr- 11
  • 12. KerData: Dealing with the Data DelugeDeliver the capability to mine,search and analyze this data innear real timeScience itself is evolvingCredits: Microsoft12- 12
  • 13. Lastfew decadesThe Data Science:The 4th Paradigm for Scientific DiscoveryThousandyears agoToday and theFutureLast fewhundred years222.34acGaaSimulation ofcomplex phenomenaNewton’s laws,Maxwell’s equations…Description of naturalphenomenaUnify theory, experimentand simulation withlarge multidisciplinaryDataUsing data explorationand data mining(from instruments,sensors, humans…)Distributed CommunitiesCrédits: Dennis Gannon13
  • 14. Lastfew decadesThe Data Science:The 4th Paradigm for Scientific DiscoveryThousandyears agoToday and theFutureLast fewhundred years222.34acGaaSimulation ofcomplex phenomenaNewton’s laws,Maxwell’s equations…Description of naturalphenomenaUnify theory, experimentand simulation withlarge multidisciplinaryDataUsing data explorationand data mining(from instruments,sensors, humans…)Distributed Communities14
  • 15. Research Focus:How to efficiently store, share and process datafor new-generation, data-intensive applications?• Scientific challenges• Massive data (1 object = 1 TB)• Geographically distributed• Fine-grain access (MB) for reading and writing• High concurrency (10³ concurrent clients)• Without locking- Major goal: high-throughput under heavy concurrency- Our contributionDesign and implementation of distributed algorithmsValidation with real apps on real platforms with real users• Applications• Massive data analysis: clouds (e.g. MapReduce)• Post-Petascale HPC simulations: supercomputers- 15
  • 16. BlobSeer: A Software Platform for Scalable,Distributed BLOB ManagementStarted in 2008, 6 PhD theses (Gilles Kahn/SPECIF PhD Thesis Award in 2011)Main goal: optimized for concurrent accesses under heavy concurrencyThree key ideas•Decentralized metadata management•Lock-free concurrent writes (enabled by versioning)- Write = create new version of the data•Data and metadata “patching” rather than updatingA back-end for higher-level data management systems•Short term: highly scalable distributed file systems•Middle term: storage for cloud servicesOur approach•Design and implementation of distributed algorithms•Experiments on the Grid’5000 grid/cloud testbed•Validation with “real” apps on “real” platforms: Nimbus, Azure, OpenNebula clouds…http://blobseer.gforge.inria.fr/16- 16
  • 17. Impact of BlobSeer: MapReduceBlobSeer improves Hadoop• Gain (execution time) : 35%ANR MapReduce Project (2010-2014)• Lead: G. Antoniu (KerData)• Partners: INRIA (AVALON), Argonne National Lab, U. Illinois Urbana-Champaign, IBM,JLPC, IBCP, MEDIT• Strong collaboration with the Nimbus team from Argonne National Lab- BlobSeer integrated with the Nimbus cloud toolkit- BlobSeer used for efficient VM deployment and snapshotting• Validation : Grid’5000 with Nimbus, FutureGrid (USA), Open Cirrus (USA)http://mapreduce.inria.fr- 17
  • 18. The A-Brain Project: Data-Intensive Processing onMicrosoft Azure CloudsApplication• Large-scale joint genetic andneuroimaging data analysisGoal• Assess and understand thevariability between individualsApproach• Optimized data processing onMicrosoft’s Azure cloudsInria teams involved• KerData (Rennes)• Parietal(Saclay)Framework• Joint MSR-Inria Research Center• MS involvement: Azure teams,EMIC18
  • 19. Genetic information: SNPsG GT GT TT GG GMRI brain imagesClinical / behaviourThe Imaging Genetics Challenge:Comparing Heterogeneous InformationTHere we focuson this link- 19
  • 20. Neuroimaging-genetics: The Problem Several brain diseases have a geneticorigin, or their occurrence/severity relatedto genetic factors Genetics important to understand & predictresponse to treatment Genetic variability captured inDNA micro-array datap( )|Gene→Imagegeneticimage20
  • 21. Imaging Genetics Methodological IssuesGenetic dataBrain imageY~105-106~2000X~105-106– Anatomical MRI– Functional MRI– Diffusion MRI– DNA array (SNP/CNV)– gene expression data– others...- 21
  • 22. A BIG DATA Challenge …Azure can help…Data:doublepermutationvoxelsSNPs5%-10%usefulComputation:Estimate timespanon single machineEstimation for A-Brain on Azure (350 cores)Storage capacity estimations (350 cores)
  • 23. Imaging Genetics Methodological Issues Multivariate methods:predict brain characteristic with manygenetic variables Elastic net regularization:combination of ℓ1 and ℓ2 penalties →sparse loadings parameters setting:internal cross-validation/bootstrap Performance evaluated usingpermutations23
  • 24. A-Brain as Map-Reduce Processing- 24
  • 25. A-Brain as Map-Reduce Data Processing25
  • 26. Efficient Procedures for StatisticsExample : voxelwise Genome Wide Association Studies (vGWAS) 740 subjects ~ 50,000 voxels ~ 500,000 SNPs 10,000 permutations→ ~ 12,000 hours of computation→ ~ 1.8 Po of statistical scores- 26
  • 27. Efficient Procedures for StatisticsExample : Ridge regression with cross-validation loops Some costly computations(SVD ~ 60 sec) are used 1-2millions of times and cannot bekept in memory.~ 60-120 x 106 sec / SVD(1.9-3.8 years / SVD)→ An efficient distributed cachecan achieve huge speedup!- 27
  • 28. TomusBlobs approach- 28
  • 29. Requirements for a cloud storage / data managementHigh throughput under heavy concurrencyFine grain accessScalability / ElasticityData availabilityTransparencyDesign principlesData locality – use the local storageNo modification on the cloud middlewareLoose coupling between storage and applicationsStorage hierarchy- 29
  • 30. TomusBlobs - Architecture- 30Computation nodes
  • 31. Architecture contd.System componentsInitiator- Cloud specific- Generic stub- Properties: Scaling; Self configurationDistributed Storage- Aggregates the virtual disks- Not depending on a specific solutionClient API- Cloud specific API- Expose the operation transparentlyInitiatorLocalDiskApplicationClient APITBentityVM snapshotCustomizableEnvironment- 31
  • 32. TomusBlobs Evaluation• Scenario: Single reader / writer• Data transfer from memory to storage• Metric: Client IO throughput
  • 33. TomusBlobs EvaluationCumulative read throughput Cumulative write throughput• Scenario: Multiple readers / writers• Throughput limited by bandwidth• Read 4X ; Write 5X- 34
  • 34. TomusBlobs as a Storage Backend forSharing Application Data in MapReduceAppAPIApp App App AppAPI API API APITomusBlobs- 35
  • 35. TomusMapReduce Evaluation• Scenario: Increase the problem size• Optimize computation by managing better intermediate data- 36
  • 36. Iterative MapReduce - Daytona Merge Step In-Memory Caching of static data Cache aware hybrid scheduling using Queues as well asusing a bulletin board (special table)ReduceReduceMergeAddIteration?NoMap CombineMap CombineMap CombineDataCacheYesHybrid scheduling of the new iterationJob StartJob FinishCrédits: Dennis Gannon- 37
  • 37. Beyond MapReduce• Unique result with parallel reducephase• No central control entity• No synchronization barrierMapReducerMapMapMapMapReducer- 38
  • 38. Zoom on the Reduction Ratio• Compute the minimum of a set of large matrixes (7.5 GB)using 30 mappers- 39
  • 39. Azure integration- 40
  • 40. The Most Frequent Words benchmark•Input data size varies from 3.2 GB to 32 GB•ReductionRatio = 5- 41
  • 41. Execution times for A-Brain•Increasing number of map jobs = increasing size of data(5 GB to 50 GB)- 42
  • 42. Beyond Single Siteprocessing• Data movements across geo-distributeddeployments is costly• Minimize the size and number of transfers• The overall aggregate must collaboratetowards reaching the goal• The deployments work as independentservices• The architecture can be used for scenariosin which data is produced in differentlocations- 43
  • 43. Towards a Geo-distributedTomusBlobs approach• TomusBlobs for intra-deployment data management• Public Storage (AzureBlobs/Queues) for intra-deployment communication• Iterative Reduce technique forminimizing number of transfers(and data size)• Balance the network bottleneckfrom single data center- 44
  • 44. Multi-Site MapReduce• 3 deployments (NE,WE,NUS)• 1000 CPUs• ABrain execution across multiple sites- 45
  • 45. Beyond MapReduce -Workflow Processing- 46
  • 46. Data access patterns for workflows [1][1] Vairavanathan et al.A Workflow-Aware StorageSystem: An Opportunity Studyhttp://ece.ubc.ca/~matei/papers/ccgrid2012.pdfPipelineCachingData informed workflowInputOutputBroadcastReplicationData sizeInputOutputReduce/GatherCo-placement of all dataData informed workflowInputOutputScatterFile size awarenessData informed workflowInputOutput- 47
  • 47. eScience Central(Newcastle University)- 48
  • 48. Generic Worker Walkthrough(Microsoft ATLE)LocalstorageClientcodeResearcherJobManagementServiceAlgorithmHDGW DriverPluggableRuntimeEnvironmentRuntimeBusiness LogicJob DetailsTableJob IndexTableNotification Listeners(Accounting, StatusChange, etc..)BLOB StorageNotificationServiceScalingServiceOGFBES VMSOAP WS–*Use of interoperable standardprotocols and data schemas!OGFJSDLApplication CodeGW Services & SDKsExisting ComponentsInput FilesOutput FilesSharedStorage- 49Credits: Microsoft
  • 49. Defining the scopeIDFilesBatchjobsAssumptionsabout theworkflowsWorkflows arecomposed ofbatch jobs withwell-defineddata passingschemasThe input andthe output of thebatch jobs arefiles The batch jobsand their inputsand outputs canbe uniquelyidentified in thesystemMost workflows fit in thissubclassIdea: manage files insidethe deployment- 50
  • 50. The ConceptFile Name LocationsF1 VM1F2 VM1,VM2F3 VM2VM 1 VM 2Local Disk Local DiskF1F2 F3TransferModuleFile Metadata Registry(1) Register (F1,VM1)(2) GetLocation(F1)(3) DownloadFile(F1)F1F2• Metadata Registry• Transfer ModuleComponentsTransferModule- 51
  • 51. Characteristics of the componentsMetadata Registry Transfer ModuleRole• Transfer files from one node toanotherData type• FilesAccessibility• Each VM has such a module• Applications access the local module• The modules interact across nodesSolutions• FTP; Torrent; InMemory, HTTP etc.Role•Hold the location of files within thedeploymentData type•Key-value pairs –(file identification; retrieval information)Accessibility•Accessible by all nodesSolutions•Azure Caching Preview, Azure Tables,InMemory DBIdea:Adopt multiple transfer solutionsAdapt to the context: select the one that fits best- 52
  • 52. Transfer methodsMethod ObservationsInMemory • Caching data• InMemory data offers fast access• GBs of memory capacity perdeployment• Small filesBitTorrent • Replicas for file dissemination• Collaborative reads• New way of stage-in dataFTP • TCP transfer• Medium and large files• Potential of inter-operability- 53
  • 53. VM SnapshotVMMemoryMetaDataRegistryAdaptiveStorageFTPTorrentInMemoryTransfer Module ServicesReplication QueueReplicationFTPTrackerPeerLocalDisk- 54
  • 54. F1Azure CachingAdaptiveStorageAdaptiveStorageApp AppF1CreateUpload(F1)GetMetadataRead(F1)Memory MemoryLocal Storage Local StorageRead (F1)WriteMetadataWrite(F1)Download (F1)APIAPI- 55
  • 55. Scenario 2 – Large files ; replication enabled0102030405060708050 100 150 200 250Time(sec)Size of a single file (MB)DirectLink Torrent Adaptive AzureBlobs• Torrents are superior for broadcast when replicas are used• DirectLink is faster for pipeline (reduction tree)• Adaptive storage can chose each time the best strategy- 56
  • 56. NCBI Blast for AzureSeamless Experience• Evaluate data and invoke computational modelsfrom Excel.• Computationally heavy analysis done close tolarge database of curated data.• Scalable for large, surge computationally heavyanalysis.selects DBs andinput sequenceWeb Role Input SplitterWorker RoleBLASTExecutionWorkerRole #n….CombinerWorker RoleGenomeDB 1GenomeDB KBLAST DBConfigurationAzure BlobStorageBLASTExecutionWorkerRole #1Crédits: Dennis Gannon- 57
  • 57. BLAST analysis – data managementcomponent03060901205 10 15 25 35 40 50 60Time(sec)Number of BLAST jobsDownload Adaptive Download AzureBlobs Upload Adaptive Upload AzureBlobs• Database files – 1.6 GB• Input size – 800 MB• 50 nodes- 58
  • 58. Scalable Storage on Clouds: Open IssuesUnderstanding price-performance trade-offs• Consistency, availability, performance, cost, security,quality of service, energy consumption• Autonomy, adaptive consistency• Dynamic elasticity• Trade-offs exposed to the userHigh performance variability- Understand it, model it, cope with itDeployment/application launching time is highLatency of data accesses is still an issueData movements are expensiveCope with tightly-coupled applicationsCope with various cloud programming modelsVirtualization overheadBenchmarkingPerformance modelingSelf-optimization for cost reduction- Elastic scale downSecurity and privacy- 59
  • 59. Extreme scale does matter BUT not onlyOther focus areas– Affordability and usability of intermediate size systems– Pervasiveness of usage across the entire industry, including Small andMedium Enterprises (SMEs) and ISVs– New HPC deployments (e.g. Big Data, HPC in Clouds)– HPC and Cloud usage expansion, fostering the development of consultancy,expertise and service business / end-user support– Facilitating the creation of start-ups and the development of the SME sector(hw/sw supply side)– Education and training (inc. engineering skills for industry)Cloud Computing@INRIAStrategic Research Agenda- 60
  • 60. - 61Azure Brain: 4th paradigm, scientificdiscovery & (really) big data (REC201)Gabriel AntoniuSenior Research Scientist, InriaHead of the KerData Project-Team, Inria Rennes – Bretagne AtlantiqueRadu TudoranPhD student, ENS Cachan – BrittanyKerData Project-Team, Inria Rennes – Bretagne AtlantiqueContacts: Gabriel.Antoniu@inria.fr, Radu.Tudoran@inria.fr