Clouds, Grids and Data

•Télécharger en tant que ODP, PDF•

1 j'aime•606 vues

The Next-Generation sequencing data-deluge requires storage and compute services to be provisioned at an ever-increasing rate. Can Cloud (and last decade's buzzword, Grid), help us? Talk given at the NHGRI Cloud computing workshop, 2010.

Technologie

Clouds, Grids and Data Guy Coates Wellcome Trust Sanger Institute [email_address]

[object Object],[object Object],[object Object]

Based in Hinxton Genome Campus, Cambridge, UK. ,[object Object],[object Object]

We have active cancer, malaria, pathogen and genomic variation / human health studies. ,[object Object],[object Object]

Past Collaborations Data Sequencing Centre + DCC Sequencing centre Sequencing centre Sequencing centre Sequencing centre

Future Collaborations Collaborations are short term: 18 months-3 years. Sequencing Centre 3 Sequencing Centre 1 Sequencing Centre 2A Sequencing Centre 2B Federated access

Genomics Data Unstructured data (flat files) Data size per Genome Structured data (databases) DAS, bioMART etc ? Intensities / raw data (2TB) Alignments (200 GB) Sequence + quality data (500 GB) Variation data (1GB) Individual features (3MB)

Sharing Unstructured data ,[object Object]

Single institute will have data distributed for DR / worldwide access. ,[object Object],[object Object],[object Object]

Some will have patient identifiable data.

iRODS ICAT Catalogue database Rule Engine Implements policies Irods Server Data on disk User interface WebDAV, icommands,fuse Irods Server Data in database Irods Server Data in S3

Fast parallel data transfers across local and wide area network links. ,[object Object],[object Object],[object Object],[object Object]

Allows user at institute A to seamlessly access data at institute B in a controlled manner.

What are we doing with it? ,[object Object]

Move files between different storage pools. ,[object Object],[object Object],[object Object],[object Object]

Encrypt files and place on private FTP dropboxes.

Cumbersome to manage and insecure. ,[object Object],[object Object]

Lots of solutions: ,[object Object],[object Object],[object Object]

Delegated authentication? ,[object Object]

Is data in an inaccessible archive really useful?

Recommandé

Cloud ExperiencesGuy Coates

Sharing data: Sanger ExperiencesGuy Coates

Storage for next-generation sequencingGuy Coates

Next-generation sequencing: Data mangementGuy Coates

Next generation genomics: Petascale data in the life sciencesGuy Coates

PUC Masterclass Big DataArjen de Vries

Future Architectures for genomicsGuy Coates

Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Robert Grossman

Recommandé

Cloud ExperiencesGuy Coates

Sharing data: Sanger ExperiencesGuy Coates

Storage for next-generation sequencingGuy Coates

Next-generation sequencing: Data mangementGuy Coates

Next generation genomics: Petascale data in the life sciencesGuy Coates

PUC Masterclass Big DataArjen de Vries

Future Architectures for genomicsGuy Coates

Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Robert Grossman

Architectures for Data Commons (XLDB 15 Lightning Talk)Robert Grossman

10 Popular Hadoop Technical Interview QuestionsZaranTech LLC

Managing Big Data (Chapter 2, SC 11 Tutorial)Robert Grossman

IRJET- Systematic Review: Progression Study on BIG DATA articlesIRJET Journal

What Are Science Clouds?Robert Grossman

Hadoop for beginners free course pptNjain85

Keynote on 2015 Yale Day of Data Robert Grossman

The Open Science Data Cloud: Empowering the Long Tail of ScienceRobert Grossman

So Long Computer OverlordsIan Foster

Using the Open Science Data Cloud for Data Science ResearchRobert Grossman

Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Ashok Royal

Empowering Transformational ScienceChelle Gentemann

Big Data, The Community and The Commons (May 12, 2014)Robert Grossman

Rpi talk foster september 2011Ian Foster

Big data technologies and Hadoop infrastructureRoman Nikitchenko

Big Data: an introductionBart Vandewoestyne

Introduction of Big data and Hadoop Arohi Khandelwal

Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animationsWilliam Yetman

WhatisbigdataandwhylearnhadoopEdureka!

13 09-28 hadoop-in_taiwan_2013_openingJazz Yao-Tsung Wang

Clouds: All fluff and no substance?Guy Coates

BIG DATAShashank Shetty

Contenu connexe

Tendances

Architectures for Data Commons (XLDB 15 Lightning Talk)Robert Grossman

10 Popular Hadoop Technical Interview QuestionsZaranTech LLC

Managing Big Data (Chapter 2, SC 11 Tutorial)Robert Grossman

IRJET- Systematic Review: Progression Study on BIG DATA articlesIRJET Journal

What Are Science Clouds?Robert Grossman

Hadoop for beginners free course pptNjain85

Keynote on 2015 Yale Day of Data Robert Grossman

The Open Science Data Cloud: Empowering the Long Tail of ScienceRobert Grossman

So Long Computer OverlordsIan Foster

Using the Open Science Data Cloud for Data Science ResearchRobert Grossman

Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Ashok Royal

Empowering Transformational ScienceChelle Gentemann

Big Data, The Community and The Commons (May 12, 2014)Robert Grossman

Rpi talk foster september 2011Ian Foster

Big data technologies and Hadoop infrastructureRoman Nikitchenko

Big Data: an introductionBart Vandewoestyne

Introduction of Big data and Hadoop Arohi Khandelwal

Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animationsWilliam Yetman

WhatisbigdataandwhylearnhadoopEdureka!

13 09-28 hadoop-in_taiwan_2013_openingJazz Yao-Tsung Wang

Tendances (20)

Architectures for Data Commons (XLDB 15 Lightning Talk)

10 Popular Hadoop Technical Interview Questions

Managing Big Data (Chapter 2, SC 11 Tutorial)

IRJET- Systematic Review: Progression Study on BIG DATA articles

What Are Science Clouds?

Hadoop for beginners free course ppt

Keynote on 2015 Yale Day of Data

The Open Science Data Cloud: Empowering the Long Tail of Science

So Long Computer Overlords

Using the Open Science Data Cloud for Data Science Research

Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...

Empowering Transformational Science

Big Data, The Community and The Commons (May 12, 2014)

Rpi talk foster september 2011

Big data technologies and Hadoop infrastructure

Big Data: an introduction

Introduction of Big data and Hadoop

Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations

Whatisbigdataandwhylearnhadoop

13 09-28 hadoop-in_taiwan_2013_opening

Similaire à Clouds, Grids and Data

Clouds: All fluff and no substance?Guy Coates

BIG DATAShashank Shetty

Waters Grid & HPC Coursejimliddle

Big Data and OSS at IBMBoulder Java User's Group

Cyberinfrastructure and Applications Overview: Howard University June22marpierc

Cluster Filesystems and the next 1000 human genomesGuy Coates

2015 04 bio it worldChris Dwan

Lecture 5 - Big Data and Hadoop Intro.pptalmaraniabwmalk

Big Data Session 1.pptxElsonPaul2

Big data business caseKarthik Padmanabhan ( MLE℠)

Farms, Fabrics and CloudsSteve Loughran

Graph Hardware Architecture - Enterprise graphs deserve great hardware!TigerGraph

Accelerating Analytics for the Future of GenomicsAmazon Web Services

Computing Outside The Box September 2009Ian Foster

Challenges and Opportunities of Big Data GenomicsYasin Memari

Introduction Big data مروان الوجيه

Big Data - Need of Converged Data PlatformGeekNightHyderabad

Research and technology explosion in scale-out storageJeff Spencer

The Evolving Landscape of Data EngineeringAndrei Savu

Hadoop @ Sara & BiG GridEvert Lammerts

Similaire à Clouds, Grids and Data (20)

Clouds: All fluff and no substance?

BIG DATA

Waters Grid & HPC Course

Big Data and OSS at IBM

Cyberinfrastructure and Applications Overview: Howard University June22

Cluster Filesystems and the next 1000 human genomes

2015 04 bio it world

Lecture 5 - Big Data and Hadoop Intro.ppt

Big Data Session 1.pptx

Big data business case

Farms, Fabrics and Clouds

Graph Hardware Architecture - Enterprise graphs deserve great hardware!

Accelerating Analytics for the Future of Genomics

Computing Outside The Box September 2009

Challenges and Opportunities of Big Data Genomics

Introduction Big data

Big Data - Need of Converged Data Platform

Research and technology explosion in scale-out storage

The Evolving Landscape of Data Engineering

Hadoop @ Sara & BiG Grid

Dernier

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

A Year of the Servo Reboot: Where Are We Now?Igalia

Partners Life - Insurer Innovation Award 2024The Digital Insurer

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

🐬 The future of MySQL is Postgres 🐘RTylerCroy

Why Teams call analytics are critical to your entire businesspanagenda

Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

Real Time Object Detection Using Open CVKhem

Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Dernier (20)

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

A Year of the Servo Reboot: Where Are We Now?

Partners Life - Insurer Innovation Award 2024

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Exploring the Future Potential of AI-Enabled Smartphone Processors

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

🐬 The future of MySQL is Postgres 🐘

Why Teams call analytics are critical to your entire business

Artificial Intelligence Chap.5 : Uncertainty

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

presentation ICT roal in 21st century education

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Strategies for Landing an Oracle DBA Job as a Fresher

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Apidays New York 2024 - The value of a flexible API Management solution for O...

Real Time Object Detection Using Open CV

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Clouds, Grids and Data

1. Clouds, Grids and Data Guy Coates Wellcome Trust Sanger Institute [email_address]

3. ~700 employees.

6. Shared data archives

7. Past Collaborations Data Sequencing Centre + DCC Sequencing centre Sequencing centre Sequencing centre Sequencing centre

8. Future Collaborations Collaborations are short term: 18 months-3 years. Sequencing Centre 3 Sequencing Centre 1 Sequencing Centre 2A Sequencing Centre 2B Federated access

9. Genomics Data Unstructured data (flat files) Data size per Genome Structured data (databases) DAS, bioMART etc ? Intensities / raw data (2TB) Alignments (200 GB) Sequence + quality data (500 GB) Variation data (1GB) Individual features (3MB)

10.

11.

12.

13. Some will have patient identifiable data.

14. Plan for it now.

15.

16. iRODS ICAT Catalogue database Rule Engine Implements policies Irods Server Data on disk User interface WebDAV, icommands,fuse Irods Server Data in database Irods Server Data in S3

17.

18.

19. Allows user at institute A to seamlessly access data at institute B in a controlled manner.

20.

21.

22. Controlled data is hard:

23. Encrypt files and place on private FTP dropboxes.

24.

25. Software knows about S3 storage layers.

26.

27. Culture shock.

28.

29. Single sign on?

30.

31. Cloud Archives

32.

33. Is data in an inaccessible archive really useful?

34.

35.

36. 3 month lead time.

37. ~$1.5M capex.

38. Elephant in the room

39.

40.

41. NCBI -> Sanger: 15 Mbyte/s (120 Mbit/s)

42.

43. 20 days to pull down 100TB from Oxford.

44.

45.

46.

47. Put VMs on compute that is “attached” to the data. Data CPU CPU CPU CPU Data CPU CPU CPU CPU VM

48. Proto-Example: Ssaha trace search Hash Table (320 GB) trace Database ~30TB 1. hash database CPU CPU CPU CPU hash hash hash hash 2 .Distribute hash across machines query 3. Run query in parallel

49. Practical Hurdles

50.

51.

52.

53.

54.

55. We have effectively tied ourselves to a single provider.

56. Compute architecture VS CPU CPU CPU Fat Network Posix Global filesystem CPU CPU CPU CPU thin network Local storage Local storage Local storage Local storage Batch schedular hadoop/S3 Data-store Data-store

57.

58.

59. ...then beowulf took over the world.

60.

61.

62. The challenge is computing across the data at scale.

63. Network infrastructure and cloud architectures still problematic.

64.

65.

66. Gen-Tao Chiang

67. Pete Clapham

68.

69.

70. John Teague

71.

72. Backup

73. Other cloud projects

74.

75.

76. Common workload.

77.

78. Ensembl / Annotation TCCTCTCTTTATTTTAGCTGGACCAGACCAATTTTGAGGAAAGGATACAGACAGCGCCTG GAATTGTCAGACATATACCAAATCCCTTCTGTTGATTCTGCTGACAATCTATCTGAAAAA TTGGAAAGGTATGTTCATGTACATTGTTTAGTTGAAGAGAGAAATTCATATTATTAATTA TTTAGAGAAGAGAAAGCAAACATATTATAAGTTTAATTCTTATATTTAAAAATAGGAGCC AAGTATGGTGGCTAATGCCTGTAATCCCAACTATTTGGGAGGCCAAGATGAGAGGATTGC TTGAGACCAGGAGTTTGATACCAGCCTGGGCAACATAGCAAGATGTTATCTCTACACAAA ATAAAAAAGTTAGCTGGGAATGGTAGTGCATGCTTGTATTCCCAGCTACTCAGGAGGCTG AAGCAGGAGGGTTACTTGAGCCCAGGAGTTTGAGGTTGCAGTGAGCTATGATTGTGCCAC TGCACTCCAGCTTGGGTGACACAGCAAAACCCTCTCTCTCTAAAAAAAAAAAAAAAAAGG AACATCTCATTTTCACACTGAAATGTTGACTGAAATCATTAAACAATAAAATCATAAAAG AAAAATAATCAGTTTCCTAAGAAATGATTTTTTTTCCTGAAAAATACACATTTGGTTTCA GAGAATTTGTCTTATTAGAGACCATGAGATGGATTTTGTGAAAACTAAAGTAACACCATT ATGAAGTAAATCGTGTATATTTGCTTTCAAAACCTTTATATTTGAATACAAATGTACTCC

79.

80.