This is a report of the ELIXIR pilot project performed by the EMBL-EBI (PRIDE and System teams), BILS and EUDAT. The title of the pilot project was: "Integration of BILS-ProteomeXchange using EUDAT resources".
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
ELIXIR Pilot Actions launched in 2014: Integration of BILS-ProteomeXchange using EUDAT resources
1. European Life Sciences Infrastructure for Biological Information
www.elixir-europe.org
“BILS-ProteomeXchange integration using
EUDAT resources”
ELIXIR-Pilot Project
Dr. Juan A. Vizcaíno, EMBL-EBI, juan@ebi.ac.uk
Dr. Fredrik Levander, BILS, fredrik.levander@bils.se
2. Juan A. Vizcaíno
juan@ebi.ac.uk
ELIXIR Webinar
20 May 2015
• Andy Jenkinson (Systems group)
• Rui Wang (PRIDE)
• Juan A. Vizcaíno (PRIDE)
• Fredrik Levander
• Samuel Lampa
• Janos Nagy
• Mikael Borg
• Jani Heikkinen
Main people involved directly in this pilot
3. Juan A. Vizcaíno
juan@ebi.ac.uk
ELIXIR Webinar
20 May 2015
• Short intro to PRIDE & ProteomeXchange, BILS
and EUDAT
• Objectives of the pilot
• Report on the results
• Perspectives for the future and conclusions
Overview
4. Juan A. Vizcaíno
juan@ebi.ac.uk
ELIXIR Webinar
20 May 2015
• Short intro to PRIDE & ProteomeXchange, BILS
and EUDAT
• Objectives of the pilot
• Report on the results
• Perspectives for the future and conclusions
Overview
5. Juan A. Vizcaíno
juan@ebi.ac.uk
ELIXIR Webinar
20 May 2015
• PRIDE stores mass spectrometry (MS)-
based proteomics data:
• Peptide and protein expression data
(identification and quantification)
• Post-translational modifications
• Mass spectra (raw data and peak
lists)
• Technical and biological metadata
• Any other related information
• Full support for tandem MS approaches
PRIDE (PRoteomics IDEntifications) database
http://www.ebi.ac.uk/pride
Martens et al., Proteomics, 2005
Vizcaíno et al., NAR, 2013
6. Juan A. Vizcaíno
juan@ebi.ac.uk
ELIXIR Webinar
20 May 2015
ProteomeXchange Consortium
•Goal: Development of a framework to allow
standard data submission and dissemination
pipelines between the main existing proteomics
repositories.
•Includes PeptideAtlas (ISB, Seattle), PRIDE
(Cambridge, UK) and MassIVE (UCSD, San Diego).
•Tranche and Peptidome initially included but
discontinued.
•Common identifier space (PXD identifiers)
•Two supported data workflows: MS/MS and
SRM.
•Main objective: Make life easier for
researchers
http://www.proteomexchange.org Vizcaíno et al., Nat Biotechnol, 2014
7. Juan A. Vizcaíno
juan@ebi.ac.uk
ELIXIR Webinar
20 May 2015
ProteomeCentral
Metadata /
Manuscript
Raw Data*
Results
Journals
UniProt/
neXtProt
Peptide
Atlas
Other DBs
Receiving repositories
PASSEL
(SRM data)
PRIDE
(MS/MS data)
Other DBs
GPMDB
Researcher’s results
Reprocessed results
Raw data*
Metadata
MassIVE
(MS/MS data)
ProteomeXchange data workflow: PRIDE
Vizcaíno et al., Nat Biotechnol, 2014
8. Juan A. Vizcaíno
juan@ebi.ac.uk
ELIXIR Webinar
20 May 2015
PX Data workflow for MS/MS data
1. Mass spectrometer output files: raw data (binary files) or
peak list spectra in a standardized format (mzML, mzXML).
2. Result files:
a. Complete submissions: Result files can be converted to
PRIDE XML or the mzIdentML data standard.
b. Partial submissions: For workflows not yet supported by
PRIDE, search engine output files will be stored and
provided in their original form.
3. Metadata: Sufficiently detailed description of sample origin,
workflow, instrumentation, submitter.
4. Other files: Optional files:
1. QUANT: Quantification related results e. FASTA
2. PEAK: Peak list files f. SP_LIBRARY
3. GEL: Gel images
4. OTHER: Any other file type
Published
Raw
Files
Other
files
10. Juan A. Vizcaíno
juan@ebi.ac.uk
ELIXIR Webinar
20 May 2015
PRIDE Components: Submission Process
PRIDE Converter 2
PRIDE Inspector PX Submission Tool
mzIdentML
PRIDE XML
11. Juan A. Vizcaíno
juan@ebi.ac.uk
ELIXIR Webinar
20 May 2015
Origin:
396 USA
224 Germany
191 United Kingdom
106 Netherlands
105 China
104 France
94 Switzerland
75 Canada
55 Japan
55 Spain
54 Denmark
52 Sweden
50 Belgium
48 Australia
34 Austria
25 Norway
23 Taiwan
22 India
21 Finland
20 Ireland
20 Italy
16 Brazil
15 Russia
14 Republic of Korea
10 Israel
10 Singapore …
ProteomeXchange: 1,963 datasets up until 1st
April, 2015
Type:
613 PRIDE complete
1177 PRIDE partial
79 PeptideAtlas/PASSEL complete
69 MassIVE
25 reprocessed
Publicly Accessible:
959 datasets, 49% of all
88% PRIDE
9% PASSEL
3% MassIVE
Data volume:
Total: ~102 TB
Number of all files: ~250,000
PXD000320-324: ~ 5 TB
PXD000065: ~ 1.4TB
Datasets/year:
2012: 102
2013: 527
2014: 963
2015: 371
Top Species studied by at least 20
datasets:
839 Homo sapiens
232 Mus musculus
79 Arabidopsis thaliana
77 Saccharomyces cerevisiae
44 Rattus norvegicus
35 Escherichia coli
21 Bos taurus
21 Glycine max
~ 460 species in total
12. Juan A. Vizcaíno
juan@ebi.ac.uk
ELIXIR Webinar
20 May 2015
BILS – Bioinformatics Infrastructure for Life Sciences
• Distributed national research infrastructure
supported by the Swedish Research Council
• Coordination with other bioinformatics activities
• BILS provides:
• Bioinformatics support (consultancy)
• Bioinformatics infrastructure (data and tools)
Computing and storage is provided in collaboration with SNIC
• Bioinformatics network
• Nodes at each of the 6 large university cities
• Annual workshop
• Training
• Coordination with other bioinformatics activities
• Swedish node in ELIXIR
13. Juan A. Vizcaíno
juan@ebi.ac.uk
ELIXIR Webinar
20 May 2015
Main BILS proteomics support aims
• Data storage:
• Secure
• Long-time
• Metadata
• Automated
• Publishing
• Standardised formats
• Data processing:
• Accessible data processing workflows
14. Juan A. Vizcaíno
juan@ebi.ac.uk
ELIXIR Webinar
20 May 2015
Proteios: Software environment for proteomics
web browser access and analysis
of own data only
BILS
Scripts
BILS
Scripts
Public access
to released
raw data Häkkinen et al. (2009) J Proteome Res
A multi-user platform for analysis and management of proteomics data
15. Juan A. Vizcaíno
juan@ebi.ac.uk
ELIXIR Webinar
20 May 2015
EUDAT
• EUDAT aims to contribute to building and operating a
Collaborative Data Infrastructure for European science.
• This involves a suite of co-ordinated and interoperable
services for preserving scientific data, and for making
them accessible to researchers.
• EUDAT collaborates with research communities across a
range of disciplines, from social sciences to
environmental science and including molecular biology (as
represented by ELIXIR).
• These communities have diverse structures, cultures and
scales but also share some common requirements
regarding the management of data. http://www.eudat.eu
18. Juan A. Vizcaíno
juan@ebi.ac.uk
ELIXIR Webinar
20 May 2015
EUDAT: B2SAFE AND iRODS
• B2SAFE aims to provide a software ecosystem for
persistently available data, including persistent
identification, abstracted data storage, and reliable
automated replication via auditable rules.
• It is built on top of the iRODS data management software
(http://irods.org) and integrates a PID system such as the
European Persistent Identification Consortium (EPIC -
(http://www.pidconsortium.eu) Handle API).
19. Juan A. Vizcaíno
juan@ebi.ac.uk
ELIXIR Webinar
20 May 2015
• PRIDE, ProteomeXchange, BILS and EUDAT
• Objectives of the pilot
• Report on the results
• Perspectives for the future and conclusions
Overview
20. Juan A. Vizcaíno
juan@ebi.ac.uk
ELIXIR Webinar
20 May 2015
Objective
• To integrate the data repositories for MS proteomics data
run by BILS (Sweden) and ProteomeXchange (via the PRIDE
database, EMBL-EBI, UK), using EUDAT’s B2SAFE software.
21. Juan A. Vizcaíno
juan@ebi.ac.uk
ELIXIR Webinar
20 May 2015
Plans at European level
National proteomics centers
Meta
data
Meta
data
Result
s
Result
s
Raw
Data
Raw
Data
Central repository
Meta
data
Meta
data
Result
s
Result
s
Raw
Data
Raw
Data
Data storage centers
Meta
data
Meta
data
Raw
Data
Raw
Data
1.- ELIXIR replication
2.- EUDAT replication
22. Juan A. Vizcaíno
juan@ebi.ac.uk
ELIXIR Webinar
20 May 2015
Objective
• To integrate the data repositories for MS proteomics data
run by BILS (Sweden) and ProteomeXchange (via the PRIDE
database, EMBL-EBI, UK), using EUDAT’s B2SAFE software.
• This project will also show the potential of collaboration among
research infrastructures and e-infrastructures to better
manage the data deluge. It will help to evaluate the
requirements of such federated systems.
23. Juan A. Vizcaíno
juan@ebi.ac.uk
ELIXIR Webinar
20 May 2015
• Short intro to PRIDE & ProteomeXchange, BILS
and EUDAT
• Objectives of the pilot
• Report on the results
• Perspectives for the future and conclusions
Overview
24. Juan A. Vizcaíno
juan@ebi.ac.uk
ELIXIR Webinar
20 May 2015
Timeline
•The pilot started when Jani Heikkinen (EUDAT) installed
B2SAFE at EMBL-EBI (July 2014).
•Data workflow was defined on September/ October 2014.
•Implementation work happened in parallel, with regular weekly
calls from January 2015.
•The pilot is now finishing (May 2015).
25. Juan A. Vizcaíno
juan@ebi.ac.uk
ELIXIR Webinar
20 May 2015
Envisioned data workflow (September/October 2014)
• Default B2SAFE rules ->Trigger replication of data from BILS to EBI
• PIDS assigned per file
26. Juan A. Vizcaíno
juan@ebi.ac.uk
ELIXIR Webinar
20 May 2015
Implementation process (1)
• B2SAFE 3.0.0 (including iRODS 3.3.1) was initially
installed at EMBL-EBI.
• However, BILS had moved already to iRODS v4.
• Incompatibility problems were found.
• It was decided to install iRODS 4.0 at the EBI, to solve
the incompatibility issue.
• At the time iRODS v4 was not officially supported with
iRODS version 4.0.3, so changes were necessary to the
original install procedure to accommodate 4.0.3.
27. Juan A. Vizcaíno
juan@ebi.ac.uk
ELIXIR Webinar
20 May 2015
Implementation process (2)
• EBI and BILS obtained Handle prefixes and made them
available within EPIC. The integration with iRODS was
successfully tested.
• The next step was to configure B2SAFE and achieve a test
replication of a file from BILS to EBI using the B2SAFE PID
creation and file transfer rules.
• Unexpected delays:
• EBI experienced some network issues that affected
communications between the EBI and BILS iRODS.
• Two successive bugs were discovered. Both centered on
the rule execution engine and prevented B2SAFE from
functioning.
• These bugs were solved by EUDAT & iRODs developers.
28. Juan A. Vizcaíno
juan@ebi.ac.uk
ELIXIR Webinar
20 May 2015
Implementation process (3)
• With workarounds now in place it was possible to manually
trigger a successful replication of a file from BILS to EBI.
• However it became apparent that the authorisation
mechanism employed by iRODS in a federation would make
the proposed submission workflow difficult to manage in a
production environment.
• This means every BILS researcher able to submit data
must have a user created for them on the EBI server first.
Alternative customised solutions could solve this issue by
decoupling the actions of researchers from the replication
itself. However this would inevitably add complexity.
29. Juan A. Vizcaíno
juan@ebi.ac.uk
ELIXIR Webinar
20 May 2015
Implementation process (4)
• At this point (March 2015) the pilot had overrun (it was
expected to last 6 months), with more work required to
integrate the B2SAFE replication process with the PRIDE
submission pipeline.
• It was decided to halt the process and find an alternative
way to achieve the same goals using existing resources.
• A detailed report has been written and has been sent to all the
parties involved.
30. Juan A. Vizcaíno
juan@ebi.ac.uk
ELIXIR Webinar
20 May 2015
Implemented alternative solution
• Proteios is able to generate the metadata file needed
for the submission to ProteomeXchange via PRIDE.
• The PX submission tool was extended to support
loading of files not available locally at the moment
of submission (URLs are specified).
• As a proof of concept, dataset PXD002037 was
submitted to PRIDE. Now it is publicly available.
33. Juan A. Vizcaíno
juan@ebi.ac.uk
ELIXIR Webinar
20 May 2015
Dataset tags in PRIDE Archive
http://www.ebi.ac.uk/pride/archive/simpleSearch?q=&projectTagFilters=Bioinformatics%20Infrastructure%20for%20Life
%20Sciences%20(BILS)%20network%20(Sweden)
- Datasets can be tags with different attributes.
- Functionality available in the submission process.
- Stable URLs can be generated.
34. Juan A. Vizcaíno
juan@ebi.ac.uk
ELIXIR Webinar
20 May 2015
• Short intro to PRIDE & ProteomeXchange, BILS
and EUDAT
• Objectives of the pilot
• Report on the results
• Perspectives for the future and conclusions
Overview
35. Juan A. Vizcaíno
juan@ebi.ac.uk
ELIXIR Webinar
20 May 2015
At present and in the near future…
• EMBL-EBI is involved in the EUDAT 2020 project (PI is
Steven Newhouse).
• EMBL-EBI will then continue to collaborate with EUDAT, for
gaining experience in the use of this software.
• PRIDE will evaluate the situation in the future to decide if
the originally envisioned submission pipeline (based on
B2SAFE and IRODS) is implemented.
36. Juan A. Vizcaíno
juan@ebi.ac.uk
ELIXIR Webinar
20 May 2015
Conclusions
• The pilot establishes that the original use case is not the best
application of B2SAFE at the present time. However, the
situation will be kept under review by PRIDE.
• This conclusion is not a reflection on B2SAFE per se, indeed
B2SAFE and iRODS have been found to be very flexible and are
likely to be interesting candidates for other use cases outside of
PRIDE elsewhere in EMBL-EBI or ELIXIR.
• In particular, use cases focused on data management within or
between data centres (i.e. bipartite collaborations) or
environments where mature data submission, curation and
archiving solutions do not already exist.
• In addition, we recommend ELIXIR continues to explore
EUDAT services and their relevance in ELIXIR use cases.
37. Juan A. Vizcaíno
juan@ebi.ac.uk
ELIXIR Webinar
20 May 2015
Conclusions: Technical recommendations
• Incorporate a fully-functional RESTful interface for iRODS
into B2SAFE, that can be used by a client to avoid installing
iCommands on the client machine.
• The security model should be adapted to allow anonymous
RW to a specified URL.
• If widespread deployment of EUDAT software is expected,
effort must be committed by EUDAT 2020 to make the
software more easily and quickly deployable by ‘ordinary’
system administrators.
38. Juan A. Vizcaíno
juan@ebi.ac.uk
ELIXIR Webinar
20 May 2015
• Henning Hermjakob
• Steven Newhouse
• Rafael Jimenez
• Bengt Persson
• EUDAT management
& developers
Acknowledgements