Big Data Europe is a EU funded Horizon2020 project and will undertake the foundational work for enabling European companies to build innovative multilingual products and services based on semantically interoperable, large-scale, multi-lingual data assets and knowledge, available under a variety of licenses and business models.
The Open PHACTS Discovery Platform is bringing together pharmacological data resources in an integrated, interoperable infrastructure, and has been developed to reduce barriers to drug discovery in industry, academia and for small businesses.
The first round of pilots for the Big Data Europe project is about to enter the evaluation phase. This also holds for the Societal Challenge 1: Health. For this challenge the Open PHACTS foundation, Manchester University and the VU Amsterdam are working on the Open PHACTS docker and its integration with the Big Data Europe infrastructure.
This presentation will give you:
- a general overview of the infrastructure and the status of the generic components that are being developed
- an outline of the Societal Challenge and the rationale for the pilot
a look into the future pilot options
The intended audience are people acquainted with basic development tools like Docker and GitHub with an interest in Big Data and Drug Discovery.
Disentangling the origin of chemical differences using GHOST
SC1 - Hangout 2: The Open PHACTS pilot
1. BIG DATA EUROPE
H2020 CSA (2015-17)
SOCIETAL CHALLENGE “HEALTH”
Integrating Big Data, Software & Communities for Addressing
Europe’s Societal Challenges06.07.2016
2. BigDataEurope
6-Jul-16
Today:
• Short overview of Big Data Europe Ronald Siebes
• What is Open PHACTS Stian Soiland-Reyes, Bryn Williams-Jones
• The Big Data Europe infrastructure Erika Pauwels, Aad Versteden
• Pilot 1: The Open PHACTS docker Stian Soiland-Reyes
• Q&A
Stian Soiland-Reyes
BioExcel and
University of Manchester
Ronald Siebes
VU Amsterdam
Erika Pauwels
Tenforce
Aad Versteden
Tenforce
Bryn Williams-Jones
Open PHACTS Foundation
7. Open PHACTS
Architecture and
Docker install
Stian Soiland-Reyes, University of Manchester
http://orcid.org/0000-0001-9842-9718
@soilandreyes
This work is licensed under a
.Creative Commons Attribution 4.0 International License
Big Data Europe Webinar, 2016-07-06
This work has been done as part of the BioExcel CoE ( ),
a project funded by the EC H2020 program, contract number
www.bioexcel.eu
EINFRA-5-2015 675728
https://slides.com/soilandreyes/2016-07-06-openphacts
1
8. http://www.openphacts.org/
Bringing together pharmacological data resources
in an integrated, interoperable infrastructure
Data sources integrated and linked together
so that you can easily see the relationships
between compounds, targets, pathways,
diseases and tissues.
, , , ,
, , , ,
, ,
ChEBI ChEMBL ChemSpider ConceptWiki
DisGeNET DrugBank FAERS Gene Ontology
neXtProt SureChEMBL, UniProt WikiPathways
2 . 1
11. {
"format": "linked-data-api",
"version": "1.5",
"result": {
"_about": "https://beta.openphacts.org/1.5/compound?uri=http%3A%2F%2Fwww.conceptwiki.org%2Fconce
"definition": "https://beta.openphacts.org/api-config",
"extendedMetadataVersion": "https://beta.openphacts.org/1.5/compound?uri=http%3A%2F%2Fwww.concep
"linkPredicate": "http://www.w3.org/2004/02/skos/core#exactMatch",
"activeLens": "Default",
"primaryTopic": {
"_about": "http://www.conceptwiki.org/concept/38932552-111f-4a4e-a46a-4ed1d7bdf9d5",
"inDataset": "http://www.conceptwiki.org",
"exactMatch": [
{
"_about": "http://bio2rdf.org/drugbank:DB00398",
"description_en": "Sorafenib (rINN), marketed as Nexavar by Bayer, is a drug approved for th
"description": "Sorafenib (rINN), marketed as Nexavar by Bayer, is a drug approved for the t
"drugType_en": [
"investigational",
"approved"
],
"drugType": [
"investigational",
"approved"
],
"genericName_en": "Sorafenib",
"genericName": "Sorafenib",
"metabolism_en": "Sorafenib is metabolized primarily in the liver, undergoing oxidative meta
"metabolism": "Sorafenib is metabolized primarily in the liver, undergoing oxidative metabol
"proteinBinding_en": "99.5% bound to plasma proteins.",
"proteinBinding": "99.5% bound to plasma proteins.",
"toxicity_en": "The highest dose of sorafenib studied clinically is 800 mg twice daily. The
"toxicity": "The highest dose of sorafenib studied clinically is 800 mg twice daily. The adv
"inDataset": "http://www.openphacts.org/bio2rdf/drugbank", 2 . 4
19. Platform goals
◎ Low total cost of ownership
◎ Simple to get started with Big Data
◎ Cater for widely varying use cases
◎ Embrace emerging Big Data technologies
◎ Simple integration with custom components
21. Big Data is
◎ Volume
o Quantity of data
◎ Velocity
o Speed at which data is provided
◎ Variety
o Different formats/models in which data is provided
◎ Veracity
o Accuracy/truthfulness of the data
Why did we need all this?
25. Semantic Big Data
ongoing research!
◎ Semantic Data Lake
o from data swamp to data lake
o query contents in the data lake
◎ SANSA stack
o Big Data analytics on semantic graph
26. Support layer
◎ Swarm UI
o Launch, install and manage pipelines
◎ Pipeline daemon & monitor
o Determine order in which steps are executed
o eg: Upload files before running computations
◎ Integrator UI
o Present dashboards in a unified interface
30. Platform installation
◎ Manual installation guide
◎ Using Docker Machine
o On local machine (VirtualBox)
o In the cloud (AWS, DigitalOcean, Azure)
o Bare metal
32. ◎ High level picture
o docker-compose.yml describes pipeline topology
◎ Common components
o extend template image with your code
◎ New components
o build a Docker image for your component
o this is your own little Virtual Machine for your component
◎ Sharing
o publish topology as git repository
o publish new components on docker hub
Platform development
42. More monitoring
This topic is ongoing, many interesting options
◎ Visualise logs with Kibana?
◎ Combine logs for large overview?
◎ Monitor node load?
◎ Provide autoscheduling?
44. You can talk to us!
◎ Aad Versteden
aad.versteden@tenforce.com
◎ Erika Pauwels
erika.pauwels@tenforce.com
45. Linux Container technology
..light-weight "virtual" virtual machine
A container is started from a image
Images downloaded from Docker Hub
Dockerfile: Layer-based recipe
Philosophy: One service, one image → microservices
Cloud's best friend: scalable, reproducible, customizable
https://www.docker.com/
5 . 1
51. Docker and data?
Docker Hub maximum image size: 10 GB
Open PHACTS data (compressed): ~30 GB
Open PHACTS data (installed): ~200 GB
Solution: Added staging Docker containers
Download from
Verify consistency
Import into Virtuso and mySQL
https://data.openphacts.org/
6 . 2
55. https://github.com/openphacts/ops-docker
Hardware requirements:
150 GB of disk space (ideal: 250 GB)
16 GB of RAM (ideal: 128 GB)
4 CPU core (ideal: 8 cores)
Prerequisites:
Recent x64 Linux (Ubuntu 14.04 LTS, Centos 7)
Fast Internet connection
Docker
Docker Compose
What do I need?
7 . 2
56. https://github.com/openphacts/ops-docker
Follow the GitHub tutorial exactly, customize later
Install latest Docker and Docker Compose
Just testing on Windows or OS X?
.. modify Docker's Linux VM to have enough disk and memory
Firewall? Different settings depend on your firewall details.
Don't worry - Docker is containerized!
..you won't break your machine
Don't jump ahead..
7 . 3
66. Custom data staging
Different Open PHACTS 2.1 licensing options:
Non-Commercial users: Everything
Commercial users: No DrugBank, partial SureChembl
Open PHACTS members: Full SureChembl
9 . 2
67. Microservices pr dataset
Most queries have separate fragments per dataset
..which could be executed on separate microservices
Better cloud scalability
Easier to test upgrades of individual datasets
But still need "API" layer to do Identity Mapping
and selecting datasets to query
9 . 3
68. BioExcel Workflow blocks
BioExcel approach: Spin up virtual machine when an
Open PHACTS workflow is started
Workflow bound dynamically to VM instance(s)
Scalability (exclusive access)
Reproducibility (independent/fixed OPS install)
Tool descriptions - exposed in bio.tools
9 . 4
69. Customization
Make it easier to add third-party data:
datasets, linksets, queries, API calls
..so pharma industry can mix in their in-house data
.. so academics can upgrade and expand datasets
More tooling,
more documentation,
or more training?
9 . 5