2. @mirocupak
What and why?
!2
• Beacon Network
• https://beacon-network.org/
• largest search and discovery engine of human genetic mutations
• from the Global Alliance for Genomics & Health (GA4GH)
• case study
problem
standard
architecture
technologies
fun with stats
9. @mirocupak
• no single institution will have sufficient resources
• still, institutions don’t have enough data
• common diseases
• rare diseases
• challenge
• discovering data
• solution
• traditional approach of data aggregation in a single centralized site not working
• federated system capable of executing cross-dataset and cross-institution queries is needed
Problem
!9
10. @mirocupak
• nonprofit standards organization
• a coalition of over 500 leading institutions working in health care,
research, disease advocacy, life science, and information
technology
• goal: enable responsible sharing of genomic and clinical data
• established in 2013
GA4GH & Beacon Project
!10
• experiment to test the willingness of international sites to share
genetic data in the simplest of all technical contexts
• initiative requiring collaboration of many different GA4GH groups
• started in 2014 and quickly gained traction
http://ga4gh.org/
https://www.broadinstitute.org/files/news/pdfs/
GAWhitePaperJune3.pdf
https://beacon-project.io/
12. @mirocupak
• simple web service allowing users to query institution’s databases to determine whether they
contain a genetic variant of interest
• receives questions of the form Do you have information about this mutation?
• responds with yes or no, optionally with additional information about the mutation
• design principles
• A beacon has to be technically simple.
• A beacon has to minimize risks associated with genomic data sharing.
• It has to be possible to make a beacon publicly available.
Beacon
!12
13. @mirocupak
• no formal specification
• receives questions of the form Do you have information about this mutation?
• responds with yes or no
• 4 public beacons, each API different
Standard: Before Beacon Network
!13
• request method
• supported parameters
• parameter names
• chromosome identifiers
• positional base
• assembly notation
• supported alleles
• dataset support
• response format
• data included in the response
15. @mirocupak
• 2014
• really simple (2 records)
• true/false response
• format: Avro
• not enough traction
• too vague
• issues partially addressed by the Beacon Network
Standard: 0.1
!15
16. @mirocupak
• 2015
• true/false/overlap/null response
• datasets
• simple data use conditions
• self description
• format: Avro
• complex (9 records)
• not well adopted
• not polished enough
Standard: 0.2
!16
17. @mirocupak
• 2016
• simplified 0.2
• based on real needs, successful
• true/false/null response
• data model improvements, extended
metadata and response, improved support
for datasets and cross-dataset queries,
data versioning
• modular and extensible
• tooling
• format: Avro → Proto3
Standard: 0.3
!17
18. @mirocupak
• 2018
• stable and more flexible
• support for more complex
mutations
• improved error handling
• improved data use conditions
• various minor improvements
• developer experience
• format: Proto3 → OpenAPI
Standard: 0.4
!18
22. @mirocupak
Service
!22
• communication with other subsystems
• query normalization
• aggregators
• participant resolution
• query distribution
• audit trail
• L1 parallelization
23. @mirocupak
Processor
!23
• executing a query against a beacon
and processing its response
• management of a flexible, dynamic and
easily extensible query execution pipeline
• pipeline stages resolution (CDI and EJB)
• L2 parallelization
• cross-assembly query handling
25. @mirocupak
Requester
!25
• second stage in the query execution pipeline
• constructing beacon requests based on their
URIs and parameters produced by the
converters
26. @mirocupak
Fetcher
!26
• third stage in the query execution pipeline
• unit actually talking to the API of beacons
• submitting requests over the network and
obtaining the raw response
27. @mirocupak
Parser
!27
• last stage in the pipeline
• extracting information of interest from the
raw response obtained by a fetcher
• dealing with various formats
• handling metadata, multiple responses, errors
• response normalization
• parallelized