Life sciences in general such as genetics and biology have traditionally benefited from Perl with excellent projects leading the way such as BioPerl. Unfortunately, in medical research and epidemiology, the picture is different. Researchers are struggling with the ever increasing size and complexity of datasets. This presentation will briefly describe the situation I faced when I first joined a research team working on coronary heart disease, what I did to make things better and how I achieved one small victory for Perl.
I come from both a sciency and a commercial environment where large datasets were used from multiple stakeholders and sharing was a good thing.
Bioinformatics, - essentially computer science and molecular biology dealing with DNA, RNA etc. They are used to extremely large datasets, the raw human genome was 30TB big There has been substantial innovation in both hardware and software and established standards for storing, searchin, visualizing information The bioinformatics community is international and collaborative and data is shared amongst peers. The bioperl project is an excellent example of this A big collection of perl modules for doing many operatios on bioinformatics data, An international collaboration with many people working on it, cross platform and a plethora tools based on it Very good documentation and there’s even a O’Reilly book on it.
Epidemiology on the other hand, clinical epidemiology, is all about collecting and analyzing clinical data on patients Traditionally it is very expencive to follow up people with medical exams, questionnaires etc and the typical study size would have less than 5000 individuals Paper is king and everything is based on it, slowly doing the transition to electronic format for data gathering Times however have changed, there’s a bunch if NHS IT projects going on to bring medical data together, electronic health records have come into play etc There is more and more data available from multiple sources such as GP surgeries, hospitals, office of national statistics and government data sources.
So what are people doing. When I first joined my new job I saw that people were not happy. The size of the data is every increasing, I am dealing with a 6m patient database with over 5 bn rows Of course it was delivered as text files Of course I had to sign 40 page forms to obtain the data Data is well kept secret. There is very little sharing going on. Researchers are struggling to actually manage the data rather than analyze it. Data cleaning, formatting, specifications (lack of) Statistical packages are used to manage the data which in my head is not entirely appropriate Only very recently did funding organizations start requiring research teams to actually hire somebody dedicated for managing and curating these data sets. Some common patterns emerged which I examined in an academic fashion.
Fear leads to hate, hate leads to anger, anger is the path to suffering. And only one person is happy with all those.
So what did I try to do. I took a small step for man and created the medical namespace after emailing the dev list I started thinking of similar ways to create something like bioperl but for medical-specific modules. There already are several modules on CPAN which are of interest. DICOM is a image format widely used and UMLS is a structured ontology used in biomedical sciences The main issue is to expose these, and others, to non-Perl people, aka normal people
The NHS deals with 1m patients per 36 hours The nhs number,is a ten digit UID essentially that everybody gets assigned and is based on the mod11 algorithm Of course, this is the NHS so there are 21 different formats of old school NHS numbers floating around I looked on CPAN and could not find anything, but its no problem, I just created medical::nhsnumber which was the first module for the medical namespace
The ICD10 coding system is basically one huge ontology for coding diseases, signs, symptoms , test results etc Everytime you visit the hospital, you get a series of codes according to what the problem was, its very very widely used. What do most people do? They try to open it in Excel… And how do you take all the parents of the term if you want to? Weeeeelll, we use this search function and paste results into another spreadsheet and then we use stata to check it… Ok ok stop. Medical::icd10, a very simple module doing very simple things saving people time. Coupled with a very basic web interface.
Another thing I looked at was standards used to describe the data Or perhaps more appropriately, the lack of standards to describe the data. Documentation is delivered as a excel file or a email or a word document with cryptic variable names and all that fun So I said, there is an established data documentation standard called the DDI, why don’t we use it and make our lifes easier? I created two modules and a bunch of scripts and turned a flat excel file of little usability into something better, much better. Similarly, for study registration in the interest of transparency, we use clinicaltrials.gov all the time. I created a module for people to use.
It turns out people do want to do their lives easier. We got activestate and we have the excellent resource of learning perl so I t Finally I introduced perl to people in my team. Most of them got scared away but one of them was happy with Perl. She even considered Python. Still in my book it’s a win.
So life after perl, what did I do: I itrodiuced a new namespace I created several modules internal and external I created better data documentation using perl and promoted standards And I introduced perl to normal people? Was all this technically complicated? Probably not, it was very straightforward in the majority of cases. Was it worth it? This is how my work was after Perl.
Please help out! Introduce perl to your academic group Contribute to the medical namespace Help design and implement medperl Use more standards at work if you are not already using them And finally, shameless, please join the UCL perl users group if you are from UCL Thanks!