SlideShare une entreprise Scribd logo
1  sur  43
Télécharger pour lire hors ligne
Volume 3, Issue 3
pythonpapers.org
Journal Information



The Python Papers
                                                     ISSN: 1834-3147

Editors
                     Co- Edi to r s - in - Chief :      Maur ice Ling
                                                        Tennessee Leeuwenburg
                     Assoc i a t e Edi to r s :         Gui lherme Polo
                                                        Guy Klos s
                                                        Richard Jones
                                                        Sarah Mount
                                                        Stephan ie Chong

Referencing Information
Art i c l e s     f rom th i s edi t i o n of th i s journa l may be referenced as fo l l ows :

                Author , “Ti t l e” (2008) The Python Papers , Volume N, I s s ue M, Art i c l e   Number


Copyright Information
            © Copyr i gh t 2007 The Python Papers and the ind i v i dua l author s
      Thi s work i s copyr i gh t under the Creat i v e Commons 2.5 l i c ense subjec t to
                                   Att r i bu t i o n , Noncommercia l and
             Share- Al i ke condi t i o n s . The fu l l lega l code may be found at
                   http : / / c r ea t i v e commons.org / l i c en se s / byncsa /2 . 1 / au /
       The Python Papers was f i r s t publ i s hed in 2006 in Melbourne, Aust ra l i a .

Referees
             An academic peer - rev iew was perfo rmed on al l academic art i c l e s
             in accordance to The Python Papers Antho logy Edi to r i a l Pol i c y .
          The rev iewers wi l l be acknowledge ind i v i dua l l y but the i r ident i t i e s
                 wi l l not be re lea sed in order to ensure the anonymity .

Focus and Scope
      *   Python User Groups and Spec i a l In te re s t Group in t r oduct i o n s
      *   Technica l aspect s of the Python language
      *   Code rev iews and book rev iews
      *   Descr i p t i o n s of new Python modules and l i b r a r i e s
      *   So lu t i on s to spec i f i c problems in Python
      *   Conso l i d a t ed summaries of current di scus s i o n in Python
      *   Mai l i n g l i s t s or other fora
      *   Companies and organ i s a t i o n s us ing Python
      *   Appl i c a t i o n s developed in Python ( such as held in the Python Cheese
          Shop)

In shor t , we are so l i c i t i n g        submiss i o n s where Python i s an integ r a l         par t of
the answer .
The Python Papers Volume 3, Issue 3                                                                3




The Python Papers Anthology Editorial Policy

0. Preamble
The Python Papers Anthology is the umbrella entity referring to The Python Papers (ISSN
1834-3147), The Python Papers Monograph (ISSN under application) and The Python Papers
Source Codes (ISSN under application), under a common editorial committee (hereafter
known as 'editorial board').

It aims to be a platform for disseminating industrial / trade and academic knowledge about
Python technologies and its applications.

The Python Papers is intended to be both an industrial journal as well as an academic journal,
in the sense that the editorial board welcomes submissions relating to all aspects of the
Python programming language, its tools and libraries, and community, both of academic and
industrial inclinations. The Python Papers aims to be a publication for the Python community
at large. In order to cater for this, The Python Papers seeks to publish submissions under two
main streams: the industrial stream (technically reviewed) and the academic stream (peer-
reviewed).

The Python Papers Monograph provides a refereed format for publication of monograph-
length reports including dissertations, conference proceedings, case studies, advanced-level
lectures, and similar material of theoretical or empirical importance. All volumes published
under The Python Papers Monograph will be peer-reviewed and external reviewers may be
named in the publication.

The Python Papers Source Codes provides a refereed format for publication of software and
source codes which are usually associated with papers published in The Python Papers and
The Python Papers Monograph. All publications made under The Python Papers Source Codes
will be peer-reviewed.

This policy statement seeks to clarify the processes of technical review and peer-review in
The Python Papers Anthology.

1. Composition and roles of the editorial board
The editorial board is headed by the Editor-in-Chief or Co-Editors-in-Chief (hereafter known as
"EIC"), assisted by Associate Editors (hereafter known as "AE") and Editorial Reviewers
(hereafter known as "ER").

EIC is the chair of the editorial board and together with AEs, manages the strategic and
routine operations of the periodicals. ER is a tier of editors deemed to have in-depth expertise
knowledge in specialized areas. As members of the editorial board, ERs are accorded editorial
status but are generally not involved in the strategic and routine operations of the periodicals
although their expert opinions may be sought at the discretion of EIC.

2. Right of submission author(s) to choose streams
The submission author(s); that is, the author(s) of the article or code or any submissions in
any other forms deemed by the editorial board as being suitable; reserves the right to choose
if he/she wants his/her submission to be in the industrial stream, where it will be technically
reviewed, or in the academic stream, where it will be peer-reviewed. It is also the onus of the
submission author(s) to nominate the stream. The editorial board defaults all submissions to
be industrial (technical review) in event of non-nomination by the submission author(s) but
the editorial board reserves the right to place such submissions into the academic stream if it
deems fit.

The editorial board also reserves the right to place submissions nominated for the academic
stream in the technical stream if it deems fit.
The Python Papers Volume 3, Issue 3                                                                4



3. Right of submission author(s) to nominate potential reviewers
The submission author(s) can exercise the right to nominate up to 4 potential reviewers
(hereafter known as ";external reviewer";) for his/her submission if the submission author(s)
choose to be peer-reviewed. When this right is exercised, the submission author(s) must
declare any prior relationships or conflict of interests with the nominated potential reviewers.
The final decision to accept the nominated reviewer(s) rests with the Chief Reviewer (see
section 5 for further information on the role of the Chief Reviewer).

4. Right of submission author(s) to exclude potential reviewers
The submission author(s) can exercise the right to recommend excluding any reasonable
numbers of potential reviewers for his/her submission. When this right is exercised, the
submission author(s) must indicate the grounds on which such exclusion should be
recommended. Decisions for the editorial board to accept or reject such exclusions will be
solely based on the grounds as indicated by the submission author(s).

5. Peer-review process
Upon receiving a submission for peer-review, the Editor-in-Chief (hereafter known as "EIC")
may choose to reject the submission or the EIC will nominate a Chief Reviewer (hereafter
known as "CR") from the editorial board to chair the peer-review process of that submission.
The EIC can nominate himself/herself as CR for the submission.

The CR will send out the submission to TWO or more external reviewers to be reviewed. The
CR reserves the right not to call upon the nominated potential reviewers and/or to call upon
any of the reviewers nominated for exclusion by the submission author(s). The CR may also
concurrently send the submission to one or more Associate Editor(s) (hereafter known as
";AE";) for review. Hence, a submission in the academic stream will be reviewed by at least
three persons, the CR and two external reviewers. Typically, a submission may be reviewed
by three to four persons: the EIC as CR, an AE, and two external reviewers. There is no upper
limit to the number of reviews in a submission.

Upon receiving the review from external reviewer(s) and/or AE(s), the CR decides on one of
the following options: accept without revision, accept with revision or reject; and notifies the
submission author(s) of the decision on behalf of the EIC. If the decision is "accept with
revision", the CR will provide a deadline to the submission author(s) for revisions to be done
and will automatically accept the revised submission if the CR deems that all revision(s) were
done; however, the CR reserves the right to move to reject the original submission if the
revision(s) were not carried out by the stipulated deadline by the CR. If the decision is
"reject", the submission author(s) may choose to revise for future re-submission. Decision(s)
by CR or EIC are final.

6. Technical review process
Upon receiving a submission for technical review, the Editor-in-Chief (hereafter known as
"EIC") may choose to reject the submission or the EIC will nominate a Chief Reviewer
(hereafter known as "CR") from the editorial board to chair the review process of that
submission. The EIC can nominate himself/herself as CR for the submission.

The CR may decide to accept or reject the submission after reviewing or may seek another
AE's opinions before reaching a decision. The CR will notify the submission author(s) of the
decision on behalf of the EIC. Decision(s) by CR or EIC is final.

7. Main difference between peer-review and technical review
The process of peer-review and technical review are similar, with the main difference being
that in the peer review process, the submission is reviewed both internally by the editorial
board and externally by external reviewers (nominated by submission author(s) and/or
nominated by EIC/CR). In a technical review process, the submission is reviewed by the
editorial board. The editorial board retains the right to additionally undertake an external
review if it is deemed necessary.
The Python Papers Volume 3, Issue 3                                                                  5


8. Umbrella philosophy
The Python Papers Anthology editorial board firmly believes that all good (technically and/or
scholarly/academic) submissions should be published when appropriate and that the editorial
board is integral to refining all submissions. The board believes in giving good advice to all
submission author(s) regardless of the final decision to accept or reject and hopes that advice
to rejected submissions will assist in their revisions.



The Python Papers Editorial Statement on Open Access

The Python Papers Anthology has received a number of inquiries relating to the republishing
of articles from the journal, especially in the context of open-access repositories. Each issue
of The Python Papers Anthology is released under a Creative Commons 2.5 license, subject to
Attribution, Non-commercial and Share-Alike clauses. This, in short, provides a carte blanche
on republishing articles, so long as the source of the article is fully attributed, the article is
not used for commercial purposes and that the article is republished under this same license.
Creative commons permits both republishing in full and also the incorporation of portions of
The Python Papers in other works. A portion may be an article, quotation or image. This
means (a) that content may be freely re-used and (b) that other works using The Python
Papers Anthology content must be available under the same Creative Commons license.

The remainder of this article will address some of the details that might be of interest to
anyone who wishes to include issues or articles in a database, website, hard copy collection
or any other alternative access mechanism.

The full legal code of the license may be found at
 http://creativecommons.org/licenses/byncsa/2.1/au/

The full open access policy can be found at
 http://ojs.pythonpapers.org/index.php/tpp/about/editorialPolicies
The Python Papers Volume 3, Issue 3                                                                 6




Editorial
Maurice Ling

Hi Everyone,

Welcome to the latest issue of The Python Papers. First and foremost, we will like to show our
appreciation for all the contributions we had during the year which made us where we are
today. Of course, we will not forget all our supporters and readers as well for all your valuable
comments. In 2008 (Volume 3), we had published a total of 7 industrial and academic articles
each, as well as 2 columns from our regular columnist, Ian Ozsvald, in his ShowMeDo
Updates.

Thank you for all your support and we will look forward to your continued encouragement.

Starting in 2009, all the serials under The Python Papers Anthology will take on a new
publishing scheme. We will be releasing each article out to the public as they are being
accepted but each issue will be delimited by our usual “issue release” date. The “issue
release” date is then our cutoff deadline to prepare the 1-PDF per issue file. This means that
we will be serving new articles to everyone much faster than now and there will
not be anymore meaningful publication schedules.

We had also changed our policy from “Review Policy” to “Editorial Policy” to reflect the
changes in the editorial team. We are currently in the process of appointing Editorial
Reviewers (ER for short). Editorial Reviewers are members of the editorial committee whom
are deemed to have in-depth expertise knowledge in specialized areas.

Let's us looking forward to a great year ahead for more Python development and a recovering
economy.

Happy reading.
Editorial: Python at the Crossroads

My favourite T-shirt glimpsed at Pycon UK 2008 was ...

  Python
    programming
    as Guido
    indented it

Apart from the two keynote speeches, it was a happy and fascinating event. It was my first Python-
only conference and what a pleasure to be able to choose from four streams - web, GUI, testing and
the language itself. My previous conferences, all in Australia, were open source events with wider
scope and had just a single Python stream among Perl, PHP, Ruby and so on.

The quality of speakers was uniformly excellent and the organisation was first rate. We can be sure
that EuroPython 2009 being hosted by the same team next year in Birmingham will be definitely
worth attending.

The two keynotes by Mark Shuttleworth, CEO of Canonical, and Ted Leung, Python Evangelist at
Sun, both highlighted Python at the crossroads. Fascinating but not particularly light hearted.
Mixing and matching what they said, their combined story is ...

1. Python has critical mass and will continue to grow. The speed of growth is another question.

2. Django has reached the important milestone version 1.0 and should therefore compete with
Ruby-on-Rails for newcomers to the Python language itself.

3. Intel and Sun are currently selling multi-core CPUs - 16 and 128 cores respectively. Expect
massively multi-core machines in future.

4. Future growth in language popularity will be tied to multi-threading on multi-core CPUs.
Haskell is one language expecting a multi-core growth kick.

5. Python's Global Interpreter Lock effectively prevents the language from exploiting current state-
of-the-art multi-core computers.

Where does this leave beautiful Python?

The point was made that a language is chosen for being appropriate for the purpose of a project.
Where this happens for multi-core performance reasons and Python is rejected then that is growth
for another language and a permanent loss to Python.

The Python Papers Editorial Team hopes that many of the Pycon UK session papers will be
published in these pages. Please get in touch if you would like to submit an article for academic or
technical review. Visit http://ojs.pythonpapers.org to submit an article or paper.

In view of the crossroads highlighted for Python at Pycon in September, articles with a focus on
multi-threading for multi-core computers would seem to be valuable for the language itself.

The Python Papers is keen to see the language succeed and has very talented reviewers ready to
help authors get their articles published.
Got something to contribute? Please get in touch ...

Mike Dewhirst
ShowMeDo Update - November
Ian Ozsvald


In the last issue of the Python Papers I wrote a long article about how ShowMeDo helps you to
learn more about Python. Since then we've added another 40 Python videos taking us to almost 380
in total. Including all the open-source topics we cover we have over 800 tutorial videos for you.

Much of the content is free, contributed by our great authors and ourselves. Some of the content is
in the Club which is for paying members - currently the Club focuses purely on Python tutorials for
new and intermediate Python programmers. An update on the Club videos follows later.

We were interviewed in October by Ron Stephens of Python411, you'll find the interview and all of
Ron's other great Python podcasts on his site:
http://www.awaretek.com/python/


Contributing to ShowMeDo:
Would you like to share your knowledge with thousands of Python viewers every month?
Contributing to ShowMeDo is easy, you'll find guides and links to screencasting software here:
http://showmedo.com/addVideoInstructions

To get an idea of what is popular with our viewers, see how the videos rank here:
http://showmedo.com/mostPopular

Remember that everything is previewed by us before publishing. You may have to wait a few days
before your video is published but you'll be safe in the knowledge that your content sits alongside
other vetted content.

We are very keen to help you share your knowledge with our Pythonistas, especially if you want to
spread awareness of the tools you like to use. Do get in contact in our forum, our authors are a
friendly and very helpful crowd:
http://groups.google.com/group/showmedo


Free Screencasts:

Django:
We've had a lot of new Django content recently, mostly from Eric Holscher and ericflo. Eric and
Eric have produced an amazing 21 new screencasts to help you learn Django.

Django From the Ground Up(13 videos), ericflo
http://showmedo.com/videos/series?name=PPN7NA155

Setting Up a Django Development Environment (3 videos), ericflo
http://showmedo.com/videos/series?name=LY7fNbpc1

Debugging Django (4 videos), Eric Holscher
http://showmedo.com/videos/series?name=RjHhY85GD
Django Command Extensions, Eric Holscher
http://showmedo.com/videos/series?name=3eB8j5P3b

To commemorate the launch of Django v1 I produced a 1-minute quick intro with backing music by
the great Django Reinhardt to help raise awareness of the team's great effort:

Django In Under A Minute, Ian Ozsvald
http://showmedo.com/videos/video?name=3240000&fromSeriesID=324


Python Coding:
Florian, a longer-term ShowMeDo author, has created two series which introduce Decorators and
teach you how to do unit-testing.

Advanced Python(3 videos), Florian Mayer
http://showmedo.com/videos/series?name=D42HbAhqD

Unit-testing with Python(2 videos), Florian Mayer
http://showmedo.com/videos/series?name=TUeY7z7GD


Python Tools:
We also have videos on the Python Bug Tracker, the Round-up Issue Tracker, using VIM with
Python and another in a set explaining how to use Python inside Resolver System's 'Excel-beating'
Resolver One spreadsheet.

Searching the Python Bug Tracker, A.M. Kuchling
http://showmedo.com/videos/video?name=3110000&fromSeriesID=311

An Introduction to Round-up Issue Tracker, Tonu Mikk
http://showmedo.com/videos/video?name=3610000&fromSeriesID=361

An Introduction to Vim Macros (7 videos), Justin Lilly
http://showmedo.com/videos/series?name=0oSagogCe

Putting Python objects in the spreadsheet grid in Resolver One, Resolver Systems
http://showmedo.com/videos/video?name=3520000&fromSeriesID=352


Club ShowMeDo:
In the Club we continue to create more specialist tutorials for new and intermediate Python
programmers. Membership to the Club can either be bought for a year's access or gained free for
life if you author a video for us.

You'll find details of the 115 Python videos for Club members here:
http://showmedo.com/club
Lucas Holland has joined us as a Club author having authored many free videos inside ShowMeDo.
In this 9-part series he introduced the Python Standard Library:
Batteries included - The Python standard library(9 videos), Lucas Holland
http://showmedo.com/videos/series?name=o9MBQ758M

I have created two new series which walk you through loops, iteration and functions:
Python Beginners - Loops and Iteration (7 videos), Ian Ozsvald
http://showmedo.com/videos/series?name=tIZs1K8h4

Python Beginners - Functions (6 videos), Ian Ozsvald
http://showmedo.com/videos/series?name=4oReffvYq
Filtering Microarray Correlations by Statistical
Literature Analysis Yields Potential Hypotheses for
Lactation Research

Maurice HT Ling1,2 (mauriceling@acm.org)
Christophe Lefevre 1,3,4 (Chris.Lefevre@med.monash.edu.au)
Kevin R Nicholas1,4 (kevin.nicholas@deakin.edu.au)

1
  CRC for Innovative Dairy Products, Department of Zoology, The University of
  Melbourne, Australia
2
  School of Chemical and Life Sciences, Singapore Polytechnic, Singapore
3
  Victorian Bioinformatics Consortium, Monash University, Australia
4
  Institute of Technology Research and Innovation, Deakin University, Australia




Abstract
Background
Recent studies have demonstrated that the cyclical nature of mouse lactation 1 can be
mirrored at the transcriptome2 level of the mammary glands but making sense of
microarray3 results requires analysis of large amounts of biological information which
is increasingly difficult to access as the amount of literature increases. Extraction of
protein-protein interaction from text by statistical and natural language processing has
shown to be useful in managing the literature. Correlations between gene expression
across a series of samples is a simple method to analyze microarray data as it was
found that genes that are related in functions exhibit similar expression profiles4.
Microarrays had been used to examine the transcriptome of mouse lactation and found
that the cyclic nature of the lactation cycle as observed histologically is reflected at
the transcription level. However, there has been no study to date using text mining to
sieve microarray analysis to generate new hypotheses for further research in the field
of lactational biology.

Results
Our results demonstrated that a previously reported protein name co-occurrence
method (5-mention PubGene) which was not based on a hypothesis testing
framework, is generally more stringent than the 99th percentile of Poisson distribution-
based method of calculating co-occurrence. It agrees with previous methods using
natural language processing to extract protein-protein interaction from text as more
than 96% of the interactions found by natural language processing methods to
coincide with the results from 5-mention PubGene method. However, less than 2% of

1 Lactation is the process of milk production.

2 Transcriptome is the set of genes that are active in a given cell at any one time.

3 Microarray is a multiplex technology used in molecular biology to measure the activity of a set of
  genes at any one time.
4 A gene expression profile is the trend of activity for all the genes across different time points or
  conditions.


                                                  -1-
the gene co-expressions analyzed by microarray were found from direct co-
occurrence or interaction information extraction from the literature. At the same time,
combining microarray and literature analyses, we derive a novel set of 7 potential
functional protein-protein interactions that had not been previously described in the
literature.

Conclusions
We conclude that the 5-mention PubGene method is more stringent than the 99th
percentile of Poisson distribution method for extracting protein-protein interactions by
co-occurrence of entity names and literature analysis may be a potential filter for
microarray analysis to isolate potentially novel hypotheses for further research.


1.     Background
Microarray technology is a transcriptome analysis tool which had been used in the
study of the mouse lactation cycle (Clarkson and Watson, 2003; Rudolph et al., 2007).
A number of advances in microarray analysis have been made recently. For example,
inferring the underlying genetic network from microarray results (Rawool and
Venkatesh, 2007; Maraziotis et al., 2007) by statistical correlation of gene expression
across a series of samples (Reverter et al., 2005), then deriving functional network
clusters by mapping onto Gene Ontology (Beissbarth, 2006). It has been shown that
functionally related genes demonstrate similar expression profiles (Reverter et al.,
2005). These methods have been used to study functional gene sets for basal cell
carcinoma (O'Driscoll et al., 2006). The amount of information in published form is
increasing exponentially, making it difficult for researchers to keep abreast with the
relevant literature (Hunter and Cohen, 2006). At the same time, there has been no
study to demonstrate that the current status of knowledge in protein-protein
interactions in the literature is useful to increase the understanding of microarray data.

The two major streams for biomedical protein-protein information extraction are
natural language processing (NLP) and co-occurrence statistics (Cohen and Hersh,
2005; Jensen et al., 2006). The main reason for concurrent existence of these two
methods is their complementary effect in terms of information extraction (Jensen et
al., 2006). NLP has a lower recall or sensitivity than co-occurrence but tends to be
more precise compared with co-occurrence statistical methods (Wren and Garner,
2004; Jensen et al., 2006). Mathematically, precision is the number of true positives
divided by the total number of items labeled by the system as positive (number of true
positives divided by the sum of true and false positives), whereas recall is the number
of true positives identified by the system divided the number of actual positives
(number of true positives divided by the sum of true positives and false negatives). A
number of tools have approached protein-protein interaction extraction from the NLP
perspective, these include GENIES (Friedman et al., 2001), MedScan (Novichkova et
al., 2003), PreBIND (Donaldson et al., 2003), BioRAT (David et al., 2004), GIS
(Chiang et al., 2004), CONAN (Malik et al., 2006), and Muscorian (Ling et al., 2007).
Muscorian (Ling et al., 2007) achieved at least 82% precision and 30% in recall
(sensitivity). NLP methods made use of the grammatical forms of words and structure
of a valid sentence to identify the grammatical roles of each word in a sentence, parse
the sentence into phrases and extracting information such as subject-verb-object
structures from these phrases. Co-occurrence, a statistical method, is based on the
thesis that multiple occurrences of the same pair of entities suggests that the pair of


                                          -2-
entities are related in some way and the likelihood of such relatedness increases with
higher co-occurrence. In another words, co-occurrence methods tend to view the text
as a bag of un-sequenced words. Hence, depending on the threshold allowed, which
will translate to the precision of the entire system, recall could be total, as implied in
PubGene (Jenssen et al., 2001).

PubGene (Jenssen et al., 2001) defined interactions by co-occurrence to the simplest
and widest possible form by assigning an interaction between 2 proteins if these 2
proteins appear in the same article just once in the entire library of 10 million articles
and found that this criterion has 60% precision (1-Mention PubGene method).
Although it was not stated in the article (Jenssen et al., 2001), it is obvious that such a
criterion would yield 100% recall or sensitivity, giving an F-score of 0.75. F-score is
defined as the harmonic mean of precision and recall, attributing equal weight to both
precision and recall. However, 60% precision is usually unsatisfactory for most
applications. PubGene (Jenssen et al., 2001) had also defined a “5-Mention” method
which requires 5 or more articles with 2 protein names to assign an interaction with
72% precision. It is generally accepted that precision and recall are inversely related;
hence, it can be expected that the “5-Mention” method will not be 100% sensitive.
However, PubGene was benchmarked against the Database of Interacting Proteins and
OMIM, making it more difficult to appreciate the statistical basis of “1-Mention” and
“5-Mention” methods as compared to using a hypothesis testing framework in Chen et
al. (2008). In addition, PubGene is unable to extract the nature of interactions, for
example, binding or inhibiting interactions. On the other hand, NLP is designed to
extract the nature of interactions (Malik et al., 2006; Ling et al., 2007); hence, it can
be expected that NLP results may be used to annotate co-occurrence results.

CoPub Mapper used a more sophisticated information measure which took into
account the distribution of entity names in the text database (Alako et al., 2005).
Although Alako et al (2005) demonstrated CoPub Mapper's information measure co-
relates well with microarray co-expression, the information measure was not used as a
decision criterion for deciding which pairs of co-occurrences were positive results
(personal communication, Guido Jenster, 2006). This is unlike 1-Mention PubGene
method where all co-occurrence were taken as positive result and 5-Mention PubGene
method requires at least 5 count of co-occurrence before attributing the co-occurrence
as a positive result. Chen et al. (2008) used chi-square to test co-occurrence
statistically to mine disease-drug interactions from clinical notes and published
literature. Another possible way to calculate co-occurrence is a direct use of Poisson
distribution on the assumption that co-occurrence of 2 protein names is a rare chance
with respect to the entire library. Poisson distribution is a discrete distribution similar
to Binomial distribution but is used for rare events, for example, to estimate the
probability of accidents in a given stretch of road in a day. Poisson distribution is
easier to use than Binomial distribution as it only requires the mean and does not
require a standard deviation. Based on PubGene, the statistical assumption of Poisson
distribution-based statistics requiring rare events (in this case, the co-occurrences of 2
protein names in a collection of text is statistically rare) can generally be held
(Jenssen et al., 2001).

Although a combination of either NLP or co-occurrence in microarray analysis have
been used (Li et al., 2007; Gajendran et al., 2007; Hsu et al., 2007), neither method
had been used in microarray analysis for advancing lactational biology. This study


                                           -3-
attempts to examine the relation between the PubGene and Poisson distribution
methods of calculating co-occurrence and explore the use of NLP-based protein-
protein interaction extraction results to annotate co-occurrence results. This study also
examines the use of co-occurrence analysis on 4 publically available microarray data
sets on mouse lactation cycle (Master et al., 2002; Clarkson and Watson, 2003; Stein
et al., 2004; Rudolph et al., 2007) as a novel hypothesis discovery tool. Master et al.
(2002) used 13 microarrays to discover the presence of brown adipose tissue in mouse
mammary fat pad and its role in thermoregulation, Clarkson and Watson (2003) used
24 microarrays and characterized inflammation response genes during involution,
Stein et al. (2004) used 51 microarrays and discovered a set of 145 genes that are up-
regulated in early involution where 49 encoded for immunoglobulins, and Rudolph et
al. (2007) used 29 microarrays to study lipid synthesis in the mouse mammary gland
following diets of various fat content and found that genes encoding for nutrient
transporter into the cell are up-regulated following increased food intake. More
importantly, each of the 4 studies independently demonstrated that the cyclical nature
of mammary gland development, as observed histologically and biochemically, are
reflected at the transcriptome level suggesting that microarray is a suitable tool to
study the regulation of mouse lactation. It should be noted that even-though each of
these microarray experiments were designed for different purposes, the principle that
co-expressed genes are more functionally correlated than functionally unrelated genes
remains, as demonstrated by Reverter et al. (2005).

Our results demonstrate that 5-mention PubGene method is generally statistically
more significant than 99th percentile of Poisson distribution method of calculating co-
occurrence. Our results showed that 96% of the interactions extracted by NLP
methods (Ling et al., 2007) overlapped with the results from 5-mention PubGene
method. However, less than 2% of the microarray correlations were found in the co-
occurrence graph extracted by 1-mention PubGene method. Using co-occurrence
results to filter microarray co-expression correlations, we have discovered a
potentially novel set of 7 protein-protein interactions that had not been previously
described in the literature.


2.     Methods
2.1.   Microarray Datasets
The 4 microarray datasets are from Master et al. (2002) using Affymetrix Mouse Chip
Mu6500 and FVB mice, Clarkson and Watson (2003) using Affymetrix U74Av2 chip
and C57/BL6 mice, Rudolph et al. (2007) using Affymetrix U74Av2 chip and FVB
mice, and Stein et al. (2004) using Affymetrix U74Av2 chip and Balb/C mice.


2.2.   Co-Occurrence Calculations
Using a pre-defined list of 3653 protein names which was derived by Ling et al.
(2007) from Affymetrix Mouse Chip Mu6500 microarray probeset, PubGene
established 2 measures of binary co-occurrence (Jenssen et al., 2001): 1-mention
method and 5 mentions method. In the 1-mention method, the appearance of 2 entity
names in the same abstract will be deemed as a positive outcome whereas the 5
mentions method will require the appearance of 2 entity names in at least 5 abstracts
before considered positive.


                                          -4-
For co-occurrence modelled on Poisson distribution (Poisson co-occurrence), the
number of abstracts in which both entity names appeared in is assumed to be rare as it
only requires the appearance of 2 entity names within 5 articles in a collection of 10
million articles to give a precision of 0.72 (Jenssen et al., 2001). The relative
occurrence frequencies of each of the 2 entities were calculated separately as a
quotient of the number of abstracts in which an entity name appeared in and the total
number of abstracts in the corpus. The product of relative occurrence frequency of
each of the 2 entities can be taken as the mean expected probability of the 2 entities
appearing in the same abstract if they are not related, which when multiplied by the
total number of abstracts, can be taken as the mean number of occurrence (lambda) of
Poisson distribution. For example, if proteinA and proteinB are found in 1000
abstracts each and there are 1 million abstracts, the relative occurrence frequency will
be 0.001 each and the mean number of occurrence will be 1 (0.001 2 x 1000000). This
means that we expect 1 abstract in a collection of 1 million to contain proteinA and
proteinB if they are not related (n = 1, p = 0.5).

A positive result is where the number of abstracts in which both the 2 entities in
question appeared on or above the 95th (one-tail P < 0.05) or 99th (one-tail P < 0.01)
percentile of the Poisson distribution. In both co-occurrence calculations, entity
(protein) names in text is recognized by pattern matching , as used in Ling et al.
(2007).


2.3.   Comparing Co-Occurrence and Text Processing
Two sets of comparisons were performed: within the different forms of co-occurrence,
and between co-occurrence and text processing methods. The first set of comparison
aims to evaluate the differences between the 3 co-occurrence methods described
above. PubGene's 1-mention and 5-mentions methods were co-related singly and in
combination with Poisson co-occurrence methods.

Given that the nodes (N) of a co-occurrence network represents the entities and the
links or edges (E) between each node to represent a co-occurrence under the method
used, the entire co-occurrence graph (G) = {N, E}, that is, a set of nodes and a set of
edges. In addition, given that the same set of entities were used (same set of nodes),
the differences between the 2 graphs resulted from 2 co-occurrence methods can then
be simply denoted as the number of differences between the 2 sets of edges
(subtraction of one set of edges with another set of edges). In practice, a total space
model is used. A graph of total possible co-occurrence is where each node is “linked”
or co-occurred with every node, including loops (edge to itself). Thus, a graph of total
possible co-occurrence has 3653 nodes and 12694969 (35632) edges. We define a
graph, G*, as the undirected graph of total possible co-occurrence without parallel
edges including loops. G* has 3653 nodes and 63457030 [3563 x (3563 – 1) / 2]
edges. The output graph of each co-occurrence method is reduced to the number of
edges it contains as it can be assumed that the graph from 1-mention PubGene method
represents the most liberal co-occurrence graph (GPG1), the resulting graph from any
other more sophisticated method (Gi where i denotes the co-occurrence method) will
be a proper subset of GPG1 and certainly G*.




                                         -5-
The second set of comparison aims at correlating co-occurrence techniques and
natural language processing techniques for extracting interactions between two
entities, such as two proteins. In this comparison, the extracted protein-protein
binding and activation interactions, extracted using Muscorian on 860000 published
abstracts using “mouse” as the keyword as previously described (Ling et al., 2007),
has been used to compare against co-occurrence network of 1-Mention PubGene and
5-Mention PubGene by graph edges overlapping as described above. Briefly,
Muscorian (Ling et al., 2007) normalized protein names within abstracts by
converting the names into abbreviations before processing the abbreviated abstracts
into a table of subject-verb-objects. Protein-protein interaction extractions were
carried out by matching of each of the 12694969 (35632) pairs of protein names and
verb, namely, activate or bind, in the extracted table of subject-verb-objects.


2.4.   Mapping Co-Expression Networks onto Text-Mined Networks
A co-expression network was generated from each of the 4 in vivo data sets by pair-
wise calculation of Pearson's coefficient on the intensity values across the dataset,
where a coefficient of more than 0.75 or less than -0.75 signifies the presence of a co-
expression between the pair of signals on the microarray (Reverter et al., 2005). The
co-expression network generated from Master et al. (2002) and an intersected co-
expression network generated by intersecting all 4 networks were used to map onto 1-
PubGene and NLP-mined networks. For the co-expression network generated from
Master et al. (2002), a 0.01 coefficient unit incremental stepwise mapping to 1-
PubGene co-occurrence network as performed from 0.75 to 1 to analyze for an
optimal correlation coefficient to derive a set of correlations between genes that is
likely to have not been studied before (not found in 1-PubGene co-occurrence
network).


3.     Results
3.1.   Comparing Co-Occurrence Calculation Methods
Using 3563 transcript names, there is a total of 6345703 possible pairs of interactions
- 927648 (14.6%) were found using 1-Mention PubGene method and 431173 (6.80%)
were found using 5-Mention PubGene method. The Poisson co-occurrence method
using both 95th or 99th percentile threshold found 927648 co-occurrences, which is the
same set as using 1-Mention PubGene method.


The mean number of co-occurrence, which is used as the mean of the Poisson
distribution, is calculated as the product of the probability of occurrence of each of the
entity names in the database. Using a database of 100 thousand abstracts as an
example, if 500 abstracts contained the term “insulin” (500 abstracts in 100 thousand,
or 0.5%) and 200 abstracts contained the term “MAP kinase” (200 abstracts in 100
thousand, or 0.2%), then the mean number of co-occurrence (lambda in Poisson
distribution) is 0.001%. The range of mean number of co-occurrence for the 6345703
pairs of entities were from zero to 0.59, with mean of 0.000031. For example, if the
mean is 3.1 x 10-5, then the probability of an abstract mentioning 2 proteins not related
in any functional way is 4.8 x 10-10 or virtually zero in 6.3 million possible
interactions. These results are summarized in Table 1.


                                          -6-
Number of Clone-Pairs       % of Full
                                                                        Combination
Full Combination (G*)1                                6345703               100.00
1-Mention PubGene                                     927648                 14.62
5-Mention PubGene                                     431173                 6.80
Poisson Co-occurrence at 95th percentile              9276482                14.62
                             th                              2
Poisson Co-occurrence at 99 percentile                927648                 14.62

Table 1 - Summary results of co-occurrence using PubGene or Poisson
distribution
1
  The undirected graph of total possible co-occurrence (35632) without parallel edges
excluding self edge, which has 3653 nodes and 63457030 [3563 x (3563 – 1) / 2]
edges.
2
  Same set as 1-Mention PubGene


3.2.   Comparison of Natural Language Processing and Co-Occurrence
Natural language processing (NLP) techniques were used to extract protein-protein
binding interactions and protein-protein activation interactions from almost 860000
abstracts as described in Ling et al. (2007). A total of 9803 unique binding
interactions and 11365 unique activation interactions were identified, of which 2958
were both binding and activation interactions. Of the 9803 binding interactions, 9661
interactions concurred with 1-Mention PubGene method (98.55%) and 9465
interactions with 5-Mention PubGene method (96.54%). Of the 11365 activation
interactions, 11280 interactions and 11111 interactions concurred with 1-Mention
PubGene method (99.25%) and 5-Mention PubGene method (97.77%) respectively.
Hence, of the 927648 interactions found using 1-Mention PubGene method, 1.04% (n
= 9661) were binding interactions and 1.22% (n = 11280) were activation interactions.
Furthermore, of the 431173 interactions found using 5-Mention PubGene method,
2.20% (n = 9465) of the interactions were binding interactions and 2.58% (n = 11111)
were activation interactions. Combining binding and activation interactions (n =
18120), 1.96% of 1-Mention PubGene co-occurrence graph and 3.85% of 5-Mention
PubGene co-occurrence graph were annotated respectively.


3.3.   Mapping Co-Expression Networks onto Text-Mined Networks
Using Pearson's correlation coefficient to signify the presence of a co-expression
between the pair of spots (genes) on the Master et al. (2002) data set, there are 210283
correlations between -1.00 to -0.75 and 0.75 to 1.00, of which 2014 (0.96% of
correlations) are found in 1-PubGene co-occurrence network, 342 (0.16% of
correlations) are found in activation network extracted by natural language processing
means and 407 (0.19% of correlations) are found in binding network extracted by
natural language processing means.




                                           -7-
From incremental correlation mapping with 1-PubGene network (tabulated in Table 2
       and graphed in Figure 1), there is a decline of the number of correlations from 208269
       (correlation coefficient of 0.75) to 7 (correlation coefficient of 1.00). The percentage
       of overlap between co-occurrence and co-expression rose linearly from correlation
       coefficient of 0.75 to 0.85 (r = 0.959) while that of correlation coefficient of 0.86 to
       0.92 is less correlated (r = 0.223). The 7 pairs of correlations in Master et al. (2002)
       data set with correlation coefficient of 1.00 are; lactotransferrin (Mm.282359) and
       solute carrier family 3 (activators of dibasic and neutral amino acid transport),
       member 2 (Mm.4114); B-cell translocation gene 3 (Mm.2823) and UDP-
       Gal:betaGlcNAc beta 1,4- galactosyltransferase, polypeptide 1 (Mm.15622); gamma-
       glutamyltransferase 1 (Mm.4559) and programmed cell death 4 (Mm.1605); FK506
       binding protein 11 (Mm.30729) and signal recognition particle 9 (Mm.303071);
       FK506 binding protein 11 (Mm.30729) and Ras-related protein Rab-18 (Mm.132802);
       casein gamma (Mm.4908) and casein alpha (Mm.295878); G protein-coupled receptor
       83 (Mm.4672) and recombination activating gene 1 activating protein 1 (Mm.17958).
       The amount of overlap between microarray correlations and 1-mention PubGene co-
       occurrence increased steadily from 0.96% at the correlation coefficient of 0.75 to
       1.057% at the correlation coefficient of 0.87.

       Mapping an intersect of co-expression networks of all 4 in vivo data sets (Master et
       al., 2002; Clarkson and Watson, 2003; Stein et al., 2004; Rudolph et al., 2007), there
       are 1140 correlations, of which 14 (1.23%) are found in 1-PubGene co-occurrence
       network, none of which corresponds to the interactions found in activation or binding
       networks extracted by natural language processing means (Ling et al., 2007).


                                                 Intersect of Correlation and 1-Mention PubGene
                             1.10
Found in 1-Mention PubGene
  Percent of Correlations




                             1.05

                             1.00

                             0.95

                             0.90

                             0.85

                             0.80

                             0.75

                             0.70
                                    0.76 0.77 0.78 0.79   0.8   0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.9   0.91 0.92 0.93 0.94 0.95 0.95 0.97 0.98
                                                                                        Minimum Correlation


       Figure 1 – Percentage of correlation network analyzed from Maser et al. (2002) are
       found in 1-Mention PubGene co-occurrence




                                                                                      -8-
Minimum Number of Correlations        Number of Correlations     Percentage of
Correlation in Master et al. (2002)    found in 1-PubGene      Correlations Found
     0.75           210283                     2014                  0.958
     0.76           207593                     1983                  0.964
     0.77           181383                     1735                  0.966
     0.78           157622                     1495                  0.958
     0.79           136152                     1316                  0.976
     0.80           116775                     1141                  0.987
     0.81           99276                      970                   0.987
     0.82           83802                      823                   0.988
     0.83           70019                      692                   0.998
     0.84           57872                      575                   1.004
     0.85           47453                      472                   1.005
     0.86           38228                      373                   0.985
     0.87           30347                      314                   1.046
     0.88           23740                      234                   0.995
     0.89           18137                      178                   0.991
     0.90           13435                      138                   1.038
     0.91            9797                       96                   0.990
     0.92            6849                       70                   1.034
     0.93            4580                       40                   0.881
     0.94            2919                       28                   0.969
     0.95            1742                       14                   0.984
     0.95            970                        7                    0.727
     0.97            472                        4                    0.855
     0.98            197                        2                    1.026
     0.99             60                        0                    0.000
     1.00              7                        0                    0.000

Table 2 - Summary of incremental stepwise mapping of correlation coefficients
from Master et al. (2002) to 1-PubGene co-occurrence network


4.      Discussion
Comparing the difference between PubGene (Jenssen et al., 2001) and Poisson
modelling method for co-occurrence calculations, three observations could be made.
Firstly, one of the common criticisms of a simple co-occurrence method as used in
this study (co-occurrence of terms without considering the number of words between


                                         -9-
these terms) is that given a large number of articles or documents, every term will co-
occur with every term at least once, leading to total possible co-occurrence (100% or
12694969 in this case). Our results showed that 7.31% of the total possible co-
occurrence were actually found using about 860000 abstracts and only 3.40% using a
more stringent method. PubGene (Jenssen et al., 2001) has also suggested that total
possible co-occurrence was not evident with a much larger set of articles (10 million)
and yet achieved 60% precision using only one instance of co-occurrence in 10
million articles (1-Mention PubGene) and 72% precision with 5-Mention PubGene. It
can be expected with more instances of co-occurrence, precision may be higher. This
might be due to the sparse distribution of entity names in the set of text as observed
from the low mean number of co-occurrence used for Poisson distribution modeling.
At the same time, PubGene (Jenssen et al., 2001) also illustrated that entity name
recognition by simple pattern matching is able to yield quality results.

Using only results from PubGene (Jenssen et al., 2001), it can be concluded that total
possible co-occurrence is unlikely for a corpus size of up to 10 million (more than
half of current PubMed). Using the Poisson distribution, the mean number of co-
occurrence can be expected to decrease with a larger corpus than used in this study as
it is a product of the relative frequencies of each of the 2 entities. This suggests that as
the size of corpus increases, it is likely that each co-occurrence of terms is more
significant, suggesting that a statistical measure might be more useful in a very large
corpus of more than 10 million as it takes into account both frequencies and corpus
size.

Secondly, Poisson co-occurrence methods at both 95th and 99th percentile yield the
same set of results as 1-Mention PubGene method, which is expected as the maximum
mean number of co-occurrence is 0.59. This implied that every co-occurrence found
are essentially statistically significant in a corpus of about 860000 abstracts; thus,
providing statistical basis for “1-Mention PubGene” method. This might be due to the
nature of abstracts, which were known to be concise. Proteins that have no relation to
each other are generally unlikely to be mentioned in the same abstract and abstracts
tends to mention only crucial findings. However, the same might not apply if full text
articles are used – un-related proteins could be used solely for illustrative purposes.

Thirdly, the number of co-occurrences found using 5-Mention PubGene method is
substantially lower (less than half) of that by 1-Mention PubGene method which was
also shown in Jenssen et al. (2001). This suggested that 5-Mention PubGene is
appreciably more stringent than using Poisson co-occurrence at 99th percentile; thus,
providing statistical basis for “5-Mention PubGene” method.

Our results comparing the numbers of co-occurrence demonstrated a 50.79% decrease
in co-occurrence from 1-Mention PubGene network to 5-Mention PubGene network.
However, the 5-Mention PubGene network retained most of the “activation” (98.5%)
and “binding” (98.0%) interactions found in 1-Mention PubGene network. This might
be the consequence of 30% recall of the NLP methods (Ling et al., 2007) as it would
usually require 3 or more mentions to have a reasonable chance to be identified by
NLP methods. This might also be due to the observation that the 5-Mention PubGene
method is more precise, in terms of accuracy, than the 1-PubGene method as shown
in Jenssen et al. (2001).



                                           - 10 -
The probability of a true interaction (Ling et al., 2007) existing in each of the 9661
NLP-extracted binding interactions that are also found in 1-Mention PubGene co-
occurrence would be raised. The probability of a true interaction existing in each of
the 9465 NLP-extracted binding interactions that are also found in 5-Mention
PubGene co-occurrence would be higher. Hence, combining NLP and statistical co-
occurrence techniques can improve the overall confidence of finding true interactions.
However, it should be noted that statistical co-occurrence used in this work cannot
raise the confidence of NLP-extracted interactions.

Nevertheless, these results also suggest that graphs of statistical co-occurrence could
be annotated with information from NLP methods to indicate the nature of such
interactions. In this study, 2 NLP-extracted interactions from Ling et al. (2007),
“binding” and “activation”, were combined. The combined “binding” and “activation”
network covered 1.96% and 3.85% of 1-Mention and 5-Mention PubGene co-
occurrence graph respectively. Our results demonstrate that the combined network has
a higher coverage than individual “binding” or “activation” networks. Thus, it can be
reasonable to expect that with more forms of interactions, such as degradation and
phosphorylation, extracted with the same NLP techniques, the co-occurrence graph
annotation would be more complete.

By overlapping the co-expression network analyzed from Master et al. (2002) data set
to 1-Mention PubGene co-occurrence network, our results demonstrated that about
99% of the co-expression was not found in the co-occurrence network. This might
suggest that the choice of Pearson's correlation coefficient threshold of more than 0.75
and less than -0.75 as suggested by Reverter et al. (2005) is likely to be sensitive in
isolating functionally related genes from microarray data at the cost of reduced
specificity.

Our results from incremental stepwise analysis showed that the percentage of overlap
between co-expression and co-occurrence rose linearly from correlation coefficient
from 0.75 to 0.85. This suggests that a correlation coefficient of 0.85 may be optimal
for this data set as it is likely that using the correlation coefficient of 0.85 will result in
less false positives than the correlation coefficient of 0.75. At the same time,
increasing the correlation coefficient from 0.75 to 0.85 resulted in 77.4% less (47453
correlations from 210283) interaction correlations. Using this method to further
describe protein-protein interactions and to generate new hypotheses, it can be argued
that correlation coefficient of 0.85 will result in less false positives. While this
deduction is likely as a more stringent criterion tends to reduce the rate of false
positives, it is difficult to prove experimentally without exhaustive examination of
each result. Nevertheless, the result suggest the possibility of using the inverse
linearity of correlation coefficient and the number of gene co-expressions as a
preliminary visual assessment to gauge an optimal correlation coefficient to use for a
particular data set. However, on the extreme end, a correlation coefficient of 0.99 and
1.00 yielded 60 and 7 correlations respectively in Master et al. (2002) data set but
none was found in 1-Mention PubGene co-occurrence network. This suggests that
high-throughput genomic techniques such as microarrays, present a vast amount of
un-mined biological information that had not been examined experimentally.

By exploring the literature for the biological significance for each of the 7 pairs of
perfectly co-expressed genes using Swanson's method (Swanson, 1990), it was found


                                            - 11 -
that all 7 pairs were biologically significant. Lactotransferrin (Ishii et al., 2007) and
solute carrier family 3 (activators of dibasic and neutral amino acid transport),
member 2 (Feral et al., 2005) were involved in cell adhesion. B-cell translocation
gene 3 (Guehenneux et al., 1997) and UDP-Gal:betaGlcNAc beta 1,4-
galactosyltransferase, polypeptide 1 (Mori et al., 2004) were involved in cell cycle
control. Casein gamma and casein alpha are well-established components of milk.
Gamma-glutamyltransferase 1 (Huseby et al., 2003) and programmed cell death 4
(Frankel et al., 2008) were known to be regulating apoptotic pathways. Rab18
(Vazquez-Martinez et al., 2007), signal recognition particle 9 (Egea et al., 2004) and
FK506 binding protein 11 (Dybkaer et al., 2007) were known to be involved in the
secretory pathway. G protein-coupled receptor 83 (Lu et al., 2007) and recombination
activating gene 1 activating protein 1 (Igarashi et al., 2001) were known to be
involved in T-cell function. Taken together, these suggest that the set of 7 correlations
have not likely been described and may prove to be valuable new hypotheses in the
study of mouse mammary physiology. It is also plausible that this argument can be
extended to the set of 53 highly co-expressed genes (0.99 < correlation coefficient <
1.00).

Intersecting the 4 in vivo data sets into a co-expression network increases the power
of the analysis as it represents correlation among gene expression that are more than
0.75 or less than -0.75 in all 4 data sets. There were 1140 examples of co-expression
in this intersect and only 14 co-expressions (1.23%) were found in the one-mention
PubGene co-occurrence network, but none in either the binding or activation networks
extracted by natural language processing. This suggests that these 14 co-expressions
are neither binding nor activating interactions. Textpresso (Muller et al., 2004) had
defined a total of 36 molecular associations between 2 proteins which includes
binding and activation. Future work will expand NLP mining to 34 other interactions
to improve the annotation of co-occurrence networks.

Reverter et al. (2005) had previously analysed 5 microarray data sets by expression
correlation and demonstrated that genes of related functions exhibit similar expression
profile across different experimental conditions. Our results suggest 1126 co-
expressed genes across 4 microarray data sets are not found in the co-occurrence
network. This may be a new set of valuable information in the study of mouse
mammary physiology as these pairs of genes have not been previously mentioned in
the same publication and experimental examination of these potential interactions is
needed to understand the biological significance of these co-expressions.


5.     Conclusions
We conclude that the 5-mention PubGene method is more stringent than the 99th
percentile of Poisson distribution method. In this study, we demonstrate the use of a
liberal co-occurrence-based literature analysis (1-Mention PubGene method) to
represent the state of research knowledge in functional protein-protein interactions as
a sieve to isolate potentially novel hypotheses from microarray co-expression analyses
for further research.




                                         - 12 -
Authors' contributions
ML, CL and KRN contribute equally to the design of experiments and analysis of
results. ML carries out the experiments.


References
   1. Alako BT, Veldhoven A, van Baal S, Jelier R, Verhoeven S, Rullmann T,
       Polman J, Jenster G: CoPub Mapper: mining MEDLINE based on search
       term co-publication. BMC Bioinformatics 2005, 6(1):51.
   2. Beissbarth T: Interpreting experimental results using gene ontologies.
       Methods in Enzymology 2006, 411:340-352.
   3. Chen ES, Hripcsak G, Xu H, Markatou M, Friedman C: Automated
       Acquisition of Disease Drug Knowledge from Biomedical and Clinical
       Documents: An Initial Study. Journal of the American Medical Informatics
       Association 2008, 15(1):87-98.
   4. Chiang J-H, Yu H-C, Hsu H-J: GIS: a biomedical text-mining system for
       gene information discovery. Bioinformatics 2004, 20(1):120.
   5. Clarkson RWE, Watson CJ: Microarray analysis of the involution switch.
       Journal of Mammary Gland Biology and Neoplasia 2003, 8(3):309-319.
   6. Cohen AM, Hersh WR: A survey of current work in biomedical text
       mining. Briefings in Bioinformatics 2005, 6(1):57-71.
   7. David PAC, Bernard FB, William BL, David TJ: BioRAT: extracting
       biological information from full-length papers. Bioinformatics 2004,
       20(17):3206.
   8. Donaldson I, Martin J, de Bruijn B, Wolting C, Lay V, Tuekam B, Zhang S,
       Baskin B, Bader GD, Michalickova K et al: PreBIND and Textomy--mining
       the biomedical literature for protein-protein interactions using a support
       vector machine. BMC Bioinformatics 2003, 4:11.
   9. Dybkaer K, Iqbal J, Zhou G, Geng H, Xiao L, Schmitz A, d'Amore F, Chan
       WC: Genome wide transcriptional analysis of resting and IL2 activated
       human natural killer cells: gene expression signatures indicative of novel
       molecular signaling pathways. BMC Genomics 2007, 8:230.
   10. Egea PF, Shan SO, Napetschnig J, Savage DF, Walter P, Stroud RM:
       Substrate twinning activates the signal recognition particle and its
       receptor. Nature 2004, 427(6971):215-221.
   11. Feral CC, Nishiya N, Fenczik CA, Stuhlmann H, Slepak M, Ginsberg MH:
       CD98hc (SLC3A2) mediates integrin signaling. Proceedings of the National
       Academy of Science U S A 2005, 102(2):355-360.
   12. Frankel LB, Christoffersen NR, Jacobsen A, Lindow M, Krogh A, Lund AH:
       Programmed cell death 4 (PDCD4) is an important functional target of
       the microRNA miR-21 in breast cancer cells. Journal of Biological
       Chemistry 2008, 283(2):1026-1033.
   13. Friedman C, Kra P, Yu H, Krauthammer M, Rzhetsky A: GENIES: a natural-
       language processing system for the extraction of molecular pathways
       from journal articles. Bioinformatics 2001, 17(Suppl. 1):S74-S82.
   14. Gajendran VK, Lin JR, Fyhrie DP: An application of bioinformatics and
       text mining to the discovery of novel genes related to bone biology. Bone
       2007, 40(5):1378-1388.



                                     - 13 -
15. Guehenneux F, Duret L, Callanan MB, Bouhas R, Hayette S, Berthet C,
    Samarut C, Rimokh R, Birot AM, Wang Q et al: Cloning of the mouse BTG3
    gene and definition of a new gene family (the BTG family) involved in the
    negative control of the cell cycle. Leukemia 1997, 11(3):370-375.
16. Hsu CN, Lai JM, Liu CH, Tseng HH, Lin CY, Lin KT, Yeh HH, Sung TY, Hsu
    WL, Su LJ et al: Detection of the inferred interaction network in
    hepatocellular carcinoma from EHCO (Encyclopedia of Hepatocellular
    Carcinoma genes Online). BMC Bioinformatics 2007, 8:66.
17. Hunter L, Cohen KB: Biomedical language processing: what's beyond
    PubMed? Molecular Cell 2006, 21(5):589-594.
18. Huseby NE, Asare N, Wetting S, Mikkelsen IM, Mortensen B, Sveinbjornsson
    B, Wellman M: Nitric oxide exposure of CC531 rat colon carcinoma cells
    induces gamma-glutamyltransferase which may counteract glutathione
    depletion and cell death. Free Radical Research 2003, 37(1):99-107.
19. Igarashi H, Kuwata N, Kiyota K, Sumita K, Suda T, Ono S, Bauer SR,
    Sakaguchi N: Localization of recombination activating gene 1/green
    fluorescent protein (RAG1/GFP) expression in secondary lymphoid
    organs after immunization with T-dependent antigens in rag1/gfp knockin
    mice. Blood 2001, 97(9):2680-2687.
20. Ishii T, Ishimori H, Mori K, Uto T, Fukuda K, Urashima T, Nishimura M:
    Bovine lactoferrin stimulates anchorage-independent cell growth via
    membrane-associated chondroitin sulfate and heparan sulfate
    proteoglycans in PC12 cells. Journal of Pharmacological Science 2007,
    104(4):366-373.
21. Jensen LJ, Saric J, Bork P: Literature mining for the biologist: from
    information retrieval to biological discovery. Nature Review Genetics 2006,
    7(2):119-129.
22. Jenssen TK, Laegreid A, Komorowski J, Hovig E: A literature network of
    human genes for high-throughput analysis of gene expression. Nature
    Genetics 2001, 28(1):21-28.
23. Li X, Chen H, Huang Z, Su H, Martinez JD: Global mapping of gene/protein
    interactions in PubMed abstracts: A framework and an experiment with
    P53 interactions. Journal of Biomedical Informatics 2007.
24. Ling MH, Lefevre C, Nicholas KR, Lin F: Re-construction of Protein-
    Protein Interaction Pathways by Mining Subject-Verb-Objects
    Intermediates. In: Second IAPR Workshop on Pattern Recognition in
    Bioinformatics (PRIB 2007). Singapore: Springer-Verlag; 2007.
25. Lu LF, Gavin MA, Rasmussen JP, Rudensky AY: G protein-coupled receptor
    83 is dispensable for the development and function of regulatory T cells.
    Molecular Cell Biology 2007, 27(23):8065-8072.
26. Malik R, Franke L, Siebes A: Combination of text-mining algorithms
    increases the performance. Bioinformatics 2006, 22(17):2151-2157.
27. Master SR, Hartman JL, D'Cruz CM, Moody SE, Keiper EA, Ha SI, Cox JD,
    Belka GK, Chodosh LA: Functional microarray analysis of mammary
    organogenesis reveals a developmental role in adaptive thermogenesis.
    Molecular Endocrinology 2002, 16(6):1185-1203.
28. Maraziotis IA, Dragomir A, Bezerianos A: Gene networks reconstruction
    and time-series prediction from microarray data using recurrent neural
    fuzzy networks. IET Systems Biology 2007, 1(1):41-50.
29. Mori R, Kondo T, Nishie T, Ohshima T, Asano M: Impairment of skin
    wound healing in beta-1,4-galactosyltransferase-deficient mice with


                                   - 14 -
reduced leukocyte recruitment. American Journal of Pathology 2004,
    164(4):1303-1314.
30. Muller HM, Kenny EE, Sternberg PW: Textpresso: an ontology-based
    information retrieval and extraction system for biological literature. PLoS
    Biology 2004, 2(11):e309.
31. Novichkova S, Egorov S, Daraselia N: MedScan, a natural language
    processing engine for MEDLINE abstracts. Bioinformatics 2003, 19:1699-
    1706.
32. O'Driscoll L, McMorrow J, Doolan P, McKiernan E, Mehta JP, Ryan E,
    Gammell P, Joyce H, O'Donovan N, Walsh N et al: Investigation of the
    molecular profile of basal cell carcinoma using whole genome
    microarrays. Molecular Cancer 2006, 5:74.
33. Rawool SB, Venkatesh KV: Steady state approach to model gene
    regulatory networks-Simulation of microarray experiments. Biosystems
    2007.
34. Reverter A, Barris W, Moreno-Sanchez N, McWilliam S, Wang YH, Harper
    GS, Lehnert SA, Dalrymple BP: Construction of gene interaction and
    regulatory networks in bovine skeletal muscle from expression data.
    Australian Journal of Experimental Agriculture 2005, 45:821-829.
35. Rudolph MC, McManaman JL, Phang T, Russell T, Kominsky DJ, Serkova
    NJ, Stein T, Anderson SM, Neville MC: Metabolic regulation in the
    lactating mammary gland: a lipid synthesizing machine. Physiological
    Genomics 2007, 28:323-336.
36. Stein T, Morris J, Davies C, Weber-Hall S, Duffy M-A, Heath V, Bell A,
    Ferrier R, Sandilands G, Gusterson B: Involution of the mouse mammary
    gland is associated with an immune cascade and an acute-phase response,
    involving LBP, CD14 and STAT3. Breast Cancer Research 2004, 6(2):R75 –
    R91.
37. Swanson DR: Medical literature as a potential source of new knowledge.
    Bulletin of the Medical Library Association 1990, 78(1):29-37.
38. Vazquez-Martinez R, Cruz-Garcia D, Duran-Prado M, Peinado JR, Castano JP,
    Malagon MM: Rab18 inhibits secretory activity in neuroendocrine cells by
    interacting with secretory granules. Traffic 2007, 8(7):867-882.
39. Wren JD, Garner HR: Shared relationship analysis: ranking set cohesion
    and commonalities within a literature-derived relationship network.
    Bioinformatics 2004, 20(2):191-198.




                                   - 15 -
Appendix A – Use of Python in this work
Python programming had been used throughout this study, which had been
incorporated into Muscorian (Ling et al., 2007). The following are code snippets to
demonstrate the calculation of Poisson distribution and the intersection of Master et
al., 2002 and 1-mention PubGene results as shown in Figure 1 and Table 2.

Given that muscopedia.dbcursor is the database cursor and pmc_abstract table to
contain the abstracts, the Poisson distribution model for each pair of entity (gene or
protein) names is constructed by the function commandJobCloneOccurrencePoisson,

class Poisson:
    mean = 0.0
    def __init__(self, lamb = 0.0): self.mean = lamb

     def factorial(self, m):
         value=1
         if m != 0:
             while m !=1:
                 value=value*m
                 m=m-1
         return value

     def PDF(self, x):
         return math.exp(self.mean)* 
            pow(self.mean,x)/self.factorial(x)

     def inverseCDF(self, prob):
         cprob = 0.0
         x = 0
         while (cprob < prob):
             cprob = cprob + self.PDF(x)
             x = x + 1
         return (x, cprob)

def commandJobCloneOccurrencePoisson(self):
    poisson = Poisson()
    muscopedia.dbcursor.execute('
     select count(pmid) from pmc_abstract')
    abstractcount = 
     float(self.muscopedia.dbcursor.fetchall()[0][0])
    muscopedia.dbcursor.execute('
     select jclone, occurrence from jclone_occurrence')
    dataset = [[clone[0].strip(), clone[1]] for clone in
                self.muscopedia.dbcursor.fetchall()]
    muscopedia.dbcursor.execute("
     delete from jclone_occur_stat")
    count = 0
    for subj in dataset:
        for obj in dataset:
            mean = (float(subj[1])/abstractcount)* 
                      (float(obj[1])/abstractcount)
            poisson.mean = mean
            (poi95, prob) = poisson.inverseCDF(0.95)
            (poi99, prob) = poisson.inverseCDF(0.99)
            count = count + 1
            sqlstmt = "insert into jclone_occur_stat (clone1,
                clone2, randomoccur, poisson95, poisson99) 
                values ('%s','%s','%.6f','%s','%s')" % 



                                        - 16 -
(str(subj[0]), str(obj[0]), mean, 
                         str(poi95), str(poi99))
               try: muscopedia.dbcursor.execute(sqlstmt)
               except IOError: pass
               if (count % 1000) == 0:
                   muscopedia.dbconnect.commit()

Each pair of entities was searched in each abstract using SQL statements, such as
“select count(pmid) from pmc_abstract where text containing 'insulin' and 'MAPK'”,
and the number of abstracts found was matched against jclone_occur_stat table for
statistical significance based on the calculated Poisson distribution.

The results were exported from muscopedia (Muscorian's database) as a tab-delimited
file and analyzed using the following code to generate Table 2:
import sets

lc = open('lc_cor.csv','r').readlines()
lc = [x[:-1] for x in lc]
lc = [x.split('t') for x in lc]
d = {}
for x in lc:
    try: t = d[(x[1], x[0])]
    except KeyError: d[(x[0], x[1])] = float(x[2])

lc = [(x[0], x[1], d[x]) for x in d]
l = [(x[0], x[1]) for x in d]
l = sets.Set(l)

def process_sif(file):
    a = open(file,'r').readlines()
    a = [x[:-1] for x in a]
    a = [x.split('tppt') for x in a]
    return [(x[0], x[1]) for x in a]

a = sets.Set(process_sif('pubgene1.sif'))

print "# intersect of pubgene1.sif and LC data: " + 
     str(len(l.intersection(a)))
print "# LC data not in pubgene1.sif: " + 
     str(len(l.difference(a)))
print "# pubgene1.sif not in LC data: " + 
     str(len(a.difference(l)))
print ""

cor = 0.74
while (cor < 1.0):
    t = [(x[0], x[1]) for x in lc if x[2] > cor]
    l = sets.Set(t)
    cor = cor + 0.01
    print "LC correlation: " + str(cor)
    print "# intersect of pubgene1.sif and LC data: " + 
      str(len(l.intersection(a)))
    print "# LC data not in pubgene1.sif: " + 
      str(len(l.difference(a)))
    print "# pubgene1.sif not in LC data: " + 
      str(len(a.difference(l)))
    print ""




                                       - 17 -
Appendix B – PubGene algorithm and its main results
PubGene (Jenssen et al., 2001) algorithm is a count-based algorithm which simply
counts the number of abstracts with both entity names. Using “insulin” and “MAPK”
as the pair of entities, PubGene algorithm can be implemented using the following
SQL, “select count(pmid), 'insulin', 'MAPK' from pmc_abstract where text containing
'insulin' and text containing 'MAPK'”. 1-Mention PubGene and 5-Mention PubGene
can be isolated by filtering for count(pmid) to be more than zero and four respectively.
PubGene (Jenssen et al., 2001) had demonstrated that the precision of 1-Mention is
60% while the precision of 5-Mention is 72%.




                                         - 18 -
The Python Papers, Vol. 3, No. 3 (2008)
                                                                                        1
Available online at http://ojs.pythonpapers.org/index.php/tpp/issue/view/10




                    Automatic C Library Wrapping
                      Ctypes from the Trenches
                                     Guy K. Kloss

                                    Computer Science
                   Institute of Information  Mathematical Sciences
                  Massey University at Albany, Auckland, New Zealand
                           Email: G.Kloss@massey.ac.nz

         At some point of time many Python developers  at least in computational
      science  will face the situation that they want to interface some natively
      compiled library from Python. For binding native code to Python by now a
      larger variety of tools and technologies are available. This paper focuses on
      wrapping shared C libraries, using Python's default Ctypes. Particularly tools
      to ease the process (by using code generation) and some best practises will be
      stressed. The paper will try to tell a stepbystep story of the wrapping and
      development process, that should be transferable to similar problems.


      Keywords:        Python, Ctypes, wrapping, automation, code generation.




1    Introduction
One of the grand fundamentals in software engineering is to use the tools that are
best suited for a job, and not to prematurely decide on an implementation. That is
often easier said than done, in the light of some complimentary requirements (e. g.
rapid/easy implementation vs. needed speed of execution or vs. low level access to
hardware).       The traditional way [1] of binding native code to Python through
extending   or   embedding   is quite tedious and requires lots of manual coding in C.
This paper presents an approach using the       Ctypes   package [2], which is by default
part of Python since version 2.5.
    As an example the creation of a wrapper for the Little CMS colour management
library [3] is outlined. The library oers excellent features, and ships with ocial
Python bindings (using       SWIG   [4]), but unfortunately with several shortcomings
(incompleteness, un-Pythonic API, complex to use, etc.). So out of need and frus-
tration the initial steps towards alternative Python bindings were undertaken.
    An alternative would be to x or improve the bindings using         SWIG,    or to use
one of a variety of binding tools. The eld has been limited to tools that are widely
in use today within the community, and that are promising to be future proof as
Automatic C Library Wrapping                        Ctypes from the Trenches                 2



well as not overly complicated to use. These are the contestants with (very brief )
notes for use cases that suit their particular strengths:


   •   Use   Ctypes   [2], if you want to wrap pure C code very easily.


   •   Use   Boost.Python    [5, 6], if you want to create a more complete API for C++
       that also reects the object oriented nature of your native code, including
       inheritance into Python, etc.


   •   Use   cython   [7], if you want to easily speed up and migrate code from Python
       to speedier native code (Mixing is possible!).


   •   Use   SWIG     [4], if you want to wrap your code against several dynamic lan-
       guages.


   Of course, wrapper code can be written manually, in this case directly using
Ctypes   . This paper does not provide a tutorial on how          Ctypes   is used. The reader
should be familiar with this package when attempting to undertake serious library
wrapping. The     Ctypes tutorial     and   Ctypes reference   on the project web site [2] are
an excellent starting point for this. For extensive libraries and robustness towards
an evolving API, code generation proved to be a good approach over manual editing.
Code generators exist for         Boost.Python   as well as forCtypes    to ease the process of
wrapping:    Py++     [8] (for   Boost.Python ) and   CtypesLib's h2xml.py
                                                                   [2]          and xml2py.py.
   Three main reasons have inuenced the decision to approach this project using
ctypes:
   •   Ubiquity of the binding approach, as        Ctypes   is part of the default distribution.


   •   No compilation of native code to libraries is necessary.             Additionally, this
       relieves one from installing a number of development tools, and the library
       wrapper can be approached in a platform independent way.


   •   The availability of a code generator to automate large portions of the wrapper
       implementation process for ease and robustness against changes.


   The next section of this paper will rst introduce a simple C example.                  This
example is later migrated to Python code through the various incarnations of the
Python wrapper throughout the paper. Sect. 3 introduces how to facilitate the C
library code from Python, in this case through code generation.               Sect. 4 explains
how to rene the generated code to meet the desired functionality of the wrapper.
The library is anything but Pythonic, so Sect. 5 explains an object oriented Façade
API for the library that features qualities we love.
   This paper only outlines some interesting fundamentals of the wrapper building
process. Please refer to the source code for more precise details [9].
Automatic C Library Wrapping                     Ctypes from the Trenches             3



2      The Example
The sample code (listing in Fig. 1) aims to convert image data from device dependent
colour information to a standardised colour space.       The input prole results from
a device specic characterisation of a Hewlett Packard ScanJet (in the ICC prole
HPSJTW.ICM). The output is in the standard conformant sRGB output colour space
as it is used for the majority of displays on computers. For this a built-in prole
from   LittleCMS   is used.
    Input and output are characterised through so called ICC proles.          For the
input prole the characterisation is read from a le (line 8), and a built in output
prole is used (line 9). The transformation object is set up using the proles (lines
1113), specifying the colour encoding in the in- and output as well as some further
parameters not worth discussing here. In the for loop (lines 1521) the image data is
transformed line by line, operating on the number of pixels used per line (necessary
as array rows are often padded).
    The goal is to provide a suitable and easy to use API to perform the same task
in Python.




3      Code Generation
Wrapping C data types, functions, constants, etc. with       Ctypes   is not particularly
dicult. The tutorial, project web site and documentation on the wiki introduce
this concept quite well.      But in the presence of an existing larger library, manual
wrapping can be tedious and error prone, as well as hard to keep consistent with
the library in case of changes. This is especially true when the library is maintained
by someone else. Therefore, it is advisable to generate the wrapper code.
    Thomas Heller, the author of      Ctypes   has implemented a corresponding project
CtypesLib    that includes tools for code generation.    The tool chain consists of two
parts, the parser (for header les) and the code generator.



3.1     Parsing the Header File

The C header les are parsed by the tool h2xml. In the background it uses GCCXML,
a GCC compiler that parses the code and generates an XML tree representation.
Therefore, usually the same compiler that builds the binary of the library can be
used to analyse the sources for the code generation. Alternative parsers often have
problems determining a 100 % proper interpretation of the code. This is particularly
true in the case of C code containing pre-processor macros, which can commit
massively complex things.
Automatic C Library Wrapping                          Ctypes from the Trenches              4




1    #include lcms.h

3    int correctColour(void) {
4        cmsHPROFILE inProfile, outProfile;
5        cmsHTRANSFORM myTransform;
6        int i;

8          inProfile = cmsOpenProfileFromFile(HPSJTW.ICM, r);
9          outProfile = cmsCreate_sRGBProfile();

11         myTransform = cmsCreateTransform(inProfile, TYPE_RGB_8,
12                                          outProfile, TYPE_RGB_8,
13                                          INTENT_PERCEPTUAL, 0);

15         for (i = 0; i  scanLines; i++) {
16             /* Skipped pointer handling of buffers. */
17             cmsDoTransform(myTransform,
18                            pointerToYourInBuffer,
19                            pointerToYourOutBuffer,
20                            numberOfPixelsPerScanLine);
21         }

23         cmsDeleteTransform(myTransform);
24         cmsCloseProfile(inProfile);
25         cmsCloseProfile(outProfile);

27         return 0;
28   }


                 Figure 1: Example in C using the           LittleCMS   library directly.



     3.2     Generating the Wrapper

     In the next stage the parser tree in XML format is taken to generate the binding
     code in Python using     Ctypes.   This task is performed by the xml2py tool. The gener-
     ator can be congured in its actions by means of switches passed to it. Of particular
     interest here are the   -k   and the   -r   switches. The former denes the kind of types
     to include in the output. In this case the #defines, functions, structure and union
     denitions are of interest, yielding        -kdfs.   Note: Dependencies are resolved auto-
     matically. The    -r   switch takes a regular expression the generator uses to identify
     symbols to generate code for. The full argument list is shown in the listing in Fig. 2
     (lines 1115). The generated code is written to a Python module, in this case _lcms.
     It is made private by convention (leading underscore) to indicate that it is           not   to
     be used or modied directly.
Automatic C Library Wrapping                        Ctypes from the Trenches                 5



     3.3       Automating the Generator

     Both h2xml and xml2py are Python scrips. Therefore, the generation process can be
     automated in a simple generator script. This makes all steps reproducible, docu-
     ments the used settings, and makes the process robust towards evolutionary (smaller)
     changes in the C API. A largely simplied version is in the listing of Fig. 2.


1    # Skipped declaration of paths.
2    HEADER_FILE = ’lcms.h’
3    header_basename = os.path.splitext(HEADER_FILE)[0]

5    h2xml.main([’h2xml.py’, header_path,
6                ’-c’,
7                ’-o’,
8                ’%s.xml’ % header_basename])

10   SYMBOLS = [’cms.*’, ’TYPE_.*’, ’PT_.*’, ’ic.*’, ’LPcms.*’, ...]
11   xml2py.main([’xml2py.py’, ’-kdfs’,
12                ’-l%s’ % library_path,
13                ’-o’, module_path,
14                ’-r%s’ % ’|’.join(SYMBOLS),
15                ’%s.xml’ % header_basename]


                        Figure 2: Essential parts of the code generator script.


         Generated code should         never   be edited manually.     As some modication will
     be necessary to achieve the desired functionality (see Sect. 4), automation becomes
     essential to yield reproducible results. Due to some shortcomings (see Sect. 4) of
     the generated code however, some editing was necessary. This modication has also
     been integrated into the generator script to fully remove the need of manual editing.




     4        Rening the C API
     In the current version of    Ctypes
                                   in Python 2.5 it is not possible to add e. g. __repr__()
     or __str__() methods to data types. Also, code for loading the shared library in a
     platform independent way needs to be patched into the generated code. A function
     in the code generator reads the whole generated module _lcms and writes it back to
     the le system, and in the course replacing three lines from the beginning of the le
     with the code snippet from the listing in Fig. 3.
         _setup (listing in Fig. 4) monkey patches 1 the class ctypes.Structure to include
     a __repr__() method (lines 410) for ease of use when representing wrapped objects
     for output. Furthermore, the loading of the shared library (DLL in Windows lingo)

         1A   monkey patch is a way to extend or modify the runtime code of dynamic languages without

     altering the original source code:   http://en.wikipedia.org/wiki/Monkey_patch
Automatic C Library Wrapping                 Ctypes from the Trenches            6




1    from _setup import *
2    import _setup

4    _libraries = {}
5    _libraries[’/usr/lib/liblcms.so.1’] = _setup._init()


              Figure 3: Lines to be patched into the generated module _lcms.



     is abstracted to work in a platform independent way using the system's default
     search mechanism (lines 1213).



1    import ctypes
2    from ctypes.util import find_library

 4   class Structure(ctypes.Structure):
 5       def __repr__(self):
 6           Print fields of the object.
 7           res = []
 8           for field in self._fields_:
 9               res.append(’%s=%s’ % (field[0], repr(getattr(self, field[0]))))
10           return ’%s(%s)’ % (self.__class__.__name__, ’, ’.join(res))

12   def _init():
13       return ctypes.cdll.LoadLibrary(find_library(’lcms’))


                         Figure 4: Extract from module _setup.py.




     4.1    Creating the Basic Wrapper

     Further modications are less invasive. For this, the C API is rened into a module
     c_lcms. This module imports   everything   from the generated._lcms and overrides or
     adds certain functionality individually (again through monkey patching). These
     are intended to make the C API a little bit easier to use through some helper
     functions, but mainly to make the new bindings more compatible with and similar
     to the ocial   SWIG   bindings (packaged together with   LittleCMS   ). The wrapped
     C API can be used from Python (see Sect. 4.2). Although, it still requires closing,
     freeing or deleting from the code after use, and c_lcms objects/structures do not
     feature methods for operations. This shortcoming will be solved later.



     4.2    c lcms    Example

     The wrapped raw C API in Python behaves in exactly the same way, it is just
     implemented in Python syntax (listing in Fig. 5).
Automatic C Library Wrapping                   Ctypes from the Trenches           7




1    from c_lcms import *

3    def correctColour():
4        inProfile = cmsOpenProfileFromFile(’HPSJTW.ICM’, ’r’)
5        outProfile = cmsCreate_sRGBProfile()

7          myTransform = cmsCreateTransform(inProfile, TYPE_RGB_8,
8                                           outProfile, TYPE_RGB_8,
9                                           INTENT_PERCEPTUAL, 0)

11         for line in scanLines:
12             # Skipped handling of buffers.
13             cmsDoTransform(myTransform,
14                            yourInBuffer,
15                            yourOutBuffer,
16                            numberOfPixelsPerScanLine)

18         cmsDeleteTransform(myTransform)
19         cmsCloseProfile(inProfile)
20         cmsCloseProfile(outProfile)


                  Figure 5: Example using the basic API of the c_lcms module.



     5       A Pythonic API
     To create the usual pleasant batteries included feeling when working with code
     in Python, another module  littlecms  was manually created, implementing the
     Façade Design Pattern.       From here on we are moving away from the original C-like
     API. This high level object oriented Façade takes care of the internal handling of
     tedious and error prone operations. It also performs sanity checking and automatic
     detection for certain crucial parameters passed to the C API. This has drastically
     reduced problems with the low level nature of the underlying C library.



     5.1      littlecms         Example

     Using    littlecms the API is now object oriented (listing in Fig. 6) with a
     doTransform() method on the myTransform object.        But there are a few more in-
     teresting benets of this API:


         •   Automatic disposing of C API instances hidden inside the Profile and
             Transform classes.

         •   Largely reduced code size with an easily comprehensible structure.


         •   Redundant passing of information (e. g. the in- and output colour spaces) is
             determined within the Transform constructor from information available in the
             Profile objects.
Automatic C Library Wrapping                       Ctypes from the Trenches               8



         •   Uses   NumPy   [10] arrays for convenience in the buers, rather than introducing
             further custom types. On these data array types and shapes can be automat-
             ically matched up.


         •   The number of pixels for each scan line placed in yourInBuffer can usually be
             detected automatically.


         •   Compatible with the often used      PIL   [11] library.


         •   Several sanity checks prevent clashes of erroneously passed buer sizes, shapes,
             types, etc. that would otherwise result in a crashed or hanging process.




1    from littlecms import Profile, PT_RGB, Transform

3    def correctColour():
4        inProfile = Profile(’HPSJTW.ICM’)
5        outProfile = Profile(colourSpace=PT_RGB)
6        myTransform = Transform(inProfile, outProfile)

 8       for line in scanLines:
 9           # Skipped handling of buffers.
10           myTransform.doTransform(yourNumpyInBuffer, yourNumpyOutBuffer)


         Figure 6: Example using the object oriented API of the littlecms module.




     6       Conclusion
     Binding pure C libraries to Python is not very dicult, and the skills can be mastered
     in a rather short time frame.        If done right, these bindings can be quite robust
     even towards certain changes in the evolving C API without the need of very time
     consuming manual tracking of all changes.           As with many projects for this, it is
     vital to be able to automate the mechanical processes: Beyond the outlined code
     generation in this paper, an important role comes to automated code integrity testing
     (here: using    PyUnit   [12]) as well as an API documentation (here: using   Epydoc   [13]).
         Unfortunately, as    CtypesLib   is still work in progress, the whole process did not go
     as smoothly as described here. It was particularly important to match up working
     versions properly between GCCXML (which in itself is still in development) and
     CtypesLib.     In this case a current GCCXML in version 0.9.0 (as available in Ubuntu
     Intrepid Ibex, 8.10) required a branch of      CtypesLib     that needed to be checked out
     through the developer's Subversion repository.           Furthermore, it was necessary to
     develop a x for the code generator as it failed to generate code for #defined oating
     point constants. The patch has been reported to the author and is now in the source
     code repository. Also patching into the generated source code for overriding some
Automatic C Library Wrapping                        Ctypes from the Trenches            9



features and manipulating the library loading code can be considered as being less
than elegant.
   Library wrapping as described in this paper was performed on version 1.16 of the
LittleCMS       library. While writing this paper the author has moved to the now stable
version 1.17. Adapting the Python wrapper to this code base was a matter of about
15 minutes of work.          The main task was xing some unit tests due to rounding
dierences resulting from an improved numerical model within the library.              The
author of       LittleCMS    made a rst preview of the upcoming version 2.0 (an almost
complete rewrite) available recently.         Adapting to that version took only about a
good day of modications, even though some substantial changes were made to the
API. But even for this case only very little amounts of new code had to be written.
   Overall, it is foreseeable that this type of library wrapping in the Python world
will become more and more ubiquitous, as the tools for it mature. But already at the
present time one does not have to fear the process. The time spent initially setting
up the environment will be easily saved over all projects phases and iterations. It
will be interesting to see      Ctypes   evolve to be able to interface to C++ libraries as
well.    Currently the developers of      Ctypes   and   Py++   (Thomas Heller and Roman
Yakovenko) are evaluating potential extensions.




References
 [1]    Ocial Python Documentation: Extending and Embedding the Python Inter-
        preter  , Python Software Foundation.


 [2] T. Heller,  Python Ctypes Project, http://starship.python.net/crew/theller/
        ctypes/, last accessed December 2008.


 [3] M. Maria,  LittleCMS project, http://littlecms.com/, last accessed December
        2008.


 [4] D. M. Beazley and W. S. Fulton,  SWIG Project, http://www.swig.org/, last
        accessed December 2008.


 [5] D. Abrahams and R. W. Grosse-Kunstleve,  Building Hybrid Systems with
        Boost.Python, http://www.boostpro.com/writing/bpl.html, March 2003, last
        accessed December 2008.


 [6] D. Abrahams,  Boost.Python Project, http://www.boost.org/libs/python/,
        last accessed December 2008.


 [7] S. Behnel, R. Bradshaw, and G. Ewing,  Cython Project, http://cython.org/,
        last accessed December 2008.


 [8] R.         Yakovenko,       Py++     Project,     http://www.language-binding.net/
        pyplusplus/pyplusplus.html, last accessed December 2008.
Automatic C Library Wrapping                Ctypes from the Trenches              10



 [9] G. K. Kloss,  Source Code: Automatic C Library Wrapping  Ctypes from the
    Trenches,   The Python Papers Source Codes [in review]  , vol. n/a, p. n/a, 2009,
    [Online available] http://ojs.pythonpapers.org/index.php/tppsc/issue/.


[10] T. Oliphant,  NumPy Project, http://numpy.scipy.org/, last accessed Decem-
    ber 2008.


[11] F. Lundh,  Python Imaging Library (PIL) Project, http://www.pythonware.
    com/products/pil/, last accessed December 2008.


[12] S. Purcell,  PyUnit Project, http://pyunit.sourceforge.net/, last accessed De-
    cember 2008.


[13] E. Loper,  Epydoc Project, http://epydoc.sourceforge.net/, last accessed De-
    cember 2008.
TPP3_3
TPP3_3
TPP3_3
TPP3_3

Contenu connexe

Similaire à TPP3_3

Can I Pay Someone To Write My Research Paper - The
Can I Pay Someone To Write My Research Paper - TheCan I Pay Someone To Write My Research Paper - The
Can I Pay Someone To Write My Research Paper - TheLaura Smith
 
Cmps research school retreat pres by sutton
Cmps research school retreat pres by suttonCmps research school retreat pres by sutton
Cmps research school retreat pres by suttonCaroline Sutton
 
Software Publishing in AAS Journals
Software Publishing in AAS JournalsSoftware Publishing in AAS Journals
Software Publishing in AAS Journalschrislintott
 
Python Foundation – A programmer's introduction to Python concepts & style
Python Foundation – A programmer's introduction to Python concepts & stylePython Foundation – A programmer's introduction to Python concepts & style
Python Foundation – A programmer's introduction to Python concepts & styleKevlin Henney
 
Free Essays On Racism In Australia
Free Essays On Racism In AustraliaFree Essays On Racism In Australia
Free Essays On Racism In AustraliaJennifer Brown
 
Mastering Python Programming.pdf
Mastering Python Programming.pdfMastering Python Programming.pdf
Mastering Python Programming.pdfDhineshN12
 
The ticTOCs Project: Transforming current awareness
The ticTOCs Project: Transforming current awarenessThe ticTOCs Project: Transforming current awareness
The ticTOCs Project: Transforming current awarenessRoddy MacLeod
 
A Whirlwind Tour Of Python
A Whirlwind Tour Of PythonA Whirlwind Tour Of Python
A Whirlwind Tour Of PythonAsia Smith
 
II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in NiceII-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in NiceDr. Haxel Consult
 
How to start: Setting up an open access repository in 22 steps
How to start: Setting up an open access repository in 22 stepsHow to start: Setting up an open access repository in 22 steps
How to start: Setting up an open access repository in 22 stepsIryna Kuchma
 
Oxford, Open Access, and the REF
Oxford, Open Access, and the REFOxford, Open Access, and the REF
Oxford, Open Access, and the REFMathilde Pascal
 
Research (scientific miisconduct)
Research (scientific miisconduct)Research (scientific miisconduct)
Research (scientific miisconduct)Eemlliuq Agalalan
 
Fantoni Urgo - Cirp Dictionary
Fantoni Urgo - Cirp DictionaryFantoni Urgo - Cirp Dictionary
Fantoni Urgo - Cirp DictionaryGualtiero Fantoni
 

Similaire à TPP3_3 (20)

Using OpenURL Activity Data Project 03 Aug 2011
Using OpenURL Activity Data Project 03 Aug 2011Using OpenURL Activity Data Project 03 Aug 2011
Using OpenURL Activity Data Project 03 Aug 2011
 
Can I Pay Someone To Write My Research Paper - The
Can I Pay Someone To Write My Research Paper - TheCan I Pay Someone To Write My Research Paper - The
Can I Pay Someone To Write My Research Paper - The
 
Journal Guidelines
Journal GuidelinesJournal Guidelines
Journal Guidelines
 
Fenner2010
Fenner2010Fenner2010
Fenner2010
 
jili_Notification.pdf
jili_Notification.pdfjili_Notification.pdf
jili_Notification.pdf
 
Cmps research school retreat pres by sutton
Cmps research school retreat pres by suttonCmps research school retreat pres by sutton
Cmps research school retreat pres by sutton
 
peertechz publications
peertechz publicationspeertechz publications
peertechz publications
 
Software Publishing in AAS Journals
Software Publishing in AAS JournalsSoftware Publishing in AAS Journals
Software Publishing in AAS Journals
 
Python made easy
Python made easy Python made easy
Python made easy
 
Python Foundation – A programmer's introduction to Python concepts & style
Python Foundation – A programmer's introduction to Python concepts & stylePython Foundation – A programmer's introduction to Python concepts & style
Python Foundation – A programmer's introduction to Python concepts & style
 
Free Essays On Racism In Australia
Free Essays On Racism In AustraliaFree Essays On Racism In Australia
Free Essays On Racism In Australia
 
Mastering Python Programming.pdf
Mastering Python Programming.pdfMastering Python Programming.pdf
Mastering Python Programming.pdf
 
The ticTOCs Project: Transforming current awareness
The ticTOCs Project: Transforming current awarenessThe ticTOCs Project: Transforming current awareness
The ticTOCs Project: Transforming current awareness
 
A Whirlwind Tour Of Python
A Whirlwind Tour Of PythonA Whirlwind Tour Of Python
A Whirlwind Tour Of Python
 
II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in NiceII-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in Nice
 
How to start: Setting up an open access repository in 22 steps
How to start: Setting up an open access repository in 22 stepsHow to start: Setting up an open access repository in 22 steps
How to start: Setting up an open access repository in 22 steps
 
Python
PythonPython
Python
 
Oxford, Open Access, and the REF
Oxford, Open Access, and the REFOxford, Open Access, and the REF
Oxford, Open Access, and the REF
 
Research (scientific miisconduct)
Research (scientific miisconduct)Research (scientific miisconduct)
Research (scientific miisconduct)
 
Fantoni Urgo - Cirp Dictionary
Fantoni Urgo - Cirp DictionaryFantoni Urgo - Cirp Dictionary
Fantoni Urgo - Cirp Dictionary
 

Plus de tutorialsruby

&lt;img src="../i/r_14.png" />
&lt;img src="../i/r_14.png" />&lt;img src="../i/r_14.png" />
&lt;img src="../i/r_14.png" />tutorialsruby
 
TopStyle Help &amp; &lt;b>Tutorial&lt;/b>
TopStyle Help &amp; &lt;b>Tutorial&lt;/b>TopStyle Help &amp; &lt;b>Tutorial&lt;/b>
TopStyle Help &amp; &lt;b>Tutorial&lt;/b>tutorialsruby
 
The Art Institute of Atlanta IMD 210 Fundamentals of Scripting &lt;b>...&lt;/b>
The Art Institute of Atlanta IMD 210 Fundamentals of Scripting &lt;b>...&lt;/b>The Art Institute of Atlanta IMD 210 Fundamentals of Scripting &lt;b>...&lt;/b>
The Art Institute of Atlanta IMD 210 Fundamentals of Scripting &lt;b>...&lt;/b>tutorialsruby
 
&lt;img src="../i/r_14.png" />
&lt;img src="../i/r_14.png" />&lt;img src="../i/r_14.png" />
&lt;img src="../i/r_14.png" />tutorialsruby
 
&lt;img src="../i/r_14.png" />
&lt;img src="../i/r_14.png" />&lt;img src="../i/r_14.png" />
&lt;img src="../i/r_14.png" />tutorialsruby
 
Standardization and Knowledge Transfer – INS0
Standardization and Knowledge Transfer – INS0Standardization and Knowledge Transfer – INS0
Standardization and Knowledge Transfer – INS0tutorialsruby
 
0047ecaa6ea3e9ac0a13a2fe96f4de3bfd515c88f5d90c1fae79b956363d7f02c7fa060269
0047ecaa6ea3e9ac0a13a2fe96f4de3bfd515c88f5d90c1fae79b956363d7f02c7fa0602690047ecaa6ea3e9ac0a13a2fe96f4de3bfd515c88f5d90c1fae79b956363d7f02c7fa060269
0047ecaa6ea3e9ac0a13a2fe96f4de3bfd515c88f5d90c1fae79b956363d7f02c7fa060269tutorialsruby
 
0047ecaa6ea3e9ac0a13a2fe96f4de3bfd515c88f5d90c1fae79b956363d7f02c7fa060269
0047ecaa6ea3e9ac0a13a2fe96f4de3bfd515c88f5d90c1fae79b956363d7f02c7fa0602690047ecaa6ea3e9ac0a13a2fe96f4de3bfd515c88f5d90c1fae79b956363d7f02c7fa060269
0047ecaa6ea3e9ac0a13a2fe96f4de3bfd515c88f5d90c1fae79b956363d7f02c7fa060269tutorialsruby
 
BloggingWithStyle_2008
BloggingWithStyle_2008BloggingWithStyle_2008
BloggingWithStyle_2008tutorialsruby
 
BloggingWithStyle_2008
BloggingWithStyle_2008BloggingWithStyle_2008
BloggingWithStyle_2008tutorialsruby
 
cascadingstylesheets
cascadingstylesheetscascadingstylesheets
cascadingstylesheetstutorialsruby
 
cascadingstylesheets
cascadingstylesheetscascadingstylesheets
cascadingstylesheetstutorialsruby
 

Plus de tutorialsruby (20)

&lt;img src="../i/r_14.png" />
&lt;img src="../i/r_14.png" />&lt;img src="../i/r_14.png" />
&lt;img src="../i/r_14.png" />
 
TopStyle Help &amp; &lt;b>Tutorial&lt;/b>
TopStyle Help &amp; &lt;b>Tutorial&lt;/b>TopStyle Help &amp; &lt;b>Tutorial&lt;/b>
TopStyle Help &amp; &lt;b>Tutorial&lt;/b>
 
The Art Institute of Atlanta IMD 210 Fundamentals of Scripting &lt;b>...&lt;/b>
The Art Institute of Atlanta IMD 210 Fundamentals of Scripting &lt;b>...&lt;/b>The Art Institute of Atlanta IMD 210 Fundamentals of Scripting &lt;b>...&lt;/b>
The Art Institute of Atlanta IMD 210 Fundamentals of Scripting &lt;b>...&lt;/b>
 
&lt;img src="../i/r_14.png" />
&lt;img src="../i/r_14.png" />&lt;img src="../i/r_14.png" />
&lt;img src="../i/r_14.png" />
 
&lt;img src="../i/r_14.png" />
&lt;img src="../i/r_14.png" />&lt;img src="../i/r_14.png" />
&lt;img src="../i/r_14.png" />
 
Standardization and Knowledge Transfer – INS0
Standardization and Knowledge Transfer – INS0Standardization and Knowledge Transfer – INS0
Standardization and Knowledge Transfer – INS0
 
xhtml_basics
xhtml_basicsxhtml_basics
xhtml_basics
 
xhtml_basics
xhtml_basicsxhtml_basics
xhtml_basics
 
xhtml-documentation
xhtml-documentationxhtml-documentation
xhtml-documentation
 
xhtml-documentation
xhtml-documentationxhtml-documentation
xhtml-documentation
 
CSS
CSSCSS
CSS
 
CSS
CSSCSS
CSS
 
0047ecaa6ea3e9ac0a13a2fe96f4de3bfd515c88f5d90c1fae79b956363d7f02c7fa060269
0047ecaa6ea3e9ac0a13a2fe96f4de3bfd515c88f5d90c1fae79b956363d7f02c7fa0602690047ecaa6ea3e9ac0a13a2fe96f4de3bfd515c88f5d90c1fae79b956363d7f02c7fa060269
0047ecaa6ea3e9ac0a13a2fe96f4de3bfd515c88f5d90c1fae79b956363d7f02c7fa060269
 
0047ecaa6ea3e9ac0a13a2fe96f4de3bfd515c88f5d90c1fae79b956363d7f02c7fa060269
0047ecaa6ea3e9ac0a13a2fe96f4de3bfd515c88f5d90c1fae79b956363d7f02c7fa0602690047ecaa6ea3e9ac0a13a2fe96f4de3bfd515c88f5d90c1fae79b956363d7f02c7fa060269
0047ecaa6ea3e9ac0a13a2fe96f4de3bfd515c88f5d90c1fae79b956363d7f02c7fa060269
 
HowTo_CSS
HowTo_CSSHowTo_CSS
HowTo_CSS
 
HowTo_CSS
HowTo_CSSHowTo_CSS
HowTo_CSS
 
BloggingWithStyle_2008
BloggingWithStyle_2008BloggingWithStyle_2008
BloggingWithStyle_2008
 
BloggingWithStyle_2008
BloggingWithStyle_2008BloggingWithStyle_2008
BloggingWithStyle_2008
 
cascadingstylesheets
cascadingstylesheetscascadingstylesheets
cascadingstylesheets
 
cascadingstylesheets
cascadingstylesheetscascadingstylesheets
cascadingstylesheets
 

Dernier

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 

Dernier (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 

TPP3_3

  • 1. Volume 3, Issue 3 pythonpapers.org
  • 2. Journal Information The Python Papers ISSN: 1834-3147 Editors Co- Edi to r s - in - Chief : Maur ice Ling Tennessee Leeuwenburg Assoc i a t e Edi to r s : Gui lherme Polo Guy Klos s Richard Jones Sarah Mount Stephan ie Chong Referencing Information Art i c l e s f rom th i s edi t i o n of th i s journa l may be referenced as fo l l ows : Author , “Ti t l e” (2008) The Python Papers , Volume N, I s s ue M, Art i c l e Number Copyright Information © Copyr i gh t 2007 The Python Papers and the ind i v i dua l author s Thi s work i s copyr i gh t under the Creat i v e Commons 2.5 l i c ense subjec t to Att r i bu t i o n , Noncommercia l and Share- Al i ke condi t i o n s . The fu l l lega l code may be found at http : / / c r ea t i v e commons.org / l i c en se s / byncsa /2 . 1 / au / The Python Papers was f i r s t publ i s hed in 2006 in Melbourne, Aust ra l i a . Referees An academic peer - rev iew was perfo rmed on al l academic art i c l e s in accordance to The Python Papers Antho logy Edi to r i a l Pol i c y . The rev iewers wi l l be acknowledge ind i v i dua l l y but the i r ident i t i e s wi l l not be re lea sed in order to ensure the anonymity . Focus and Scope * Python User Groups and Spec i a l In te re s t Group in t r oduct i o n s * Technica l aspect s of the Python language * Code rev iews and book rev iews * Descr i p t i o n s of new Python modules and l i b r a r i e s * So lu t i on s to spec i f i c problems in Python * Conso l i d a t ed summaries of current di scus s i o n in Python * Mai l i n g l i s t s or other fora * Companies and organ i s a t i o n s us ing Python * Appl i c a t i o n s developed in Python ( such as held in the Python Cheese Shop) In shor t , we are so l i c i t i n g submiss i o n s where Python i s an integ r a l par t of the answer .
  • 3. The Python Papers Volume 3, Issue 3 3 The Python Papers Anthology Editorial Policy 0. Preamble The Python Papers Anthology is the umbrella entity referring to The Python Papers (ISSN 1834-3147), The Python Papers Monograph (ISSN under application) and The Python Papers Source Codes (ISSN under application), under a common editorial committee (hereafter known as 'editorial board'). It aims to be a platform for disseminating industrial / trade and academic knowledge about Python technologies and its applications. The Python Papers is intended to be both an industrial journal as well as an academic journal, in the sense that the editorial board welcomes submissions relating to all aspects of the Python programming language, its tools and libraries, and community, both of academic and industrial inclinations. The Python Papers aims to be a publication for the Python community at large. In order to cater for this, The Python Papers seeks to publish submissions under two main streams: the industrial stream (technically reviewed) and the academic stream (peer- reviewed). The Python Papers Monograph provides a refereed format for publication of monograph- length reports including dissertations, conference proceedings, case studies, advanced-level lectures, and similar material of theoretical or empirical importance. All volumes published under The Python Papers Monograph will be peer-reviewed and external reviewers may be named in the publication. The Python Papers Source Codes provides a refereed format for publication of software and source codes which are usually associated with papers published in The Python Papers and The Python Papers Monograph. All publications made under The Python Papers Source Codes will be peer-reviewed. This policy statement seeks to clarify the processes of technical review and peer-review in The Python Papers Anthology. 1. Composition and roles of the editorial board The editorial board is headed by the Editor-in-Chief or Co-Editors-in-Chief (hereafter known as "EIC"), assisted by Associate Editors (hereafter known as "AE") and Editorial Reviewers (hereafter known as "ER"). EIC is the chair of the editorial board and together with AEs, manages the strategic and routine operations of the periodicals. ER is a tier of editors deemed to have in-depth expertise knowledge in specialized areas. As members of the editorial board, ERs are accorded editorial status but are generally not involved in the strategic and routine operations of the periodicals although their expert opinions may be sought at the discretion of EIC. 2. Right of submission author(s) to choose streams The submission author(s); that is, the author(s) of the article or code or any submissions in any other forms deemed by the editorial board as being suitable; reserves the right to choose if he/she wants his/her submission to be in the industrial stream, where it will be technically reviewed, or in the academic stream, where it will be peer-reviewed. It is also the onus of the submission author(s) to nominate the stream. The editorial board defaults all submissions to be industrial (technical review) in event of non-nomination by the submission author(s) but the editorial board reserves the right to place such submissions into the academic stream if it deems fit. The editorial board also reserves the right to place submissions nominated for the academic stream in the technical stream if it deems fit.
  • 4. The Python Papers Volume 3, Issue 3 4 3. Right of submission author(s) to nominate potential reviewers The submission author(s) can exercise the right to nominate up to 4 potential reviewers (hereafter known as ";external reviewer";) for his/her submission if the submission author(s) choose to be peer-reviewed. When this right is exercised, the submission author(s) must declare any prior relationships or conflict of interests with the nominated potential reviewers. The final decision to accept the nominated reviewer(s) rests with the Chief Reviewer (see section 5 for further information on the role of the Chief Reviewer). 4. Right of submission author(s) to exclude potential reviewers The submission author(s) can exercise the right to recommend excluding any reasonable numbers of potential reviewers for his/her submission. When this right is exercised, the submission author(s) must indicate the grounds on which such exclusion should be recommended. Decisions for the editorial board to accept or reject such exclusions will be solely based on the grounds as indicated by the submission author(s). 5. Peer-review process Upon receiving a submission for peer-review, the Editor-in-Chief (hereafter known as "EIC") may choose to reject the submission or the EIC will nominate a Chief Reviewer (hereafter known as "CR") from the editorial board to chair the peer-review process of that submission. The EIC can nominate himself/herself as CR for the submission. The CR will send out the submission to TWO or more external reviewers to be reviewed. The CR reserves the right not to call upon the nominated potential reviewers and/or to call upon any of the reviewers nominated for exclusion by the submission author(s). The CR may also concurrently send the submission to one or more Associate Editor(s) (hereafter known as ";AE";) for review. Hence, a submission in the academic stream will be reviewed by at least three persons, the CR and two external reviewers. Typically, a submission may be reviewed by three to four persons: the EIC as CR, an AE, and two external reviewers. There is no upper limit to the number of reviews in a submission. Upon receiving the review from external reviewer(s) and/or AE(s), the CR decides on one of the following options: accept without revision, accept with revision or reject; and notifies the submission author(s) of the decision on behalf of the EIC. If the decision is "accept with revision", the CR will provide a deadline to the submission author(s) for revisions to be done and will automatically accept the revised submission if the CR deems that all revision(s) were done; however, the CR reserves the right to move to reject the original submission if the revision(s) were not carried out by the stipulated deadline by the CR. If the decision is "reject", the submission author(s) may choose to revise for future re-submission. Decision(s) by CR or EIC are final. 6. Technical review process Upon receiving a submission for technical review, the Editor-in-Chief (hereafter known as "EIC") may choose to reject the submission or the EIC will nominate a Chief Reviewer (hereafter known as "CR") from the editorial board to chair the review process of that submission. The EIC can nominate himself/herself as CR for the submission. The CR may decide to accept or reject the submission after reviewing or may seek another AE's opinions before reaching a decision. The CR will notify the submission author(s) of the decision on behalf of the EIC. Decision(s) by CR or EIC is final. 7. Main difference between peer-review and technical review The process of peer-review and technical review are similar, with the main difference being that in the peer review process, the submission is reviewed both internally by the editorial board and externally by external reviewers (nominated by submission author(s) and/or nominated by EIC/CR). In a technical review process, the submission is reviewed by the editorial board. The editorial board retains the right to additionally undertake an external review if it is deemed necessary.
  • 5. The Python Papers Volume 3, Issue 3 5 8. Umbrella philosophy The Python Papers Anthology editorial board firmly believes that all good (technically and/or scholarly/academic) submissions should be published when appropriate and that the editorial board is integral to refining all submissions. The board believes in giving good advice to all submission author(s) regardless of the final decision to accept or reject and hopes that advice to rejected submissions will assist in their revisions. The Python Papers Editorial Statement on Open Access The Python Papers Anthology has received a number of inquiries relating to the republishing of articles from the journal, especially in the context of open-access repositories. Each issue of The Python Papers Anthology is released under a Creative Commons 2.5 license, subject to Attribution, Non-commercial and Share-Alike clauses. This, in short, provides a carte blanche on republishing articles, so long as the source of the article is fully attributed, the article is not used for commercial purposes and that the article is republished under this same license. Creative commons permits both republishing in full and also the incorporation of portions of The Python Papers in other works. A portion may be an article, quotation or image. This means (a) that content may be freely re-used and (b) that other works using The Python Papers Anthology content must be available under the same Creative Commons license. The remainder of this article will address some of the details that might be of interest to anyone who wishes to include issues or articles in a database, website, hard copy collection or any other alternative access mechanism. The full legal code of the license may be found at http://creativecommons.org/licenses/byncsa/2.1/au/ The full open access policy can be found at http://ojs.pythonpapers.org/index.php/tpp/about/editorialPolicies
  • 6. The Python Papers Volume 3, Issue 3 6 Editorial Maurice Ling Hi Everyone, Welcome to the latest issue of The Python Papers. First and foremost, we will like to show our appreciation for all the contributions we had during the year which made us where we are today. Of course, we will not forget all our supporters and readers as well for all your valuable comments. In 2008 (Volume 3), we had published a total of 7 industrial and academic articles each, as well as 2 columns from our regular columnist, Ian Ozsvald, in his ShowMeDo Updates. Thank you for all your support and we will look forward to your continued encouragement. Starting in 2009, all the serials under The Python Papers Anthology will take on a new publishing scheme. We will be releasing each article out to the public as they are being accepted but each issue will be delimited by our usual “issue release” date. The “issue release” date is then our cutoff deadline to prepare the 1-PDF per issue file. This means that we will be serving new articles to everyone much faster than now and there will not be anymore meaningful publication schedules. We had also changed our policy from “Review Policy” to “Editorial Policy” to reflect the changes in the editorial team. We are currently in the process of appointing Editorial Reviewers (ER for short). Editorial Reviewers are members of the editorial committee whom are deemed to have in-depth expertise knowledge in specialized areas. Let's us looking forward to a great year ahead for more Python development and a recovering economy. Happy reading.
  • 7. Editorial: Python at the Crossroads My favourite T-shirt glimpsed at Pycon UK 2008 was ... Python programming as Guido indented it Apart from the two keynote speeches, it was a happy and fascinating event. It was my first Python- only conference and what a pleasure to be able to choose from four streams - web, GUI, testing and the language itself. My previous conferences, all in Australia, were open source events with wider scope and had just a single Python stream among Perl, PHP, Ruby and so on. The quality of speakers was uniformly excellent and the organisation was first rate. We can be sure that EuroPython 2009 being hosted by the same team next year in Birmingham will be definitely worth attending. The two keynotes by Mark Shuttleworth, CEO of Canonical, and Ted Leung, Python Evangelist at Sun, both highlighted Python at the crossroads. Fascinating but not particularly light hearted. Mixing and matching what they said, their combined story is ... 1. Python has critical mass and will continue to grow. The speed of growth is another question. 2. Django has reached the important milestone version 1.0 and should therefore compete with Ruby-on-Rails for newcomers to the Python language itself. 3. Intel and Sun are currently selling multi-core CPUs - 16 and 128 cores respectively. Expect massively multi-core machines in future. 4. Future growth in language popularity will be tied to multi-threading on multi-core CPUs. Haskell is one language expecting a multi-core growth kick. 5. Python's Global Interpreter Lock effectively prevents the language from exploiting current state- of-the-art multi-core computers. Where does this leave beautiful Python? The point was made that a language is chosen for being appropriate for the purpose of a project. Where this happens for multi-core performance reasons and Python is rejected then that is growth for another language and a permanent loss to Python. The Python Papers Editorial Team hopes that many of the Pycon UK session papers will be published in these pages. Please get in touch if you would like to submit an article for academic or technical review. Visit http://ojs.pythonpapers.org to submit an article or paper. In view of the crossroads highlighted for Python at Pycon in September, articles with a focus on multi-threading for multi-core computers would seem to be valuable for the language itself. The Python Papers is keen to see the language succeed and has very talented reviewers ready to help authors get their articles published.
  • 8. Got something to contribute? Please get in touch ... Mike Dewhirst
  • 9. ShowMeDo Update - November Ian Ozsvald In the last issue of the Python Papers I wrote a long article about how ShowMeDo helps you to learn more about Python. Since then we've added another 40 Python videos taking us to almost 380 in total. Including all the open-source topics we cover we have over 800 tutorial videos for you. Much of the content is free, contributed by our great authors and ourselves. Some of the content is in the Club which is for paying members - currently the Club focuses purely on Python tutorials for new and intermediate Python programmers. An update on the Club videos follows later. We were interviewed in October by Ron Stephens of Python411, you'll find the interview and all of Ron's other great Python podcasts on his site: http://www.awaretek.com/python/ Contributing to ShowMeDo: Would you like to share your knowledge with thousands of Python viewers every month? Contributing to ShowMeDo is easy, you'll find guides and links to screencasting software here: http://showmedo.com/addVideoInstructions To get an idea of what is popular with our viewers, see how the videos rank here: http://showmedo.com/mostPopular Remember that everything is previewed by us before publishing. You may have to wait a few days before your video is published but you'll be safe in the knowledge that your content sits alongside other vetted content. We are very keen to help you share your knowledge with our Pythonistas, especially if you want to spread awareness of the tools you like to use. Do get in contact in our forum, our authors are a friendly and very helpful crowd: http://groups.google.com/group/showmedo Free Screencasts: Django: We've had a lot of new Django content recently, mostly from Eric Holscher and ericflo. Eric and Eric have produced an amazing 21 new screencasts to help you learn Django. Django From the Ground Up(13 videos), ericflo http://showmedo.com/videos/series?name=PPN7NA155 Setting Up a Django Development Environment (3 videos), ericflo http://showmedo.com/videos/series?name=LY7fNbpc1 Debugging Django (4 videos), Eric Holscher http://showmedo.com/videos/series?name=RjHhY85GD
  • 10. Django Command Extensions, Eric Holscher http://showmedo.com/videos/series?name=3eB8j5P3b To commemorate the launch of Django v1 I produced a 1-minute quick intro with backing music by the great Django Reinhardt to help raise awareness of the team's great effort: Django In Under A Minute, Ian Ozsvald http://showmedo.com/videos/video?name=3240000&fromSeriesID=324 Python Coding: Florian, a longer-term ShowMeDo author, has created two series which introduce Decorators and teach you how to do unit-testing. Advanced Python(3 videos), Florian Mayer http://showmedo.com/videos/series?name=D42HbAhqD Unit-testing with Python(2 videos), Florian Mayer http://showmedo.com/videos/series?name=TUeY7z7GD Python Tools: We also have videos on the Python Bug Tracker, the Round-up Issue Tracker, using VIM with Python and another in a set explaining how to use Python inside Resolver System's 'Excel-beating' Resolver One spreadsheet. Searching the Python Bug Tracker, A.M. Kuchling http://showmedo.com/videos/video?name=3110000&fromSeriesID=311 An Introduction to Round-up Issue Tracker, Tonu Mikk http://showmedo.com/videos/video?name=3610000&fromSeriesID=361 An Introduction to Vim Macros (7 videos), Justin Lilly http://showmedo.com/videos/series?name=0oSagogCe Putting Python objects in the spreadsheet grid in Resolver One, Resolver Systems http://showmedo.com/videos/video?name=3520000&fromSeriesID=352 Club ShowMeDo: In the Club we continue to create more specialist tutorials for new and intermediate Python programmers. Membership to the Club can either be bought for a year's access or gained free for life if you author a video for us. You'll find details of the 115 Python videos for Club members here: http://showmedo.com/club
  • 11. Lucas Holland has joined us as a Club author having authored many free videos inside ShowMeDo. In this 9-part series he introduced the Python Standard Library: Batteries included - The Python standard library(9 videos), Lucas Holland http://showmedo.com/videos/series?name=o9MBQ758M I have created two new series which walk you through loops, iteration and functions: Python Beginners - Loops and Iteration (7 videos), Ian Ozsvald http://showmedo.com/videos/series?name=tIZs1K8h4 Python Beginners - Functions (6 videos), Ian Ozsvald http://showmedo.com/videos/series?name=4oReffvYq
  • 12. Filtering Microarray Correlations by Statistical Literature Analysis Yields Potential Hypotheses for Lactation Research Maurice HT Ling1,2 (mauriceling@acm.org) Christophe Lefevre 1,3,4 (Chris.Lefevre@med.monash.edu.au) Kevin R Nicholas1,4 (kevin.nicholas@deakin.edu.au) 1 CRC for Innovative Dairy Products, Department of Zoology, The University of Melbourne, Australia 2 School of Chemical and Life Sciences, Singapore Polytechnic, Singapore 3 Victorian Bioinformatics Consortium, Monash University, Australia 4 Institute of Technology Research and Innovation, Deakin University, Australia Abstract Background Recent studies have demonstrated that the cyclical nature of mouse lactation 1 can be mirrored at the transcriptome2 level of the mammary glands but making sense of microarray3 results requires analysis of large amounts of biological information which is increasingly difficult to access as the amount of literature increases. Extraction of protein-protein interaction from text by statistical and natural language processing has shown to be useful in managing the literature. Correlations between gene expression across a series of samples is a simple method to analyze microarray data as it was found that genes that are related in functions exhibit similar expression profiles4. Microarrays had been used to examine the transcriptome of mouse lactation and found that the cyclic nature of the lactation cycle as observed histologically is reflected at the transcription level. However, there has been no study to date using text mining to sieve microarray analysis to generate new hypotheses for further research in the field of lactational biology. Results Our results demonstrated that a previously reported protein name co-occurrence method (5-mention PubGene) which was not based on a hypothesis testing framework, is generally more stringent than the 99th percentile of Poisson distribution- based method of calculating co-occurrence. It agrees with previous methods using natural language processing to extract protein-protein interaction from text as more than 96% of the interactions found by natural language processing methods to coincide with the results from 5-mention PubGene method. However, less than 2% of 1 Lactation is the process of milk production. 2 Transcriptome is the set of genes that are active in a given cell at any one time. 3 Microarray is a multiplex technology used in molecular biology to measure the activity of a set of genes at any one time. 4 A gene expression profile is the trend of activity for all the genes across different time points or conditions. -1-
  • 13. the gene co-expressions analyzed by microarray were found from direct co- occurrence or interaction information extraction from the literature. At the same time, combining microarray and literature analyses, we derive a novel set of 7 potential functional protein-protein interactions that had not been previously described in the literature. Conclusions We conclude that the 5-mention PubGene method is more stringent than the 99th percentile of Poisson distribution method for extracting protein-protein interactions by co-occurrence of entity names and literature analysis may be a potential filter for microarray analysis to isolate potentially novel hypotheses for further research. 1. Background Microarray technology is a transcriptome analysis tool which had been used in the study of the mouse lactation cycle (Clarkson and Watson, 2003; Rudolph et al., 2007). A number of advances in microarray analysis have been made recently. For example, inferring the underlying genetic network from microarray results (Rawool and Venkatesh, 2007; Maraziotis et al., 2007) by statistical correlation of gene expression across a series of samples (Reverter et al., 2005), then deriving functional network clusters by mapping onto Gene Ontology (Beissbarth, 2006). It has been shown that functionally related genes demonstrate similar expression profiles (Reverter et al., 2005). These methods have been used to study functional gene sets for basal cell carcinoma (O'Driscoll et al., 2006). The amount of information in published form is increasing exponentially, making it difficult for researchers to keep abreast with the relevant literature (Hunter and Cohen, 2006). At the same time, there has been no study to demonstrate that the current status of knowledge in protein-protein interactions in the literature is useful to increase the understanding of microarray data. The two major streams for biomedical protein-protein information extraction are natural language processing (NLP) and co-occurrence statistics (Cohen and Hersh, 2005; Jensen et al., 2006). The main reason for concurrent existence of these two methods is their complementary effect in terms of information extraction (Jensen et al., 2006). NLP has a lower recall or sensitivity than co-occurrence but tends to be more precise compared with co-occurrence statistical methods (Wren and Garner, 2004; Jensen et al., 2006). Mathematically, precision is the number of true positives divided by the total number of items labeled by the system as positive (number of true positives divided by the sum of true and false positives), whereas recall is the number of true positives identified by the system divided the number of actual positives (number of true positives divided by the sum of true positives and false negatives). A number of tools have approached protein-protein interaction extraction from the NLP perspective, these include GENIES (Friedman et al., 2001), MedScan (Novichkova et al., 2003), PreBIND (Donaldson et al., 2003), BioRAT (David et al., 2004), GIS (Chiang et al., 2004), CONAN (Malik et al., 2006), and Muscorian (Ling et al., 2007). Muscorian (Ling et al., 2007) achieved at least 82% precision and 30% in recall (sensitivity). NLP methods made use of the grammatical forms of words and structure of a valid sentence to identify the grammatical roles of each word in a sentence, parse the sentence into phrases and extracting information such as subject-verb-object structures from these phrases. Co-occurrence, a statistical method, is based on the thesis that multiple occurrences of the same pair of entities suggests that the pair of -2-
  • 14. entities are related in some way and the likelihood of such relatedness increases with higher co-occurrence. In another words, co-occurrence methods tend to view the text as a bag of un-sequenced words. Hence, depending on the threshold allowed, which will translate to the precision of the entire system, recall could be total, as implied in PubGene (Jenssen et al., 2001). PubGene (Jenssen et al., 2001) defined interactions by co-occurrence to the simplest and widest possible form by assigning an interaction between 2 proteins if these 2 proteins appear in the same article just once in the entire library of 10 million articles and found that this criterion has 60% precision (1-Mention PubGene method). Although it was not stated in the article (Jenssen et al., 2001), it is obvious that such a criterion would yield 100% recall or sensitivity, giving an F-score of 0.75. F-score is defined as the harmonic mean of precision and recall, attributing equal weight to both precision and recall. However, 60% precision is usually unsatisfactory for most applications. PubGene (Jenssen et al., 2001) had also defined a “5-Mention” method which requires 5 or more articles with 2 protein names to assign an interaction with 72% precision. It is generally accepted that precision and recall are inversely related; hence, it can be expected that the “5-Mention” method will not be 100% sensitive. However, PubGene was benchmarked against the Database of Interacting Proteins and OMIM, making it more difficult to appreciate the statistical basis of “1-Mention” and “5-Mention” methods as compared to using a hypothesis testing framework in Chen et al. (2008). In addition, PubGene is unable to extract the nature of interactions, for example, binding or inhibiting interactions. On the other hand, NLP is designed to extract the nature of interactions (Malik et al., 2006; Ling et al., 2007); hence, it can be expected that NLP results may be used to annotate co-occurrence results. CoPub Mapper used a more sophisticated information measure which took into account the distribution of entity names in the text database (Alako et al., 2005). Although Alako et al (2005) demonstrated CoPub Mapper's information measure co- relates well with microarray co-expression, the information measure was not used as a decision criterion for deciding which pairs of co-occurrences were positive results (personal communication, Guido Jenster, 2006). This is unlike 1-Mention PubGene method where all co-occurrence were taken as positive result and 5-Mention PubGene method requires at least 5 count of co-occurrence before attributing the co-occurrence as a positive result. Chen et al. (2008) used chi-square to test co-occurrence statistically to mine disease-drug interactions from clinical notes and published literature. Another possible way to calculate co-occurrence is a direct use of Poisson distribution on the assumption that co-occurrence of 2 protein names is a rare chance with respect to the entire library. Poisson distribution is a discrete distribution similar to Binomial distribution but is used for rare events, for example, to estimate the probability of accidents in a given stretch of road in a day. Poisson distribution is easier to use than Binomial distribution as it only requires the mean and does not require a standard deviation. Based on PubGene, the statistical assumption of Poisson distribution-based statistics requiring rare events (in this case, the co-occurrences of 2 protein names in a collection of text is statistically rare) can generally be held (Jenssen et al., 2001). Although a combination of either NLP or co-occurrence in microarray analysis have been used (Li et al., 2007; Gajendran et al., 2007; Hsu et al., 2007), neither method had been used in microarray analysis for advancing lactational biology. This study -3-
  • 15. attempts to examine the relation between the PubGene and Poisson distribution methods of calculating co-occurrence and explore the use of NLP-based protein- protein interaction extraction results to annotate co-occurrence results. This study also examines the use of co-occurrence analysis on 4 publically available microarray data sets on mouse lactation cycle (Master et al., 2002; Clarkson and Watson, 2003; Stein et al., 2004; Rudolph et al., 2007) as a novel hypothesis discovery tool. Master et al. (2002) used 13 microarrays to discover the presence of brown adipose tissue in mouse mammary fat pad and its role in thermoregulation, Clarkson and Watson (2003) used 24 microarrays and characterized inflammation response genes during involution, Stein et al. (2004) used 51 microarrays and discovered a set of 145 genes that are up- regulated in early involution where 49 encoded for immunoglobulins, and Rudolph et al. (2007) used 29 microarrays to study lipid synthesis in the mouse mammary gland following diets of various fat content and found that genes encoding for nutrient transporter into the cell are up-regulated following increased food intake. More importantly, each of the 4 studies independently demonstrated that the cyclical nature of mammary gland development, as observed histologically and biochemically, are reflected at the transcriptome level suggesting that microarray is a suitable tool to study the regulation of mouse lactation. It should be noted that even-though each of these microarray experiments were designed for different purposes, the principle that co-expressed genes are more functionally correlated than functionally unrelated genes remains, as demonstrated by Reverter et al. (2005). Our results demonstrate that 5-mention PubGene method is generally statistically more significant than 99th percentile of Poisson distribution method of calculating co- occurrence. Our results showed that 96% of the interactions extracted by NLP methods (Ling et al., 2007) overlapped with the results from 5-mention PubGene method. However, less than 2% of the microarray correlations were found in the co- occurrence graph extracted by 1-mention PubGene method. Using co-occurrence results to filter microarray co-expression correlations, we have discovered a potentially novel set of 7 protein-protein interactions that had not been previously described in the literature. 2. Methods 2.1. Microarray Datasets The 4 microarray datasets are from Master et al. (2002) using Affymetrix Mouse Chip Mu6500 and FVB mice, Clarkson and Watson (2003) using Affymetrix U74Av2 chip and C57/BL6 mice, Rudolph et al. (2007) using Affymetrix U74Av2 chip and FVB mice, and Stein et al. (2004) using Affymetrix U74Av2 chip and Balb/C mice. 2.2. Co-Occurrence Calculations Using a pre-defined list of 3653 protein names which was derived by Ling et al. (2007) from Affymetrix Mouse Chip Mu6500 microarray probeset, PubGene established 2 measures of binary co-occurrence (Jenssen et al., 2001): 1-mention method and 5 mentions method. In the 1-mention method, the appearance of 2 entity names in the same abstract will be deemed as a positive outcome whereas the 5 mentions method will require the appearance of 2 entity names in at least 5 abstracts before considered positive. -4-
  • 16. For co-occurrence modelled on Poisson distribution (Poisson co-occurrence), the number of abstracts in which both entity names appeared in is assumed to be rare as it only requires the appearance of 2 entity names within 5 articles in a collection of 10 million articles to give a precision of 0.72 (Jenssen et al., 2001). The relative occurrence frequencies of each of the 2 entities were calculated separately as a quotient of the number of abstracts in which an entity name appeared in and the total number of abstracts in the corpus. The product of relative occurrence frequency of each of the 2 entities can be taken as the mean expected probability of the 2 entities appearing in the same abstract if they are not related, which when multiplied by the total number of abstracts, can be taken as the mean number of occurrence (lambda) of Poisson distribution. For example, if proteinA and proteinB are found in 1000 abstracts each and there are 1 million abstracts, the relative occurrence frequency will be 0.001 each and the mean number of occurrence will be 1 (0.001 2 x 1000000). This means that we expect 1 abstract in a collection of 1 million to contain proteinA and proteinB if they are not related (n = 1, p = 0.5). A positive result is where the number of abstracts in which both the 2 entities in question appeared on or above the 95th (one-tail P < 0.05) or 99th (one-tail P < 0.01) percentile of the Poisson distribution. In both co-occurrence calculations, entity (protein) names in text is recognized by pattern matching , as used in Ling et al. (2007). 2.3. Comparing Co-Occurrence and Text Processing Two sets of comparisons were performed: within the different forms of co-occurrence, and between co-occurrence and text processing methods. The first set of comparison aims to evaluate the differences between the 3 co-occurrence methods described above. PubGene's 1-mention and 5-mentions methods were co-related singly and in combination with Poisson co-occurrence methods. Given that the nodes (N) of a co-occurrence network represents the entities and the links or edges (E) between each node to represent a co-occurrence under the method used, the entire co-occurrence graph (G) = {N, E}, that is, a set of nodes and a set of edges. In addition, given that the same set of entities were used (same set of nodes), the differences between the 2 graphs resulted from 2 co-occurrence methods can then be simply denoted as the number of differences between the 2 sets of edges (subtraction of one set of edges with another set of edges). In practice, a total space model is used. A graph of total possible co-occurrence is where each node is “linked” or co-occurred with every node, including loops (edge to itself). Thus, a graph of total possible co-occurrence has 3653 nodes and 12694969 (35632) edges. We define a graph, G*, as the undirected graph of total possible co-occurrence without parallel edges including loops. G* has 3653 nodes and 63457030 [3563 x (3563 – 1) / 2] edges. The output graph of each co-occurrence method is reduced to the number of edges it contains as it can be assumed that the graph from 1-mention PubGene method represents the most liberal co-occurrence graph (GPG1), the resulting graph from any other more sophisticated method (Gi where i denotes the co-occurrence method) will be a proper subset of GPG1 and certainly G*. -5-
  • 17. The second set of comparison aims at correlating co-occurrence techniques and natural language processing techniques for extracting interactions between two entities, such as two proteins. In this comparison, the extracted protein-protein binding and activation interactions, extracted using Muscorian on 860000 published abstracts using “mouse” as the keyword as previously described (Ling et al., 2007), has been used to compare against co-occurrence network of 1-Mention PubGene and 5-Mention PubGene by graph edges overlapping as described above. Briefly, Muscorian (Ling et al., 2007) normalized protein names within abstracts by converting the names into abbreviations before processing the abbreviated abstracts into a table of subject-verb-objects. Protein-protein interaction extractions were carried out by matching of each of the 12694969 (35632) pairs of protein names and verb, namely, activate or bind, in the extracted table of subject-verb-objects. 2.4. Mapping Co-Expression Networks onto Text-Mined Networks A co-expression network was generated from each of the 4 in vivo data sets by pair- wise calculation of Pearson's coefficient on the intensity values across the dataset, where a coefficient of more than 0.75 or less than -0.75 signifies the presence of a co- expression between the pair of signals on the microarray (Reverter et al., 2005). The co-expression network generated from Master et al. (2002) and an intersected co- expression network generated by intersecting all 4 networks were used to map onto 1- PubGene and NLP-mined networks. For the co-expression network generated from Master et al. (2002), a 0.01 coefficient unit incremental stepwise mapping to 1- PubGene co-occurrence network as performed from 0.75 to 1 to analyze for an optimal correlation coefficient to derive a set of correlations between genes that is likely to have not been studied before (not found in 1-PubGene co-occurrence network). 3. Results 3.1. Comparing Co-Occurrence Calculation Methods Using 3563 transcript names, there is a total of 6345703 possible pairs of interactions - 927648 (14.6%) were found using 1-Mention PubGene method and 431173 (6.80%) were found using 5-Mention PubGene method. The Poisson co-occurrence method using both 95th or 99th percentile threshold found 927648 co-occurrences, which is the same set as using 1-Mention PubGene method. The mean number of co-occurrence, which is used as the mean of the Poisson distribution, is calculated as the product of the probability of occurrence of each of the entity names in the database. Using a database of 100 thousand abstracts as an example, if 500 abstracts contained the term “insulin” (500 abstracts in 100 thousand, or 0.5%) and 200 abstracts contained the term “MAP kinase” (200 abstracts in 100 thousand, or 0.2%), then the mean number of co-occurrence (lambda in Poisson distribution) is 0.001%. The range of mean number of co-occurrence for the 6345703 pairs of entities were from zero to 0.59, with mean of 0.000031. For example, if the mean is 3.1 x 10-5, then the probability of an abstract mentioning 2 proteins not related in any functional way is 4.8 x 10-10 or virtually zero in 6.3 million possible interactions. These results are summarized in Table 1. -6-
  • 18. Number of Clone-Pairs % of Full Combination Full Combination (G*)1 6345703 100.00 1-Mention PubGene 927648 14.62 5-Mention PubGene 431173 6.80 Poisson Co-occurrence at 95th percentile 9276482 14.62 th 2 Poisson Co-occurrence at 99 percentile 927648 14.62 Table 1 - Summary results of co-occurrence using PubGene or Poisson distribution 1 The undirected graph of total possible co-occurrence (35632) without parallel edges excluding self edge, which has 3653 nodes and 63457030 [3563 x (3563 – 1) / 2] edges. 2 Same set as 1-Mention PubGene 3.2. Comparison of Natural Language Processing and Co-Occurrence Natural language processing (NLP) techniques were used to extract protein-protein binding interactions and protein-protein activation interactions from almost 860000 abstracts as described in Ling et al. (2007). A total of 9803 unique binding interactions and 11365 unique activation interactions were identified, of which 2958 were both binding and activation interactions. Of the 9803 binding interactions, 9661 interactions concurred with 1-Mention PubGene method (98.55%) and 9465 interactions with 5-Mention PubGene method (96.54%). Of the 11365 activation interactions, 11280 interactions and 11111 interactions concurred with 1-Mention PubGene method (99.25%) and 5-Mention PubGene method (97.77%) respectively. Hence, of the 927648 interactions found using 1-Mention PubGene method, 1.04% (n = 9661) were binding interactions and 1.22% (n = 11280) were activation interactions. Furthermore, of the 431173 interactions found using 5-Mention PubGene method, 2.20% (n = 9465) of the interactions were binding interactions and 2.58% (n = 11111) were activation interactions. Combining binding and activation interactions (n = 18120), 1.96% of 1-Mention PubGene co-occurrence graph and 3.85% of 5-Mention PubGene co-occurrence graph were annotated respectively. 3.3. Mapping Co-Expression Networks onto Text-Mined Networks Using Pearson's correlation coefficient to signify the presence of a co-expression between the pair of spots (genes) on the Master et al. (2002) data set, there are 210283 correlations between -1.00 to -0.75 and 0.75 to 1.00, of which 2014 (0.96% of correlations) are found in 1-PubGene co-occurrence network, 342 (0.16% of correlations) are found in activation network extracted by natural language processing means and 407 (0.19% of correlations) are found in binding network extracted by natural language processing means. -7-
  • 19. From incremental correlation mapping with 1-PubGene network (tabulated in Table 2 and graphed in Figure 1), there is a decline of the number of correlations from 208269 (correlation coefficient of 0.75) to 7 (correlation coefficient of 1.00). The percentage of overlap between co-occurrence and co-expression rose linearly from correlation coefficient of 0.75 to 0.85 (r = 0.959) while that of correlation coefficient of 0.86 to 0.92 is less correlated (r = 0.223). The 7 pairs of correlations in Master et al. (2002) data set with correlation coefficient of 1.00 are; lactotransferrin (Mm.282359) and solute carrier family 3 (activators of dibasic and neutral amino acid transport), member 2 (Mm.4114); B-cell translocation gene 3 (Mm.2823) and UDP- Gal:betaGlcNAc beta 1,4- galactosyltransferase, polypeptide 1 (Mm.15622); gamma- glutamyltransferase 1 (Mm.4559) and programmed cell death 4 (Mm.1605); FK506 binding protein 11 (Mm.30729) and signal recognition particle 9 (Mm.303071); FK506 binding protein 11 (Mm.30729) and Ras-related protein Rab-18 (Mm.132802); casein gamma (Mm.4908) and casein alpha (Mm.295878); G protein-coupled receptor 83 (Mm.4672) and recombination activating gene 1 activating protein 1 (Mm.17958). The amount of overlap between microarray correlations and 1-mention PubGene co- occurrence increased steadily from 0.96% at the correlation coefficient of 0.75 to 1.057% at the correlation coefficient of 0.87. Mapping an intersect of co-expression networks of all 4 in vivo data sets (Master et al., 2002; Clarkson and Watson, 2003; Stein et al., 2004; Rudolph et al., 2007), there are 1140 correlations, of which 14 (1.23%) are found in 1-PubGene co-occurrence network, none of which corresponds to the interactions found in activation or binding networks extracted by natural language processing means (Ling et al., 2007). Intersect of Correlation and 1-Mention PubGene 1.10 Found in 1-Mention PubGene Percent of Correlations 1.05 1.00 0.95 0.90 0.85 0.80 0.75 0.70 0.76 0.77 0.78 0.79 0.8 0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.95 0.97 0.98 Minimum Correlation Figure 1 – Percentage of correlation network analyzed from Maser et al. (2002) are found in 1-Mention PubGene co-occurrence -8-
  • 20. Minimum Number of Correlations Number of Correlations Percentage of Correlation in Master et al. (2002) found in 1-PubGene Correlations Found 0.75 210283 2014 0.958 0.76 207593 1983 0.964 0.77 181383 1735 0.966 0.78 157622 1495 0.958 0.79 136152 1316 0.976 0.80 116775 1141 0.987 0.81 99276 970 0.987 0.82 83802 823 0.988 0.83 70019 692 0.998 0.84 57872 575 1.004 0.85 47453 472 1.005 0.86 38228 373 0.985 0.87 30347 314 1.046 0.88 23740 234 0.995 0.89 18137 178 0.991 0.90 13435 138 1.038 0.91 9797 96 0.990 0.92 6849 70 1.034 0.93 4580 40 0.881 0.94 2919 28 0.969 0.95 1742 14 0.984 0.95 970 7 0.727 0.97 472 4 0.855 0.98 197 2 1.026 0.99 60 0 0.000 1.00 7 0 0.000 Table 2 - Summary of incremental stepwise mapping of correlation coefficients from Master et al. (2002) to 1-PubGene co-occurrence network 4. Discussion Comparing the difference between PubGene (Jenssen et al., 2001) and Poisson modelling method for co-occurrence calculations, three observations could be made. Firstly, one of the common criticisms of a simple co-occurrence method as used in this study (co-occurrence of terms without considering the number of words between -9-
  • 21. these terms) is that given a large number of articles or documents, every term will co- occur with every term at least once, leading to total possible co-occurrence (100% or 12694969 in this case). Our results showed that 7.31% of the total possible co- occurrence were actually found using about 860000 abstracts and only 3.40% using a more stringent method. PubGene (Jenssen et al., 2001) has also suggested that total possible co-occurrence was not evident with a much larger set of articles (10 million) and yet achieved 60% precision using only one instance of co-occurrence in 10 million articles (1-Mention PubGene) and 72% precision with 5-Mention PubGene. It can be expected with more instances of co-occurrence, precision may be higher. This might be due to the sparse distribution of entity names in the set of text as observed from the low mean number of co-occurrence used for Poisson distribution modeling. At the same time, PubGene (Jenssen et al., 2001) also illustrated that entity name recognition by simple pattern matching is able to yield quality results. Using only results from PubGene (Jenssen et al., 2001), it can be concluded that total possible co-occurrence is unlikely for a corpus size of up to 10 million (more than half of current PubMed). Using the Poisson distribution, the mean number of co- occurrence can be expected to decrease with a larger corpus than used in this study as it is a product of the relative frequencies of each of the 2 entities. This suggests that as the size of corpus increases, it is likely that each co-occurrence of terms is more significant, suggesting that a statistical measure might be more useful in a very large corpus of more than 10 million as it takes into account both frequencies and corpus size. Secondly, Poisson co-occurrence methods at both 95th and 99th percentile yield the same set of results as 1-Mention PubGene method, which is expected as the maximum mean number of co-occurrence is 0.59. This implied that every co-occurrence found are essentially statistically significant in a corpus of about 860000 abstracts; thus, providing statistical basis for “1-Mention PubGene” method. This might be due to the nature of abstracts, which were known to be concise. Proteins that have no relation to each other are generally unlikely to be mentioned in the same abstract and abstracts tends to mention only crucial findings. However, the same might not apply if full text articles are used – un-related proteins could be used solely for illustrative purposes. Thirdly, the number of co-occurrences found using 5-Mention PubGene method is substantially lower (less than half) of that by 1-Mention PubGene method which was also shown in Jenssen et al. (2001). This suggested that 5-Mention PubGene is appreciably more stringent than using Poisson co-occurrence at 99th percentile; thus, providing statistical basis for “5-Mention PubGene” method. Our results comparing the numbers of co-occurrence demonstrated a 50.79% decrease in co-occurrence from 1-Mention PubGene network to 5-Mention PubGene network. However, the 5-Mention PubGene network retained most of the “activation” (98.5%) and “binding” (98.0%) interactions found in 1-Mention PubGene network. This might be the consequence of 30% recall of the NLP methods (Ling et al., 2007) as it would usually require 3 or more mentions to have a reasonable chance to be identified by NLP methods. This might also be due to the observation that the 5-Mention PubGene method is more precise, in terms of accuracy, than the 1-PubGene method as shown in Jenssen et al. (2001). - 10 -
  • 22. The probability of a true interaction (Ling et al., 2007) existing in each of the 9661 NLP-extracted binding interactions that are also found in 1-Mention PubGene co- occurrence would be raised. The probability of a true interaction existing in each of the 9465 NLP-extracted binding interactions that are also found in 5-Mention PubGene co-occurrence would be higher. Hence, combining NLP and statistical co- occurrence techniques can improve the overall confidence of finding true interactions. However, it should be noted that statistical co-occurrence used in this work cannot raise the confidence of NLP-extracted interactions. Nevertheless, these results also suggest that graphs of statistical co-occurrence could be annotated with information from NLP methods to indicate the nature of such interactions. In this study, 2 NLP-extracted interactions from Ling et al. (2007), “binding” and “activation”, were combined. The combined “binding” and “activation” network covered 1.96% and 3.85% of 1-Mention and 5-Mention PubGene co- occurrence graph respectively. Our results demonstrate that the combined network has a higher coverage than individual “binding” or “activation” networks. Thus, it can be reasonable to expect that with more forms of interactions, such as degradation and phosphorylation, extracted with the same NLP techniques, the co-occurrence graph annotation would be more complete. By overlapping the co-expression network analyzed from Master et al. (2002) data set to 1-Mention PubGene co-occurrence network, our results demonstrated that about 99% of the co-expression was not found in the co-occurrence network. This might suggest that the choice of Pearson's correlation coefficient threshold of more than 0.75 and less than -0.75 as suggested by Reverter et al. (2005) is likely to be sensitive in isolating functionally related genes from microarray data at the cost of reduced specificity. Our results from incremental stepwise analysis showed that the percentage of overlap between co-expression and co-occurrence rose linearly from correlation coefficient from 0.75 to 0.85. This suggests that a correlation coefficient of 0.85 may be optimal for this data set as it is likely that using the correlation coefficient of 0.85 will result in less false positives than the correlation coefficient of 0.75. At the same time, increasing the correlation coefficient from 0.75 to 0.85 resulted in 77.4% less (47453 correlations from 210283) interaction correlations. Using this method to further describe protein-protein interactions and to generate new hypotheses, it can be argued that correlation coefficient of 0.85 will result in less false positives. While this deduction is likely as a more stringent criterion tends to reduce the rate of false positives, it is difficult to prove experimentally without exhaustive examination of each result. Nevertheless, the result suggest the possibility of using the inverse linearity of correlation coefficient and the number of gene co-expressions as a preliminary visual assessment to gauge an optimal correlation coefficient to use for a particular data set. However, on the extreme end, a correlation coefficient of 0.99 and 1.00 yielded 60 and 7 correlations respectively in Master et al. (2002) data set but none was found in 1-Mention PubGene co-occurrence network. This suggests that high-throughput genomic techniques such as microarrays, present a vast amount of un-mined biological information that had not been examined experimentally. By exploring the literature for the biological significance for each of the 7 pairs of perfectly co-expressed genes using Swanson's method (Swanson, 1990), it was found - 11 -
  • 23. that all 7 pairs were biologically significant. Lactotransferrin (Ishii et al., 2007) and solute carrier family 3 (activators of dibasic and neutral amino acid transport), member 2 (Feral et al., 2005) were involved in cell adhesion. B-cell translocation gene 3 (Guehenneux et al., 1997) and UDP-Gal:betaGlcNAc beta 1,4- galactosyltransferase, polypeptide 1 (Mori et al., 2004) were involved in cell cycle control. Casein gamma and casein alpha are well-established components of milk. Gamma-glutamyltransferase 1 (Huseby et al., 2003) and programmed cell death 4 (Frankel et al., 2008) were known to be regulating apoptotic pathways. Rab18 (Vazquez-Martinez et al., 2007), signal recognition particle 9 (Egea et al., 2004) and FK506 binding protein 11 (Dybkaer et al., 2007) were known to be involved in the secretory pathway. G protein-coupled receptor 83 (Lu et al., 2007) and recombination activating gene 1 activating protein 1 (Igarashi et al., 2001) were known to be involved in T-cell function. Taken together, these suggest that the set of 7 correlations have not likely been described and may prove to be valuable new hypotheses in the study of mouse mammary physiology. It is also plausible that this argument can be extended to the set of 53 highly co-expressed genes (0.99 < correlation coefficient < 1.00). Intersecting the 4 in vivo data sets into a co-expression network increases the power of the analysis as it represents correlation among gene expression that are more than 0.75 or less than -0.75 in all 4 data sets. There were 1140 examples of co-expression in this intersect and only 14 co-expressions (1.23%) were found in the one-mention PubGene co-occurrence network, but none in either the binding or activation networks extracted by natural language processing. This suggests that these 14 co-expressions are neither binding nor activating interactions. Textpresso (Muller et al., 2004) had defined a total of 36 molecular associations between 2 proteins which includes binding and activation. Future work will expand NLP mining to 34 other interactions to improve the annotation of co-occurrence networks. Reverter et al. (2005) had previously analysed 5 microarray data sets by expression correlation and demonstrated that genes of related functions exhibit similar expression profile across different experimental conditions. Our results suggest 1126 co- expressed genes across 4 microarray data sets are not found in the co-occurrence network. This may be a new set of valuable information in the study of mouse mammary physiology as these pairs of genes have not been previously mentioned in the same publication and experimental examination of these potential interactions is needed to understand the biological significance of these co-expressions. 5. Conclusions We conclude that the 5-mention PubGene method is more stringent than the 99th percentile of Poisson distribution method. In this study, we demonstrate the use of a liberal co-occurrence-based literature analysis (1-Mention PubGene method) to represent the state of research knowledge in functional protein-protein interactions as a sieve to isolate potentially novel hypotheses from microarray co-expression analyses for further research. - 12 -
  • 24. Authors' contributions ML, CL and KRN contribute equally to the design of experiments and analysis of results. ML carries out the experiments. References 1. Alako BT, Veldhoven A, van Baal S, Jelier R, Verhoeven S, Rullmann T, Polman J, Jenster G: CoPub Mapper: mining MEDLINE based on search term co-publication. BMC Bioinformatics 2005, 6(1):51. 2. Beissbarth T: Interpreting experimental results using gene ontologies. Methods in Enzymology 2006, 411:340-352. 3. Chen ES, Hripcsak G, Xu H, Markatou M, Friedman C: Automated Acquisition of Disease Drug Knowledge from Biomedical and Clinical Documents: An Initial Study. Journal of the American Medical Informatics Association 2008, 15(1):87-98. 4. Chiang J-H, Yu H-C, Hsu H-J: GIS: a biomedical text-mining system for gene information discovery. Bioinformatics 2004, 20(1):120. 5. Clarkson RWE, Watson CJ: Microarray analysis of the involution switch. Journal of Mammary Gland Biology and Neoplasia 2003, 8(3):309-319. 6. Cohen AM, Hersh WR: A survey of current work in biomedical text mining. Briefings in Bioinformatics 2005, 6(1):57-71. 7. David PAC, Bernard FB, William BL, David TJ: BioRAT: extracting biological information from full-length papers. Bioinformatics 2004, 20(17):3206. 8. Donaldson I, Martin J, de Bruijn B, Wolting C, Lay V, Tuekam B, Zhang S, Baskin B, Bader GD, Michalickova K et al: PreBIND and Textomy--mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics 2003, 4:11. 9. Dybkaer K, Iqbal J, Zhou G, Geng H, Xiao L, Schmitz A, d'Amore F, Chan WC: Genome wide transcriptional analysis of resting and IL2 activated human natural killer cells: gene expression signatures indicative of novel molecular signaling pathways. BMC Genomics 2007, 8:230. 10. Egea PF, Shan SO, Napetschnig J, Savage DF, Walter P, Stroud RM: Substrate twinning activates the signal recognition particle and its receptor. Nature 2004, 427(6971):215-221. 11. Feral CC, Nishiya N, Fenczik CA, Stuhlmann H, Slepak M, Ginsberg MH: CD98hc (SLC3A2) mediates integrin signaling. Proceedings of the National Academy of Science U S A 2005, 102(2):355-360. 12. Frankel LB, Christoffersen NR, Jacobsen A, Lindow M, Krogh A, Lund AH: Programmed cell death 4 (PDCD4) is an important functional target of the microRNA miR-21 in breast cancer cells. Journal of Biological Chemistry 2008, 283(2):1026-1033. 13. Friedman C, Kra P, Yu H, Krauthammer M, Rzhetsky A: GENIES: a natural- language processing system for the extraction of molecular pathways from journal articles. Bioinformatics 2001, 17(Suppl. 1):S74-S82. 14. Gajendran VK, Lin JR, Fyhrie DP: An application of bioinformatics and text mining to the discovery of novel genes related to bone biology. Bone 2007, 40(5):1378-1388. - 13 -
  • 25. 15. Guehenneux F, Duret L, Callanan MB, Bouhas R, Hayette S, Berthet C, Samarut C, Rimokh R, Birot AM, Wang Q et al: Cloning of the mouse BTG3 gene and definition of a new gene family (the BTG family) involved in the negative control of the cell cycle. Leukemia 1997, 11(3):370-375. 16. Hsu CN, Lai JM, Liu CH, Tseng HH, Lin CY, Lin KT, Yeh HH, Sung TY, Hsu WL, Su LJ et al: Detection of the inferred interaction network in hepatocellular carcinoma from EHCO (Encyclopedia of Hepatocellular Carcinoma genes Online). BMC Bioinformatics 2007, 8:66. 17. Hunter L, Cohen KB: Biomedical language processing: what's beyond PubMed? Molecular Cell 2006, 21(5):589-594. 18. Huseby NE, Asare N, Wetting S, Mikkelsen IM, Mortensen B, Sveinbjornsson B, Wellman M: Nitric oxide exposure of CC531 rat colon carcinoma cells induces gamma-glutamyltransferase which may counteract glutathione depletion and cell death. Free Radical Research 2003, 37(1):99-107. 19. Igarashi H, Kuwata N, Kiyota K, Sumita K, Suda T, Ono S, Bauer SR, Sakaguchi N: Localization of recombination activating gene 1/green fluorescent protein (RAG1/GFP) expression in secondary lymphoid organs after immunization with T-dependent antigens in rag1/gfp knockin mice. Blood 2001, 97(9):2680-2687. 20. Ishii T, Ishimori H, Mori K, Uto T, Fukuda K, Urashima T, Nishimura M: Bovine lactoferrin stimulates anchorage-independent cell growth via membrane-associated chondroitin sulfate and heparan sulfate proteoglycans in PC12 cells. Journal of Pharmacological Science 2007, 104(4):366-373. 21. Jensen LJ, Saric J, Bork P: Literature mining for the biologist: from information retrieval to biological discovery. Nature Review Genetics 2006, 7(2):119-129. 22. Jenssen TK, Laegreid A, Komorowski J, Hovig E: A literature network of human genes for high-throughput analysis of gene expression. Nature Genetics 2001, 28(1):21-28. 23. Li X, Chen H, Huang Z, Su H, Martinez JD: Global mapping of gene/protein interactions in PubMed abstracts: A framework and an experiment with P53 interactions. Journal of Biomedical Informatics 2007. 24. Ling MH, Lefevre C, Nicholas KR, Lin F: Re-construction of Protein- Protein Interaction Pathways by Mining Subject-Verb-Objects Intermediates. In: Second IAPR Workshop on Pattern Recognition in Bioinformatics (PRIB 2007). Singapore: Springer-Verlag; 2007. 25. Lu LF, Gavin MA, Rasmussen JP, Rudensky AY: G protein-coupled receptor 83 is dispensable for the development and function of regulatory T cells. Molecular Cell Biology 2007, 27(23):8065-8072. 26. Malik R, Franke L, Siebes A: Combination of text-mining algorithms increases the performance. Bioinformatics 2006, 22(17):2151-2157. 27. Master SR, Hartman JL, D'Cruz CM, Moody SE, Keiper EA, Ha SI, Cox JD, Belka GK, Chodosh LA: Functional microarray analysis of mammary organogenesis reveals a developmental role in adaptive thermogenesis. Molecular Endocrinology 2002, 16(6):1185-1203. 28. Maraziotis IA, Dragomir A, Bezerianos A: Gene networks reconstruction and time-series prediction from microarray data using recurrent neural fuzzy networks. IET Systems Biology 2007, 1(1):41-50. 29. Mori R, Kondo T, Nishie T, Ohshima T, Asano M: Impairment of skin wound healing in beta-1,4-galactosyltransferase-deficient mice with - 14 -
  • 26. reduced leukocyte recruitment. American Journal of Pathology 2004, 164(4):1303-1314. 30. Muller HM, Kenny EE, Sternberg PW: Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biology 2004, 2(11):e309. 31. Novichkova S, Egorov S, Daraselia N: MedScan, a natural language processing engine for MEDLINE abstracts. Bioinformatics 2003, 19:1699- 1706. 32. O'Driscoll L, McMorrow J, Doolan P, McKiernan E, Mehta JP, Ryan E, Gammell P, Joyce H, O'Donovan N, Walsh N et al: Investigation of the molecular profile of basal cell carcinoma using whole genome microarrays. Molecular Cancer 2006, 5:74. 33. Rawool SB, Venkatesh KV: Steady state approach to model gene regulatory networks-Simulation of microarray experiments. Biosystems 2007. 34. Reverter A, Barris W, Moreno-Sanchez N, McWilliam S, Wang YH, Harper GS, Lehnert SA, Dalrymple BP: Construction of gene interaction and regulatory networks in bovine skeletal muscle from expression data. Australian Journal of Experimental Agriculture 2005, 45:821-829. 35. Rudolph MC, McManaman JL, Phang T, Russell T, Kominsky DJ, Serkova NJ, Stein T, Anderson SM, Neville MC: Metabolic regulation in the lactating mammary gland: a lipid synthesizing machine. Physiological Genomics 2007, 28:323-336. 36. Stein T, Morris J, Davies C, Weber-Hall S, Duffy M-A, Heath V, Bell A, Ferrier R, Sandilands G, Gusterson B: Involution of the mouse mammary gland is associated with an immune cascade and an acute-phase response, involving LBP, CD14 and STAT3. Breast Cancer Research 2004, 6(2):R75 – R91. 37. Swanson DR: Medical literature as a potential source of new knowledge. Bulletin of the Medical Library Association 1990, 78(1):29-37. 38. Vazquez-Martinez R, Cruz-Garcia D, Duran-Prado M, Peinado JR, Castano JP, Malagon MM: Rab18 inhibits secretory activity in neuroendocrine cells by interacting with secretory granules. Traffic 2007, 8(7):867-882. 39. Wren JD, Garner HR: Shared relationship analysis: ranking set cohesion and commonalities within a literature-derived relationship network. Bioinformatics 2004, 20(2):191-198. - 15 -
  • 27. Appendix A – Use of Python in this work Python programming had been used throughout this study, which had been incorporated into Muscorian (Ling et al., 2007). The following are code snippets to demonstrate the calculation of Poisson distribution and the intersection of Master et al., 2002 and 1-mention PubGene results as shown in Figure 1 and Table 2. Given that muscopedia.dbcursor is the database cursor and pmc_abstract table to contain the abstracts, the Poisson distribution model for each pair of entity (gene or protein) names is constructed by the function commandJobCloneOccurrencePoisson, class Poisson: mean = 0.0 def __init__(self, lamb = 0.0): self.mean = lamb def factorial(self, m): value=1 if m != 0: while m !=1: value=value*m m=m-1 return value def PDF(self, x): return math.exp(self.mean)* pow(self.mean,x)/self.factorial(x) def inverseCDF(self, prob): cprob = 0.0 x = 0 while (cprob < prob): cprob = cprob + self.PDF(x) x = x + 1 return (x, cprob) def commandJobCloneOccurrencePoisson(self): poisson = Poisson() muscopedia.dbcursor.execute(' select count(pmid) from pmc_abstract') abstractcount = float(self.muscopedia.dbcursor.fetchall()[0][0]) muscopedia.dbcursor.execute(' select jclone, occurrence from jclone_occurrence') dataset = [[clone[0].strip(), clone[1]] for clone in self.muscopedia.dbcursor.fetchall()] muscopedia.dbcursor.execute(" delete from jclone_occur_stat") count = 0 for subj in dataset: for obj in dataset: mean = (float(subj[1])/abstractcount)* (float(obj[1])/abstractcount) poisson.mean = mean (poi95, prob) = poisson.inverseCDF(0.95) (poi99, prob) = poisson.inverseCDF(0.99) count = count + 1 sqlstmt = "insert into jclone_occur_stat (clone1, clone2, randomoccur, poisson95, poisson99) values ('%s','%s','%.6f','%s','%s')" % - 16 -
  • 28. (str(subj[0]), str(obj[0]), mean, str(poi95), str(poi99)) try: muscopedia.dbcursor.execute(sqlstmt) except IOError: pass if (count % 1000) == 0: muscopedia.dbconnect.commit() Each pair of entities was searched in each abstract using SQL statements, such as “select count(pmid) from pmc_abstract where text containing 'insulin' and 'MAPK'”, and the number of abstracts found was matched against jclone_occur_stat table for statistical significance based on the calculated Poisson distribution. The results were exported from muscopedia (Muscorian's database) as a tab-delimited file and analyzed using the following code to generate Table 2: import sets lc = open('lc_cor.csv','r').readlines() lc = [x[:-1] for x in lc] lc = [x.split('t') for x in lc] d = {} for x in lc: try: t = d[(x[1], x[0])] except KeyError: d[(x[0], x[1])] = float(x[2]) lc = [(x[0], x[1], d[x]) for x in d] l = [(x[0], x[1]) for x in d] l = sets.Set(l) def process_sif(file): a = open(file,'r').readlines() a = [x[:-1] for x in a] a = [x.split('tppt') for x in a] return [(x[0], x[1]) for x in a] a = sets.Set(process_sif('pubgene1.sif')) print "# intersect of pubgene1.sif and LC data: " + str(len(l.intersection(a))) print "# LC data not in pubgene1.sif: " + str(len(l.difference(a))) print "# pubgene1.sif not in LC data: " + str(len(a.difference(l))) print "" cor = 0.74 while (cor < 1.0): t = [(x[0], x[1]) for x in lc if x[2] > cor] l = sets.Set(t) cor = cor + 0.01 print "LC correlation: " + str(cor) print "# intersect of pubgene1.sif and LC data: " + str(len(l.intersection(a))) print "# LC data not in pubgene1.sif: " + str(len(l.difference(a))) print "# pubgene1.sif not in LC data: " + str(len(a.difference(l))) print "" - 17 -
  • 29. Appendix B – PubGene algorithm and its main results PubGene (Jenssen et al., 2001) algorithm is a count-based algorithm which simply counts the number of abstracts with both entity names. Using “insulin” and “MAPK” as the pair of entities, PubGene algorithm can be implemented using the following SQL, “select count(pmid), 'insulin', 'MAPK' from pmc_abstract where text containing 'insulin' and text containing 'MAPK'”. 1-Mention PubGene and 5-Mention PubGene can be isolated by filtering for count(pmid) to be more than zero and four respectively. PubGene (Jenssen et al., 2001) had demonstrated that the precision of 1-Mention is 60% while the precision of 5-Mention is 72%. - 18 -
  • 30. The Python Papers, Vol. 3, No. 3 (2008) 1 Available online at http://ojs.pythonpapers.org/index.php/tpp/issue/view/10 Automatic C Library Wrapping Ctypes from the Trenches Guy K. Kloss Computer Science Institute of Information Mathematical Sciences Massey University at Albany, Auckland, New Zealand Email: G.Kloss@massey.ac.nz At some point of time many Python developers at least in computational science will face the situation that they want to interface some natively compiled library from Python. For binding native code to Python by now a larger variety of tools and technologies are available. This paper focuses on wrapping shared C libraries, using Python's default Ctypes. Particularly tools to ease the process (by using code generation) and some best practises will be stressed. The paper will try to tell a stepbystep story of the wrapping and development process, that should be transferable to similar problems. Keywords: Python, Ctypes, wrapping, automation, code generation. 1 Introduction One of the grand fundamentals in software engineering is to use the tools that are best suited for a job, and not to prematurely decide on an implementation. That is often easier said than done, in the light of some complimentary requirements (e. g. rapid/easy implementation vs. needed speed of execution or vs. low level access to hardware). The traditional way [1] of binding native code to Python through extending or embedding is quite tedious and requires lots of manual coding in C. This paper presents an approach using the Ctypes package [2], which is by default part of Python since version 2.5. As an example the creation of a wrapper for the Little CMS colour management library [3] is outlined. The library oers excellent features, and ships with ocial Python bindings (using SWIG [4]), but unfortunately with several shortcomings (incompleteness, un-Pythonic API, complex to use, etc.). So out of need and frus- tration the initial steps towards alternative Python bindings were undertaken. An alternative would be to x or improve the bindings using SWIG, or to use one of a variety of binding tools. The eld has been limited to tools that are widely in use today within the community, and that are promising to be future proof as
  • 31. Automatic C Library Wrapping Ctypes from the Trenches 2 well as not overly complicated to use. These are the contestants with (very brief ) notes for use cases that suit their particular strengths: • Use Ctypes [2], if you want to wrap pure C code very easily. • Use Boost.Python [5, 6], if you want to create a more complete API for C++ that also reects the object oriented nature of your native code, including inheritance into Python, etc. • Use cython [7], if you want to easily speed up and migrate code from Python to speedier native code (Mixing is possible!). • Use SWIG [4], if you want to wrap your code against several dynamic lan- guages. Of course, wrapper code can be written manually, in this case directly using Ctypes . This paper does not provide a tutorial on how Ctypes is used. The reader should be familiar with this package when attempting to undertake serious library wrapping. The Ctypes tutorial and Ctypes reference on the project web site [2] are an excellent starting point for this. For extensive libraries and robustness towards an evolving API, code generation proved to be a good approach over manual editing. Code generators exist for Boost.Python as well as forCtypes to ease the process of wrapping: Py++ [8] (for Boost.Python ) and CtypesLib's h2xml.py [2] and xml2py.py. Three main reasons have inuenced the decision to approach this project using ctypes: • Ubiquity of the binding approach, as Ctypes is part of the default distribution. • No compilation of native code to libraries is necessary. Additionally, this relieves one from installing a number of development tools, and the library wrapper can be approached in a platform independent way. • The availability of a code generator to automate large portions of the wrapper implementation process for ease and robustness against changes. The next section of this paper will rst introduce a simple C example. This example is later migrated to Python code through the various incarnations of the Python wrapper throughout the paper. Sect. 3 introduces how to facilitate the C library code from Python, in this case through code generation. Sect. 4 explains how to rene the generated code to meet the desired functionality of the wrapper. The library is anything but Pythonic, so Sect. 5 explains an object oriented Façade API for the library that features qualities we love. This paper only outlines some interesting fundamentals of the wrapper building process. Please refer to the source code for more precise details [9].
  • 32. Automatic C Library Wrapping Ctypes from the Trenches 3 2 The Example The sample code (listing in Fig. 1) aims to convert image data from device dependent colour information to a standardised colour space. The input prole results from a device specic characterisation of a Hewlett Packard ScanJet (in the ICC prole HPSJTW.ICM). The output is in the standard conformant sRGB output colour space as it is used for the majority of displays on computers. For this a built-in prole from LittleCMS is used. Input and output are characterised through so called ICC proles. For the input prole the characterisation is read from a le (line 8), and a built in output prole is used (line 9). The transformation object is set up using the proles (lines 1113), specifying the colour encoding in the in- and output as well as some further parameters not worth discussing here. In the for loop (lines 1521) the image data is transformed line by line, operating on the number of pixels used per line (necessary as array rows are often padded). The goal is to provide a suitable and easy to use API to perform the same task in Python. 3 Code Generation Wrapping C data types, functions, constants, etc. with Ctypes is not particularly dicult. The tutorial, project web site and documentation on the wiki introduce this concept quite well. But in the presence of an existing larger library, manual wrapping can be tedious and error prone, as well as hard to keep consistent with the library in case of changes. This is especially true when the library is maintained by someone else. Therefore, it is advisable to generate the wrapper code. Thomas Heller, the author of Ctypes has implemented a corresponding project CtypesLib that includes tools for code generation. The tool chain consists of two parts, the parser (for header les) and the code generator. 3.1 Parsing the Header File The C header les are parsed by the tool h2xml. In the background it uses GCCXML, a GCC compiler that parses the code and generates an XML tree representation. Therefore, usually the same compiler that builds the binary of the library can be used to analyse the sources for the code generation. Alternative parsers often have problems determining a 100 % proper interpretation of the code. This is particularly true in the case of C code containing pre-processor macros, which can commit massively complex things.
  • 33. Automatic C Library Wrapping Ctypes from the Trenches 4 1 #include lcms.h 3 int correctColour(void) { 4 cmsHPROFILE inProfile, outProfile; 5 cmsHTRANSFORM myTransform; 6 int i; 8 inProfile = cmsOpenProfileFromFile(HPSJTW.ICM, r); 9 outProfile = cmsCreate_sRGBProfile(); 11 myTransform = cmsCreateTransform(inProfile, TYPE_RGB_8, 12 outProfile, TYPE_RGB_8, 13 INTENT_PERCEPTUAL, 0); 15 for (i = 0; i scanLines; i++) { 16 /* Skipped pointer handling of buffers. */ 17 cmsDoTransform(myTransform, 18 pointerToYourInBuffer, 19 pointerToYourOutBuffer, 20 numberOfPixelsPerScanLine); 21 } 23 cmsDeleteTransform(myTransform); 24 cmsCloseProfile(inProfile); 25 cmsCloseProfile(outProfile); 27 return 0; 28 } Figure 1: Example in C using the LittleCMS library directly. 3.2 Generating the Wrapper In the next stage the parser tree in XML format is taken to generate the binding code in Python using Ctypes. This task is performed by the xml2py tool. The gener- ator can be congured in its actions by means of switches passed to it. Of particular interest here are the -k and the -r switches. The former denes the kind of types to include in the output. In this case the #defines, functions, structure and union denitions are of interest, yielding -kdfs. Note: Dependencies are resolved auto- matically. The -r switch takes a regular expression the generator uses to identify symbols to generate code for. The full argument list is shown in the listing in Fig. 2 (lines 1115). The generated code is written to a Python module, in this case _lcms. It is made private by convention (leading underscore) to indicate that it is not to be used or modied directly.
  • 34. Automatic C Library Wrapping Ctypes from the Trenches 5 3.3 Automating the Generator Both h2xml and xml2py are Python scrips. Therefore, the generation process can be automated in a simple generator script. This makes all steps reproducible, docu- ments the used settings, and makes the process robust towards evolutionary (smaller) changes in the C API. A largely simplied version is in the listing of Fig. 2. 1 # Skipped declaration of paths. 2 HEADER_FILE = ’lcms.h’ 3 header_basename = os.path.splitext(HEADER_FILE)[0] 5 h2xml.main([’h2xml.py’, header_path, 6 ’-c’, 7 ’-o’, 8 ’%s.xml’ % header_basename]) 10 SYMBOLS = [’cms.*’, ’TYPE_.*’, ’PT_.*’, ’ic.*’, ’LPcms.*’, ...] 11 xml2py.main([’xml2py.py’, ’-kdfs’, 12 ’-l%s’ % library_path, 13 ’-o’, module_path, 14 ’-r%s’ % ’|’.join(SYMBOLS), 15 ’%s.xml’ % header_basename] Figure 2: Essential parts of the code generator script. Generated code should never be edited manually. As some modication will be necessary to achieve the desired functionality (see Sect. 4), automation becomes essential to yield reproducible results. Due to some shortcomings (see Sect. 4) of the generated code however, some editing was necessary. This modication has also been integrated into the generator script to fully remove the need of manual editing. 4 Rening the C API In the current version of Ctypes in Python 2.5 it is not possible to add e. g. __repr__() or __str__() methods to data types. Also, code for loading the shared library in a platform independent way needs to be patched into the generated code. A function in the code generator reads the whole generated module _lcms and writes it back to the le system, and in the course replacing three lines from the beginning of the le with the code snippet from the listing in Fig. 3. _setup (listing in Fig. 4) monkey patches 1 the class ctypes.Structure to include a __repr__() method (lines 410) for ease of use when representing wrapped objects for output. Furthermore, the loading of the shared library (DLL in Windows lingo) 1A monkey patch is a way to extend or modify the runtime code of dynamic languages without altering the original source code: http://en.wikipedia.org/wiki/Monkey_patch
  • 35. Automatic C Library Wrapping Ctypes from the Trenches 6 1 from _setup import * 2 import _setup 4 _libraries = {} 5 _libraries[’/usr/lib/liblcms.so.1’] = _setup._init() Figure 3: Lines to be patched into the generated module _lcms. is abstracted to work in a platform independent way using the system's default search mechanism (lines 1213). 1 import ctypes 2 from ctypes.util import find_library 4 class Structure(ctypes.Structure): 5 def __repr__(self): 6 Print fields of the object. 7 res = [] 8 for field in self._fields_: 9 res.append(’%s=%s’ % (field[0], repr(getattr(self, field[0])))) 10 return ’%s(%s)’ % (self.__class__.__name__, ’, ’.join(res)) 12 def _init(): 13 return ctypes.cdll.LoadLibrary(find_library(’lcms’)) Figure 4: Extract from module _setup.py. 4.1 Creating the Basic Wrapper Further modications are less invasive. For this, the C API is rened into a module c_lcms. This module imports everything from the generated._lcms and overrides or adds certain functionality individually (again through monkey patching). These are intended to make the C API a little bit easier to use through some helper functions, but mainly to make the new bindings more compatible with and similar to the ocial SWIG bindings (packaged together with LittleCMS ). The wrapped C API can be used from Python (see Sect. 4.2). Although, it still requires closing, freeing or deleting from the code after use, and c_lcms objects/structures do not feature methods for operations. This shortcoming will be solved later. 4.2 c lcms Example The wrapped raw C API in Python behaves in exactly the same way, it is just implemented in Python syntax (listing in Fig. 5).
  • 36. Automatic C Library Wrapping Ctypes from the Trenches 7 1 from c_lcms import * 3 def correctColour(): 4 inProfile = cmsOpenProfileFromFile(’HPSJTW.ICM’, ’r’) 5 outProfile = cmsCreate_sRGBProfile() 7 myTransform = cmsCreateTransform(inProfile, TYPE_RGB_8, 8 outProfile, TYPE_RGB_8, 9 INTENT_PERCEPTUAL, 0) 11 for line in scanLines: 12 # Skipped handling of buffers. 13 cmsDoTransform(myTransform, 14 yourInBuffer, 15 yourOutBuffer, 16 numberOfPixelsPerScanLine) 18 cmsDeleteTransform(myTransform) 19 cmsCloseProfile(inProfile) 20 cmsCloseProfile(outProfile) Figure 5: Example using the basic API of the c_lcms module. 5 A Pythonic API To create the usual pleasant batteries included feeling when working with code in Python, another module littlecms was manually created, implementing the Façade Design Pattern. From here on we are moving away from the original C-like API. This high level object oriented Façade takes care of the internal handling of tedious and error prone operations. It also performs sanity checking and automatic detection for certain crucial parameters passed to the C API. This has drastically reduced problems with the low level nature of the underlying C library. 5.1 littlecms Example Using littlecms the API is now object oriented (listing in Fig. 6) with a doTransform() method on the myTransform object. But there are a few more in- teresting benets of this API: • Automatic disposing of C API instances hidden inside the Profile and Transform classes. • Largely reduced code size with an easily comprehensible structure. • Redundant passing of information (e. g. the in- and output colour spaces) is determined within the Transform constructor from information available in the Profile objects.
  • 37. Automatic C Library Wrapping Ctypes from the Trenches 8 • Uses NumPy [10] arrays for convenience in the buers, rather than introducing further custom types. On these data array types and shapes can be automat- ically matched up. • The number of pixels for each scan line placed in yourInBuffer can usually be detected automatically. • Compatible with the often used PIL [11] library. • Several sanity checks prevent clashes of erroneously passed buer sizes, shapes, types, etc. that would otherwise result in a crashed or hanging process. 1 from littlecms import Profile, PT_RGB, Transform 3 def correctColour(): 4 inProfile = Profile(’HPSJTW.ICM’) 5 outProfile = Profile(colourSpace=PT_RGB) 6 myTransform = Transform(inProfile, outProfile) 8 for line in scanLines: 9 # Skipped handling of buffers. 10 myTransform.doTransform(yourNumpyInBuffer, yourNumpyOutBuffer) Figure 6: Example using the object oriented API of the littlecms module. 6 Conclusion Binding pure C libraries to Python is not very dicult, and the skills can be mastered in a rather short time frame. If done right, these bindings can be quite robust even towards certain changes in the evolving C API without the need of very time consuming manual tracking of all changes. As with many projects for this, it is vital to be able to automate the mechanical processes: Beyond the outlined code generation in this paper, an important role comes to automated code integrity testing (here: using PyUnit [12]) as well as an API documentation (here: using Epydoc [13]). Unfortunately, as CtypesLib is still work in progress, the whole process did not go as smoothly as described here. It was particularly important to match up working versions properly between GCCXML (which in itself is still in development) and CtypesLib. In this case a current GCCXML in version 0.9.0 (as available in Ubuntu Intrepid Ibex, 8.10) required a branch of CtypesLib that needed to be checked out through the developer's Subversion repository. Furthermore, it was necessary to develop a x for the code generator as it failed to generate code for #defined oating point constants. The patch has been reported to the author and is now in the source code repository. Also patching into the generated source code for overriding some
  • 38. Automatic C Library Wrapping Ctypes from the Trenches 9 features and manipulating the library loading code can be considered as being less than elegant. Library wrapping as described in this paper was performed on version 1.16 of the LittleCMS library. While writing this paper the author has moved to the now stable version 1.17. Adapting the Python wrapper to this code base was a matter of about 15 minutes of work. The main task was xing some unit tests due to rounding dierences resulting from an improved numerical model within the library. The author of LittleCMS made a rst preview of the upcoming version 2.0 (an almost complete rewrite) available recently. Adapting to that version took only about a good day of modications, even though some substantial changes were made to the API. But even for this case only very little amounts of new code had to be written. Overall, it is foreseeable that this type of library wrapping in the Python world will become more and more ubiquitous, as the tools for it mature. But already at the present time one does not have to fear the process. The time spent initially setting up the environment will be easily saved over all projects phases and iterations. It will be interesting to see Ctypes evolve to be able to interface to C++ libraries as well. Currently the developers of Ctypes and Py++ (Thomas Heller and Roman Yakovenko) are evaluating potential extensions. References [1] Ocial Python Documentation: Extending and Embedding the Python Inter- preter , Python Software Foundation. [2] T. Heller, Python Ctypes Project, http://starship.python.net/crew/theller/ ctypes/, last accessed December 2008. [3] M. Maria, LittleCMS project, http://littlecms.com/, last accessed December 2008. [4] D. M. Beazley and W. S. Fulton, SWIG Project, http://www.swig.org/, last accessed December 2008. [5] D. Abrahams and R. W. Grosse-Kunstleve, Building Hybrid Systems with Boost.Python, http://www.boostpro.com/writing/bpl.html, March 2003, last accessed December 2008. [6] D. Abrahams, Boost.Python Project, http://www.boost.org/libs/python/, last accessed December 2008. [7] S. Behnel, R. Bradshaw, and G. Ewing, Cython Project, http://cython.org/, last accessed December 2008. [8] R. Yakovenko, Py++ Project, http://www.language-binding.net/ pyplusplus/pyplusplus.html, last accessed December 2008.
  • 39. Automatic C Library Wrapping Ctypes from the Trenches 10 [9] G. K. Kloss, Source Code: Automatic C Library Wrapping Ctypes from the Trenches, The Python Papers Source Codes [in review] , vol. n/a, p. n/a, 2009, [Online available] http://ojs.pythonpapers.org/index.php/tppsc/issue/. [10] T. Oliphant, NumPy Project, http://numpy.scipy.org/, last accessed Decem- ber 2008. [11] F. Lundh, Python Imaging Library (PIL) Project, http://www.pythonware. com/products/pil/, last accessed December 2008. [12] S. Purcell, PyUnit Project, http://pyunit.sourceforge.net/, last accessed De- cember 2008. [13] E. Loper, Epydoc Project, http://epydoc.sourceforge.net/, last accessed De- cember 2008.