SlideShare une entreprise Scribd logo
1  sur  6
Télécharger pour lire hors ligne
Human Computation Must Be Reproducible

                                                               Praveen Paritosh
                                                                   Google
                                                                345 Spear St,
                                                           San Francisco, CA 94105.
                                                              pkp@google.com


ABSTRACT                                                                    sures of quality [Ipeirotis, Provost and Wang, 2010], utility
Human computation is the technique of performing a com-                     [Dai, Mausam and Weld, 2010], among others. In order for
putational process by outsourcing some of the difficult-to-                   us to have confidence in any such criteria, the results must
automate steps to humans. In the social and behavioral sci-                 be reproducible, i.e., not a result of chance agreement or irre-
ences, when using humans as measuring instruments, repro-                   producible human idiosyncrasies, but a reflection of the un-
ducibility guides the design and evaluation of experiments.                 derlying properties of the questions and task instructions, on
We argue that human computation has similar properties,                     which others could agree as well. Reproducibility is the de-
and that the results of human computation must be repro-                    gree to which a process can be replicated by different human
ducible, in the least, in order to be informative. We might                 contributors working under varying conditions, at different
additionally require the results of human computation to                    locations, or using different but functionally equivalent mea-
have high validity or high utility, but the results must be                 suring instruments. A total lack of reproducibility implies
reproducible in order to measure the validity or utility to a               that the given results could have been obtained merely by
degree better than chance. Additionally, a focus on repro-                  chance agreement. If the results are not differentiable from
ducibility has implications for design of task and instruc-                 chance, there is little information content in them. Using
tions, as well as for the communication of the results. It                  human computation in such a scenario is wasteful of an ex-
is humbling how often the initial understanding of the task                 pensive resource, as chance is cheap to simulate.
and guidelines turns out to lack reproducibility. We suggest                   A much stronger claim than reproducibility is validity. For
ensuring, measuring and communicating reproducibility of                    a measurement instrument, e.g., a vernier caliper, a stan-
human computation tasks.                                                    dardized test, or, a human coder, the reproducibility is the
                                                                            extent to which a measurement gives consistent results, and
                                                                            the validity is the extent to which the tool measures what
1. INTRODUCTION                                                             it claims to measure. In contrast to reproducibility, valid-
  Some examples of tasks using human computation are:                       ity concerns truths. Validity requires comparing the results
labeling images [Nowak and Ruger, 2010], conducting user                    of the study to evidence obtained independently of that ef-
studies [Kittur, Chi, and Suh, 2008], annotating natural lan-               fort. Reproducibility provides assurances that particular re-
guage corpora [Snow, O’Connor, Jurafsky and Ng, 2008], an-                  search results can be duplicated, that no (or only a negligi-
notating images for computer vision research [Sorokin and                   ble amount of) extraneous noise has entered the process and
Forsyth, 2008], search engine evaluation [Alonso, Rose and                  polluted the data or perturbed the research results, validity
Stewart, 2008; Alonso, Kazai and Mizzaro, 2011], content                    provides assurances that claims emerging from the research
moderation [Ipeirotis, Provost and Wang, 2010], entity rec-                 are borne out in fact.
onciliation [Kochhar, Mazzocchi and Paritosh, 2010], con-                      We might want the results of human computation to have
ducting behavioral studies [Suri and Mason, 2010; Horton,                   high validity, high utility, low cost, among other desirable
Rand and Zeckhauser, 2010].                                                 characteristics. However, the results must be reproducible
  These tasks involve presenting a question, e.g.,“Is this im-              in order for us to measure the validity or utility to a degree
age offensive?,” to one or more humans, whose answers are                    better than chance.
aggregated to produce a resolution, a suggested answer for                     More than a statistic, a focus on reproducibility offers
the original question. The humans might be paid contribu-                   valuable insights regarding the design of the task and the
tors. Examples of paid workforces include Amazon Mechani-                   guidelines for the human contributors, as well as the commu-
cal Turk [www.mturk.com] and oDesk [www.odesk.com]. An                      nication of the results. The output of human computation is
example of a community of volunteer contributors is Foldit                  thus akin to the result of a scientific experiment, and it can
[www.fold.it], where human computation augments machine                     only be considered meaningful if it is reproducible — that is,
computation for predicting protein structure [Cooper et al.,                the same results could be replicated in an independent exer-
2010]. Another example involving volunteer contributors is                  cise. This requires clearly communicating the task instruc-
games with a purpose [von Ahn, 2006], and the upcoming                      tions, and the criterion of selecting the human contributors,
Duolingo [www.duolingo.com], where the contributors are                     ensuring that they work independently, and reporting an
translate previously untranslated web corpora while learn-                  appropriate measure of reproducibility. Much of this is well
ing a new language.                                                         established in the methodology of content analysis in the so-
  The results of human computation can be characterized                     cial and behavioral sciences [Armstrong, Gosling, Weinman
by accuracy [Oleson et al., 2011], information theoretic mea-


Copyright c 2012 for the individual papers by the papers’ authors. Copy-
ing permitted for private and academic purposes. This volume is published
and copyrighted by its editors.
CrowdSearch 2012 workshop at WWW 2012, Lyon, France
and Marteau, 1997; Hayes and Krippendorff, 2007], being             3.     CONTENT ANALYSIS AND CODING IN
required of any publishable result involving human contrib-               THE BEHAVIORAL SCIENCES
utors. In Section 2 and 3, we argue that human computation
                                                                      In the social sciences, content analysis is a methodology
resembles the coding tasks of behavioral sciences. However,
                                                                   for studying the content of communication [Berelson, 1952;
in the human computation and crowdsourcing research com-
                                                                   Krippendorff, 2004]. Coding of subjective information is a
munity, reproducibility is not commonly reported.
                                                                   significant source of empirical data in social and behavioral
   We have collected millions of human judgments regard-
                                                                   sciences, as they allow techniques of quantitative research
ing entities and facts in Freebase [Kochhar, Mazzocchi and
                                                                   to be applied to complex phenomena. These data are typ-
Paritosh, 2010; Paritosh and Taylor, 2012]. We have found
                                                                   ically generated by trained human observers who record or
reproducibility to be a useful guide for task and guideline
                                                                   transcribe textual, pictorial or audible matter in terms suit-
design. It is humbling how often the initial understanding
                                                                   able for analysis. This task is called coding, which involves
of the task and guidelines turns out to lack reproducibility.
                                                                   assigning categorical, ordinal or quantitative responses to
In section 4, we describe some of the widely used measures of
                                                                   units of communication.
reproducibility. We suggest ensuring, measuring and com-
                                                                      An early example of a content analysis based study is “Do
municating reproducibility of human computation tasks.
                                                                   newspapers now give the news?” [Speed, 1893], which tried
   In the next section, we describe the sources of variabil-
                                                                   to show that the coverage of religious, scientific and literary
ity human computation, which highlight the role of repro-
                                                                   matters was dropped in favor of gossip, sports and scandals
ducibility.
                                                                   between 1881 and 1893, by New York newspapers. Conclu-
                                                                   sions from such data can only be trusted if the reading of
2. SOURCES OF VARIABILITY IN HUMAN                                 the textual data as well as of the research results are replica-
   COMPUTATION                                                     ble elsewhere, that the coders demonstrably agree on what
  There are many sources of variability in human compu-            they are talking about. Hence, the coders need to demon-
tation that are not present in machine computation. Given          strate the trustworthiness of their data by measuring their
that human computation is used to solve problems that are          reproducibility. To perform reproducibility tests, additional
beyond the reach of machine computation, by definition,             data are needed: by duplicating the research under various
these problems are incompletely specified. Variability arises       conditions. Reproducibility is established by independent
due to incomplete specification of the task. This is convolved      agreement between different but functionally equal measur-
with the fact that the guidelines are subject to differing in-      ing devices, for example, by using several coders with di-
terpretations by different human contributors. Some char-           verse personalities. The reproducibility of coding has been
acteristics of human computation tasks are:                        used for comparing consistency of medical diagnosis [e.g.,
                                                                   Koran, 1975], for drawing conclusions from meta-analysis of
   • Task guidelines are incomplete: A task can span a wide
                                                                   research findings [e.g., Morley et al., 1999], for testing indus-
     set of domains, not all of which are anticipated at the
                                                                   trial reliability [Meeker and Escobar, 1998], for establishing
     beginning of the task. This leads to incompleteness
                                                                   the usefulness of a clinical scale [Hughes et al., 1982].
     in guidelines, as well as varying levels of performance
     depending upon the contributor’s expertise in that do-        3.1     Relationship between Reproducibility and
     main. Consider the task of establishing relevance of
     an arbitrary search query [Alonso, Kazai and Mizzaro,
                                                                           Validity
     2011].
                                                                        • Lack of reproducibility limits the chance of validity: If
   • Task guidelines are not precise: Consider, for exam-                 the coding results are a product of chance, it may well
     ple, the task of declaring if an image is unsuitable for             include a valid account of what was observed, but re-
     a social network website. Not only is it hard to write               searchers would not be able to identify that account
     down all the factors that go into making an image of-                to a degree better than chance. Thus, the more unre-
     fensive, it is hard to communicate those factors to hu-              liable a procedure, the less likely it is to result in data
     man contributors with vastly different predispositions.               that lead to valid conclusions.
     The guidelines usually rely upon shared common sense
     knowledge and cultural knowledge.                                  • Reproducibility does not guarantee validity: Two ob-
                                                                          servers of the same event who hold the same concep-
   • Validity data is expensive or unavailable: An oracle                 tual system, prejudice, or, interest may well agree on
     that provide the true answer for any given question is               what they see but still be objectively wrong, based on
     usually unavailable. Sometimes for a small subset of                 some external criterion. Thus a reliable process may
     gold questions, we have answers from another indepen-                or may not lead to valid outcomes.
     dent source. This can be useful in making estimates of
     validity, subject to the degree that the gold questions          In some cases, validity data might be so hard to obtain
     are representative of the set of questions. These gold        that one has to contend with reproducibility. In tasks such
     questions could be very useful for training and feed-         as interpretation and transcription of complex textual mat-
     back to the human contributors, however, we have to           ter, suitable accuracy standards are not easy to find. Be-
     be ensure their representatitiveness in order to make         cause interpretations can only be compared to interpreta-
     warranted claims regarding validity.                          tions, attempts to measure validity presuppose the privi-
  Each of the above might be true to a different degree for         leging of some interpretations over others, and this puts any
different human computation tasks. These factors are simi-          claims regarding validity on epistemologically shaky grounds.
lar to the concerns of behavioral and social scientists in using   In some tasks like psychiatric diagnosis, even reproducibil-
humans as measuring instruments.                                   ity is hard to attain for some questions. Aboraya et al.
[2006] review the reproducibility of psychiatric diagnosis.       flecting the original bias in the distribution of safe and unsafe
Lack of reproducibility has been reported for judgments of        bags. A misguided interpretation of accuracy or a poor esti-
schizophrenia and affective disorder [Goodman et al., 1984],       mate of accuracy can be less informative than reproducibil-
calling such diagnosis into question.                             ity.

3.2 Relationship with Chance Agreement                            3.3    Reproducibility and Experiment Design
                                                                    The focus on reproducibility has implications on design of
   In this section, we look at some properties of chance agree-
                                                                  task and instruction materials. Krippendorff [2004] argues
ment, and its relationship to reproducibility. Given two cod-
                                                                  that any study using observed agreement as a measure of
ing schemes for the same phenomenon, the one with fewer
                                                                  reproducibility must satisfy the following requirements:
categories will have higher chance agreement. For example,
in reCAPTCHA [von Ahn et al., 2008], two independent                 • It must employ an exhaustively formulated, clear, and
humans are shown an image containing text and asked to                 usable guidelines;
transcribe it. Assuming that there is no collusion, chance
agreement, i.e., two different humans typing in the same              • It must use clearly specified criteria concerning the
word/phrase by chance is very small. However, in a task                choice of contributors,so as others may use such cri-
in which there are only two possible answers, e.g., true and           teria to reproduce the data;
false, the probability of chance agreement between two an-           • It must ensure that the contributors that generate the
swers is 0.5.                                                          data used to measure reproducibility work indepen-
   If a disproportionate amount of data falls under one cate-          dently of each other. Only if such independence is as-
gory, then the expected chance agreement is very high, so in           sured can covert consensus be ruled out the observed
order to demonstrate high reproducibility, even higher ob-             agreement be explained in terms of the given guidelines
served agreement is required [Feinstein and Cicchetti 1990;            and the task.
Di Eugenio and Glass 2004].
   Consider a task of rating a proposition as true or false.         The last point cannot be stressed enough. There are po-
Let p be the probability of the proposition being true. An        tential benefits from multiple contributors collaborating, but
implementation of chance agreement is the following: toss a       data generated in this manner neither ensure reproducibil-
biased coin with the same odds, i.e., p is the probability that   ity nor reveal its extent. In groups like these, humans are
it turns heads, and declare the proposition to be true when       known to negotiate and to yield to each other in quid pro
the coin lands heads. Now, we can simulate n judgments            quo exchanges, with prestigious group members dominating
by tossing the coin n times. Let us look at the properties        the outcome [see for example, Esser, 1998]. This makes the
of unanimous agreement between two judgments. The like-           results of collaborative human computation a reflection of
lihood of a chance agreement on true is p2 . Independent          the social structure of the group, which is nearly impossible
of this agreement, the probability of the proposition being       to communicate to other researchers and replicate. The data
true is p, therefore, the accuracy of this chance agreement       generated by collaborative work are akin to data generated
on true is p3 . By design, these judgments do not contain any     by a single observer, while reproducibility requires at least
information other than the a priori distribution across an-       two independent observers. To substantiate the contention
swers. Such data has close to zero reproducibility, however,      that collaborative coding is superior to coding by separate
sometimes it can show up in surprising ways when looked           individuals, a researcher would have to compare the data
through the lens of accuracy.                                     generated by at least two such groups and two individuals,
   For example, consider the task of an airport agent declar-     each working independently.
ing a bag as safe or unsafe for boarding on the plane. A             A model in which the coders work independently, but con-
bag can be unsafe if it contains toxic or explosive materials     sult each other when unanticipated problems arise, is also
that could threaten the safety of the flight. Most bags are        problematic. A key source of these unanticipated problems
safe. Let us say that one in a thousand bags is potentially       is the fact that the writers of the coding instructions did
unsafe. Random coding would allow two agents to jointly           not anticipate all the possible ways of expressing the rele-
assign “safe” 99.8% of the time, and since 99.9% of the bags      vant matter. Ideally, these instructions should include every
are safe, this agreement would be accurate 99.7% of the time!     applicable rule on which agreement is being measured. How-
This leads to the surprising result that when data are highly     ever, discussing emerging problems could create re-interpretation
skewed, the coders may agree on a high proportion of items        of the existing instructions in ways that are a function of the
while producing annotations that are accurate, but of low         group and not communicable to others. In addition, as the
reproducibility. When one category is very common, high           instructions become reinterpreted, the process loses its sta-
accuracy and high agreement can also result from indiscrim-       bility: data generated early in the process use instructions
inate coding. The test for reproducibility in such cases is       that differ from those later.
the ability to agree on the rare categories. In the airport          In addition to the above, Craggs and McGee Wood [2005]
bag classification problem, while chance accuracy on safe          discourage researchers from testing their coding instructions
bags is high, chance accuracy on unsafe bags is extremely         on data from more than one domain. Given that the re-
low, 10−7 %. In practice, the cost of errors vary: mistakenly     producibility of the coding instructions depends to a great
classifying a safe bag as unsafe causes far less damage than      extent on how complications are dealt with, and that every
classifying an unsafe bag as safe.                                domain displays different complications, the sample should
   In this case, it is dangerous to consider an averaged accu-    contain sufficient examples from all domains which have to
racy score, as different errors do not count equal: a chance       be annotated according to the instructions.
process that does not add any information has an average             Even the best coding instructions might not specify all
accuracy which is higher than 99.7%, most of which is re-         possible complications. Besides the set of desired answers,
the coders should also be allowed to skip a question. If the        to Popping [1988], Artstein and Poesio [2007]. The different
coders cannot prove any of the other answers is correct, they       coefficients of reproducibility differ in the assumptions they
skip that question. For any other answer, the instructions          make about the properties of coders, judgments and units.
define an a priori model of agreement on that answer, while          Scott’s π [1955] is applicable to two raters and assumes that
skip represents the unanticipated properties of questions and       the raters have the same distribution of responses, where
coders. For instance, some questions might be too difficult           Cohen’s κ [1960; 1968] allows for a a separate distribution
for certain coders. Providing the human contributors with           of chance behavior per coder. Fleiss’ κ [1971] is a gen-
an option to skip is a nod to the openness of the task, and         eralization of Scott’s π for an arbitrary number of raters.
can be used to explore the poorly defined parts of the task          All of these coefficients of reproducibility correct for chance
that were not anticipated at the outset. Additionally, the          agreement similarly. First, they find how much agreement
skip votes can be removed from the analysis for computing           is expected by chance: let us call this vallue Ae . The data
reproducibility, as we do not have expectation of agreement         from the coding is a measure of the observed agreement,
on them [Krippendorff, 2012, personal communication].                Ao . Various inter-rater reliabilities measure the proportion
                                                                    of the possible agreement beyond chance that was actually
4. MEASURING REPRODUCIBILITY                                        observed.
   In measurement theory, reliability is the more general
guarantee that the data obtained are independent of the                                              Ao − Ae
                                                                                         S, π, κ =
measuring event, instrument or person. There are three dif-                                          1 − Ae
ferent kinds of reliability:                                           Krippendorff’s α [1970; 2004] is a generalization of many
   Stability: measures the degree to which a process is un-         of these coefficients. It is a generalization of the above met-
changing over time. It is measured by agreement between             rics for an arbitrary number of raters, not all of whom have
multiple trials of the same measuring or coding process. This       to answer every question. Krippendorff’s α has the following
is also called test-retest condition, in which one observer         desirable characteristics:
does a task, and after some time, repeats the task again.
This measures intra-observer reliability. A similar notion is          • It is applicable to an arbitrary number of contributors
internal consistency [Cronbach, 1951], which is the degree               and invariant to the permutation and selective partic-
to which the answers on the same task are consistent. Sur-               ipation of contributors. It corrects itself for varying
veys are designed so that the subsets of similar questions               amounts of reproducibility data.
are known a priori, and measures for internal consistency
metrics are based on correlation between these answers.                • It constitutes a numerical scale between at least two
   Reproducibility: measures the degree to which a process               points with sensible reproducibility interpretations, 0
can be replicated by different analysts working under varying             representing absence of agreement, and 1 indicating
conditions, at different locations, or, using different but func-          perfect agreement.
tionally equivalent measuring instruments. Reproducible                • It is applicable to several scales of measurement: ordi-
data, by definition, are data that remain constant through-               nal, nominal, interval, ratio, and more.
out variations in the measuring process [Kaplan and Gold-
sen, 1965].                                                           Alpha’s general form is:
   Accuracy: measures the degree to which the process pro-
duces valid results. To measure accuracy, we have to com-                                                Do
pare the performance of contributors with the performance                                   α=1−
                                                                                                         De
of a procedure that is known to be correct. In order to
generate estimates of accuracy, we need accuracy data, i.e.,          Where Do is the observed disagreement:
valid answers to a representative sample of the questions.
Estimating accuracy gets harder in cases where the hetero-                                   1                     2
                                                                                      Do =                   ock .δck
geneity of the task is poorly understood.                                                    n   c       k
   The next section focuses on reproducibility.
                                                                      and De is the disagreement one would expect when the
4.1 Reproducibility                                                 answers are attributable to chance rather than to the prop-
   There are two different aspects of reproducibility: inter-        erties of the questions:
rater reliability and inter-method reliability. Inter-rater reli-
ability focuses on the reproducibility by agreement between                                 1                             2
                                                                                  De =                           nc .nk .δck
independent raters, and inter-method reliability focuses on                              n(n − 1)    c       k
the reliability of different measuring devices. For example,
                                                                           2
in survey and test design, parallel forms reliability is used         The δck term is the distance metric for the scale of the
to create multiple equivalent tests, of which more than one         answerspace. For a nominal scale,
are administered to the same human. We focus on inter-
rater reliability as the measure of reproducibility typically
                                                                                          2          0 if c = k
applicable to human computation tasks, where we generate                                 δck =
judgments from multiple humans per question. The simplest                                            1 if c = k
form of inter-rater reliability is percent agreement, however
it is not suitable as a measure of reproducibility as it does       4.2   Statistical Significance
not correct for chance agreement.                                     The goal of measuring reproducibility is to ensure that
   For extensive survey of measures of reproducibility, refer       the data does not deviate too much from perfect agreement,
not that the data is different from chance agreement. In the      4.5    Other Measures of Quality
definition of α, chance agreement is one of the two anchors          Ipeiritos, Provost and Wang [2010] present an information
for the agreement scale, the other, more important, reference    theoretic quality score, which measures the quality of a con-
point being that of perfect agreement. As the distribution of    tributor in terms of comparing their score to a spammer who
α is unknown, confidence intervals on α are obtained from         is trying to advantage of chance accuracy. In that regard, it
empirical distribution generated by bootstrapping — that         is a similar metric to Krippendorff’s alpha, and additionally
is, by drawing a large number of subsamples from the re-         models a notion of cost.
producibility data, computing α for each. This gives us a
probability distribution of hypothetical α values that could                                   ExpCost(Contributor)
                                                                         QualityScore = 1 −
occur within the constraints of the observed data. This can                                     ExpCost(Spammer)
be used to calculate the probability of failing to reach the        Turkontrol [Dai, Mausam and Weld, 2010], uses both a
smallest acceptable reproducibility αmin , q|α < αmin , or a     model of utility and quality using decision-theoretic control
two tailed confidence interval for chosen level of significance.   to make trade-offs between quality and utility for workflow
4.3 Sampling Considerations                                      control.
                                                                    Le, Edmonds, Hester and Biewald [2010] develop a gold
  To generate an estimate of the reproducibility of a pop-
                                                                 standard based quality assurance framework that provides
ulation, we need to generate a representative sample of the
                                                                 direct feedback to the workers and targets specific worker
population. The sampling needs to ensure that we have
                                                                 errors. This approach requires extensive manually generated
enough units from the rare categories of questions in the
                                                                 collection of gold data. Oleson et al. [2011], further develop
data. Assuming that α is normally distributed, Bloch and
                                                                 this approach to include pyrite, which are programmatically
Kraemer [1989] provide a suggestion for minimum number of
                                                                 generated gold questions on which contributors are likely to
questions from each category to be included in the sample,
                                                                 make an error, for example by mutating data so that it is
Nc , by,
                                                                 no longer valid. These are very useful metrics for training,
                                                                 feedback and protection against spammers, but these do not
                    (1 + αmin )(3 − αmin )                       reveal the accuracy of the results. The gold questions, by
         Nc = z 2                            − αmin
                    4pc (1 − pc )(1 − αmin )                     design, are not representative of the original set of questions.
  Where,                                                         These lead to wide error bars on the accuracy estimates, and
   • pc is the smallest estimated proportion values of the       it might be valuable to measure reproducibility of results.
     category c in the population,
   • alphamin is the smallest acceptable reproducibility be-     5.    CONCLUSIONS
     low which data will have to be rejected as unreliable,        We describe reproducibility as a necessary but not suf-
     and                                                         ficient requirement for results of human computation. We
                                                                 might additionally require the results to have high validity
   • z is the desired level of statistical significance, repre-   or high utility, but our ability to measure validity or utility
     sented by the corresponding z value for one-tailed tests    with confidence is limited if the data are not reproducible.
   This is a simplification, as it assumes α is normally dis-     Additionally, a focus on reproducibility has implications for
tributed, and binary data, and does not account for the num-     design of task and instructions, as well as for the commu-
ber of raters. A general description of sampling requirement     nication of the results. It is humbling how often the initial
is an open problem.                                              understanding of the task and guidelines turns out to lack
                                                                 reproducibility. We suggest ensuring, measuring and com-
4.4 Acceptable Levels of Reproducibility                         municating reproducibility of human computation tasks.
  Fleiss [1981] and Krippendorff [2004] present guidelines
for what should acceptable values of reproducibility based
on surveying the empirical research using these measures.
                                                                 6.    ACKNOWLEDGMENTS
Krippendorff suggests,                                              The author would like to thank Jamie Taylor, Reilly Hayes,
                                                                 Stefano Mazzocchi, Mike Shwe, Ben Hutchinson, Al Marks,
   • Rely on variables with reproducibility above α = 0.800.     Tamsyn Waterhouse, Peng Dai, Ozymandias Haynes, Ed
     Additionally don’t accept data if the confidence inter-      Chi and Panos Ipeirotis for insightful discussions on this
     val reaches below the smallest acceptable reproducibil-     topic.
     ity, αmin = 0.667, or, ensure that the probability, q, of
     the failure to have less than smallest acceptable repro-
     ducibility alphamin is reasonably small, e.g., q < 0.05.    7.    REFERENCES
                                                                   [1] Aboraya, A., Rankin, E., France, C., El-Missiry, A.,
   • Consider variables with reproducibility between α =
                                                                 John, C. The Reliability of Psychiatric Diagnosis Revisited,
     0.667 and α = 0.800 only for drawing tentative con-
                                                                 Psychiatry (Edgmont). 2006 January; 3(1): 41-50.
     clusions.
                                                                   [2] Alonso, O., Rose, D. E., Stewart, B.. Crowdsourcing
   These are suggestions, and the choice of thresholds of ac-    for relevance evaluation. SIGIR Forum, 42(2):9-15, 2008.
ceptability depend upon the validity requirements imposed          [3] Alonso, O., Kazai, G. and Mizzaro, S., 2011, Crowd-
on the research results. It is perilous to “game” α by vio-      sourcing for search engine evaluation, Springer 2011.
lating the requirements of reproducibility: for example, by        [4] Armstrong, D., Gosling, A., Weinman, J. and Marteau,
removing a subset of data post-facto to increase α. Parti-       T., 1997, The place of inter-rater reliability in qualitative
tioning data by agreement measured after the experiment          research: An empirical study, Sociology, vol 31, no. 3, 597-
will not lead to valid conclusions.                              606.
[5] Artstein, R. and Poesio, M. 2007. Inter-coder agree-          [25] Krippendorff, K. 2004. Content Analysis: An Intro-
ment for computational linguistics. Computational Linguis-        duction to Its Methodology, Second edition. Sage, Thousand
tics.                                                             Oaks.
   [6] Bennett, E. M., R. Alpert, and A. C. Goldstein. 1954.         [26] Kochhar, S., Mazzocchi, S., and Paritosh, P., 2010,
Communications through limited questioning. Public Opin-          The Anatomy of a Large-Scale Human Computation Engine,
ion Quarterly, 18(3):303-308.                                     In Proceedings of Human Computation Workshop at the
   [7] Berelson, B., 1952. Content analysis in communication      16th ACM SIKDD Conference on Knowledge Discovery and
research, Free Press, New York.                                   Data Mining, KDD 2010, Washington D.C.
   [8] Bloch, Daniel A. and Helena Chmura Kraemer. 1989.             [27] Koran, L. M., 1975, The reliability of clinical meth-
2 x 2 kappa coefficients: Measures of agreement or associa-         ods, data and judgments (parts 1 and 2), New England Jour-
tion. Biometrics, 45(1):269-287                                   nal of Medicine, 293(13/14).
   [9] Bollacker, K., Evans, C., Paritosh, P., Sturge, T. and        [28] Kittur, A., Chi, E. H. and Suh, B. Crowdsourcing
Taylor, J., 2008, Freebase: A Collaboratively Created Graph       user studies with Mechanical Turk. In Proceedings of the
Database for Structuring Human Knowledge. In the Pro-             Proceeding of the twenty-sixth annual SIGCHI conference
ceedings of the 28th ACM SIGMOD International Confer-             on Human factors in computing systems, Florence.
ence on Management of Data, Vancouver.                               [29] Mason, W., and Siddharth. 2010. Conducting Be-
   [10] Cohen, Jacob. 1960. A coefficient of agreement for          havioral Research on Amazon’s Mechanical Turk, Working
nominal scales. Educational and Psychological Measure-            Paper, Social Science Research Network.
ment, 20(1):37-46.                                                   [30] Meeker, W. and Escobar, L., 1998, Statistical Meth-
   [11] Cohen, Jacob. 1968. Weighted kappa: Nominal scale         ods for Reliability Data, Wiley.
agreement with provision for scaled disagreement or partial          [31] Morley, S. Eccleston, C. and Williams, A., 1999, Sys-
credit. Psychological Bulletin, 70(4):213-220.                    tematic review and meta-analysis of randomized controlled
   [12] Cooper, S., Khatib, F., Treuille, A., Barbero, J., Lee,   trials of cognitive behavior therapy and behavior therapy for
J., Beenen, M., Leaver-Fay, A., Baker, D., and Popovic,           chronic pain in adults, excluding headache, Pain, 80, 1-13.
Z. Predicting protein structures with a multiplayer online           [32] Nowak, S. and Ruger, S. 2010. How reliable are an-
game. Nature, 466:7307, 756-760                                   notations via crowdsourcing: a study about inter-annotator
   [13] Cronbach, L. J., 1951, Coefficient alpha and the in-        agreement for multi-label image annotation. In Multimedia
ternal structure of tests. Psychometrica, 16, 297-334.            Information Retrieval, pages 557-566.
   [14] Craggs, R. and McGee Wood, M. 2004. A two-dimensional        [33] Oleson, D., Hester, V., Sorokin, A., Laughlin, G.,
annotation scheme for emotion in dialogue. In Proc.of AAAI        Le, J., and Biewald, L. Programmatic Gold: Targeted and
Spring Symmposium on Exploring Attitude and Affect in              Scalable Quality Assurance in Crowdsourcing. In HCOMP
Text, Stanford.                                                   ’11: Proceedings of the Third AAAI Human Computation
   [15] Dai, P., Mausam, Weld, D. S. Decision-Theoretic           Workshop.
Control of Crowd-Sourced Workflows. AAAI 2010.                        [34] Paritosh, P. and Taylor, J., 2012, Freebase: Curating
   [16] Di Eugenio, Barbara and Michael Glass. 2004. The          Knowledge at Scale, To appear in the 24th Conference on
kappa statistic: A second look. Computational Linguistics,        Innovative Applications of Artificial Intelligence, IAAI 2012.
30(1):95-101.                                                        [35] Popping, R., 1988, On agreement indices for nominal
   [17] Esser, J.K., 1998, Alive and Well after 25 Years: A       data, Sociometric Research: Data Collection and Scaling,
Review of Groupthink Research, Organizational Behavior            90-105.
and Human Decision Processes, Volume 73, Issues 2-3, 116-            [36] Scott, W. A. 1955. Reliability of content analysis:
141.                                                              The case of nominal scale coding. Public Opinion Quarterly,
   [18] Feinstein, Alvan R. and Domenic V. Cicchetti. 1990.       19(3):321-325.
High agreement but low kappa: I. The problems of two para-           [37] Snow, R., O’Connor, B., Jurafsky, D., and Ng, A.
doxes. Journal of Clinical Epidemiology, 43(6):543-549.           Y. 2008. Cheap and fast, but is it good?: evaluating non-
   [19] Fleiss, Joseph L. 1971. Measuring nominal scale agree-    expert annotations for natural language tasks. In EMNLP
ment among many raters. Psychological Bulletin, 76(5):378-        ’08: Proceedings of the Conference on Empirical Methods
382.                                                              in Natural Language Processing.
   [20] Goodman AB, Rahav M, Popper M, Ginath Y, Pearl               [38] Sorokin, A., and Forsyth, D. 2008. Utility data anno-
E. The reliability of psychiatric diagnosis in Israel’s Psychi-   tation with amazon mechanical turk. In First International
atric Case Register. Acta Psychiatr Scand. 1984 May;69(5):391-    Workshop on Internet Vision, CVPR 08.
7.                                                                   [39] Speed, J. G., 1893, Do newspapers now give the news,
   [21] Hayes, A. F., and Krippendorff, K. 2007. Answering         Forum, August, 1893.
the call for a standard reliability measure for coding data.         [40] von Ahn, L. Games With A Purpose. IEEE Computer
Communication Methods and Measures, 1(1):77-89.                   Magazine, June 2006. Pages 96-98
   [22] Hughes CD, Berg L, Danziger L, Coben LA, Martin              [41] von Ahn, L., and Dabbish, L. Labeling Images with
RL. A new clinical scale for the staging of dementia. Then        a Computer Game. ACM Conference on Human Factors in
British Journal of Psychiatry 1982;140:56.                        Computing Systems, CHI 2004. Pages 319-326.
   [23] Ipeirotis, P.; Provost, F.; and Wang, J. 2010. Quality       [42] von Ahn, L., Maurer, B., McMillen, C., Abraham,
management on amazon mechanical turk. In KDD-HCOMP                D. and Blum, M. reCAPTCHA: Human-Based Character
’10.                                                              Recognition via Web Security Measures. Science, September
   [24] Krippendorff, K. 1970. Bivariate agreement coef-           12, 2008. pp 1465-1468.
ficients for reliability of data. Sociological Methodology,
2:139-150

Contenu connexe

Similaire à Human Computation Must Ensure Reproducible Results

C H I 9 5 M O S A I C OF CREATIVITY - May 7 1 1 1995 P a p e.docx
C H I  9 5 M O S A I C OF CREATIVITY - May 7 1 1 1995 P a p e.docxC H I  9 5 M O S A I C OF CREATIVITY - May 7 1 1 1995 P a p e.docx
C H I 9 5 M O S A I C OF CREATIVITY - May 7 1 1 1995 P a p e.docxjasoninnes20
 
Assessing Problem Solving In Expert Systems Using Human Benchmarking
Assessing Problem Solving In Expert Systems Using Human BenchmarkingAssessing Problem Solving In Expert Systems Using Human Benchmarking
Assessing Problem Solving In Expert Systems Using Human BenchmarkingClaire Webber
 
Chao Wrote Some trends that influence human resource are, Leade.docx
Chao Wrote Some trends that influence human resource are, Leade.docxChao Wrote Some trends that influence human resource are, Leade.docx
Chao Wrote Some trends that influence human resource are, Leade.docxsleeperharwell
 
Chao Wrote Some trends that influence human resource are, Leade.docx
Chao Wrote Some trends that influence human resource are, Leade.docxChao Wrote Some trends that influence human resource are, Leade.docx
Chao Wrote Some trends that influence human resource are, Leade.docxketurahhazelhurst
 
Meetup 22/2/2018 - Artificiële Intelligentie & Human Resources
Meetup 22/2/2018 - Artificiële Intelligentie & Human ResourcesMeetup 22/2/2018 - Artificiële Intelligentie & Human Resources
Meetup 22/2/2018 - Artificiële Intelligentie & Human ResourcesDigipolis Antwerpen
 
On Machine Learning and Data Mining
On Machine Learning and Data MiningOn Machine Learning and Data Mining
On Machine Learning and Data Miningbutest
 
An Instrument To Support Thinking Critically About Critical Thinking In Onlin...
An Instrument To Support Thinking Critically About Critical Thinking In Onlin...An Instrument To Support Thinking Critically About Critical Thinking In Onlin...
An Instrument To Support Thinking Critically About Critical Thinking In Onlin...Daniel Wachtel
 
Intention recognition for dynamic role exchange in haptic
Intention recognition for dynamic role exchange in hapticIntention recognition for dynamic role exchange in haptic
Intention recognition for dynamic role exchange in hapticأحلام انصارى
 
SEMANTIC NETWORKS IN AI
SEMANTIC NETWORKS IN AISEMANTIC NETWORKS IN AI
SEMANTIC NETWORKS IN AIIRJET Journal
 
What is Human Centred Design - The Design Journal
What is Human Centred Design - The Design JournalWhat is Human Centred Design - The Design Journal
What is Human Centred Design - The Design JournalJoseph Giacomin
 
Developing of climate data for building simulation with future weather condit...
Developing of climate data for building simulation with future weather condit...Developing of climate data for building simulation with future weather condit...
Developing of climate data for building simulation with future weather condit...Rasmus Madsen
 
DALL-E 2 - OpenAI imagery automation first developed by Vishal Coodye in 2021...
DALL-E 2 - OpenAI imagery automation first developed by Vishal Coodye in 2021...DALL-E 2 - OpenAI imagery automation first developed by Vishal Coodye in 2021...
DALL-E 2 - OpenAI imagery automation first developed by Vishal Coodye in 2021...MITAILibrary
 
Rapid Prototyping for Instructional Design over Time
Rapid Prototyping for Instructional Designover TimeRapid Prototyping for Instructional Designover Time
Rapid Prototyping for Instructional Design over TimeJean Mullins
 
Stepsin researchprocesspartialleastsquareofstructuralequationmodeling2016
Stepsin researchprocesspartialleastsquareofstructuralequationmodeling2016Stepsin researchprocesspartialleastsquareofstructuralequationmodeling2016
Stepsin researchprocesspartialleastsquareofstructuralequationmodeling2016Aurangzeb Ch
 
Knowledge Management in Software Enterprise
Knowledge Management in Software EnterpriseKnowledge Management in Software Enterprise
Knowledge Management in Software EnterpriseIOSR Journals
 
A theoretical model of differential social attributions toward computing tech...
A theoretical model of differential social attributions toward computing tech...A theoretical model of differential social attributions toward computing tech...
A theoretical model of differential social attributions toward computing tech...UltraUploader
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Big Data Spain
 
An Iterative Model as a Tool in Optimal Allocation of Resources in University...
An Iterative Model as a Tool in Optimal Allocation of Resources in University...An Iterative Model as a Tool in Optimal Allocation of Resources in University...
An Iterative Model as a Tool in Optimal Allocation of Resources in University...Dr. Amarjeet Singh
 

Similaire à Human Computation Must Ensure Reproducible Results (20)

C H I 9 5 M O S A I C OF CREATIVITY - May 7 1 1 1995 P a p e.docx
C H I  9 5 M O S A I C OF CREATIVITY - May 7 1 1 1995 P a p e.docxC H I  9 5 M O S A I C OF CREATIVITY - May 7 1 1 1995 P a p e.docx
C H I 9 5 M O S A I C OF CREATIVITY - May 7 1 1 1995 P a p e.docx
 
Assessing Problem Solving In Expert Systems Using Human Benchmarking
Assessing Problem Solving In Expert Systems Using Human BenchmarkingAssessing Problem Solving In Expert Systems Using Human Benchmarking
Assessing Problem Solving In Expert Systems Using Human Benchmarking
 
Chao Wrote Some trends that influence human resource are, Leade.docx
Chao Wrote Some trends that influence human resource are, Leade.docxChao Wrote Some trends that influence human resource are, Leade.docx
Chao Wrote Some trends that influence human resource are, Leade.docx
 
Chao Wrote Some trends that influence human resource are, Leade.docx
Chao Wrote Some trends that influence human resource are, Leade.docxChao Wrote Some trends that influence human resource are, Leade.docx
Chao Wrote Some trends that influence human resource are, Leade.docx
 
Aussois bda-mdd-2018
Aussois bda-mdd-2018Aussois bda-mdd-2018
Aussois bda-mdd-2018
 
Meetup 22/2/2018 - Artificiële Intelligentie & Human Resources
Meetup 22/2/2018 - Artificiële Intelligentie & Human ResourcesMeetup 22/2/2018 - Artificiële Intelligentie & Human Resources
Meetup 22/2/2018 - Artificiële Intelligentie & Human Resources
 
On Machine Learning and Data Mining
On Machine Learning and Data MiningOn Machine Learning and Data Mining
On Machine Learning and Data Mining
 
An Instrument To Support Thinking Critically About Critical Thinking In Onlin...
An Instrument To Support Thinking Critically About Critical Thinking In Onlin...An Instrument To Support Thinking Critically About Critical Thinking In Onlin...
An Instrument To Support Thinking Critically About Critical Thinking In Onlin...
 
Intention recognition for dynamic role exchange in haptic
Intention recognition for dynamic role exchange in hapticIntention recognition for dynamic role exchange in haptic
Intention recognition for dynamic role exchange in haptic
 
SEMANTIC NETWORKS IN AI
SEMANTIC NETWORKS IN AISEMANTIC NETWORKS IN AI
SEMANTIC NETWORKS IN AI
 
What is Human Centred Design - The Design Journal
What is Human Centred Design - The Design JournalWhat is Human Centred Design - The Design Journal
What is Human Centred Design - The Design Journal
 
Developing of climate data for building simulation with future weather condit...
Developing of climate data for building simulation with future weather condit...Developing of climate data for building simulation with future weather condit...
Developing of climate data for building simulation with future weather condit...
 
DALL-E 2 - OpenAI imagery automation first developed by Vishal Coodye in 2021...
DALL-E 2 - OpenAI imagery automation first developed by Vishal Coodye in 2021...DALL-E 2 - OpenAI imagery automation first developed by Vishal Coodye in 2021...
DALL-E 2 - OpenAI imagery automation first developed by Vishal Coodye in 2021...
 
Rapid Prototyping for Instructional Design over Time
Rapid Prototyping for Instructional Designover TimeRapid Prototyping for Instructional Designover Time
Rapid Prototyping for Instructional Design over Time
 
Stepsin researchprocesspartialleastsquareofstructuralequationmodeling2016
Stepsin researchprocesspartialleastsquareofstructuralequationmodeling2016Stepsin researchprocesspartialleastsquareofstructuralequationmodeling2016
Stepsin researchprocesspartialleastsquareofstructuralequationmodeling2016
 
Knowledge Management in Software Enterprise
Knowledge Management in Software EnterpriseKnowledge Management in Software Enterprise
Knowledge Management in Software Enterprise
 
A theoretical model of differential social attributions toward computing tech...
A theoretical model of differential social attributions toward computing tech...A theoretical model of differential social attributions toward computing tech...
A theoretical model of differential social attributions toward computing tech...
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
 
An Iterative Model as a Tool in Optimal Allocation of Resources in University...
An Iterative Model as a Tool in Optimal Allocation of Resources in University...An Iterative Model as a Tool in Optimal Allocation of Resources in University...
An Iterative Model as a Tool in Optimal Allocation of Resources in University...
 
Cook invited talk Uni of Bristol
Cook invited talk Uni of BristolCook invited talk Uni of Bristol
Cook invited talk Uni of Bristol
 

Plus de Sunny Kr

UNIVERSAL ACCOUNT NUMBER (UAN) Manual
UNIVERSAL ACCOUNT NUMBER (UAN) ManualUNIVERSAL ACCOUNT NUMBER (UAN) Manual
UNIVERSAL ACCOUNT NUMBER (UAN) ManualSunny Kr
 
Advanced Tactics and Engagement Strategies for Google+
Advanced Tactics and Engagement Strategies for Google+Advanced Tactics and Engagement Strategies for Google+
Advanced Tactics and Engagement Strategies for Google+Sunny Kr
 
WhatsApp Hacking Tricks and Software
WhatsApp Hacking Tricks and SoftwareWhatsApp Hacking Tricks and Software
WhatsApp Hacking Tricks and SoftwareSunny Kr
 
A scalable gibbs sampler for probabilistic entity linking
A scalable gibbs sampler for probabilistic entity linkingA scalable gibbs sampler for probabilistic entity linking
A scalable gibbs sampler for probabilistic entity linkingSunny Kr
 
Collaboration in the cloud at Google
Collaboration in the cloud at GoogleCollaboration in the cloud at Google
Collaboration in the cloud at GoogleSunny Kr
 
Think Hotels - Book Cheap Hotels Worldwide
Think Hotels - Book Cheap Hotels WorldwideThink Hotels - Book Cheap Hotels Worldwide
Think Hotels - Book Cheap Hotels WorldwideSunny Kr
 
What are wireframes?
What are wireframes?What are wireframes?
What are wireframes?Sunny Kr
 
comScore Inc. - 2013 Mobile Future in Focus
comScore Inc. - 2013 Mobile Future in FocuscomScore Inc. - 2013 Mobile Future in Focus
comScore Inc. - 2013 Mobile Future in FocusSunny Kr
 
Topical clustering of search results
Topical clustering of search resultsTopical clustering of search results
Topical clustering of search resultsSunny Kr
 
Latent Structured Ranking
Latent Structured RankingLatent Structured Ranking
Latent Structured RankingSunny Kr
 
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...Sunny Kr
 
Clinching Auctions with Online Supply
Clinching Auctions with Online SupplyClinching Auctions with Online Supply
Clinching Auctions with Online SupplySunny Kr
 
Approximation Algorithms for the Directed k-Tour and k-Stroll Problems
Approximation Algorithms for the Directed k-Tour and k-Stroll ProblemsApproximation Algorithms for the Directed k-Tour and k-Stroll Problems
Approximation Algorithms for the Directed k-Tour and k-Stroll ProblemsSunny Kr
 

Plus de Sunny Kr (13)

UNIVERSAL ACCOUNT NUMBER (UAN) Manual
UNIVERSAL ACCOUNT NUMBER (UAN) ManualUNIVERSAL ACCOUNT NUMBER (UAN) Manual
UNIVERSAL ACCOUNT NUMBER (UAN) Manual
 
Advanced Tactics and Engagement Strategies for Google+
Advanced Tactics and Engagement Strategies for Google+Advanced Tactics and Engagement Strategies for Google+
Advanced Tactics and Engagement Strategies for Google+
 
WhatsApp Hacking Tricks and Software
WhatsApp Hacking Tricks and SoftwareWhatsApp Hacking Tricks and Software
WhatsApp Hacking Tricks and Software
 
A scalable gibbs sampler for probabilistic entity linking
A scalable gibbs sampler for probabilistic entity linkingA scalable gibbs sampler for probabilistic entity linking
A scalable gibbs sampler for probabilistic entity linking
 
Collaboration in the cloud at Google
Collaboration in the cloud at GoogleCollaboration in the cloud at Google
Collaboration in the cloud at Google
 
Think Hotels - Book Cheap Hotels Worldwide
Think Hotels - Book Cheap Hotels WorldwideThink Hotels - Book Cheap Hotels Worldwide
Think Hotels - Book Cheap Hotels Worldwide
 
What are wireframes?
What are wireframes?What are wireframes?
What are wireframes?
 
comScore Inc. - 2013 Mobile Future in Focus
comScore Inc. - 2013 Mobile Future in FocuscomScore Inc. - 2013 Mobile Future in Focus
comScore Inc. - 2013 Mobile Future in Focus
 
Topical clustering of search results
Topical clustering of search resultsTopical clustering of search results
Topical clustering of search results
 
Latent Structured Ranking
Latent Structured RankingLatent Structured Ranking
Latent Structured Ranking
 
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...
 
Clinching Auctions with Online Supply
Clinching Auctions with Online SupplyClinching Auctions with Online Supply
Clinching Auctions with Online Supply
 
Approximation Algorithms for the Directed k-Tour and k-Stroll Problems
Approximation Algorithms for the Directed k-Tour and k-Stroll ProblemsApproximation Algorithms for the Directed k-Tour and k-Stroll Problems
Approximation Algorithms for the Directed k-Tour and k-Stroll Problems
 

Dernier

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 

Dernier (20)

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 

Human Computation Must Ensure Reproducible Results

  • 1. Human Computation Must Be Reproducible Praveen Paritosh Google 345 Spear St, San Francisco, CA 94105. pkp@google.com ABSTRACT sures of quality [Ipeirotis, Provost and Wang, 2010], utility Human computation is the technique of performing a com- [Dai, Mausam and Weld, 2010], among others. In order for putational process by outsourcing some of the difficult-to- us to have confidence in any such criteria, the results must automate steps to humans. In the social and behavioral sci- be reproducible, i.e., not a result of chance agreement or irre- ences, when using humans as measuring instruments, repro- producible human idiosyncrasies, but a reflection of the un- ducibility guides the design and evaluation of experiments. derlying properties of the questions and task instructions, on We argue that human computation has similar properties, which others could agree as well. Reproducibility is the de- and that the results of human computation must be repro- gree to which a process can be replicated by different human ducible, in the least, in order to be informative. We might contributors working under varying conditions, at different additionally require the results of human computation to locations, or using different but functionally equivalent mea- have high validity or high utility, but the results must be suring instruments. A total lack of reproducibility implies reproducible in order to measure the validity or utility to a that the given results could have been obtained merely by degree better than chance. Additionally, a focus on repro- chance agreement. If the results are not differentiable from ducibility has implications for design of task and instruc- chance, there is little information content in them. Using tions, as well as for the communication of the results. It human computation in such a scenario is wasteful of an ex- is humbling how often the initial understanding of the task pensive resource, as chance is cheap to simulate. and guidelines turns out to lack reproducibility. We suggest A much stronger claim than reproducibility is validity. For ensuring, measuring and communicating reproducibility of a measurement instrument, e.g., a vernier caliper, a stan- human computation tasks. dardized test, or, a human coder, the reproducibility is the extent to which a measurement gives consistent results, and the validity is the extent to which the tool measures what 1. INTRODUCTION it claims to measure. In contrast to reproducibility, valid- Some examples of tasks using human computation are: ity concerns truths. Validity requires comparing the results labeling images [Nowak and Ruger, 2010], conducting user of the study to evidence obtained independently of that ef- studies [Kittur, Chi, and Suh, 2008], annotating natural lan- fort. Reproducibility provides assurances that particular re- guage corpora [Snow, O’Connor, Jurafsky and Ng, 2008], an- search results can be duplicated, that no (or only a negligi- notating images for computer vision research [Sorokin and ble amount of) extraneous noise has entered the process and Forsyth, 2008], search engine evaluation [Alonso, Rose and polluted the data or perturbed the research results, validity Stewart, 2008; Alonso, Kazai and Mizzaro, 2011], content provides assurances that claims emerging from the research moderation [Ipeirotis, Provost and Wang, 2010], entity rec- are borne out in fact. onciliation [Kochhar, Mazzocchi and Paritosh, 2010], con- We might want the results of human computation to have ducting behavioral studies [Suri and Mason, 2010; Horton, high validity, high utility, low cost, among other desirable Rand and Zeckhauser, 2010]. characteristics. However, the results must be reproducible These tasks involve presenting a question, e.g.,“Is this im- in order for us to measure the validity or utility to a degree age offensive?,” to one or more humans, whose answers are better than chance. aggregated to produce a resolution, a suggested answer for More than a statistic, a focus on reproducibility offers the original question. The humans might be paid contribu- valuable insights regarding the design of the task and the tors. Examples of paid workforces include Amazon Mechani- guidelines for the human contributors, as well as the commu- cal Turk [www.mturk.com] and oDesk [www.odesk.com]. An nication of the results. The output of human computation is example of a community of volunteer contributors is Foldit thus akin to the result of a scientific experiment, and it can [www.fold.it], where human computation augments machine only be considered meaningful if it is reproducible — that is, computation for predicting protein structure [Cooper et al., the same results could be replicated in an independent exer- 2010]. Another example involving volunteer contributors is cise. This requires clearly communicating the task instruc- games with a purpose [von Ahn, 2006], and the upcoming tions, and the criterion of selecting the human contributors, Duolingo [www.duolingo.com], where the contributors are ensuring that they work independently, and reporting an translate previously untranslated web corpora while learn- appropriate measure of reproducibility. Much of this is well ing a new language. established in the methodology of content analysis in the so- The results of human computation can be characterized cial and behavioral sciences [Armstrong, Gosling, Weinman by accuracy [Oleson et al., 2011], information theoretic mea- Copyright c 2012 for the individual papers by the papers’ authors. Copy- ing permitted for private and academic purposes. This volume is published and copyrighted by its editors. CrowdSearch 2012 workshop at WWW 2012, Lyon, France
  • 2. and Marteau, 1997; Hayes and Krippendorff, 2007], being 3. CONTENT ANALYSIS AND CODING IN required of any publishable result involving human contrib- THE BEHAVIORAL SCIENCES utors. In Section 2 and 3, we argue that human computation In the social sciences, content analysis is a methodology resembles the coding tasks of behavioral sciences. However, for studying the content of communication [Berelson, 1952; in the human computation and crowdsourcing research com- Krippendorff, 2004]. Coding of subjective information is a munity, reproducibility is not commonly reported. significant source of empirical data in social and behavioral We have collected millions of human judgments regard- sciences, as they allow techniques of quantitative research ing entities and facts in Freebase [Kochhar, Mazzocchi and to be applied to complex phenomena. These data are typ- Paritosh, 2010; Paritosh and Taylor, 2012]. We have found ically generated by trained human observers who record or reproducibility to be a useful guide for task and guideline transcribe textual, pictorial or audible matter in terms suit- design. It is humbling how often the initial understanding able for analysis. This task is called coding, which involves of the task and guidelines turns out to lack reproducibility. assigning categorical, ordinal or quantitative responses to In section 4, we describe some of the widely used measures of units of communication. reproducibility. We suggest ensuring, measuring and com- An early example of a content analysis based study is “Do municating reproducibility of human computation tasks. newspapers now give the news?” [Speed, 1893], which tried In the next section, we describe the sources of variabil- to show that the coverage of religious, scientific and literary ity human computation, which highlight the role of repro- matters was dropped in favor of gossip, sports and scandals ducibility. between 1881 and 1893, by New York newspapers. Conclu- sions from such data can only be trusted if the reading of 2. SOURCES OF VARIABILITY IN HUMAN the textual data as well as of the research results are replica- COMPUTATION ble elsewhere, that the coders demonstrably agree on what There are many sources of variability in human compu- they are talking about. Hence, the coders need to demon- tation that are not present in machine computation. Given strate the trustworthiness of their data by measuring their that human computation is used to solve problems that are reproducibility. To perform reproducibility tests, additional beyond the reach of machine computation, by definition, data are needed: by duplicating the research under various these problems are incompletely specified. Variability arises conditions. Reproducibility is established by independent due to incomplete specification of the task. This is convolved agreement between different but functionally equal measur- with the fact that the guidelines are subject to differing in- ing devices, for example, by using several coders with di- terpretations by different human contributors. Some char- verse personalities. The reproducibility of coding has been acteristics of human computation tasks are: used for comparing consistency of medical diagnosis [e.g., Koran, 1975], for drawing conclusions from meta-analysis of • Task guidelines are incomplete: A task can span a wide research findings [e.g., Morley et al., 1999], for testing indus- set of domains, not all of which are anticipated at the trial reliability [Meeker and Escobar, 1998], for establishing beginning of the task. This leads to incompleteness the usefulness of a clinical scale [Hughes et al., 1982]. in guidelines, as well as varying levels of performance depending upon the contributor’s expertise in that do- 3.1 Relationship between Reproducibility and main. Consider the task of establishing relevance of an arbitrary search query [Alonso, Kazai and Mizzaro, Validity 2011]. • Lack of reproducibility limits the chance of validity: If • Task guidelines are not precise: Consider, for exam- the coding results are a product of chance, it may well ple, the task of declaring if an image is unsuitable for include a valid account of what was observed, but re- a social network website. Not only is it hard to write searchers would not be able to identify that account down all the factors that go into making an image of- to a degree better than chance. Thus, the more unre- fensive, it is hard to communicate those factors to hu- liable a procedure, the less likely it is to result in data man contributors with vastly different predispositions. that lead to valid conclusions. The guidelines usually rely upon shared common sense knowledge and cultural knowledge. • Reproducibility does not guarantee validity: Two ob- servers of the same event who hold the same concep- • Validity data is expensive or unavailable: An oracle tual system, prejudice, or, interest may well agree on that provide the true answer for any given question is what they see but still be objectively wrong, based on usually unavailable. Sometimes for a small subset of some external criterion. Thus a reliable process may gold questions, we have answers from another indepen- or may not lead to valid outcomes. dent source. This can be useful in making estimates of validity, subject to the degree that the gold questions In some cases, validity data might be so hard to obtain are representative of the set of questions. These gold that one has to contend with reproducibility. In tasks such questions could be very useful for training and feed- as interpretation and transcription of complex textual mat- back to the human contributors, however, we have to ter, suitable accuracy standards are not easy to find. Be- be ensure their representatitiveness in order to make cause interpretations can only be compared to interpreta- warranted claims regarding validity. tions, attempts to measure validity presuppose the privi- Each of the above might be true to a different degree for leging of some interpretations over others, and this puts any different human computation tasks. These factors are simi- claims regarding validity on epistemologically shaky grounds. lar to the concerns of behavioral and social scientists in using In some tasks like psychiatric diagnosis, even reproducibil- humans as measuring instruments. ity is hard to attain for some questions. Aboraya et al.
  • 3. [2006] review the reproducibility of psychiatric diagnosis. flecting the original bias in the distribution of safe and unsafe Lack of reproducibility has been reported for judgments of bags. A misguided interpretation of accuracy or a poor esti- schizophrenia and affective disorder [Goodman et al., 1984], mate of accuracy can be less informative than reproducibil- calling such diagnosis into question. ity. 3.2 Relationship with Chance Agreement 3.3 Reproducibility and Experiment Design The focus on reproducibility has implications on design of In this section, we look at some properties of chance agree- task and instruction materials. Krippendorff [2004] argues ment, and its relationship to reproducibility. Given two cod- that any study using observed agreement as a measure of ing schemes for the same phenomenon, the one with fewer reproducibility must satisfy the following requirements: categories will have higher chance agreement. For example, in reCAPTCHA [von Ahn et al., 2008], two independent • It must employ an exhaustively formulated, clear, and humans are shown an image containing text and asked to usable guidelines; transcribe it. Assuming that there is no collusion, chance agreement, i.e., two different humans typing in the same • It must use clearly specified criteria concerning the word/phrase by chance is very small. However, in a task choice of contributors,so as others may use such cri- in which there are only two possible answers, e.g., true and teria to reproduce the data; false, the probability of chance agreement between two an- • It must ensure that the contributors that generate the swers is 0.5. data used to measure reproducibility work indepen- If a disproportionate amount of data falls under one cate- dently of each other. Only if such independence is as- gory, then the expected chance agreement is very high, so in sured can covert consensus be ruled out the observed order to demonstrate high reproducibility, even higher ob- agreement be explained in terms of the given guidelines served agreement is required [Feinstein and Cicchetti 1990; and the task. Di Eugenio and Glass 2004]. Consider a task of rating a proposition as true or false. The last point cannot be stressed enough. There are po- Let p be the probability of the proposition being true. An tential benefits from multiple contributors collaborating, but implementation of chance agreement is the following: toss a data generated in this manner neither ensure reproducibil- biased coin with the same odds, i.e., p is the probability that ity nor reveal its extent. In groups like these, humans are it turns heads, and declare the proposition to be true when known to negotiate and to yield to each other in quid pro the coin lands heads. Now, we can simulate n judgments quo exchanges, with prestigious group members dominating by tossing the coin n times. Let us look at the properties the outcome [see for example, Esser, 1998]. This makes the of unanimous agreement between two judgments. The like- results of collaborative human computation a reflection of lihood of a chance agreement on true is p2 . Independent the social structure of the group, which is nearly impossible of this agreement, the probability of the proposition being to communicate to other researchers and replicate. The data true is p, therefore, the accuracy of this chance agreement generated by collaborative work are akin to data generated on true is p3 . By design, these judgments do not contain any by a single observer, while reproducibility requires at least information other than the a priori distribution across an- two independent observers. To substantiate the contention swers. Such data has close to zero reproducibility, however, that collaborative coding is superior to coding by separate sometimes it can show up in surprising ways when looked individuals, a researcher would have to compare the data through the lens of accuracy. generated by at least two such groups and two individuals, For example, consider the task of an airport agent declar- each working independently. ing a bag as safe or unsafe for boarding on the plane. A A model in which the coders work independently, but con- bag can be unsafe if it contains toxic or explosive materials sult each other when unanticipated problems arise, is also that could threaten the safety of the flight. Most bags are problematic. A key source of these unanticipated problems safe. Let us say that one in a thousand bags is potentially is the fact that the writers of the coding instructions did unsafe. Random coding would allow two agents to jointly not anticipate all the possible ways of expressing the rele- assign “safe” 99.8% of the time, and since 99.9% of the bags vant matter. Ideally, these instructions should include every are safe, this agreement would be accurate 99.7% of the time! applicable rule on which agreement is being measured. How- This leads to the surprising result that when data are highly ever, discussing emerging problems could create re-interpretation skewed, the coders may agree on a high proportion of items of the existing instructions in ways that are a function of the while producing annotations that are accurate, but of low group and not communicable to others. In addition, as the reproducibility. When one category is very common, high instructions become reinterpreted, the process loses its sta- accuracy and high agreement can also result from indiscrim- bility: data generated early in the process use instructions inate coding. The test for reproducibility in such cases is that differ from those later. the ability to agree on the rare categories. In the airport In addition to the above, Craggs and McGee Wood [2005] bag classification problem, while chance accuracy on safe discourage researchers from testing their coding instructions bags is high, chance accuracy on unsafe bags is extremely on data from more than one domain. Given that the re- low, 10−7 %. In practice, the cost of errors vary: mistakenly producibility of the coding instructions depends to a great classifying a safe bag as unsafe causes far less damage than extent on how complications are dealt with, and that every classifying an unsafe bag as safe. domain displays different complications, the sample should In this case, it is dangerous to consider an averaged accu- contain sufficient examples from all domains which have to racy score, as different errors do not count equal: a chance be annotated according to the instructions. process that does not add any information has an average Even the best coding instructions might not specify all accuracy which is higher than 99.7%, most of which is re- possible complications. Besides the set of desired answers,
  • 4. the coders should also be allowed to skip a question. If the to Popping [1988], Artstein and Poesio [2007]. The different coders cannot prove any of the other answers is correct, they coefficients of reproducibility differ in the assumptions they skip that question. For any other answer, the instructions make about the properties of coders, judgments and units. define an a priori model of agreement on that answer, while Scott’s π [1955] is applicable to two raters and assumes that skip represents the unanticipated properties of questions and the raters have the same distribution of responses, where coders. For instance, some questions might be too difficult Cohen’s κ [1960; 1968] allows for a a separate distribution for certain coders. Providing the human contributors with of chance behavior per coder. Fleiss’ κ [1971] is a gen- an option to skip is a nod to the openness of the task, and eralization of Scott’s π for an arbitrary number of raters. can be used to explore the poorly defined parts of the task All of these coefficients of reproducibility correct for chance that were not anticipated at the outset. Additionally, the agreement similarly. First, they find how much agreement skip votes can be removed from the analysis for computing is expected by chance: let us call this vallue Ae . The data reproducibility, as we do not have expectation of agreement from the coding is a measure of the observed agreement, on them [Krippendorff, 2012, personal communication]. Ao . Various inter-rater reliabilities measure the proportion of the possible agreement beyond chance that was actually 4. MEASURING REPRODUCIBILITY observed. In measurement theory, reliability is the more general guarantee that the data obtained are independent of the Ao − Ae S, π, κ = measuring event, instrument or person. There are three dif- 1 − Ae ferent kinds of reliability: Krippendorff’s α [1970; 2004] is a generalization of many Stability: measures the degree to which a process is un- of these coefficients. It is a generalization of the above met- changing over time. It is measured by agreement between rics for an arbitrary number of raters, not all of whom have multiple trials of the same measuring or coding process. This to answer every question. Krippendorff’s α has the following is also called test-retest condition, in which one observer desirable characteristics: does a task, and after some time, repeats the task again. This measures intra-observer reliability. A similar notion is • It is applicable to an arbitrary number of contributors internal consistency [Cronbach, 1951], which is the degree and invariant to the permutation and selective partic- to which the answers on the same task are consistent. Sur- ipation of contributors. It corrects itself for varying veys are designed so that the subsets of similar questions amounts of reproducibility data. are known a priori, and measures for internal consistency metrics are based on correlation between these answers. • It constitutes a numerical scale between at least two Reproducibility: measures the degree to which a process points with sensible reproducibility interpretations, 0 can be replicated by different analysts working under varying representing absence of agreement, and 1 indicating conditions, at different locations, or, using different but func- perfect agreement. tionally equivalent measuring instruments. Reproducible • It is applicable to several scales of measurement: ordi- data, by definition, are data that remain constant through- nal, nominal, interval, ratio, and more. out variations in the measuring process [Kaplan and Gold- sen, 1965]. Alpha’s general form is: Accuracy: measures the degree to which the process pro- duces valid results. To measure accuracy, we have to com- Do pare the performance of contributors with the performance α=1− De of a procedure that is known to be correct. In order to generate estimates of accuracy, we need accuracy data, i.e., Where Do is the observed disagreement: valid answers to a representative sample of the questions. Estimating accuracy gets harder in cases where the hetero- 1 2 Do = ock .δck geneity of the task is poorly understood. n c k The next section focuses on reproducibility. and De is the disagreement one would expect when the 4.1 Reproducibility answers are attributable to chance rather than to the prop- There are two different aspects of reproducibility: inter- erties of the questions: rater reliability and inter-method reliability. Inter-rater reli- ability focuses on the reproducibility by agreement between 1 2 De = nc .nk .δck independent raters, and inter-method reliability focuses on n(n − 1) c k the reliability of different measuring devices. For example, 2 in survey and test design, parallel forms reliability is used The δck term is the distance metric for the scale of the to create multiple equivalent tests, of which more than one answerspace. For a nominal scale, are administered to the same human. We focus on inter- rater reliability as the measure of reproducibility typically 2 0 if c = k applicable to human computation tasks, where we generate δck = judgments from multiple humans per question. The simplest 1 if c = k form of inter-rater reliability is percent agreement, however it is not suitable as a measure of reproducibility as it does 4.2 Statistical Significance not correct for chance agreement. The goal of measuring reproducibility is to ensure that For extensive survey of measures of reproducibility, refer the data does not deviate too much from perfect agreement,
  • 5. not that the data is different from chance agreement. In the 4.5 Other Measures of Quality definition of α, chance agreement is one of the two anchors Ipeiritos, Provost and Wang [2010] present an information for the agreement scale, the other, more important, reference theoretic quality score, which measures the quality of a con- point being that of perfect agreement. As the distribution of tributor in terms of comparing their score to a spammer who α is unknown, confidence intervals on α are obtained from is trying to advantage of chance accuracy. In that regard, it empirical distribution generated by bootstrapping — that is a similar metric to Krippendorff’s alpha, and additionally is, by drawing a large number of subsamples from the re- models a notion of cost. producibility data, computing α for each. This gives us a probability distribution of hypothetical α values that could ExpCost(Contributor) QualityScore = 1 − occur within the constraints of the observed data. This can ExpCost(Spammer) be used to calculate the probability of failing to reach the Turkontrol [Dai, Mausam and Weld, 2010], uses both a smallest acceptable reproducibility αmin , q|α < αmin , or a model of utility and quality using decision-theoretic control two tailed confidence interval for chosen level of significance. to make trade-offs between quality and utility for workflow 4.3 Sampling Considerations control. Le, Edmonds, Hester and Biewald [2010] develop a gold To generate an estimate of the reproducibility of a pop- standard based quality assurance framework that provides ulation, we need to generate a representative sample of the direct feedback to the workers and targets specific worker population. The sampling needs to ensure that we have errors. This approach requires extensive manually generated enough units from the rare categories of questions in the collection of gold data. Oleson et al. [2011], further develop data. Assuming that α is normally distributed, Bloch and this approach to include pyrite, which are programmatically Kraemer [1989] provide a suggestion for minimum number of generated gold questions on which contributors are likely to questions from each category to be included in the sample, make an error, for example by mutating data so that it is Nc , by, no longer valid. These are very useful metrics for training, feedback and protection against spammers, but these do not (1 + αmin )(3 − αmin ) reveal the accuracy of the results. The gold questions, by Nc = z 2 − αmin 4pc (1 − pc )(1 − αmin ) design, are not representative of the original set of questions. Where, These lead to wide error bars on the accuracy estimates, and • pc is the smallest estimated proportion values of the it might be valuable to measure reproducibility of results. category c in the population, • alphamin is the smallest acceptable reproducibility be- 5. CONCLUSIONS low which data will have to be rejected as unreliable, We describe reproducibility as a necessary but not suf- and ficient requirement for results of human computation. We might additionally require the results to have high validity • z is the desired level of statistical significance, repre- or high utility, but our ability to measure validity or utility sented by the corresponding z value for one-tailed tests with confidence is limited if the data are not reproducible. This is a simplification, as it assumes α is normally dis- Additionally, a focus on reproducibility has implications for tributed, and binary data, and does not account for the num- design of task and instructions, as well as for the commu- ber of raters. A general description of sampling requirement nication of the results. It is humbling how often the initial is an open problem. understanding of the task and guidelines turns out to lack reproducibility. We suggest ensuring, measuring and com- 4.4 Acceptable Levels of Reproducibility municating reproducibility of human computation tasks. Fleiss [1981] and Krippendorff [2004] present guidelines for what should acceptable values of reproducibility based on surveying the empirical research using these measures. 6. ACKNOWLEDGMENTS Krippendorff suggests, The author would like to thank Jamie Taylor, Reilly Hayes, Stefano Mazzocchi, Mike Shwe, Ben Hutchinson, Al Marks, • Rely on variables with reproducibility above α = 0.800. Tamsyn Waterhouse, Peng Dai, Ozymandias Haynes, Ed Additionally don’t accept data if the confidence inter- Chi and Panos Ipeirotis for insightful discussions on this val reaches below the smallest acceptable reproducibil- topic. ity, αmin = 0.667, or, ensure that the probability, q, of the failure to have less than smallest acceptable repro- ducibility alphamin is reasonably small, e.g., q < 0.05. 7. REFERENCES [1] Aboraya, A., Rankin, E., France, C., El-Missiry, A., • Consider variables with reproducibility between α = John, C. The Reliability of Psychiatric Diagnosis Revisited, 0.667 and α = 0.800 only for drawing tentative con- Psychiatry (Edgmont). 2006 January; 3(1): 41-50. clusions. [2] Alonso, O., Rose, D. E., Stewart, B.. Crowdsourcing These are suggestions, and the choice of thresholds of ac- for relevance evaluation. SIGIR Forum, 42(2):9-15, 2008. ceptability depend upon the validity requirements imposed [3] Alonso, O., Kazai, G. and Mizzaro, S., 2011, Crowd- on the research results. It is perilous to “game” α by vio- sourcing for search engine evaluation, Springer 2011. lating the requirements of reproducibility: for example, by [4] Armstrong, D., Gosling, A., Weinman, J. and Marteau, removing a subset of data post-facto to increase α. Parti- T., 1997, The place of inter-rater reliability in qualitative tioning data by agreement measured after the experiment research: An empirical study, Sociology, vol 31, no. 3, 597- will not lead to valid conclusions. 606.
  • 6. [5] Artstein, R. and Poesio, M. 2007. Inter-coder agree- [25] Krippendorff, K. 2004. Content Analysis: An Intro- ment for computational linguistics. Computational Linguis- duction to Its Methodology, Second edition. Sage, Thousand tics. Oaks. [6] Bennett, E. M., R. Alpert, and A. C. Goldstein. 1954. [26] Kochhar, S., Mazzocchi, S., and Paritosh, P., 2010, Communications through limited questioning. Public Opin- The Anatomy of a Large-Scale Human Computation Engine, ion Quarterly, 18(3):303-308. In Proceedings of Human Computation Workshop at the [7] Berelson, B., 1952. Content analysis in communication 16th ACM SIKDD Conference on Knowledge Discovery and research, Free Press, New York. Data Mining, KDD 2010, Washington D.C. [8] Bloch, Daniel A. and Helena Chmura Kraemer. 1989. [27] Koran, L. M., 1975, The reliability of clinical meth- 2 x 2 kappa coefficients: Measures of agreement or associa- ods, data and judgments (parts 1 and 2), New England Jour- tion. Biometrics, 45(1):269-287 nal of Medicine, 293(13/14). [9] Bollacker, K., Evans, C., Paritosh, P., Sturge, T. and [28] Kittur, A., Chi, E. H. and Suh, B. Crowdsourcing Taylor, J., 2008, Freebase: A Collaboratively Created Graph user studies with Mechanical Turk. In Proceedings of the Database for Structuring Human Knowledge. In the Pro- Proceeding of the twenty-sixth annual SIGCHI conference ceedings of the 28th ACM SIGMOD International Confer- on Human factors in computing systems, Florence. ence on Management of Data, Vancouver. [29] Mason, W., and Siddharth. 2010. Conducting Be- [10] Cohen, Jacob. 1960. A coefficient of agreement for havioral Research on Amazon’s Mechanical Turk, Working nominal scales. Educational and Psychological Measure- Paper, Social Science Research Network. ment, 20(1):37-46. [30] Meeker, W. and Escobar, L., 1998, Statistical Meth- [11] Cohen, Jacob. 1968. Weighted kappa: Nominal scale ods for Reliability Data, Wiley. agreement with provision for scaled disagreement or partial [31] Morley, S. Eccleston, C. and Williams, A., 1999, Sys- credit. Psychological Bulletin, 70(4):213-220. tematic review and meta-analysis of randomized controlled [12] Cooper, S., Khatib, F., Treuille, A., Barbero, J., Lee, trials of cognitive behavior therapy and behavior therapy for J., Beenen, M., Leaver-Fay, A., Baker, D., and Popovic, chronic pain in adults, excluding headache, Pain, 80, 1-13. Z. Predicting protein structures with a multiplayer online [32] Nowak, S. and Ruger, S. 2010. How reliable are an- game. Nature, 466:7307, 756-760 notations via crowdsourcing: a study about inter-annotator [13] Cronbach, L. J., 1951, Coefficient alpha and the in- agreement for multi-label image annotation. In Multimedia ternal structure of tests. Psychometrica, 16, 297-334. Information Retrieval, pages 557-566. [14] Craggs, R. and McGee Wood, M. 2004. A two-dimensional [33] Oleson, D., Hester, V., Sorokin, A., Laughlin, G., annotation scheme for emotion in dialogue. In Proc.of AAAI Le, J., and Biewald, L. Programmatic Gold: Targeted and Spring Symmposium on Exploring Attitude and Affect in Scalable Quality Assurance in Crowdsourcing. In HCOMP Text, Stanford. ’11: Proceedings of the Third AAAI Human Computation [15] Dai, P., Mausam, Weld, D. S. Decision-Theoretic Workshop. Control of Crowd-Sourced Workflows. AAAI 2010. [34] Paritosh, P. and Taylor, J., 2012, Freebase: Curating [16] Di Eugenio, Barbara and Michael Glass. 2004. The Knowledge at Scale, To appear in the 24th Conference on kappa statistic: A second look. Computational Linguistics, Innovative Applications of Artificial Intelligence, IAAI 2012. 30(1):95-101. [35] Popping, R., 1988, On agreement indices for nominal [17] Esser, J.K., 1998, Alive and Well after 25 Years: A data, Sociometric Research: Data Collection and Scaling, Review of Groupthink Research, Organizational Behavior 90-105. and Human Decision Processes, Volume 73, Issues 2-3, 116- [36] Scott, W. A. 1955. Reliability of content analysis: 141. The case of nominal scale coding. Public Opinion Quarterly, [18] Feinstein, Alvan R. and Domenic V. Cicchetti. 1990. 19(3):321-325. High agreement but low kappa: I. The problems of two para- [37] Snow, R., O’Connor, B., Jurafsky, D., and Ng, A. doxes. Journal of Clinical Epidemiology, 43(6):543-549. Y. 2008. Cheap and fast, but is it good?: evaluating non- [19] Fleiss, Joseph L. 1971. Measuring nominal scale agree- expert annotations for natural language tasks. In EMNLP ment among many raters. Psychological Bulletin, 76(5):378- ’08: Proceedings of the Conference on Empirical Methods 382. in Natural Language Processing. [20] Goodman AB, Rahav M, Popper M, Ginath Y, Pearl [38] Sorokin, A., and Forsyth, D. 2008. Utility data anno- E. The reliability of psychiatric diagnosis in Israel’s Psychi- tation with amazon mechanical turk. In First International atric Case Register. Acta Psychiatr Scand. 1984 May;69(5):391- Workshop on Internet Vision, CVPR 08. 7. [39] Speed, J. G., 1893, Do newspapers now give the news, [21] Hayes, A. F., and Krippendorff, K. 2007. Answering Forum, August, 1893. the call for a standard reliability measure for coding data. [40] von Ahn, L. Games With A Purpose. IEEE Computer Communication Methods and Measures, 1(1):77-89. Magazine, June 2006. Pages 96-98 [22] Hughes CD, Berg L, Danziger L, Coben LA, Martin [41] von Ahn, L., and Dabbish, L. Labeling Images with RL. A new clinical scale for the staging of dementia. Then a Computer Game. ACM Conference on Human Factors in British Journal of Psychiatry 1982;140:56. Computing Systems, CHI 2004. Pages 319-326. [23] Ipeirotis, P.; Provost, F.; and Wang, J. 2010. Quality [42] von Ahn, L., Maurer, B., McMillen, C., Abraham, management on amazon mechanical turk. In KDD-HCOMP D. and Blum, M. reCAPTCHA: Human-Based Character ’10. Recognition via Web Security Measures. Science, September [24] Krippendorff, K. 1970. Bivariate agreement coef- 12, 2008. pp 1465-1468. ficients for reliability of data. Sociological Methodology, 2:139-150