Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

New trends in NLP applications

Tutorial given at RANLP 2015 in Hissar, Bulgaria

Recent years have seen lots of changes in the field of computational linguistics, most of them due to the widespread use of the Internet and the benefits and problems it brings. The first part of this tutorial will discuss these changes and will focus on crowdsourcing and how it influenced the creation of annotated data.

Annotation of data employed to train and test NLP methods used to be the task of language experts who had a good understanding of the linguistic phenomena to be tackled. Given that a large number of people now have access to the Internet, crowdsourcing has become an alternative way of obtaining annotated data. The core idea of crowdsourcing is that it is possible to design tasks that can be completed by non-experts and that the outputs of these tasks can be combined to obtain high-quality linguistic annotation, which would normally be produced by experts. Examples of how crowdsourcing was employed in computational linguistics will be given.

Big data is another trend in computational linguistics as researchers rely on more and more data for improving the results of a method. The second part of the tutorial will introduce the MapReduce programming model and show how it was used in processing language. Combined with processing larger quantities of data, the field of computational linguistics has applied deep learning to various tasks successfully, improving their accuracy. An introduction to deep learning will be provided, followed by examples of how it was applied to tasks such as learning semantic representations, sentiment analysis and machine translation evaluation.

  • Soyez le premier à commenter

New trends in NLP applications

  1. 1. New trends in NLP applications Constantin Orasan University of Wolverhampton, UK http://www.wlv.ac.uk/~in6093/ 6th September 2015 RANLP 2015, HISSAR, BULGARIA 1/100
  2. 2. A better title: Constantin’s subjective view of some of the interesting trends in NLP that can be presented in 3 hours 6th September 2015 RANLP 2015, HISSAR, BULGARIA 2/100
  3. 3. The latest trend in NLP is … natural language understanding 6th September 2015 RANLP 2015, HISSAR, BULGARIA 3/100
  4. 4. Not understanding like in … “Open the pod bay doors, please Hal...” Jurafsky, D., & Martin, J. H. (2009) Speech and language processing (2nd ed.). Pearson Prentice Hall. More information from http://www.cs.colorado.edu/~martin/slp.html 6th September 2015 RANLP 2015, HISSAR, BULGARIA 4/100
  5. 5. NLU for specific applications • Translate texts between two languages • Simplify texts • Find out the opinion/sentiment of texts • Find out the entities mentioned in texts and the relations between them • Answer questions from large collections of documents • Help customers navigate knowledge databases • Filter spam in social media • Profile people • Summarise texts • …. Are these new? 6th September 2015 RANLP 2015, HISSAR, BULGARIA 5/100
  6. 6. Then: 1993 Source: https://en.wikipedia.org/wiki/On_the_Internet,_nobody_knows_you%27re_a_dog 6th September 2015 RANLP 2015, HISSAR, BULGARIA 6/100
  7. 7. Now Source: http://www.kdnuggets.com/2015/08/cartoon-big-data-internet-dog-question.html 6th September 2015 RANLP 2015, HISSAR, BULGARIA 7/100
  8. 8. The technology advanced The Internet evolved Web 2.0 Openness More access Better hardware 6th September 2015 RANLP 2015, HISSAR, BULGARIA 8/100
  9. 9. NLP is approaching maturity More interest from companies on developing and deploying working applications Interest from users to employ NLP technologies in their company “NLP for masses” More datasets available, more tools available 6th September 2015 RANLP 2015, HISSAR, BULGARIA 9/100
  10. 10. Structure of the tutorial 1. Text analytics: example of establish field with impact on industry 2. Crowdsourcing 3. Processing large quantities of data 4. Deep-learning 6th September 2015 RANLP 2015, HISSAR, BULGARIA 100/100
  11. 11. RANLP 2015, HISSAR, BULGARIA 11/100 Text analytics 6th September 2015 RANLP 2015, HISSAR, BULGARIA 111/100
  12. 12. Text analytics – from users’ perspective From Grimes, S. (2014). Text / Content Analytics 2014 : User Perspectives on Solutions and Providers. Retrieved from www.altaplana.com Text analytics = software and transformational processes that uncover business value in “unstructured” text. Text analytics applies statistical, linguistic, machine learning, and data analysis and visualization techniques to identify and extract salient information and insights. The goal is to inform decision-making and support business optimization. Survey of 220 users of text analytics tools 6th September 2015 RANLP 2015, HISSAR, BULGARIA 122/100
  13. 13. Text analytics Can benefit from crowdsourcing Requires processing of large quantities of data Needs better ML algorithms It is widely and successfully used by companies Other similarly successful applications are machine translation and virtual personal assistants 6th September 2015 RANLP 2015, HISSAR, BULGARIA 133/100
  14. 14. From Grimes, S. (2014). Text / Content Analytics 2014 : User Perspectives on Solutions and Providers. Retrieved from www.altaplana.com 6th September 2015 RANLP 2015, HISSAR, BULGARIA 144/100
  15. 15. From Grimes, S. (2014). Text / Content Analytics 2014 : User Perspectives on Solutions and Providers. Retrieved from www.altaplana.com 6th September 2015 RANLP 2015, HISSAR, BULGARIA 155/100
  16. 16. From Grimes, S. (2014). Text / Content Analytics 2014 : User Perspectives on Solutions and Providers. Retrieved from www.altaplana.com 6th September 2015 RANLP 2015, HISSAR, BULGARIA 166/100
  17. 17. From Grimes, S. (2014). Text / Content Analytics 2014 : User Perspectives on Solutions and Providers. Retrieved from www.altaplana.com 6th September 2015 RANLP 2015, HISSAR, BULGARIA 177/100
  18. 18. Comments on the overall experience • It is a messy business, but invaluable if there is no other information available. • It gives as an overview of the data that we could not achieve without it. • I have been doing text analytics since 1984, and I have yet to find an environment that meets my requirements for knowledge extraction. • When applied properly and when its limits are understood, it works quite well. • With access to proper info, I can generate a PhD level analysis in one day. • We annotate incoming text against our taxonomy and then use the annotations as the basis of text analytics as well as search. • As with any “adolescent” technology, there is no single end-to-end product that finds, analyzes, and visualizes all available data sources • Accuracy needs improvement. Tools need to be customized to specific business cases. • Still need a human to interpret context, inference, etc. • It is (relatively) easy to apply algorithms. It is difficult to assess the accuracy of the results or to translate them into strategic insight. • Text content analytics is in its early infancy, and there is a long road ahead. From Grimes, S. (2014). Text / Content Analytics 2014 : User Perspectives on Solutions and Providers. Retrieved from www.altaplana.com 6th September 2015 RANLP 2015, HISSAR, BULGARIA 188/100
  19. 19. Technology-related growth drivers1 Open source: lowers the barriers to technology adoption and enabling focusing on building higher level, more specific applications The API economy: enable easier adoption of technologies Data availability: there is more data than needs to be analysed and available to train our systems Synthesis: as different technologies become mature they lead to more complex systems and more automation 1 Adapted from Grimes, S. (2014). Text / Content Analytics 2014 : User Perspectives on Solutions and Providers. Retrieved from www.altaplana.com where they are presented from the perspective of text analytics 6th September 2015 RANLP 2015, HISSAR, BULGARIA 199/100
  20. 20. NLP meets the cloud • Software as a Service (SaaS) is a very popular way of giving access to software • The software is run in the cloud and users pay some kind of subscription to access it • Great way to develop (commercial) NLP applications that mashup information from several services • Can lead to scalable applications • There are already several established provides of APIs that allow language processing (usually branded text analytics) • Difficult to assess how accurate these tools are • “don’t try to compete with what’s there, but build something new using it.”1 1 Dale, R. (2015). NLP meets the cloud. Natural Language Engineering, 21(04), 653–659. http://doi.org/10.1017/S1351324915000200 6th September 2015 RANLP 2015, HISSAR, BULGARIA 20/100
  21. 21. “text analytics has come to age”1 Is data science the next big thing (or it is already the big thing)? 1Text Analytics: The Next Generation of Big Data, http://insidebigdata.com/2015/06/05/text-analytics-the-next-generation-of-big-data/ 6th September 2015 RANLP 2015, HISSAR, BULGARIA 221/100
  22. 22. RANLP 2015, HISSAR, BULGARIA 22/100 Crowdsourcing 6th September 2015 RANLP 2015, HISSAR, BULGARIA 22/100
  23. 23. Crowdsourcing Crowdsourcing = the act of delegating a task to a large diffuse group, usually without substantial monetary compensation1 Has developed largely as a result of Web 2.0 and increasing access to Internet by masses “distributed labor networks are using the Internet to exploit the spare processing power of millions of human brains” 1 Wikipedia is considered one of the most successful projects using this approach Embraced by the research community and industry It is not outsourcing, but crowdsourcing 1Jeff Howe (June 2006). The Rise of Crowdsourcing. Wired. Available at http://www.wired.com/wired/archive/14.06/crowds.html 6th September 2015 RANLP 2015, HISSAR, BULGARIA 23/100
  24. 24. Crowdsourcing in NLP Used to  Create gold standards  Collect human judgements  Involve the community in projects (e.g. competitions) Used increasingly in NLP1 1 Fort, K., Adda, G., & Cohen, K. B. (2011). Amazon Mechanical Turk: Gold Mine or Coal Mine? Computational Linguistics, 37(2), 413–420. http://doi.org/10.1162/COLI_a_00057 6th September 2015 RANLP 2015, HISSAR, BULGARIA 25/100
  25. 25. Revision of annotation guidelines Linguistic analysis of the problem tackled • Annotation guidelines produced Annotation process • Annotated dataset produced Interannotator agreement calculated • Disagreements discussed Standard annotation flow Language experts are involved in all stages 6th September 2015 RANLP 2015, HISSAR, BULGARIA 26/100
  26. 26. The crowdsourcing approach Relies much less on experts Requires decomposing the (annotation) task into simple tasks that does not require linguistic knowledge (e.g. for paraphrasing the expression desert rat ask participants to fill in the gap rat that … desert(s)1) These tasks can be combined to obtain high quality annotation Requires screening of participants, filtering of noise, validation of data In some cases the tasks are presented as games 1 Nakov, P. (2008). Noun compound interpretation using paraphrasing verbs: Feasibility study. In Proceedings of the 13th international conference on Artificial Intelligence: Methodology, Systems, and Applications, AIMSA '08, pages 103 - 117, Berlin, Heidelberg. Springer-Verlag. 6th September 2015 RANLP 2015, HISSAR, BULGARIA 27/100
  27. 27. Crowdsourcing used for Annotation of data: • label data according to predefined categories • quality can be assessed using inter-annotator agreement Creation of new content: • text created for certain purposes e.g. translation of a sentence, description of an image • validation of the work is more difficult • validation can be decomposed as a series of crowdsourced tasks Obtain subjective information: • in some cases there is more than one correct answer and the opinion of majority is sought e.g. important features of mobile phones for an IQA system1 1 Konstantinova, N., Orasan, C., & Balage, P. P. (2012). A Corpus-Based Method for Product Feature Ranking for Interactive Question Answering Systems. International Journal of Computational Linguistics and Applications, 3(1), 57 – 70. 6th September 2015 RANLP 2015, HISSAR, BULGARIA 28/100
  28. 28. Source: http://blog.lionbridge.com/enterprise-crowdsourcing/2013/07/22/managed-crowds-help-deliver-on-promise-of-business-crowdsourcing/ The crowdsourcing approach 6th September 2015 RANLP 2015, HISSAR, BULGARIA 29/100
  29. 29. Open Mind Common Sense project One of the first examples of crowdsourcing Project initiated at MIT Media Lab with the goal to build and utilize a large common sense knowledge base Since 1999 it collected more than 1 million English facts from over 15,000 contributors “an attempt to ... harness some of the distributed human computing power of the Internet” 1 http://commons.media.mit.edu/en/ (not working August 2015) summarised at https://en.wikipedia.org/wiki/Open_Mind_Common_Sense 6th September 2015 RANLP 2015, HISSAR, BULGARIA 30/100
  30. 30. Teaching computers common sense The slow progress in IA is due to the fact that computers lack common sense1 Common Sense: The mental skills that most people share. Common sense thinking is actually more complex than many of the intellectual accomplishments that attract more attention and respect, because the mental skills we call “expertise” often engage large amounts of knowledge but usually employ only a few types of representations. In contrast, common sense involves many kinds of representations and thus requires a larger range of different skills. It is estimated that humans have hundreds of millions of pieces of common sense knowledge 1Singh, P. (2002). The Open Mind Common Sense Project. KurzwilAI.net. Retrieved from http://web.media.mit.edu/~push/Kurzweil.html 6th September 2015 RANLP 2015, HISSAR, BULGARIA 331/100
  31. 31. Cyc vs OMCS Cyc another attempt to acquire common sense knowledge backed by Cycorp company (http://www.cyc.com/) Employs knowledge engineers to populate the database People from the Cyc team have worked for nearly two decades to build a database of 1.5 million pieces of common knowledge at the cost of many tens of millions of dollars1 1Information from Singh, P. (2002). The Open Mind Common Sense Project. KurzwilAI.net. Representing the situation at the turn of the century 6th September 2015 RANLP 2015, HISSAR, BULGARIA 32/100
  32. 32. Open Mind Common Sense Asks volunteers to provide common knowledge by: • Asking them to fill in templates: A hammer is for ________ or The effect of eating a sandwich is ________ • Giving them a story and asking to enter knowledge in response: User is prompted with story: Bob had a cold. Bob went to the doctor. User enters many kinds of knowledge in response: Bob was feeling sick. Bob wanted to feel better. The doctor wore a stethoscope around his neck. • Collect information longer than expressed in one sentence (photo caption, supply short stories, annotate movies of simple iconic spatial events) • After information is entered the user can be shown with an inference the system made which can be accepted or rejected The participants provided the information using English sentences which were processed afterwards Peer reviewing used to ensure the quality of the input 6th September 2015 RANLP 2015, HISSAR, BULGARIA 33/100
  33. 33. Phrase detectives A specially designed interface developed at University of Essex, UK used to create a resource for anaphora resolution It is presented as a game with a purpose where participants collect points The participation is not paid, but at times rewards are given to the most active participants One of the main challenges is how to present the task to non experts Further reading: Chamberlain, J., Fort, K., Kruschwitz, U., Lafourcade, M., & Poesio, M. (2013). Using Games to Create Language Resources: Successes and Limitations of the Approach. In I. Gurevych & J. Kim (Eds.), The People’s Web Meets NLP (pp. 3–44). Springer Berlin Heidelberg. http://doi.org/10.1007/978-3-642-35085-6_1 6th September 2015 RANLP 2015, HISSAR, BULGARIA 34/100
  34. 34. Source: https://anawiki.essex.ac.uk/ 6th September 2015 RANLP 2015, HISSAR, BULGARIA 35/100
  35. 35. Source: https://anawiki.essex.ac.uk/ 6th September 2015 RANLP 2015, HISSAR, BULGARIA 36/100
  36. 36. Source: https://anawiki.essex.ac.uk/ 6th September 2015 RANLP 2015, HISSAR, BULGARIA 37/100
  37. 37. Source: https://anawiki.essex.ac.uk/ 6th September 2015 RANLP 2015, HISSAR, BULGARIA 38/100
  38. 38. Phrase detectives The interface operates in two modes: • Annotation mode: name the culprit • Validation mode: detectives conference New participants are trained on a gold standard before they progress to real documents Each markable is annotated by 8 players to collect multiple judgements (4 more judgements can be added in case of disagreement) Users are profiled to identify spammers, rate the quality of their work, etc The quality of the resource produced is considered excellent: in 84% of all annotations the interpretation specified by the majority vote of non-expert was identical with one assigned by an expert (agreement between experts 94%) Agreement for property 0% and for non-referential 100% 6th September 2015 RANLP 2015, HISSAR, BULGARIA 39/100
  39. 39. Amazon Mechanical Turk Amazon Mechanical Turk (MTurk) is a crowdsourcing Internet marketplace that enables individuals and businesses (known as Requesters) to coordinate the use of human intelligence to perform tasks that computers are currently unable to do.1 Requesters who need tasks completed load HITs (Human Intelligence Tests) load them on MTurk indicating various parameters (how much they are willing to pay, conditions for participants, max time spent, etc.) One of the most used crowdsourcing platforms 1 https://en.wikipedia.org/wiki/Amazon_Mechanical_Turk To find out more about the original mechanical turk http://www.bbc.co.uk/news/magazine-21882456 6th September 2015 RANLP 2015, HISSAR, BULGARIA 40/100
  40. 40. 6th September 2015 RANLP 2015, HISSAR, BULGARIA 441/100
  41. 41. 6th September 2015 RANLP 2015, HISSAR, BULGARIA 42/100
  42. 42. Why Mturk (or similar services) Little work required to set up the interface (pre-existing templates or fairly simple programming) Use existing infrastructure (hardware, payment) Access to workers (at times tasks are completed extremely fast): in Jan 2011 over 500,000 workers from 190 countries1 But you need to keep in mind that you will have to pay these services in addition to workers 1 Information from https://en.wikipedia.org/wiki/Amazon_Mechanical_Turk 6th September 2015 RANLP 2015, HISSAR, BULGARIA 43/100
  43. 43. Snow et al. (2008)1 Use crowdsourcing for five tasks: affect recognition, word similarity, recognition of textual entailment, event temporal ordering and word sense disambiguation. The main purpose of the research was to explore the quality of resources created using crowdsourcing Propose a model to assess the reliability of individual workers and correct their biases 1Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. Y. (2008). Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 254–263). Association for Computational Linguistics. Retrieved from http://portal.acm.org/citation.cfm?id=1613751 6th September 2015 RANLP 2015, HISSAR, BULGARIA 44/100
  44. 44. Affect recognition Based on the task proposed in Strapparava and Mihalcea (2007) Short headlines and annotators gave a numeric judgement • between 0 and 100 related to 6 emotions: anger, disgust, fear, joy, sadness and surprise • between -100 and 100 to denote the overall positive or negative valence E.g. Outcry at N Korea ‘nuclear test’ (Anger, 30), (Disgust,30), (Fear,30), (Joy,0), (Sadness,20), (Surprise,40), (Valence,- 50). 100 headlines selected and each was annotated by 10 annotators From Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. Y. (2008). Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 254–263). Association for Computational Linguistics. Retrieved from http://portal.acm.org/citation.cfm?id=1613751 6th September 2015 RANLP 2015, HISSAR, BULGARIA 45/100
  45. 45. Affect recognition Pearson correlation was calculated between the labels Individual experts are better than individual non experts, but adding their annotation to the gold standard improves the quality of the gold standard In average it takes 4 non-expert annotations to achieve equivalent of ITA of an expert annotator The numbers are different for each class: 2 for anger, disgust and sadness; 5 for valence; 7 for joy and 9 for surprise. For fear more than 10. From Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. Y. (2008). Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 254–263). Association for Computational Linguistics. Retrieved from http://portal.acm.org/citation.cfm?id=1613751 6th September 2015 RANLP 2015, HISSAR, BULGARIA 46/100
  46. 46. Affect recognition: system • A bag-of-words unigram system was trained on crowdsourced data to predict the affect and valence • Explanation for these unexpected results: “individual labelers (including experts) tend to have a strong bias, and since multiple non-expert labellers may contribute to a single set of non-expert annotations, the annotator diversity within the single set of labels may have the effect of reducing annotator bias and thus increasing system performance.” From Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. Y. (2008). Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 254–263). 6th September 2015 RANLP 2015, HISSAR, BULGARIA 47/100
  47. 47. Word similarity Provide numeric judgements on word similarity for 30 word pairs on a scale [0,10] E.g. {boy, lad} and {noon, string} Crowdsourcing used to collect 10 annotations for the 30 pairs It took less than 11 minutes to complete all the annotations Previous studies reported inter-annotator agreement between 0.958 and 0.97 Annotation obtained using crowdsourcing achieves 0.952 From Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. Y. (2008). Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 254–263). 6th September 2015 RANLP 2015, HISSAR, BULGARIA 48/100
  48. 48. Recognising textual entailment For a pair of sentences workers were asked to say whether the second sentence is inferred from the first Collected 10 annotations for 100 RTE sentence pairs Expert interannotator agreement between 91% and 96% Using MTurk 89.7% ITA is observed From Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. Y. (2008). Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 254–263). 6th September 2015 RANLP 2015, HISSAR, BULGARIA 49/100
  49. 49. Event annotation Annotate verb events from TimeBank corpus with relations strictly before and strictly after 462 verb event pairs were annotated by 10 workers ITA 0.94 using simple voting over 10 annotators No expert ITA available for this task From Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. Y. (2008). Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 254–263). 6th September 2015 RANLP 2015, HISSAR, BULGARIA 50/100
  50. 50. Bias correction for non-expert annotations • A small number of workers do a large portion of the task • Some of the workers produce low quality annotation, whilst others are biased • Model the reliability and biases of individual workers and correct for them • Train the model on a small gold standard From Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. Y. (2008). Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 254–263). 6th September 2015 RANLP 2015, HISSAR, BULGARIA 51/100
  51. 51. Word sense disambiguation Obtain 10 annotations for each of the 177 examples of the noun “president” from the SemEval corpus 3 senses available 100% interannotator agreement These results are so high because of the simplicity of the task. For more complicated tasks a small set of expert annotators perform much better than a large number of untrained turkers1 1 Bhardwaj, V., & Passonneau, R. (2010). Anveshan: a framework for analysis of multiple annotators’ labeling behavior. In Proceedings of the Fourth Linguistic Annotation Workshopp (pp. 47–55). Uppsala, Sweden. Retrieved from http://dl.acm.org/citation.cfm?id=1868726 6th September 2015 RANLP 2015, HISSAR, BULGARIA 52/100
  52. 52. Callison-Burch (2009)1 • Presents several experiments which attempt to create resources for MT evaluation • He shows that by combining judgements of several non-experts it is possible to produce a resource like those created by experts • Ranking of sentences works quite well, but producing a gold standard not because many workers used MT engines • A second task was created to identify poor reference translations 1 Callison-Burch, C. (2009). Fast, cheap, and creative: evaluating translation quality using Amazon’s Mechanical Turk. In EMNLP ’09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (Vol. 1, pp. 286–295). http://doi.org/10.3115/1699510.1699548 6th September 2015 RANLP 2015, HISSAR, BULGARIA 53/100
  53. 53. Gillick and Liu (2010)1 • Try to use non-experts to evaluate automatic summarisation systems • Workers are given two reference summaries and the topic of the summaries • They are asked to rank a summary produced by a system on a scale from 1 to 10 • The annotated data was noisy and unlikely to produce a ranking that matches the one of experts • The reason is that non-experts are not able to separate the evaluation of content from evaluation of readability • For evaluation of automatic summarisation crowdsourcing could be used for extrinsic evaluation 1Gillick, D. and Liu, Y. (2010). Non-expert evaluation of summarization systems is risky. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, CSLDAMT '10, pages 148 - 151, Stroudsburg, PA, USA. 6th September 2015 RANLP 2015, HISSAR, BULGARIA 54/100
  54. 54. Costs Many people employ crowdsourcing because it can reduce the costs Workers are paid between $0.01 and $1 per task The approximate costs1 for marking anaphoric relations in 1m tokens: • Partially validated data: 0.83 markables/$1 • Entirely validated data: 0.33 markables/$1 • Mturk: 20-84 markables/$1 + costs of researchers • Phrase detectives: 1 markable/$1 If you pay too little you may draw the wrong conclusions (e.g. translation, summarisation, …) 1Chamberlain, J., Fort, K., Kruschwitz, U., Lafourcade, M., & Poesio, M. (2013). Using Games to Create Language Resources: Successes and Limitations of the Approach. In I. Gurevych & J. Kim (Eds.), The People’s Web Meets NLP (pp. 3–44). Springer Berlin Heidelberg. 6th September 2015 RANLP 2015, HISSAR, BULGARIA 55/100
  55. 55. Criticism MTurk (and similar services) MTurk has became “the digital equivalent of an unregulated sweatshop”1,2 Limitations of crowdsourcing approaches: • Lack of expertise • Decomposition of complex tasks in simpler tasks introduces bias • Need to validate the results afterwards (e.g. use PhD students) • Impossible to control some aspects about workers (e.g. language level) 1 http://vonahn.blogspot.co.uk/2010/07/work-and-internet.html 2 http://www.utne.com/science-and-technology/amazon-mechanical-turk-zm0z13jfzlin.aspx Further readings: Fort, K., Adda, G., & Cohen, K. B. (2011). Amazon Mechanical Turk: Gold Mine or Coal Mine? Computational Linguistics, 37(2), 413–420. http://doi.org/10.1162/COLI_a_00057 Fort, K., Adda, G., Sagot, B., Mariani, J., & Couillault, A. (2014). Crowdsourcing for language resource development: Criticisms about Amazon Mechanical Turk overpowering use. Lecture Notes in Computer Science (including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 8387 LNAI, 303–314. http://doi.org/10.1007/978-3-319-08958-4_25 6th September 2015 RANLP 2015, HISSAR, BULGARIA 56/100
  56. 56. Conclusions Crowdsourcing has been hailed as one of the solutions of information overload Used properly can create large resources that otherwise cannot be obtained Don’t forget the ethical implications of using crowdsourcing 6th September 2015 RANLP 2015, HISSAR, BULGARIA 57/100
  57. 57. RANLP 2015, HISSAR, BULGARIA 58/100 Processing large datasets 6th September 2015 RANLP 2015, HISSAR, BULGARIA 58
  58. 58. More data means better results Banco & Brill (2001)1 carry out experiments where they show that it is possible to improve the performance of ML methods by increasing the size of the training data They show that for the confusion set disambiguation {to, two, too} the performance increases almost linearly when the size of the dataset increases For their task it is possible to obtain annotated data for free 1Banko, M., & Brill, E. (2001). Scaling to Very Very Large Corpora for Natural Language Disambiguation. In Proceedings of the 39th Annual Meeting on Association for Computational Linguistics (pp. 26 – 33). Toulouse, France. Retrieved from http://dx.doi.org/10.3115/1073012.1073017 6th September 2015 RANLP 2015, HISSAR, BULGARIA 59/100
  59. 59. Big data in NLP We usually have huge text collections It is not possible to load the text collections in memory We need to obtain statistics for example: • Number of times each distinct word appears in the file • Search occurrences of a word/of words • Produce language models 6th September 2015 RANLP 2015, HISSAR, BULGARIA 60/100
  60. 60. The MapReduce paradigm It is one of the most common approaches used to process large collections of documents It was inspired by functional programming (e.g. Lisp) Assumes that the task can be decomposed into (key, value) pairs and these pairs can be: • processed independently of each other (map) • the result of processing is combined to obtain the final result (reduce) Normally processing is distributed between computers and involve large datasets 6th September 2015 RANLP 2015, HISSAR, BULGARIA 61/100
  61. 61. MapReduce in python Sum of the squares def pow2(a): return a*a; def add(a, b): return a + b; def iterative(my_list): s = 0 for x in my_list: s += pow2(x); return s reduce(add, map(pow2, my_list)) 6th September 2015 RANLP 2015, HISSAR, BULGARIA 62/100
  62. 62. Word counting using Unix commands words(collection) | sort | uniq -c • words prints each word from the collection and prints it on a different line (not a unix command!) • sort: sorts all the words alphabetically • uniq -c: filters matching lines from input, writing to output, prefixing the lines by the number of occurrences REDUCE MAP 6th September 2015 RANLP 2015, HISSAR, BULGARIA 63/100
  63. 63. MapReduce Input: a set of key-value pairs derived from the dataset to be processed Map(k, v)  <k’, v’>* • Needs to be written by the programmer • Takes a key-value pair and outputs a set (including the empty set) of key- value pairs • There is one call of the Map function for each input pair Reduce(k’, <v’>*)  <k’, v’’>* • All values v’ with the same key k’ are reduced together • There is one call of the Reduce function for each unique key k’ 6th September 2015 RANLP 2015, HISSAR, BULGARIA 64/100
  64. 64. Word count using MapReduce map(key, value): // key: document name; value: text of the document for each word w in value: emit(w, 1) reduce(key, values): // key: a word; value: an iterator over counts result = 0 for each count v in values: result += v emit(key, result) Source: Mining Massive Dataset Course, Lecture 1.2 https://class.courser.org/mdds-002/ 6th September 2015 RANLP 2015, HISSAR, BULGARIA 65/100
  65. 65. Word count using MapReduce Source: Mining Massive Dataset Course, Lecture 1.2 https://class.courser.org/mdds-002/ 6th September 2015 RANLP 2015, HISSAR, BULGARIA 66/100
  66. 66. Word count in Apache Spark public static void wordCountJava8( String filename ) { // Define a configuration to use to interact with Spark SparkConf conf = new SparkConf().setMaster("local").setAppName("Work Count App"); // Create a Java version of the Spark Context from the configuration JavaSparkContext sc = new JavaSparkContext(conf); // Load the input data, which is a text file read from the command line JavaRDD<String> input = sc.textFile( filename ); // Java 8 with lambdas: split the input string into words JavaRDD<String> words = input.flatMap( s -> Arrays.asList( s.split( " " ) ) ); // Java 8 with lambdas: transform the collection of words into pairs (word and 1) and then count them JavaPairRDD<String, Integer> counts = words.mapToPair( t -> new Tuple2( t, 1 ) ) .reduceByKey( (x, y) -> (int)x + (int)y ); // Save the word count back out to a text file, causing evaluation. counts.saveAsTextFile( "output" ); } Tutorial from http://www.javaworld.com/article/2972863/big-data/open-source-java-projects-apache-spark.html 6th September 2015 RANLP 2015, HISSAR, BULGARIA 67/100
  67. 67. Language model in MapReduce Count number of times each 5-grams occurs in a large corpus of documents Map • Extract (5-grams, count) from documents Reduce • Combine the counts 6th September 2015 RANLP 2015, HISSAR, BULGARIA 68/100
  68. 68. Other requirements To use MapReduce for real life scenarios you need much more than this: • A distributed cluster of computers • A distributed file system e.g. Google GFS, Hadoop HDFS • A framework that implements MapReduce e.g. Hadoop, Apache Spark Setting up is not difficult, but fine tuning requires quite a bit of knowledge 6th September 2015 RANLP 2015, HISSAR, BULGARIA 69/100
  69. 69. MapReduce in NLP • Build co-occurance matrices from very large corpora1 ◦ Uses a cluster of 20 computers running Hadoop ◦ the co-occurrence matrix for the Gigaword corpus (7.15 million documents and about 2.97 billion words) ◦ takes about 37 minutes for a window of 2 words, and 1 hour and 23 minutes for a window of 6 words. • Build language models2 ◦ Use MapReduce to build language models from corpora that have between 13 million to 2 trillion tokens ◦ The quality of MT engines using these language models improves when the corpus is increased • Increase the processing speed: Watson running on a single process took 2 hours to answer a single question, a distributed implementation with over 2,500 cores can answer in 3-5 seconds 1 Lin, J. (2008). Scalable language processing algorithms for the masses: a case study in computing word co-occurrence matrices with MapReduce. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP '08, pages 419 – 428, http://www.aclweb.org/anthology/D08-1044 2 Brants, T., Popat, A. C., Xu, P., Och, F. J., & Jeffrey Dean. (2007). Large Language Models in Machine Translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (pp. 858–867). Prague, Czech Republic. http://www.aclweb.org/anthology/D07-1090.pdf 6th September 2015 RANLP 2015, HISSAR, BULGARIA 70/100
  70. 70. Further reading Jimmy Lin and Chris Dyer (2010) Data-Intensive Text Processing with MapReduce. Morgan & Claypool Publishers. Available at https://lintool.github.io/MapReduceAlgorithms/ Mining Massive Datasets Course to start on 12th Sept 2015 https://www.coursera.org/course/mmds 6th September 2015 RANLP 2015, HISSAR, BULGARIA 71/100
  71. 71. RANLP 2015, HISSAR, BULGARIA 72/100 Deep learning 6th September 2015 RANLP 2015, HISSAR, BULGARIA 72
  72. 72. The standard ML approach Input (annotated) data set Low level features Machine learning algorithm Evaluation Try to improve 6th September 2015 RANLP 2015, HISSAR, BULGARIA 73/100
  73. 73. But what if you can’t always define the features? Can deep learning help find the perfect date? http://www.kdnuggets.com/2015/07/can-deep-learning-help-find-perfect-girl.html 6th September 2015 RANLP 2015, HISSAR, BULGARIA 74/100
  74. 74. Quick introduction in neural networks (NNs) Perceptron Multi-layer network 6th September 2015 RANLP 2015, HISSAR, BULGARIA 75/100
  75. 75. Degree of complexity From: http://www.slideshare.net/roelofp/deep-learning-for-information-retrieval 6th September 2015 RANLP 2015, HISSAR, BULGARIA 76/100
  76. 76. What is Deep learning Is a new big trend in Machine Learning Neural Networks that are composed of many layers Deep learning algorithms attempt to automatically learn multiple levels of representation of increasing complexity/abstraction Biologically justified: Audio/Visual cortex has multiple stages == Hierarchical 6th September 2015 RANLP 2015, HISSAR, BULGARIA 77/100
  77. 77. Why deep learning • Neural networks can work as lookup tables to represent functions (i.e. some neurons can activate only for specific range of values) • For some functions we would need to many units in the hidden layer – not efficient • More hidden units in the hidden layer requires more training data • Instead we can try to learn a complex function as a composition of simple functions 6th September 2015 RANLP 2015, HISSAR, BULGARIA 78/100
  78. 78. Different levels of abstraction 6th September 2015 RANLP 2015, HISSAR, BULGARIA 79/100
  79. 79. Google trends for 5 search terms: machine learning, deep learning, neural networks, support vector machines, naïve Bayes 6th September 2015 RANLP 2015, HISSAR, BULGARIA 80/100
  80. 80. Why now? NNs have been around for many years Breakthrough around 2006 • More data • Faster processing: GPUs and multi-core CPUs • Better ideas how to train deep architectures 6th September 2015 RANLP 2015, HISSAR, BULGARIA 81/100
  81. 81. Representation models • In standard representation model a word is represented as a vector with one 1 and the rest 0 E.g. cat = [0 0 0 0 0 0 1 0 0 0 0 0 0] • Problem with vector space model: cat [0 0 0 0 0 0 1 0 0 0 0 0 0] AND dog [0 0 1 0 0 0 0 0 0 0 0 0 0] = 0 • “You shall know a word by the company that it keeps” (Firth 1957) • In distributional similarity based representations words are represented by the words that appear in its context (co-ocurrence vector) • Examples: ◦ Latent Semantic Analysis (LSA/LSI), ◦ Latent Dirichlet Analysis (LDA) ◦ Word embedding 6th September 2015 RANLP 2015, HISSAR, BULGARIA 82/100
  82. 82. Properties of continuous space representations The vector space representation has some very interesting features: • It allows a level of generalisation not possible for classical n-gram model • In continuous space model similar words are likely to have similar vectors • When the model parameters are adjusted in response to a particular word or word-sequence, the improvements will carry over to occurrences of similar words and sequences Mikolov, T., Yih, W., & Zweig, G. (2013). Linguistic regularities in continuous space word representations. In Proceedings of NAACL-HLT (pp. 746–751). https://www.aclweb.org/anthology/N/N13/N13-1090.pdf 6th September 2015 RANLP 2015, HISSAR, BULGARIA 83/100
  83. 83. Word embedding A word embedding is a parameterized function that maps words in a language in high-dimensional vectors It learns simultaneously: • A distributed representation for each word • A probability function for word sequences One of the most exciting developments in deep learning for NLP Proposed quite a while ago: Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic language model. Journal of Machine Learning Research 3 (March 2003), 1137-1155. http://dl.acm.org/citation.cfm?id=944966 6th September 2015 RANLP 2015, HISSAR, BULGARIA 84/100
  84. 84. Architectures 6th September 2015 RANLP 2015, HISSAR, BULGARIA 85/100
  85. 85. Collobert et al. (2011)1 • Train a NN to obtain word embedding • Experiments with small datasets did not lead to good results • Wikipedia and Reuters RCV1 corpora are used • The “map” of words obtained makes lots of sense 1 Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. The Journal of Machine Learning Research, 1(12), 2493–2537. Retrieved from http://dl.acm.org/citation.cfm?id=2078186 6th September 2015 RANLP 2015, HISSAR, BULGARIA 86/100
  86. 86. Collobert et al. (2011)1 1 Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. The Journal of Machine Learning Research, 1(12), 2493–2537. Retrieved from http://dl.acm.org/citation.cfm?id=2078186 2 Bottou, L. (2011). From Machine Learning to Machine Reasoning. Arxiv Preprint arXiv11021808, 15. Retrieved from http://arxiv.org/abs/1102.1808 3 See Richard Socher’s tutorial on Deep learning for NLP (without magic) http://lxmls.it.pt/2014/socher-lxmls.pdf for detailed information how to train this network R(W(‘‘cat"), W(‘‘sat"), W(‘‘on"), W(‘‘the"), W(‘‘mat")) = 1 R(W(‘‘cat"), W(‘‘sat"), W(‘‘song"), W(‘‘the"), W(‘‘mat")) = 0 • The network trained predicts if a 5-gram is valid3 • Not particular useful information as such • However, the word embedding are very useful • Training these networks on large datasets can take weeks • They use these embedding to train more complicated NN to perform POS tagging, chunking, NER, SRL . Determine if a 5-gram is valid. Figure from Bottou (2011) 6th September 2015 RANLP 2015, HISSAR, BULGARIA 87/100
  87. 87. Mikolov et al (2013)1 • Use a Recurrent Neural Network Language Model • The model has no knowledge of syntax, morphology or semantics • Used to measure linguistic regularities using the pattern “a is to b as c is to ___” ◦ Syntactic test: year:years law:laws ◦ Semantic test: clothing:shirt dish:bowl • The relationships can be expressed in terms of offsets 1 Mikolov, T., Yih, W., & Zweig, G. (2013). Linguistic regularities in continuous space word representations. In Proceedings of NAACL-HLT (pp. 746–751). Atlanta, Georgia, USA. Retrieved from https://www.aclweb.org/anthology/N/N13/N13-1090.pdf 6th September 2015 RANLP 2015, HISSAR, BULGARIA 88/100
  88. 88. Adjectival scales • Mikolov et al. (2013) show that using continuous space representations capture syntactic and semantic regularities: ◦ apple − apples ≈ car − cars ≈ family − families ◦ king − man + woman ≈ queen • Kim & Marneffe (2013)1 derive adjectival scales 1Kim, J., & Marneffe, M.-C. de. (2013). Deriving adjectival scales from continuous space word representations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 1625 – 1630). Retrieved from http://www.aclweb.org/anthology/D13- 1169 6th September 2015 RANLP 2015, HISSAR, BULGARIA 89/100
  89. 89. Bilingual word embeddings1 • Two word embeddings are trained in the traditional manner (for English and Mandarin Chinese) • An additional constraint is introduced that words from the two languages that have similar meaning should be close together • Words that were not known as translations of each other end up close together • The word embeddings are used: ◦ In the Chinese similarity task where they lead to results better than the state of the art ◦ A 0.49 increase in the BLUE score 1Zou, W. Y., Socher, R., Cer, D., & Manning, C. D. (2013). Bilingual Word Embeddings for Phrase-Based Machine Translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013). 6th September 2015 RANLP 2015, HISSAR, BULGARIA 90/100
  90. 90. Tree Structured Long Short Term Memory (Tree-LSTM) • Recurrent NN with Long Short Term Memory (LSTM) proved very good at representing sentences and useful in capturing long distance dependencies1 • Recurrent NN (RNN) can process sequences of arbitrary length • RNN have the problems of learning long distance correlations of the sequence • Have a memory cell that preserves states over long periods of time • Tree-LSTM are very useful in capturing semantic relatedness and sentiment analysis of movie reviews: ◦ Outperforms state-of-the-art for fine grained sentiment classification and comparable for the binary classification ◦ Outperforms the best performing systems at SemEval 2014 semantic relatedness task 1Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1556–1566, Beijing, China. 6th September 2015 RANLP 2015, HISSAR, BULGARIA 91/100
  91. 91. ReVal: MT evaluation metric1 • Evaluation Metric based on Tree Structured Long Short Term Memory (Tree-LSTM) networks • Trained on WMT-13 rank data: had to convert ranks to similarity scores and also Used 4500 pairs of the SICK data • Performs better at system level than some methods that rely on many features • Average performance for segment level evaluation • Good example how you can develop new methods based on existing approaches in deep learning • Code available at https://github.com/rohitguptacs/ReVal 1 Rohit Gupta, Constantin Orasan, and Josef van Genabith. 2015. Reval: A Simple and Effective Machine Translation Evaluation Metric based on Recurrent Neural Networks. In Proceedings of EMNLP-2015, Lisbon, Portugal. Rohit Gupta, Constantin Orasan and Josef van Genabith. 2015. Machine Translation Evaluation using Recurrent Neural Networks. In Proceeding of WMT-2015. 6th September 2015 RANLP 2015, HISSAR, BULGARIA 92/100
  92. 92. Text understanding from scratch1 • Discuss how it is possible to achieve text understanding starting from characters (e.g. not providing any information about words, paragraphs, etc.) • Use convolutional networks (ConvNets) for determining the polarity of texts and determine the main topic of articles • The method words for both English and Chinese • The conclusion of the paper is that ConvNets do not need any syntactic or semantic structure of the language to work 1Zhang, X., & LeCun, Y. (2015). Text Understanding from Scratch. Retrieved from http://arxiv.org/pdf/1502.01710v3.pdf 6th September 2015 RANLP 2015, HISSAR, BULGARIA 93/100
  93. 93. Word embedding for Verbal comprehension questions • Attempt to answer verbal reasoning questions from IQ tests: ◦ Isotherm is to temperature as isobar is to? (i) atmosphere, (ii) wind, (iii) pressure, (iv) latitude, (v) current. ◦ Which is the odd one out? (i) calm, (ii) quiet, (iii) relaxed, (iv) serene, (v) unruffled. ◦ Which word is most opposite to MUSICAL? (i) discordant, (ii) loud, (iii) lyrical, (iv) verbal, (v) euphonious? • These questions belong to predefined categories that can be identified easily by computers • Each category has a different solver • A novel way of producing word embeddings was necessary • Ask 200 people to answer questions via Amazon Mechanical Turk • The average performance of human beings is a little lower than that the proposed method • “Our model can reach the intelligence level between the people with the bachelor degrees and those with the master degrees” • “The results indicate that with appropriate uses of the deep learning technologies we might be a further step closer to the human intelligence.” Huazheng Wang, Bin Gao, Jiang Bian, Fei Tian, Tie-Yan Liu (2015) Solving Verbal Comprehension Questions in IQ Test by Knowledge-Powered Word Embedding. Retrieved from http://arxiv.org/abs/1505.07909 6th September 2015 RANLP 2015, HISSAR, BULGARIA 94/100
  94. 94. 6th September 2015 RANLP 2015, HISSAR, BULGARIA 95/100
  95. 95. Deep learning • It leads to better results than other methods • Can be applied to a large number of tasks • … but how many of these tasks tackle realistic data? • … will it really lead to proper text understanding? • … or it is yet another trend • Proper understanding of deep learning requires very good background in maths • … but there are many packages available that implement methods 6th September 2015 RANLP 2015, HISSAR, BULGARIA 96/100
  96. 96. Many resources available Slides and presentations from tutorials: • Using Neural Networks for Modelling and Representing Natural Languages http://www.coling-2014.org/COLING%202014%20Tutorial-fix%20- %20Tomas%20Mikolov.pdf • Richard Socher’s tutorial on Deep learning for NLP (without magic) http://lxmls.it.pt/2014/socher-lxmls.pdf • General Sequence Learning using Recurrent Neural Networks https://youtu.be/VINCQghQRuM Books: http://neuralnetworksanddeeplearning.com/ Comprehensive hub of information: http://deeplearning.net/ The topic appears constantly on social media: • Less than one day ago: “What are the limits of deep learning” on Reddit https://redd.it/3jo968 6th September 2015 RANLP 2015, HISSAR, BULGARIA 97/100
  97. 97. Are we closer to “text understanding” or we are only getting better at optimising for some (very specific and sometimes unnatural) tasks? “Open the pod bay doors, please Hal...” https://youtu.be/dSIKBliboIo 6th September 2015 RANLP 2015, HISSAR, BULGARIA 98/100
  98. 98. The latest version of the slides available at: http://www.slideshare.net/dinel/new-trends-in-nlp-applications You can contact me by email at C.Orasan@wlv.ac.uk 6th September 2015 RANLP 2015, HISSAR, BULGARIA 99/100
  99. 99. Thank you 6th September 2015 RANLP 2015, HISSAR, BULGARIA 100/100

×