2. We have used material from several popular books, papers, course notes and presentations made by experts in this area. We have provided all references to the best of our knowledge. This list however, serves only as a pointer to work in this area and is by no means a comprehensive resource.
3.
4. An Overview of Empirical Natural Language Processing, Eric Brill, Raymond J. Mooney Word Sequence Syntactic Parser Parse Tree Semantic Analyzer Literal Meaning Discourse Analyzer Meaning
100. TO COME: USAGE EXAMPLES OF WHAT WE COVERED THUS FAR
101.
102. This MEK dependency was observed in BRAF mutant cells regardless of tissue lineage, and correlated with both downregulation of cyclin D1 protein expression and the induction of G1 arrest. *MEK dependency ISA Dependency_on_an_Organic_chemical *BRAF mutant cells ISA Cell_type *downregulation of cyclin D1 protein expression ISA Biological_process *tissue lineage ISA Biological_concept *induction of G1 arrest ISA Biological_process Information Extraction = segmentation+classification+association+mining Text mining = entity identification+named relationship extraction+discovering association chains…. Segmentation Classification Named Relationship Extraction MEK dependency observed in BRAF mutant cells downregulation of cyclin D1 protein expression correlated with induction of G1 arrest correlated with
103. MEK dependency observed in BRAF mutant cells downregulation of cyclin D1 protein expression correlated with induction of G1 arrest correlated with
One of the primary goals of AI has been natural language understanding. Understanding language required not only lexical and grammatical information but also semantic pragmatic and general world knowledge. It’s a complex task and involves many levels of processing and a variety of subtasks. Typical components of a language understanding processing system Understanding language deals with (what is said) syntax and structure of language+ (what does the thing being said say/ask/inform of the world) understanding (semantics, pragmatics, discourse)
In the 1970’s AI systems were developed that demostrates interesting aspects of language understanding by developing nl understanding systems that used hand coded symbolic grammars and knowledge bases. Although there was some corpus based language learnng in the 1950’s post Shannon’s Information theory etc, Chomsky’s argument that learnability of language is more an innate property than learned was instrumental in redefining goals of linguistics in the 1950’s. Emphasis on symbolic grammars and representing innate linguistic knowledge (universal grammar)
Develpoing such systems however were very human intensive requiring intensive knowledge engineering, typically ran on toy examples and were rather brittle. Partially in response to these uissues there was a paradigm shift in nl understanding. Approaches moved from rationalist methods based on hand coded rules to systems that derived these rules through introspection or empirical or courpus based methods. Development ismore data driven and atleast partially automated thru the use of statistical or machine learning methods.
One of the primary goals of AI has been natural language understanding. Understanding language required not only lexical and grammatical information but also semantic pragmatic and general world knowledge. It’s a complex task and involves many levels of processing and a variety of subtasks. Typical components of a language understanding processing system Understanding language deals with (what is said) syntax and structure of language+ (what does the thing being said say/ask/inform of the world) understanding (semantics, pragmatics, discourse)
When we are analyzing text for semantic comp, we are doing one of two things - Finding more about wat we know. Often termed as the finding needle in a haystack paradigm, this search / browse method contrasts the other goal of text analysis.. Finding wat we do not know or discovering undisc knowledge
Assertional, simple, easier to parse and understand Biomedical literature, however, contains text that describes complex scientific investigations which do not always contain explicit factual assertions. Instead, there is often a series of arguments, opinions and experiments supported by evidence that collectively corroborate or refute a hypothesis that may not be explicitly stated in a simple sentence. Sentences tend to be rather long and convoluted. Furthermore domain specific terms, abbreviations, number ranges and symbols often make sentences hard for the human reader to parse, further complicating automated information extraction. These factors make the task of mining biomedical text substantially more complex than Wikipedia like text. Casual, goal is largely interactive as opposed to informative grammatical errors, misspellings, entity variations not uncommon
One of the primary goals of AI has been natural language understanding. Understanding language required not only lexical and grammatical information but also semantic pragmatic and general world knowledge. It’s a complex task and involves many levels of processing and a variety of subtasks. Typical components of a language understanding processing system Understanding language deals with (what is said) syntax and structure of language+ (what does the thing being said say/ask/inform of the world) understanding (semantics, pragmatics, discourse)
Using Text to Form Hypotheses about Disease For more than a decade, Don Swanson has eloquently argued why it is plausible to expect new information to be derivable from text collections: experts can only read a small subset of what is published in their fields and are often unaware of developments in related fields. Thus it should be possible to find useful linkages between information in related literatures, if the authors of those literatures rarely refer to one another's work. Swanson has shown how chains of causal implication within the medical literature can lead to hypotheses for causes of rare diseases, some of which have received supporting experimental evidence [Swanson1987,Swanson1991,Swanson and Smalheiser1994,Swanson and Smalheiser1997]. For example, when investigating causes of migraine headaches, he extracted various pieces of evidence from titles of articles in the biomedical literature. Some of these clues can be paraphrased as follows: stress is associated with migraines stress can lead to loss of magnesium calcium channel blockers prevent some migraines magnesium is a natural calcium channel blocker spreading cortical depression (SCD) is implicated in some migraines high leveles of magnesium inhibit SCD migraine patients have high platelet aggregability magnesium can suppress platelet aggregability These clues suggest that magnesium deficiency may play a role in some kinds of migraine headache; a hypothesis which did not exist in the literature at the time Swanson found these links. The hypothesis has to be tested via non-textual means, but the important point is that a new, potentially plausible medical hypothesis was derived from a combination of text fragments and the explorer's medical expertise. (According to swanson91, subsequent study found support for the magnesium-migraine hypothesis [Ramadan et al.1989].) This approach has been only partially automated. There is, of course, a potential for combinatorial explosion of potentially valid links. beeferman98 has developed a flexible interface and analysis tool for exploring certain kinds of chains of links among lexical relations within WordNet.2 However, sophisticated new algorithms are needed for helping in the pruning process, since a good pruning algorithm will want to take into account various kinds of semantic constraints. This may be an interesting area of investigation for computational linguists.
Beyond search Analytical operations over text to answer complex questions Requiring aggregation of information across a corpus Context specific Domain specific Application specific Text mining, also known as intelligent text analysis, text data mining or knowledge-discovery in text (KDT), refers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text. Text mining is a young interdisciplinary field which draws on information retrieval, data mining, machine learning, statistics and computational linguistics. As most information (over 80%) is stored as text, text mining is believed to have a high commercial potential value.
The first query is “Flu Epidemic.” In Table 1, we see that the first storyline contains information about flu (identified by terms like ‘vaccines’, ‘strains’), the second contains seasonal news (identified by terms like ‘deaths’, ‘reported’), the third is about bird flu (identified by terms like ‘avian’, ‘bird’), and the fourth is about Spanish flu epidemic from 1918 (identified by terms like ‘spanish’, ‘ 1918’).
Select well-distributed terms from the collection Eliminate stopwords Retain only those terms with a distribution higher than a threshold (default: top 10%) Build a “backbone” Create paths from unambiguous terms only Bias the structure towards appropriate senses of words Get hypernym path if term: - has only one sense, or - matches a pre-selected WordNet domain Adding a new term increases a count at each node on its path by # of docs with the term.
We will cover some fundamentals that are a core part of most TM systems.
Syntactic analysis involves determingng the grammatical structure of a sentence. One subtask is pos tagging or
Local word characteristics *s not always plural nouns (he works) (his works)
Headed by nouns and provide information about the noun in the sentence Headed by prepositions, contains noun phrases and express spatial, temporal and other attributes Oraganizes all elements on a sentence that syntactically depend on the verb
a grammar describes which of the possible sequences of symbols (strings) in a language constitute valid words or statements in that language, but it does not describe their semantics (i.e. what they mean).
Typical sources of knowledge Meanings of words Meanings of grammatical constructs Knowledge about structure of discourse Common sense knowledge about topic Knowledge about state of affairs in which discourse is occurring
Forming new words from old words (derivational) Suffixes (inflections)
Synonym test for words/lemmas - Propositional meaning When one word can be substituted for another without changing meaning of sentence Car and automobile substitutable but not identical in meaning
pathlen(c1,c2) = number of edges in the shortest path in the thesaurus graph between the sense nodes c1 and c2 simpath(c1,c2) = -log pathlen(c1,c2) wordsim(w1,w2) = max c1 senses(w1),c2 senses(w2) sim(c1,c2)
if "Adam Kluver Ltd" had already been recognised as an organisation by the sure-fire rule, in this second step any occurrences of "Kluver Ltd", "Adam Ltd" and "Adam Kluver" are also tagged as possible organizations. This assignment, however, is not definite since some of these words (such as "Adam") could refer to a different entity. This information goes to a pre-trained maximum entropy model (see Mikheev (1998) for more details on this aproach).
SRV - Two trends are evident in the recent evolution of the field of information extraction: a preference for simple, often corpus-driven techniques over linguistically sophisticated ones; and a broadening of the central problem definition to include many non-traditional text domains. This development calls for information extraction systems which are as retctrgetable and general as possible. Here, we describe SRV, a learning architecture for information extraction which is designed for maximum generality and flexibility.SRV can exploit domain-specific information,including linguistic syntax and lexical information, in the form of features provided to the system explicitly as input for training. This process is illustrated using a domain created from Reuters corporate acquisitions articles. Features are derived from two general-purpose NLP systems, Sleator and Temperly's link grammar parser and Wordnet. Experiments compare the learner's performance with and without such linguistic information. Surprisingly, in many cases, the system performs as well without this information as with it.
The label bias problem represents a simple finite-state model designed to distinguish between the two words rib and rob. In the first time step, r matches both transitions from the start state, so the probability mass gets distributed roughly equally among those two transitions. Next we observe i. Both states 1 and 4 have only one outgoing transition. State 1 has seen this observation often in training, state 4 has almost never seen this observation; but like state 1, state 4 has no choice but to pass all its mass to its single outgoing transition, since it is not generating the observation, only conditioning on it. Thus, states with a single outgoing transition effectively ignore their observations. The top path and the bottom path will be about equally likely, independently of the observation sequence. If one of the two words is slightly more common in the training set, the transitions out of the start state will slightly prefer its corresponding transition, and that word’s state sequence will always win.
i. Starting with a few seed entities, it is possible to induce high-precision context patterns by exploiting entity context redundancy. ii. New entity instances of the same category can be extracted from unlabeled data with the induced patterns to create high-precision extensions of the seed lists. iii. Features derived from token membership in the extended lists improve the accuracy of learned named-entity taggers.