8. Gathering Evidence stress migraine CCB magnesium PA magnesium SCD magnesium magnesium Slide reused with permission of Marti Hearst @ UCB
9. A Higher Order Co-Occurrence Relation! migraine magnesium Slide reused with permission of Marti Hearst @ UCB No single author knew/wrote about this connection… this distinguishes Text Mining from Information Retrieval. stress CCB PA SCD
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33. Higher-order Co-occurrence 3 rd order co-occurrence as a chain of co-occurrences (Kontostathis & Pottenger, 2006) Context (document, instance, record, …) entity, term, AVP, item, … Example
Notes de l'éditeur
IID (Taskar et al, 2002) -> test instances are related to each other and their labels are not independent! (Lu&Getoor, 2003; Jensen, 1999) Traditional statistical inference assume that instances are independent -> can lead inappropriate conclusions (Lu&Getoor, 2003) “traditional data mining tasks such as association rule mining, market basket analysis, and cluster analysis commonly attempt to find patterns in a dataset characterized by a collection of independent instances of a single relation. This is consistent with the classical statistical inference problem of trying to identify a model given a random sample from a common underlying distribution“ Latent semantics are important for Information Retrieval (IR) and Text Mining applications For example In LSI, latent aspects of term similarity that LSI reveals is dependent on the higher-order paths between terms
Explicit links : (e.g., hyperlinks between web pages or citation links between scientific papers) apply the model to a separate network (a set of unlabeled test instances with links) in collective classification phase Collective inference : making inferences about multiple data instances simultaneously collective inference can significantly reduce classification error (Jensen et al., 2004) The basic idea in these iterative algorithms is to start with a labeling of reasonable quality (by a content only classifier) and refine it using a coupled distribution of content and labels of neighbors.
Several simple link attributes that are constructed based on the statistics computed from the categories of the different sets of linked objects: mode-link, a single attribute computed from the in-links, out-links, and co-citation links count-link which is basically the frequency of classes of linked instances binary-link is a simple binary feature vector; for each class label, if a link to an instance occurs at least once, the corresponding feature is 1 To determine a label (dependent var) c in {-1,+1} given an input vector (explanatory var) x, P(c=1|w,x) , find optimal w for discriminative function CORA : 4187 machine learning papers, 7 class, dictionary 1400 words after stemming, stopwords WEBKB: web pages from four computer science departments; 4 topics and others: total 5; without others 700 pages
(Edmond, 1997) The application described selects the most appropriate term when a context (such as a sentence) is provided. (Sch ü tze,1998) use of second-order co-occurrence of the terms in the training set to create context vectors that represent a specific sense of a word to be discriminated. (Xu & Croft, 1998) A strong correlation between terms A and B, and also between terms B and C will result in the placement of terms A, B, and C into the same equivalence class. The result will be a transitive semantic relationship between A and C. Orders of co-occurrence higher than two are also possible in this application. (LSI), a well-known approach to information retrieval, (LSI) implicitly depends on higher-order co-occurrences. (LSI) In previous work it is demonstrated empirically that higher-order co-occurrences play a key role in the effectiveness of systems based on LSI. (Zhang, 2000) A related effort used second-order co-occurrences to improve the runtime performance of LSI. (LBD), employs second-order co-occurrence to discover connections between concepts (entities). (LBD) A well-known example is the discovery of a novel migraine-magnesium connection in the medical domain, The researchers found that in the Medline database some terms co-occur frequently with “migraine” in article titles, e.g. “stress” and “calcium channel blockers.” They also discovered that “stress” co-occurs frequently with “magnesium” in other titles. As a result, they hypothesized a link between “migraine” and “magnesium,” and some clinical evidence has been obtained that supports this hypothesis.
preliminary conclusion: frequency distributions of higher-order itemsets capture distinguishing characteristics of the classes in supervised machine learning datasets
Cybersecurity: Abnormal BGP events often affect global routing infrastructure. For example, in January 2003, the Slammer worm caused a surge of BGP updates. Since BGP anomaly events often cause major disruptions in Internet, the ability to detect and categorize BGP events is extremely useful Our aim is to distinguish whether Border Gateway Protocol (BGP) traffic is caused by an anomalous event such as a power failure, a worm attack or a node/link failure. This is different from mushroom dataset because attributes are integer valued
For the Slammer worm and Blackout events, the t-test probability starts increasing as the sliding window approaches the 25th window. When the number of abnormal event instances inside the current window exceeds a certain threshold (around the 21st – 23rd window), observe a sharp increase After the 25th window, the probability stays above 5%, revealing that we are in the event period and have detected and distinguished both the Slammer and Blackout events using their respective event models detect and distinguish these events in 360 seconds or less. Results are similar for the Witty worm event in figure 6, although the detection takes slightly longer
This figure depicts three documents, D1, D2 and D3, each containing two terms, or entities, represented by the letters A, B, C and D. Below the three documents form a higher-order path that links entity A with entity D through B and C. This is a third-order path since three links, or “hops,” connect A and D D1, D2 and D3 are not always documents – they might be records in a database or instances in a labeled training dataset. Likewise, the entities A, B, C etc. need not be terms – they may be values in a database record, or items (attribute-value pairs) in an instance. Actually we can extract co-occurrence relations as long as there is a meaningful context of entities.