Dark Data In the Long Tail of Science: Examples in Biology
1. Dark Data In the Long Tail of Science: Examples in Biology September 2, 2009 National Institute of Standards and Technology P. Bryan Heidorn NSF University of Illinois University of Arizona
9. Naive View of Science Data GenBank PDB f ( x )= ax k + o ( x k ) Power Law of Science Data f ( x )= ax k + o ( x k )| X<.20 Data Volume Science Projects and Initiatives
10. Does NSF’s Data Follow the Power Law? I do not know but if $1 = X bytes…..
11. 20-80 Rule The small are big! $350,000- $831 $6,892,810-$350,000 Range $938,548,595 $1,199,088,125 Total Dollars 7478 1869 Number Grants 80% 20% 9347 $2,137,636,716 Total Grants
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22. Automatic Metadata Extraction (Darwin Core) From Museum Specimen Labels 2008 Dublin Core Conference P. Bryan Heidorn, Qin Wei University of Illinois at Urbana-Champaign … <co> Curtis, </co><hdlc> North American Pl </hdlc><cnl> No.</cnl><cn> 503*</cn> <gn> Polygala</gn><sp> ambigua,</sp><sa> Nutt.,</sa><val> var.</val> <hb> Coral soil,</hb><lc> Cudjoe Key, South Florida. </lc><col> Legit</col><co> A. H. Curtiss.</co><dt>February</dt>…
41. Learning w/ pre categorization Gold Labels Machine Learner Model n Classified Labels Class 1 Labels Categor- ization Class 2 Labels Class n Labels Machine Learner Machine Learner Model 2 Model 1 Class 1 Labels Categor- ization Class 2 Labels Class n Labels Machine Classification Machine Classification Machine Classification Classified Labels Classified Labels Unclassified Labels
42. FIG. 5. Improved Performance of Specialist Model Specialist100 Curtiss VS 100 General
43. P. Bryan Heidorn 1 , Hong Zhang 1 , Eugene Chung 2 and BGWG 1 Graduate School of Library and Information Science, 2 Linguistics, University of Illinois Machine Learning in BioGeomancer’s Locality Specification SPNHC & NSCA 2006
46. Example Locality Types F; NF; FS Seward Peninsula; vic. Bluff, S coast 204 FPOH 0.4 mi N Collinston on LA 138 181 FOO WALTMAN, 9 MI N, 2.5 MI W OF 160 P; FOH; NP TIESMA RD, 1.5 MI NW EDGEWATER; OFF LAKE MICHIGAN R 109 P; POH INDIAN CREEK, 11 MI. W HWY 160 100 NF; FH near Aleutian Islands; S of Amukta Pass 86 FOH; F dario 7 mi wnw of; RIO VIEJO 43 Locality Type Specification of Location Record #
47.
48.
49.
50. Information Extraction From FNA Templates for useful information Extraction Rules Structured information Leaf_Shape obovate Leaf_Shape orbiculate Blade_Dimension 3—9 x 3—8 cm ………… .. ………… .. Original documents ……… .. Leaf blade obovate to nearly orbiculate, 3--9 × 3--8 cm, leathery, base obtuse to broadly cuneate, margins flat, coarsely and often irregularly doubly serrate to nearly dentate, . ……………… Knowledge bases … .. PartBlade: Leaf blade Blades blade …… Pattern:: * <PartBlade> ' ' <leafShape> * ( <leafShape> ) ',' * Output:: leaf {leafShape $1} Pattern:: * <PartBlade> * ', ' ( <Range> ' ' * <LengUnit> ) * <PartBase> Output:: leaf {bladeDimension $1} User log analysis Leaf_Shape Leaf_Margin Leaf_Apex Leaf_Base Blade_Dimension … .. … ..
51. Results – System Performance NT: number of tasks accomplished in total NTH: number of tasks accomplished per hour TSR: task success rate SSR: search success rate NSST: number of searches to accomplish a task TST: time spent to accomplish a task NDVST: number of documents viewed to accomplish a task 0.162 14.75 11.16 NDVST 0.72 435.2 338.8 TST 0.000 9.584 4.779 NSST 0.053 0.568 3.598 4.50 SEARF 0.011 0.000 0.005 0.005 Sig.(ANOVA) 0.210 0.860 8.078 6.75 SEARFA SSR TSR NTH NT Group
52.
53.
54.
55.
56.
57.
58.
59.
60.
61.
62.
63.
64.
65.
Notes de l'éditeur
Change to new front image
Add jobs from the interagency working group preport.