SlideShare une entreprise Scribd logo
1  sur  30
Ordering the chaos: 
creating websites using 
imperfect data 
Andrew Stretton 
Oxford University Web SIG November 2014
Who am I, what is ChemBio Hub? 
• Andrew Stretton – Data Architect and Developer 
github.com/strets123 
@strets123 
linkedin (google me) 
• Chembio Hub 
http://chembiohub.ox.ac.uk (feel free to link to us!) 
@oxchembiohub 
github.com/thesgc
Chembio Hub exists to 
support research at the 
interface of chemistry and 
biology 
by enabling sharing of reagents, expertise 
and data across 20+ departments
Who are we trying to connect and how? 
User 1: 
Scientist at Oxford 
User 2: 
Potential collaborator 
Could be in industry 
or anywhere in academia 
Stored and curated by ChemBio Hub 
Unpublished 
results 
Negative Data 
Methods 
Equipment 
Reagents 
? Not sure yet 
Areas of 
expertise 
Questions 
and answers 
Contacts 
Publications 
Held on other sites or social networks 
Organised/linked to by ChemBio Hub
All of these parts require tagging 
entities in text, how can we do it 
Who are we trying to connect and how? 
cheaply and sustainably? 
User 1: 
Scientist at Oxford 
User 2: 
Potential collaborator 
Could be in industry 
or anywhere in academia 
Stored and curated by ChemBio Hub 
Unpublished 
results 
Negative Data 
Methods 
Equipment 
Reagents 
? Not sure yet 
Areas of 
expertise 
Questions 
and answers 
Contacts 
Publications 
Held on other sites or social networks 
Organised/linked to by ChemBio Hub
What sorts of messy data are we working with? 
• Full text from procedures, biographies, web sites 
• Raw CSV/ Excel formats from multiple machines 
or departmental processes 
• “Standard” XML and JSON formats from various 
sources that do not map perfectly to our 
application 
• Multiple external databases to submit data to
How do most of our users like their web-based tools? 
Simple Search 
Flexible data 
management 
Comprehensive, 
overlapping tagging 
Clear progress, seamless experience
What do we sometimes give them? 
• Incomplete or many-to-one tagging 
• Hyperlinks instead of the right information 
from the other site 
• Dumb search 
• Inflexible schemas 
• Lack of linking between datasets
What strategies do we have to deal with messy data? 
Create more helpful data management apps 
Fill in gaps in tagging by using search engines 
Consider creating databases of flat files 
Create map reduce / 
Database views 
for schema 
Normalisation and 
data analysis 
Web crawling - not as 
hard or messy as it 
used to be
Let’s look at this one first, happy 
to discuss other areas later… 
What strategies do we have to deal with messy data? 
Create more helpful data management apps 
Fill in gaps in tagging by using search engines 
Consider creating databases of flat files 
Create map reduce / 
Database views 
for schema 
Normalisation and 
data analysis 
Web crawling - not as 
hard or messy as it 
used to be
How do we fill in gaps on un-tagged 
data? 
Let’s do an experiment… 
github.com/strets123/web-sig-2014/
Elasicsearch - information extraction on-the-fly 
• Take a dataset of 18801 companies 
~ 50% tagged 
> 80% have some 
text data 
0% 50% 100% 
Tags 
Description 
Overview 
Overview or 
description 
Source data : http://jsonstudio.com/resources/ github.com/strets123/web-sig-2014/
Use the “significant terms” feature… 
• What description/overview words most strongly 
linked to each tag? 
travel education music realestate 
Search 
engine 
optimization 
jobs onlinemarketing projectmanagement 
travel students music estate seo job marketing project 
travelers teachers artists real optimization jobs seo projects 
trip learning musicians agents engine employers agency task 
trips education songs property ppc career optimization collaboration 
hotels student labels listings marketing teams 
flights educational playlists search management 
traveler bands click 
travellers song pay 
airline artist 
hotel fans
Now let’s test these queries 
• Which companies have no tag but are most 
likely to need tagging with “music”… 
uPlaya 
Description uPlaya provides independent or unsigned musicians with immediate 
feedback on their music…. 
Category games_video 
Tags - 
Webceleb 
Description Webceleb is music marketplace and community where musicians 
and fans engage and profit from discovering, purchasing and 
downloading the latest independent music.…. 
Category games_video 
Tags -
But what if we have 
NO TAGS?
A process to extract tags from text… 
Index Data 
Assign resources (e.g. 
Amazon spot instance 
for large dataset) 
List word counts with 
the least frequent 
first 
Exclude lowest counts 
Aggregate the 
significant terms for 
each word 
Filter words that have 
a lot of high scoring 
significant terms
What does this give us? 
athletes: [athletes, coaches, athlete, coach, sports, fans] 
avatars: [avatars, avatar, multiplayer, virtual, casual, 3d, games, chat, create, game] 
clouds: [clouds, cloud, hybrid, computing, private, deploy, public, infrastructure] 
dashboards: [dashboards, bi, reports, analytics, reporting, self, analysis, intelligence, features] 
dial: [dial, calling, calls, voip, number, call, voice, phone] 
exercise: [exercise, sleep, nutrition, fitness, weight, healthy, health] 
indie: [indie, labels, artists, music] 
logos: [logos, branding, flash, design] 
pci: [pci, dss, hipaa, compliance, sensitive, compliant] 
portland: [portland, oregon, inc, founded] 
ringtones: [ringtones, ringtone, personalization, games] 
traders: [traders, forex, trader, trading, quotes, stock, trade] 
yellow: [yellow, pages, directory, local] 
abc: [abc, cnn, nbc, television] 
argentina: [argentina, buenos, aires, chile, uruguay, colombia, brazil, mexico, latin] 
aviation: [aviation, aircraft, aerospace, defense, transportation] 
airline: [airline, fares, airlines, flights, flight, travel, tickets, hotel, air]
What else can we do with this? 
Filter words that have 
a lot of high scoring 
significant terms 
De duplicate where 
large overlaps exist 
Assign levels of tags 
in order of frequency 
Use to categorise 
new data on the fly 
using percolate 
Curate manually 
Generate a sidebar 
menu 
github.com/strets123/web-sig-2014/ 
Use elasticsearch 
phrase suggester to 
create phrase tags
Advantages over direct curation / supervised learning: 
• Simplicity and pragmatism 
• Applicable to novel domains 
– e.g. Chemical Biology 
• Auto generated tags choose more appropriate 
word combinations than manual curators 
• No need for complex data formats like rdf 
• Data from many sources can be mixed 
– e.g. categories from other university’s sites…
Where might this technology lead? 
• How about a tag-based file system? 
• How about an implicit social network? 
• Elasticsearch is really easy to scale… 
• Which websites, filesystems and datasets do 
you need to categorise? 
– Do you really need RDF ontologies, curators etc. or 
can you just do something simple?
Summary 
• We now have many options to categorise and 
tidy up messy data 
• Managing variations on schemas takes a lot of 
resources – leave it to the data owners if you 
can! 
• When it comes to tagging… 
– Perfection is in the eye of the beholder 
– Sustainability is really important
Thanks 
• Thanks to the Research 
informatics team at the NDM 
Structural Genomics 
Consortium 
– Paul Barrett 
– Karen Porter 
– Michael O’Hagan 
– Brian Marsden 
– David Damerell 
– Sefa Garsot 
– Anthony Bradley 
• Thanks to the InfoDev team 
at IT services for answering 
my endless questions about 
webauth 
• Funders: 
– John Fell Fund 
– NDM Strategic 
– Welcome Trust 
– Higher Education Funding 
Council 
• To everyone here for listening
Any Questions? 
• Andrew Stretton 
github.com/strets123 
@strets123 
linkedin (google me) 
• Chembio Hub 
http://chembiohub.ox.ac.uk 
@oxchembiohub 
github.com/thesgc 
Simple example categorisation 
code available here in python 
github.com/strets123/web-sig-2014/
Appendix of other messy 
data techniques
How do we make it easy to 
add spreadsheet data to a 
system?
Working with flat files 
• Sometimes a flat file is the right schema for a 
dataset 
– User defined formats 
– Different types of research 
– Only some of the fields are relevant when 
comparing experiments 
– Data is not in memory unless needed 
• Pandas and HDF allows SQL-like queries on flat 
files
Helpful data management 
• Data Wrangler 
– https://player.vimeo.com/video/19185801 
• Raw 
– http://raw.densitydesign.org 
• Take these as inspiration for our tool for re-shaping 
biochemistry data
Simplifying web crawling 
• Modern web crawling patterns use class 
selectors instead of xPath 
– Less likelihood of change 
• Content can be crawled using a backend web 
browser 
– Dynamic javascript elements are included 
• Using a website’s data for classification is 
more acceptable than wholesale reproduction
Managing multiple JSON schemas with views 
PostgreSQL – also supported by Rails/Activerecord 
Couchbase
Why views over JSON can be useful 
• Expose only required fields from e.g. RDF 
• Input format may change but we don’t want 
crawler to break 
• Required fields may change 
• Versions are easy to support if format 
normalisation is in the database layer 
• Storage is cheap 
• View code is executed only once

Contenu connexe

Tendances

An introduction to Semantic Web and Linked Data
An introduction to Semantic  Web and Linked DataAn introduction to Semantic  Web and Linked Data
An introduction to Semantic Web and Linked Data
Gabriela Agustini
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012
Peter Mika
 

Tendances (20)

Schema.org Structured data the What, Why, & How
Schema.org Structured data the What, Why, & HowSchema.org Structured data the What, Why, & How
Schema.org Structured data the What, Why, & How
 
NetIKX Semantic Search Presentation
NetIKX Semantic Search PresentationNetIKX Semantic Search Presentation
NetIKX Semantic Search Presentation
 
Marc and beyond: 3 Linked Data Choices
 Marc and beyond: 3 Linked Data Choices  Marc and beyond: 3 Linked Data Choices
Marc and beyond: 3 Linked Data Choices
 
Oct 15 NISO Webinar: 21st Century Resource Sharing: Which Inter-Library Loan ...
Oct 15 NISO Webinar: 21st Century Resource Sharing: Which Inter-Library Loan ...Oct 15 NISO Webinar: 21st Century Resource Sharing: Which Inter-Library Loan ...
Oct 15 NISO Webinar: 21st Century Resource Sharing: Which Inter-Library Loan ...
 
Three Linked Data choices for Libraries
Three Linked Data choices for LibrariesThree Linked Data choices for Libraries
Three Linked Data choices for Libraries
 
Schema.org - Extending Benefits
Schema.org - Extending BenefitsSchema.org - Extending Benefits
Schema.org - Extending Benefits
 
Extending Schema.org
Extending Schema.orgExtending Schema.org
Extending Schema.org
 
A review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic WebA review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic Web
 
Semantic Web and Schema.org
Semantic Web and Schema.orgSemantic Web and Schema.org
Semantic Web and Schema.org
 
Schema.org: What It Means For You and Your Library
Schema.org: What It Means For You and Your LibrarySchema.org: What It Means For You and Your Library
Schema.org: What It Means For You and Your Library
 
Analysing & Improving Learning Resources Markup on the Web
Analysing & Improving Learning Resources Markup on the WebAnalysing & Improving Learning Resources Markup on the Web
Analysing & Improving Learning Resources Markup on the Web
 
An introduction to Semantic Web and Linked Data
An introduction to Semantic  Web and Linked DataAn introduction to Semantic  Web and Linked Data
An introduction to Semantic Web and Linked Data
 
Knowledge Integration in Practice
Knowledge Integration in PracticeKnowledge Integration in Practice
Knowledge Integration in Practice
 
Wimmics Overview 2021
Wimmics Overview 2021Wimmics Overview 2021
Wimmics Overview 2021
 
Semantic Search on the Rise
Semantic Search on the RiseSemantic Search on the Rise
Semantic Search on the Rise
 
Danbri Drupalcon Export
Danbri Drupalcon ExportDanbri Drupalcon Export
Danbri Drupalcon Export
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012
 
semantic markup using schema.org
semantic markup using schema.orgsemantic markup using schema.org
semantic markup using schema.org
 
Structured Data: It's All About the Graph!
Structured Data: It's All About the Graph!Structured Data: It's All About the Graph!
Structured Data: It's All About the Graph!
 
Ontologies and semantic web
Ontologies and semantic webOntologies and semantic web
Ontologies and semantic web
 

En vedette (7)

Seven Axiom
Seven AxiomSeven Axiom
Seven Axiom
 
Get Your Ducks Nccet Webinar
Get Your Ducks   Nccet WebinarGet Your Ducks   Nccet Webinar
Get Your Ducks Nccet Webinar
 
E11 Physics Evaluation Sheet
E11 Physics Evaluation SheetE11 Physics Evaluation Sheet
E11 Physics Evaluation Sheet
 
Chembio Crunch Intro
Chembio Crunch IntroChembio Crunch Intro
Chembio Crunch Intro
 
Moodle
MoodleMoodle
Moodle
 
California Corporate College Presentation at NCCET 100910
California Corporate College Presentation at NCCET 100910California Corporate College Presentation at NCCET 100910
California Corporate College Presentation at NCCET 100910
 
California Corporate College Cccaoe Fall 2009
California Corporate College Cccaoe Fall 2009California Corporate College Cccaoe Fall 2009
California Corporate College Cccaoe Fall 2009
 

Similaire à Ordering the chaos: Creating websites with imperfect data

Charting Searchland, ACM SIG Data Mining
Charting Searchland, ACM SIG Data MiningCharting Searchland, ACM SIG Data Mining
Charting Searchland, ACM SIG Data Mining
Valeria de Paiva
 

Similaire à Ordering the chaos: Creating websites with imperfect data (20)

Bioschemas Workshop
Bioschemas WorkshopBioschemas Workshop
Bioschemas Workshop
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
Fox-Keynote-Now and Now of Data Publishing-nfdp13
Fox-Keynote-Now and Now of Data Publishing-nfdp13Fox-Keynote-Now and Now of Data Publishing-nfdp13
Fox-Keynote-Now and Now of Data Publishing-nfdp13
 
Pratical Deep Dive into the Semantic Web - #smconnect
Pratical Deep Dive into the Semantic Web - #smconnectPratical Deep Dive into the Semantic Web - #smconnect
Pratical Deep Dive into the Semantic Web - #smconnect
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
 
The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...
 
A Brief (and Practical) Introduction to Information Architecture
A Brief (and Practical) Introduction to Information ArchitectureA Brief (and Practical) Introduction to Information Architecture
A Brief (and Practical) Introduction to Information Architecture
 
Alamw15 VIVO
Alamw15 VIVOAlamw15 VIVO
Alamw15 VIVO
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)
 
Graphs for Recommendation Engines: Looking beyond Social, Retail, and Media
Graphs for Recommendation Engines: Looking beyond Social, Retail, and MediaGraphs for Recommendation Engines: Looking beyond Social, Retail, and Media
Graphs for Recommendation Engines: Looking beyond Social, Retail, and Media
 
Data analytics and SEO to grow your international business | John Caldwell | ...
Data analytics and SEO to grow your international business | John Caldwell | ...Data analytics and SEO to grow your international business | John Caldwell | ...
Data analytics and SEO to grow your international business | John Caldwell | ...
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
Linked (Open) Data
Linked (Open) DataLinked (Open) Data
Linked (Open) Data
 
FAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech ProposalsFAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech Proposals
 
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
 
Charting Searchland, ACM SIG Data Mining
Charting Searchland, ACM SIG Data MiningCharting Searchland, ACM SIG Data Mining
Charting Searchland, ACM SIG Data Mining
 

Dernier

Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 

Dernier (20)

Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 

Ordering the chaos: Creating websites with imperfect data

  • 1. Ordering the chaos: creating websites using imperfect data Andrew Stretton Oxford University Web SIG November 2014
  • 2. Who am I, what is ChemBio Hub? • Andrew Stretton – Data Architect and Developer github.com/strets123 @strets123 linkedin (google me) • Chembio Hub http://chembiohub.ox.ac.uk (feel free to link to us!) @oxchembiohub github.com/thesgc
  • 3. Chembio Hub exists to support research at the interface of chemistry and biology by enabling sharing of reagents, expertise and data across 20+ departments
  • 4. Who are we trying to connect and how? User 1: Scientist at Oxford User 2: Potential collaborator Could be in industry or anywhere in academia Stored and curated by ChemBio Hub Unpublished results Negative Data Methods Equipment Reagents ? Not sure yet Areas of expertise Questions and answers Contacts Publications Held on other sites or social networks Organised/linked to by ChemBio Hub
  • 5. All of these parts require tagging entities in text, how can we do it Who are we trying to connect and how? cheaply and sustainably? User 1: Scientist at Oxford User 2: Potential collaborator Could be in industry or anywhere in academia Stored and curated by ChemBio Hub Unpublished results Negative Data Methods Equipment Reagents ? Not sure yet Areas of expertise Questions and answers Contacts Publications Held on other sites or social networks Organised/linked to by ChemBio Hub
  • 6. What sorts of messy data are we working with? • Full text from procedures, biographies, web sites • Raw CSV/ Excel formats from multiple machines or departmental processes • “Standard” XML and JSON formats from various sources that do not map perfectly to our application • Multiple external databases to submit data to
  • 7. How do most of our users like their web-based tools? Simple Search Flexible data management Comprehensive, overlapping tagging Clear progress, seamless experience
  • 8. What do we sometimes give them? • Incomplete or many-to-one tagging • Hyperlinks instead of the right information from the other site • Dumb search • Inflexible schemas • Lack of linking between datasets
  • 9. What strategies do we have to deal with messy data? Create more helpful data management apps Fill in gaps in tagging by using search engines Consider creating databases of flat files Create map reduce / Database views for schema Normalisation and data analysis Web crawling - not as hard or messy as it used to be
  • 10. Let’s look at this one first, happy to discuss other areas later… What strategies do we have to deal with messy data? Create more helpful data management apps Fill in gaps in tagging by using search engines Consider creating databases of flat files Create map reduce / Database views for schema Normalisation and data analysis Web crawling - not as hard or messy as it used to be
  • 11. How do we fill in gaps on un-tagged data? Let’s do an experiment… github.com/strets123/web-sig-2014/
  • 12. Elasicsearch - information extraction on-the-fly • Take a dataset of 18801 companies ~ 50% tagged > 80% have some text data 0% 50% 100% Tags Description Overview Overview or description Source data : http://jsonstudio.com/resources/ github.com/strets123/web-sig-2014/
  • 13. Use the “significant terms” feature… • What description/overview words most strongly linked to each tag? travel education music realestate Search engine optimization jobs onlinemarketing projectmanagement travel students music estate seo job marketing project travelers teachers artists real optimization jobs seo projects trip learning musicians agents engine employers agency task trips education songs property ppc career optimization collaboration hotels student labels listings marketing teams flights educational playlists search management traveler bands click travellers song pay airline artist hotel fans
  • 14. Now let’s test these queries • Which companies have no tag but are most likely to need tagging with “music”… uPlaya Description uPlaya provides independent or unsigned musicians with immediate feedback on their music…. Category games_video Tags - Webceleb Description Webceleb is music marketplace and community where musicians and fans engage and profit from discovering, purchasing and downloading the latest independent music.…. Category games_video Tags -
  • 15. But what if we have NO TAGS?
  • 16. A process to extract tags from text… Index Data Assign resources (e.g. Amazon spot instance for large dataset) List word counts with the least frequent first Exclude lowest counts Aggregate the significant terms for each word Filter words that have a lot of high scoring significant terms
  • 17. What does this give us? athletes: [athletes, coaches, athlete, coach, sports, fans] avatars: [avatars, avatar, multiplayer, virtual, casual, 3d, games, chat, create, game] clouds: [clouds, cloud, hybrid, computing, private, deploy, public, infrastructure] dashboards: [dashboards, bi, reports, analytics, reporting, self, analysis, intelligence, features] dial: [dial, calling, calls, voip, number, call, voice, phone] exercise: [exercise, sleep, nutrition, fitness, weight, healthy, health] indie: [indie, labels, artists, music] logos: [logos, branding, flash, design] pci: [pci, dss, hipaa, compliance, sensitive, compliant] portland: [portland, oregon, inc, founded] ringtones: [ringtones, ringtone, personalization, games] traders: [traders, forex, trader, trading, quotes, stock, trade] yellow: [yellow, pages, directory, local] abc: [abc, cnn, nbc, television] argentina: [argentina, buenos, aires, chile, uruguay, colombia, brazil, mexico, latin] aviation: [aviation, aircraft, aerospace, defense, transportation] airline: [airline, fares, airlines, flights, flight, travel, tickets, hotel, air]
  • 18. What else can we do with this? Filter words that have a lot of high scoring significant terms De duplicate where large overlaps exist Assign levels of tags in order of frequency Use to categorise new data on the fly using percolate Curate manually Generate a sidebar menu github.com/strets123/web-sig-2014/ Use elasticsearch phrase suggester to create phrase tags
  • 19. Advantages over direct curation / supervised learning: • Simplicity and pragmatism • Applicable to novel domains – e.g. Chemical Biology • Auto generated tags choose more appropriate word combinations than manual curators • No need for complex data formats like rdf • Data from many sources can be mixed – e.g. categories from other university’s sites…
  • 20. Where might this technology lead? • How about a tag-based file system? • How about an implicit social network? • Elasticsearch is really easy to scale… • Which websites, filesystems and datasets do you need to categorise? – Do you really need RDF ontologies, curators etc. or can you just do something simple?
  • 21. Summary • We now have many options to categorise and tidy up messy data • Managing variations on schemas takes a lot of resources – leave it to the data owners if you can! • When it comes to tagging… – Perfection is in the eye of the beholder – Sustainability is really important
  • 22. Thanks • Thanks to the Research informatics team at the NDM Structural Genomics Consortium – Paul Barrett – Karen Porter – Michael O’Hagan – Brian Marsden – David Damerell – Sefa Garsot – Anthony Bradley • Thanks to the InfoDev team at IT services for answering my endless questions about webauth • Funders: – John Fell Fund – NDM Strategic – Welcome Trust – Higher Education Funding Council • To everyone here for listening
  • 23. Any Questions? • Andrew Stretton github.com/strets123 @strets123 linkedin (google me) • Chembio Hub http://chembiohub.ox.ac.uk @oxchembiohub github.com/thesgc Simple example categorisation code available here in python github.com/strets123/web-sig-2014/
  • 24. Appendix of other messy data techniques
  • 25. How do we make it easy to add spreadsheet data to a system?
  • 26. Working with flat files • Sometimes a flat file is the right schema for a dataset – User defined formats – Different types of research – Only some of the fields are relevant when comparing experiments – Data is not in memory unless needed • Pandas and HDF allows SQL-like queries on flat files
  • 27. Helpful data management • Data Wrangler – https://player.vimeo.com/video/19185801 • Raw – http://raw.densitydesign.org • Take these as inspiration for our tool for re-shaping biochemistry data
  • 28. Simplifying web crawling • Modern web crawling patterns use class selectors instead of xPath – Less likelihood of change • Content can be crawled using a backend web browser – Dynamic javascript elements are included • Using a website’s data for classification is more acceptable than wholesale reproduction
  • 29. Managing multiple JSON schemas with views PostgreSQL – also supported by Rails/Activerecord Couchbase
  • 30. Why views over JSON can be useful • Expose only required fields from e.g. RDF • Input format may change but we don’t want crawler to break • Required fields may change • Versions are easy to support if format normalisation is in the database layer • Storage is cheap • View code is executed only once

Notes de l'éditeur

  1. Real word data is not: Perfectly tagged In one place In one format In one technology stack Spreadsheet processes don’t just disappear when you build a tool