Building a lightweight discovery interface for Chinese patents

Building a Lightweight
Discovery Interface for
Chinese Patents
Chinese Patents
Strata 2014 Santa Clara
Eric Pugh | epugh@o19s.com | @dep4b

Who am I?
•

Principal of OpenSource Connections
- Solr/Lucene Search Consultancy
http://bit.ly/OSCCommercialSummary

•

Member of Apache Software
Foundation

•

SOLR-284 UpdateRichDocuments
(July 07)

•

Fascinated by the art of software
development

ex
N

n
tio
di
tE

M

!
ar

Co-Author

•

First USPTO application in
“the cloud”

•
•

Simple, and discoverable
Expresses our philosophy of
“Cloud meets Ocean”

Risks
•
•
•
•

Cloud new at USPTO
Discovery is tenuous concept
Conflicting User Goals
Fixed Budget: trade scope for
budget/quality

Telling some stories
➡How to inject “Discovery” into your
app

• The Cloud to the Rescue (sorta!)
• Parsers and Parsers and Parsers
• Don’t be Afraid to Share!

Flow of understanding

Data
Data

Information
Information

Understanding
Understanding

Building “Discovery”

UX
UX

Tensio
n

Data
Data

Engine
Engine

UX
UX

User Interviews
Card Sorting
Scenarios/Personas
Data
Data

Grok data at gut level
Look for outliers

brainstorm
brainstorm
brainstorm
brainstorm

Surveys

Mockups
Proof of concept

Where to spend time?
UX
UX
Engine
Engine
Data
Data

40%
20%
40%

40%
40%
20%
We
spent

Walk through results
http://gpsn.uspto.gov

• How to inject “Discovery” into your app
➡The Cloud to the Rescue (sorta!)

Boy meets Girl Story
Content
Files
Ingest
Pipeline
Metadata

Discovery
UX

Don’t Move Files
• Copying 5 TB data up to S3 was very
painful.

• We used S3Funnel which is “rsync like”
• We bought more network bandwidth for
our office

Never
underestimate
the bandwidth of
a station wagon
full of tapes
hurtling down
the highway.
–Andrew Tanenbaum, 1981

Think about Data Volume
•

Started with older dataset, and tasks like TIFF -> PNG
conversion became progressively harder. Map/Reduce nice,
need more visibility into progress..

•

Should have sharded our Search Index from the beginning
just to make indexing faster and cheaper process (500 gb
index!)

•
•

8 shards dropped time from 12 hours to 2 hours.
Merging took 5!

We had too many steps in our pipeline

More prosaically…
Server
Server

$

Database
Database

Server
Server

$

$

Client
Client

Client
Client

Server
Server

$

Client
Client

➡Parsers and Parsers and Parsers

Morphlines

Why so many pipelines?

Lot’s of File Types
• Sometimes in ZIP archives, sometimes not!
• multiple XML formats as well as CSV and
EDI

• Purplebook,Yellowbook,

Redbook,Greenbook, Questel, SIPO…

Tika as a pipeline!
• Auto detects content type
• Metadata structure has all the
key/value needed for Solr

• Allows us to scale up with
Behemoth project (and
others!).

Detector to pick File
public class GreenbookDetector implements Detector {
private static Pattern pattern = Pattern.compile("PATN");
@Override
public MediaType detect(InputStream stream, Metadata metadata) throws IOException {
MediaType type = MediaType.OCTET_STREAM;
InputStream lookahead = new LookaheadInputStream(stream, 1024);
String extract = org.apache.commons.io.IOUtils.toString(lookahead, "UTF-8");
Matcher matcher = pattern.matcher(extract);
if (matcher.find()) {
type = GreenbookParser.MEDIA_TYPE;
}
lookahead.close();
return type;
}
}

➡Don’t be Afraid to Share!

Your solution isn’t
perfect
• Allow users to export data
• Most business users want to work in Excel!
Accept it!

• Allow other applications to build on top of
it.

GPSN has
•
•

Lots of easy “Print to PDF” options.
Data stored in S3 as:

•
•
•
•
•

individual patent files
chunky downloads.

Filtering to expand or select specific data sets.
Permalinks: simple, very sharable URLs.
Underlying Solr service is exposed to public via firewall.
You can query Solr yourself.

Measuring the impact
of our algorithms
changes is just getting
harder with Big Data.

e
W

Quepid: Give your Queries
some Love
e
ne
d
t
be
a

s!
er
us

www.quepid.io

Office Hours Thurs 10:50 AM

Whats Up with the Lucene
Community?
Community?

Questions?
Questions?
Questions?
Nervous about
epugh@o19s.com

•
speaking up? Ask me
• @dep4b
later!
• www.opensourceconnections.com
• slideshare.com/o19s

Building a lightweight discovery interface for Chinese patents

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (6)

Similaire à Building a lightweight discovery interface for Chinese patents

Similaire à Building a lightweight discovery interface for Chinese patents (20)

Plus de OpenSource Connections

Plus de OpenSource Connections (20)

Dernier

Dernier (20)

Building a lightweight discovery interface for Chinese patents

Notes de l'éditeur