The United States Patent and Trademark Office wanted a simple, lightweight, yet modern and rich discovery interface for Chinese patent data. This is the story of the Global Patent Search Network, the next generation multilingual search platform for the USPTO. GPSN, http://gpsn.uspto.gov, was the first public application deployed in the cloud, and allowed a very small development team to build a discovery interface across millions of patents.
This case study will cover:
• How we leveraged Amazon Web Services platform for data ingestion, auto scaling, and deployment at a very low price compared to traditional data centers.
• We will cover some of the innovative methods for converting XML formatted data to usable information.
• Parsing through 5 TB of raw TIFF image data and converting them to modern web friendly format.
• Challenges in building a modern Single Page Application that provides a dynamic, rich user experience.
• How we built “data sharing” features into the application to allow third party systems to build additional functionality on top of GPSN.
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Building a lightweight discovery interface for Chinese patents
1. Building a Lightweight
Discovery Interface for
Chinese Patents
Chinese Patents
Strata 2014 Santa Clara
Eric Pugh | epugh@o19s.com | @dep4b
2. Who am I?
•
Principal of OpenSource Connections
- Solr/Lucene Search Consultancy
http://bit.ly/OSCCommercialSummary
•
Member of Apache Software
Foundation
•
SOLR-284 UpdateRichDocuments
(July 07)
•
Fascinated by the art of software
development
7. •
First USPTO application in
“the cloud”
•
•
Simple, and discoverable
Expresses our philosophy of
“Cloud meets Ocean”
8.
9. Risks
•
•
•
•
Cloud new at USPTO
Discovery is tenuous concept
Conflicting User Goals
Fixed Budget: trade scope for
budget/quality
10. Telling some stories
➡How to inject “Discovery” into your
app
• The Cloud to the Rescue (sorta!)
• Parsers and Parsers and Parsers
• Don’t be Afraid to Share!
16. Telling some stories
• How to inject “Discovery” into your app
➡The Cloud to the Rescue (sorta!)
• Parsers and Parsers and Parsers
• Don’t be Afraid to Share!
21. Don’t Move Files
• Copying 5 TB data up to S3 was very
painful.
• We used S3Funnel which is “rsync like”
• We bought more network bandwidth for
our office
24. Think about Data Volume
•
Started with older dataset, and tasks like TIFF -> PNG
conversion became progressively harder. Map/Reduce nice,
need more visibility into progress..
•
Should have sharded our Search Index from the beginning
just to make indexing faster and cheaper process (500 gb
index!)
•
•
8 shards dropped time from 12 hours to 2 hours.
Merging took 5!
We had too many steps in our pipeline
28. Telling some stories
• How to inject “Discovery” into your app
• The Cloud to the Rescue (sorta!)
➡Parsers and Parsers and Parsers
• Don’t be Afraid to Share!
31. Lot’s of File Types
• Sometimes in ZIP archives, sometimes not!
• multiple XML formats as well as CSV and
EDI
• Purplebook,Yellowbook,
Redbook,Greenbook, Questel, SIPO…
32. Tika as a pipeline!
• Auto detects content type
• Metadata structure has all the
key/value needed for Solr
• Allows us to scale up with
Behemoth project (and
others!).
33. Detector to pick File
public class GreenbookDetector implements Detector {
private static Pattern pattern = Pattern.compile("PATN");
@Override
public MediaType detect(InputStream stream, Metadata metadata) throws IOException {
MediaType type = MediaType.OCTET_STREAM;
InputStream lookahead = new LookaheadInputStream(stream, 1024);
String extract = org.apache.commons.io.IOUtils.toString(lookahead, "UTF-8");
Matcher matcher = pattern.matcher(extract);
if (matcher.find()) {
type = GreenbookParser.MEDIA_TYPE;
}
lookahead.close();
return type;
}
}
34. Telling some stories
• How to inject “Discovery” into your app
• The Cloud to the Rescue (sorta!)
• Parsers and Parsers and Parsers
➡Don’t be Afraid to Share!
35. Your solution isn’t
perfect
• Allow users to export data
• Most business users want to work in Excel!
Accept it!
• Allow other applications to build on top of
it.
36. GPSN has
•
•
Lots of easy “Print to PDF” options.
Data stored in S3 as:
•
•
•
•
•
individual patent files
chunky downloads.
Filtering to expand or select specific data sets.
Permalinks: simple, very sharable URLs.
Underlying Solr service is exposed to public via firewall.
You can query Solr yourself.
SOLR-284 back in July 07 was a first cut at a content extraction library before Tika came along.
And I love Agile development processes. And I think of agile as business -> requirements -> development -> testing -> systems administration
SOLR-284 back in July 07 was a first cut at a content extraction library before Tika came along.
USPTO and SIPO: Chinese Intellectual Property Organization are committed to sharing patent data.
Simplify patent protections
reduce conflicting patent claims,
facilitate business, by making it easier for Chinese and American companies to collaborate.
Part of the MOU was to put China’s patent data on line, but somewhat of a checkbox feature.
we got excited!
I won’t be sugar coating them. Most speakers focus on how wonderful they did, and while inordinately proud of GPSN, it wasn’t a perfect project.
The Goal
Building Discovery capabilities is the tension between UX and Data needs
Engine provides the flow between!
There are things that can be done in parallel, but after brainstorming, everything needs to go hand in hand.
An issue we’ve seen is that it’s easy for the UX folks to document ideas, but often the Data folks get bogged down in the minutia of the source data. Need to surface that to a level people can work with.
Ideally you are focused on the UX and Data, your “tooling” shouldn’t get in the way.
We were bit by doing a lot of knowledge transfer, and verification of our Cloud deploy because it was the first time, which meant we had some data issues bite us.
Issues that came up are: User Sophistication - Both public and Expert Patent Examiner users. Tilted towards public, w/ a layer of PE features
Data, in English, was bad. So surface more of the original Chinese, and especially image data.
Core users, such as patent attorneys’ want the original image of patents, it’s a trust thing.
Google like simple search, but with more powerful queries.
One of the most common tropes in story telling is about a boy meeting a girl. Think Wall-E and Eve. The meet, shenanigans happen, and then happiness.
Well, GPSN followed one of the most common tropes of discovery, margining clean metadata with content, and building a ui.
this pretty much describes every project.
One of the most common tropes in story telling is about a boy meeting a girl. Think Wall-E and Eve. The meet, shenanigans happen, and then happiness.
Well, GPSN followed one of the most common tropes of discovery, margining clean metadata with content, and building a ui.
this pretty much describes every project.
$ is the cost of computation, doing work.
Scott pointed out Tika.
Walking out of a Federal building with 4 hard drives in my backpack!