Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC
1. BUILDING A LIGHTWEIGHT DISCOVERY
INTERFACE FOR CHINESE PATENTS
ERIC PUGH | epugh@o19s.com | @dep4b
2. Who am I?
• Principal of OpenSource Connections
- Solr/Lucene Search Consultancy
http://bit.ly/OSCCommercialSummary
• Member of Apache Software
Foundation
• SOLR-284 UpdateRichDocuments
(July 07)
• Fascinated by the art of software
development
10. Risks
• Cloud new at USPTO
• Discovery is tenuous concept
• Conflicting User Goals
• Fixed Budget: trade scope for
budget/quality
11. • First USPTO application in
“the cloud”
• Simple, and discoverable
• Expresses our philosophy of
“Cloud meets Ocean”
!
• Check it out at http://
gpsn.uspto.gov
12. Telling some stories
➡How to inject “Discovery” into your
app
• The Cloud to the Rescue (sorta!)
• Parsers and Parsers and Parsers
• Don’t be Afraid to Share!
16. Grok data at gut level
Look for outliers
!
User Interviews
Surveys
Card Sorting
Scenarios/Personas
!
UX
Data
brainstorm
Mockups
Proof of concept
!
!
18. Where to spend time?
UX
Engine
Data
40%
!
20%
!
40%
!
40%
!
40%
!
20%
We spent
!
19. Telling some stories
• How to inject “Discovery” into your app
➡The Cloud to the Rescue (sorta!)
• Parsers and Parsers and Parsers
• Don’t be Afraid to Share!
23. Boy meets Girl Story
Metadata
Ingest
Pipeline
Discovery
UX
Content
Files
24. How we built it
EmberJS Single Page Search App
HTML
XML
JSON
Server Dashboard
GPSN UI (Bootsrap CSS)
Browsers
Mobile/
Tablet
Third Party
Application
Servers
S3 BucketSolr
26. Don’t Move Files
• Copying 5 TB data up to S3 was very
painful.
• We used S3Funnel which is “rsync like”
• We bought more network bandwidth for
our office
30. Think about DataVolume
• Started with older dataset, and tasks like TIFF -> PNG
conversion became progressively harder. Map/Reduce nice,
need more visibility into progress..
• Should have sharded our Search Index from the beginning
just to make indexing faster and cheaper process (500 gb
index!)
• 8 shards dropped time from 12 hours to 2 hours.
Merging took 5!
• We had too many steps in our pipeline
31. Building
a
Patents
Index
MachineCount
0
75
150
225
300
5 days 3 days 30 Minutes
1 5
300
38. Telling some stories
• How to inject “Discovery” into your app
• The Cloud to the Rescue (sorta!)
➡Parsers and Parsers and Parsers
• Don’t be Afraid to Share!
41. Lot’s of File Types
• Sometimes in ZIP archives, sometimes not!
• multiple XML formats as well as CSV and
EDI
• Purplebook,Yellowbook,
Redbook,Greenbook, Questel, SIPO…
42. Tika as a pipeline!
• Auto detects content type
• Metadata structure has all the
key/value needed for Solr
• Allows us to scale up with
Behemoth project (and
others!).
44. Detector to pick File
public
class
GreenbookDetector
implements
Detector
{
!
private
static
Pattern
pattern
=
Pattern.compile("PATN");
@Override
public
MediaType
detect(InputStream
stream,
Metadata
metadata)
throws
IOException
{
!
MediaType
type
=
MediaType.OCTET_STREAM;
InputStream
lookahead
=
new
LookaheadInputStream(stream,
1024);
String
extract
=
org.apache.commons.io.IOUtils.toString(lookahead,
"UTF-‐8");
!
Matcher
matcher
=
pattern.matcher(extract);
!
if
(matcher.find())
{
type
=
GreenbookParser.MEDIA_TYPE;
}
!
lookahead.close();
return
type;
}
}
45. Telling some stories
• How to inject “Discovery” into your app
• The Cloud to the Rescue (sorta!)
• Parsers and Parsers and Parsers
➡Don’t be Afraid to Share!
46. Your BigData solution
isn’t perfect
• Allow users to export data
• Most business users want to work in Excel!
Accept it!
• Allow other applications to build on top of
it.
47. GPSN has
• Lots of easy “Print to
PDF” options.
• Data stored in S3 as:
• individual patent files
• chunky downloads.
• Filtering to expand or
select specific data sets.
• Permalinks: simple, very
sharable URLs.
• Underlying Solr service
is exposed to public via
proxy. You can query
Solr yourself.
• Need advance querying?
Use Lucene syntax in
search bar.