War stories from building the Global Patent Search Network, and why Data folks need to think more about UX and Discovery, and UX folks need to think more about Data.
Searching Chinese Patents Presentation at Enterprise Data World
1. Searching Chinese Patents:
Challenges and Solutions When Building
an Innovative Discovery Interface
ERIC PUGH | epugh@o19s.com | @dep4b
2. Who am I?
• Principal at OpenSource Connections
- Solr/Lucene Search Consultancy
http://bit.ly/OSCCommercialSummary
• Member of Apache Software
Foundation
• SOLR-284 UpdateRichDocuments
(July 07)
8. Risks
• Cloud new at USPTO
• Discovery is tenuous concept
• Conflicting User Goals
• Fixed Budget: trade scope for
budget/quality
9.
10.
11.
12.
13.
14. Telling some stories
➡How to inject “Discovery” into your
app
• The Cloud to the Rescue (sorta!)
• Parsers and Parsers and Parsers
• Don’t be Afraid to Share!
17. Grok data at gut level
Look for outliers
!
User Interviews
Surveys
Card Sorting
Scenarios/Personas
!
UX
Data
brainstorm
Mockups
Proof of concept
!
!
18. Where to spend time?
UX
Engine
Data
40%
!
20%
!
40%
!
40%
!
40%
!
20%
We spent
!
19. Telling some stories
• How to inject “Discovery” into your app
➡The Cloud to the Rescue (sorta!)
• Parsers and Parsers and Parsers
• Don’t be Afraid to Share!
21. Boy meets Girl Story
Metadata
Ingest
Pipeline
Discovery
UX
Content
Files
22. How we built it
EmberJS Single Page Search App
HTML
XML
JSON
Server Dashboard
GPSN UI (Bootsrap CSS)
Browsers
Mobile/
Tablet
Third Party
Application
Servers
S3 BucketSolr
23. Solr as a NoSQL
Datastore
• Used “atomic updates” to merge three
source datasets into single final dataset.
• All text displayed in application stored in
Solr.
• Dynamic schema supports many languages,
en, cn right now.
25. Don’t Move Files
• Copying 5 TB data up to S3 was very
painful.
• We used S3Funnel which is “rsync like”
• We bought more network bandwidth for
our office
28. Think about DataVolume
• Started with older dataset, and tasks like TIFF -> PNG
conversion became progressively harder. Map/Reduce nice,
need more visibility into progress..
• Should have sharded our Search Index from the beginning
just to make indexing faster and cheaper process (500 gb
index!)
• 8 shards dropped time from 12 hours to 2 hours.
Merging took 5!
• We had too many steps in our pipeline
29. Building
a
Patents
Index
MachineCount
0
75
150
225
300
5 days 3 days 30 Minutes
1 5
300
32. Telling some stories
• How to inject “Discovery” into your app
• The Cloud to the Rescue (sorta!)
➡Parsers and Parsers and Parsers
• Don’t be Afraid to Share!
35. Lot’s of File Types
• Sometimes in ZIP archives, sometimes not!
• multiple XML formats as well as CSV and
EDI
• Purplebook,Yellowbook,
Redbook,Greenbook, Questel, SIPO…
36. Tika as a pipeline!
• Auto detects content type
• Metadata structure has all the
key/value needed for Solr
• Allows us to scale up with
Behemoth project (and
others!).
38. Detector to pick File
public
class
GreenbookDetector
implements
Detector
{
!
private
static
Pattern
pattern
=
Pattern.compile("PATN");
@Override
public
MediaType
detect(InputStream
stream,
Metadata
metadata)
throws
IOException
{
!
MediaType
type
=
MediaType.OCTET_STREAM;
InputStream
lookahead
=
new
LookaheadInputStream(stream,
1024);
String
extract
=
org.apache.commons.io.IOUtils.toString(lookahead,
"UTF-‐8");
!
Matcher
matcher
=
pattern.matcher(extract);
!
if
(matcher.find())
{
type
=
GreenbookParser.MEDIA_TYPE;
}
!
lookahead.close();
return
type;
}
}
39. Telling some stories
• How to inject “Discovery” into your app
• The Cloud to the Rescue (sorta!)
• Parsers and Parsers and Parsers
➡Don’t be Afraid to Share!
40. Your BigData solution
isn’t perfect
• Allow users to export data
• Most business users want to work in Excel.
Accept it!
• Allow other applications to build on top of
of your application.
41. GPSN has
• Lots of easy “Print to
PDF” options.
• Data stored in S3 as:
• individual patent files
• chunky downloads.
• Filtering to expand or
select specific data sets.
• Permalinks: simple, very
sharable URLs.
• Underlying Solr service
is exposed to public via
proxy. You can query
Solr yourself.
• Need advance querying?
Use Lucene syntax in
search bar.