Contenu connexe Similaire à DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION (20) DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION1. Large Scale Search, Discovery
and Analysis in Action
Ivan Provalov
Research Engineer
Office of the Chief Scientist
September 25, 2012
Confidential © Copyright 2012
2. User Interactions With Big Data
Command System
Data DFS Line Administrator
Key
Query
Data Value Engineer
Language
Store
Keyword
Data Index End User
Search
Confidential and Proprietary
2 © 2012 LucidWorks
3. Is Search Enough?
• Keyword search is a
commodity endeavour shuttle bay area
• Holistic view of the data and
Search
the user interactions with that
data
• Search, Discovery and
Analytics are the key to
unlocking this view of users
and data
Search, Discovery and Analytics
Confidential and Proprietary
3 © 2012 LucidWorks
4. Why Search, Discovery and Analytics?
• User Needs
Search - real-time, ad hoc access to
content
- aggressive prioritization
based on importance
- serendipity
- feedback/learning from past
Analytics Discovery
• Business Needs
- deeper insight into users
- leverage existing internal
knowledge
- cost effective
Confidential and Proprietary
4 © 2012 LucidWorks
5. Topics
• Background and needs
• Architecture
• Search, Discovery and Analytics in action
• Road map
• Wrap up
Confidential and Proprietary
5 © 2012 LucidWorks
6. Search
• Performance
• Real time
• Relevance and importance
• Presenting results
• Experiment management
Confidential and Proprietary
6 © 2012 LucidWorks
7. Discovery
• Content clustering
• Discovering near duplicate documents
• Finding ‘dark data’
• Making recommendations
• Uncovering trends
• Recognizing topics
• More like this
Confidential and Proprietary
7 © 2012 LucidWorks
8. Analytics
• Term frequency
• Facets
• Click analysis
• Relevancy metrics
• Zero results queries
• Hot spots
• Statistically interesting phrases
Confidential and Proprietary
8 © 2012 LucidWorks
9. Some Use Cases
• Video streaming
- classification
- recommendations
• Financial, transportation,
telecommunications
- fraud detection
• Social media
- trend monitoring
• Information technology
- logs monitoring
• Healthcare
- identifying patients for clinical studies
Confidential and Proprietary
9 © 2012 LucidWorks
10. In Focus: Personalized Medicine
Alignment
and other Genetic
analysis Variations
Patient DNA
Standard Therapies
Alternative Therapies
Search and Faceting
Confidential and Proprietary
10 © 2012 LucidWorks
11. In Focus: Log Processing in Telecommunications
• Each year, large sums of money are lost due to
fraudulent calls and poor service
• Logs are usually semi-structured and contain vital
information about errors and fraud
• Deeper batch analytics can provide insight into patterns
across vast amounts of data
• Search of call and network information (via logs) is
critical to providing deeper analysis and understanding
of these errors and fraudulent activities
Confidential and Proprietary
11 © 2012 LucidWorks
12. What Does a Search, Discovery and Analytics
Platform Need?
• Fast, efficient, scalable search
- bulk and near real time indexing
- handle billions of records with sub-second search and faceting
• Large scale, cost effective storage and processing capabilities
- need whole data consumption and analysis
- experimentation/sampling tools
• NLP and machine learning tools that scale to enhance discovery
and analysis
Confidential and Proprietary
12 © 2012 LucidWorks
13. Building a Search, Discovery and Analytics Platform
API
Search, Discovery, Analytics
Management
Inputs
Bulk & Processing & Storage
Real Time
Provisioning, Monitoring & Configuration
Confidential and Proprietary
© 2012 LucidWorks
14. LucidWorks Big Data
API
Inputs
Search, Discovery, Analytics
Management
Processing & Storage
Provisioning, Monitoring & Configuration
Confidential and Proprietary
© 2012 LucidWorks
15. LucidWorks Big Data
API
Inputs
Search, Discovery, Analytics
Management
Processing & Storage
Provisioning, Monitoring & Configuration
Confidential and Proprietary
© 2012 LucidWorks
16. LucidWorks Big Data
API
Inputs Search, Discovery, Analytics
Analytics Service Document Service
Management
Processing & Storage
Provisioning, Monitoring & Configuration
Confidential and Proprietary
© 2012 LucidWorks
17. LucidWorks Big Data
API
Inputs Search, Discovery, Analytics Mgmt
Analytics Service Document Service
Admin
Service
Processing & Storage Mgmt
Data
Mgmt
Provisioning, Monitoring & Configuration
Confidential and Proprietary
© 2012 LucidWorks
18. LucidWorks Big Data
API
Inputs Search, Discovery, Analytics Mgmt
Analytics Service Document Service
Admin
Service
Processing & Storage Mgmt
Data
Mgmt
Provisioning, Monitoring & Configuration
Confidential and Proprietary
© 2012 LucidWorks
19. LucidWorks Big Data
API
Big Data LucidWorks Web HDFS
Inputs Search, Discovery, Analytics Mgmt
Analytics Service Document Service
Admin
Service
Processing & Storage Mgmt
Data
Mgmt
Provisioning, Monitoring & Configuration
Confidential and Proprietary
© 2012 LucidWorks
20. Components – LucidWorks Search
Component Benefit
LucidWorks Search (2.1.1) Lucene/Solr 4.0-dev, sharded with
• connector framework SolrCloud, near-real time indexing,
• security transaction logs for recovery.
• user click framework
• business process integration
• administration
LucidWorks Search
Confidential and Proprietary
20 © 2012 LucidWorks
21. Components - Hadoop
Component Benefit
Apache Hadoop (1.0.3) Distributed computing and
processing for ETL and analytics
jobs.
Apache HBase (0.92) Key-value store allowing fast access
to the data.
Apache Oozie (modified 3.2) Workflow orchestration.
Confidential and Proprietary
21 © 2012 LucidWorks
22. Components - Analysis/ML/NLP
Component Benefit
Apache Mahout (trunk) Distributed machine learning
• k-means clustering processing framework.
• statistically interesting phrases
• similar documents
• classification
Apache UIMA (2.4.0) Text processing and annotations.
Apache OpenNLP (1.5.2) Machine learning toolkit for natural
• named entity extraction language processing.
Behemoth (modified trunk) Makes easier M/R data extraction,
abstracts annotations frameworks.
Apache Pig (0.9.2) Helps with writing analytics M/R
• ETL programs.
• log analysis
Confidential and Proprietary
22 © 2012 LucidWorks
23. Components - Middleware
Component Benefit
Apache ZooKeeper (3.4.3) Service discovery.
• Netflix Curator
Apache Kafka (0.7) Logs consumption and event-based
real-time document processing
framework.
Confidential and Proprietary
23 © 2012 LucidWorks
24. Components - SDA Engine
• RESTful services (Restlet 2.1)
• ZooKeeper + Netflix Curator
• Authentication and authorization
• Proxies for LucidWorks and
WebHDFS API
• Workflow engine
Confidential and Proprietary
24 © 2012 LucidWorks
25. Road Map
• Analytics themes
- relevance
- data quality
- discovery
- integration with other packages (R)
• Machine learning
- NLP
- recommendations
• Experiment management
Confidential and Proprietary
25 © 2012 LucidWorks
26. Conclusions
• Search, Discovery and Analytics,
when combined into a single,
integrated system provides
powerful insight into both your
content and your users
• LucidWorks has combined many
of these things into LucidWorks
Big Data
Confidential and Proprietary
26 © 2012 LucidWorks
27. LucidWorks Big Data
• Unified development platform for Big Data applications
• Integrated open source stack: Lucene/Solr, Hadoop,
Mahout, NLP
• Single, uniform REST API
• Pre-tuned by open source industry experts
• Out of the box provisioning - hosted or on premise
Confidential and Proprietary
27 © 2012 LucidWorks
28. Search | Discover | Analyze
www.lucidworks.com/bigdata
ivan.provalov@lucidworks.com
@iprovalov
Confidential and Proprietary
28 © 2012 LucidWorks
Notes de l'éditeur How do you gain insight?The Search boxis the UI for data these daysFeedback improvements into system for usersExtract key metrics for business understanding ChallengesMany of these are intense calculations or iterativeMany are subjective and require a lot of experimentation Single nucleotide polymorphisms (SNPs) are used as markers in linkage and association studies to detect which regions in the human genome may be involved in disease.Single nucleotide polymorphism (SNP) studies and random mutagenesis projects identify amino acid substitutions in protein-coding regions. Each substitution has the potential to affect protein function. SIFT (Sorting Intolerant From Tolerant) is a program that predicts whether an amino acid substitution affects protein function so that users can prioritize substitutions for further study. Make into images? SearchStorage and processingExperiment managementToolsNLPstatistical analysisScalableLow costProduction monitoringProvisioningBulk and near real-time Handle volume in sub-second processing Solr takes care of leader election, etc. so no more master/slave1 second (default) soft commits for NRT updates1 minute (default) hard commits (no searcher reopen)Transaction logs for recovery