DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION

Large Scale Search, Discovery
and Analysis in Action

Ivan Provalov
Research Engineer
Office of the Chief Scientist
September 25, 2012

Confidential © Copyright 2012

User Interactions With Big Data

Command System
Data DFS Line Administrator

Key
Query
Data Value Engineer
Language
Store

Keyword
Data Index End User
Search

Confidential and Proprietary
2 © 2012 LucidWorks

Is Search Enough?

• Keyword search is a
commodity endeavour shuttle bay area
• Holistic view of the data and
Search
the user interactions with that
data
• Search, Discovery and
Analytics are the key to
unlocking this view of users
and data

Search, Discovery and Analytics


Why Search, Discovery and Analytics?

• User Needs
Search - real-time, ad hoc access to
content
- aggressive prioritization
based on importance
- serendipity
- feedback/learning from past
Analytics Discovery

• Business Needs
- deeper insight into users
- leverage existing internal
knowledge
- cost effective


Topics

• Background and needs
• Architecture
• Search, Discovery and Analytics in action
• Road map
• Wrap up


Search

• Performance
• Real time
• Relevance and importance
• Presenting results
• Experiment management


Discovery

• Content clustering
• Discovering near duplicate documents
• Finding ‘dark data’
• Making recommendations
• Uncovering trends
• Recognizing topics
• More like this


Analytics

• Term frequency
• Facets
• Click analysis
• Relevancy metrics
• Zero results queries
• Hot spots
• Statistically interesting phrases


Some Use Cases

• Video streaming
- classification
- recommendations
• Financial, transportation,
telecommunications
- fraud detection
• Social media
- trend monitoring
• Information technology
- logs monitoring
• Healthcare
- identifying patients for clinical studies


In Focus: Personalized Medicine

Alignment
and other Genetic
analysis Variations

Patient DNA

Standard Therapies

Alternative Therapies

Search and Faceting


In Focus: Log Processing in Telecommunications

• Each year, large sums of money are lost due to
fraudulent calls and poor service

• Logs are usually semi-structured and contain vital
information about errors and fraud

• Deeper batch analytics can provide insight into patterns
across vast amounts of data

• Search of call and network information (via logs) is
critical to providing deeper analysis and understanding
of these errors and fraudulent activities

What Does a Search, Discovery and Analytics
Platform Need?
• Fast, efficient, scalable search
- bulk and near real time indexing
- handle billions of records with sub-second search and faceting

• Large scale, cost effective storage and processing capabilities
- need whole data consumption and analysis
- experimentation/sampling tools

• NLP and machine learning tools that scale to enhance discovery
and analysis


Building a Search, Discovery and Analytics Platform

API

Search, Discovery, Analytics

Management
Inputs

Bulk & Processing & Storage
Real Time

Provisioning, Monitoring & Configuration

© 2012 LucidWorks

LucidWorks Big Data

API

Inputs

Search, Discovery, Analytics

Management
Processing & Storage


© 2012 LucidWorks

LucidWorks Big Data

API

Inputs Search, Discovery, Analytics
Analytics Service Document Service

Management
Processing & Storage


© 2012 LucidWorks

LucidWorks Big Data

API

Inputs Search, Discovery, Analytics Mgmt
Admin

Service
Processing & Storage Mgmt

Data
Mgmt


© 2012 LucidWorks

LucidWorks Big Data
API

Big Data LucidWorks Web HDFS

Inputs Search, Discovery, Analytics Mgmt
Admin

Service
Processing & Storage Mgmt

Data
Mgmt


© 2012 LucidWorks

Components – LucidWorks Search

Component Benefit
LucidWorks Search (2.1.1) Lucene/Solr 4.0-dev, sharded with
• connector framework SolrCloud, near-real time indexing,
• security transaction logs for recovery.
• user click framework
• business process integration
• administration

LucidWorks Search


Components - Hadoop

Component Benefit
Apache Hadoop (1.0.3) Distributed computing and
processing for ETL and analytics
jobs.
Apache HBase (0.92) Key-value store allowing fast access
to the data.

Apache Oozie (modified 3.2) Workflow orchestration.


Components - Analysis/ML/NLP

Component Benefit
Apache Mahout (trunk) Distributed machine learning
• k-means clustering processing framework.
• statistically interesting phrases
• similar documents
• classification
Apache UIMA (2.4.0) Text processing and annotations.

Apache OpenNLP (1.5.2) Machine learning toolkit for natural
• named entity extraction language processing.
Behemoth (modified trunk) Makes easier M/R data extraction,
abstracts annotations frameworks.
Apache Pig (0.9.2) Helps with writing analytics M/R
• ETL programs.
• log analysis


Components - Middleware

Component Benefit
Apache ZooKeeper (3.4.3) Service discovery.
• Netflix Curator

Apache Kafka (0.7) Logs consumption and event-based
real-time document processing
framework.


Components - SDA Engine

• RESTful services (Restlet 2.1)
• ZooKeeper + Netflix Curator
• Authentication and authorization
• Proxies for LucidWorks and
WebHDFS API
• Workflow engine


Road Map

• Analytics themes
- relevance
- data quality
- discovery
- integration with other packages (R)
• Machine learning
- NLP
- recommendations
• Experiment management


Conclusions

• Search, Discovery and Analytics,
when combined into a single,
integrated system provides
powerful insight into both your
content and your users
• LucidWorks has combined many
of these things into LucidWorks
Big Data


LucidWorks Big Data

• Unified development platform for Big Data applications
• Integrated open source stack: Lucene/Solr, Hadoop,
Mahout, NLP
• Single, uniform REST API
• Pre-tuned by open source industry experts
• Out of the box provisioning - hosted or on premise


Search | Discover | Analyze

www.lucidworks.com/bigdata
ivan.provalov@lucidworks.com
@iprovalov

DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (10)

Similaire à DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION

Similaire à DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION (20)

Dernier

Dernier (20)

DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION

Notes de l'éditeur