SlideShare une entreprise Scribd logo
1  sur  41
1© Cloudera, Inc. All rights reserved.
Introduction to Cloudera
Search Training
Tom Wheeler, Sr. Curriculum Developer
2© Cloudera, Inc. All rights reserved.
Course Objectives
After successfully completing this course, you will be able to:
• Understand the architecture of Cloudera Search
• Describe several use cases for Cloudera Search
• Develop schemas and queries for your data
• Choose the most appropriate indexing method for a particular situation
• Perform batch indexing of data stored in HDFS and HBase
• Perform indexing of streaming data in near-real-time with Flume
• Index content in multiple languages and file formats
• Process and transform incoming data with Morphlines
• Understand the factors that affect the performance of Cloudera Search
• Create a user interface for your index using Hue
• Integrate Cloudera Search with external applications
• Improve the Search experience using features such as faceting, highlighting, and
spelling correction
3© Cloudera, Inc. All rights reserved.
Tools Used in Hands-On Exercises
4© Cloudera, Inc. All rights reserved.
Target Audience, Course Prerequisites, and Required Skills
This is a three-day technical course
• Intended for software developers, data engineers, and similar roles
There are no specific prerequisite courses
Students should have the following qualifications
• A basic understanding of Hadoop
• Experience with a general-purpose programming language
• Ability to perform basic end-user tasks using the Linux command line
No prior experience with Cloudera Search or Apache Solr is necessary
• Nor is experience with tools such as Apache Flume or Apache HBase
5© Cloudera, Inc. All rights reserved.
Learning Path: Developers & Data Engineers
Intro to
Data Science
Spark
Training
Learn to code and write MapReduce programs for produc on
Master advanced API topics required for real-world data analysis
Combine batch and stream processing with interac ve analy cs
Op mize applica ons for speed, ease of use, and sophis ca on
Implement recommenders and data experiments
Draw ac onable insights from analysis of disparate data
Big Data
Applica ons
Build converged applica ons using mul ple processing engines
Develop enterprise solu ons using components across the EDH
Developer
Training
Design schemas to minimize latency on massive data sets
Scale hundreds of thousands of opera ons per second
HBase
Training
Search
Training
Bring scalable, flexible indexing to Hadoop with Apache Solr
Integrate powerful, real- me queries with external applica ons
Aaron T. Myers
So ware Engineer
6© Cloudera, Inc. All rights reserved.
Course Outline (1)
Overview of Cloudera Search
Performing Basic Queries
• Hands-On Exercise: Writing and Executing Basic Search Queries
• Bonus Exercise: Issuing Queries Directly to Solr
Writing More Powerful Queries
• Hands-On Exercise: Using Functions in Queries
• Bonus Exercise: Using Filter Queries
• Bonus Exercise: Field Faceting
Preparing to Index Documents
• Hands-On Exercise: Performing Pre-Indexing Tasks
• Bonus Exercise: Extracting Multiple Values from a Field
7© Cloudera, Inc. All rights reserved.
Course Outline (2)
Batch Indexing HDFS Data with MapReduce
• Hands-On Exercise: Using MapReduce to Index Data in HDFS
• Bonus Exercise: Troubleshooting Data Problems
Near-Real-Time Indexing with Flume
• Hands-On Exercise: Using Flume to Index Changes to a Collection
• Bonus Exercise: Indexing Streaming Data in Near-Real-Time
Indexing HBase Data with Lily
• Hands-On Exercise: Indexing Data in HBase Tables
Understanding Language and File Type Support
• Hands-On Exercise: Testing the Analyzer Chain with the Admin UI
• Bonus Exercise: Extracting Information from Binary Files
8© Cloudera, Inc. All rights reserved.
Course Outline (3)
Improving Search Quality and Performance
• Hands-On Exercise: Improving Search Quality
• Bonus Exercise: Using Spellchecking in Queries
Building User Interfaces for Search
• Hands-On Exercise: Building a User Interface with Hue
Considerations for Deployment
9© Cloudera, Inc. All rights reserved.
Presentation: Excerpt from Course
I will now show you some of what's in the course.
Primarily based on the "Overview of Cloudera Search" chapter
• What is Cloudera Search?
• Helpful Features
• Use Cases
10© Cloudera, Inc. All rights reserved.
Overview of Cloudera Search
• What is Cloudera Search?
• Helpful Features
• Case Studies
• Essential Points
11© Cloudera, Inc. All rights reserved.
The Need for Cloudera Search
There is significant growth in unstructured and semi-structured data
• Log files
• Product reviews
• Customer surveys
• News releases and articles
• Email and social media messages
• Research reports and other documents
We need scalability, speed, and flexibility to keep up with this growth
• Relational databases can’t handle this volume or variety of data
Decreasing storage costs make it possible to store everything
• But finding relevant data is increasingly a problem
12© Cloudera, Inc. All rights reserved.
Cloudera Search Is an Important Part of an Enterprise Data Hub
Interactive full-text search capability for data in your Hadoop cluster
Makes the data accessible to non-technical audiences
• A few people can write code for Spark or MapReduce
• Many more people can write SQL queries
• Nearly everyone can use a search engine
13© Cloudera, Inc. All rights reserved.
Cloudera Search Integrates Apache Solr with CDH
Apache Solr provides a high-performance search service
• Solr is a mature platform with widespread deployment
• Standard Solr APIs and Web UI are available in Cloudera Search
Integration with CDH increases scalability and reliability
• The indexing and query processes can be distributed across nodes
Cloudera Search is 100% open source
• Released under the Apache Software License
14© Cloudera, Inc. All rights reserved.
Relationship Between Cloudera Search and Apache Solr
Apache Solr is the foundation of Cloudera Search
• Proven technology that powers much of the internet
• Active open source community
Cloudera Search adds many additional capabilities
• Integration with HDFS, MapReduce, HBase, and Flume
• Support for file formats widely used with Hadoop
• Dynamic Web-based dashboard and search interface with Hue
• Fine-grained access control through integration with Apache Sentry
15© Cloudera, Inc. All rights reserved.
How Does Cloudera Search Compare to a Relational Database?
As with a database, Cloudera Search is primarily a backend tool
• End users usually interact with it through user interfaces you create
• APIs are available for application development in multiple languages
Databases are often used to analyze data
• Search is typically used to discover data
Databases are designed to join tables based on a key
• Search is intended for queries on denormalized (flat) data sets
Databases are optimized to find and sort by specific values
• Search can match based on specific values, term variants, or ranges
• Search results are usually sorted by relevance
16© Cloudera, Inc. All rights reserved.
Overview of Cloudera Search
• What is Cloudera Search?
• Helpful Features
• Case Studies
• Essential Points
17© Cloudera, Inc. All rights reserved.
Scoring Manipulation
One way you can improve precision is by manipulating document scores
• Users don’t always know how to write good queries
This is also used to balance the needs of the business and the user
• In the end, it is important that the user is satisfied
• Data scientists can be helpful in developing scoring algorithms
• Function queries are often used to manipulate scores
Many factors might be used to influence the scores
• Such as geography, popularity, timeliness, or profit margin
18© Cloudera, Inc. All rights reserved.
Broad File Format Support
Cloudera Search is ideal for semi-structured and free-form text data
• This includes a variety of document types such as log files, email messages,
reports, spreadsheets, presentations, and multimedia
Support for indexing data from many common formats, including
• Microsoft Office (Word, Excel, and PowerPoint)
• Portable Document Format (PDF)
• HTML and XML
• UNIX mailbox format (mbox)
• Plain text and Rich Text Format (RTF)
• Hadoop file formats like SequenceFiles and Avro
Can also extract and index metadata from many image and audio formats
19© Cloudera, Inc. All rights reserved.
Multilingual Support
You can index and query content in more than 30 languages
20© Cloudera, Inc. All rights reserved.
“More Like This”
Aids in focusing results when searching on words with multiple meanings
The Apple Macintosh Book
by Cary Lu (1984)
A wealth of information about the Macintosh family of computers... more like this
Wild Apple and Fruit Trees of Central Asia
by Jules Janick and Calvin Ross Sperling (2003)
The definitive source of information about Malus species found in... more like this
The Year the Big Apple Went Bust
by Fred Ferretti (1976)
Chronicles the 1975 fiscal crisis that nearly forced New York City... more like this
Apple of My Eye
by Patrick Redmond (2003)
When Susan and Ronnie first meet, the attraction is instant... more like this
They Were Strangers: A Family History
by Slovie Solomon Apple (1995)
Determined to survive at any cost, Clara endures untold hardships... more like this
Showing results 1-5 out of 7,523 for term: apple
21© Cloudera, Inc. All rights reserved.
Term Highlighting
Highlighting helps you quickly identify matches in surrounding text
How to Traverse the Space-Time Continuum
by Doc Brown (1955)
...after hitting my head on the bathroom sink while attempting to hang a clock,
I conceived of a flux capacitor, which contains three Geissler-style gas discharge
tubes sealed with mercury vapor or reactive alkali metal such as sodium...
Customizing Your DeLorean DMC-12
by Doc Brown and Marty McFly (1985)
...the stainless steel body of the DeLorean DMC-12 provides a direct and influential
effect on the "flux dispersal" of the overall system, and by installing a flux capacitor
providing 1.21 gigawatts (roughly equivalent to the power produced by 15 jet...
Relativity: the Special and General Theory
by A. Einstein (1916)
…under these conditions, the u-curves and v-curves are straight lines in the
sense of Euclidean geometry, and they are perpendicular to each other when
the flux capacitor exceeds ~ 1200 gigawatts of electrical power...
Showing results 1 - 3 out of 18 for phrase: “flux capacitor”
22© Cloudera, Inc. All rights reserved.
Spellchecking Suggestions
Users often enter search terms incorrectly
• Unless they notice, they may conclude that no relevant data exists
• The spellchecking feature in Cloudera Search can suggest an alternative
No results found for phrase: “comptuer porgramming”
Did you mean to search for “computer programming” instead?
23© Cloudera, Inc. All rights reserved.
Geospatial Search
Cloudera Search can use location data to filter and sort results
• Proximity is calculated based on longitude and latitude of each point
1. Forest Park Station
0.1 kilometers
2. Skinker Station
0.2 kilometers
3. Central West End Station
0.3 kilometers
4. Delmar Station
0.3 kilometers
5. Big Bend Station
0.9 kilometers
5
1
2
3
4
Showing all 5 results for Metrolink stations within 1 kilometer of Forest Park
24© Cloudera, Inc. All rights reserved.
Faceted Search
Facets categorize results by field values or ranges
• Makes it easy to “drill down” into a subset of results
This feature is found on many popular Web sites
• Travel sites might facet on location and price
• Music sites might facet by genre, format, and year
Faceting makes it easy for users to narrow searches
• They can see how many items match a given facet
• Then, they can filter by that facet
This is key for analytics in Cloudera Search
(remove) - Jazz
Genre
2010 - Now (397)
2000 - 2009 (974)
1990 - 1999 (721)
Release Year
(remove) - Vinyl
Format
Downtown (97)
Midtown (62)
+ Show more...
Neighborhood
Economy (872)
Moderate (519)
Luxury (361)
Price Range
25© Cloudera, Inc. All rights reserved.
Hue: Search Dashboards
Hue has drag-and-drop support for building dashboards based on Search
Search Employees +
Job Designer Dat a Browsers Workf lows Search
Department
Operations (590)
Sales (540)
Facilities Management (272)
Customer Support (227)
IT (222)
Engineering (218)
Show more…
Nevada
439
Year Hired
2014 (914)
2013 (892)
2012 (703)
2011 (489)
2010 (401)
Before 2010 (376)
Location
Education Level
120,000
110,000
Salary
Stanford
26© Cloudera, Inc. All rights reserved.
Overview of Cloudera Search
• What is Cloudera Search?
• Helpful Features
• Case Studies
• Essential Points
27© Cloudera, Inc. All rights reserved.
Use Case #1: Online Document Archive
Information silos impede cross-team collaboration and knowledge sharing
HDFS can act as a central repository for archiving all types of data
• Search allows employees to find this information quickly and easily
PDF (132)
Microsoft Word (68)
Microsoft Excel (27)
E-Mail Message (19)
Audio File (3)
File Type
Legal Compliance (117)
Engineering (86)
Manufacturing (46)
Department
Find: Display results per page, sorted by
249 matches found
Recall Notice: CX1-2112 Fuel Pump May Cause Fire
By Arnold Anderson, Chief Engineer (April 29, 2014)
Pending Class Action Regarding Faulty Fuel Pumps
Author10
The CX1-2112 fuel pump uses a neoprene gasket that has
been shown to fail during normal use, causing dangerous…
From Winston Prescott, Esquire (November 11, 2014)
My firm represents 318 victims, injured during fires caused
by the failure of the CX1-2112 fuel pump manufactured by…
“fuel pump” AND fail
28© Cloudera, Inc. All rights reserved.
Use Case #2: Threat Detection in Near-Real-Time
Looking at yesterday’s log files allows us to react to history
• Yet emerging threats require us to react to what’s happening right now
Search can help you identify important patterns in incoming data
Yes (4,292,172)
No (61,779)
Packet Rejected
4,323,951 records matched (time range: 11:37:21 – 12:37:21)
Firewall LogsSearch data set for IP Addressin field
HTTP (594,370)
HTTPS (605,352)
SSH (475,634)
SMTP (2,645,595)
Service Port
Top Five Origins by Source IP Address
Display Last Hour
New York
Ukraine
Texas
Illinois
California
172.16.36.*
29© Cloudera, Inc. All rights reserved.
Use Case #3: Market Segmentation/Identification
Survey and feedback information is valuable
• But extracting insight can be a slow and expensive process
Search makes it easy to interactively explore new opportunities
2014 SurveySearch: for term in field
90%
Recent Leisure ActivitiesPrimary
Residence
$10,000
Monthly Expenses, by Category
$9,000
$8,000
80%
70%
60%
50%
Yachting
Shopping
Polo
Opera
Croquet
1. Beverly Hills, CA
2. Malibu, CA
3. Los Altos Hills, CA
4. Scottsdale, AZ
5. Park City, UT
Under 35 (1,798)
35-50 (6,389)
Over 50 (8,991)
Age Range
17,138 matches with filters (Annual Income: >$500,000, Region: Southwest, Education: College Graduate)
Female (10,085)
Male (7,093)
Gender
Marital Status
Married (12,347)
phone OR tablet Next Purchase
30© Cloudera, Inc. All rights reserved.
Overview of Cloudera Search
• What is Cloudera Search?
• Helpful Features
• Case Studies
• Essential Points
31© Cloudera, Inc. All rights reserved.
Documents, Fields, Queries, and Terms
It is helpful to understand the meaning
of some commonly-used words in Solr
A query typically specifies terms of
interest, such as “equity” or “David”
It may match one or more documents
• Each document contains one or
more fields, such as “title” or “name”
The notion of “document” is flexible
• Think of a document as being similar
to a record in a database table
• A single file may contain multiple documents
Title:
Date:
Author:
Summary:
Body:
Equity Market Analysis
March 14, 2015
J.P. Moneybags
This report explains how to…
Given the recent increase in…
name address city
Alice 12 Ames St. Austin
Bruce 27 Bend Rd. Baltimore
Carol 35 Clay Ct. Cleveland
David 41 Deer Dr. Dallas
Ellie 59 Elan Ln. El Paso
32© Cloudera, Inc. All rights reserved.
Indexing Data Is a Prerequisite to Searching It
You must index data prior to querying that data with Cloudera Search
Creating and populating an index requires specialized skills
• Somewhat similar to designing database tables
• Frequently involves data extraction and transformation
Running basic queries on that data requires relatively little skill
• “Power users” who master the syntax can create very powerful queries
Overview of Cloudera Search
Transform
Data
Index
Data
Acquire
Data
Query
Data
Display
Results
33© Cloudera, Inc. All rights reserved.
What Is an Index?
Indexes are data structures optimized for quick lookups
• Much like a book’s index helps you quickly locate information
The indexing process uses a schema to define the documents’ fields
• This includes each field’s name and data type
Cloudera Search includes the Morphlines library
• Can extract, transform, and load data into Solr
Data
a,Alice,Manager
b,Bruce,Engineer,$5000
c,Carol,Manager,$7500
d,David,Analyst,$5000
Schema
Index
Alice: (a)
Bruce: (b)
Carol: (c)
David: (d)
name
Analyst: (d)
Engineer: (b)Manager: (a,
c)
title
5000: (b, d)
7500: (c)
bonus
id: string
name: string
title: string
bonus: int
34© Cloudera, Inc. All rights reserved.
Three Indexing Methods in Cloudera Search
Near-Real-Time indexing with Flume
• Data is indexed immediately as it enters the cluster
Batch mode indexing with MapReduce
• Used to index static data that already resides in HDFS
HBase indexing with Lily
• Allows you to index records stored in HBase tables
35© Cloudera, Inc. All rights reserved.
Batch Indexing of Data in HDFS with MapReduce
Use batch indexing to index static data already stored in HDFS
Cloudera Search provides a reusable job (MapReduceIndexerTool)
• Reads input data previously stored in HDFS
• Processes this data using Morphlines
• Creates the index and stores it in HDFS
HDFSAdd Input Data Read Input Data
MapReduce
Indexing Job
Morphlines
Input Data
name
Alice: (a)
Bruce: (b)
Carol: (c)
David: (d)
Index
Store Create
36© Cloudera, Inc. All rights reserved.
Near-Real-Time Indexing with Flume
Use near-real-time indexing for streaming or continuously-generated data
• Flume reads incoming data from a specific source
• This data is processed using Morphlines
• The index is created in HDFS and updated as new records arrive
The processed data can optionally be written as files in HDFS
Read Source
Flume
Morphline
Solr Sink
Morphlines
Input Data
Event
Event
Event
HDFS
name
Alice: (a)
Bruce: (b)
Carol: (c)
David: (d)
IndexCreate or Update
Index in HDFS
Create data files (optional)
37© Cloudera, Inc. All rights reserved.
Indexing Data in HBase with Lily
Use Lily to index data stored in HBase tables
• HBase is a non-relational (NoSQL) distributed database built on HDFS
• HBase can scale to handle billions of records with millions of columns
Both batch and near-real-time modes of operation are supported
HBase
name
Alice: (a)
Bruce: (b)
Carol: (c)
David: (d)
Index
Read input from cells
Lily NRT
Indexer Tool
Morphlines Update the index
Create the index
Triggered by updates
to HBase cells
Read input from cells
Invoked on demand
or through scheduler
HDFS
HBase Batch
Indexer Tool
Morphlines
38© Cloudera, Inc. All rights reserved.
Morphlines Overview
Morphlines is a framework for processing streams of data
• It is part of the Kite Software Development Kit (SDK)
• Offers many helpful features for indexing data with Search
• It is a plain Java library that can be used even outside of Hadoop
Especially useful for Extract, Transform, and Load (ETL) processing
• Processing commands are defined in a configuration file
• These commands are executed in sequence, much like a UNIX pipeline
• Morphlines ships with dozens of reusable commands
Incoming
Record
Outgoing
Record
Morphlines Processing Pipeline
Read
CSV
Generate
UUID
Convert
Timestamp
39© Cloudera, Inc. All rights reserved.
Essential Points
Cloudera Search provides full-text interactive search for data in Hadoop
• Apache Solr is a mature, high-performance search platform
• CDH components provide reliability and scalability
Search offers an additional option for accessing data
• Ideal for free-form or semi-structured data in many formats
• Does not require users to have experience with Java or SQL
Data must be indexed before it can be searched
• Cloudera Search offers several methods for indexing data at scale
• You can extract, load, and transform data using Morphlines
40© Cloudera, Inc. All rights reserved.
Thank you!
twheeler@cloudera.com
41© Cloudera, Inc. All rights reserved.
Thank You for Attending!
• Submit questions in the Q&A panel
• Follow Cloudera University on Twitter @ClouderaU
• Learn more about Cloudera Search Training:
http://university.cloudera.com/search-training
• Follow the Developer Learning Path:
http://university.cloudera.com/developers
• Get Developer Certification: http://university.cloudera.com/certification
• Join the Cloudera Community: http://community.cloudera.com

Contenu connexe

Tendances

SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and SolrLucidworks (Archived)
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014hadooparchbook
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSpark Summit
 
Nl HUG 2016 Feb Hadoop security from the trenches
Nl HUG 2016 Feb Hadoop security from the trenchesNl HUG 2016 Feb Hadoop security from the trenches
Nl HUG 2016 Feb Hadoop security from the trenchesBolke de Bruin
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)Todd Lipcon
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoopmarkgrover
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoopmarkgrover
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopCloudera, Inc.
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorialmarkgrover
 
Impala Performance Update
Impala Performance UpdateImpala Performance Update
Impala Performance UpdateCloudera, Inc.
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchMark Miller
 

Tendances (20)

SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
LEGO: Data Driven Growth Hacking Powered by Big Data
LEGO: Data Driven Growth Hacking Powered by Big Data LEGO: Data Driven Growth Hacking Powered by Big Data
LEGO: Data Driven Growth Hacking Powered by Big Data
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Deep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profitDeep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profit
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
 
Nl HUG 2016 Feb Hadoop security from the trenches
Nl HUG 2016 Feb Hadoop security from the trenchesNl HUG 2016 Feb Hadoop security from the trenches
Nl HUG 2016 Feb Hadoop security from the trenches
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoop
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for Hadoop
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorial
 
Impala Performance Update
Impala Performance UpdateImpala Performance Update
Impala Performance Update
 
Impala for PhillyDB Meetup
Impala for PhillyDB MeetupImpala for PhillyDB Meetup
Impala for PhillyDB Meetup
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data Search
 

Similaire à Introduction to Cloudera Search Training

Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopCloudera, Inc.
 
Hack for Good and Profit (Cloud Foundry Summit 2014)
Hack for Good and Profit (Cloud Foundry Summit 2014)Hack for Good and Profit (Cloud Foundry Summit 2014)
Hack for Good and Profit (Cloud Foundry Summit 2014)VMware Tanzu
 
Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...
Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...
Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...Cloudera, Inc.
 
Unlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator OptimizerUnlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator OptimizerCloudera, Inc.
 
Cloudera training: secure your Cloudera cluster
Cloudera training: secure your Cloudera clusterCloudera training: secure your Cloudera cluster
Cloudera training: secure your Cloudera clusterCloudera, Inc.
 
Enterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaEnterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaNeo4j
 
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...Cloudera, Inc.
 
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr
Cloudera, Inc.
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Uri Laserson
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
From Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your OrganizationFrom Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your OrganizationCloudera, Inc.
 
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindDeliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindAvere Systems
 
Enterprise search Information
Enterprise search Information Enterprise search Information
Enterprise search Information Netwoven Inc.
 
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudBuilding Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudPeter Haase
 
Machine Learning Model Deployment: Strategy to Implementation
Machine Learning Model Deployment: Strategy to ImplementationMachine Learning Model Deployment: Strategy to Implementation
Machine Learning Model Deployment: Strategy to ImplementationDataWorks Summit
 
Islandora Webinar: Building a Repository Roadmap
Islandora Webinar: Building a Repository RoadmapIslandora Webinar: Building a Repository Roadmap
Islandora Webinar: Building a Repository Roadmapeohallor
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Cloudera, Inc.
 
NOVA Data Science Meetup 2-21-2018 Presentation Cloudera Data Science Workbench
NOVA Data Science Meetup 2-21-2018 Presentation Cloudera Data Science WorkbenchNOVA Data Science Meetup 2-21-2018 Presentation Cloudera Data Science Workbench
NOVA Data Science Meetup 2-21-2018 Presentation Cloudera Data Science WorkbenchNOVA DATASCIENCE
 

Similaire à Introduction to Cloudera Search Training (20)

Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
 
Hack for Good and Profit (Cloud Foundry Summit 2014)
Hack for Good and Profit (Cloud Foundry Summit 2014)Hack for Good and Profit (Cloud Foundry Summit 2014)
Hack for Good and Profit (Cloud Foundry Summit 2014)
 
Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...
Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...
Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...
 
Unlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator OptimizerUnlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator Optimizer
 
Cloudera training: secure your Cloudera cluster
Cloudera training: secure your Cloudera clusterCloudera training: secure your Cloudera cluster
Cloudera training: secure your Cloudera cluster
 
Enterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaEnterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, Cloudera
 
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
 
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr

 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
 
Intro to Search
Intro to SearchIntro to Search
Intro to Search
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
From Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your OrganizationFrom Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your Organization
 
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindDeliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
 
Enterprise search Information
Enterprise search Information Enterprise search Information
Enterprise search Information
 
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudBuilding Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
 
Machine Learning Model Deployment: Strategy to Implementation
Machine Learning Model Deployment: Strategy to ImplementationMachine Learning Model Deployment: Strategy to Implementation
Machine Learning Model Deployment: Strategy to Implementation
 
Islandora Webinar: Building a Repository Roadmap
Islandora Webinar: Building a Repository RoadmapIslandora Webinar: Building a Repository Roadmap
Islandora Webinar: Building a Repository Roadmap
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 

 
NOVA Data Science Meetup 2-21-2018 Presentation Cloudera Data Science Workbench
NOVA Data Science Meetup 2-21-2018 Presentation Cloudera Data Science WorkbenchNOVA Data Science Meetup 2-21-2018 Presentation Cloudera Data Science Workbench
NOVA Data Science Meetup 2-21-2018 Presentation Cloudera Data Science Workbench
 

Plus de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Plus de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Dernier

SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Best Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfBest Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfIdiosysTechnologies1
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 

Dernier (20)

SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Best Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfBest Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdf
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 

Introduction to Cloudera Search Training

  • 1. 1© Cloudera, Inc. All rights reserved. Introduction to Cloudera Search Training Tom Wheeler, Sr. Curriculum Developer
  • 2. 2© Cloudera, Inc. All rights reserved. Course Objectives After successfully completing this course, you will be able to: • Understand the architecture of Cloudera Search • Describe several use cases for Cloudera Search • Develop schemas and queries for your data • Choose the most appropriate indexing method for a particular situation • Perform batch indexing of data stored in HDFS and HBase • Perform indexing of streaming data in near-real-time with Flume • Index content in multiple languages and file formats • Process and transform incoming data with Morphlines • Understand the factors that affect the performance of Cloudera Search • Create a user interface for your index using Hue • Integrate Cloudera Search with external applications • Improve the Search experience using features such as faceting, highlighting, and spelling correction
  • 3. 3© Cloudera, Inc. All rights reserved. Tools Used in Hands-On Exercises
  • 4. 4© Cloudera, Inc. All rights reserved. Target Audience, Course Prerequisites, and Required Skills This is a three-day technical course • Intended for software developers, data engineers, and similar roles There are no specific prerequisite courses Students should have the following qualifications • A basic understanding of Hadoop • Experience with a general-purpose programming language • Ability to perform basic end-user tasks using the Linux command line No prior experience with Cloudera Search or Apache Solr is necessary • Nor is experience with tools such as Apache Flume or Apache HBase
  • 5. 5© Cloudera, Inc. All rights reserved. Learning Path: Developers & Data Engineers Intro to Data Science Spark Training Learn to code and write MapReduce programs for produc on Master advanced API topics required for real-world data analysis Combine batch and stream processing with interac ve analy cs Op mize applica ons for speed, ease of use, and sophis ca on Implement recommenders and data experiments Draw ac onable insights from analysis of disparate data Big Data Applica ons Build converged applica ons using mul ple processing engines Develop enterprise solu ons using components across the EDH Developer Training Design schemas to minimize latency on massive data sets Scale hundreds of thousands of opera ons per second HBase Training Search Training Bring scalable, flexible indexing to Hadoop with Apache Solr Integrate powerful, real- me queries with external applica ons Aaron T. Myers So ware Engineer
  • 6. 6© Cloudera, Inc. All rights reserved. Course Outline (1) Overview of Cloudera Search Performing Basic Queries • Hands-On Exercise: Writing and Executing Basic Search Queries • Bonus Exercise: Issuing Queries Directly to Solr Writing More Powerful Queries • Hands-On Exercise: Using Functions in Queries • Bonus Exercise: Using Filter Queries • Bonus Exercise: Field Faceting Preparing to Index Documents • Hands-On Exercise: Performing Pre-Indexing Tasks • Bonus Exercise: Extracting Multiple Values from a Field
  • 7. 7© Cloudera, Inc. All rights reserved. Course Outline (2) Batch Indexing HDFS Data with MapReduce • Hands-On Exercise: Using MapReduce to Index Data in HDFS • Bonus Exercise: Troubleshooting Data Problems Near-Real-Time Indexing with Flume • Hands-On Exercise: Using Flume to Index Changes to a Collection • Bonus Exercise: Indexing Streaming Data in Near-Real-Time Indexing HBase Data with Lily • Hands-On Exercise: Indexing Data in HBase Tables Understanding Language and File Type Support • Hands-On Exercise: Testing the Analyzer Chain with the Admin UI • Bonus Exercise: Extracting Information from Binary Files
  • 8. 8© Cloudera, Inc. All rights reserved. Course Outline (3) Improving Search Quality and Performance • Hands-On Exercise: Improving Search Quality • Bonus Exercise: Using Spellchecking in Queries Building User Interfaces for Search • Hands-On Exercise: Building a User Interface with Hue Considerations for Deployment
  • 9. 9© Cloudera, Inc. All rights reserved. Presentation: Excerpt from Course I will now show you some of what's in the course. Primarily based on the "Overview of Cloudera Search" chapter • What is Cloudera Search? • Helpful Features • Use Cases
  • 10. 10© Cloudera, Inc. All rights reserved. Overview of Cloudera Search • What is Cloudera Search? • Helpful Features • Case Studies • Essential Points
  • 11. 11© Cloudera, Inc. All rights reserved. The Need for Cloudera Search There is significant growth in unstructured and semi-structured data • Log files • Product reviews • Customer surveys • News releases and articles • Email and social media messages • Research reports and other documents We need scalability, speed, and flexibility to keep up with this growth • Relational databases can’t handle this volume or variety of data Decreasing storage costs make it possible to store everything • But finding relevant data is increasingly a problem
  • 12. 12© Cloudera, Inc. All rights reserved. Cloudera Search Is an Important Part of an Enterprise Data Hub Interactive full-text search capability for data in your Hadoop cluster Makes the data accessible to non-technical audiences • A few people can write code for Spark or MapReduce • Many more people can write SQL queries • Nearly everyone can use a search engine
  • 13. 13© Cloudera, Inc. All rights reserved. Cloudera Search Integrates Apache Solr with CDH Apache Solr provides a high-performance search service • Solr is a mature platform with widespread deployment • Standard Solr APIs and Web UI are available in Cloudera Search Integration with CDH increases scalability and reliability • The indexing and query processes can be distributed across nodes Cloudera Search is 100% open source • Released under the Apache Software License
  • 14. 14© Cloudera, Inc. All rights reserved. Relationship Between Cloudera Search and Apache Solr Apache Solr is the foundation of Cloudera Search • Proven technology that powers much of the internet • Active open source community Cloudera Search adds many additional capabilities • Integration with HDFS, MapReduce, HBase, and Flume • Support for file formats widely used with Hadoop • Dynamic Web-based dashboard and search interface with Hue • Fine-grained access control through integration with Apache Sentry
  • 15. 15© Cloudera, Inc. All rights reserved. How Does Cloudera Search Compare to a Relational Database? As with a database, Cloudera Search is primarily a backend tool • End users usually interact with it through user interfaces you create • APIs are available for application development in multiple languages Databases are often used to analyze data • Search is typically used to discover data Databases are designed to join tables based on a key • Search is intended for queries on denormalized (flat) data sets Databases are optimized to find and sort by specific values • Search can match based on specific values, term variants, or ranges • Search results are usually sorted by relevance
  • 16. 16© Cloudera, Inc. All rights reserved. Overview of Cloudera Search • What is Cloudera Search? • Helpful Features • Case Studies • Essential Points
  • 17. 17© Cloudera, Inc. All rights reserved. Scoring Manipulation One way you can improve precision is by manipulating document scores • Users don’t always know how to write good queries This is also used to balance the needs of the business and the user • In the end, it is important that the user is satisfied • Data scientists can be helpful in developing scoring algorithms • Function queries are often used to manipulate scores Many factors might be used to influence the scores • Such as geography, popularity, timeliness, or profit margin
  • 18. 18© Cloudera, Inc. All rights reserved. Broad File Format Support Cloudera Search is ideal for semi-structured and free-form text data • This includes a variety of document types such as log files, email messages, reports, spreadsheets, presentations, and multimedia Support for indexing data from many common formats, including • Microsoft Office (Word, Excel, and PowerPoint) • Portable Document Format (PDF) • HTML and XML • UNIX mailbox format (mbox) • Plain text and Rich Text Format (RTF) • Hadoop file formats like SequenceFiles and Avro Can also extract and index metadata from many image and audio formats
  • 19. 19© Cloudera, Inc. All rights reserved. Multilingual Support You can index and query content in more than 30 languages
  • 20. 20© Cloudera, Inc. All rights reserved. “More Like This” Aids in focusing results when searching on words with multiple meanings The Apple Macintosh Book by Cary Lu (1984) A wealth of information about the Macintosh family of computers... more like this Wild Apple and Fruit Trees of Central Asia by Jules Janick and Calvin Ross Sperling (2003) The definitive source of information about Malus species found in... more like this The Year the Big Apple Went Bust by Fred Ferretti (1976) Chronicles the 1975 fiscal crisis that nearly forced New York City... more like this Apple of My Eye by Patrick Redmond (2003) When Susan and Ronnie first meet, the attraction is instant... more like this They Were Strangers: A Family History by Slovie Solomon Apple (1995) Determined to survive at any cost, Clara endures untold hardships... more like this Showing results 1-5 out of 7,523 for term: apple
  • 21. 21© Cloudera, Inc. All rights reserved. Term Highlighting Highlighting helps you quickly identify matches in surrounding text How to Traverse the Space-Time Continuum by Doc Brown (1955) ...after hitting my head on the bathroom sink while attempting to hang a clock, I conceived of a flux capacitor, which contains three Geissler-style gas discharge tubes sealed with mercury vapor or reactive alkali metal such as sodium... Customizing Your DeLorean DMC-12 by Doc Brown and Marty McFly (1985) ...the stainless steel body of the DeLorean DMC-12 provides a direct and influential effect on the "flux dispersal" of the overall system, and by installing a flux capacitor providing 1.21 gigawatts (roughly equivalent to the power produced by 15 jet... Relativity: the Special and General Theory by A. Einstein (1916) …under these conditions, the u-curves and v-curves are straight lines in the sense of Euclidean geometry, and they are perpendicular to each other when the flux capacitor exceeds ~ 1200 gigawatts of electrical power... Showing results 1 - 3 out of 18 for phrase: “flux capacitor”
  • 22. 22© Cloudera, Inc. All rights reserved. Spellchecking Suggestions Users often enter search terms incorrectly • Unless they notice, they may conclude that no relevant data exists • The spellchecking feature in Cloudera Search can suggest an alternative No results found for phrase: “comptuer porgramming” Did you mean to search for “computer programming” instead?
  • 23. 23© Cloudera, Inc. All rights reserved. Geospatial Search Cloudera Search can use location data to filter and sort results • Proximity is calculated based on longitude and latitude of each point 1. Forest Park Station 0.1 kilometers 2. Skinker Station 0.2 kilometers 3. Central West End Station 0.3 kilometers 4. Delmar Station 0.3 kilometers 5. Big Bend Station 0.9 kilometers 5 1 2 3 4 Showing all 5 results for Metrolink stations within 1 kilometer of Forest Park
  • 24. 24© Cloudera, Inc. All rights reserved. Faceted Search Facets categorize results by field values or ranges • Makes it easy to “drill down” into a subset of results This feature is found on many popular Web sites • Travel sites might facet on location and price • Music sites might facet by genre, format, and year Faceting makes it easy for users to narrow searches • They can see how many items match a given facet • Then, they can filter by that facet This is key for analytics in Cloudera Search (remove) - Jazz Genre 2010 - Now (397) 2000 - 2009 (974) 1990 - 1999 (721) Release Year (remove) - Vinyl Format Downtown (97) Midtown (62) + Show more... Neighborhood Economy (872) Moderate (519) Luxury (361) Price Range
  • 25. 25© Cloudera, Inc. All rights reserved. Hue: Search Dashboards Hue has drag-and-drop support for building dashboards based on Search Search Employees + Job Designer Dat a Browsers Workf lows Search Department Operations (590) Sales (540) Facilities Management (272) Customer Support (227) IT (222) Engineering (218) Show more… Nevada 439 Year Hired 2014 (914) 2013 (892) 2012 (703) 2011 (489) 2010 (401) Before 2010 (376) Location Education Level 120,000 110,000 Salary Stanford
  • 26. 26© Cloudera, Inc. All rights reserved. Overview of Cloudera Search • What is Cloudera Search? • Helpful Features • Case Studies • Essential Points
  • 27. 27© Cloudera, Inc. All rights reserved. Use Case #1: Online Document Archive Information silos impede cross-team collaboration and knowledge sharing HDFS can act as a central repository for archiving all types of data • Search allows employees to find this information quickly and easily PDF (132) Microsoft Word (68) Microsoft Excel (27) E-Mail Message (19) Audio File (3) File Type Legal Compliance (117) Engineering (86) Manufacturing (46) Department Find: Display results per page, sorted by 249 matches found Recall Notice: CX1-2112 Fuel Pump May Cause Fire By Arnold Anderson, Chief Engineer (April 29, 2014) Pending Class Action Regarding Faulty Fuel Pumps Author10 The CX1-2112 fuel pump uses a neoprene gasket that has been shown to fail during normal use, causing dangerous… From Winston Prescott, Esquire (November 11, 2014) My firm represents 318 victims, injured during fires caused by the failure of the CX1-2112 fuel pump manufactured by… “fuel pump” AND fail
  • 28. 28© Cloudera, Inc. All rights reserved. Use Case #2: Threat Detection in Near-Real-Time Looking at yesterday’s log files allows us to react to history • Yet emerging threats require us to react to what’s happening right now Search can help you identify important patterns in incoming data Yes (4,292,172) No (61,779) Packet Rejected 4,323,951 records matched (time range: 11:37:21 – 12:37:21) Firewall LogsSearch data set for IP Addressin field HTTP (594,370) HTTPS (605,352) SSH (475,634) SMTP (2,645,595) Service Port Top Five Origins by Source IP Address Display Last Hour New York Ukraine Texas Illinois California 172.16.36.*
  • 29. 29© Cloudera, Inc. All rights reserved. Use Case #3: Market Segmentation/Identification Survey and feedback information is valuable • But extracting insight can be a slow and expensive process Search makes it easy to interactively explore new opportunities 2014 SurveySearch: for term in field 90% Recent Leisure ActivitiesPrimary Residence $10,000 Monthly Expenses, by Category $9,000 $8,000 80% 70% 60% 50% Yachting Shopping Polo Opera Croquet 1. Beverly Hills, CA 2. Malibu, CA 3. Los Altos Hills, CA 4. Scottsdale, AZ 5. Park City, UT Under 35 (1,798) 35-50 (6,389) Over 50 (8,991) Age Range 17,138 matches with filters (Annual Income: >$500,000, Region: Southwest, Education: College Graduate) Female (10,085) Male (7,093) Gender Marital Status Married (12,347) phone OR tablet Next Purchase
  • 30. 30© Cloudera, Inc. All rights reserved. Overview of Cloudera Search • What is Cloudera Search? • Helpful Features • Case Studies • Essential Points
  • 31. 31© Cloudera, Inc. All rights reserved. Documents, Fields, Queries, and Terms It is helpful to understand the meaning of some commonly-used words in Solr A query typically specifies terms of interest, such as “equity” or “David” It may match one or more documents • Each document contains one or more fields, such as “title” or “name” The notion of “document” is flexible • Think of a document as being similar to a record in a database table • A single file may contain multiple documents Title: Date: Author: Summary: Body: Equity Market Analysis March 14, 2015 J.P. Moneybags This report explains how to… Given the recent increase in… name address city Alice 12 Ames St. Austin Bruce 27 Bend Rd. Baltimore Carol 35 Clay Ct. Cleveland David 41 Deer Dr. Dallas Ellie 59 Elan Ln. El Paso
  • 32. 32© Cloudera, Inc. All rights reserved. Indexing Data Is a Prerequisite to Searching It You must index data prior to querying that data with Cloudera Search Creating and populating an index requires specialized skills • Somewhat similar to designing database tables • Frequently involves data extraction and transformation Running basic queries on that data requires relatively little skill • “Power users” who master the syntax can create very powerful queries Overview of Cloudera Search Transform Data Index Data Acquire Data Query Data Display Results
  • 33. 33© Cloudera, Inc. All rights reserved. What Is an Index? Indexes are data structures optimized for quick lookups • Much like a book’s index helps you quickly locate information The indexing process uses a schema to define the documents’ fields • This includes each field’s name and data type Cloudera Search includes the Morphlines library • Can extract, transform, and load data into Solr Data a,Alice,Manager b,Bruce,Engineer,$5000 c,Carol,Manager,$7500 d,David,Analyst,$5000 Schema Index Alice: (a) Bruce: (b) Carol: (c) David: (d) name Analyst: (d) Engineer: (b)Manager: (a, c) title 5000: (b, d) 7500: (c) bonus id: string name: string title: string bonus: int
  • 34. 34© Cloudera, Inc. All rights reserved. Three Indexing Methods in Cloudera Search Near-Real-Time indexing with Flume • Data is indexed immediately as it enters the cluster Batch mode indexing with MapReduce • Used to index static data that already resides in HDFS HBase indexing with Lily • Allows you to index records stored in HBase tables
  • 35. 35© Cloudera, Inc. All rights reserved. Batch Indexing of Data in HDFS with MapReduce Use batch indexing to index static data already stored in HDFS Cloudera Search provides a reusable job (MapReduceIndexerTool) • Reads input data previously stored in HDFS • Processes this data using Morphlines • Creates the index and stores it in HDFS HDFSAdd Input Data Read Input Data MapReduce Indexing Job Morphlines Input Data name Alice: (a) Bruce: (b) Carol: (c) David: (d) Index Store Create
  • 36. 36© Cloudera, Inc. All rights reserved. Near-Real-Time Indexing with Flume Use near-real-time indexing for streaming or continuously-generated data • Flume reads incoming data from a specific source • This data is processed using Morphlines • The index is created in HDFS and updated as new records arrive The processed data can optionally be written as files in HDFS Read Source Flume Morphline Solr Sink Morphlines Input Data Event Event Event HDFS name Alice: (a) Bruce: (b) Carol: (c) David: (d) IndexCreate or Update Index in HDFS Create data files (optional)
  • 37. 37© Cloudera, Inc. All rights reserved. Indexing Data in HBase with Lily Use Lily to index data stored in HBase tables • HBase is a non-relational (NoSQL) distributed database built on HDFS • HBase can scale to handle billions of records with millions of columns Both batch and near-real-time modes of operation are supported HBase name Alice: (a) Bruce: (b) Carol: (c) David: (d) Index Read input from cells Lily NRT Indexer Tool Morphlines Update the index Create the index Triggered by updates to HBase cells Read input from cells Invoked on demand or through scheduler HDFS HBase Batch Indexer Tool Morphlines
  • 38. 38© Cloudera, Inc. All rights reserved. Morphlines Overview Morphlines is a framework for processing streams of data • It is part of the Kite Software Development Kit (SDK) • Offers many helpful features for indexing data with Search • It is a plain Java library that can be used even outside of Hadoop Especially useful for Extract, Transform, and Load (ETL) processing • Processing commands are defined in a configuration file • These commands are executed in sequence, much like a UNIX pipeline • Morphlines ships with dozens of reusable commands Incoming Record Outgoing Record Morphlines Processing Pipeline Read CSV Generate UUID Convert Timestamp
  • 39. 39© Cloudera, Inc. All rights reserved. Essential Points Cloudera Search provides full-text interactive search for data in Hadoop • Apache Solr is a mature, high-performance search platform • CDH components provide reliability and scalability Search offers an additional option for accessing data • Ideal for free-form or semi-structured data in many formats • Does not require users to have experience with Java or SQL Data must be indexed before it can be searched • Cloudera Search offers several methods for indexing data at scale • You can extract, load, and transform data using Morphlines
  • 40. 40© Cloudera, Inc. All rights reserved. Thank you! twheeler@cloudera.com
  • 41. 41© Cloudera, Inc. All rights reserved. Thank You for Attending! • Submit questions in the Q&A panel • Follow Cloudera University on Twitter @ClouderaU • Learn more about Cloudera Search Training: http://university.cloudera.com/search-training • Follow the Developer Learning Path: http://university.cloudera.com/developers • Get Developer Certification: http://university.cloudera.com/certification • Join the Cloudera Community: http://community.cloudera.com