Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Basics of Solr and Solr Integration with AEM6

5 636 vues

Publié le

Basics of Solr and Solr Integration with AEM6

  1. 1. Introduction to Solr and Solr Integration with AEM (Deepak Khetawat)
  2. 2. About me  ❖ AEM6 Certified Expert Lead Consultant ❖ LinkedIn : https://www.linkedin.com/pub/deepak- khetawat/96/a44/99a ❖ Twitter : @dk452 ❖ For any query mail at deepakkhe.90@gmail.com
  3. 3. Agenda ❖ Search ❖ What is Solr ? ❖ Solr Architecture ❖ Working of Solr ❖ Solr Core ❖ Solr Queries ❖ Solr with AEM ❖ External Solr Integration with AEM ❖ Exercises
  4. 4. ❖Search Stats : ➢ Google now processes over 40,000 search queries every second on average , which translates to over 3.5 billion searches per day and 1.2 trillion searches per year worldwide. ➢ 61% of global Internet users research products online . ➢ 44% of online shoppers begin by using a search engine . ➢ A study by Outbrain shows that Search is the #1 driver of traffic to content sites , beating social medias .
  5. 5. ❖Why Searching ? ➢ Want to look up similar terms in database forming autocomplete , get facet/categories results also in one-go? ➢ Instead of application(ex. e-commerce portals) storing all structured data in database , it is recommended to find right products which is already built, which can be adapted . ➢ This is where Search engines/servers comes into picture .
  6. 6. ❖Why Searching ?
  7. 7. ❖What is Solr ?  Price High to Low : Endeca, FredHopper, Mercado, Google Mini, Microsoft Search Server, Autonomy, Microsoft Search Server Express, Lucene(Solr/Elastic Search)  Speed Fast to Slow : Google Mini/Endeca, FredHopper, Autonomy, Lucene/MSS/MSSE  Features High to Low : Endeca, FredHopper, Mercado, Solr, Autonomy, Lucene, MSS/MSSE, Google Mini  Extensibility High to Low : Lucene, Endeca, FredHopper, Mercado, Autonomy, MSS/MSSE, Google Mini
  8. 8. ❖What is Solr ? ➢ Apache Solr is a popular open source enterprise search server / web application . ➢ Solr uses lucene search library and extends it . Solr exposes lucene Java API’s as Restful services . ➢ Apache Solr is easy to use from virtually any programming language. It can be used to increase the performance as it can search all the web content. ➢ We put documents in it (called indexing) via xml, json, csv, binary formats & query it via GET request and receive search data in xml, json, csv, Python, Ruby, PHP, binary etc. formats.
  9. 9. ❖Key features of Solr : ➢ Advanced Full-Text Search Capabilities . ➢ Optimized for High Volume Web Traffic . ➢ Standards Based Open Interfaces - XML , JSON and HTTP . ➢ Comprehensive HTML Administration Interfaces . ➢ Server Statistics exposed over JMX for Monitoring . ➢ Near Real Time Indexing . ➢ Extensible Plugin Architecture .
  10. 10. ❖Other Key features of Solr :  Faceting  Geospatial  Scaling  Query auto-complete  Rich Document(ex. PDF) Parsing
  11. 11. ❖Solr Powered Sites : Take a look at - PublicServers - Solr Wiki  Netflix  AT&T Interactive (http://YP.com)  Comcast  Instagram  AOL  CitySearch  LA Times
  12. 12. ❖Solr Architecture
  13. 13. ❖Solr Search System
  14. 14. ❖Indexing : ➢ Indexing is the processing of original data into a highly efficient cross- reference lookup in order to facilitate rapid searching . ➢ This type of index is called an inverted index, because it inverts a page- centric data structure (page->words) to a keyword-centric data structure (word- >pages). ➢ Without an index, the search engine would scan every document which requires considerable time and computing power . ➢ For example, while an index of 10,000 documents can be queried within milliseconds, a sequential scan of every word in 10,000 large documents could take hours .
  15. 15. ❖Analysis: ➢ Analyze : Search engines does not index text directly . The text are broken into a series of individual atomic elements called tokens . ➢ An Analyzer builds TokenStreams , which analyze texts and represents a policy for extracting index terms from texts . ➢ Analyzers are used both when a document is indexed and at query time . ➢ Tokenizer break field data into lexical units or tokens . ➢ Filters examine a stream of tokens and keep them ,transform or discard them, or create new ones . https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters https://cwiki.apache.org/confluence/display/solr/Tokenizers https://cwiki.apache.org/confluence/display/solr/About+Filters
  16. 16. ❖Analysis :
  17. 17. ❖Searching : ➢ Searching is the process of consulting the search index and retrieving the documents matching the query , sorted in the requested sort order . ➢ Search Process works like following :
  18. 18. ❖What is Core ? ➢ Solr server can handle multiple cores . Core is a index that is handled by Solr server . To run Solr core need to have few configuration files to be discussed later . ➢ Mostly cores run as isolated databases , not meant for interacting with each other . ➢ A Solr index can accept data from many different sources including xml, csv , data extracted from tables in a DB and common formats like a word or PDF .
  19. 19. ❖Ways of loading data into Solr Index ➢ Upload files like XML , JSON by sending HTTP request to Solr . ➢ Index handlers to upload from database . ➢ Writing a custom java app to ingest data through Solr’s Java client .
  20. 20. ❖Ways of loading data into Solr Index ➢ Upload files like XML , JSON by sending HTTP request to Solr . ➢ Index handlers to upload from database . ➢ Writing a custom java app to ingest data through Solr’s Java client .
  21. 21. ❖Configuring Solr Linux Installing and Starting Solr # apt-get install openjdk-7-jre-headless -y # java -version # cd /opt/ && wget http://mirror.nexcess.net/apache/lucene/solr/5.4.1/solr-5.4.1.tgz # # cd solr-5.4.1 # bin/solr start Access it using # http://<ip>:8983/solr/#/ Solr stores index in a directory called index in the data directory (/opt/solr- 5.4.1/server/solr/corename1/data/index/) Create a core To be able to Index and search, you need a core # bin/solr create -c corename -d basic_configs
  22. 22. ❖Configuring Solr Windows http://mirror.nexcess.net/apache/lucene/solr/5.4.1/ to download solr . Using cmd go upto bin and use solr start to start solr . Access it using # http://<ip>:8983/solr/#/ Solr stores index in a directory called index in the data directory (/solr-5.4.1/server/solr/corename1/data/index/) Create a core To be able to Index and search, you need a core. navigate the solr-5.4.1bin folder in the command window . solr create -c corename -d basic_configs
  23. 23. ❖Solr Configuration Files ● Solrconfig.xml : ➢ Manage request and data manipulation for users . ➢ Here we configure update handlers,update event listeners and different search components . ➢ Can handle schemaless Solr feature by adding <schemaFactory class="ManagedIndexSchemaFactory"> in Solrconfig.xml . ● Core.properties : For defining properties at the core level . ● Solr.xml : It is at entire Solr level for cores like how many cores we have , core details . ● Schema.xml : ➢ It is at core data-structure level having information like what field type , fields . ➢ Analyzers are specified as a child of the <fieldType> element in the schema.xml configuration file .
  24. 24. ❖ Field Types  In Solr, every field has a type. Examples of basic field types available in Solr include: float, long, double, date, text  New field types can be created/defined by combining filters and tokenizers.  Field Definition <field name="id" type="text" indexed="true" stored="true"multiValued="true"/> name: Name of the field type: Field type indexed: Should this field be added to the inverted index? stored: Should the original value of this field be stored? multiValued: Can this field have multiple values?
  25. 25. ❖ Exercises  Create 2 cores test and solrpoc .  Configure your schema.xml .  https://lucidworks.com/blog/2014/03/31/introducing-solrs-restmanager- and-managed-stop-words-and-synonyms/ (For POC)
  26. 26. ❖Solr Queries ● Keyword Searching : ➢ Word Searching : title:aem doc ➢ Phrase Searching: title:"aem document" ➢ AND, OR, - operators ((title:"foo bar" AND body:"quick fox") OR title:fox) ● Wildcard Searching : ➢ title : ae* ➢ title : a*m
  27. 27. ❖Solr Queries(Note in code snippet of queries we are using test core) ● Range Searching : ➢ modified_date:[20020101 TO 20030101] Query Example : ➢ http://localhost:8983/solr/test/select?q=title:Design&wt=json&indent=true ➢ http://localhost:8983/solr/test/select?q=*:*&fl=author,title&facet=true&facet .field=author&wt=json&indent=true
  28. 28. ❖Solr Queries ● Fuzzy Search: ➢ Lucene supports fuzzy searches based on the Levenshtein Distance, or Edit Distance algorithm. To do a fuzzy search use the tilde, "~", symbol at the end of a Single word Term. ➢ For example to search for a term similar in spelling to "roam" use the fuzzy search:roam~ This search will find terms like foam and roams. Starting with Lucene 1.9 an additional (optional) parameter can specify the required similarity. The value is between 0 and 1, with a value closer to 1 only terms with a higher similarity will be matched. For example: roam~0.8 The default that is used if the parameter is not given is 0.5. Query Example : ➢ http://localhost:8983/solr/test/select?q=DesignPatterns~.5&wt=json&inden t=true&defType=edismax&qf=title
  29. 29. ❖Common Query Parameters ● sort : title desc (http://localhost:8983/solr/test/select?q=title:Desi*&sort=title+desc&wt=jso n&indent=true) ● fq :This parameter can be used to specify a query that can be used to restrict the super set of documents that can be returned, without influencing score. It can be very useful for speeding up complex queries since the queries specified with fq are cached independently from the main query. Caching means the same filter is used again for a later query (i.e. there's a cache hit). fq=popularity:[10 TO *] & fq=section:0 ● http://www.exadium.com/tools/online-url-encode-and-url-decode-tool/(to encode and decode your Solr queries)
  30. 30. ❖More Search Queries ● Highlighting : http://localhost:8983/solr/test/select?q=title:Desi*&hl=true&hl.snippets=1& hl.fl=*&hl.fragsize=0&wt=json&indent=true ● hl – Should be set to true as it enables highlighted snippets to be generated in the query response. ● hl.fl – Specifies a list of fields to highlight. Wild char * will highlight all the fields ● hl.fragsize – The size, in characters, of the snippets (aka fragments) created by the highlighter. In the original Highlighter, “0” indicates that the whole field value should be used with no fragmenting. By default fragment is of size 100 characters ● hl.snippets The maximum number of highlighted snippets to generate per field. Note: it is possible for any number of snippets from zero to this value to be generated. The default value is "1".
  31. 31. ❖Exercises 1. Under test core achieve all the following with one query : a) Get only fields author and title in query results . b) Fuzziness Index of .5 in title field . c) Title should be start with design d)Change title field to text_general instead of string . e)Faceting for title field . f) Highlighting title field g)author name should only Range between Arvind to Jasvinder .
  32. 32. ❖Exercises 1.Under test core with your schema.xml created, achieve all the following : Full Text Search Fuzzy Search Faceting Spatial Search Highlighting Wild Card Search Range Search
  33. 33. ❖Solr with AEM 6 ● The integration in AEM 6.0 happens at the repository level so that Solr is one of the possible indexes that can be used in Oak, the new repository implementation shipped with AEM 6.0. It can be configured to work as an embedded server with the AEM instance, or as a remote server. Configuring AEM with an embedded SOLR server The embedded SOLR server is recommended for developing and testing the Solr index for an Oak repository. Make sure you use a remote SOLR for production instances.
  34. 34. ❖Configure the embedded Solr server by: ● Search for "Oak Solr server provider". ● Press the edit button and in the following window set the server type to Embedded Solr in the drop-down list. ● Next, edit "Oak Solr embedded server configuration" and create a configuration Add a node called solrlndex of type oak:QueryIndexDefinition under oak:index with the following properties: type:solr (of type String) async:async (of type String) reindex:true (of type Boolean) Testing Solr Index(Recommended Configuration) : https://docs.adobe.com/docs/en/aem/6-0/deploy/upgrade/queries-and- indexing.html
  35. 35. ❖External Solr with AEM 1. Creation of Custom Replication Agent . http://localhost:7502/miscadmin Select “Agents on Author” New > Page Select Replication Agent > Add Title and Name (e.g Solr-Replication-Agent) Now go to the agent (double click) properties , Edit Settings Serialization Type = POC Solr XML Content Serializer Retry Delay = 60000 , Enable the Log Level to Debug Transport Tab put the URl http://localhost:8983/solr/solrpoc?update=true
  36. 36. ❖2. Get the Page to be indexed(Push to Solr) Suppose you want to Index: http://localhost:7502/editor.html/content/geometrixx-outdoors/en/activities.html  Now open CRXDE page (http://localhost:7502/crx/de/index.jsp) for the respective page:  Check jcr:content of the the above page and get the value of “sliing:resourceType” (value = geometrixx-outdoors/components/page_sidebar)
  37. 37. ❖2. Get the Page to be indexed(Push to Solr)  In Felix go to Content extractor service for Solr indexing (codebase in https://github.com/deepakkhetawat/SolrPOCWithAEM)  Edit the “Component Indexed values” for the page you want to index in apache solr.  Then add/put above value into the “Component Indexed values”  geometrixx-outdoors/components/page_sidebar@articletext=jcr:title  Note: @articletext -> Field which is mentioned/added into schema.xml earlier.  jcr:title -> “Name” field mentioned in jcr:content properties of above page
  38. 38. ❖2. Get the Page to be indexed(Push to Solr)  You need to decide where you want to map the page property to what field of Solr and it should be present in the schema.xml  Once you publish the content , apache solr indexes data as we configured the separate replication agent SOLR in agents on Author.
  39. 39. ❖Exploratory Exercises  Create your own fieldtype in Solr like textgeneral .  Explore AEM 6.1 Solr Indexing by Default  Implement Highlighting for title field  Try geospatial search with AEM
  40. 40. ❖References  Code of “Tag Based Faceting using External Apache Solr with AEM” available at https://github.com/deepakkhetawat/SolrPOCWithAEM.git  Needful References are mentioned in between Slides .  Google 