Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Real time entity resolution with elasticsearch - haystack 2018

David Moore from Elastic talking about Real Time Entity Resolution from Haystack 2018

  • Identifiez-vous pour voir les commentaires

Real time entity resolution with elasticsearch - haystack 2018

  1. 1. Dave Moore david.moore@elastic.co Real-Time Entity Resolution With Elasticsearch
  2. 2. 1 Disambiguation Entity Entity Single attributes in unstructured text "Named Entity Recognition" Multiple attributes in structured data "Entity Resolution" vs. Person Field Value Name Alice Jones DOB 1984-01-01 Street 123 Main St Credit Card 4040 0000 2020 8080 Phone 202-555-1234
  3. 3. 2 What is entity resolution?
  4. 4. Health Care Patient ID We need to identify and their medical many hand-written Mixing up records puts at risk of injury or Sales & Marketing Customer Intel We have reps managing many sources of info on leads and customers. Our view of the buyer is fragmented and that makes us less effective. We're losing pipeline. Security & Compliance Fraud We need to track a person or device that is hiding its tracks. Connecting the dots is a laborious process and we can't keep up with our incident backlog. Military, IC, Law Surveillance We need to track a person or device that is hiding its identity. Our timely success is critical to public safety and national security. Privacy Compliance GDPR We must find and manage all PII to respond to inquiries. Failure to comply risks fines of €20 million or 4% annual turnover. IT MDM MDM is a slow and bureaucratic process. We can solve our own data quality problems faster and better. And we still need query time entity resolution. 3 Examples
  5. 5. 4 Why is identity hard to track?
  6. 6. Ali Jones 123 W Main Street ABC Wigdets 4040 0000 2020 8008 +1 (202) 555 1234 5 1. Identity is Vague Allie Jones 123 Main St ABC Widgets, Inc. 4040 0000 2020 8080 202-555-1234 Icons by icons8
  7. 7. Ali Jones 123 W Main Street ABC Wigdets 4040 0000 2020 8008 +1 (202) 555 1234 Alison Jones-Smith 555 Brooad Street XYZ Tech 3030 5500 9999 0000 2025559867 6 2. Identity Changes Allie Jones 123 Main St ABC Widgets, Inc. 4040 0000 2020 8080 202-555-1234 Allison Smith 555 Broad St XYZ Technology Corp. 3030 5050 9999 0000 202-555-9876 Icons by icons8
  8. 8. Ali Jones 123 W Main Street ABC Wigdets 4040 0000 2020 8008 +1 (202) 555 1234 Alison Jones-Smith 555 Brooad Street XYZ Tech 3030 5500 9999 0000 2025559867 7 3. Identity is Messy Allie Jones 123 Main St ABC Widgets, Inc. 4040 0000 2020 8080 202-555-1234 Allison Smith 555 Broad St XYZ Technology Corp. 3030 5050 9999 0000 202-555-9876 Icons by icons8
  9. 9. 8 4. Identity is Diverse Ali Jones 123 W Main Street ABC Wigdets 4040 0000 2020 8008 +1 (202) 555 1234 Alison Jones-Smith 555 Brooad Street XYZ Tech 3030 5500 9999 0000 2025559867 Allie Jones 123 Main St ABC Widgets, Inc. 4040 0000 2020 8080 202-555-1234 Allison Smith 555 Broad St XYZ Technology Corp. 3030 5050 9999 0000 202-555-9876 ??? ??? ??? ??? Icons by icons8
  10. 10. 9 Entity Resolution connects the dots despite these challenges
  11. 11. Allie Jones 123 Main St ABC Widgets, Inc. 4040 0000 2020 8080 202-555-1234 Allie Jones 123 Main Street ABC Widgets 4040 0000 2020 8080 202.555.1234 Ali Jones 123 W Main Street ABC Wigdets 4040 0000 2020 8008 +1 (202) 555 1234 Allie Jones 132 W Main Street ABC Widgets 4040 0000 2020 8080 202 555 1234 Allie Smith 123 Main St ABC Widgets, Inc. 4040 0000 2020 8080 202-555-1234 Allie Smith 123 Main Street ABC Widgets 4040 0000 2020 8080 202.555.1234 Ali Smith 123 W Main Street ABC Wigdets 4040 0000 2020 8008 +1 (202) 555 1234 Allie Smith 555 Broad St ABC Widgets, Inc 4040 0000 2020 8080 202-555-1234 Allie Smith 555 Broad Street XYZ Tech Corp 3030 5050 9999 0000 202.555.1234 Allie Smith 555 Broad Street XYZ Technology Corp 3030 5050 9999 0000 202-555-9876 10 Comparison to Search Search Resolution name:"Allie Jones" AND street:"123 Main St" name:"Allie Jones" AND street:"123 Main St" Allie Jones 123 Main St ABC Widgets, Inc. 4040 0000 2020 8080 202-555-1234 Allie Jones 123 Main Street ABC Widgets 4040 0000 2020 8080 202.555.1234 Ali Jones 123 W Main Street ABC Wigdets 4040 0000 2020 8008 +1 (202) 555 1234 Ali Jones 132 Mane Street ABC Widgets 4024 0071 4970 1227 888-555-5555 Aly Jonas 113 Main Street Acme Corp. 4716 1035 4536 4671 610-555-5555 Allie Jones 132 W Main Street ABC Widgets 4040 0000 2020 8080 202-555-9876 Al Jones 132 E Main St Mom & Pop, LLC 3772 733741 52501 1-610-555-0000 Aly Jones 113 Main St, #102 Acme Corp. 4716 1035 4536 4671 610-555-5555 Ali Jones 132 Mane Street ABC Widgets 4024 0071 4970 1227 888-555-1234 Aly Jonas 113 Main Street Acme Corp. 4781 9105 0533 4481 610-555-2345 Allie Johns 132 W Main Street ABC Widgets 4088 0110 2044 8180 202-555-3456 Elle Jeon 132 E Main St Mom & Pop, LLC 3502 730741 52203 1-610-555-4567 Elle Jones 113 Main St, #102 Acme Corp. 4716 1035 4536 4671 610-555-5678 Eli Jones 132 Mane Street ABC Widgets 4224 0065 4800 1337 888-555-6789 Eli Joans 113 Main Street Acme Corp. 4206 1035 4536 4081 610-555-7890 Allie Jeans 132 N Mean Street ABC Widgets 4240 0101 02020 8888 202-555-8901 Search engine ranks results once. True hits mixed with noise. Search engine filters results recursively. True hits isolated and transitively linked.
  12. 12. 11 Real-Time
  13. 13. 12 Batch vs. Real-Time Batch Real-Time How is it used? Resolve all entities in advance (Partitioning, pairwise scoring, connected components) How long does it take? Docs + (Docs/Partitions)2 + Components2 (Hours for billions of documents) When is it necessary? Population or network analysis Most solutions have a real-time phase, sometimes applied after batch resolution. How is it used? Resolve one entity on query (Recursive Boolean query) How long does it take? Indices * Attributes * Hops (Milliseconds for a handful of each) When is it necessary? Individual analysis
  14. 14. Robust matching • Token normalization • Phonetic matching • Fuzzy transpositions • Boolean logic filtering • Fine-tune search parameters 13 Real-Time Why Elasticsearch Suited for operations • Horizontal scaling • Real-time response rates • Flexible index mappings
  15. 15. 14 Approach • Fast – Get results in real-time. From milliseconds to low seconds. • Generic – Resolve any type of entity. People, companies, locations, sessions, etc. • Transitive – Resolve over multiple hops of matches. Capture changing identities. • Multi-source – Resolve over multiple indices with disparate mappings. • Accommodating – Operate on data as it exists. Avoid transforming and reindexing data. • Logical – Logic is easier to read, troubleshoot, and optimize than statistics. • 100% Elasticsearch – Operate within existing search infrastructure. Goals
  16. 16. 15 Approach 1. Entity modeling – What is the entity? What are its attributes? 2. Analyzers – How are you indexing each attribute? 3. Matchers – What is the query logic for each attribute? 4. Resolvers – What combinations of matching attributes imply a resolution? 5. Metadata maps – Which matchers apply to which indexed fields? 6. Recursive queries – How to repeat the queries until completion? Steps
  17. 17. 16 zentity zentity.io An open source Elasticsearch plugin for real-time entity resolution
  18. 18. zentity zentity.io An open source Elasticsearch plugin for real-time entity resolution 17 POST _zentity/resolution/person { "attributes": { "name": "Alice Jones", "dob": "1984-01-01", "phone": [ "555-123-4567", "555-987-6543" ] } }
  19. 19. 18 Demos
  20. 20. 19 Demos Customer intelligence Gather everything we know about a customer. Web traffic sessionization Track a bot that cycles through IP addresses, cookies, and user agent signatures. Fraud detection Determine if a health care provider was blacklisted under a different name.
  21. 21. Dave Moore email: david.moore@elastic.co zentity: zentity.io Contact
  22. 22. @elastic www.elastic.co Extra Content
  23. 23. 22 Approach
  24. 24. 23 Step 1. Entity Modeling Person Name the entity type. Name – First Name Name – Last Name Address – Street Address – City Address – Province Address – Postal Code Address – Country Date of Birth Phone Number Email Address IP Address Credit Card Number Social Security Number Define its attributes. Study them in your data sets. Uniqueness Consistency Presence Moderate Moderate High Low Low Low Low Moderate Moderate High High Extreme Extreme Moderate Moderate Low Moderate High High High High Moderate Extreme Extreme Extreme High Extreme High Moderate Moderate High High High Moderate Moderate Moderate Low Low None Icons by icons8
  25. 25. 24 Step 1. Entity Modeling Person Name the entity type. Name – First Name Name – Last Name Address – Street Address – City Address – Province Address – Postal Code Address – Country Date of Birth Phone Number Email Address IP Address Credit Card Number Social Security Number Define its attributes. Study them in your data sets. Uniqueness Consistency Presence Moderate Moderate High Low Low Low Low Moderate Moderate High High Extreme Extreme Moderate Moderate Low Moderate High High High High Moderate Extreme Extreme Extreme High Extreme High Moderate Moderate High High High Moderate Moderate Moderate Low Low None This model is independent from your indices. You can reuse and extend this model as you add or amend indices. Icons by icons8
  26. 26. Name – First Name Name – Last Name Address – Street Address – City Address – Province Address – Postal Code Address – Country Date of Birth Phone Number Email Address IP Address Credit Card Number Social Security Number Phonetic "Alice Jones" => ["ALAC","JAN"] Standard "Alice Jones" => ["ALICE","JONES"] 25 Step 2. Analyzers Take the attributes. Define their analyzers. Put them in your index mappings. { "settings": { "index": { "analysis": { "filter": { "phonetic": { "type": "phonetic", "encoder": "nysiis" } }, "analyzer": { "phonetic": { "filter": [ "icu_normalizer", "icu_folding", "phonetic" ], "tokenizer": "standard" } } } } } } { "mappings": { "_doc": { "properties": { “first_name": { "type": "text", "fields": { "phonetic": { "type": "text", "analyzer": "phonetic" } } } } } } } Person Icons by icons8
  27. 27. Name – First Name Name – Last Name Address – Street Address – City Address – Province Address – Postal Code Address – Country Date of Birth Phone Number Email Address IP Address Credit Card Number Social Security Number Phonetic "Alice Jones" => ["ALAC","JAN"] Standard "Alice Jones" => ["ALICE","JONES"] 26 Step 2. Analyzers Take the attributes. Define their analyzers. Put them in your index mappings. { "settings": { "index": { "analysis": { "filter": { "phonetic": { "type": "phonetic", "encoder": "nysiis" } }, "analyzer": { "phonetic": { "filter": [ "icu_normalizer", "icu_folding", "phonetic" ], "tokenizer": "standard" } } } } } } { "mappings": { "_doc": { "properties": { “first_name": { "type": "text", "fields": { "phonetic": { "type": "text", "analyzer": "phonetic" } } } } } } } Person Analyzers are powerful. But they must be defined prior to indexing. Give careful thought to your analyzers to avoid having to reindex data. Icons by icons8
  28. 28. Phonetic { "match": { "{{ field }}": { "query": "{{ value }}", "fuzziness": 0 } } } Standard { "match": { "{{ field }}": { "query": "{{ value }}", "fuzziness": 2 } } } 27 Step 3. Matchers Take the attributes. Name – First Name Name – Last Name Address – Street Address – City Address – Province Address – Postal Code Address – Country Date of Birth Phone Number Email Address IP Address Credit Card Number Social Security Number Define their Boolean query logic. Use templates for variables. Person {{ field }} – The field of an index. {{ value }} – The value of an attribute. We will replace these at query time. Icons by icons8
  29. 29. Phonetic { "match": { "{{ field }}": { "query": "{{ value }}“, "fuzziness": 0 } } } Standard { "match": { "{{ field }}": { "query": "{{ value }}“, "fuzziness": 2 } } } 28 Step 3. Matchers Take the attributes. Name – First Name Name – Last Name Address – Street Address – City Address – Province Address – Postal Code Address – Country Date of Birth Phone Number Email Address IP Address Credit Card Number Social Security Number Define their Boolean query logic. Use templates for variables. Person {{ field }} – The field of an index. {{ value }} – The value of an attribute. We will replace these at query time. Understand that each matcher will be combined into one large Boolean query. Icons by icons8
  30. 30. 29 Step 4. Resolvers Take the attributes. Name – First Name Name – Last Name Address – Street Address – City Address – Province Address – Postal Code Address – Country Date of Birth Phone Number Email Address IP Address Credit Card Number Social Security Number Determine which combinations of matching attributes imply a resolution. [ Name – First, Name – Last, Address – Street, Address – City, Address – State ] [ Name – First, Name – Last, Address – Street, Address – Postal Code ] [ Name – First, Name – Last, Date of Birth, Address – City, Address – State ] [ Name – First, Name – Last, Date of Birth, Address – Postal Code ] [ Name – First, Name – Last, Phone Number ] [ Name – First, Name – Last, Email Address ] [ Name – First, Name – Last, IP Address ] [ Name – First, Name – Last, Credit Card Number ] [ Name – First, Name – Last, Social Security Number] [ Email Address, Phone Number ] [ Email Address, IP Address ] [ Email Address, Credit Card Number ] [ IP Address, Credit Card Number ] Person Icons by icons8
  31. 31. 30 Step 4. Resolvers Take the attributes. Name – First Name Name – Last Name Address – Street Address – City Address – Province Address – Postal Code Address – Country Date of Birth Phone Number Email Address IP Address Credit Card Number Social Security Number Determine which combinations of matching attributes imply a resolution. [ Name – First, Name – Last, Address – Street, Address – City, Address – State ] [ Name – First, Name – Last, Address – Street, Address – Postal Code ] [ Name – First, Name – Last, Date of Birth, Address – City, Address – State ] [ Name – First, Name – Last, Date of Birth, Address – Postal Code ] [ Name – First, Name – Last, Phone Number ] [ Name – First, Name – Last, Email Address ] [ Name – First, Name – Last, IP Address ] [ Name – First, Name – Last, Credit Card Number ] [ Name – First, Name – Last, Social Security Number] [ Email Address, Phone Number ] [ Email Address, IP Address ] [ Email Address, Credit Card Number ] [ IP Address, Credit Card Number ] Person Avoid resolving on a single attribute such as Social Security Number. Corroboration among multiple attributes helps prevent snowballs. Icons by icons8
  32. 32. 31 Step 5. Metadata Maps Take the attributes. Name – First Name Name – Last Name Address – Street Address – City Address – Province Address – Postal Code Address – Country Date of Birth Phone Number Email Address IP Address Credit Card Number Social Security Number Map them to the fields of the relevant indices. users.first_name users.last_name users.phone users.email customers:fname customers:lname customers:tel customers:email customers:cc customers:zip Person Icons by icons8
  33. 33. 32 Step 6. Recursive Queries With each query, new inputs might be found in different attributes. Use the metadata map and your resolvers to determine if you can create new queries for the new inputs.

×