Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Graph Gurus Episode 6: Community Detection

14 vues

Publié le

Community Detection using a Native Parallel Database

Publié dans : Logiciels
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Graph Gurus Episode 6: Community Detection

  1. 1. Graph Gurus Episode 6 Birds of a Feather - Community Detection with a Native Parallel Graph Database
  2. 2. © 2018 TigerGraph. All Rights Reserved Welcome ● Attendees are muted but you can talk to us via Chat in Zoom ● Send questions at any time using the Q&A tab in the Zoom menu ● We will have 10 min for Q&A at the end ● The webinar will be recorded and sent via email ● A link to the presentation and reproducible steps will be emailed 2 NOTE: update to the latest version of Zoom to avoid bugs
  3. 3. © 2018 TigerGraph. All Rights Reserved Developer Edition Available We now offer Docker versions and VirtualBox versions of the TigerGraph Developer Edition, so you can now run on ● MacOS ● Windows 10 ● Linux Developer Edition Download https://www.tigergraph.com/developer/
  4. 4. © 2018 TigerGraph. All Rights Reserved Today's Gurus 4 Victor Lee Director of Product Management ● BS in Electrical Engineering and Computer Science from UC Berkeley, MS in Electrical Engineering from Stanford University ● PhD in Computer Science from Kent State University focused on graph data mining ● 15+ years in tech industry Emma Liu Product Manager ● BS in Engineering from Harvey Mudd College, MS in Engineering Systems from MIT ● Prior work experience at Oracle and MarkLogic ● Focus - Cloud, Containers, Enterprise Infra, Monitoring, Management, Connectors Huiting Su Software Engineer ● Masters in Industrial Engineering from Purdue ● Focus - Graph Algorithms and Analytics, Machine Learning ● Resident GSQL Expert
  5. 5. © 2018 TigerGraph. All Rights Reserved Graph Algorithms, Part 2 Part 1 discussed PageRank (Graph Gurus Episode 5).
  6. 6. © 2018 TigerGraph. All Rights Reserved Communities can be Natural Phenomena Natural Organic - human-made, but without central control DBpedia pages, with links between pages https://www.hackdiary.com/2012/04/05/extracting-a-social-graph-fro m-wikipedia-people-pages/Protein interaction network for Schziophrenia, https://en.wikipedia.org/wiki/Interactome
  7. 7. © 2018 TigerGraph. All Rights Reserved … or Engineered Communities Congressional Committees and Subcommittees http://www.pnas.org/content/102/20/7057
  8. 8. © 2018 TigerGraph. All Rights Reserved 8 Understanding Connected Communities 1. How do I find the most influential provider in each region (e.g. healthcare local market) delivering care for a related group of codes for a condition (Diabetes, Cardiac Care, etc.)? 2. Who is influenced by these leaders (e.g. other doctors, chiropractors, physical therapists, facilities)? 3. What is the community size and impact (patients and providers) around these hubs? Questions Use Case Understand care and referral dynamics better Target education at the influencers Identify which influencers are also best-practice practitioners Based on work by Large US Pharma
  9. 9. © 2018 TigerGraph. All Rights Reserved 9 How do I find the most influential provider in each region for a particular medical condition? Whole-Graph Compute problem 1. Analyze claims data to identify referral relationships among providers (Time Series Analysis) 2. Create subsets of claims around each condition with a group of healthcare codes (e.g. CPT codes) for each region (e.g. local healthcare market) 3. Utilize PageRank to score hubs within each market Dr. Thomas Condition: Diabetes Healthcare Market: S. San Jose, CA Hub Identified: Dr. Thomas
  10. 10. © 2018 TigerGraph. All Rights Reserved 10 Who is influenced by these leaders (e.g. other doctors, chiropractors, physical therapists, facilities)? Utilize Community Detection 1. Identify communities of providers around each hub for each region and for a specific condition 2. Track changes over time to detect significant shifts in communities Dr. Thomas Condition: Diabetes Healthcare Market: S. San Jose, CA Hub Identified: Dr. Thomas Community Detected: Diabetes – S. San Jose – Dr. Thomas
  11. 11. © 2018 TigerGraph. All Rights Reserved 11 What is the community size and impact (patients and providers) around these hubs? 1. Compute cost of care for initial diagnosis and follow-on treatment for each community 2. Compare with other communities with similar patient population 3. Track changes over time to detect significant changes in cost of care Dr. Thomas Condition: Diabetes Healthcare Market: S. San Jose, CA Hub Identified: Dr. Thomas Community Detected: Diabetes – S. San Jose – Dr. Thomas Cost of care: initial diagnosis, follow-on care (medicine, tests, treatment)
  12. 12. © 2018 TigerGraph. All Rights Reserved Other Use Cases ● Business: ○ Who is trading with whom? ○ What products or services are often purchased together? ● Government: ○ Determine natural groupings of persons & needs, for more efficient delivery of services ● Criminal Investigation ○ Detect collusion/conspiracy ○ Detect persons at risk of criminal influence
  13. 13. © 2018 TigerGraph. All Rights Reserved What is a Community? Who are its members? ● Several different definitions of community ● Usually based on direct connections: The set of vertices C are a community if 1. Every member in C has a direct connection to every other member, or 2. Every member in C has a path to every other member, or 3. The majority of C's neighbors also belong to C. 4. The density of connections within V is greater than expected if connections were random. C
  14. 14. © 2018 TigerGraph. All Rights Reserved 1: Everyone is connected to everyone. ● This type of subgraph is called a complete graph. ● The collection of vertices is called a clique. ● Too strict for most real-world uses. http://mathworld.wolfram.com/CompleteGraph.html
  15. 15. © 2018 TigerGraph. All Rights Reserved 2: A Path to every member. ● Instead of direct connection, we allow indirect connection. ● A connected component is the subgraph of vertices which are connected. ○ Weakly Connected Component (WCC) - undirected edges ○ Strongly Connected Component (SCC) - directed edges ● Important, but still strict
  16. 16. © 2018 TigerGraph. All Rights Reserved 3: Same community with most of your neighbors. ● Allows individuals to link to multiple communities. ● What if there's an equal number of in-group and out-group links? ● What is the right number of communities? LabelRank: https://ieeexplore.ieee.org/document/6609210
  17. 17. © 2018 TigerGraph. All Rights Reserved Parsimony: The simplest answer is best. ● cf. Occam's razor. ● If you have a choose of 2 communities or 3 communities, and a both "explain" the data equally well → pick the smaller number of communities (2). ● If 3 communities gives a "cleaner explanation" of the data than 2 communities → probably go with 3.
  18. 18. © 2018 TigerGraph. All Rights Reserved 4: More in-group connections that out-group. ● Modularity is the fraction of the edges that fall within the given groups minus the expected fraction if edges were distributed at random. (Newman and Girvan) ● The value of the modularity lies in the range [-1,1] ● Choose the partitioning (grouping) that has the highest modularity score. http://www.ludowaltman.nl/slm/
  19. 19. © 2018 TigerGraph. All Rights Reserved Community Detection Algorithms 1. Complete Graph Discovery Every member in C has a direct connection to every other member 2. Connected Components: Every member in C has a path to every other member, or 3. Label Propagation: The majority of C's neighbors also belong to C. 4. Modularity Optimization (Louvain method): The density of connections within V is greater than expected if connections were random. ● Each has a different level of computational complexity (how long it takes to compute, when the graph is very big.)
  20. 20. © 2018 TigerGraph. All Rights Reserved GSQL Graph Algorithm Library https://github.com/tigergraph/ecosys/tree/master/graph_algorithms Each graph algorithm is a GSQL query. ● May have zero or more input parameters. ● Typically 3 variations: ○ Standard JSON output ○ Write to a CSV file ○ Save to vertex attributes (requires that the attributes exist)
  21. 21. © 2018 TigerGraph. All Rights Reserved Connected Component Algorithm 1. Label each vertex with a unique community ID (Each vertex is a community of size 1.) 2. Repeat a. For each edge, set the commID of the target vertex to be the smaller of the two commIDs. b. If there are no commID changes, then exit. c. Otherwise, repeat.
  22. 22. © 2018 TigerGraph. All Rights Reserved CREATE QUERY conn_comp () FOR GRAPH generic { MinAccum<int> @cc_id = 0; # each vertex's tentative component id SumAccum<int> @old_id = 0; OrAccum<bool> @active; # Initialize: Label each vertex with its own internal ID Start = {Node.*}; S = SELECT x FROM Start:x POST-ACCUM x.@cc_id = getvid(x), x.@old_id = getvid(x); # Propagate smaller internal IDs until no more ID changes can be done WHILE (Start.size()>0) DO Start = SELECT t FROM Start:s -(Link:e)-> :t ACCUM t.@cc_id += s.@cc_id // If s has a smaller id than t, copy the id to t POST-ACCUM CASE WHEN t.@old_id != t.@cc_id THEN // If t's id has changed t.@old_id = t.@cc_id, t.@active = true ELSE t.@active = false END HAVING t.@active == true; END; }
  23. 23. © 2018 TigerGraph. All Rights Reserved Connected Component Results ● Dataset: Zachary's Karate Club ○ Well-known social network study in 1977. ○ Friendship network of 34 karate club members, who fractured into 2 clubs. ● It's one connected component. ● CC is more important for very large graphs, to find isolated subgroups.
  24. 24. © 2018 TigerGraph. All Rights Reserved Label Propagation Algorithm 1. Label each vertex with a unique community ID (Each vertex is a community of size 1.) 2. Repeat a. For each vertex, count the commIDs of its neighbors. b. For each vertex, update its commID to be the most commonly seen commID among its neighbors. c. If there are no commID changes or you have reached the maximum number of iterations, then exit. d. Otherwise, repeat.
  25. 25. © 2018 TigerGraph. All Rights Reserved CREATE QUERY label_prop (INT maxIter) FOR GRAPH generic { OrAccum @@changed = true; MapAccum<int, int> @map; # local <communityId, numNeighbors> MapAccum<int, int> @@commSizes; # global <communityId, numMembers> SumAccum<int> @label, @num; Start = {Node.*}; # Assign unique labels to each vertex Start = SELECT s FROM Start:s ACCUM s.@label = getvid(s); # Continued on next slide
  26. 26. © 2018 TigerGraph. All Rights Reserved # Propagate labels to neighbors until labels converge or the max iterations is reached WHILE @@changed == true LIMIT maxIter DO @@changed = false; Start = SELECT s FROM Start:s -(Link:e)-> :t ACCUM t.@map += (s.@label -> 1) # count the occurrences of neighbor's labels POST-ACCUM INT maxV = 0, INT label = 0, # Iterate over the map to get the neighbor label that occurs most often FOREACH (k,v) IN t.@map DO CASE WHEN v > maxV THEN maxV = v, label = k END END, # When the neighbor search finds a label AND it is a new label # AND the label's count has increased, update the label. CASE WHEN label != 0 AND t.@label != label AND maxV > t.@num THEN @@changed += true, t.@label = label, t.@num = maxV END, t.@map.clear(); END; }
  27. 27. © 2018 TigerGraph. All Rights Reserved Label Propagation Results ● Zachary's Karate Club again ● 2 large groups ● 2 or 3 small groups
  28. 28. Real World Use Case Finding communities among Health Care Providers Please send your questions via Q&A at any time
  29. 29. Q&A Please send your questions via the Q&A menu in Zoom 29
  30. 30. © 2018 TigerGraph. All Rights Reserved Episode 7: WEDNESDAY, DECEMBER 5 AT 11:00 A.M. PT / 2:00 P.M. ET Connecting the Dots in Real-Time: Deep Link Analysis with a Native Parallel Graph Database to Uncover Hidden Relationships https://info.tigergraph.com/graph-gurus-7 30 REGISTER FOR MORE WEBINARS AT https://www.tigergraph.com/ webinars-and-events/
  31. 31. © 2018 TigerGraph. All Rights Reserved Additional Resources 31 New Developer Portal https://www.tigergraph.com/developers/ Download the Developer Edition or Enterprise Free Trial https://www.tigergraph.com/download/ Guru Scripts https://github.com/tigergraph/ecosys/tree/master/guru_scripts Join our Developer Forum https://groups.google.com/a/opengsql.org/forum/#!forum/gsql-users @TigerGraphDB youtube.com/tigergraph facebook.com/TigerGraphDB linkedin.com/company/TigerGraph

×