Thu bernstein key_warp_speed

450 vues

Publié le

  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Thu bernstein key_warp_speed

  1. 1. Processing Linked Data at Warp Speed Abraham Bernstein CCBY NASA http://www.flickr.com/photos/nasa_jsc_photo/sets/72157629726792248/with/7197236116/0
  2. 2. CCBY NASA http://www.flickr.com/photos/nasa_jsc_photo/sets/72157629726792248/with/7197236116/0
  3. 3. "Earth's Location in the Universe (JPEG)" by Andrew Z. Colvin - Own work. Licensed under Creative Commons Attribution-Share Alike 3.0 via Wikimedia Commons http://commons.wikimedia.org/wiki/File:Earth%27s_Location_in_the_Universe_(JPEG).jpg#mediaviewer/File:Earth%27s_Location_in_the_Universe_(JPEG).jpg
  4. 4. "IBM Electronic Data Processing Machine - GPN-2000-001881" NASA, Public Domain @ Wikimedia Commons - 
 http://commons.wikimedia.org/wiki/File:IBM_Electronic_Data_Processing_Machine_-_GPN-2000-001881.jpg
  5. 5. Processing Graphs "IBM Electronic Data Processing Machine - GPN-2000-001881" NASA, Public Domain @ Wikimedia Commons - 
 http://commons.wikimedia.org/wiki/File:IBM_Electronic_Data_Processing_Machine_-_GPN-2000-001881.jpg
  6. 6. Semantic Web Reasoning KB:Asserted Triples Entailed KB: Asserted & Infered Triples DLReasoning InductiveR. AnalogicalR. YourR.
  7. 7. Signal/Collect P. Stutz, A. Bernstein, and W. Cohen. Signal/Collect: Graph algorithms for the (semantic) web. International Semantic Web Conference–ISWC 2010. Springer Berlin Heidelberg, 2010. 764-780.
  8. 8. • Vertices as stateful processing units • Vertices interact through signals along edges • Which are collected by a processing function that updates the vertex state Processing Graphs Naturally
  9. 9. • Define the graph structure • Vertices represent RDFS classes • Edges from superclasses to subclasses • Vertex state initialized with the class that the vertex represents Signal/Collect: An Intuition for RDFS subclass inference id: animal state: {animal} id: bird state: {bird} id: owl state: {owl} id: penguin state: {penguin} id: animal id: bird id: owlid: penguin Stutz et al, 2010
  10. 10. Signal/Collect: An Intuition for RDFS subclass inference Stutz et al, 2010 id: animal state: {animal} id: bird state: {bird} id: owl state: {owl} id: penguin state: {penguin} {bird, animal} {animal} {bird, animal} id: animal state: {animal} id: bird state: {bird} id: owl state: {owl} id: penguin state: {penguin} id: penguin state: {penguin, bird, animal} id: bird state: {bird, animal} id: owl state: {owl, bird, animal} def collect = state [ [ s2signals s def signal = state
  11. 11. Scoring/Asychronicity: Single-Source Shortest Path state: ∞ state: ∞ state: ∞ state: ∞ state: ∞ state: ∞ state: 0 ∞ 1 ∞ ∞ ∞1 1 ∞ state: 1 state: ∞ state: ∞ state: ∞state: 1 state: 1 2 1 2 ∞ 21 1 ∞ state: 1 state: 2 state: 2 state: 2state: 1 state: 1 2 1 2 3 21 1 3 state: 1 state: 2 state: 2 state: 2state: 1 state: 1 def signal = state + weight def collect = min (state, min (signals) )
  12. 12. state: 1 Scoring/Asynchronicity: Single-Source Shortest Path state: ∞ state: ∞ state: ∞ state: 0 1 1 1 state: 1 state: ∞ state: ∞ state: ∞ state: 1 2 2 2 state: 2 state: 2 state: 2 state: 2 state: 2 var oldState = infinity def scoreSignal = if (state ! = oldState) 1 else 0 3 3
  13. 13. PageRank in Code class Document(id: Any) extends Vertex(id, 0.15) { def collect = 0.15 + 0.85 * signals[Double].foldLeft(0.0)(_ + _) } Algorithm class Citation(citer: Any, cited: Any) extends Edge(citer, cited) { def signal = source.state.asInstanceOf[Double] * weight / source.sumOfOutWeights } ExecutionInitialization object Algorithm { def executeCitationRank(db: SparqlAccessor) { val computeGraph = new AsynchronousComputeGraph() val citations = new SparqlTuples(db, "select ?source ?target where {" + "?source <http://lsdis.cs.uga.edu/projects/semdis/opus#cites> ?target}") citations foreach { case (citer, cited) => computeGraph.addVertex[Document](citer) computeGraph.addVertex[Document](cited) computeGraph.addEdge[Citation](citer, cited) } computeGraph.execute() } }
  14. 14. Signal/Collect as a Platform P. Stutz, A. Bernstein, and W. Cohen. Signal/Collect: Graph algorithms for the (semantic) web. International Semantic Web Conference–ISWC 2010. Springer Berlin Heidelberg, 2010. 764-780. T R I P L E R U S H F R A U D D E T E C T I O N D C O P S Images::BoInsogna,https://flic.kr/p/9famxT,CC-NC-NDhttps://creativecommons.org/licenses/by-nc-nd/2.0/ Antana,https://flic.kr/p/gGQPhA,CC-BY-SA,https://creativecommons.org/licenses/by-sa/2.0/, JeffKubina,https://flic.kr/p/2CG5PU,CC-BY-SA
  15. 15. Tr i p l e S t o re E LV I S D Y L A N J O B S ? X I N S P I R E D ? Y ? Y I N S P I R E D ? Z I N S P I R E D DATA QUERY IN SP IR E D
  16. 16. DylanElvis inspired *Elvis inspiredDylan * inspired DylanElvis * *Elvis *** inspired Dylan * * ** *
  17. 17. DylanElvis inspired *Elvis inspiredDylan * inspired DylanElvis * *Elvis *** inspired Dylan * * ** * *Dylan inspired Jobs * inspiredJobsDylan * JobsDylan inspired *Dylan *Jobs * *
  18. 18. DylanElvis inspired Dylan * inspired ** inspired *Dylan inspired Jobs * inspired JobsDylan inspired Query Vertex ?X inspired ?Y! ?Y inspired ?Z ?X inspired ?Y! ?Y inspired ?Z Elvis inspired Dylan! Dylan inspired ?Z Dylan inspired Jobs! Jobs inspired ?Z No vertex with ID! [ Jobs inspired * ] Query Vertex { ?X = Elvis, ?Y = Dylan, ?Z = Jobs } ✗ Elvis inspired Dylan! Dylan inspired Jobs ✓ ?X inspired ?Y! ?Y inspired ?Z
  19. 19. DylanElvis inspired Dylan * inspired ** inspired *Dylan inspired Jobs * inspired JobsDylan inspired Query Vertex ?X inspired ?Y! ?Y inspired ?Z ?X inspired ?Y! ?Y inspired ?Z Elvis inspired Dylan! Dylan inspired ?Z Dylan inspired Jobs! Jobs inspired ?Z No vertex with ID! [ Jobs inspired * ] Query Vertex { ?X = Elvis, ?Y = Dylan, ?Z = Jobs } ✗ Elvis inspired Dylan! Dylan inspired Jobs ✓ ?X inspired ?Y! ?Y inspired ?Z
  20. 20. P e r f o r m a n c e R e s u l t s Distributed (8 nodes), LUBM 10240 (~1.36 billion triples) Single-node, LUBM 160 (~21 million triples) Fastest L1 L2 L3 L4 L5 L6 L7 Geo. of 10 runs mean TripleRush 3,111.2 1,457.9 0.7 3.5 9.5 29.1 1,165.8 62.1 Trinity.RDF 12,648.0 6,018.0 8,735.0 5.0 4.0 9.0 31,214.0 450.0 TriAD 7,631.0 1,663.0 4,290.0 2.1 0.5 69.0 14,895.0 249.0 TriAD-SG 2,146.0 2,025.0 1,647.0 1.3 0.7 1.4 16,863.0 106.0 Fastest L1 L2 L3 L4 L5 L6 L7 Geo of 10 runs mean TripleRush 22.6 27.8 0.4 1.0 0.4 0.9 21.2 2.94 Trinity.RDF 281.0 132.0 110.0 5.0 4.0 9.0 630.0 46.0 TriAD 427.0 117.0 210.0 2.0 0.5 19.0 693.0 39.0 TriAD-SG 97.0 140.0 31.0 1.0 0.2 1.8 711.0 14.0
  21. 21. O n g o i n g Wo r k : 
 G r a p h P a r t i t i o n i n g • https://github.com/uzh/triplerush
  22. 22. T R I P L E R U S H F R A U D D E T E C T I O N D C O P S Signal/Collect as a Platform Images::BoInsogna,https://flic.kr/p/9famxT,CC-NC-NDhttps://creativecommons.org/licenses/by-nc-nd/2.0/ Antana,https://flic.kr/p/gGQPhA,CC-BY-SA,https://creativecommons.org/licenses/by-sa/2.0/, JeffKubina,https://flic.kr/p/2CG5PU,CC-BY-SA
  23. 23. D e c o m p o s i n g F r a u d P a t t e r n s • Participants can be labeled as: • Splitters • Aggregators • Forwarders $10k$11k 2 x $5k $10k 10k CHF 2k CHF 4k CHF 6k CHF 8k CHF
  24. 24. E l i c i t a t i o n P ro c e s s FilterConnectMatch
  25. 25. B i t c o i n Tr a n s a c t i o n s : R u n t i m e v s . M a t c h i n g C o m p l e x i t y 0" 1000" 2000" 3000" 4000" 5000" 6000" 7000" 4" 6" 8" 10" 12" Time%(sec)% Matching%Complexity% Time"in"GC" Total"Processing"Time" Dataset Size: 50M Transactions, Matching Duration: 1 week Runtime: 27 min Throughput: 1.8M / min Runtime: 35 min Throughput: 1.4M / min Runtime: 98 min Throughput: 0.5M / min
  26. 26. T R I P L E R U S H F R A U D D E T E C T I O N D C O P S Signal/Collect as a Platform Images::BoInsogna,https://flic.kr/p/9famxT,CC-NC-NDhttps://creativecommons.org/licenses/by-nc-nd/2.0/ Antana,https://flic.kr/p/gGQPhA,CC-BY-SA,https://creativecommons.org/licenses/by-sa/2.0/, JeffKubina,https://flic.kr/p/2CG5PU,CC-BY-SA
  27. 27. ≠ ≠≠ ≠ y Distributed Constraint Optimization z q x x ≠ y y ≠ z z ≠ x q ≠ z x ∈ {0,1,2} y ∈ {0,1} z ∈ {1,2} q ∈ {0,1,2}
  28. 28. Vertex Coloring in action Optimized Version of DSA Running on a MacBook Pro with 8 workers (slow, due to lots of IO for logging, bookkeeping, etc.) • Scaled to: 10 Million vertices / variables Verman and Bernstein, 2014
  29. 29. Industry Usage ! “We  are  using  Signal/Collect  to  analyze  millions  of   claims  every  day  to  iden9fy  opportuni9es  for  our   clients  to  save  money  through  be=er  healthcare  or   avoiding  fraud,  waste,  and  abuse.”   US  Healthcare  Analy/cs  Company   

  30. 30. "IBM Electronic Data Processing Machine - GPN-2000-001881" NASA, Public Domain @ Wikimedia Commons - 
 http://commons.wikimedia.org/wiki/File:IBM_Electronic_Data_Processing_Machine_-_GPN-2000-001881.jpg
  31. 31. Processing Graph Streams "IBM Electronic Data Processing Machine - GPN-2000-001881" NASA, Public Domain @ Wikimedia Commons - 
 http://commons.wikimedia.org/wiki/File:IBM_Electronic_Data_Processing_Machine_-_GPN-2000-001881.jpg
  32. 32. 147.1b € 97.5b €
  33. 33. 147.1b € 97.5b € 5’100
  34. 34. 3mViewers 300 Events/s
  35. 35. 250 TV Channels 25 frames/s
  36. 36. EPG for 7 days LOD enhanced
  37. 37. Traditional TripleStore http://www.mpi.de/ http://www.ifi.uzh.ch/ http://www.uni-sb.de/courses/ http://www.universities.de/Saarbruecken http://www.ifi.uzh.ch/ http://www.ifi.uzh.ch/i_am_a_URI http://www.ifi.uzh.ch/i_am_a_URI http://www.ifi.uzh.ch/i_am_a_URI http://www.ifi.uzh.ch/i_am_a_URI http://www.ifi.uzh.ch/i_am_a_URI http://www.ifi.uzh.ch/i_am_a_URI http://www.lubm.org/teaches http://www.lubm.org/
  38. 38. Traditional TripleStore http://www.mpi.de/ http://www.ifi.uzh.ch/ http://www.uni-sb.de/courses/ http://www.universities.de/Saarbruecken http://www.ifi.uzh.ch/ http://www.ifi.uzh.ch/i_am_a_URI http://www.ifi.uzh.ch/i_am_a_URI http://www.ifi.uzh.ch/i_am_a_URI http://www.ifi.uzh.ch/i_am_a_URI http://www.ifi.uzh.ch/i_am_a_URI http://www.ifi.uzh.ch/i_am_a_URI http://www.lubm.org/teaches http://www.lubm.org/
  39. 39. Semantic Flow Processing
  40. 40. Semantic Flow Processing t
  41. 41. ViSTA-TV Viewership Data 0 7500 15000 22500 30000 0h 8h 16h 24h 8h 16h 24h 8h 16h 24h Num of Valid Data Entries day 1 day 2 day 3 Cache ViSTA-­‐TV  project:  UserLog  and  EPG  streams
  42. 42. Semantic Flow Processing is: • Time-stamped tripes t = <s,p,o> [time] • Semantic flow F = [t1, t2, ... tn] • Perform query matching on cached subset of F • Subject to Stress—incoming data-rate overwhelms the system’s processing capability
  43. 43. Context: Time Window last 1 min ? within 1 sec
  44. 44. Load Shedding last 1 min
  45. 45. Eviction last 1 min ?
  46. 46. Eviction is: • Remove cached data • Note: Eviction may lower recall • Evict potential to produce future results • Types of eviction strategies (considered): • Random • Time-based (i.e, FIFO) • Least Recently Used (LRU)
  47. 47. Why not LRU? LRU C B Input: UserLog ,<watch>,ChannelUser 1 B Join EPG Cache Two-way Join: UserLog EPG
  48. 48. Why not LRU? LRU C B User 1 Input: UserLog ,<watch>,Channel C Join EPG Cache
  49. 49. Why not LRU? LRU B C Input Input: EPG Channel , <Play>, Show 4DD EPG Cache D User 3 Input: UserLog ,<watch>,Channel ?
  50. 50. Why not LRU? LRU C Input: EPG Channel , <Play>, Show 5E B Input: EPG Channel , <Play>, Show 4D InputInput: EPG Channel , <Play>, Show 6F Input: EPG Channel , <Play>, Show 7G Input: EPG Channel , <Play>, Show 8H D EPG Cache
  51. 51. CLOCK is: • Consider both recency (LRU) and past results • Giving each data entry a score • The score could be incremented and depreciated • Named by the buffer management algorithm CLOCK
  52. 52. CLOCK CLOCK Score 6 1 3C B Input: UserLog ,<watch>,ChannelUser 2 C Join EPG Cache
  53. 53. CLOCK CLOCK Score 6 1 4C B Input: UserLog ,<watch>,ChannelUser 2 C Join EPG Cache
  54. 54. CLOCK CLOCK Score 6 1 4C B Input: EPG Channel , <Play>, Show 4D Input dep() EPG Cache
  55. 55. CLOCK CLOCK Score 5 1 4C B Input: EPG Channel , <Play>, Show 4D Input dep() EPG Cache
  56. 56. CLOCK CLOCK Score 5 0 4C B Input: EPG Channel , <Play>, Show 4D Input dep() D EPG Cache
  57. 57. CLOCK CLOCK Score 5 init = 1 4C Input: EPG Channel , <Play>, Show 4D Input dep() D EPG Cache
  58. 58. CLOCK CLOCK Score 5 1 4C Input dep() D Input: EPG Channel , <Play>, Show 5E EPG Cache
  59. 59. CLOCK CLOCK Score 5 1 3C Input dep() D Input: EPG Channel , <Play>, Show 5E EPG Cache
  60. 60. CLOCK CLOCK Score 4 1 3C Input dep() D Input: EPG Channel , <Play>, Show 5E EPG Cache
  61. 61. CLOCK CLOCK Score 4 0 3C Input dep() D Input: EPG Channel , <Play>, Show 5E EPG CacheE
  62. 62. CLOCK CLOCK Score 4 1 3C Input dep() E EPG Cache Input: EPG Channel , <Play>, Show 6F Input: EPG Channel , <Play>, Show 7G Input: EPG Channel , <Play>, Show 8H
  63. 63. CLOCK Eviction is: ! • The weight is adjusted by dep(): • Linear: dep(w) = w - 1 • Exponential: dep(w) = w * ρ (0< ρ < 1) Recency History
  64. 64. Experimental Results TV Viewership Data Recall 0% 25% 50% 75% 100% Cache Size 100% 50% 25% 15% 1% Random FIFO LRU CLOCK Two-way join query with 1,919,216 input triples
  65. 65. Experimental Results: Depreciation Function Exponential factor ρ = ρ in {0.25, 0.5, 0.75, 0.95}
  66. 66. Limitations • Static depreciation function: dep() • Experiment on other datasets • Real implementation to investigate other performance metrics • Only local eviction strategies considered
  67. 67. CCBY NASA http://www.flickr.com/photos/nasa_jsc_photo/sets/72157629726792248/with/7197236116/0 "IBM Electronic Data Processing Machine - GPN-2000-001881" NASA, Public Domain @ Wikimedia Commons - 
 http://commons.wikimedia.org/wiki/File:IBM_Electronic_Data_Processing_Machine_-_GPN-2000-001881.jpg Semantic Web Reasoning KB:Asserted Triples Entailed KB: ! Asserted & Infered Triples DLReasoning InductiveR. AnalogicalR. YourR. Signal/Collect P. Stutz, A. Bernstein, and W. Cohen. Signal/Collect: Graph algorithms for the (semantic) web. International Semantic Web Conference–ISWC 2010. Springer Berlin Heidelberg, 2010. 764-780. CLOCK Eviction is: ! • The weight is adjusted by dep():" • Linear: dep(w) = w - 1" • Exponential: dep(w) = w * ρ (0< ρ < 1) Recency History
  68. 68. CCBY NASA http://www.flickr.com/photos/nasa_jsc_photo/sets/72157629726792248/with/7197236116/0 Philip Stutz, Mihaela Verman, Shen Gao, Daniel Strebel, Bibek Paudel, Lorenz Fischer, Thomas Keller, Robin Hafen, Genc Mazlami

×