SlideShare une entreprise Scribd logo
1  sur  41
Télécharger pour lire hors ligne
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




               Using Cascalog and Hadoop for rapid graph
                       processing and exploration

                                Nils Grunwald and Hugo Zanghi

                                                  Linkfluence


                     2012-02-05 - FOSDEM 2012 - Graph Devroom




Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




Outline


      Graph Analysis at Linkfluence


      Why Cascalog


      Introduction to Cascalog


      Conclusion




Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




What we do at Linkfluence


          Web data mining (blogs,
          media, etc.)
          Social Network data mining
          (Twitter, Facebook)
          Use this data to build
          various search engines
          Visualize the data with
          various UI (Gephi, maps,
          etc.)



Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




What we get




             Lots of nodes (users, pages, websites, words)




Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




What we get




             Lots of nodes (users, pages, websites, words)
             Lots of edges (hyperlinks, comments, RT, co-occurences)




Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




What we get




             Lots of nodes (users, pages, websites, words)
             Lots of edges (hyperlinks, comments, RT, co-occurences)
             These datasets are interconnected (Twitter users link pages,
             words occur everywhere)




Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




The problem




             Collecting and processing this data as a graph is not the
             primary goal of our system




Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




The problem




             Collecting and processing this data as a graph is not the
             primary goal of our system
             But it is a very rich dataset we want to explore for R&D
             purpose




Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




The constraints



             The graph processing should not compromise the rest of the
             system




Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




The constraints



             The graph processing should not compromise the rest of the
             system
             Low-maintenance




Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




The constraints



             The graph processing should not compromise the rest of the
             system
             Low-maintenance
             Used for queries and rapid prototyping




Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




The constraints



             The graph processing should not compromise the rest of the
             system
             Low-maintenance
             Used for queries and rapid prototyping
             Flexible, hard to tell which field or metadata will be used
             beforehand




Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




What is Cascalog




             Built on top of Hadoop and Cascading (workflow management)




Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




What is Cascalog




             Built on top of Hadoop and Cascading (workflow management)
             Inspired by the Datalog query syntax




Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




What is Cascalog




             Built on top of Hadoop and Cascading (workflow management)
             Inspired by the Datalog query syntax
             Hosted on the JVM by the Clojure language




Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




Hadoop for reliability and scalability



             Reliable and scalable




Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




Hadoop for reliability and scalability



             Reliable and scalable
             Everything is dumped in text files, we reuse our existing
             rsyslog infrastructure




Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




Hadoop for reliability and scalability



             Reliable and scalable
             Everything is dumped in text files, we reuse our existing
             rsyslog infrastructure
             We can reuse existing hadoop instances of our system




Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




Hadoop for reliability and scalability



             Reliable and scalable
             Everything is dumped in text files, we reuse our existing
             rsyslog infrastructure
             We can reuse existing hadoop instances of our system
             No need to know beforehand about indexed fields or to have
             data in a perfectly uniform format




Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




Datalog for rapid protyping




             Subset of Prolog




Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




Datalog for rapid protyping




             Subset of Prolog
             Declarative, expressive and very concise way of writing queries




Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




Datalog for rapid protyping




             Subset of Prolog
             Declarative, expressive and very concise way of writing queries
             Prolog has long been used for making queries over graphs




Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




Clojure for flexibility




             Only one language and one file for queries and business logic




Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




Clojure for flexibility




             Only one language and one file for queries and business logic
             Tasks unrelated to data processing are possible inside the
             queries (Resolve shortened links for example)




Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




Clojure for flexibility




             Only one language and one file for queries and business logic
             Tasks unrelated to data processing are possible inside the
             queries (Resolve shortened links for example)
             Allows complex algorithms to be concisely expressed




Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




The downsides




             Slow compared to in-memory computation or non-distributed
             graph DB




Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




The downsides




             Slow compared to in-memory computation or non-distributed
             graph DB
             Cannot do realtime




Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




Use-cases



             Post-processing on large number of edges




Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




Use-cases



             Post-processing on large number of edges
             Filtering or transforming a dataset before exporting to Gephi
             or Neo4j




Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




Use-cases



             Post-processing on large number of edges
             Filtering or transforming a dataset before exporting to Gephi
             or Neo4j
             Back-processing old data with inconsistent fields and merging
             datasets from different sources




Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




Using Cascalog



             Declarative syntax




Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




Using Cascalog



             Declarative syntax
             Order of statements is arbitrary




Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




Using Cascalog



             Declarative syntax
             Order of statements is arbitrary
             Syntax is LISP-like




Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




Using Cascalog



             Declarative syntax
             Order of statements is arbitrary
             Syntax is LISP-like
             Operations are based on tuples




Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




Using Cascalog



             Declarative syntax
             Order of statements is arbitrary
             Syntax is LISP-like
             Operations are based on tuples
             Possibility to control the flow with custom operators (filter,
             mapcat, etc.)




Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




Anatomy of a Cascalog query (Aggregation)


      Example (in-degree from cascalog.graph.core)
      (defn in-degree ;; just a normal function
      "computes the in degrees" ;; docstring
        [edges]
        (<- ;; returns a cascalog query
         [?dst ?in_d] ;; returned tuple
          (edges ?dst _) ;; destructuring on a generator
          (:distinct false)
          (c/count :> ?in_d))) ;; infers aggregation on ?dst




Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




Anatomy of a Cascalog query (Filtering)

      Example (filtering on in-degree)
      (defn filtered-nodes
        [edges threshold]
        ;; compute in-degree as a subquery
        (let [in-degrees (in-degree edges)]
        (<-
          [?node-id ?in-deg]
           ;; filters on computed in-degree
           (> ?in-deg threshold)
           ;; uses previous subquery as a generator
           (in-degrees ?node-id ?in-deg))))




Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




Under the hood, this happens. . .

      Example (using custom filter ops)
      (deffilterop over-threshold
        [deg threshold]
        (> deg threshold))

      (defn filtered-nodes
        [edges threshold]
        (let [in-degrees (in-degree edges)]
        (<-
          [?node-id ?in-deg]
           (in-degrees ?node-id ?in-deg)
           ;; use custom operator
           (over-threshold ?in-deg threshold))))


Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




Anatomy of a Cascalog query (Join)
      Example (joining on heterogenous datasets)
      (defn get-website
        [url]
        (-> (URL. url)
           (.getHost)))

      (defn join-edges
        [backlinks rt]
        ;; compute in-degree as a subquery
        (<-
            [?resolved]
             (backlinks ?src ?url)
             (rt _ ?url)
             (get-website ?url :> ?resolved)))

Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




Further reading




             Cascalog home
      https://github.com/nathanmarz/cascalog
             More advanced uses: Pagerank and components detection
      https://github.com/docteurZ/cascalog-contrib/tree/pagerank




Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion




Thanks!




      If you like this kind of problems, we’re hiring!
      Contact us at contact@linkfluence.net




Nils Grunwald and Hugo Zanghi                                                                Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration

Contenu connexe

Tendances

BDW Chicago 2016 - Jim Scott, Director, Enterprise Strategy & Architecture - ...
BDW Chicago 2016 - Jim Scott, Director, Enterprise Strategy & Architecture - ...BDW Chicago 2016 - Jim Scott, Director, Enterprise Strategy & Architecture - ...
BDW Chicago 2016 - Jim Scott, Director, Enterprise Strategy & Architecture - ...Big Data Week
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Spark Summit
 
Fast Data: A Customer’s Journey to Delivering a Compelling Real-Time Solution
Fast Data: A Customer’s Journey to Delivering a Compelling Real-Time SolutionFast Data: A Customer’s Journey to Delivering a Compelling Real-Time Solution
Fast Data: A Customer’s Journey to Delivering a Compelling Real-Time SolutionGuido Schmutz
 
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark Summit
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
ArnoCandelAIFrontiers011217
ArnoCandelAIFrontiers011217ArnoCandelAIFrontiers011217
ArnoCandelAIFrontiers011217Sri Ambati
 
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Khai Tran
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXKrishna Sankar
 
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadinSpark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadinSpark Summit
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
spark interview questions & answers acadgild blogs
 spark interview questions & answers acadgild blogs spark interview questions & answers acadgild blogs
spark interview questions & answers acadgild blogsprateek kumar
 
Stratio CrossData: an efficient distributed datahub with batch and streaming ...
Stratio CrossData: an efficient distributed datahub with batch and streaming ...Stratio CrossData: an efficient distributed datahub with batch and streaming ...
Stratio CrossData: an efficient distributed datahub with batch and streaming ...Stratio
 
SQL on Hadoop: Defining the New Generation of Analytics Databases
SQL on Hadoop: Defining the New Generation of Analytics Databases  SQL on Hadoop: Defining the New Generation of Analytics Databases
SQL on Hadoop: Defining the New Generation of Analytics Databases DataWorks Summit
 
Data Warehouse Evolution Roadshow
Data Warehouse Evolution RoadshowData Warehouse Evolution Roadshow
Data Warehouse Evolution RoadshowMapR Technologies
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBaseCarol McDonald
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri
 
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaDatabricks
 
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)Spark Summit
 

Tendances (20)

BDW Chicago 2016 - Jim Scott, Director, Enterprise Strategy & Architecture - ...
BDW Chicago 2016 - Jim Scott, Director, Enterprise Strategy & Architecture - ...BDW Chicago 2016 - Jim Scott, Director, Enterprise Strategy & Architecture - ...
BDW Chicago 2016 - Jim Scott, Director, Enterprise Strategy & Architecture - ...
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
 
Fast Data: A Customer’s Journey to Delivering a Compelling Real-Time Solution
Fast Data: A Customer’s Journey to Delivering a Compelling Real-Time SolutionFast Data: A Customer’s Journey to Delivering a Compelling Real-Time Solution
Fast Data: A Customer’s Journey to Delivering a Compelling Real-Time Solution
 
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
ArnoCandelAIFrontiers011217
ArnoCandelAIFrontiers011217ArnoCandelAIFrontiers011217
ArnoCandelAIFrontiers011217
 
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadinSpark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
spark interview questions & answers acadgild blogs
 spark interview questions & answers acadgild blogs spark interview questions & answers acadgild blogs
spark interview questions & answers acadgild blogs
 
Stratio CrossData: an efficient distributed datahub with batch and streaming ...
Stratio CrossData: an efficient distributed datahub with batch and streaming ...Stratio CrossData: an efficient distributed datahub with batch and streaming ...
Stratio CrossData: an efficient distributed datahub with batch and streaming ...
 
SQL on Hadoop: Defining the New Generation of Analytics Databases
SQL on Hadoop: Defining the New Generation of Analytics Databases  SQL on Hadoop: Defining the New Generation of Analytics Databases
SQL on Hadoop: Defining the New Generation of Analytics Databases
 
Data Warehouse Evolution Roadshow
Data Warehouse Evolution RoadshowData Warehouse Evolution Roadshow
Data Warehouse Evolution Roadshow
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBase
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
 
解讀雲端大數據新趨勢
解讀雲端大數據新趨勢解讀雲端大數據新趨勢
解讀雲端大數據新趨勢
 
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
 

En vedette

Challenges in the Design of a Graph Database Benchmark
Challenges in the Design of a Graph Database Benchmark Challenges in the Design of a Graph Database Benchmark
Challenges in the Design of a Graph Database Benchmark graphdevroom
 
Ontological Conjunctive Query Answering over large, semi-structured knowledge...
Ontological Conjunctive Query Answering over large, semi-structured knowledge...Ontological Conjunctive Query Answering over large, semi-structured knowledge...
Ontological Conjunctive Query Answering over large, semi-structured knowledge...graphdevroom
 
Cypher Query Language
Cypher Query Language Cypher Query Language
Cypher Query Language graphdevroom
 
Works with persistent graphs using OrientDB
Works with persistent graphs using OrientDB Works with persistent graphs using OrientDB
Works with persistent graphs using OrientDB graphdevroom
 
Introduction to Database Benchmarking with Benchmark Factory
Introduction to Database Benchmarking with Benchmark FactoryIntroduction to Database Benchmarking with Benchmark Factory
Introduction to Database Benchmarking with Benchmark FactoryMichael Micalizzi
 
An example graph visualization with Processing.js
An example graph visualization with Processing.js An example graph visualization with Processing.js
An example graph visualization with Processing.js graphdevroom
 
Benchmark Showdown: Which Relational Database is the Fastest on AWS?
Benchmark Showdown: Which Relational Database is the Fastest on AWS?Benchmark Showdown: Which Relational Database is the Fastest on AWS?
Benchmark Showdown: Which Relational Database is the Fastest on AWS?Clustrix
 

En vedette (7)

Challenges in the Design of a Graph Database Benchmark
Challenges in the Design of a Graph Database Benchmark Challenges in the Design of a Graph Database Benchmark
Challenges in the Design of a Graph Database Benchmark
 
Ontological Conjunctive Query Answering over large, semi-structured knowledge...
Ontological Conjunctive Query Answering over large, semi-structured knowledge...Ontological Conjunctive Query Answering over large, semi-structured knowledge...
Ontological Conjunctive Query Answering over large, semi-structured knowledge...
 
Cypher Query Language
Cypher Query Language Cypher Query Language
Cypher Query Language
 
Works with persistent graphs using OrientDB
Works with persistent graphs using OrientDB Works with persistent graphs using OrientDB
Works with persistent graphs using OrientDB
 
Introduction to Database Benchmarking with Benchmark Factory
Introduction to Database Benchmarking with Benchmark FactoryIntroduction to Database Benchmarking with Benchmark Factory
Introduction to Database Benchmarking with Benchmark Factory
 
An example graph visualization with Processing.js
An example graph visualization with Processing.js An example graph visualization with Processing.js
An example graph visualization with Processing.js
 
Benchmark Showdown: Which Relational Database is the Fastest on AWS?
Benchmark Showdown: Which Relational Database is the Fastest on AWS?Benchmark Showdown: Which Relational Database is the Fastest on AWS?
Benchmark Showdown: Which Relational Database is the Fastest on AWS?
 

Similaire à Using Cascalog and Hadoop for rapid graph processing and exploration

Fabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on FlinkFabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on FlinkFlink Forward
 
From oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsFrom oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsGuy Harrison
 
AWS Finland September Meetup - Using Amazon Neptune to build Fashion Knowledg...
AWS Finland September Meetup - Using Amazon Neptune to build Fashion Knowledg...AWS Finland September Meetup - Using Amazon Neptune to build Fashion Knowledg...
AWS Finland September Meetup - Using Amazon Neptune to build Fashion Knowledg...Uri Savelchev
 
Introduction to Property Graph Features (AskTOM Office Hours part 1)
Introduction to Property Graph Features (AskTOM Office Hours part 1) Introduction to Property Graph Features (AskTOM Office Hours part 1)
Introduction to Property Graph Features (AskTOM Office Hours part 1) Jean Ihm
 
Strata + Hadoop World: Jump Into the Data Lake with Hadoop-Scale Data Integra...
Strata + Hadoop World: Jump Into the Data Lake with Hadoop-Scale Data Integra...Strata + Hadoop World: Jump Into the Data Lake with Hadoop-Scale Data Integra...
Strata + Hadoop World: Jump Into the Data Lake with Hadoop-Scale Data Integra...SnapLogic
 
GraphQL IndyJS April 2016
GraphQL IndyJS April 2016GraphQL IndyJS April 2016
GraphQL IndyJS April 2016Brad Pillow
 
apidays LIVE Paris - The Rise of GraphQL for database APIs by Karthic Rao
apidays LIVE Paris - The Rise of GraphQL for database APIs by Karthic Raoapidays LIVE Paris - The Rise of GraphQL for database APIs by Karthic Rao
apidays LIVE Paris - The Rise of GraphQL for database APIs by Karthic Raoapidays
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNHortonworks
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015Robbie Strickland
 
Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014spinningmatt
 
wepik-enhancing-web-development-with-nodejs-graphql-a-practical-guide-2024041...
wepik-enhancing-web-development-with-nodejs-graphql-a-practical-guide-2024041...wepik-enhancing-web-development-with-nodejs-graphql-a-practical-guide-2024041...
wepik-enhancing-web-development-with-nodejs-graphql-a-practical-guide-2024041...AllstuffRj
 
Cascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User GroupCascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User Groupnathanmarz
 
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...Hadoop User Group
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfishFei Dong
 
Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.Anirudh Gangwar
 
GraphQL & DGraph with Go
GraphQL & DGraph with GoGraphQL & DGraph with Go
GraphQL & DGraph with GoJames Tan
 
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...Data Con LA
 

Similaire à Using Cascalog and Hadoop for rapid graph processing and exploration (20)

Spark vs Hadoop
Spark vs HadoopSpark vs Hadoop
Spark vs Hadoop
 
Fabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on FlinkFabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on Flink
 
From oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsFrom oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other tools
 
AWS Finland September Meetup - Using Amazon Neptune to build Fashion Knowledg...
AWS Finland September Meetup - Using Amazon Neptune to build Fashion Knowledg...AWS Finland September Meetup - Using Amazon Neptune to build Fashion Knowledg...
AWS Finland September Meetup - Using Amazon Neptune to build Fashion Knowledg...
 
Introduction to Property Graph Features (AskTOM Office Hours part 1)
Introduction to Property Graph Features (AskTOM Office Hours part 1) Introduction to Property Graph Features (AskTOM Office Hours part 1)
Introduction to Property Graph Features (AskTOM Office Hours part 1)
 
Strata + Hadoop World: Jump Into the Data Lake with Hadoop-Scale Data Integra...
Strata + Hadoop World: Jump Into the Data Lake with Hadoop-Scale Data Integra...Strata + Hadoop World: Jump Into the Data Lake with Hadoop-Scale Data Integra...
Strata + Hadoop World: Jump Into the Data Lake with Hadoop-Scale Data Integra...
 
GraphQL IndyJS April 2016
GraphQL IndyJS April 2016GraphQL IndyJS April 2016
GraphQL IndyJS April 2016
 
apidays LIVE Paris - The Rise of GraphQL for database APIs by Karthic Rao
apidays LIVE Paris - The Rise of GraphQL for database APIs by Karthic Raoapidays LIVE Paris - The Rise of GraphQL for database APIs by Karthic Rao
apidays LIVE Paris - The Rise of GraphQL for database APIs by Karthic Rao
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
 
Module01
 Module01 Module01
Module01
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
 
Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014
 
wepik-enhancing-web-development-with-nodejs-graphql-a-practical-guide-2024041...
wepik-enhancing-web-development-with-nodejs-graphql-a-practical-guide-2024041...wepik-enhancing-web-development-with-nodejs-graphql-a-practical-guide-2024041...
wepik-enhancing-web-development-with-nodejs-graphql-a-practical-guide-2024041...
 
Cascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User GroupCascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User Group
 
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
 
Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.
 
Stratio big data spain
Stratio   big data spainStratio   big data spain
Stratio big data spain
 
GraphQL & DGraph with Go
GraphQL & DGraph with GoGraphQL & DGraph with Go
GraphQL & DGraph with Go
 
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
 

Dernier

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 

Dernier (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 

Using Cascalog and Hadoop for rapid graph processing and exploration

  • 1. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion Using Cascalog and Hadoop for rapid graph processing and exploration Nils Grunwald and Hugo Zanghi Linkfluence 2012-02-05 - FOSDEM 2012 - Graph Devroom Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 2. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion Outline Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 3. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion What we do at Linkfluence Web data mining (blogs, media, etc.) Social Network data mining (Twitter, Facebook) Use this data to build various search engines Visualize the data with various UI (Gephi, maps, etc.) Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 4. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion What we get Lots of nodes (users, pages, websites, words) Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 5. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion What we get Lots of nodes (users, pages, websites, words) Lots of edges (hyperlinks, comments, RT, co-occurences) Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 6. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion What we get Lots of nodes (users, pages, websites, words) Lots of edges (hyperlinks, comments, RT, co-occurences) These datasets are interconnected (Twitter users link pages, words occur everywhere) Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 7. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion The problem Collecting and processing this data as a graph is not the primary goal of our system Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 8. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion The problem Collecting and processing this data as a graph is not the primary goal of our system But it is a very rich dataset we want to explore for R&D purpose Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 9. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion The constraints The graph processing should not compromise the rest of the system Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 10. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion The constraints The graph processing should not compromise the rest of the system Low-maintenance Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 11. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion The constraints The graph processing should not compromise the rest of the system Low-maintenance Used for queries and rapid prototyping Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 12. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion The constraints The graph processing should not compromise the rest of the system Low-maintenance Used for queries and rapid prototyping Flexible, hard to tell which field or metadata will be used beforehand Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 13. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion What is Cascalog Built on top of Hadoop and Cascading (workflow management) Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 14. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion What is Cascalog Built on top of Hadoop and Cascading (workflow management) Inspired by the Datalog query syntax Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 15. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion What is Cascalog Built on top of Hadoop and Cascading (workflow management) Inspired by the Datalog query syntax Hosted on the JVM by the Clojure language Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 16. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion Hadoop for reliability and scalability Reliable and scalable Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 17. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion Hadoop for reliability and scalability Reliable and scalable Everything is dumped in text files, we reuse our existing rsyslog infrastructure Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 18. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion Hadoop for reliability and scalability Reliable and scalable Everything is dumped in text files, we reuse our existing rsyslog infrastructure We can reuse existing hadoop instances of our system Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 19. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion Hadoop for reliability and scalability Reliable and scalable Everything is dumped in text files, we reuse our existing rsyslog infrastructure We can reuse existing hadoop instances of our system No need to know beforehand about indexed fields or to have data in a perfectly uniform format Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 20. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion Datalog for rapid protyping Subset of Prolog Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 21. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion Datalog for rapid protyping Subset of Prolog Declarative, expressive and very concise way of writing queries Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 22. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion Datalog for rapid protyping Subset of Prolog Declarative, expressive and very concise way of writing queries Prolog has long been used for making queries over graphs Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 23. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion Clojure for flexibility Only one language and one file for queries and business logic Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 24. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion Clojure for flexibility Only one language and one file for queries and business logic Tasks unrelated to data processing are possible inside the queries (Resolve shortened links for example) Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 25. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion Clojure for flexibility Only one language and one file for queries and business logic Tasks unrelated to data processing are possible inside the queries (Resolve shortened links for example) Allows complex algorithms to be concisely expressed Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 26. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion The downsides Slow compared to in-memory computation or non-distributed graph DB Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 27. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion The downsides Slow compared to in-memory computation or non-distributed graph DB Cannot do realtime Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 28. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion Use-cases Post-processing on large number of edges Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 29. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion Use-cases Post-processing on large number of edges Filtering or transforming a dataset before exporting to Gephi or Neo4j Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 30. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion Use-cases Post-processing on large number of edges Filtering or transforming a dataset before exporting to Gephi or Neo4j Back-processing old data with inconsistent fields and merging datasets from different sources Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 31. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion Using Cascalog Declarative syntax Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 32. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion Using Cascalog Declarative syntax Order of statements is arbitrary Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 33. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion Using Cascalog Declarative syntax Order of statements is arbitrary Syntax is LISP-like Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 34. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion Using Cascalog Declarative syntax Order of statements is arbitrary Syntax is LISP-like Operations are based on tuples Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 35. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion Using Cascalog Declarative syntax Order of statements is arbitrary Syntax is LISP-like Operations are based on tuples Possibility to control the flow with custom operators (filter, mapcat, etc.) Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 36. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion Anatomy of a Cascalog query (Aggregation) Example (in-degree from cascalog.graph.core) (defn in-degree ;; just a normal function "computes the in degrees" ;; docstring [edges] (<- ;; returns a cascalog query [?dst ?in_d] ;; returned tuple (edges ?dst _) ;; destructuring on a generator (:distinct false) (c/count :> ?in_d))) ;; infers aggregation on ?dst Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 37. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion Anatomy of a Cascalog query (Filtering) Example (filtering on in-degree) (defn filtered-nodes [edges threshold] ;; compute in-degree as a subquery (let [in-degrees (in-degree edges)] (<- [?node-id ?in-deg] ;; filters on computed in-degree (> ?in-deg threshold) ;; uses previous subquery as a generator (in-degrees ?node-id ?in-deg)))) Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 38. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion Under the hood, this happens. . . Example (using custom filter ops) (deffilterop over-threshold [deg threshold] (> deg threshold)) (defn filtered-nodes [edges threshold] (let [in-degrees (in-degree edges)] (<- [?node-id ?in-deg] (in-degrees ?node-id ?in-deg) ;; use custom operator (over-threshold ?in-deg threshold)))) Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 39. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion Anatomy of a Cascalog query (Join) Example (joining on heterogenous datasets) (defn get-website [url] (-> (URL. url) (.getHost))) (defn join-edges [backlinks rt] ;; compute in-degree as a subquery (<- [?resolved] (backlinks ?src ?url) (rt _ ?url) (get-website ?url :> ?resolved))) Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 40. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion Further reading Cascalog home https://github.com/nathanmarz/cascalog More advanced uses: Pagerank and components detection https://github.com/docteurZ/cascalog-contrib/tree/pagerank Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration
  • 41. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion Thanks! If you like this kind of problems, we’re hiring! Contact us at contact@linkfluence.net Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration