Graphs are becoming increasingly popular as ways of modeling a wide variety of systems. As such, the label "graph processing" also covers a range of objectives and architectural constraints. At [Linkfluence][http://us.linkfluence.net/], we use graph processing on datasets produced with very different systems (Web crawler, Twitter and Facebook API, feed aggregator, etc.) We spend a lot of time doing exploratory programming, trying to use our eclectic datasets in interesting ways, and processing our data in asynchronous workflows.
We have come to see [Hadoop][http://hadoop.apache.org/] and the processing framework [Cascalog][https://github.com/nathanmarz/cascalog] as essential tools in our toolbox when dealing with graphs, since it gives us architectural flexibility, scalability and the possibility of rapid prototyping.
Cascalog is an open source framework built on top of Hadoop and [Cascading][http://www.cascading.org/]. Its syntactic and semantic roots come from Datalog and Prolog, which have been succesfully applied for a long time in the manipulation of graphs. Also, its ability to directly embed the expressive [Clojure][http://clojure.org/] language allows to very easily define custom operations and ad-hoc processing.
In this talk, we will present the framework, consider its advantages and drawbacks when compared to other approaches, show concrete exemples of usage for graph processing and how we use them to complement graph databases.
DSPy a system for AI to Write Prompts and Do Fine Tuning
Using Cascalog and Hadoop for rapid graph processing and exploration
1. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Using Cascalog and Hadoop for rapid graph
processing and exploration
Nils Grunwald and Hugo Zanghi
Linkfluence
2012-02-05 - FOSDEM 2012 - Graph Devroom
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
2. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Outline
Graph Analysis at Linkfluence
Why Cascalog
Introduction to Cascalog
Conclusion
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
3. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
What we do at Linkfluence
Web data mining (blogs,
media, etc.)
Social Network data mining
(Twitter, Facebook)
Use this data to build
various search engines
Visualize the data with
various UI (Gephi, maps,
etc.)
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
4. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
What we get
Lots of nodes (users, pages, websites, words)
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
5. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
What we get
Lots of nodes (users, pages, websites, words)
Lots of edges (hyperlinks, comments, RT, co-occurences)
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
6. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
What we get
Lots of nodes (users, pages, websites, words)
Lots of edges (hyperlinks, comments, RT, co-occurences)
These datasets are interconnected (Twitter users link pages,
words occur everywhere)
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
7. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
The problem
Collecting and processing this data as a graph is not the
primary goal of our system
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
8. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
The problem
Collecting and processing this data as a graph is not the
primary goal of our system
But it is a very rich dataset we want to explore for R&D
purpose
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
9. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
The constraints
The graph processing should not compromise the rest of the
system
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
10. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
The constraints
The graph processing should not compromise the rest of the
system
Low-maintenance
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
11. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
The constraints
The graph processing should not compromise the rest of the
system
Low-maintenance
Used for queries and rapid prototyping
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
12. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
The constraints
The graph processing should not compromise the rest of the
system
Low-maintenance
Used for queries and rapid prototyping
Flexible, hard to tell which field or metadata will be used
beforehand
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
13. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
What is Cascalog
Built on top of Hadoop and Cascading (workflow management)
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
14. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
What is Cascalog
Built on top of Hadoop and Cascading (workflow management)
Inspired by the Datalog query syntax
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
15. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
What is Cascalog
Built on top of Hadoop and Cascading (workflow management)
Inspired by the Datalog query syntax
Hosted on the JVM by the Clojure language
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
16. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Hadoop for reliability and scalability
Reliable and scalable
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
17. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Hadoop for reliability and scalability
Reliable and scalable
Everything is dumped in text files, we reuse our existing
rsyslog infrastructure
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
18. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Hadoop for reliability and scalability
Reliable and scalable
Everything is dumped in text files, we reuse our existing
rsyslog infrastructure
We can reuse existing hadoop instances of our system
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
19. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Hadoop for reliability and scalability
Reliable and scalable
Everything is dumped in text files, we reuse our existing
rsyslog infrastructure
We can reuse existing hadoop instances of our system
No need to know beforehand about indexed fields or to have
data in a perfectly uniform format
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
20. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Datalog for rapid protyping
Subset of Prolog
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
21. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Datalog for rapid protyping
Subset of Prolog
Declarative, expressive and very concise way of writing queries
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
22. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Datalog for rapid protyping
Subset of Prolog
Declarative, expressive and very concise way of writing queries
Prolog has long been used for making queries over graphs
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
23. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Clojure for flexibility
Only one language and one file for queries and business logic
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
24. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Clojure for flexibility
Only one language and one file for queries and business logic
Tasks unrelated to data processing are possible inside the
queries (Resolve shortened links for example)
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
25. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Clojure for flexibility
Only one language and one file for queries and business logic
Tasks unrelated to data processing are possible inside the
queries (Resolve shortened links for example)
Allows complex algorithms to be concisely expressed
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
26. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
The downsides
Slow compared to in-memory computation or non-distributed
graph DB
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
27. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
The downsides
Slow compared to in-memory computation or non-distributed
graph DB
Cannot do realtime
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
28. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Use-cases
Post-processing on large number of edges
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
29. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Use-cases
Post-processing on large number of edges
Filtering or transforming a dataset before exporting to Gephi
or Neo4j
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
30. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Use-cases
Post-processing on large number of edges
Filtering or transforming a dataset before exporting to Gephi
or Neo4j
Back-processing old data with inconsistent fields and merging
datasets from different sources
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
31. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Using Cascalog
Declarative syntax
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
32. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Using Cascalog
Declarative syntax
Order of statements is arbitrary
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
33. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Using Cascalog
Declarative syntax
Order of statements is arbitrary
Syntax is LISP-like
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
34. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Using Cascalog
Declarative syntax
Order of statements is arbitrary
Syntax is LISP-like
Operations are based on tuples
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
35. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Using Cascalog
Declarative syntax
Order of statements is arbitrary
Syntax is LISP-like
Operations are based on tuples
Possibility to control the flow with custom operators (filter,
mapcat, etc.)
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
36. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Anatomy of a Cascalog query (Aggregation)
Example (in-degree from cascalog.graph.core)
(defn in-degree ;; just a normal function
"computes the in degrees" ;; docstring
[edges]
(<- ;; returns a cascalog query
[?dst ?in_d] ;; returned tuple
(edges ?dst _) ;; destructuring on a generator
(:distinct false)
(c/count :> ?in_d))) ;; infers aggregation on ?dst
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
37. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Anatomy of a Cascalog query (Filtering)
Example (filtering on in-degree)
(defn filtered-nodes
[edges threshold]
;; compute in-degree as a subquery
(let [in-degrees (in-degree edges)]
(<-
[?node-id ?in-deg]
;; filters on computed in-degree
(> ?in-deg threshold)
;; uses previous subquery as a generator
(in-degrees ?node-id ?in-deg))))
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
38. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Under the hood, this happens. . .
Example (using custom filter ops)
(deffilterop over-threshold
[deg threshold]
(> deg threshold))
(defn filtered-nodes
[edges threshold]
(let [in-degrees (in-degree edges)]
(<-
[?node-id ?in-deg]
(in-degrees ?node-id ?in-deg)
;; use custom operator
(over-threshold ?in-deg threshold))))
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
39. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Anatomy of a Cascalog query (Join)
Example (joining on heterogenous datasets)
(defn get-website
[url]
(-> (URL. url)
(.getHost)))
(defn join-edges
[backlinks rt]
;; compute in-degree as a subquery
(<-
[?resolved]
(backlinks ?src ?url)
(rt _ ?url)
(get-website ?url :> ?resolved)))
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
40. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Further reading
Cascalog home
https://github.com/nathanmarz/cascalog
More advanced uses: Pagerank and components detection
https://github.com/docteurZ/cascalog-contrib/tree/pagerank
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
41. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Thanks!
If you like this kind of problems, we’re hiring!
Contact us at contact@linkfluence.net
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration