Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

Using Cascalog and Hadoop for rapid graph
processing and exploration

Nils Grunwald and Hugo Zanghi

Linkfluence

2012-02-05 - FOSDEM 2012 - Graph Devroom

Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration


Outline

Graph Analysis at Linkﬂuence

Why Cascalog

Introduction to Cascalog

Conclusion



What we do at Linkﬂuence

Web data mining (blogs,
media, etc.)
Social Network data mining
(Twitter, Facebook)
Use this data to build
various search engines
Visualize the data with
various UI (Gephi, maps,
etc.)



What we get

Lots of nodes (users, pages, websites, words)



What we get

Lots of edges (hyperlinks, comments, RT, co-occurences)



What we get

Lots of edges (hyperlinks, comments, RT, co-occurences)
These datasets are interconnected (Twitter users link pages,
words occur everywhere)



The problem

Collecting and processing this data as a graph is not the
primary goal of our system



The problem

Collecting and processing this data as a graph is not the
primary goal of our system
But it is a very rich dataset we want to explore for R&D
purpose



The constraints

The graph processing should not compromise the rest of the
system



The constraints

system
Low-maintenance



The constraints

system
Low-maintenance
Used for queries and rapid prototyping



The constraints

system
Low-maintenance
Used for queries and rapid prototyping
Flexible, hard to tell which ﬁeld or metadata will be used
beforehand



What is Cascalog

Built on top of Hadoop and Cascading (workﬂow management)



What is Cascalog

Inspired by the Datalog query syntax



What is Cascalog

Inspired by the Datalog query syntax
Hosted on the JVM by the Clojure language



Hadoop for reliability and scalability

Reliable and scalable




Everything is dumped in text ﬁles, we reuse our existing
rsyslog infrastructure




We can reuse existing hadoop instances of our system




We can reuse existing hadoop instances of our system
No need to know beforehand about indexed ﬁelds or to have
data in a perfectly uniform format



Datalog for rapid protyping

Subset of Prolog




Subset of Prolog
Declarative, expressive and very concise way of writing queries




Subset of Prolog
Declarative, expressive and very concise way of writing queries
Prolog has long been used for making queries over graphs



Clojure for ﬂexibility

Only one language and one ﬁle for queries and business logic




Tasks unrelated to data processing are possible inside the
queries (Resolve shortened links for example)




Tasks unrelated to data processing are possible inside the
queries (Resolve shortened links for example)
Allows complex algorithms to be concisely expressed



The downsides

Slow compared to in-memory computation or non-distributed
graph DB



The downsides

Slow compared to in-memory computation or non-distributed
graph DB
Cannot do realtime



Use-cases

Post-processing on large number of edges



Use-cases

Filtering or transforming a dataset before exporting to Gephi
or Neo4j



Use-cases

Filtering or transforming a dataset before exporting to Gephi
or Neo4j
Back-processing old data with inconsistent ﬁelds and merging
datasets from diﬀerent sources



Using Cascalog

Declarative syntax



Using Cascalog

Declarative syntax
Order of statements is arbitrary



Using Cascalog

Declarative syntax
Syntax is LISP-like



Using Cascalog

Declarative syntax
Syntax is LISP-like
Operations are based on tuples



Using Cascalog

Declarative syntax
Syntax is LISP-like
Operations are based on tuples
Possibility to control the ﬂow with custom operators (ﬁlter,
mapcat, etc.)



Anatomy of a Cascalog query (Aggregation)

Example (in-degree from cascalog.graph.core)
(defn in-degree ;; just a normal function
"computes the in degrees" ;; docstring
[edges]
(<- ;; returns a cascalog query
[?dst ?in_d] ;; returned tuple
(edges ?dst _) ;; destructuring on a generator
(:distinct false)
(c/count :> ?in_d))) ;; infers aggregation on ?dst



Anatomy of a Cascalog query (Filtering)

Example (ﬁltering on in-degree)
(defn filtered-nodes
[edges threshold]
;; compute in-degree as a subquery
(let [in-degrees (in-degree edges)]
(<-
[?node-id ?in-deg]
;; filters on computed in-degree
(> ?in-deg threshold)
;; uses previous subquery as a generator
(in-degrees ?node-id ?in-deg))))



Under the hood, this happens. . .

Example (using custom ﬁlter ops)
(deffilterop over-threshold
[deg threshold]
(> deg threshold))

(defn filtered-nodes
[edges threshold]
(let [in-degrees (in-degree edges)]
(<-
[?node-id ?in-deg]
(in-degrees ?node-id ?in-deg)
;; use custom operator
(over-threshold ?in-deg threshold))))



Anatomy of a Cascalog query (Join)
Example (joining on heterogenous datasets)
(defn get-website
[url]
(-> (URL. url)
(.getHost)))

(defn join-edges
[backlinks rt]
;; compute in-degree as a subquery
(<-
[?resolved]
(backlinks ?src ?url)
(rt _ ?url)
(get-website ?url :> ?resolved)))



Further reading

Cascalog home
https://github.com/nathanmarz/cascalog
More advanced uses: Pagerank and components detection
https://github.com/docteurZ/cascalog-contrib/tree/pagerank



Thanks!

If you like this kind of problems, we’re hiring!
Contact us at contact@linkﬂuence.net


Using Cascalog and Hadoop for rapid graph processing and exploration

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (7)

Similaire à Using Cascalog and Hadoop for rapid graph processing and exploration

Similaire à Using Cascalog and Hadoop for rapid graph processing and exploration (20)

Dernier

Dernier (20)

Using Cascalog and Hadoop for rapid graph processing and exploration