Natural Language Processing with Neo4j

Natural Language
Processing with Neo4j
Kenny Bastani
@kennybastani

This is a hobby of mine
I’m passionate about it
It’s always a work in progress
I do it for fun

Machine Learning Focuses
•

Text mining

•

Natural Language Processing

•

Automatic summarization

•

Graph databases

•

Commitment to unsupervised
learning.

I wanted a better way to learn with less
effort
I wanted something a little more
zippy.
I’m mostly self-taught, so I wanted
something that made self-learning
easier for others.

The Idea
Articles

Contain

un

d

in

Found in

Fo

Phrases

Sentences

Importance of NLP
•

I’m inspired by the idea of
machines learning from
experience.

•

NLP is important for finding
valuable information in noisy
unstructured text.

•

I’m a Developer Evangelist for
Neo4j, so I’m kind of a fan of
graph databases.

Algorithms can learn
As long as it can store information and retrieve it in enough
time for it to be of any use.

Learning requires storage
To learn, storage is required.
For NLP, storage is sometimes a
second class citizen.
Much focus is on the algorithm first,
then storage second.
But really, it’s storage and retrieval
of big data that is the problem.

Machine learning
Machine learning isn’t magic or hard to understand. It’s real stuff.
We know how to do it.
It’s easily articulated.
ML algorithms solve big computational problems today.
It’s based on the idea of machines learning from prior experiences
as data.

Formulate a Hypothesis
When you analyze data, the
outcome is usually a hypothesis.
An hypothesis is a conclusion based
on limited data.
There are always more pieces
needed to solve the puzzle.

Build on Past Experience
By experience, I mean DATA.
Machine Learning techniques are
entirely based on collection and
analysis of recorded data.
So storage is really important if you
want to do machine learning
successfully.
You cannot play baseball without
your brain. Don’t try it.

The Problem with AI
The problem with AI is that it seems like
magic.
Some people say strong AI is possible.
There are some people that deny that it is
possible.
It is a central theme in many fictional
fantasy films and book genres.
It’s in Greek mythology.

Is AI Misunderstood?
Researchers admit to not fully
understanding how intelligence
works in the human brain.
We generally understand how it
works, but no consensus on how to
recreate it in machines.
AI is really just the act of perceiving
an environment and maximizing
chances of success.

You get the point.
•

Now why is a Graph Database useful for unsupervised
machine learning?

•

Let’s consider the problem I stated earlier.

•

I wanted to build a better way to summarize and
learn from Wikipedia’s combined knowledge.

Unsupervised Learning on
Wikipedia
Articles

Contain

un

d

in

Found in

Fo

Phrases

Sentences

How do you learn about
learning?
I started by observing myself learning from reading
Wikipedia articles.
I searched for an interesting term on Google.
I read through the article’s text word by word.

The Learning Algorithm
As I read the article’s text, I would sometimes come
across a phrase or term I had not seen before.
Before continuing reading I would open up a new tab
and search for the unrecognized phrase.
It was a well defined recursive algorithm.
I would drill down n-times on unrecognized article
terms until returning to the original article text.

A Self-Learning Algorithm
In the computer’s world, this process
would result in an ontology of labeled
data.
Which looks a lot like a graph.
But how would I store the results?
If only there were a database for that..

Neo4j is a graph database
…and graphs are everywhere!

Contains
Article

Phrase

un

d

in

Found in

Fo

Sentence

Simple Clustering Model

What about the NLP stuff?
This is how I did it.

The seed article

You start with a seed article which is the first article text
to start the learning algorithm with.

Fetch text from Wikipedia

Get the unstructured text and meta data from
Wikipedia.

Sliding text window
I formulated dynamic RegEx templates and treated
them as a hypothesis.
The RegEx template would slide word by word through
the text, searching for unrecognized phrases
(n known word matches + 1 wildcard word match)

Looking for redundant phrases
As each unrecognized phrase is encountered, the
dynamic RegEx is then matched against the entire
article’s text.
The algorithm looks for more than 2 identical phrases
within the article’s text.
It appends a 3rd wildcard word match to the template
and then rescans the text for redundant phrases until
none are found.

Identify Redundancy of Text
This recursive matching process within the local article’s
text resulted in finding the duplicate phrases of a
variable length.
“The King of Sweden” has 2 appearances in an article,
so that must be important to the topic of Sweden.
Better go search for an article stub on “The King of
Sweden”

Graph Storage and Retrieval
Every time a phrase that doesn’t exist as a node in
Neo4j is encountered, it becomes a target of
investigation, kind of like a hypothesis.
Each sentence that contains the extracted phrase is also
added to Neo4j as a content node.
Relationships are added between nodes, showing
semantic relationship.

Phrase inheritance

Phrases can be found within other phrases, denoting a
grammatical inheritance hierarchy mapped to a variety
of content nodes and articles.

Phrase Inheritance Graph Data
Model

Article

Contains

Sentence
“X MEN.”

Found in

Fo

un

d

in

Phrase
“X Y”

Found in

Found in

Phrase
“X”

Sentence
“X Y Z.”

Found in

Fo

u

nd

Phrase
“X Y Z”

in

Contains

Article

Graphs are everywhere.

Questions?

Thanks for coming to my talk!
Please look me up on Twitter and LinkedIn!
Twitter: http://www.twitter.com/kennybastani
LinkedIn: http://www.linkedin.com/in/kennybastani

Natural Language Processing with Neo4j

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Viewers also liked

Viewers also liked (20)

Similar to Natural Language Processing with Neo4j

Similar to Natural Language Processing with Neo4j (20)

More from Kenny Bastani

More from Kenny Bastani (9)

Recently uploaded

Recently uploaded (20)

Natural Language Processing with Neo4j

Editor's Notes