This document contains the slides for a presentation on natural language processing (NLP) with .NET. The presentation introduces common NLP tasks like analysis, transformation, and generation. It discusses NLP concepts like bag-of-words, TF-IDF, n-grams, and word embeddings. Tools for NLP with .NET are presented, including ML.NET, Catalyst, and Microsoft Recognizers libraries. Demonstrations of text summarization and document tagging using these tools are described. The presentation concludes that NLP for basic tasks is possible with .NET libraries, though features are still limited compared to other languages.
2. Тема доклада
Тема доклада
Тема доклада
.NET LEVEL UP
About me
.NET CONFERENCE #1 IN UKRAINE KYIV 2019
Sergiy Korzh
25+ years in software development
20 year running own business
.NET developer since 2004
iForum.ua (technology section)
Projects:
EasyQuery (https://korzh.com/easyquery)
Easy.Report (http://easy.report)
Aistant (https://aistant.com/)
Twitter: @korzhs
LinkedIn: https://www.linkedin.com/in/korzh/
3. Тема доклада
Тема доклада
Тема доклада
.NET LEVEL UP
Agenda
.NET CONFERENCE #1 IN UKRAINE KYIV 2019
1 Introduction to NLP (main tasks and basic concepts)
NLP Tools for .NET (and not only)2
3 Demos
4 Useful materials and conclusions
5. Тема доклада
Тема доклада
Тема доклада
.NET LEVEL UP
Why NLP on .NET?
.NET CONFERENCE #1 IN UKRAINE KYIV 2019
Because we love .NET, right?
Quick and easy (for simple NLP tasks)
No “glue” code
6. Тема доклада
Тема доклада
Тема доклада
.NET LEVEL UP
Remarks
.NET CONFERENCE #1 IN UKRAINE KYIV 2019
“Light” NLP tasks only!
No Deep Learning
Beginner level topics
7. .NET LEVEL UP
NLP Tasks
.NET CONFERENCE #1 IN UKRAINE KYIV 2019
1 Linguistic
Analysis
Transformation
2
3
Generation4
8. .NET LEVEL UP
NLP Tasks
.NET CONFERENCE #1 IN UKRAINE KYIV 2019
1 Linguistic
• Segmentation
• Part of speech tagging
• Named-entity recognition
• Relation extraction
• Syntactic parsing
• Coreference resolution
• Semantic parsing
9. .NET LEVEL UP
NLP Tasks’ Examples
.NET CONFERENCE #1 IN UKRAINE KYIV 2019
2 Analysis
• Spam-filter
• Sentiment analysis
• Text similarity
• Information extraction
10. .NET LEVEL UP
NLP Tasks’ Examples
.NET CONFERENCE #1 IN UKRAINE KYIV 2019
3 Transformation
• Machine translation
• Speech to Text / Text to speech
• Grammar correction
• Text summarization
11. .NET LEVEL UP
NLP Tasks’ Examples
.NET CONFERENCE #1 IN UKRAINE KYIV 2019
4 Generation
• Question Answering
• Chat bots
• Story generation
12. .NET LEVEL UP
NLP Pipeline
.NET CONFERENCE #1 IN UKRAINE KYIV 2019
TEXT Text Featurizing
(Numeric representation)
ML Algorithm RESULT
13. .NET LEVEL UP
NLP Pipeline: Classic
.NET CONFERENCE #1 IN UKRAINE KYIV 2019
from AYLIEN blog
14. .NET LEVEL UP
NLP Pipeline: Deep Learning
.NET CONFERENCE #1 IN UKRAINE KYIV 2019
from AYLIEN blog
15. .NET LEVEL UP
NLP concepts: Bag of words
.NET CONFERENCE #1 IN UKRAINE KYIV 2019
The way to represent your text for ML algorithms
• Word frequency
• One-hot encoding
• TF-IDF
• Other metrics
Encoding approaches:
16. .NET LEVEL UP
NLP concepts: TF-IDF
.NET CONFERENCE #1 IN UKRAINE KYIV 2019
For a word-document pair, TF-IDF shows the
importance of the word in the document.
Used in all kinds of information retrieval tasks:
• Search
• Text mining
• Stop-words filtering
17. .NET LEVEL UP
NLP concepts: N-grams
.NET CONFERENCE #1 IN UKRAINE KYIV 2019
Word N-grams
n-gram is a contiguous sequence of n items from a given sample of text.
“I live in Kyiv” word bi-grams
1. # I
2. I live
3. live in
4. in Kyiv
5. Kyiv #
Character N-grams
“I live in Kyiv” character bi-grams
1. #_
2. _I
3. I_
4. _l
5. li
6. Iv
7. ve
8. . . .
18. .NET LEVEL UP
NLP concepts: Word Embeddings
.NET CONFERENCE #1 IN UKRAINE KYIV 2019
A set of techniques which allow to map words (or phrases) to numeric vectors.
The words with similar meanings have “close” vectors.
word Vector
man [0.23, 0.56, …]
king [0.34, 0.16, …]
woman [0.41, 0.73, …]
queen [0.09, 0.62, …]
[king] – [man] + [woman] ≈ [queen]
Popular embeddings algorithms:
Word2Vec
fastText
Glove
. . .
19. .NET LEVEL UP
NLP concepts: Language Model
.NET CONFERENCE #1 IN UKRAINE KYIV 2019
allows to compute a probability of a word in a sequence.
Where used? (spoiler: almost everywhere!)
Please, give me a … [ pen: 0.002, example: 0.0001, hand:0.08, … ]
• Machine translation
• Error correction
• Speech recognition
• Text generation
20. .NET LEVEL UP
NLP Tools
.NET CONFERENCE #1 IN UKRAINE KYIV 2019
1 Online services
Python libraries
.NET Libraries
2
3
Azure Cognitive Services, IBM Watson, Amazon AI Services
NLTK, spaCy, skikit-learn,
gensim, Pattern
ML.NET, Microsoft.Speech,
Microsoft.Recognizers, Catalyst
21. .NET LEVEL UP
.NET libs: ML.NET
.NET CONFERENCE #1 IN UKRAINE KYIV 2019
https://dotnet.microsoft.com/apps/machinelearning-ai/ml-dotnet
Pros:
• Native for .NET (Core)
• Backed my Microsoft
• Super performant (at least MS says that )
• Extended with TensorFlow & more
NLP features:
• Text normalization
• Tokenizing
• N-gram
• Word embeddings
• Stop words removal Cons:
• Poor NLP features
• English-only (mostly)
• Not convenient for using separately from ML pipeline
22. .NET LEVEL UP
.NET libs: Catalyst
.NET CONFERENCE #1 IN UKRAINE KYIV 2019
NLP features:
• Text normalization
• Tokenizing
• POS-tagging
• Word embeddings
• Stop words removal
https://github.com/curiosity-ai/catalyst
Pros:
• Native for .NET (Core)
• Inspired by spaCy library
• Fast tokenizer
• Has pretrained models
• Allows to train your own models
(based on Universal Dependencies project)
Cons:
• Early beta (or even alpha). Version 0.0.2795
• English-only (mostly)
23. .NET LEVEL UP
.NET libs: Microsoft.Recognizers
.NET CONFERENCE #1 IN UKRAINE KYIV 2019
• Rule-based
• Recognizes numbers, units, date/time, etc
• Supports about 10 different languages
• Not only .NET (JavaScript, Python, Java)
• No support for Russian or Ukrainian
https://github.com/Microsoft/Recognizers-Text/
24. .NET LEVEL UP
Other useful libraries
.NET CONFERENCE #1 IN UKRAINE KYIV 2019
DEMO 1
Text summarization (extraction based) using home-brewed NLP
TEXT
Detect
language
Break into
sentences
Tokenize
and
get stems
sentence1 sentence2 sentence3
stem1 1 3 5
stem2 0 2 4
stem3 3 4 0
stem4 2 0 2
Bag of words
S1 S2 S3
S1 0 1.21 0.2
S2 1.21 0 3.56
S3 0.2 3.56 0
Similarity matrix
Page rank
algorithm
Summary
(top-rated
sentences)
35. .NET LEVEL UP
Useful resources
.NET CONFERENCE #1 IN UKRAINE KYIV 2019
Universal Dependencies
https://universaldependencies.org/
Lang-uk
http://lang.org.ua/uk/
https://github.com/korzh/Korzh.NLP
All source code of this talk
Math.net – numerical computation algorithms for .NET
https://www.mathdotnet.com/
http://tiny.cc/dotnet-nlp-libs
List of .NET libraries with some NLP features
36. .NET LEVEL UP
Conclusions
.NET CONFERENCE #1 IN UKRAINE KYIV 2019
Catalyst library
looks promising but still a way to go
Contribute!
We can do NLP on .NET
(for the basic tasks at least)
ML.NET library
good and reliable but limited NLP features
37. .NET LEVEL UP
Other useful libraries
.NET CONFERENCE #1 IN UKRAINE KYIV 2019
Thank you!
Sergiy Korzh
Twitter: @korzhs
LinkedIn: https://www.linkedin.com/in/korzh/
Facebook: https://www.facebook.com/sergiy.korzh
Email: sergiy@korzh.com
Notes de l'éditeur
What kind of normalization?
How to get tokens?
What n-gramming is supported (word, character?)
What kind of word embeddings? Only English?
How to add my own stop-word removal?