SlideShare une entreprise Scribd logo
1  sur  36
Télécharger pour lire hors ligne
Large Scale Processing of Text
Suneel Marthi
DataWorks Summit 2017,
San Jose, California
@suneelmarthi
$WhoAmI
● Principal Software Engineer in the Office of Technology, Red Hat
● Member of Apache Software Foundation
● Committer and PMC member on Apache Mahout, Apache OpenNLP, Apache
Streams
What is a Natural Language?
What is a Natural Language?
Is any language that has evolved naturally in humans through
use and repetition without conscious planning or
premeditation
(From Wikipedia)
What is NOT a Natural Language?
Characteristics of Natural Language
Unstructured
Ambiguous
Complex
Hidden semantic
Ironic
Informal
Unpredictable
Rich
Most updated
Noise
Hard to search
and it holds most of human knowledge
and it holds most of human knowledge
and but it holds most of human knowledge
As information overload grows
ever worse, computers may
become our only hope for
handling a growing deluge of
documents.
MIT Press - May 12, 2017
What is Natural Language Processing?
NLP is a field of computer science, artificial intelligence and
computational linguistics concerned with the interactions
between computers and human (natural) languages, and, in
particular, concerned with programming computers to fruitfully
process large natural language corpora.(From Wikipedia)
???
How?
By solving small problems each time
A pipeline where an ambiguity type is solved, incrementally.
Sentence Detector
Mr. Robert talk is today at room num. 7. Let's go?
| | | | ❌
| | ✅
Tokenizer
Mr. Robert talk is today at room num. 7. Let's go?
|| | | | | | | || || | ||| | | ❌
| | | | | | | | || | | | | | ✅
By solving small problems each time
Each step of a pipeline solves one ambiguity problem.
Name Finder
<Person>Washington</Person> was the first president of the USA.
<Place>Washington</Place> is a state in the Pacific Northwest region
of the USA.
POS Tagger
Laura Keene brushed by him with the glass of water .
| | | | | | | | | | |
NNP NNP VBD IN PRP IN DT NN IN NN .
By solving small problems each time
A pipeline can be long and resolve many ambiguities
Lemmatizer
He is better than many others
| | | | | |
He be good than many other
Apache OpenNLP
Apache OpenNLP
Mature project (> 10 years)
Actively developed
Machine learning
Java
Easy to train
Highly customizable
Fast
Language Detector (soon)
Sentence detector
Tokenizer
Part of Speech Tagger
Lemmatizer
Chunker
Parser
....
Training Models for English
Corpus - OntoNotes (https://catalog.ldc.upenn.edu/ldc2013t19)
bin/opennlp TokenNameFinderTrainer.ontonotes -lang eng -ontoNotesDir
~/opennlp-data-dir/ontonotes4/data/files/data/english/ -model en-pos-ontonotes.bin
bin/opennlp POSTaggerTrainer.ontonotes -lang eng -ontoNotesDir
~/opennlp-data-dir/ontonotes4/data/files/data/english/ -model en-pos-maxent.bin
Training Models for Portuguese
Corpus - Amazonia (http://www.linguateca.pt/floresta/corpus.html)
bin/opennlp TokenizerTrainer.ad -lang por -data amazonia.ad -model por-tokenizer.bin -detokenizer
lang/pt/tokenizer/pt-detokenizer.xml -encoding ISO-8859-1
bin/opennlp POSTaggerTrainer.ad -lang por -data amazonia.ad -model por-pos.bin -encoding
ISO-8859-1 -includeFeatures false
bin/opennlp ChunkerTrainerME.ad -lang por -data amazonia.ad -model por-chunk.bin -encoding
ISO-8859-1
bin/opennlp TokenNameFinderTrainer.ad -lang por -data amazonia.ad -model por-ner.bin -encoding
ISO-8859-1
Name Finder API - Detect Names
NameFinderME nameFinder = new NameFinderME(new
TokenNameFinderModel(
OpenNLPMain.class.getResource("/opennlp-models/por-ner.bin”)));
for (String document[][] : documents) {
for (String[] sentence : document) {
Span nameSpans[] = nameFinder.find(sentence);
// do something with the names
}
nameFinder.clearAdaptiveData()
}
Name Finder API - Train a model
ObjectStream<String> lineStream =
new PlainTextByLineStream(new
FileInputStream("en-ner-person.train"), StandardCharsets.UTF8);
TokenNameFinderModel model;
try (ObjectStream<NameSample> sampleStream = new
NameSampleDataStream(lineStream)) {
model = NameFinderME.train("en", "person", sampleStream,
TrainingParameters.defaultParams(),
TokenNameFinderFactory nameFinderFactory);
}
model.serialize(modelFile);
Name Finder API - Evaluate a model
TokenNameFinderEvaluator evaluator = new TokenNameFinderEvaluator(new
NameFinderME(model));
evaluator.evaluate(sampleStream);
FMeasure result = evaluator.getFMeasure();
System.out.println(result.toString());
Name Finder API - Cross Evaluate a model
FileInputStream sampleDataIn = new FileInputStream("en-ner-person.train");
ObjectStream<NameSample> sampleStream = new
PlainTextByLineStream(sampleDataIn.getChannel(),
StandardCharsets.UTF_8);
TokenNameFinderCrossValidator evaluator = new
TokenNameFinderCrossValidator("en", 100, 5);
evaluator.evaluate(sampleStream, 10);
FMeasure result = evaluator.getFMeasure();
System.out.println(result.toString());
Language
Detector
Sentence
Detector
Tokenizer
POS
Tagger
Lemmatizer
Name
Finder
Chunker
Language 1
Language 2
Language N
Index
.
.
.
Apache Flink
Apache Flink
Mature project - 320+ contributors, > 11K commits
Very Active project on Github
Java/Scala
Streaming first
Fault-Tolerant
Scalable - to 1000s of nodes and more
High Throughput, Low Latency
Apache Flink - Pos Tagger and NER
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> portugeseText =
env.readTextFile(OpenNLPMain.class.getResource(
"/input/por_newscrawl.txt").getFile());
DataStream<String> engText = env.readTextFile(
OpenNLPMain.class.getResource("/input/eng_news.txt").getFile());
DataStream<String> mergedStream = inputStream.union(portugeseText);
SplitStream<Tuple2<String, String>> langStream = mergedStream.split(new
LanguageSelector());
Apache Flink - Pos Tagger and NER
DataStream<Tuple2<String, String>> porNewsArticles = langStream.select("por");
DataStream<Tuple2<String, String[]>> porNewsTokenized = porNewsArticles.map(new
PorTokenizerMapFunction());
DataStream<POSSample> porNewsPOS = porNewsTokenized.map(new
PorPOSTaggerMapFunction());
DataStream<NameSample> porNewsNamedEntities = porNewsTokenized.map(new
PorNameFinderMapFunction());
Apache Flink - Pos Tagger and NER
private static class LanguageSelector implements OutputSelector<Tuple2<String, String>> {
public Iterable<String> select(Tuple2<String, String> s) {
List<String> list = new ArrayList<>();
list.add(languageDetectorME.predictLanguage(s.f1).getLang());
return list;
}
}
private static class PorTokenizerMapFunction implements MapFunction<Tuple2<String, String>,
Tuple2<String, String[]>> {
public Tuple2<String, String[]> map(Tuple2<String, String> s) {
return new Tuple2<>(s.f0, porTokenizer.tokenize(s.f0));
}
}
Apache Flink - Pos Tagger and NER
private static class PorPOSTaggerMapFunction implements MapFunction<Tuple2<String, String[]>,
POSSample> {
public POSSample map(Tuple2<String, String[]> s) {
String[] tags = porPosTagger.tag(s.f1);
return new POSSample(s.f0, s.f1, tags);
}
}
private static class PorNameFinderMapFunction implements MapFunction<Tuple2<String, String[]>,
NameSample> {
public NameSample map(Tuple2<String, String[]> s) {
Span[] names = engNameFinder.find(s.f1);
return new NameSample(s.f0, s.f1, names, null, true);
}
}
What’s Coming ??
What’s Coming ??
● DL4J: Mature Project: 114 contributors, ~8k commits
● Modular: Tensor library, reinforcement learning, ETL,..
● Focused on integrating with JVM ecosystem while
supporting state of the art like gpus on large clusters
● Implements most neural nets you’d need for language
● Named Entity Recognition using DL4J with LSTMs
● Language Detection using DL4J with LSTMs
● Possible: Translation using Bidirectional LSTMs with embeddings
● Computation graph architecture for more advanced use cases
Credits
Joern Kottmann — PMC Chair, Apache OpenNLP
Tommaso Teofili --- PMC - Apache Lucene, Apache OpenNLP
William Colen --- Head of Technology, Stilingue - Inteligência Artificial,
Sao Paulo, Brazil
PMC - Apache OpenNLP
Till Rohrmann --- Engineering Lead, Data Artisans, Berlin, Germany
Committer and PMC, Apache Flink
Fabian Hueske --- Data Artisans, Committer and PMC on Apache Flink
Questions ???

Contenu connexe

Tendances

MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...
A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...
A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...jaxLondonConference
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...Spark Summit
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenWes McKinney
 
PyCon HK 2018 - Heterogeneous job processing with Apache Kafka
PyCon HK 2018 - Heterogeneous job processing with Apache Kafka PyCon HK 2018 - Heterogeneous job processing with Apache Kafka
PyCon HK 2018 - Heterogeneous job processing with Apache Kafka Hua Chu
 
PyConline AU 2021 - Things might go wrong in a data-intensive application
PyConline AU 2021 - Things might go wrong in a data-intensive applicationPyConline AU 2021 - Things might go wrong in a data-intensive application
PyConline AU 2021 - Things might go wrong in a data-intensive applicationHua Chu
 
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...Databricks
 
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskAUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskVíctor Zabalza
 
Lens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsLens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsVíctor Zabalza
 
LCA13: Hadoop DFS Performance
LCA13: Hadoop DFS PerformanceLCA13: Hadoop DFS Performance
LCA13: Hadoop DFS PerformanceLinaro
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Wes McKinney
 
Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...Spark Summit
 
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...Spark Summit
 
20140120 presto meetup_en
20140120 presto meetup_en20140120 presto meetup_en
20140120 presto meetup_enOgibayashi
 

Tendances (18)

MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
HDP2 and YARN operations point
HDP2 and YARN operations pointHDP2 and YARN operations point
HDP2 and YARN operations point
 
A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...
A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...
A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
 
PyCon HK 2018 - Heterogeneous job processing with Apache Kafka
PyCon HK 2018 - Heterogeneous job processing with Apache Kafka PyCon HK 2018 - Heterogeneous job processing with Apache Kafka
PyCon HK 2018 - Heterogeneous job processing with Apache Kafka
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
PyConline AU 2021 - Things might go wrong in a data-intensive application
PyConline AU 2021 - Things might go wrong in a data-intensive applicationPyConline AU 2021 - Things might go wrong in a data-intensive application
PyConline AU 2021 - Things might go wrong in a data-intensive application
 
Treasure Data and Fluentd
Treasure Data and FluentdTreasure Data and Fluentd
Treasure Data and Fluentd
 
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
 
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskAUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
 
Lens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsLens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgets
 
LCA13: Hadoop DFS Performance
LCA13: Hadoop DFS PerformanceLCA13: Hadoop DFS Performance
LCA13: Hadoop DFS Performance
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
 
Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...
 
London level39
London level39London level39
London level39
 
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
 
20140120 presto meetup_en
20140120 presto meetup_en20140120 presto meetup_en
20140120 presto meetup_en
 

En vedette

“Semantic PDF Processing & Document Representation”
“Semantic PDF Processing & Document Representation”“Semantic PDF Processing & Document Representation”
“Semantic PDF Processing & Document Representation”diannepatricia
 
Ontologies for Mental Health and Disease
Ontologies for Mental Health and DiseaseOntologies for Mental Health and Disease
Ontologies for Mental Health and DiseaseJanna Hastings
 
Pipeline for automated structure-based classification in the ChEBI ontology
Pipeline for automated structure-based classification in the ChEBI ontologyPipeline for automated structure-based classification in the ChEBI ontology
Pipeline for automated structure-based classification in the ChEBI ontologyJanna Hastings
 
Ontology-based Data Integration
Ontology-based Data IntegrationOntology-based Data Integration
Ontology-based Data IntegrationJanna Hastings
 
Adam Bartusiak and Jörg Lässig | Semantic Processing for the Conversion of Un...
Adam Bartusiak and Jörg Lässig | Semantic Processing for the Conversion of Un...Adam Bartusiak and Jörg Lässig | Semantic Processing for the Conversion of Un...
Adam Bartusiak and Jörg Lässig | Semantic Processing for the Conversion of Un...semanticsconference
 
Using AI to Make Sense of Customer Feedback
Using AI to Make Sense of Customer FeedbackUsing AI to Make Sense of Customer Feedback
Using AI to Make Sense of Customer FeedbackAlyona Medelyan
 
Natural Language Processing and Graph Databases in Lumify
Natural Language Processing and Graph Databases in LumifyNatural Language Processing and Graph Databases in Lumify
Natural Language Processing and Graph Databases in LumifyCharlie Greenbacker
 
Entity-Relationship Extraction from Wikipedia Unstructured Text - Overview
Entity-Relationship Extraction from Wikipedia Unstructured Text - OverviewEntity-Relationship Extraction from Wikipedia Unstructured Text - Overview
Entity-Relationship Extraction from Wikipedia Unstructured Text - OverviewRadityo Eko Prasojo
 

En vedette (11)

“Semantic PDF Processing & Document Representation”
“Semantic PDF Processing & Document Representation”“Semantic PDF Processing & Document Representation”
“Semantic PDF Processing & Document Representation”
 
Ontologies for Mental Health and Disease
Ontologies for Mental Health and DiseaseOntologies for Mental Health and Disease
Ontologies for Mental Health and Disease
 
Ontology
OntologyOntology
Ontology
 
Pipeline for automated structure-based classification in the ChEBI ontology
Pipeline for automated structure-based classification in the ChEBI ontologyPipeline for automated structure-based classification in the ChEBI ontology
Pipeline for automated structure-based classification in the ChEBI ontology
 
Knowledge representation
Knowledge representationKnowledge representation
Knowledge representation
 
Ontology-based Data Integration
Ontology-based Data IntegrationOntology-based Data Integration
Ontology-based Data Integration
 
Adam Bartusiak and Jörg Lässig | Semantic Processing for the Conversion of Un...
Adam Bartusiak and Jörg Lässig | Semantic Processing for the Conversion of Un...Adam Bartusiak and Jörg Lässig | Semantic Processing for the Conversion of Un...
Adam Bartusiak and Jörg Lässig | Semantic Processing for the Conversion of Un...
 
Using AI to Make Sense of Customer Feedback
Using AI to Make Sense of Customer FeedbackUsing AI to Make Sense of Customer Feedback
Using AI to Make Sense of Customer Feedback
 
Natural Language Processing and Graph Databases in Lumify
Natural Language Processing and Graph Databases in LumifyNatural Language Processing and Graph Databases in Lumify
Natural Language Processing and Graph Databases in Lumify
 
Entity-Relationship Extraction from Wikipedia Unstructured Text - Overview
Entity-Relationship Extraction from Wikipedia Unstructured Text - OverviewEntity-Relationship Extraction from Wikipedia Unstructured Text - Overview
Entity-Relationship Extraction from Wikipedia Unstructured Text - Overview
 
AI and the Future of Growth
AI and the Future of GrowthAI and the Future of Growth
AI and the Future of Growth
 

Similaire à Large Scale Text Processing with Apache OpenNLP and Apache Flink

Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...Apache OpenNLP
 
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf EremyanDataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyanrudolf eremyan
 
Ai meetup Neural machine translation updated
Ai meetup Neural machine translation updatedAi meetup Neural machine translation updated
Ai meetup Neural machine translation updated2040.io
 
The Mystery of Natural Language Processing
The Mystery of Natural Language ProcessingThe Mystery of Natural Language Processing
The Mystery of Natural Language ProcessingMahmood Aijazi, MD
 
Natural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application TrendsNatural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application TrendsShreyas Suresh Rao
 
Recent Advances in Natural Language Processing
Recent Advances in Natural Language ProcessingRecent Advances in Natural Language Processing
Recent Advances in Natural Language ProcessingSeth Grimes
 
Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...alessio_ferrari
 
Open nlp presentationss
Open nlp presentationssOpen nlp presentationss
Open nlp presentationssChandan Deb
 
Python Intro For Managers
Python Intro For ManagersPython Intro For Managers
Python Intro For ManagersAtul Shridhar
 
AIMeetup #4: Neural-machine-translation
AIMeetup #4: Neural-machine-translationAIMeetup #4: Neural-machine-translation
AIMeetup #4: Neural-machine-translation2040.io
 
NLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in PythonNLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in Pythonshanbady
 
GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003butest
 
Using Stanza NLP and TensorFlow to create a summary of a book
Using Stanza NLP and TensorFlow to create a summary of a bookUsing Stanza NLP and TensorFlow to create a summary of a book
Using Stanza NLP and TensorFlow to create a summary of a bookOlusola Amusan
 
Distributed tracing with erlang/elixir
Distributed tracing with erlang/elixirDistributed tracing with erlang/elixir
Distributed tracing with erlang/elixirIvan Glushkov
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extractionGabriel Hamilton
 
Generative programming (mostly parser generation)
Generative programming (mostly parser generation)Generative programming (mostly parser generation)
Generative programming (mostly parser generation)Ralf Laemmel
 
Breaking down the AI magic of ChatGPT: A technologist's lens to its powerful ...
Breaking down the AI magic of ChatGPT: A technologist's lens to its powerful ...Breaking down the AI magic of ChatGPT: A technologist's lens to its powerful ...
Breaking down the AI magic of ChatGPT: A technologist's lens to its powerful ...rahul_net
 

Similaire à Large Scale Text Processing with Apache OpenNLP and Apache Flink (20)

Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
 
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf EremyanDataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
 
Ai meetup Neural machine translation updated
Ai meetup Neural machine translation updatedAi meetup Neural machine translation updated
Ai meetup Neural machine translation updated
 
The Mystery of Natural Language Processing
The Mystery of Natural Language ProcessingThe Mystery of Natural Language Processing
The Mystery of Natural Language Processing
 
Natural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application TrendsNatural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application Trends
 
Recent Advances in Natural Language Processing
Recent Advances in Natural Language ProcessingRecent Advances in Natural Language Processing
Recent Advances in Natural Language Processing
 
Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...
 
Open nlp presentationss
Open nlp presentationssOpen nlp presentationss
Open nlp presentationss
 
Python Intro For Managers
Python Intro For ManagersPython Intro For Managers
Python Intro For Managers
 
AIMeetup #4: Neural-machine-translation
AIMeetup #4: Neural-machine-translationAIMeetup #4: Neural-machine-translation
AIMeetup #4: Neural-machine-translation
 
NLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in PythonNLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in Python
 
programming language.pdf
programming language.pdfprogramming language.pdf
programming language.pdf
 
GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003
 
Using Stanza NLP and TensorFlow to create a summary of a book
Using Stanza NLP and TensorFlow to create a summary of a bookUsing Stanza NLP and TensorFlow to create a summary of a book
Using Stanza NLP and TensorFlow to create a summary of a book
 
Distributed tracing with erlang/elixir
Distributed tracing with erlang/elixirDistributed tracing with erlang/elixir
Distributed tracing with erlang/elixir
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extraction
 
ppt
pptppt
ppt
 
Practical NLP with Lisp
Practical NLP with LispPractical NLP with Lisp
Practical NLP with Lisp
 
Generative programming (mostly parser generation)
Generative programming (mostly parser generation)Generative programming (mostly parser generation)
Generative programming (mostly parser generation)
 
Breaking down the AI magic of ChatGPT: A technologist's lens to its powerful ...
Breaking down the AI magic of ChatGPT: A technologist's lens to its powerful ...Breaking down the AI magic of ChatGPT: A technologist's lens to its powerful ...
Breaking down the AI magic of ChatGPT: A technologist's lens to its powerful ...
 

Plus de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 

Dernier (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Large Scale Text Processing with Apache OpenNLP and Apache Flink

  • 1. Large Scale Processing of Text Suneel Marthi DataWorks Summit 2017, San Jose, California @suneelmarthi
  • 2. $WhoAmI ● Principal Software Engineer in the Office of Technology, Red Hat ● Member of Apache Software Foundation ● Committer and PMC member on Apache Mahout, Apache OpenNLP, Apache Streams
  • 3. What is a Natural Language?
  • 4. What is a Natural Language? Is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation (From Wikipedia)
  • 5. What is NOT a Natural Language?
  • 6. Characteristics of Natural Language Unstructured Ambiguous Complex Hidden semantic Ironic Informal Unpredictable Rich Most updated Noise Hard to search
  • 7. and it holds most of human knowledge
  • 8. and it holds most of human knowledge
  • 9. and but it holds most of human knowledge
  • 10. As information overload grows ever worse, computers may become our only hope for handling a growing deluge of documents. MIT Press - May 12, 2017
  • 11. What is Natural Language Processing? NLP is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora.(From Wikipedia)
  • 12. ???
  • 13.
  • 14. How?
  • 15. By solving small problems each time A pipeline where an ambiguity type is solved, incrementally. Sentence Detector Mr. Robert talk is today at room num. 7. Let's go? | | | | ❌ | | ✅ Tokenizer Mr. Robert talk is today at room num. 7. Let's go? || | | | | | | || || | ||| | | ❌ | | | | | | | | || | | | | | ✅
  • 16. By solving small problems each time Each step of a pipeline solves one ambiguity problem. Name Finder <Person>Washington</Person> was the first president of the USA. <Place>Washington</Place> is a state in the Pacific Northwest region of the USA. POS Tagger Laura Keene brushed by him with the glass of water . | | | | | | | | | | | NNP NNP VBD IN PRP IN DT NN IN NN .
  • 17. By solving small problems each time A pipeline can be long and resolve many ambiguities Lemmatizer He is better than many others | | | | | | He be good than many other
  • 19. Apache OpenNLP Mature project (> 10 years) Actively developed Machine learning Java Easy to train Highly customizable Fast Language Detector (soon) Sentence detector Tokenizer Part of Speech Tagger Lemmatizer Chunker Parser ....
  • 20. Training Models for English Corpus - OntoNotes (https://catalog.ldc.upenn.edu/ldc2013t19) bin/opennlp TokenNameFinderTrainer.ontonotes -lang eng -ontoNotesDir ~/opennlp-data-dir/ontonotes4/data/files/data/english/ -model en-pos-ontonotes.bin bin/opennlp POSTaggerTrainer.ontonotes -lang eng -ontoNotesDir ~/opennlp-data-dir/ontonotes4/data/files/data/english/ -model en-pos-maxent.bin
  • 21. Training Models for Portuguese Corpus - Amazonia (http://www.linguateca.pt/floresta/corpus.html) bin/opennlp TokenizerTrainer.ad -lang por -data amazonia.ad -model por-tokenizer.bin -detokenizer lang/pt/tokenizer/pt-detokenizer.xml -encoding ISO-8859-1 bin/opennlp POSTaggerTrainer.ad -lang por -data amazonia.ad -model por-pos.bin -encoding ISO-8859-1 -includeFeatures false bin/opennlp ChunkerTrainerME.ad -lang por -data amazonia.ad -model por-chunk.bin -encoding ISO-8859-1 bin/opennlp TokenNameFinderTrainer.ad -lang por -data amazonia.ad -model por-ner.bin -encoding ISO-8859-1
  • 22. Name Finder API - Detect Names NameFinderME nameFinder = new NameFinderME(new TokenNameFinderModel( OpenNLPMain.class.getResource("/opennlp-models/por-ner.bin”))); for (String document[][] : documents) { for (String[] sentence : document) { Span nameSpans[] = nameFinder.find(sentence); // do something with the names } nameFinder.clearAdaptiveData() }
  • 23. Name Finder API - Train a model ObjectStream<String> lineStream = new PlainTextByLineStream(new FileInputStream("en-ner-person.train"), StandardCharsets.UTF8); TokenNameFinderModel model; try (ObjectStream<NameSample> sampleStream = new NameSampleDataStream(lineStream)) { model = NameFinderME.train("en", "person", sampleStream, TrainingParameters.defaultParams(), TokenNameFinderFactory nameFinderFactory); } model.serialize(modelFile);
  • 24. Name Finder API - Evaluate a model TokenNameFinderEvaluator evaluator = new TokenNameFinderEvaluator(new NameFinderME(model)); evaluator.evaluate(sampleStream); FMeasure result = evaluator.getFMeasure(); System.out.println(result.toString());
  • 25. Name Finder API - Cross Evaluate a model FileInputStream sampleDataIn = new FileInputStream("en-ner-person.train"); ObjectStream<NameSample> sampleStream = new PlainTextByLineStream(sampleDataIn.getChannel(), StandardCharsets.UTF_8); TokenNameFinderCrossValidator evaluator = new TokenNameFinderCrossValidator("en", 100, 5); evaluator.evaluate(sampleStream, 10); FMeasure result = evaluator.getFMeasure(); System.out.println(result.toString());
  • 28. Apache Flink Mature project - 320+ contributors, > 11K commits Very Active project on Github Java/Scala Streaming first Fault-Tolerant Scalable - to 1000s of nodes and more High Throughput, Low Latency
  • 29. Apache Flink - Pos Tagger and NER final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStream<String> portugeseText = env.readTextFile(OpenNLPMain.class.getResource( "/input/por_newscrawl.txt").getFile()); DataStream<String> engText = env.readTextFile( OpenNLPMain.class.getResource("/input/eng_news.txt").getFile()); DataStream<String> mergedStream = inputStream.union(portugeseText); SplitStream<Tuple2<String, String>> langStream = mergedStream.split(new LanguageSelector());
  • 30. Apache Flink - Pos Tagger and NER DataStream<Tuple2<String, String>> porNewsArticles = langStream.select("por"); DataStream<Tuple2<String, String[]>> porNewsTokenized = porNewsArticles.map(new PorTokenizerMapFunction()); DataStream<POSSample> porNewsPOS = porNewsTokenized.map(new PorPOSTaggerMapFunction()); DataStream<NameSample> porNewsNamedEntities = porNewsTokenized.map(new PorNameFinderMapFunction());
  • 31. Apache Flink - Pos Tagger and NER private static class LanguageSelector implements OutputSelector<Tuple2<String, String>> { public Iterable<String> select(Tuple2<String, String> s) { List<String> list = new ArrayList<>(); list.add(languageDetectorME.predictLanguage(s.f1).getLang()); return list; } } private static class PorTokenizerMapFunction implements MapFunction<Tuple2<String, String>, Tuple2<String, String[]>> { public Tuple2<String, String[]> map(Tuple2<String, String> s) { return new Tuple2<>(s.f0, porTokenizer.tokenize(s.f0)); } }
  • 32. Apache Flink - Pos Tagger and NER private static class PorPOSTaggerMapFunction implements MapFunction<Tuple2<String, String[]>, POSSample> { public POSSample map(Tuple2<String, String[]> s) { String[] tags = porPosTagger.tag(s.f1); return new POSSample(s.f0, s.f1, tags); } } private static class PorNameFinderMapFunction implements MapFunction<Tuple2<String, String[]>, NameSample> { public NameSample map(Tuple2<String, String[]> s) { Span[] names = engNameFinder.find(s.f1); return new NameSample(s.f0, s.f1, names, null, true); } }
  • 34. What’s Coming ?? ● DL4J: Mature Project: 114 contributors, ~8k commits ● Modular: Tensor library, reinforcement learning, ETL,.. ● Focused on integrating with JVM ecosystem while supporting state of the art like gpus on large clusters ● Implements most neural nets you’d need for language ● Named Entity Recognition using DL4J with LSTMs ● Language Detection using DL4J with LSTMs ● Possible: Translation using Bidirectional LSTMs with embeddings ● Computation graph architecture for more advanced use cases
  • 35. Credits Joern Kottmann — PMC Chair, Apache OpenNLP Tommaso Teofili --- PMC - Apache Lucene, Apache OpenNLP William Colen --- Head of Technology, Stilingue - Inteligência Artificial, Sao Paulo, Brazil PMC - Apache OpenNLP Till Rohrmann --- Engineering Lead, Data Artisans, Berlin, Germany Committer and PMC, Apache Flink Fabian Hueske --- Data Artisans, Committer and PMC on Apache Flink