SlideShare une entreprise Scribd logo
1  sur  21
Télécharger pour lire hors ligne
Text Mining with Node.js
Philipp Burckhardt
Carnegie Mellon University
Who am I?
Why re-invent the
wheel?
Reasons for using Node.js
• JavaScript - language of the Web
• Platform-agnostic (all operating systems, browser, CLIs and
desktop applications)
• V8 engine is fast enough to handle text mining tasks (faster than
Python or R)
• Core streams can handle real-time data & large amounts of
text
Drawback: Besides few popular packages like natural,
no eco-system of good text mining modules yet.
Use Case: deidentify
Use Case: deidentify
• Software for de-identification of protected health
information in free-text medical record data
• Developed as part of research project at CMU
The Challenge
Unstructured data
might account for
more than 80%
percent of data
collected.
Text Mining Overview
Typical Test Mining Tasks
• Sentiment
analysis
•Cluster analysis and
topic modeling: find
hidden patterns or
grouping in data
Getting practical
Sentiment Analysis of „State of the Union“ addresses
by President Obama
const getSpeeches = require( '@stdlib/datasets/sotu-addresses' );
const words = require( '@stdlib/datasets/afinn-111' );
const tm = require( 'text-miner' );
// Convert to a dictionary...
const len = words.length;
const dict = {};
for ( let i = 0; i < len; i++ ) {
dict[ words[i][0] ] = words[i][1];
}
const obamaSpeeches = getSpeeches({
'president': [ 'Barack Obama' ]
});
let obamaCorpus = new tm.Corpus(
obamaSpeeches.map( x => x.text )
)
.trim()
.toLower()
.removeInterpunctuation();
// Calculate sentiments...
const docs = obamaCorpus.getTexts();
const sentiments = [];
for ( let i = 0; i < docs; i++ ) {
const words = docs[ i ].split( ' ' );
let score = 0;
for ( let j = 0; j < words.length; j++ ) {
const val = dict[ words[ j ] ];
if ( val ) { score += val; }
}
sentiments.push( score );
}
Pre-Processing
sentiments = [ 69, 47, 266, 75, 234, 234, 163, 157 ]
Topic Modeling
Goal: find documents which share the same themes
(e.g. politics, business, sports)
Latent Dirichlet Allocation
• Probabilistic model for text documents by Blei et
al.
• Documents are assumed to have a distribution
over topics
• Very popular because of its expandability
const getSpeeches = require( '@stdlib/datasets/sotu-addresses' )
const lda = require( '@stdlib/nlp/latent-dirichlet-allocation' );
const tm = require( 'text-miner' );
let speeches = getSpeeches({ range: [ 1930, 2010 ] })
.map( ( e ) => e.text );
let corpus = new tm.Corpus( speeches );
corpus = corpus
.toLower()
.removeInterpunctuation()
.removeWords( tm.STOPWORDS.EN );
let docs = corpus.getTexts();
let model = lda( docs, 3 );
model.fit( 1000, 100, 10 );
lda( <Documents Array>, <Number of Topics> )
model.fit( <Iterations>, <Burnin>, <Thinning> )
Results for SOTU addresses
from 1930 to 2010
Topic Words
1 world, peace, war, nations, free,
people, great, nation, united,
freedom, power, military,
american, men, defense, time,
forces, strength
2 america, people, american, years,
americans, year, work, make,
children, congress, tonight, time,
tax, country, government, health,
budget, care
3 government, year, federal,
program, congress, economic,
states, national, administration,
million, policy, public, dollars,
legislation, programs, billion,
system, years, united, fiscal
Text Analysis using the
Command Line
Rationale
• Data pipelines using UNIX shell commands
• Processing of shell commands is done in parallel
• Memory usage
• V8 engine has default limit of 1.76 GB on 64 bit machine
(changeable via --max_old_space_size=<size>)
• Use stream processing instead of batch processing to avoid
high memory usage
LIVE DEMONSTRATION

Contenu connexe

Tendances

A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
Fabio Fumarola
 
Visualization-Driven Data Aggregation
Visualization-Driven Data AggregationVisualization-Driven Data Aggregation
Visualization-Driven Data Aggregation
Zbigniew Jerzak
 

Tendances (20)

Elasticsearch - basics and beyond
Elasticsearch - basics and beyondElasticsearch - basics and beyond
Elasticsearch - basics and beyond
 
Temporal Pattern Mining
Temporal Pattern MiningTemporal Pattern Mining
Temporal Pattern Mining
 
Frequent Itemset Mining(FIM) on BigData
Frequent Itemset Mining(FIM) on BigDataFrequent Itemset Mining(FIM) on BigData
Frequent Itemset Mining(FIM) on BigData
 
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
 
2011linked science4mccuskermcguinnessfinal
2011linked science4mccuskermcguinnessfinal2011linked science4mccuskermcguinnessfinal
2011linked science4mccuskermcguinnessfinal
 
Session 09 learning relationships.pptx
Session 09 learning relationships.pptxSession 09 learning relationships.pptx
Session 09 learning relationships.pptx
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021
 
Nuix Presentation
Nuix PresentationNuix Presentation
Nuix Presentation
 
tech 3camp presentation
tech 3camp presentationtech 3camp presentation
tech 3camp presentation
 
Visualization-Driven Data Aggregation
Visualization-Driven Data AggregationVisualization-Driven Data Aggregation
Visualization-Driven Data Aggregation
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
 
Data science in ruby is it possible? is it fast? should we use it?
Data science in ruby is it possible? is it fast? should we use it?Data science in ruby is it possible? is it fast? should we use it?
Data science in ruby is it possible? is it fast? should we use it?
 
Object multifunctional indexing with an open API
Object multifunctional indexing with an open API Object multifunctional indexing with an open API
Object multifunctional indexing with an open API
 
Introduction to SARA's Hadoop Hackathon - dec 7th 2010
Introduction to SARA's Hadoop Hackathon - dec 7th 2010Introduction to SARA's Hadoop Hackathon - dec 7th 2010
Introduction to SARA's Hadoop Hackathon - dec 7th 2010
 
NBITSearch. Features.
NBITSearch. Features.NBITSearch. Features.
NBITSearch. Features.
 
Toulouse Data Science meetup - Apache zeppelin
Toulouse Data Science meetup - Apache zeppelinToulouse Data Science meetup - Apache zeppelin
Toulouse Data Science meetup - Apache zeppelin
 
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"..."Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
 
3Camp2015_prod
3Camp2015_prod3Camp2015_prod
3Camp2015_prod
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
Xray: extended arrays for scientific datasets by Stephan Hoyer PyData SV 2014
Xray: extended arrays for scientific datasets by Stephan Hoyer PyData SV 2014Xray: extended arrays for scientific datasets by Stephan Hoyer PyData SV 2014
Xray: extended arrays for scientific datasets by Stephan Hoyer PyData SV 2014
 

En vedette

Cmc liberando a agenda-horários
Cmc   liberando a agenda-horáriosCmc   liberando a agenda-horários
Cmc liberando a agenda-horários
Leonardo Alves
 
L) médico pesquisando medicações
L) médico   pesquisando medicaçõesL) médico   pesquisando medicações
L) médico pesquisando medicações
Leonardo Alves
 
Cmc encaminhamento- solicitando e agendando
Cmc  encaminhamento- solicitando e agendandoCmc  encaminhamento- solicitando e agendando
Cmc encaminhamento- solicitando e agendando
Leonardo Alves
 
K) médico histórico de consultas, retornos, consultas antigas
K) médico   histórico de consultas, retornos, consultas antigasK) médico   histórico de consultas, retornos, consultas antigas
K) médico histórico de consultas, retornos, consultas antigas
Leonardo Alves
 

En vedette (20)

Text mining mengmeng & jack_lsu
Text mining mengmeng & jack_lsuText mining mengmeng & jack_lsu
Text mining mengmeng & jack_lsu
 
Abe Curso Estudos De Caso Ii
Abe Curso Estudos De Caso IiAbe Curso Estudos De Caso Ii
Abe Curso Estudos De Caso Ii
 
Cmc liberando a agenda-horários
Cmc   liberando a agenda-horáriosCmc   liberando a agenda-horários
Cmc liberando a agenda-horários
 
L) médico pesquisando medicações
L) médico   pesquisando medicaçõesL) médico   pesquisando medicações
L) médico pesquisando medicações
 
Cmc encaminhamento- solicitando e agendando
Cmc  encaminhamento- solicitando e agendandoCmc  encaminhamento- solicitando e agendando
Cmc encaminhamento- solicitando e agendando
 
K) médico histórico de consultas, retornos, consultas antigas
K) médico   histórico de consultas, retornos, consultas antigasK) médico   histórico de consultas, retornos, consultas antigas
K) médico histórico de consultas, retornos, consultas antigas
 
Prontuário Eletrônico - Prefeituras
Prontuário Eletrônico - PrefeiturasProntuário Eletrônico - Prefeituras
Prontuário Eletrônico - Prefeituras
 
Hitchhiker's Guide to"'Serverless" Javascript - Steven Faulkner, Bustle
Hitchhiker's Guide to"'Serverless" Javascript - Steven Faulkner, BustleHitchhiker's Guide to"'Serverless" Javascript - Steven Faulkner, Bustle
Hitchhiker's Guide to"'Serverless" Javascript - Steven Faulkner, Bustle
 
Take Data Validation Seriously - Paul Milham, WildWorks
Take Data Validation Seriously - Paul Milham, WildWorksTake Data Validation Seriously - Paul Milham, WildWorks
Take Data Validation Seriously - Paul Milham, WildWorks
 
Node.js Core State of the Union- James Snell
Node.js Core State of the Union- James SnellNode.js Core State of the Union- James Snell
Node.js Core State of the Union- James Snell
 
State of the CLI- Kat Marchan
State of the CLI- Kat MarchanState of the CLI- Kat Marchan
State of the CLI- Kat Marchan
 
Aplicação de técnicas de mineração de textos para classificação automática de...
Aplicação de técnicas de mineração de textos para classificação automática de...Aplicação de técnicas de mineração de textos para classificação automática de...
Aplicação de técnicas de mineração de textos para classificação automática de...
 
Developing Nirvana - Corey A. Butler, Author.io
Developing Nirvana - Corey A. Butler, Author.ioDeveloping Nirvana - Corey A. Butler, Author.io
Developing Nirvana - Corey A. Butler, Author.io
 
Real-Life Node.js Troubleshooting - Damian Schenkelman, Auth0
Real-Life Node.js Troubleshooting - Damian Schenkelman, Auth0Real-Life Node.js Troubleshooting - Damian Schenkelman, Auth0
Real-Life Node.js Troubleshooting - Damian Schenkelman, Auth0
 
From Pterodactyls and Cactus to Artificial Intelligence - Ivan Seidel Gomes, ...
From Pterodactyls and Cactus to Artificial Intelligence - Ivan Seidel Gomes, ...From Pterodactyls and Cactus to Artificial Intelligence - Ivan Seidel Gomes, ...
From Pterodactyls and Cactus to Artificial Intelligence - Ivan Seidel Gomes, ...
 
Multimodal Interactions & JS: The What, The Why and The How - Diego Paez, Des...
Multimodal Interactions & JS: The What, The Why and The How - Diego Paez, Des...Multimodal Interactions & JS: The What, The Why and The How - Diego Paez, Des...
Multimodal Interactions & JS: The What, The Why and The How - Diego Paez, Des...
 
Are your v8 garbage collection logs speaking to you?Joyee Cheung -Alibaba Clo...
Are your v8 garbage collection logs speaking to you?Joyee Cheung -Alibaba Clo...Are your v8 garbage collection logs speaking to you?Joyee Cheung -Alibaba Clo...
Are your v8 garbage collection logs speaking to you?Joyee Cheung -Alibaba Clo...
 
Node's Event Loop From the Inside Out - Sam Roberts, IBM
Node's Event Loop From the Inside Out - Sam Roberts, IBMNode's Event Loop From the Inside Out - Sam Roberts, IBM
Node's Event Loop From the Inside Out - Sam Roberts, IBM
 
Math in V8 is Broken and How We Can Fix It - Athan Reines, Fourier
Math in V8 is Broken and How We Can Fix It - Athan Reines, FourierMath in V8 is Broken and How We Can Fix It - Athan Reines, Fourier
Math in V8 is Broken and How We Can Fix It - Athan Reines, Fourier
 
Real-Time Machine Learning with Node.js - Philipp Burckhardt, Carnegie Mellon...
Real-Time Machine Learning with Node.js - Philipp Burckhardt, Carnegie Mellon...Real-Time Machine Learning with Node.js - Philipp Burckhardt, Carnegie Mellon...
Real-Time Machine Learning with Node.js - Philipp Burckhardt, Carnegie Mellon...
 

Similaire à Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon University

Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Andrey Vykhodtsev
 
Migrating from matlab to python
Migrating from matlab to pythonMigrating from matlab to python
Migrating from matlab to python
ActiveState
 

Similaire à Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon University (20)

An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"
 
PyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from ScratchPyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from Scratch
 
04 standard class library c#
04 standard class library c#04 standard class library c#
04 standard class library c#
 
Data science training in hyderabad
Data science training in hyderabadData science training in hyderabad
Data science training in hyderabad
 
Towards Computational Research Objects
Towards Computational Research ObjectsTowards Computational Research Objects
Towards Computational Research Objects
 
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
 
Data Science Keys to Open Up OpenNASA Datasets
Data Science Keys to Open Up OpenNASA DatasetsData Science Keys to Open Up OpenNASA Datasets
Data Science Keys to Open Up OpenNASA Datasets
 
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
 
Migrating from matlab to python
Migrating from matlab to pythonMigrating from matlab to python
Migrating from matlab to python
 
Data_Science.ppt
Data_Science.pptData_Science.ppt
Data_Science.ppt
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text Classification
 
Meow Hagedorn
Meow HagedornMeow Hagedorn
Meow Hagedorn
 
Reproducible Research and the Cloud
Reproducible Research and the CloudReproducible Research and the Cloud
Reproducible Research and the Cloud
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
PPT ON MACHINE LEARNING by Ragini Ratre
PPT ON MACHINE LEARNING by Ragini RatrePPT ON MACHINE LEARNING by Ragini Ratre
PPT ON MACHINE LEARNING by Ragini Ratre
 
Data management
Data management Data management
Data management
 
Introducing U-SQL (SQLPASS 2016)
Introducing U-SQL (SQLPASS 2016)Introducing U-SQL (SQLPASS 2016)
Introducing U-SQL (SQLPASS 2016)
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
 

Plus de NodejsFoundation

From Pterodactyls and Cactus to Artificial Intelligence - Ivan Seidel Gomes, ...
From Pterodactyls and Cactus to Artificial Intelligence - Ivan Seidel Gomes, ...From Pterodactyls and Cactus to Artificial Intelligence - Ivan Seidel Gomes, ...
From Pterodactyls and Cactus to Artificial Intelligence - Ivan Seidel Gomes, ...
NodejsFoundation
 

Plus de NodejsFoundation (10)

The Morality of Code - Glen Goodwin, SAS Institute, inc.
The Morality of Code - Glen Goodwin, SAS Institute, inc.The Morality of Code - Glen Goodwin, SAS Institute, inc.
The Morality of Code - Glen Goodwin, SAS Institute, inc.
 
Nodifying the Enterprise - Prince Soni, TO THE NEW
Nodifying the Enterprise - Prince Soni, TO THE NEWNodifying the Enterprise - Prince Soni, TO THE NEW
Nodifying the Enterprise - Prince Soni, TO THE NEW
 
Workshop: Science Meets Industry: Online Behavioral Experiments with nodeGame...
Workshop: Science Meets Industry: Online Behavioral Experiments with nodeGame...Workshop: Science Meets Industry: Online Behavioral Experiments with nodeGame...
Workshop: Science Meets Industry: Online Behavioral Experiments with nodeGame...
 
Express State of the Union at Nodejs Interactive EU- Doug Wilson
Express State of the Union at Nodejs Interactive EU- Doug WilsonExpress State of the Union at Nodejs Interactive EU- Doug Wilson
Express State of the Union at Nodejs Interactive EU- Doug Wilson
 
Building Scalable Web Applications Using Microservices Architecture and NodeJ...
Building Scalable Web Applications Using Microservices Architecture and NodeJ...Building Scalable Web Applications Using Microservices Architecture and NodeJ...
Building Scalable Web Applications Using Microservices Architecture and NodeJ...
 
Take Data Validation Seriously - Paul Milham, WildWorks
Take Data Validation Seriously - Paul Milham, WildWorksTake Data Validation Seriously - Paul Milham, WildWorks
Take Data Validation Seriously - Paul Milham, WildWorks
 
From Pterodactyls and Cactus to Artificial Intelligence - Ivan Seidel Gomes, ...
From Pterodactyls and Cactus to Artificial Intelligence - Ivan Seidel Gomes, ...From Pterodactyls and Cactus to Artificial Intelligence - Ivan Seidel Gomes, ...
From Pterodactyls and Cactus to Artificial Intelligence - Ivan Seidel Gomes, ...
 
Breaking Down the Monolith - Peter Marton, RisingStack
Breaking Down the Monolith - Peter Marton, RisingStackBreaking Down the Monolith - Peter Marton, RisingStack
Breaking Down the Monolith - Peter Marton, RisingStack
 
The Enterprise Case for Node.js
The Enterprise Case for Node.jsThe Enterprise Case for Node.js
The Enterprise Case for Node.js
 
Node Foundation Membership Overview 20160907
Node Foundation Membership Overview 20160907Node Foundation Membership Overview 20160907
Node Foundation Membership Overview 20160907
 

Dernier

%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 

Dernier (20)

WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaS
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 

Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon University

  • 1. Text Mining with Node.js Philipp Burckhardt Carnegie Mellon University
  • 4. Reasons for using Node.js • JavaScript - language of the Web • Platform-agnostic (all operating systems, browser, CLIs and desktop applications) • V8 engine is fast enough to handle text mining tasks (faster than Python or R) • Core streams can handle real-time data & large amounts of text Drawback: Besides few popular packages like natural, no eco-system of good text mining modules yet.
  • 6. Use Case: deidentify • Software for de-identification of protected health information in free-text medical record data • Developed as part of research project at CMU
  • 8. Unstructured data might account for more than 80% percent of data collected.
  • 10. Typical Test Mining Tasks • Sentiment analysis
  • 11. •Cluster analysis and topic modeling: find hidden patterns or grouping in data
  • 12. Getting practical Sentiment Analysis of „State of the Union“ addresses by President Obama
  • 13. const getSpeeches = require( '@stdlib/datasets/sotu-addresses' ); const words = require( '@stdlib/datasets/afinn-111' ); const tm = require( 'text-miner' ); // Convert to a dictionary... const len = words.length; const dict = {}; for ( let i = 0; i < len; i++ ) { dict[ words[i][0] ] = words[i][1]; } const obamaSpeeches = getSpeeches({ 'president': [ 'Barack Obama' ] }); let obamaCorpus = new tm.Corpus( obamaSpeeches.map( x => x.text ) ) .trim() .toLower() .removeInterpunctuation(); // Calculate sentiments... const docs = obamaCorpus.getTexts(); const sentiments = []; for ( let i = 0; i < docs; i++ ) { const words = docs[ i ].split( ' ' ); let score = 0; for ( let j = 0; j < words.length; j++ ) { const val = dict[ words[ j ] ]; if ( val ) { score += val; } } sentiments.push( score ); } Pre-Processing sentiments = [ 69, 47, 266, 75, 234, 234, 163, 157 ]
  • 14.
  • 15. Topic Modeling Goal: find documents which share the same themes (e.g. politics, business, sports)
  • 16. Latent Dirichlet Allocation • Probabilistic model for text documents by Blei et al. • Documents are assumed to have a distribution over topics • Very popular because of its expandability
  • 17. const getSpeeches = require( '@stdlib/datasets/sotu-addresses' ) const lda = require( '@stdlib/nlp/latent-dirichlet-allocation' ); const tm = require( 'text-miner' ); let speeches = getSpeeches({ range: [ 1930, 2010 ] }) .map( ( e ) => e.text ); let corpus = new tm.Corpus( speeches ); corpus = corpus .toLower() .removeInterpunctuation() .removeWords( tm.STOPWORDS.EN ); let docs = corpus.getTexts(); let model = lda( docs, 3 ); model.fit( 1000, 100, 10 ); lda( <Documents Array>, <Number of Topics> ) model.fit( <Iterations>, <Burnin>, <Thinning> )
  • 18. Results for SOTU addresses from 1930 to 2010 Topic Words 1 world, peace, war, nations, free, people, great, nation, united, freedom, power, military, american, men, defense, time, forces, strength 2 america, people, american, years, americans, year, work, make, children, congress, tonight, time, tax, country, government, health, budget, care 3 government, year, federal, program, congress, economic, states, national, administration, million, policy, public, dollars, legislation, programs, billion, system, years, united, fiscal
  • 19. Text Analysis using the Command Line
  • 20. Rationale • Data pipelines using UNIX shell commands • Processing of shell commands is done in parallel • Memory usage • V8 engine has default limit of 1.76 GB on 64 bit machine (changeable via --max_old_space_size=<size>) • Use stream processing instead of batch processing to avoid high memory usage