Today, more data is accumulated than ever before. It has been estimated that over 80% of data collected by businesses is unstructured, mostly in the form of free text. The statistical community has developed many tools for analysing textual data, both in the areas of exploratory data analysis (e.g. clustering methods) and predictive analytics. In this talk, Philipp Burckhardt will discuss tools and libraries that you can use today to perform text mining with Node.js. Creative strategies to overcome the limitations of the V8 engine in the areas of high-performance and memory-intensive computing will be discussed. You will be introduced to how you can use Node.js streams to analyse text in real-time, how to leverage native add-ons for performance-intensive code and how to build command-line interfaces to process text directly from the terminal.
4. Reasons for using Node.js
• JavaScript - language of the Web
• Platform-agnostic (all operating systems, browser, CLIs and
desktop applications)
• V8 engine is fast enough to handle text mining tasks (faster than
Python or R)
• Core streams can handle real-time data & large amounts of
text
Drawback: Besides few popular packages like natural,
no eco-system of good text mining modules yet.
6. Use Case: deidentify
• Software for de-identification of protected health
information in free-text medical record data
• Developed as part of research project at CMU
16. Latent Dirichlet Allocation
• Probabilistic model for text documents by Blei et
al.
• Documents are assumed to have a distribution
over topics
• Very popular because of its expandability
17. const getSpeeches = require( '@stdlib/datasets/sotu-addresses' )
const lda = require( '@stdlib/nlp/latent-dirichlet-allocation' );
const tm = require( 'text-miner' );
let speeches = getSpeeches({ range: [ 1930, 2010 ] })
.map( ( e ) => e.text );
let corpus = new tm.Corpus( speeches );
corpus = corpus
.toLower()
.removeInterpunctuation()
.removeWords( tm.STOPWORDS.EN );
let docs = corpus.getTexts();
let model = lda( docs, 3 );
model.fit( 1000, 100, 10 );
lda( <Documents Array>, <Number of Topics> )
model.fit( <Iterations>, <Burnin>, <Thinning> )
18. Results for SOTU addresses
from 1930 to 2010
Topic Words
1 world, peace, war, nations, free,
people, great, nation, united,
freedom, power, military,
american, men, defense, time,
forces, strength
2 america, people, american, years,
americans, year, work, make,
children, congress, tonight, time,
tax, country, government, health,
budget, care
3 government, year, federal,
program, congress, economic,
states, national, administration,
million, policy, public, dollars,
legislation, programs, billion,
system, years, united, fiscal
20. Rationale
• Data pipelines using UNIX shell commands
• Processing of shell commands is done in parallel
• Memory usage
• V8 engine has default limit of 1.76 GB on 64 bit machine
(changeable via --max_old_space_size=<size>)
• Use stream processing instead of batch processing to avoid
high memory usage