Latent Dirichlet allocation (LDA) topic modeling in javascript for node.js.

Last update: Nov 4, 2022

Overview

LDA

Latent Dirichlet allocation (LDA) topic modeling in javascript for node.js. LDA is a machine learning algorithm that extracts topics and their related keywords from a collection of documents.

In LDA, a document may contain several different topics, each with their own related terms. The algorithm uses a probabilistic model for detecting the number of topics specified and extracting their related keywords. For example, a document may contain topics that could be classified as beach-related and weather-related. The beach topic may contain related words, such as sand, ocean, and water. Similarly, the weather topic may contain related words, such as sun, temperature, and clouds.

See http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

$ npm install lda

Usage

var lda = require('lda');

// Example document.
var text = 'Cats are small. Dogs are big. Cats like to chase mice. Dogs like to eat bones.';

// Extract sentences.
var documents = text.match( /[^\.!\?]+[\.!\?]+/g );

// Run LDA to get terms for 2 topics (5 terms each).
var result = lda(documents, 2, 5);

The above example produces the following result with two topics (topic 1 is "cat-related", topic 2 is "dog-related"):

Topic 1
cats (0.21%)
dogs (0.19%)
small (0.1%)
mice (0.1%)
chase (0.1%)

Topic 2
dogs (0.21%)
cats (0.19%)
big (0.11%)
eat (0.1%)
bones (0.1%)

Output

LDA returns an array of topics, each containing an array of terms. The result contains the following format:

[ [ { term: 'dogs', probability: 0.2 },
    { term: 'cats', probability: 0.2 },
    { term: 'small', probability: 0.1 },
    { term: 'mice', probability: 0.1 },
    { term: 'chase', probability: 0.1 } ],
  [ { term: 'dogs', probability: 0.2 },
    { term: 'cats', probability: 0.2 },
    { term: 'bones', probability: 0.11 },
    { term: 'eat', probability: 0.1 },
    { term: 'big', probability: 0.099 } ] ]

The result can be traversed as follows:

var result = lda(documents, 2, 5);

// For each topic.
for (var i in result) {
	var row = result[i];
	console.log('Topic ' + (parseInt(i) + 1));
	
	// For each term.
	for (var j in row) {
		var term = row[j];
		console.log(term.term + ' (' + term.probability + '%)');
	}
	
	console.log('');
}

Additional Languages

LDA uses stop-words to ignore common terms in the text (for example: this, that, it, we). By default, the stop-words list uses English. To use additional languages, you can specify an array of language ids, as follows:

// Use English (this is the default).
result = lda(documents, 2, 5, ['en']);

// Use German.
result = lda(documents, 2, 5, ['de']);

// Use English + German.
result = lda(documents, 2, 5, ['en', 'de']);

To add a new language-specific stop-words list, create a file /lda/lib/stopwords_XX.js where XX is the id for the language. For example, a French stop-words list could be named "stopwords_fr.js". The contents of the file should follow the format of an existing stop-words list. The format is, as follows:

exports.stop_words = [
    'cette',
    'que',
    'une',
    'il'
];

Setting a Random Seed

A specific random seed can be used to compute the same terms and probabilities during subsequent runs. You can specify the random seed, as follows:

// Use the random seed 123.
result = lda(documents, 2, 5, null, null, null, 123);

Author

Kory Becker http://www.primaryobjects.com

Based on original javascript implementation https://github.com/awaisathar/lda.js

Comments

Hi, this crashes for me

Hello and thank you for creating this package. I am trying to run lda for a bulk of textual documents and I get crashes of this sort :+1:

docs= [array of text documents] ldaResult=lda(docs,10,50)

this.nw[this.documents[m][n]][topic]++; TypeError: Cannot read property '2' of undefined at [object Object].initialState (C:\web\node_modules\lda\lib\lda.js:143:46) at [object Object].gibbs (C:\web\node_modules\lda\lib\lda.js:161:14)

Do you have an idea why this happens?
bug

opened by ilanle 14
Request - mode for infering single article topics

Hello again, Let's say that we've gone through the corpus and created topic word distributions. I want to use this output to tag single articles now. I know that the process is similar, iterative, only that it needs not to affect the phi. I think such a function would be useful for people like myself. Thanks Ilan
enhancement

opened by ilanle 4
Process return void results

When working with huge documents (~400), each of them like 500b-1K I often get a [ [], [], [] ] result from the process method. What I do is to read the files (one document per line), and parse as an array and pass as sentences to the lda. What I have found is that for a small number of documents (between 1 to 50) it works, then the output becomes void like [ [], [], [] ].
question

opened by loretoparisi 3
Support for multiple languages

I'd love to have support for multiple languages for the stop words. I can provide some lists, but I just wanted to ask how this should be implemented. Maybe an initialiser where you can define your language? We probably don't want to run against a whole list of stop words containing words in every languages.
enhancement

opened by neugartf 3
Return the entire lda object

The results can still be accessed as following: var result = lda(documents, 2, 5).result;

This change will allow me to use the getTheta() function of lda.
enhancement

opened by chisingh 2
Same word appearing in different topics?

Hello, thank you for this great tool.

Is it normal that the same word appears in different topics? For example,

Topic 1: topic text data
Topic 2: modeling topic introduction
question

opened by deemeetree 1

Error in function with empty string in middle of Array

There is a bug in how the lda function is reading the Array values. If there is an empty string in the middle of the Array then an undefined error is thrown.

Example:

const lda = require('lda');

const s1 = ['Lots of people have dogs as pets',
  'Another sentence about cats and dogs together, they get along',
  '',
  'One more sentence about dogs',

const results = lda(s1, 3, 2);
];

Result:

/test/node_modules/lda/lib/lda.js:149
            var N = this.documents[m].length;
                                     ^

TypeError: Cannot read property 'length' of undefined
    at initialState (/test/node_modules/lda/lib/lda.js:149:38)

If you remove the empty string from the s1 array, or move it to be the last element in the Array then things work as expected.

bug

opened by mikelax 1

Support for setting random seed.

It would be super helpful if we could set a seed for the randomizer. This feature would give us predictable tests and better flexibility.

While there are other libraries that could be used, there are many simple function implementations that would serve the purpose.

For example: http://stackoverflow.com/a/19303725/1267536

I can make the PR if you're willing to accept the feature.
enhancement

opened by vangorra 1
Included a stopword list for spanish (es-ES)

I've built a Spanish (es-ES) stop-word list and I have been using this for a while now. I would like to contribute the list to the project.

Perhaps you might consider including a contributors section on your package.json. Also, please note that I didn't bump the minor on package.json version, I will leave that to you :)

opened by rybnik 1
Update lda.js

Fixing bug when word is not falsey, but is stemmed to falsey value "". In this situation the word will never be added to the vocab array on line 36, due to conditional on line 34, this pushes -1 into the documents 2d array, which will later cause an error in initialState on line 141 when accessing the -1th index in nw.

lda([":'\'s'"],1,1) (Really unlikely to encounter this string outside of my own dataset, but it's possible it could affect other words)

Currently: throws error on line 141 due to accessing non-existant element of array This code will produce correct result: [ [] ]

opened by ojj11 0
Feature - API could allow the corpus to be created accretively

I'd like to be able to add sentences over time to the corpus, and not all at once. Something like: var index = lda.addSentence('string') // returns an array index or unique id

which later could use:

var topicModel = lda.process(index, numTopics, termsPer)

@primaryobjects do you have any thoughts on this?
enhancement

opened by 0o-de-lally 2