Latent Dirichlet allocation (LDA) topic modeling in javascript for node.js.

Overview

LDA

Latent Dirichlet allocation (LDA) topic modeling in javascript for node.js. LDA is a machine learning algorithm that extracts topics and their related keywords from a collection of documents.

In LDA, a document may contain several different topics, each with their own related terms. The algorithm uses a probabilistic model for detecting the number of topics specified and extracting their related keywords. For example, a document may contain topics that could be classified as beach-related and weather-related. The beach topic may contain related words, such as sand, ocean, and water. Similarly, the weather topic may contain related words, such as sun, temperature, and clouds.

See http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

$ npm install lda

Usage

var lda = require('lda');

// Example document.
var text = 'Cats are small. Dogs are big. Cats like to chase mice. Dogs like to eat bones.';

// Extract sentences.
var documents = text.match( /[^\.!\?]+[\.!\?]+/g );

// Run LDA to get terms for 2 topics (5 terms each).
var result = lda(documents, 2, 5);

The above example produces the following result with two topics (topic 1 is "cat-related", topic 2 is "dog-related"):

Topic 1
cats (0.21%)
dogs (0.19%)
small (0.1%)
mice (0.1%)
chase (0.1%)

Topic 2
dogs (0.21%)
cats (0.19%)
big (0.11%)
eat (0.1%)
bones (0.1%)

Output

LDA returns an array of topics, each containing an array of terms. The result contains the following format:

[ [ { term: 'dogs', probability: 0.2 },
    { term: 'cats', probability: 0.2 },
    { term: 'small', probability: 0.1 },
    { term: 'mice', probability: 0.1 },
    { term: 'chase', probability: 0.1 } ],
  [ { term: 'dogs', probability: 0.2 },
    { term: 'cats', probability: 0.2 },
    { term: 'bones', probability: 0.11 },
    { term: 'eat', probability: 0.1 },
    { term: 'big', probability: 0.099 } ] ]

The result can be traversed as follows:

var result = lda(documents, 2, 5);

// For each topic.
for (var i in result) {
	var row = result[i];
	console.log('Topic ' + (parseInt(i) + 1));
	
	// For each term.
	for (var j in row) {
		var term = row[j];
		console.log(term.term + ' (' + term.probability + '%)');
	}
	
	console.log('');
}

Additional Languages

LDA uses stop-words to ignore common terms in the text (for example: this, that, it, we). By default, the stop-words list uses English. To use additional languages, you can specify an array of language ids, as follows:

// Use English (this is the default).
result = lda(documents, 2, 5, ['en']);

// Use German.
result = lda(documents, 2, 5, ['de']);

// Use English + German.
result = lda(documents, 2, 5, ['en', 'de']);

To add a new language-specific stop-words list, create a file /lda/lib/stopwords_XX.js where XX is the id for the language. For example, a French stop-words list could be named "stopwords_fr.js". The contents of the file should follow the format of an existing stop-words list. The format is, as follows:

exports.stop_words = [
    'cette',
    'que',
    'une',
    'il'
];

Setting a Random Seed

A specific random seed can be used to compute the same terms and probabilities during subsequent runs. You can specify the random seed, as follows:

// Use the random seed 123.
result = lda(documents, 2, 5, null, null, null, 123);

Author

Kory Becker http://www.primaryobjects.com

Based on original javascript implementation https://github.com/awaisathar/lda.js

Comments
  • Hi, this crashes for me

    Hi, this crashes for me

    Hello and thank you for creating this package. I am trying to run lda for a bulk of textual documents and I get crashes of this sort :+1:

    docs= [array of text documents] ldaResult=lda(docs,10,50)

    this.nw[this.documents[m][n]][topic]++; TypeError: Cannot read property '2' of undefined at [object Object].initialState (C:\web\node_modules\lda\lib\lda.js:143:46) at [object Object].gibbs (C:\web\node_modules\lda\lib\lda.js:161:14)

    Do you have an idea why this happens?

    bug 
    opened by ilanle 14
  • Request - mode for infering single article topics

    Request - mode for infering single article topics

    Hello again, Let's say that we've gone through the corpus and created topic word distributions. I want to use this output to tag single articles now. I know that the process is similar, iterative, only that it needs not to affect the phi. I think such a function would be useful for people like myself. Thanks Ilan

    enhancement 
    opened by ilanle 4
  • Process return void results

    Process return void results

    When working with huge documents (~400), each of them like 500b-1K I often get a [ [], [], [] ] result from the process method. What I do is to read the files (one document per line), and parse as an array and pass as sentences to the lda. What I have found is that for a small number of documents (between 1 to 50) it works, then the output becomes void like [ [], [], [] ].

    question 
    opened by loretoparisi 3
  • Support for multiple languages

    Support for multiple languages

    I'd love to have support for multiple languages for the stop words. I can provide some lists, but I just wanted to ask how this should be implemented. Maybe an initialiser where you can define your language? We probably don't want to run against a whole list of stop words containing words in every languages.

    enhancement 
    opened by neugartf 3
  • Return the entire lda object

    Return the entire lda object

    The results can still be accessed as following: var result = lda(documents, 2, 5).result;

    This change will allow me to use the getTheta() function of lda.

    enhancement 
    opened by chisingh 2
  • Same word appearing in different topics?

    Same word appearing in different topics?

    Hello, thank you for this great tool.

    Is it normal that the same word appears in different topics? For example,

    Topic 1: topic text data
    Topic 2: modeling topic introduction

    question 
    opened by deemeetree 1
  • Error in function with empty string in middle of Array

    Error in function with empty string in middle of Array

    There is a bug in how the lda function is reading the Array values. If there is an empty string in the middle of the Array then an undefined error is thrown.

    Example:

    const lda = require('lda');
    
    const s1 = ['Lots of people have dogs as pets',
      'Another sentence about cats and dogs together, they get along',
      '',
      'One more sentence about dogs',
    
    const results = lda(s1, 3, 2);
    ];
    

    Result:

    /test/node_modules/lda/lib/lda.js:149
                var N = this.documents[m].length;
                                         ^
    
    TypeError: Cannot read property 'length' of undefined
        at initialState (/test/node_modules/lda/lib/lda.js:149:38)
    

    If you remove the empty string from the s1 array, or move it to be the last element in the Array then things work as expected.

    bug 
    opened by mikelax 1
  • Support for setting random seed.

    Support for setting random seed.

    It would be super helpful if we could set a seed for the randomizer. This feature would give us predictable tests and better flexibility.

    While there are other libraries that could be used, there are many simple function implementations that would serve the purpose.

    For example: http://stackoverflow.com/a/19303725/1267536

    I can make the PR if you're willing to accept the feature.

    enhancement 
    opened by vangorra 1
  • Included a stopword list for spanish (es-ES)

    Included a stopword list for spanish (es-ES)

    I've built a Spanish (es-ES) stop-word list and I have been using this for a while now. I would like to contribute the list to the project.

    Perhaps you might consider including a contributors section on your package.json. Also, please note that I didn't bump the minor on package.json version, I will leave that to you :)

    opened by rybnik 1
  • Update lda.js

    Update lda.js

    Fixing bug when word is not falsey, but is stemmed to falsey value "". In this situation the word will never be added to the vocab array on line 36, due to conditional on line 34, this pushes -1 into the documents 2d array, which will later cause an error in initialState on line 141 when accessing the -1th index in nw.

    lda([":'\'s'"],1,1) (Really unlikely to encounter this string outside of my own dataset, but it's possible it could affect other words)

    Currently: throws error on line 141 due to accessing non-existant element of array This code will produce correct result: [ [] ]

    opened by ojj11 0
  • Feature -  API could allow the corpus to be created accretively

    Feature - API could allow the corpus to be created accretively

    I'd like to be able to add sentences over time to the corpus, and not all at once. Something like: var index = lda.addSentence('string') // returns an array index or unique id

    which later could use:

    var topicModel = lda.process(index, numTopics, termsPer)

    @primaryobjects do you have any thoughts on this?

    enhancement 
    opened by 0o-de-lally 2
Owner
Kory Becker
Software Developer. Web applications. Machine learning. Artificial Intelligence.
Kory Becker
Clustering algorithms implemented in Javascript for Node.js and the browser

Clustering.js ####Clustering algorithms implemented in Javascript for Node.js and the browser Examples License Copyright (c) 2013 Emil Bay github@tixz

Emil Bay 29 Aug 19, 2022
architecture-free neural network library for node.js and the browser

Synaptic Important: Synaptic 2.x is in stage of discussion now! Feel free to participate Synaptic is a javascript neural network library for node.js a

Juan Cazala 6.9k Dec 27, 2022
general natural language facilities for node

natural "Natural" is a general natural language facility for nodejs. It offers a broad range of functionalities for natural language processing. Docum

null 10k Jan 9, 2023
Machine-learning for Node.js

Limdu.js Limdu is a machine-learning framework for Node.js. It supports multi-label classification, online learning, and real-time classification. The

Erel Segal-Halevi 1k Dec 16, 2022
Run XGBoost model and make predictions in Node.js

XGBoost-Node eXtreme Gradient Boosting Package in Node.js XGBoost-Node is a Node.js interface of XGBoost. XGBoost is a library from DMLC. It is design

暖房 / nuan.io 31 Nov 15, 2022
Machine Learning library for node.js

shaman Machine Learning library for node.js Linear Regression shaman supports both simple linear regression and multiple linear regression. It support

Luc Castera 108 Feb 26, 2021
Powerful Neural Network for Node.js

NeuralN Powerful Neural Network for Node.js NeuralN is a C++ Neural Network library for Node.js with multiple advantages compared to existing solution

TOTEMS::Tech 275 Dec 15, 2022
Bayesian bandit implementation for Node and the browser.

#bayesian-bandit.js This is an adaptation of the Bayesian Bandit code from Probabilistic Programming and Bayesian Methods for Hackers, specifically d3

null 44 Aug 19, 2022
FANN (Fast Artificial Neural Network Library) bindings for Node.js

node-fann node-fann is a FANN bindings for Node.js. FANN (Fast Artificial Neural Network Library) is a free open source neural network library, which

Alex Kocharin 186 Oct 31, 2022
Deep Learning in Javascript. Train Convolutional Neural Networks (or ordinary ones) in your browser.

ConvNetJS ConvNetJS is a Javascript implementation of Neural networks, together with nice browser-based demos. It currently supports: Common Neural Ne

Andrej 10.4k Dec 31, 2022
[UNMAINTAINED] Simple feed-forward neural network in JavaScript

This project has reached the end of its development as a simple neural network library. Feel free to browse the code, but please use other JavaScript

Heather 8k Dec 26, 2022
A neural network library built in JavaScript

A flexible neural network library for Node.js and the browser. Check out a live demo of a movie recommendation engine built with Mind. Features Vector

Steven Miller 1.5k Dec 31, 2022
Pure Javascript OCR for more than 100 Languages 📖🎉🖥

Version 2 is now available and under development in the master branch, read a story about v2: Why I refactor tesseract.js v2? Check the support/1.x br

Project Naptha 29.2k Dec 31, 2022
WebGL-accelerated ML // linear algebra // automatic differentiation for JavaScript.

This repository has been archived in favor of tensorflow/tfjs. This repo will remain around for some time to keep history but all future PRs should be

null 8.5k Dec 31, 2022
A JavaScript deep learning and reinforcement learning library.

neurojs is a JavaScript framework for deep learning in the browser. It mainly focuses on reinforcement learning, but can be used for any neural networ

Jan 4.4k Jan 4, 2023
Differential Programming in JavaScript.

April 19, 2018 TensorFlow.js was recently released. It is well engineered, provides an autograd-style interface to backprop, and has committed to supp

Propel 2.7k Dec 29, 2022
Machine learning tools in JavaScript

ml.js - Machine learning tools in JavaScript Introduction This library is a compilation of the tools developed in the mljs organization. It is mainly

ml.js 2.3k Jan 1, 2023
Deep Neural Network Sandbox for JavaScript.

Deep Neural Network Sandbox for Javascript Train a neural network with your data & save it's trained state! Demo • Installation • Getting started • Do

Matias Vazquez-Levi 420 Jan 4, 2023
A WebGL accelerated JavaScript library for training and deploying ML models.

TensorFlow.js TensorFlow.js is an open-source hardware-accelerated JavaScript library for training and deploying machine learning models. ⚠️ We recent

null 16.9k Jan 4, 2023