Tiny and powerful JavaScript full-text search engine for browser and Node

Last update: Jan 3, 2023

Overview

MiniSearch

MiniSearch is a tiny but powerful in-memory fulltext search engine written in JavaScript. It is respectful of resources, and it can comfortably run both in Node and in the browser.

Try out the demo application.

Find the complete documentation and API reference here, and more background about MiniSearch, including a comparison with other similar libraries, in this blog post.

Use case

MiniSearch addresses use cases where full-text search features are needed (e.g. prefix search, fuzzy search, ranking, boosting of fields…), but the data to be indexed can fit locally in the process memory. While you won't index the whole Internet with it, there are surprisingly many use cases that are served well by MiniSearch. By storing the index in local memory, MiniSearch can work offline, and can process queries quickly, without network latency.

A prominent use-case is real time search "as you type" in web and mobile applications, where keeping the index on the client enables fast and reactive UIs, removing the need to make requests to a search server.

Features

Memory-efficient index, designed to support memory-constrained use cases like mobile browsers.
Exact match, prefix search, fuzzy match, field boosting
Auto-suggestion engine, for auto-completion of search queries
Documents can be added and removed from the index at any time
Zero external dependencies

MiniSearch strives to expose a simple API that provides the building blocks to build custom solutions, while keeping a small and well tested codebase.

Installation

With npm:

npm install --save minisearch

With yarn:

yarn add minisearch

Then require or import it in your project:

// If you are using import:
import MiniSearch from 'minisearch'

// If you are using require:
const MiniSearch = require('minisearch')

Alternatively, if you prefer to use a <script> tag, you can require MiniSearch from a CDN:

<script src="https://cdn.jsdelivr.net/npm/[email protected]/dist/umd/index.min.js"></script>

In this case, MiniSearch will appear as a global variable in your project.

Finally, if you want to manually build the library, clone the repository and run yarn build (or yarn build-minified for a minified version + source maps). The compiled source will be created in the dist folder (UMD, ES6 and ES2015 module versions are provided).

Usage

Basic usage

// A collection of documents for our examples
const documents = [
  {
    id: 1,
    title: 'Moby Dick',
    text: 'Call me Ishmael. Some years ago...',
    category: 'fiction'
  },
  {
    id: 2,
    title: 'Zen and the Art of Motorcycle Maintenance',
    text: 'I can see by my watch...',
    category: 'fiction'
  },
  {
    id: 3,
    title: 'Neuromancer',
    text: 'The sky above the port was...',
    category: 'fiction'
  },
  {
    id: 4,
    title: 'Zen and the Art of Archery',
    text: 'At first sight it must seem...',
    category: 'non-fiction'
  },
  // ...and more
]

let miniSearch = new MiniSearch({
  fields: ['title', 'text'], // fields to index for full-text search
  storeFields: ['title', 'category'] // fields to return with search results
})

// Index all documents
miniSearch.addAll(documents)

// Search with default options
let results = miniSearch.search('zen art motorcycle')
// => [
//   { id: 2, title: 'Zen and the Art of Motorcycle Maintenance', category: 'fiction', score: 2.77258, match: { ... } },
//   { id: 4, title: 'Zen and the Art of Archery', category: 'non-fiction', score: 1.38629, match: { ... } }
// ]

Search options

MiniSearch supports several options for more advanced search behavior:

// Search only specific fields
miniSearch.search('zen', { fields: ['title'] })

// Boost some fields (here "title")
miniSearch.search('zen', { boost: { title: 2 } })

// Prefix search (so that 'moto' will match 'motorcycle')
miniSearch.search('moto', { prefix: true })

// Search within a specific category
miniSearch.search('zen', {
  filter: (result) => result.category === 'fiction'
})

// Fuzzy search, in this example, with a max edit distance of 0.2 * term length,
// rounded to nearest integer. The mispelled 'ismael' will match 'ishmael'.
miniSearch.search('ismael', { fuzzy: 0.2 })

// You can set the default search options upon initialization
miniSearch = new MiniSearch({
  fields: ['title', 'text'],
  searchOptions: {
    boost: { title: 2 },
    fuzzy: 0.2
  }
})
miniSearch.addAll(documents)

// It will now by default perform fuzzy search and boost "title":
miniSearch.search('zen and motorcycles')

Auto suggestions

MiniSearch can suggest search queries given an incomplete query:

miniSearch.autoSuggest('zen ar')
// => [ { suggestion: 'zen archery art', terms: [ 'zen', 'archery', 'art' ], score: 1.73332 },
//      { suggestion: 'zen art', terms: [ 'zen', 'art' ], score: 1.21313 } ]

The autoSuggest method takes the same options as the search method, so you can get suggestions for misspelled words using fuzzy search:

miniSearch.autoSuggest('neromancer', { fuzzy: 0.2 })
// => [ { suggestion: 'neuromancer', terms: [ 'neuromancer' ], score: 1.03998 } ]

Suggestions are ranked by the relevance of the documents that would be returned by that search.

Sometimes, you might need to filter auto suggestions to, say, only a specific category. You can do so by providing a filter option:

miniSearch.autoSuggest('zen ar', {
  filter: (result) => result.category === 'fiction'
})
// => [ { suggestion: 'zen art', terms: [ 'zen', 'art' ], score: 1.21313 } ]

Field extraction

By default, documents are assumed to be plain key-value objects with field names as keys and field values as simple values. In order to support custom field extraction logic (for example for nested fields, or non-string field values that need processing before tokenization), a custom field extractor function can be passed as the extractField option:

// Assuming that our documents look like:
const documents = [
  { id: 1, title: 'Moby Dick', author: { name: 'Herman Melville' }, pubDate: new Date(1851, 9, 18) },
  { id: 2, title: 'Zen and the Art of Motorcycle Maintenance', author: { name: 'Robert Pirsig' }, pubDate: new Date(1974, 3, 1) },
  { id: 3, title: 'Neuromancer', author: { name: 'William Gibson' }, pubDate: new Date(1984, 6, 1) },
  { id: 4, title: 'Zen in the Art of Archery', author: { name: 'Eugen Herrigel' }, pubDate: new Date(1948, 0, 1) },
  // ...and more
]

// We can support nested fields (author.name) and date fields (pubDate) with a
// custom `extractField` function:

let miniSearch = new MiniSearch({
  fields: ['title', 'author.name', 'pubYear'],
  extractField: (document, fieldName) => {
    // If field name is 'pubYear', extract just the year from 'pubDate'
    if (fieldName === 'pubYear') {
      const pubDate = document['pubDate']
      return pubDate && pubDate.getFullYear().toString()
    }

    // Access nested fields
    return fieldName.split('.').reduce((doc, key) => doc && doc[key], document)
  }
})

The default field extractor can be obtained by calling MiniSearch.getDefault('extractField').

Tokenization

By default, documents are tokenized by splitting on Unicode space or punctuation characters. The tokenization logic can be easily changed by passing a custom tokenizer function as the tokenize option:

// Tokenize splitting by hyphen
let miniSearch = new MiniSearch({
  fields: ['title', 'text'],
  tokenize: (string, _fieldName) => string.split('-')
})

Upon search, the same tokenization is used by default, but it is possible to pass a tokenize search option in case a different search-time tokenization is necessary:

// Tokenize splitting by hyphen
let miniSearch = new MiniSearch({
  fields: ['title', 'text'],
  tokenize: (string) => string.split('-'), // indexing tokenizer
  searchOptions: {
    tokenize: (string) => string.split(/[\s-]+/) // search query tokenizer
  }
})

The default tokenizer can be obtained by calling MiniSearch.getDefault('tokenize').

Term processing

Terms are downcased by default. No stemming is performed, and no stop-word list is applied. To customize how the terms are processed upon indexing, for example to normalize them, filter them, or to apply stemming, the processTerm option can be used. The processTerm function should return the processed term as a string, or a falsy value if the term should be discarded:

let stopWords = new Set(['and', 'or', 'to', 'in', 'a', 'the', /* ...and more */ ])

// Perform custom term processing (here discarding stop words and downcasing)
let miniSearch = new MiniSearch({
  fields: ['title', 'text'],
  processTerm: (term, _fieldName) =>
    stopWords.has(term) ? null : term.toLowerCase()
})

By default, the same processing is applied to search queries. In order to apply a different processing to search queries, supply a processTerm search option:

let miniSearch = new MiniSearch({
  fields: ['title', 'text'],
  processTerm: (term) =>
    stopWords.has(term) ? null : term.toLowerCase(), // index term processing
  searchOptions: {
    processTerm: (term) => term.toLowerCase() // search query processing
  }
})

The default term processor can be obtained by calling MiniSearch.getDefault('processTerm').

API Documentation

Refer to the API documentation for details about configuration options and methods.

Browser compatibility

MiniSearch natively supports all modern browsers implementing JavaScript standards, but requires a polyfill when used in Internet Explorer, as it makes use functions like Object.entries, Array.includes, and Array.from, which are standard but not available on older browsers. The package core-js is one such polyfill that can be used to provide those functions.

Contributing

Contributions to MiniSearch are welcome! Please read the contributions guidelines. Reading the design document is also useful to understand the project goals and the technical implementation.

Comments

Removing items by id
I've got documents that look like this:

const tasks = [ { id: 1, title: "clean the house" }, { id: 2, title: "eat food" } ]

If the title of the task with id = 1 changes, I'd like to update that change in the index. However, in my current application, I don't have access to the entire old version of the document. I just know the id and the new values for the title field.

In order to remove an item in Minisearch, it looks like I need to pass the whole document that I originally added. Is there a way I can remove an item by id? If so, I can just remove by id and then add the new document.
opened by priyadarshy 17

Issues with scoring

Hi! First of all, v4 seems to be give slightly better search ranking than v3.

However, there is a crucial issue currently with the scoring of documents in our application for some search terms. I have tried to recreate this with a synthetic example. For that purpose I've collected 5 movies about sheep.

const ms = new MiniSearch({
  fields: ['title', 'description'],
  storeFields: ['title']
})

ms.add({
  id: 1,
  title: 'Rams',
  description: 'A feud between two sheep farmers.'
})

ms.add({
  id: 2,
  title: 'Shaun the Sheep',
  description: 'Shaun is a cheeky and mischievous sheep at Mossy Bottom farm who\'s the leader of the flock and always plays slapstick jokes, pranks and causes trouble especially on Farmer X and his grumpy guide dog, Bitzer.'
})

ms.add({
  id: 3,
  title: 'Silence of the Lambs',
  description: 'F.B.I. trainee Clarice Starling (Jodie Foster) works hard to advance her career, while trying to hide or put behind her West Virginia roots, of which if some knew, would automatically classify her as being backward or white trash. After graduation, she aspires to work in the agency\'s Behavioral Science Unit under the leadership of Jack Crawford (Scott Glenn). While she is still a trainee, Crawford asks her to question Dr. Hannibal Lecter (Sir Anthony Hopkins), a psychiatrist imprisoned, thus far, for eight years in maximum security isolation for being a serial killer who cannibalized his victims. Clarice is able to figure out the assignment is to pick Lecter\'s brains to help them solve another serial murder case, that of someone coined by the media as "Buffalo Bill" (Ted Levine), who has so far killed five victims, all located in the eastern U.S., all young women, who are slightly overweight (especially around the hips), all who were drowned in natural bodies of water, and all who were stripped of large swaths of skin. She also figures that Crawford chose her, as a woman, to be able to trigger some emotional response from Lecter. After speaking to Lecter for the first time, she realizes that everything with him will be a psychological game, with her often having to read between the very cryptic lines he provides. She has to decide how much she will play along, as his request in return for talking to him is to expose herself emotionally to him. The case takes a more dire turn when a sixth victim is discovered, this one from who they are able to retrieve a key piece of evidence, if Lecter is being forthright as to its meaning. A potential seventh victim is high profile Catherine Martin (Brooke Smith), the daughter of Senator Ruth Martin (Diane Baker), which places greater scrutiny on the case as they search for a hopefully still alive Catherine. Who may factor into what happens is Dr. Frederick Chilton (Anthony Heald), the warden at the prison, an opportunist who sees the higher profile with Catherine, meaning a higher profile for himself if he can insert himself successfully into the proceedings.'
})

ms.add({
  id: 4,
  title: 'Lamb',
  description: 'Haunted by the indelible mark of loss and silent grief, sad-eyed María and her taciturn husband, Ingvar, seek solace in back-breaking work and the demanding schedule at their sheep farm in the remote, harsh, wind-swept landscapes of mountainous Iceland. Then, with their relationship hanging on by a thread, something unexplainable happens, and just like that, happiness blesses the couple\'s grim household once more. Now, as a painful ending gives birth to a new beginning, Ingvar\'s troubled brother, Pétur, arrives at the farmhouse, threatening María and Ingvar\'s delicate, newfound bliss. But, nature\'s gifts demand sacrifice. How far are ecstatic María and Ingvar willing to go in the name of love?'
})

ms.add({
  id: 5,
  title: 'Ringing Bell',
  description: 'A baby lamb named Chirin is living an idyllic life on a farm with many other sheep. Chirin is very adventurous and tends to get lost, so he wears a bell around his neck so that his mother can always find him. His mother warns Chirin that he must never venture beyond the fence surrounding the farm, because a huge black wolf lives in the mountains and loves to eat sheep. Chirin is too young and naive to take the advice to heart, until one night the wolf enters the barn and is prepared to kill Chirin, but at the last moment the lamb\'s mother throws herself in the way and is killed instead. The wolf leaves, and Chirin is horrified to see his mother\'s body. Unable to understand why his mother was killed, he becomes very angry and swears that he will go into the mountains and kill the wolf.'
})

ms.search('sheep', { boost: { title: 2 } })

The following are the results:

[
  {
    id: 1,
    terms: [ 'sheep' ],
    score: 4.360862545683414,
    match: { sheep: [Array] },
    title: 'Rams'
  },
  {
    id: 2,
    terms: [ 'sheep' ],
    score: 3.163825722967836,
    match: { sheep: [Array] },
    title: 'Shaun the Sheep'
  },
  {
    id: 5,
    terms: [ 'sheep' ],
    score: 0.3964420496075831,
    match: { sheep: [Array] },
    title: 'Ringing Bell'
  },
  {
    id: 4,
    terms: [ 'sheep' ],
    score: 0.26090630615199917,
    match: { sheep: [Array] },
    title: 'Lamb'
  }
]

The issue is the following. I expect, without any doubt, that 'Shaun the Sheep' should be the top result. Why?

Because it is the only movie with 'sheep' in the title field and in the description field.
The subjective score of 'sheep' within a 3 word title is higher than 'sheep' in a 6 word description.
The subjective score of 'sheep' in 1 title out of 5 movies is much better than 4 descriptions out of 5 movies.
I have even boosted the title by a factor of 2. In our actual application, I don't really want to boost one field too much, because it can lead to other scoring problems.

So what goes wrong?

Fields with a high variance in length obscure fields with a low variance in length

The issue is that many other movies have very long descriptions, but 'Rams' only has a 6-word description. The relative scoring for field length is fieldLength / averageFieldLength. This heavily disadvantages the description of 'Shaun the Sheep', which is only of "average" length. This essentially means that if there is a high variance in a field's length, the documents with a short field get a very large boost. Regardless of matches in other fields!

A match in two distinct fields in the same document has no bonus

I would expect that 'Shaun the Sheep' is a great match for the query 'sheep' because it is the only document that has a match in both fields. I think it would be good to give a boost in those cases, similarly to how a document that matches two words in an OR query receives a boost.

So what are the options?

I think we could take a cue from Lucene, which uses 1 / sqrt(numFieldTerms) as the length normalisation factor.

https://www.compose.com/articles/how-scoring-works-in-elasticsearch/ https://theaidigest.in/how-does-elasticsearch-scoring-work/

Just as a quick test, if I take 1 / sqrt(fieldLength), I get the following results:

[
  {
    id: 2,
    terms: [ 'sheep' ],
    score: 1.8946174879859907,
    match: { sheep: [Array] },
    title: 'Shaun the Sheep'
  },
  {
    id: 1,
    terms: [ 'sheep' ],
    score: 0.08434033477788275,
    match: { sheep: [Array] },
    title: 'Rams'
  },
  {
    id: 5,
    terms: [ 'sheep' ],
    score: 0.03596283958463321,
    match: { sheep: [Array] },
    title: 'Ringing Bell'
  },
  {
    id: 4,
    terms: [ 'sheep' ],
    score: 0.020629628616731104,
    match: { sheep: [Array] },
    title: 'Lamb'
  }
]

I get the same results even if I drop the title boosting factor. That's actually exactly what I personally expect: the shorter fields should count more if they match unless I disadvantage them explicitly.

Problem solved?! Well, not really. What if I search for a highly specific sheep?

ms.search('chirin the sheep')

[
  {
    id: 2,
    terms: [ 'the', 'sheep' ],
    score: 4.537584326120562,
    match: { the: [Array], sheep: [Array] },
    title: 'Shaun the Sheep'
  },
  {
    id: 5,
    terms: [ 'chirin', 'the', 'sheep' ],
    score: 2.2902873329363285,
    match: { chirin: [Array], the: [Array], sheep: [Array] },
    title: 'Ringing Bell'
  },
  {
    id: 3,
    terms: [ 'the' ],
    score: 1.09077315757252,
    match: { the: [Array] },
    title: 'Silence of the Lambs'
  },
  {
    id: 4,
    terms: [ 'the', 'sheep' ],
    score: 0.2166111004756766,
    match: { the: [Array], sheep: [Array] },
    title: 'Lamb'
  },
  {
    id: 1,
    terms: [ 'sheep' ],
    score: 0.08434033477788275,
    match: { sheep: [Array] },
    title: 'Rams'
  }
]

I definitely wasn't looking for Shaun! 'Ringing Bell' should be the top result here, because it is the only match for 'chirin'. So what can we do? Taking cues from Lucene, it scores terms in query with a coordination mechanism. It effectively means the more term matches there are, the better the score should be. It uses matching terms / total terms as a weight factor for each document. This can also replace the 1.5 boost for OR queries. Hacking that into MiniSearch I get this:

[
  {
    id: 2,
    terms: [ 'the', 'sheep' ],
    score: 1.0445507364815925,
    match: { the: [Array], sheep: [Array] },
    title: 'Shaun the Sheep'
  },
  {
    id: 5,
    terms: [ 'chirin', 'the', 'sheep' ],
    score: 1.0298930944999127,
    match: { chirin: [Array], the: [Array], sheep: [Array] },
    title: 'Ringing Bell'
  },
  {
    id: 3,
    terms: [ 'the' ],
    score: 0.21087593054514742,
    match: { the: [Array] },
    title: 'Silence of the Lambs'
  },
  {
    id: 4,
    terms: [ 'the', 'sheep' ],
    score: 0.09627160021141183,
    match: { the: [Array], sheep: [Array] },
    title: 'Lamb'
  },
  {
    id: 1,
    terms: [ 'sheep' ],
    score: 0.028113444925960917,
    match: { sheep: [Array] },
    title: 'Rams'
  }
]

Almost there (1.04 vs 1.03), but not quite yet...

Lucene also uses the inverse document frequency of each term in the query as a factor for determining how unique a term is. I have not tested this (it touches more code in MiniSearch), but my guess is this would raise the score of 'Ringing Bell' to the top position because of the uniqueness of the term 'chirin'.

So, my question to you is this: would you be open to revising the scoring mechanism to be closer to what Lucene uses? I believe it could solve some practical issues with the current document scoring.

If you do, maybe we should collect some test sets which are realistic enough, but also small enough to be able to judge the scoring from the outside.

Looking forward to any thoughts you may have on this!

opened by rolftimmermans 15

Added support for combined AND and OR queries.
I needed AND and OR support for an application I'm using, so I added basic support for AND and OR queries.

Unfortunately, this means that the query tokenization no longer works, since it's now being parsed with an EBNF grammar. As far as I can tell, this is the only limitation imposed by this implementation. In order to maintain backwards compatibility, I added an option for "enableAdvancedQueries" to opt-in to the new query language.

The combineWith property still works as expected -- it will treat spaces as being either an AND or an OR based on the value passed in. By default, it uses implicit OR.

The processTerm property also works as expected. All terms within the query will get processed.

This should resolve #100.

Query language supports the following:

dog AND cat dog OR cat dog AND cat OR horse // AND takes precedence over OR, making this: (dog AND cat) OR horse dog AND (cat OR horse) "AND" OR "OR" // Searches for the words "AND" or "OR"

Nesting is unlimited. "AND" takes precedence over "OR". Operators are case sensitive.

Example usage:

ms.search("cat AND (dog OR horse)", { enableAdvancedQueries: true })
opened by FindAPattern 15

Error with loadJSON method

OS: MacOS 11.6 Node: 15.11.0 Minisearch 3.1.0

====================================

Building index with the following code:

const miniSearch = require('minisearch')
const fs = require('fs');
const path = require("path");

const getProductFiles = function(dirPath, arrayOfFiles) {
  files = fs.readdirSync(dirPath);

  arrayOfFiles = arrayOfFiles || [];

  files.forEach(function(file) {
    let fn = path.join(dirPath, file);
    (fs.statSync(fn).isDirectory()) ?
      arrayOfFiles = getProductFiles(fn, arrayOfFiles) :
      arrayOfFiles.push(path.join(dirPath, "/", file));
  });

  return arrayOfFiles;
}

let arrayOfFiles;
const inputFiles = 
  getProductFiles(path.join('src', '_data'), arrayOfFiles)
      .filter(file => path.extname(file) === '.json');

let idCounter = 0

let ms = new miniSearch({
  fields: [ 'sku', 'category', 'type', 'subtype', 'name', 'description', 'cost',
            'mass', 'size', 'techLevel', 'qrebs', 'tags' ],
  storeFields: ['sku', 'name', 'description', 'cost']
});

inputFiles.forEach(file => {  

  // get the products from the file
  let products = JSON.parse(fs.readFileSync(`${file}`));

  // build search index object and add to search index
  products.forEach(product => {
    product.id = idCounter++;
    ms.add(product);
  })
})


fs.writeFileSync('src/_data/searchindex.idx', JSON.stringify(ms))

let jsonIdx = fs.readFileSync('src/_data/searchindex.idx', 'utf8');

let ms2 = new miniSearch.loadJSON(jsonIdx, {
  fields: [ 'sku', 'category', 'type', 'subtype', 'name', 'description', 'cost',
            'mass', 'size', 'techLevel', 'qrebs', 'tags' ],
  storeFields: ['sku', 'name', 'description', 'cost']
});


// console.log(`ms is ${(Array.isArray(ms)) ? "" : "not"})`)
let searchTerm = 'portal'
let options = (searchTerm.includes(' and ')) ? { combineWith: 'AND'} : {}
let res = ms2.search(searchTerm, options);
res.forEach(result => console.log(result));

The code above appears to work correctly and returns search results (use attached file searchindex.idx)

In the code below, I may be doing something wrong with the fetch, but I'm not sure what it is.

  fetch(searchIndexLocation)
    .then((res) => res.json())
    .then((data) => {
      console.log(data);
      const jsonDocs = data;

// line 272 is the next line
      let miniSearch = new MiniSearch.loadJSON(jsonDocs, {
        fields: [ 'sku', 'category', 'type', 'subtype', 'name', 'description', 'cost',
                  'mass', 'size', 'techLevel', 'qrebs', 'tags' ],
        storeFields: ['sku', 'name', 'description', 'cost']
      });
      
    })
    .catch((err) => console.log(err));

because I am consistently getting the following error:

SyntaxError: Unexpected token o in JSON at position 1
    at JSON.parse (<anonymous>)
    at new t.loadJSON (index.js:1126)
    at scripts.js:272

I'm still a bit new to the fetch API but it looks like something is occurring with the index file before it is getting to the loadJSON call.

Any clues appreciated. : searchindex.idx.zip -/

opened by cmcknight 14

making autosuggest results useful

I'm struggling to get search suggestions that are useful. It makes sense to me for the search to use AND, so I have set that for the search and tried both OR and AND for the suggestions. It seems to work okay in the demo, but my data is not simple title and artist like the demo. It is long articles of text.

I've tried various combinations of prefix and fuzzy (mostly with AND), but the suggestions are not helpful to the user, because they have the first word followed by a bunch of possible matches for the second word. I can see how these terms are all found in one document, but the user is not helped by that suggestion. Even your example in the docs is confusing, where you call autosuggest for "zen ar" and get "zen art archery" as a suggestion. It makes sense once you know the parameters, but as a suggestion, it's not something you would click on. I think the user would be helped by showing "zen art" separate from "zen archery".

Do I need to make an elaborate filter to get suggestions that make sense?

opened by joyously 11

undefined in "searchResults" but present in "rawResults"

Hi Luca,

me again^^

I have the following, quite strange behaviour.

You can add entries to the list of elements, that can be searched with mini-search. While you add the new entry, the query stays active.

The result should be, if the new entry matches the query, it should be displayed.

So, if the input data changes I perform search via a useEffect hook and display the new data.

But what happens is that despite having a match in the raw results (correct field, correct match, everything ok), the entry in searchResult is undefined causing a crash.

   useEffect(() => {
        if (data && !isFirstRender.current) {
            removeAll();
            addAll(data);
        }

        isFirstRender.current = false;

    }, [data]);

    useEffect(() => {
        if (data) {
            search(filter.query, {
                filter: filterOptions.categoryField && filter.categories.length > 0 ? matchCategory : undefined,
            });
        }
    }, [data, filter, filterOptions, matchCategory]);

The order of execution is correct, double checked on that.

Result looks like:

rawResults

[
    {
        "id": "6a901953144c411580520ed07B4567",
        "terms": [
            "einholung"
        ],
        "score": 8.600445452294665,
        "match": {
            "einholung": [
                "custom.bezeichnung"
            ]
        },
        "custom.kategorien": [
            "Gefahr in Verzug",
            "Sicherheit",
            "Qualität"
        ]
    },
    {
        "id": "h1662s",
        "terms": [
            "einholungsbums"
        ],
        "score": 4.152082359120152,
        "match": {
            "einholungsbums": [
                "custom.bezeichnung"
            ]
        },
        "custom.kategorien": [
            "Gefahr in Verzug"
        ]
    }
]

[
    {
        "id": "6a901953144c411580520ed07B4567",
        "datum": "2020-09-17T20:34:56.170914Z",
        "custom": {
            "id": "6a901953144c411580520ed07a21ef39",
            "erstelldatum": "2020-09-17T20:34:56.170914Z",
            "baumassnahmenId": "45042621",
            "bezeichnung": "Einholung weiterer Informationen",
            "kategorien": [
                "Gefahr in Verzug",
                "Sicherheit",
                "Qualität"
            ],
    .....
    },
    undefined -> Where the second element should be.
]

Help would be highly appreciated. I'm kind of confused...

opened by florianmatz 11

Fix the weights option
The weights option allows users to provide the ability to override the relative scoring of fuzzy and prefix matches. However, due to a small bug they do not actually do anything. This PR addresses that.

The reason this was overlooked until now is probably because both fuzzy and prefix matches also include exact matches. This means an exact match is scored higher because it occurs in all matches. But it also means additional, needless work combining the scoring for matches that are found in the set of exact, fuzzy and prefix matches.

What I have done:

Use the weight adjustments for scoring fuzzy and prefix matches and added tests to ensure this works.

Remove exact matches from the (intermediate) fuzzy and prefix matches. This additional check easily pays for itself because of the reduction in the amount of work combining the results later. See the benchmarks at the end.

Adjust the default weights down to somewhat correct for the removal of exact matches from the intermediate fuzzy and prefix results. I halved them to { fuzzy: 0.45, prefix: 0.375 }. I am not 100% sure these weights are adequate, I'd like to get some input. It will be hard to guarantee identical search results, because the relative weight of exact matches currently is different depending on whether a user is using either fuzzy or prefix matching, versus fuzzy and prefix matching.

Add a test to ensure the scoring of exact matches is not influenced by fuzzy or prefix matching.

~The PR is based on #122, and I can rebase when it is merged.~ Done

Before

Combined search: ================ * MiniSearch#search("virtute e conoscienza") x 95.29 ops/sec ±3.14% (71 runs sampled)

After

Combined search: ================ * MiniSearch#search("virtute e conoscienza") x 163 ops/sec ±2.93% (78 runs sampled)
opened by rolftimmermans 9

Performance increase

Hi there!

First of all thanks for creating this library. We are using it in production on a website with published guidelines on hazardous substances, created in collaboration between ministries and other government bodies in The Netherlands.

Although we've been quite happy with the performance (great job!), our profiling shows that there is some improvement to be made with regards to the creation of temporary objects. This is mostly due to using plain objects instead of ES6 Maps.

Objects are great for storing a fixed number of keys, but less great for a larger number of keys, for addition/deletion, or for iterating over them.

In particular the following pattern showed up in a few places:

Object.entries(object).forEach(([key, value]) => {
  //
})

This has some issues:

It causes at least one allocation for the array returned by Object.entries().
It probably causes an allocation for the closure passed into forEach().
Iterating with for ... of ... is almost always faster (in part due to the allocations).
Unnecessary temporary objects cause pressure on the garbage collector.

I made the following changes:

Most occurrences of objects are replaced with Maps.
Functions that take closures are mostly replaced with loops.
Some duplicate read/write operations were prevented.
The internal IDs of fields are now numbers. Because field IDs are consecutive and do not change, we can store the field data in arrays with the ID as the index.
Document short IDs are now numbers. There is no need to transform them into strings, which saves a small amount of space and time during indexing.

All of this mostly gave a performance boost. Using the benchmarks in the project on my laptop with Node v17.2.0 (which should be representative for a modern version of Chrome):

Indexing is at least 4x faster.
Combined search is about 2x faster.

However, nothing comes for free... Because the internal structure is no longer 1-1 serialisable to and from JSON, that operation becomes more expensive. Loading an index from JSON is about 30% slower, and is also much more complicated. The index format is also slightly different, so the serialised JSON is not compatible with the current version (and you might want to consider a major or minor version bump if you accept this PR).

ES6 Maps might also not be supported in very old browsers. However, their support is slightly better than that of Unicode Regexes, which are also used in the project.

Finally, I sacrificed some of your beautiful functional code style on the performance altar.

I'd love to hear your thoughts on this PR. Let me know if there is anything I should clarify!

Before

Index size: 13497 terms, 14097 documents, ~17.49MB in memory, 2.37MB serialized.

Fuzzy search:
=============
  * SearchableMap#fuzzyGet("virtute", 1) x 24,029 ops/sec ±0.23% (100 runs sampled)
  * SearchableMap#fuzzyGet("virtu", 2) x 1,918 ops/sec ±0.23% (99 runs sampled)
  * SearchableMap#fuzzyGet("virtu", 3) x 385 ops/sec ±0.42% (95 runs sampled)

Prefix search:
==============
  * Array.from(SearchableMap#atPrefix("vir")) x 381,810 ops/sec ±0.13% (100 runs sampled)
  * Array.from(SearchableMap#atPrefix("virtut")) x 592,633 ops/sec ±0.16% (98 runs sampled)

Exact search:
=============
  * SearchableMap#get("virtute") x 858,585 ops/sec ±0.08% (99 runs sampled)

Indexing:
=========
  * MiniSearch#addAll(documents) x 3.76 ops/sec ±3.47% (14 runs sampled)

Combined search:
================
  * MiniSearch#search("virtute e conoscienza") x 46.38 ops/sec ±4.34% (63 runs sampled)

Search filtering:
=================
  * MiniSearch#search("virtu", { filter: ... }) x 9,135 ops/sec ±4.19% (88 runs sampled)

Auto suggestion:
================
  * MiniSearch#autoSuggest("virtute cano") x 33,132 ops/sec ±3.56% (91 runs sampled)
  * MiniSearch#autoSuggest("virtue conoscienza", { fuzzy: 0.2 }) x 181,194 ops/sec ±1.20% (101 runs sampled)

Load index:
===========
  * MiniSearch.loadJSON(json, options) x 33.50 ops/sec ±2.46% (60 runs sampled)

After

Index size: 13497 terms, 14097 documents, ~16.01MB in memory, 2.49MB serialized.

Fuzzy search:
=============
  * SearchableMap#fuzzyGet("virtute", 1) x 27,119 ops/sec ±0.19% (99 runs sampled)
  * SearchableMap#fuzzyGet("virtu", 2) x 2,253 ops/sec ±0.21% (99 runs sampled)
  * SearchableMap#fuzzyGet("virtu", 3) x 434 ops/sec ±0.38% (92 runs sampled)

Prefix search:
==============
  * Array.from(SearchableMap#atPrefix("vir")) x 446,460 ops/sec ±0.22% (101 runs sampled)
  * Array.from(SearchableMap#atPrefix("virtut")) x 988,842 ops/sec ±0.27% (100 runs sampled)

Exact search:
=============
  * SearchableMap#get("virtute") x 1,823,572 ops/sec ±0.33% (94 runs sampled)

Indexing:
=========
  * MiniSearch#addAll(documents) x 12.09 ops/sec ±2.44% (35 runs sampled)

Combined search:
================
  * MiniSearch#search("virtute e conoscienza") x 91.87 ops/sec ±3.24% (69 runs sampled)

Search filtering:
=================
  * MiniSearch#search("virtu", { filter: ... }) x 18,009 ops/sec ±2.99% (83 runs sampled)

Auto suggestion:
================
  * MiniSearch#autoSuggest("virtute cano") x 58,476 ops/sec ±2.87% (87 runs sampled)
  * MiniSearch#autoSuggest("virtue conoscienza", { fuzzy: 0.2 }) x 327,262 ops/sec ±1.24% (98 runs sampled)

Load index:
===========
  * MiniSearch.loadJSON(json, options) x 23.24 ops/sec ±3.82% (43 runs sampled)

opened by rolftimmermans 9

Ship ES6 version of the library
Today there are major browsers support ES6 and ESM, so it make sense to ship ES6 version of your library together with UMD version.

Usually there are 3 types of versions which are good to have in your package:

ES6 + ESM (ES modules)

UMD

ES5 + ESM (es5m or esm5)

To make this job easier I'd recommend you to use rollup instead of webpack. The resulting bundle will be smaller and without internal module system (which webpack adds in the bundle).

You can check https://github.com/stalniy/rollup-plugin-content/blob/master/rollup.config.js to see how I did this in one of my libraries.

Let me know if you need help with this, I can submit a PR.
opened by stalniy 9
[Feature request] Add a way to act on warning "X has changed before removal"

Context: I'm using Minisearch to build a search plugin for the note-taking app Obisidian. To make it as fast as possible, I'm reloading Minisearch from cached data when the application boots up.

Since bugs happen, the cached index and cached files can become desynced, and the message MiniSearch: document with ID xyz has changed before removal appears. Unfortunately, we can't act on this, the errors pile up, and eventually the search index becomes unusable or corrupted.

I think a new callback field on the Options object would be a simple and efficient solution. It could be called when the console.warn is shown, so library users could manage it effectively.

opened by scambier 8

Support nested list of objects

In our scenario, we would like to provide full-text search functionality for datasets look like:

{
  "id":"12345",
  "name":"dataset_name",
  "type":"table",
  "columns":[
    {
      "name":"date",
      "type":"int",
      "description":"xxx"
    },
    {
      "name":"container",
      "type":"string",
      "description":"xxx"
    },
    {
      "name":"container_position",
      "type":"int",
      "description":"xxx"
    }
  ]
}

when search container, we hope it could find out all columns having container in its name field. Here is an example:

[
  {
    "id":"12345",
    "...": "...",
    "match":{
      "container":[
        "columns[1].name",
        "columns[2].name"
      ],
      "probably there will be better ways"
    }
  }
]

Looks like currently there is no good way to support this feature. Any thoughts?

opened by springuper 8

Add addFields and removeFields methods

Resolves: #170

This methods add/remove fields to an existing document.

This is useful to patch some fields in an existing document without having to replace it.

Example:

`addFields`

const miniSearch = new MiniSearch({ fields: ['title', 'text', 'author'] })
   
miniSearch.add({ id: 1, title: 'Neuromancer' })
   
miniSearch.addFields(1, {
  text: 'The sky above the port was the color of television, tuned to a dead channel.',
  author: 'William Gibson'
})
   
// The above is equivalent to:
miniSearch.add({
  id: 1,
  title: 'Neuromancer',
  text: 'The sky above the port was the color of television, tuned to a dead channel.',
  author: 'William Gibson'
})

`removeFields`

const miniSearch = new MiniSearch({ fields: ['title', 'text', 'author'] })

miniSearch.add({
  id: 1,
  title: 'Neuromancer',
  text: 'The sky above the port was the color of television, tuned to a dead channel.',
  author: 'William Gibson'
})
   
miniSearch.removeFields(1, {
  text: 'The sky above the port was the color of television, tuned to a dead channel.',
  author: 'William Gibson'
})
   
// The above is equivalent to:
miniSearch.add({
  id: 1,
  title: 'Neuromancer'
})

opened by lucaong 0

Adding single field (extending fieldlist) to document

Hi Luca,

I mentioned this single use case in another ticket, where we discussed several things: https://github.com/lucaong/minisearch/issues/106

I wanted to create a dedicated issue for this specific case. I am powering a filter/search of a table with minisearch. And users are able to add and remove columns from that table. the recreation of the index is quite the performance hit on that page and I am trying to make reindexing more performant on data changes. One of these scenarios is adding of a column. Currently I am creating a complete new index, just to add or remove a single field.

It would be great, if there is an api to add or remove a single field from the index. Or event doing it for every individual document with the current value of the field would be fine.

opened by KeKs0r 4