A personal semantic search engine capable of surfacing relevant bookmarks, journal entries, notes, blogs, contacts, and more, built on an efficient document embedding algorithm and Monocle's personal search index.

Overview

Revery ๐Ÿฆ…

Revery is a semantic search engine that operates on my Monocle search index. While Revery lets me search through the same database of tens of thousands of notes, bookmarks, journal entries, Tweets, contacts, and blog posts as Monocle, Revery's focus is not on keyword-based search that Monocle performs, but instead on semantic search -- finding results that are topically similar to some given web page or query, even if they don't share the same words. It's available as a browser extension that can surface relevant results to the current page, as well as a more standard web app resembling Monocle's search page.

Revery's browser extension and web interface running on an iPad and a laptop

Unlike most of my side projects, because of the size of data and amount of computational work Revery requires, its backend is written in Go. Both clients -- the web app and the browser extension -- are built with Torus.

Although it works well enough for me to use it every day, Revery is more of a proof-of-concept prototype than a finished product. I wanted to demonstrate that a tool like this could be built for personal use on top of personal productivity tools like notes and bookmarks, and experience what it would feel like to browse the web and write with such a tool.

Features

Revery, at its core, is just a single API. The API takes in some text, and crawls through my collection of personal documents and notes to find the top ones that seem most topically related to the given text. To make this interesting to use, I've wrapped it up in two different interfaces: a browser extension, and a more standard web-based search interface.

Browser extension

The Revery browser extension lives inside ./extension in this repository, and does exactly one thing: when I hit Ctrl-Shift-L on any webpage I'm viewing, it'll scrape the main body of text from the page (or some selected part of it, if I've highlighted something) and talk to the Revery API to find the documents that are most related to what I'm reading.

Revery's browser extension showing a list of related results

Where Monocle, with its keyword-based search algorithm, is good for recollection, I've found the Revery extension great for explorations on a specific topic. If I'm reading about natural language processing, for example, I can hit a few keystrokes to bring up other articles I've read, or notes I've taken in the past, that I can mentally reference as I read and learn about new ideas in NLP.

We learn new ideas best when we can find existing reference points in our memory onto which we can attach new information. Revery's extension partly automates and speeds up that task. For example, while reading an article about South Korea's unique cultural and economic position in the world, Revery surfaced a few related newsletters and articles from completely different authors and sources on Korean pop culture and its population decline, which helped me frame what I was reading in a much more broad, well-informed context.

Web interface

The web search interface, to me, is a bit secondary to the extension. It exists primarily as a demonstration of Revery's underlying technology, and also incidentally as a way for me to use Revery when the extension isn't available (like on a mobile browser).

Revery's web interface showing a list of results

The search bar in the web interface can take either a URL or some key phrase. Given a URL (as in the screenshot above), Revery will download and read the web page itself to find related documents in the search index. Given a key phrase, Revery will try to suggest documents that contain similar words and speak on similar topics.

This kind of a search interface (as opposed to the extension) is useful to me for starting out thinking about something new, where I can type in a list of related words into the search box and immediately get a list of ideas and documents I'm familiar with that are related, without having to fashion the specific and well-crafted search queries that keyword-based search engines like Monocle require.

How it works

As mentioned above, Revery's core is a single API endpoint that takes in some document and returns a list of most related documents from my search index. What makes Revery special is that this API performs a semantic search, not simply a scan for matching keywords. This means that the top results may not even contain the same words as the query, as long as its contents are topically relevant.

This kind of semantic search is enabled by a search algorithm that uses cosine similarity to cluster document embeddings of the indexed documents. If that sounds like a bunch of random words to you (as it did to me when I started this project), let me break it down:

First, we'll need to understand word embeddings. A word embedding is a way of mapping a vocabulary of natural language words to some points in space (usually a high-dimensional mathematical space), such that words that are similar in meaning are close together in this space. For example, the word "science" in a word embedding would be very close to the word "scientist", reasonably close to "research", and likely very far from "circus". When we talk about "distance" in the context of word embeddings, we usually use cosine similarity rather than Euclidean distance, for both empirical and theoretical reasons I won't cover here.

Although the concept of word embeddings is not very new, there is still active research producing new methods for generating more and more accurate and useful word embeddings from the same corpus of data. My personal deployment of Revery uses the Creative Commons-licensed word embedding dataset produced by Facebook's FastText tool, specifically a 50,000-word dataset with 300 dimensions trained on the Common Crawl corpus.

Word embeddings let us draw inferences about which words are related, but for Revery, we want to draw the same kind of inference about documents, which are a list of words. Thankfully, there's ample literature to suggest that simply taking a weighted average of word vectors for every word in a document can get us a good approximation of a "document vector" that represents the document as a whole. Though there are more advanced methods we can use, like paragraph vectors or models that take word order into account like BERT, averaging word vectors works well enough for Revery's use cases, and is simple to implement and test, so Revery sticks with this approach.

Once we can generate document vectors out of documents using our word embedding, the rest of the algorithm falls into place. On startup, Revery's API server indexes and generates document vectors for all of the documents it can find in my dataset (which isn't too large -- around 25,000 at time of writing), and on every request, the algorithm computes a document vector for the requested document, and sorts every document in the search index by its cosine distance to the query document, to return some top n results.

Within Revery, every part of this algorithm is hand-written in Go. This is for a few reasons:

  1. I wanted to encourage myself to understand these basic algorithms of the trade fully, by writing the code myself
  2. Most open-source libraries to do this kind of computation are made available in Python packages, and I don't have great personal infrastructure for deploying and maintaining a Python application.
  3. Go is fast enough, anecdotally, for this task.

Both of the clients of Revery -- the extension and the web app -- talk to this single API endpoint. The clients themselves are quite ordinary, so I won't go into detail describing how they work here.

Development and deploy

Here, the same disclaimer that I shared with Monocle also apply:

โš ๏ธ Note: If you're reading this section to try to set up and run your own Revery instance, I applaud your audacity, but it might not be super easy or fruitful -- Revery's setup (especially on the data and indexing side) is pretty specific not only to my data sources, but also the way I structure those files. I won't stop you from trying to build your own search index, but be warned: it might not work, and I'm probably not going to do tech support. For this reason, this section is also written in first-person, mostly for my future reference.

Revery depends on the search index produced by Monocle's indexer, so I usually make sure Revery has a recent copy of Monocle's search index available before running.

Revery has two independent codebases in the same repository. The first is the Chrome extension, which lives entirely inside the ./extension folder. Here's how I set it up:

  1. The extension needs an API authentication token to talk to the Revery API. I usually just choose an arbitrarily long random string. Then, I place a file in ./extension called token.js with the content:

    const REVERY_TOKEN = '<some API key here>';
  2. I go to chrome://extensions and click "Load unpacked" to load the ./extension folder as an "unpacked extension" into my browser, which will make the extension available in every tab.

That's it for the extension setup. Next, I set up the server:

  1. Take the same authentication token from above, and place just the token string itself inside tokens.txt in the root of the project folder. The Revery server will grab the whitespace-trimmed content of this file and use it as the API key.
  2. Simply running make will build the revery binary executable into the project folder.
  3. Revery needs two extra sets of data to work: the word embedding model, and Monocle's document dataset.
    • Download a word embedding file (for example, from FastText) and trim it to some reasonable size (top 50-100k words seems to work well). Trim the first line, which usually indicates the total word count and number of dimensions. Revery's code assumes 300 dimensions, so if this is not the case, revise the code.
    • Copy Monocle's docs.json document dataset generated by the indexer to ./corpus/docs.json.
  4. Running the revery executable now should correctly pre-process the model and search index, and start the web application server.

Prior art and future work

Although Revery is useful enough for me to use day to day, There's a lot of active research in the general natural language search space, and Revery itself has a lot of room for improvements.

On the data side:

  • Experimenting with other word embeddings which may provide better performance. I've tried FastText and LexVec, but there are many other open models available.
  • Generating a custom word embedding optimized for my dataset and for use in forming document vectors

On the code side:

  • Optimizing the algorithms that touch data to scale better, using some amount of caching and good old fashioned hand optimization of the code
  • Better ways to surface documents contextually in the browser. Right now, searching Revery within a browser requires an explicit user action. Perhaps we can surface them completely automatically, or even detect when a user has scrolled to the end of a page or highlighted an interesting section of the document to automatically suggest related documents.
  • Better ways to balance the benefits of keyword-based and semantic search. Right now, Monocle and Revery are two completely separate applications, but having both kinds of search collaborating with each other or even simply displaying side by side on screen may be more useful.

There is also plenty of great prior art in this space. Though I can't list them all here, there are a few that stand out as inspirations for Revery.

  • Monocle, the direct predecessor to Revery that uses the same dataset for keyword search
  • same.energy, which enables searching for tweets or photos of the same "style" using a transformer model
  • Semantica, which uses word embeddings to provide a lower-level tool to explore relationships between individual words and concepts
  • Tyler Angert's Information forest, an imaginative note about web browsers of the future
  • Document embedding techniques, which served as a useful overview of the field when I began this project
You might also like...

In this project, I built a simple HTML list of To-Do tasks. This simple web page was built using Webpack, creating everything from a JavaScript index file that imported all the modules and assets

In this project, I built a simple HTML list of To-Do tasks. This simple web page was built using Webpack, creating everything from a JavaScript index file that imported all the modules and assets

To Do List In this project, I built a simple HTML list of To-Do tasks. This simple web page was built using Webpack, creating everything from a JavaSc

Mar 31, 2022

"Jira Search Helper" is a project to search more detail view and support highlight than original jira search

Jira Search Helper What is Jira Search Helper? "Jira Search Helper" is a project to search more detail view and support highlight than original jira s

Dec 23, 2022

GitHub and Markdown-Based CMS for Blogs. EXPERIMENTAL and in the "Idea" stage. I have no clue if this is feasible.

Turborepo starter This is an official pnpm starter turborepo. What's inside? This turborepo uses pnpm as a package manager. It includes the following

Oct 13, 2022

Blog-webapp - A simple webapp prototype that serves tech news, blogs, and anything else a developer might want to learn or get help with

Blog-webapp - A simple webapp prototype that serves tech news, blogs, and anything else a developer might want to learn or get help with

Blog Web app A simple webapp prototype that serves tech news, blogs, and anythin

Nov 3, 2022

โšก A blazing fast, lightweight, and open source comment system for your static website, blogs powered by Supabase

โšก A blazing fast, lightweight, and open source comment system for your static website, blogs powered by Supabase

SupaComments โšก A blazing fast, lightweight, and open source comment system for your static website, blogs ๐Ÿš€ Demo You can visit the Below demo blog po

Dec 27, 2022

Remarkable is a simple extension that automatically keeps your bookmarks clean & up-to-date.

Remarkable is a simple extension that automatically keeps your bookmarks clean & up-to-date.

Remarkable Remarkable is a simple extension that automatically keeps your bookmarks clean & up-to-date. Installation (Other browsers coming soon - sor

Dec 21, 2022

The Omnibookmarks browser extension is the fastest way to open bookmarks

The Omnibookmarks browser extension is the fastest way to open bookmarks

โ˜… Omnibookmarks The Omnibookmarks browser extension is the fastest way to open bookmarks. Just type a keyword into the address bar to quickly open or

Aug 20, 2022

A lambda function mirroring Pixiv bookmarks to Raindrop.io

pixiv-to-raindrop A lambda function that executes automated mirroring of bookmarks from Pixiv to Raindrop.io Demo Source: https://www.pixiv.net/users/

Oct 13, 2022

MERN stack application which serves as an online map journal where users can mark and rate the places they've been to.

MERN stack application which serves as an online map journal where users can mark and rate the places they've been to.

PlaceRate PlaceRate is a MERN stack application which serves as an online map journal where users can mark and rate the places they've been to. You ca

May 17, 2022
Owner
Linus Lee
Languages, interpreters, compilers. Building better tools and workflows, usually on the Web, sometimes for the Web.
Linus Lee
Easily open daily notes and periodic notes in new pane; customize periodic notes background; quick append new line to daily notes.

Obsidian daily notes opener This plugin adds a command for opening daily notes in a new pane (so that a keyboard shortcut could be used!) and gives ex

Xiao Meng 16 Dec 26, 2022
Google-Drive-Directory-Index | Combining the power of Cloudflare Workers and Google Drive API will allow you to index your Google Drive files on the browser.

?? Google-Drive-Directory-Index Combining the power of Cloudflare Workers and Google Drive will allow you to index your Google Drive files on the brow

Aicirou 127 Jan 2, 2023
An efficient (and the fastest!) way to search the web privately using Brave Search Engine

Brave Search An efficient (and the fastest) way to search the web privately using Brave Search Engine. Not affiliated with Brave Search. Tested on Chr

Jishan Shaikh 7 Jun 2, 2022
Search for coding resources by relevant keywords

Search for coding resources by relevant keywords. This API serves educational content for a wide variety of computer science topics, languages and technologies relevant to web development.

null 22 Nov 4, 2022
Node starter kit for semantic-search. Uses Mighty Inference Server with Qdrant vector search.

Mighty Starter This project provides a complete and working semantic search application, using Mighty Inference Server, Qdrant Vector Search, and an e

MAX.IO LLC 8 Oct 18, 2022
Link your documentation to the relevant code files

Link your documentation to the relevant code files

Mintlify 13 Jul 19, 2022
jQuery UI widget for structured queries like "Contacts where Firstname starts with A and Birthday before 1/1/2000 and State in (CA, NY, FL)"...

Structured-Filter ยท Structured-Filter is a generic Web UI for building structured search or filter queries. With it you can build structured search co

Olivier Giulieri 238 Jan 6, 2023
A JupyterLab extension to create custom launcher entries.

jupyter_app_launcher A JupyterLab extension to create custom launcher entries jupyter_app_launcher helps users customize the JupyterLab launcher with

Duc Trung Le 17 Dec 28, 2022
A plugin for Obsidian that shows raw BibTeX bibliography entries in a prettier way

Pretty BibTeX Obsidian Plugin This plugin renders bibliography entries in the BibTeX format in a more readable way. Demonstration Creating a standard

Sandro Figo 8 Dec 30, 2022