A WebAssembly build of the Tesseract OCR engine for use in the browser and Node

Last update: Dec 28, 2022

Related tags

Overview

tesseract-wasm

A WebAssembly build of the Tesseract OCR engine for use in the browser and Node.

tesseract-wasm can detect and recognize text in document images. It supports multiple languages via different trained models.

👉 Try the demo (Currently supports English)

Features

This Tesseract build has been optimized for use in the browser by:

Stripping functionality which is not needed in a browser environment (eg. code to parse various image formats) to reduce download size and improve startup performance. The library and English training data require a ~2.1MB download (with Brotli compression).
Using WebAssembly SIMD when available (Chrome >= 91, Firefox >= 90, Safari ??) to improve text recognition performance.
Providing a high-level API that can be used to run web pages without blocking interaction and a low-level API that provides more control over execution.

Setup

Add the tesseract-wasm library to your project:
```
npm install tesseract-wasm
```
Serve the tesseract-core.wasm, tesseract-core-fallback.wasm and tesseract-worker.js files from node_modules/tesseract-wasm/dist alongside your JavaScript bundle.
Get the training data file(s) for the languages you want to support from the tessdata_fast repo and serve it from a URL that your JavaScript can load. The eng.traineddata file supports English for example, and also works with many documents in other languages that use the same script.

Usage

tesseract-wasm provides two APIs: a high-level asynchronous API (OCRClient) and a lower-level synchronous API (OCREngine). The high-level API is the most convenient way to run OCR on an image in a web page. It handles running the OCR engine inside a Web Worker to avoid blocking page interaction. The low-level API is useful if more control is needed over where/how the code runs and has lower latency per API call.

Using OCRClient in a web page

import { OCRClient } from 'tesseract-wasm';

async function runOCR() {
  // Fetch document image and decode it into an ImageBitmap.
  const imageResponse = await fetch('./test-image.jpg');
  const imageBlob = await imageResponse.blob();
  const image = await createImageBitmap(image);

  // Initialize the OCR engine. This will start a Web Worker to do the
  // work in the background.
  const ocr = new OCRClient();

  try {
    // Load the appropriate OCR training data for the image(s) we want to
    // process.
    await ocr.loadModel('eng.traineddata');

    await ocr.loadImage(someImage);

    // Perform text recognition and return text in reading order.
    const text = await ocr.getText();

    console.log('OCR text: ', text);
  } finally {
    // Once all OCR-ing has been done, shut down the Web Worker and free up
    // resources.
    ocr.destroy();
  }
}

runOCR();

Examples and documentation

See the examples/ directory for projects that show usage of the library in the browser and Node.

See the API documentation for detailed usage information.

See the Tesseract User Manual for information on how Tesseract works, as well as advice on improving recognition.

Development

Prerequisites

To build this library locally, you will need:

A C++ build toolchain (eg. via the build-essential package on Ubuntu or Xcode on macOS)
CMake
Ninja

The Emscripten toolchain used to compile C++ to WebAssembly is downloaded as part of the build process.

To install CMake and Ninja:

On macOS:

brew install cmake ninja

On Ubuntu

sudo apt-get install cmake ninja-build

Building the library

git clone https://github.com/robertknight/tesseract-wasm
cd tesseract-wasm

# Build WebAssembly binaries and JS library in dist/ folder
make lib

# Run tests
make test

To test your local build of the library with the example projects, or your own projects, you can use yalc.

# In this project
yalc publish

# In the project where you want to use your local build of tesseract-wasm
yalc link tesseract-wasm

Comments

peg emsdk version

It seems that the following line will always install the latest tagged emsdk release: https://github.com/robertknight/tesseract-wasm/blob/534e5f7df602a432e5e61ee1dcc832c72a80fe4c/Makefile#L67

It looks like this might make builds non-deterministic. Should it be changed to a specific version?

opened by wydengyre 4
Example web try to run but error

I follow instruction but I still cannot run it

https://github.com/robertknight/tesseract-wasm/tree/main/examples/web

import { cpSync } from "node:fs"; ^^^^^^ SyntaxError: The requested module 'node:fs' does not provide an export named 'cpSync' at ModuleJob._instantiate (internal/modules/esm/module_job.js:121:21) at async ModuleJob.run (internal/modules/esm/module_job.js:166:5) at async Loader.import (internal/modules/esm/loader.js:178:24) at async Object.loadESM (internal/process/esm_loader.js:68:5)
question

opened by a0fzide 2
use make for typecheck

Previously, there appeared to be two ways to invoke typechecking: the Makefile and npm/package.json. This commit reduces ambiguity by moving CI typechecking to the Makefile, where all the other build steps live.

opened by wydengyre 1
Add simple orientation detection
Add simple orientation detection using Leptonica's pixOrientDetect function. This was used in Tesseract because Tesseract's implementation requires the legacy (non-LSTM) engine, which is not compiled in. Leptonica's algorithm relies mostly on "the preponderence of ascenders over descenders in languages with roman characters", per this paper. Tesseract's approach which is not being used is described here.

TODO:

[x] Investigate issues with same rotated image producing different results when loaded in different browsers (see notes in second commit)

[x] Perhaps add a way for getOrientation API to indicate uncertainty in the result or errors in the process. Currently it returns 0 in the event of any error, and has no way to represent confidence in the result.
opened by robertknight 1
Enable layout analysis before loading a model

It seems that layout analysis is possible before training data is loaded, using Tesseract's TessBaseAPI::InitForAnalysePage method. In a browser context this would reduce the amount of data that needs to be downloaded if we just want to estimate the quality of the OCR layer.

opened by robertknight 1
Ensure third_party/ dirs are updated after version change

Re-run git fetch and checkout of third party packages if the version changes. I might want to look into submodules for this in future.

The Emscripten, Leptonica and Tesseract versions are now configured in third_party_versions.mk.

opened by robertknight 0
Copy static assets to build dir in web example

This avoids the need to override the default web worker URL when constructing the OCRClient, and makes the example follow the recommendations for how end users should serve the WebAssembly module and web worker.

opened by robertknight 0
Add `clearImage` method to OCREngine/OCRClient

This is of limited value at present since memory allocated to WebAssembly cannot be subsequently released without unloading the whole module. It might be useful to ensure that the state of the OCRClient is in-sync with other parts of the application though.

There is some discussion about WASM memory shrinking in https://github.com/WebAssembly/design/issues/1300.

opened by robertknight 0
Ensure all types used in public APIs are exported, add command to build API docs

Add a make api-docs command to build API docs using typedoc, and fix the warnings it reported about various types used in public APIs not being exported.

Also ensure that private methods are explicitly marked as such.

opened by robertknight 0
Convert source to native TypeScript syntax

This provides the ability to explicitly specify which types are exported in the public API, as well as providing the ability to run the code through a wider range of documentation generators.

opened by robertknight 0
Convert ImageBitmap => ImageData on the main thread in all browsers

Chrome has a bug where image orientation metadata in JPEG images is lost when an ImageBitmap is cloned via a structured clone [1]. Therefore we have to do ImageBitmap => ImageData conversion on the main thread to ensure that the OCR engine receives decoded image data which respects the image orientation.

Prior to this fix the rendered image orientation and the OCR output did not match up in Chrome if the input image was rotated.

Since neither Firefox nor Safari support OffscreenCanvas, this means that all browsers are now doing ImageBitmap => ImageData conversion on the main thread.

Fixes #35

[1] https://bugs.chromium.org/p/chromium/issues/detail?id=1332947

opened by robertknight 0

Failed to reconfigize simplified chinese text

The trained data used is fetched from: https://github.com/tesseract-ocr/tessdata/blob/main/chi_sim.traineddata

The image:

The error:

## <Seems like compiled emscripten wasm glue code>
... eturn (Module["dynCall_jii"]=Module["asm"]["ma"]).apply(null,arguments)};var calledRun;dependenciesFulfilled=function runCaller(){if(!calledRun)run();if(!calledRun)dependenciesFulfilled=runCaller;};function run(args){if(runDependencies>0){return}preRun();if(runDependencies>0){return}function doRun(){if(calledRun)return;calledRun=true;Module["calledRun"]=true;if(ABORT)return;initRuntime();readyPromiseResolve(Module);if(Module["onRuntimeInitialized"])Module["onRuntimeInitialized"]();postRun();}if(Module["setStatus"]){Module["setStatus"]("Running...");setTimeout(function(){setTimeout(function(){Module["setStatus"]("");},1);doRun();},1);}else {doRun();}}Module["run"]=run;if(Module["preInit"]){if(typeof Modu
    at abort (file:///X:/Temp/auto-bb/node_modules/tesseract-wasm/dist/lib.js:539:7465)
    at _abort (file:///X:/Temp/auto-bb/node_modules/tesseract-wasm/dist/lib.js:539:54327)
    at wasm://wasm/0067125a:wasm-function[50]:0x1b74 
    at wasm://wasm/0067125a:wasm-function[694]:0x5d8d2
    at wasm://wasm/0067125a:wasm-function[872]:0x8434f
    at wasm://wasm/0067125a:wasm-function[1733]:0x14bb88
    at wasm://wasm/0067125a:wasm-function[1953]:0x16e2fb
    at OCREngine$loadModel [as loadModel] (eval at new_ (file:///X:/Temp/auto-bb/node_modules/tesseract-wasm/dist/lib.js:539:36290), <anonymous>:9:10)
    at OCREngine.loadModel (file:///X:/Temp/auto-bb/node_modules/tesseract-wasm/dist/lib.js:629:37)
    at main (file:///X:/Temp/auto-bb/index.mjs:24:12)

I mainly referenced here for code: https://github.com/robertknight/tesseract-wasm/pull/22

opened by shrinktofit 2

peg all dependency versions for deterministic building
There may be others but currently:

[ ] Github Actions runs-on image version (currently ubuntu-latest, which can change out from under us)

[ ] clang-format (unspecified)

[ ] ninja-build (unspecified)

[ ] cmake (unspecified)

It might be worth looking into a tool like Earthly to produce deterministic builds that are identical when developing locally or using CI, but that could be treated as a separate issue.
opened by wydengyre 0
How to pass --psm to detect text as a single column of text?

Hey @robertknight ,

Great work you have been doing here. It is performing excellent in Vue 3 with Vite. I would like to send the parameter to tesseract engine --psm 4, in order to assume line as a single column. Sometimes, the engine assumes the text as 2 or 3 columns and the text recognized does not make sense. More info: https://stackoverflow.com/questions/44619077/pytesseract-ocr-multiple-config-options

I was looking through the source code, I could not find how to pass that option.

Thanks.
enhancement

opened by dheimoz 10
Investigate targeting WASM Relaxed SIMD

Tesseract is currently compiled to target SSE4, which Emscripten can map to WASM SIMD instructions. It has code which supports improved vector instructions beyond SSE4 (AVX, FMA). If it is possible for Emscripten to target these when configured to target WASM relaxed-simd (see https://github.com/WebAssembly/relaxed-simd/issues/52), this could result in additional speed-up during text line recognition.

WASM Relaxed SIMD is currently supported in Firefox 103 and is being worked in Chrome.

opened by robertknight 0
Large images can cause WASM out-of-memory errors
Large images can cause a memory allocation failure when loaded into Tesseract. The image size threshold for triggering this is lower if a large image is already loaded into Tesseract.

Images taken on my iPhone X+ are 3024x4032 at their native resolution. With the current 128MB memory cap they will load into the WebAssembly memory when first dropped into the demo app, but another image of a similar size a second time will trigger an error.

Some things that can be done:

Raise the WASM memory cap from 128MB to a higher value

Convert the image to 8-bit greyscale before loading into Tesseract

Resize images to a certain max size before loading into Tesseract

Specify a maximum image size in the library

Improve error handling for out-of-memory situations so that a useful error is at least reported
opened by robertknight 2

Releases(v0.7.0)

v0.7.0(Jul 8, 2022)
What's Changed

Add support for setting Tesseract configuration variables in OCREngine by @wydengyre in https://github.com/robertknight/tesseract-wasm/pull/52. Note that this is a feature for power-users and not all configuration variables will work in the WebAssembly/browser/Node environment.

Full Changelog: https://github.com/robertknight/tesseract-wasm/compare/v0.6.0...v0.7.0
Source code(tar.gz)
Source code(zip)
v0.6.0(Jul 7, 2022)
What's Changed

Upgrade Tesseract version to 5.2.0 by @wydengyre in https://github.com/robertknight/tesseract-wasm/pull/48

Add clearImage method to OCREngine/OCRClient by @robertknight in https://github.com/robertknight/tesseract-wasm/pull/40

Add processing time display in web demo app by @robertknight in https://github.com/robertknight/tesseract-wasm/pull/42

New Contributors

@wydengyre made their first contribution in https://github.com/robertknight/tesseract-wasm/pull/48

Full Changelog: https://github.com/robertknight/tesseract-wasm/compare/v0.5.0...v0.6.0
Source code(tar.gz)
Source code(zip)
v0.5.0(Jun 5, 2022)

This release contains various improvements to the TypeScript types. See full comparison for details.

https://github.com/robertknight/tesseract-wasm/compare/v0.4.0...v0.5.0
Source code(tar.gz)
Source code(zip)
v0.4.0(Jun 5, 2022)
Add simple orientation detection (https://github.com/robertknight/tesseract-wasm/pull/34). The initial implementation is fast but simplistic and works best for Latin text which is not all uppercase.

Add workaround for Chrome bug with handling of rotated images (https://github.com/robertknight/tesseract-wasm/pull/36)

Further reduce peak memory usage when loading images, reducing risk of hitting the current memory cap (https://github.com/robertknight/tesseract-wasm/pull/32)

Source code(tar.gz)
Source code(zip)
v0.3.0(May 30, 2022)
Support larger input images by reducing memory usage due to making multiple copies of input de4076f

Simplify installation of web demo app c1b39e2

Add supportsFastBuild helper for determining WASM build supported by current JS environment e25eea8

https://github.com/robertknight/tesseract-wasm/compare/v0.2.0...v0.3.0
Source code(tar.gz)
Source code(zip)
v0.2.0(May 29, 2022)
Add ./node module export 97f2977

Added Node and web examples (https://github.com/robertknight/tesseract-wasm/pull/22, https://github.com/robertknight/tesseract-wasm/pull/20)

Source code(tar.gz)
Source code(zip)

Owner

Robert Knight

Lead developer at @hypothesis. Preact (preactjs.com) contributor. Previous projects - @Mendeley's desktop app, Konsole terminal for KDE

GitHub https://robertknight.github.io/tesseract-wasm/

A web client port-scanner written in GO, that supports the WASM/WASI interface for Browser WebAssembly runtime execution.

WebAssembly Port Scanner Written in Go with target WASM/WASI. The WASM main function scans all the open ports in the specified range (see main.go), vi

74 Dec 27, 2022

Run official FLAC tools `flac` and `metaflac` as WebAssembly, on browsers or Deno.

flac.wasm Run official FLAC tools flac and metaflac as WebAssembly, on browsers or Deno. Currently we have no plans on supporting Node.js. Try it onli

15 Aug 21, 2022

A template of Rust + WebAssembly with TypeScript (🦀 + 🕸️ = 💖)

rust-wasm-ts-template This repository is a template of Rust + WebAssembly with TypeScript ( ?? + ??️ = ?? ). Requirements The Rust Toolchain wasm-pack

20 Aug 26, 2022

Lovefield is a relational database for web apps. Written in JavaScript, works cross-browser. Provides SQL-like APIs that are fast, safe, and easy to use.

Lovefield Lovefield is a relational database written in pure JavaScript. It provides SQL-like syntax and works cross-browser (currently supporting Chr

6.8k Jan 3, 2023

AlaSQL.js - JavaScript SQL database for browser and Node.js. Handles both traditional relational tables and nested JSON data (NoSQL). Export, store, and import data from localStorage, IndexedDB, or Excel.

Please use version 1.x as prior versions has a security flaw if you use user generated data to concat your SQL strings instead of providing them as a

6.1k Jan 9, 2023

membuat sebuah module pengganti database engine untuk mengelola data secara advance

Donate Sosial Media Introduction Database atau basis data adalah kumpulan data yang dikelola sedemikian rupa berdasarkan ketentuan tertentu yang salin

6 Dec 17, 2021

⚡️ lowdb is a small local JSON database powered by Lodash (supports Node, Electron and the browser)

Lowdb Small JSON database for Node, Electron and the browser. Powered by Lodash. ⚡ db.get('posts') .push({ id: 1, title: 'lowdb is awesome'}) .wri

18.9k Dec 30, 2022

The JavaScript Database, for Node.js, nw.js, electron and the browser

The JavaScript Database Embedded persistent or in memory database for Node.js, nw.js, Electron and browsers, 100% JavaScript, no binary dependency. AP

13.2k Jan 2, 2023

Adapter based JavaScript ORM for Node.js and the browser

firenze.js A database agnostic adapter-based object relational mapper (ORM) targetting node.js and the browser. Visit http://firenze.js.org for docume

130 Jul 14, 2022

An easy-to-use multi SQL dialect ORM tool for Node.js

Sequelize Sequelize is a promise-based Node.js ORM tool for Postgres, MySQL, MariaDB, SQLite and Microsoft SQL Server. It features solid transaction s

27.3k Jan 4, 2023

a Node.JS script to auto-import USB drives that are attached to a computer. Use it to turn your NAS into a smart photo / file importer.

File Vacuum 5000 ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ WARNING: This script is designed to manipulate files on both an external drive and another specif

46 Jan 10, 2022

The social network for developers. Discover creative websites and build a community.

Driwwwle The Social Network for Developers Features ⚡ Server-side rendering with Next.js ?? Cookie-based authorization with JSON web tokens ?? Infinit

107 Dec 26, 2022

Avocano is a sample dropship/fake product website with Cloud Run, Cloud SQL and Cloud Build

Avocano - A Fake Product Website Avocano is a sample dropship/fake product website, combining: Firebase Hosting front end, written with Lit, Cloud Run

9 Dec 9, 2022

Social-Feeds-APIs - REST APIs to build social media sites.

express4.17.1-in-docker EXPRESS 4.17 SPA IMPORTANT NOTES: 1. Make sure you follow the steps mentioned under "PROJECT START STEPS" and ensure that the

1 Jan 3, 2022

PathQL is a json standard based on GraphQL to build simple web applications.

PathQL Usage You can simple create a new PathQL Entry, which allows you to automize database over an orm and client requests over the PathQL JSON Requ

3 Jul 20, 2022

ORM for TypeScript and JavaScript (ES7, ES6, ES5). Supports MySQL, PostgreSQL, MariaDB, SQLite, MS SQL Server, Oracle, SAP Hana, WebSQL databases. Works in NodeJS, Browser, Ionic, Cordova and Electron platforms.

TypeORM is an ORM that can run in NodeJS, Browser, Cordova, PhoneGap, Ionic, React Native, NativeScript, Expo, and Electron platforms and can be used

30.1k Jan 3, 2023