Science! Testing various image matching algorithms' performance on the Pinecone vector DB

Overview

Image Matching Algorithms For Use With KNN Search

As part of the development of alt-text.org, it's necessary to perform large-scale fuzzy matching of images. To accomplish this, we leverage the Pinecone.io vector database which offers k-nearest-neighbor searches over feature vectors. While there's been much publicly available research on image similarity algorithms which use hamming distance comparison algorithms, we found little on leveraging KNN searches with non-ML based vector production.

We tested 4 algorithms: 1024 pixel pHash, 1024 frequency discrete cosine transform, 1024 pixel intensity, and the algorithm proposed in this paper by Wong, Bern, and Goldberg which results in feature vectors of 544 values.

Pinecone offers 3 distance metrics for measuring vector adjacency: Cosine, Dot Product, and Euclidean. While we had good reason to believe that Cosine was the only appropriate metric for the aforementioned algorithms, we decided to test all three for each algorithm in case we were mistaken. When returning results Pinecone scores each match, with the range depending on the metric used, and allows setting the max number to return.

We performed our tests using the first 10,000 JPEG images in the train_50k_0.zip file from a past Meta image matching competition which can be found here. The next 2,000 images were used for measuring performance when the requested image is not in the database.

For the alt-text.org use case, we're primarily interested in matching almost identical images. Several cases were of particular interest: larger, smaller, cropped, and in a different image format.

The choice of JavaScript was forced by the existing codebase, including all the matching algorithm options, already being in the language. The single-threaded nature of JavaScript was a major disadvantage and contributed to very long runtimes, but opting for a better suited language would mean a full rewrite of the alt-text.org backend.

We ran an initial round of tests collecting information only on whether the top match was correct and the count of matches with a score over some value. Examining our findings it quickly became clear that we had misunderstood the score field of returned matches, believing that it would be in [0, 1] for all metrics. These initial results did however make it apparent that the Dot Product metric was not appropriate for any algorithm except possibly Goldberg, so it was omitted for other algorithms in the second round of tests. The Euclidean distance metric performed decently, but had significant issues matching identical images. For lack of time, we chose to drop it as well for all algorithms. While the results from the first round did not directly influence our findings, the raw data is available in a Google sheet linked below.

Process

  1. process-images.js: For each image, compute and save to disk versions 2x size, 1/2x size, with a 5% border crop, and in PNG format. We use the Node.js canvas library for this, which wraps the Cairo open source image processing library.
  2. vectorize.js: For each matching algorithm, compute the feature vector for each image and all its alterations, as well as the SHA256 hash of the original image, then store the result in a large JSON blob on disk.
  3. vectorize-missing.js: Compute and save to disk vectors for a smaller set of images. These will be searched for but not inserted.
  4. upsert.js: Upsert a record to Pinecone for only the vector for the original image, with the image hash as the stored value.
  5. query-matching.js: For each image and all its alterations, perform a KNN query, recording the metrics discussed below, and then print a summary of the findings in CSV format.
  6. query-missing.js: For each missing image, perform a KNN query, recording the tops score and the total set of scores.

Goals

Three things were of interest for each algorithm:

  • Vector computation time
  • Whether the correct record was the top result for each image and all its alterations
  • How reliably non-matching results could be excluded

Result Format

The following results for a given (algorithm, distance metric) are recorded:

  1. Vector computation time: average, min, max, and percentiles
  2. What percent of the time was the correct image the top result
  3. What percent of the time was the correct image present in results
  4. The score of the matching image if found: average, min, max, and percentiles
  5. The score of the highest non-matching image: average, min, max, and percentiles

Findings

Goldberg - Cosine

  • Goldberg with a Dot Product metric consistently fared the worst at almost all measures, and could be excluded from further consideration
  • Algorithm to compute is complex and the implementation available is more so
  • The slowest to compute by several orders of magnitude, but the tested implementation may be significantly optimizable
    • Note: After this document was drafted, we looked into optimizing the algorithm implementation, and were able to improve it approximately 3x. It appears there may be additional avenues for improvement.
  • The least reliable for all but cropped images, which are the least important, but the difference was mostly within tolerance
  • The largest and most reliable difference between matching and non-matching, with a score of 0.6 appearing to be a reasonably reliable cutoff which accords with the source paper

Discrete Cosine Transform - Cosine

  • Compute second slowest, but still very fast
  • Perfect correctness for all but cropped images, where it's still very reliable
  • No clear choice of a score that would reliably include non-exact matches while excluding others

Intensity - Cosine

  • Computation quickest and simplest to understand
  • Excellent at matching all but cropped images
  • No clear choice of a score that would reliably include non-exact matches while excluding others

pHash - Cosine

  • Computation quick and simple to understand
  • Excellent at matching all but cropped images
  • No clear choice of a score that would reliably include non-exact matches while excluding others

Conclusion

Based on the findings above, we will endeavor to optimize Goldberg and determine if performance is acceptable. If it is not, we may still opt to use it. Other algorithms fall down hard on the key question of separating altered but matching images from searches returning no similar results.

Data Availability

Raw results are available in this Google sheet

The vast bulk of the runtime for this process is spent computing feature vectors, if you would like to test other KNN query engines, the precomputed vectors are available upon request.

Raw results from the first round of tests are available in this Google sheet.

Future Work

  • Testing additional algorithms
    • Would Goldberg perform better or worse with a wider lightness scale?
  • Testing images with a watermark or text added
  • Reformatted images went from JPEG to PNG as those were the formats available. The performance matching those was identical to matching identical images, hinting that we really needed to be going the other direction given that JPEG is lossy.
You might also like...

InReach is the world’s first tech platform matching LGBTQ+ people with safe, verified resources.

Explore the screenshots » Report a Bug · Request a Feature . Ask a Question Table of Contents About Built With Getting Started Prerequisites Installat

Jan 3, 2023

High performance personalization & a/b testing example using Next.js, Edge Middleware, and Builder.io

High performance personalization & a/b testing example using Next.js, Edge Middleware, and Builder.io

Next.js + Builder.io Personalization & A/B Testing with Edge Middleware This is a fork of Next.js Commerce with Builder.io integrated and using Edge M

Sep 6, 2022

Papers from the computer science community to read and discuss.

Papers We Love (PWL) is a community built around reading, discussing and learning more about academic computer science papers. This repository serves

Dec 31, 2022

NewsStation is a news app which can be used to grab daily news bites. If you are interested in news whether politics, business, entertainment, general, health, science, sports and technology news NewsStation is for you!

NewsStation is a news app which can be used to grab daily news bites. If you are interested in news whether politics, business, entertainment, general, health, science, sports and technology news NewsStation is for you!

This is a NewsStation WebApp Project Using News API NewsStation is a news app which can be used to grab daily news bites. If you are interested in new

Feb 7, 2022

An interpreter for College Board's specified pseudocode for the AP Computer Science Principles Exam.

College Board Pseudocode Interpreter A playground for this interpreter is up on my website. This project is a mostly-functioning interpreter for Colle

Nov 16, 2022

PLSC 21510/31510: Introduction to Text as Data for Social Science (Spring 2022)

PLSC 21510/31510: Introduction to Text as Data for Social Science Spring 2022 About Social scientists increasingly use large quantities of text-based

Oct 17, 2022

Course material for a ~10 hours introductionary course for Julia. Topics: Introduction, Parallel Programming, Data Science

Development We use Franklin.jl to generate the lecture material. To do so, simply activate the environment, use Franklin and run the local server: act

Dec 15, 2022

Homework Assignments for Visualization for Data Science, Fall 2022, University of Utah

Homeworks for Utah's Vis for Data Science Course In subfolders in this directory you will find the homeworks for CS 6630 / CS 5630 / DS 4630 – Visuali

Nov 14, 2022

This blog is still under development! I present a project scope for science articles, it can now be used in production! But there are some details that need to be put up front.

Science-Blog 🛑 Attention! This blog is still under development! I present a project scope for science articles, it can now be used in production! But

Sep 19, 2022
Owner
null
A testing focused Remix Stack, that integrates E2E & Unit testing with Playwright, Vitest, MSW and Testing Library. Driven by Prisma ORM. Deploys to Fly.io

Live Demo · Twitter A testing focused Remix Stack, that integrates E2E & Unit testing with Playwright, Vitest, MSW and Testing Library. Driven by Pris

Remix Stacks 18 Oct 31, 2022
Implementing various sorting algorithms in Typescript's type system

Sorta Cool I was on a 10 hour flight with no WiFi, and, bored out of my mind, I thought it would be fun to implement some sorting algorithms in the Ty

Neil Ramaswamy 7 Nov 10, 2022
This project will be using various AI and Rule Engine algorithm to detect various attack against a company!

?? Introduction This project will be using various AI and Rule Engine algorithm to detect various attack against a website! ?? Mission After starting

Harish S.G 4 Apr 29, 2022
optimize image & upload file to cloud as image bed with tiny image automic.

Rush! 图片压缩 & 直传图床工具 这是一个兴趣使然的项目, 希望 Rush! 能让这个世界的网络资源浪费减少一点点 下载 Downloads 获取最新发行版 功能 Features 拖拽批量压缩图片, 支持格式 jpg/png/gif Drop to optimize, jpg/png/gif

{ Chao } 3 Nov 12, 2022
Cardmatchinggamebyercan - A card-matching game made with Flutter.

card_matching_game_by_ercan A card-matching game. Working Demo: https://confident-austin-19dbd2.netlify.app Getting Started This project is a starting

Ercan Tomac 17 Dec 14, 2022
Automaticly parses known pocket ips patch resources, scans folders or zip files for matching roms and applies the patches.

Pocket Automaton Automaticly parses known pocket ips patch resources, scans folders or zip files for matching roms and applies the patches. Usage pock

null 3 Nov 27, 2022
A web-based application for student-tutor matching service

CodeX A web-based application for student-tutor matching service This project was generated using Nx. ?? Smart, Fast and Extensible Build System Addin

null 3 Jan 25, 2022
Import flow for Excel (.xlsx) and CSV file with automated column matching and validation.

RSI react-spreadsheet-import ⚡️ A component used for importing XLS / XLSX / CSV documents built with Chakra UI. Import flow combines: ?? Uploader ⚙️ P

Ugnis 123 Dec 24, 2022
An algorithm for fast 2D pattern-matching with wildcards.

pattern-match-2d.js An algorithm for fast 2D pattern-matching with wildcards, with a demo app inspired by MarkovJunior (by Maxim Gumin). The algorithm

null 16 Nov 5, 2022
Obsidian plugin: Implicitly add an alias matching the first heading in a document.

Alias from heading Aliases in Obsidian make it convenient to provide display names to document links. However, there are a few pain points: Aliases ar

Chris Basham 6 Dec 17, 2022