Wikipedia using only static assets & no backend

Overview

http://static.wiki

Build a read-only Wikipedia using CSS, JS, WASM, and SQLite

A proof-of-concept inspired and enabled by Hosting SQLite Databases on Github Pages and the ensuing Hacker News post. The compiled single-page app supports autocomplete for titles, automatic redirecting & other MediaWiki datasets like WikiQuote or Chinese Wikipedia. It makes no external API calls except to get Wikipedia's images.

Known Limitations

The compiled app is not production ready and is bandwidth intensive. It will eat your phone's data plan. Here are some other shortcomings:

Autocomplete is slow

500ms after your last keypress, the app checks if an article matches your input exactly. Then, it searches through an index of 7.3 million article titles for 3 similar matches. Improving the performance of the search requires modifying SQLite or switching to a different WASM/index altogether. See this thread for details

Autocomplete is stuck / the page won't load any more content

If your page is stuck, please try refreshing it. I could not find a way to terminate an ongoing request in the SQLite HttpVFS library.

The /search/ page always shows "no results"

This feature is currently disabled for en.db and a few others. Article search is even slower than the autocomplete and will lead to the page crashing.

Some wikipages look funny or are incomplete

Any mediawiki syntax/wikitext that can't be easily converted to markdown is stripped out and ignored. Check out wtf_wikipedia for details around the difficulties of parsing wikitext.

Building the app

Just want the db files? Download them from Kaggle

Watch out ⚠️ The codebase is filled with spaghetti. It's all glued together and will likely not be maintained. If you want to compile the app anyway, here are the steps:

1. Get a wikipedia dump

Go to the wikipedia dump website and choose your desired language and the latest date. On the index page, select the file that resembles <languageofchoice>wiki-<dateofbackup>-pages-articles-multistream.xml.bz2 and save the largest archive. Once it's downloaded, extract the xml file from the archive. If you chose the English dump, you'll have a ~90GB uncompressed xml to play with.

2. Process the xml into a SQLite db

Run npm install in the repo.

Then, stream the xml file into the converter. cat "/path/to/enwiki.xml" | node ./scripts/xml_to_sqlite.js /path/to/output/folder/en.db

A new SQLite file will be created at the directory and path you specified. For comparison, conversion for English wikipedia produces a 43 GB db file. It takes ~10 hours on 15" 2014 Macbook Pro and will require 100GB+ of space (excluding the xml dump file)

You can run ./scripts/sqlite3 /path/to/output/folder/en.db to see app-ready data.

3. Build the frontend assets

Run npm run build and it will compile "src/" into "dist/". See the Vite docs for more info on how SPA builds.

If you want to test the SQLite db with your local assets, you can run npm run build and npm run serve in conjunction with a local nginx service to serve the SQLite db separately. Check out the sample nginx.conf which, if you're on Mac, you can use to replace /usr/local/etc/nginx/nginx.conf.

Be sure to adjust the dev urls in db.js to point to your localhost-served "en.db".

But what about hot-reloading? Why can't I just drop the db file in the dist folder and execute npm run dev?

The Vite dev-server does not seem to emit the "Accept-Range" header which is required by the SQLite HttpVFS library. npm run dev also crashes after reloading the SQLite library a few times, for some reason.

4. Deploy "dist/" and "db/" to static file hosts

Upload the dist files to your S3 bucket, github pages, surge, vercel, or any static file host. The "db/" directory must be uploaded to a CORS friendly host that supports/emits an "Accept-Range" Header. Setting up cross-domain CORS on a static host is a real pain so I've included sample CORS config for S3 as a point-of-reference.

Finally, adjust the production urls in db.js. With that, your app should be ready to be deployed.

5. Repeat for other wikimedia dumps

If you want to try other dbs, just follow steps 1-4 and add them to db.js.

Questions around the code

Why did you build this?

I wanted to play around with the SQLite HttpVFS library. I got carried away with the wikipedia part.

Why not just build 6 million html files without SQlite?

Static files work well for rendering content. Discoverability demands a separate lookup index and search index. You might be able to build those indexes with JSON / your own code; I found SQLite-HttpVFS to be pretty easy to reason about.

Are the articles converted to real markdown?

Nope. I extended markdown to support thumbnails and other presentational content that most Markdown formats do not support natively.

Why is converting from xml to SQLite so slow?

The conversion code is inefficient. For one, it needs to be cleaned up. For two, it could be parallelized.

Why aren't the SQLite dbs in this repo?

They're too big. As a workaround, you either generate the databases yourself or get a copy from Kaggle.

Credits

  1. Phiresky for his sqlite-httpvfs library and tutorial.
  2. The good christians and folks behind SQLite.
  3. spencermountain for his wtf_wikipedia library.
  4. Wikipedia for their daily dataset dumps.
  5. Everyone in package.json for their libraries.
  6. Me for gluing things together.

You might also like...

The ManageYourCompany 📈 project is a project that creates, deletes, updates companies, units and assets.

The ManageYourCompany 📈 project is a project that creates, deletes, updates companies, units and assets. The rule is that every company has several units and the units have several assets, these assets are machines with several fields: Name, status, person in charge, image, among others... This is a project in order to exercise my Backend skills with NodeJs and front with react.

Feb 9, 2022

A tool to simplify importing custom assets in Minecraft

BAMO - Block And Move On A tool to simplify importing custom assets in Minecraft Currently only allows you to quickly prototype models in-game, but fu

Jul 15, 2022

Get discord application's assets in case you don't have them on your PC

Get discord application's assets in case you don't have them on your PC

get-discord-app-assets Get discord application's assets in case you don't have them on your PC (this is also the reason why I made this script) I came

Aug 25, 2022

Programmatically upload NFT assets to OpenSea.io for free

Programmatically upload NFT assets to OpenSea.io for free

OpenSea Uploader This is the example repository for my blog post How to Mint 100,000 NFTs For Free. Please note that this is merely a proof of concept

Sep 15, 2022

Source code and 3D assets for the Rings (for Loot) NFT project

Source code and 3D assets for the Rings (for Loot) NFT project

Rings (for Loot) Rings (for Loot) is the first and largest 3D interpretation of an entire category in Loot. Adventurers, builders, and artists are enc

Sep 13, 2022

A file manager plugin for logseq(Search unused assets file)

A file manager plugin for logseq(Search unused assets file)

logseq-plugin-file-manager Search files from assets and draws but not used in journals or pages. Please backup files before operation, and before dele

Nov 10, 2022

Yet another library for generating NFT artwork, uploading NFT assets and metadata to IPFS, deploying NFT smart contracts, and minting NFT collections

eznft Yet another library for generating NFT artwork, uploading NFT assets and metadata to IPFS, deploying NFT smart contracts, and minting NFT collec

Sep 21, 2022

Public repository of assets used during editions of AWS LATAM Container Roadshow event.

LATAM Containers Roadshow This is the official repository of assets related to LATAM Containers Roadshow, an all-day customer-facing event to highligh

Nov 23, 2022

More than a Password Protection and Management tool, it secures all your valuable digital assets in your own vault

ZeroPass Client ZeroPass is more than a Password Protection and Management tool, it secures all your valuable digital assets in your own vault, includ

Aug 22, 2022
Comments
  • SD Card version?

    SD Card version?

    It'd be awesome to provide a simple "SD Card" version for people to have ready for times when their internet goes out. Ideally this would be some format openable on an Android or iPhone phones. I also gather there was somebody who managed to further compress it to 10 GiB. Just throwing this out there in case somebody has time/vigor to implement this. It'd be cool if we had something that we could put to any phone and have as a backup for times when there's no internet and/or electricity.

    opened by youurayy 4
  • IPFS mirror

    IPFS mirror

    Since this Wikipedia version doesn't require any backend it should be possible to host a completely decentralized static Wikipedia mirror on IPFS, making it very resilient against censorship.

    https://docs.ipfs.io/how-to/websites-on-ipfs/single-page-website/

    opened by madnight 0
Owner
null
Evolve is an online investment portfolio management system where users can keep track of all the assets that they have invested in and how well their assets are performing.

Evolve is an online investment portfolio management system where users can keep track of all the assets that they have invested in and how well their assets are performing.

Indrajit 6 Oct 16, 2022
Gofiber with NextJS Static HTML is a small Go program to showcase for bundling a static HTML export of a Next.js app

Gofiber and NextJS Static HTML Gofiber with NextJS Static HTML is a small Go program to showcase for bundling a static HTML export of a Next.js app. R

Mai 1 Jan 22, 2022
🌈 GitHub following, followers, only-following, only-follower tracker 🌈

github-following-tracker GitHub following, followers, only-following, only-follower tracker ?? Just enter your GitHub name and track your followings!

Youngkwon Kim 10 Jun 15, 2022
In this project, I built a simple HTML list of To-Do tasks. This simple web page was built using Webpack, creating everything from a JavaScript index file that imported all the modules and assets

To Do List In this project, I built a simple HTML list of To-Do tasks. This simple web page was built using Webpack, creating everything from a JavaSc

Andrés Felipe Arroyave Naranjo 10 Mar 31, 2022
The only Backend you'll ever need. Written in NodeJS, works with any stack

The only Backend you'll ever need. Written in NodeJS, works with any stack Conduit Platform Conduit is a NodeJS-based Self-Hosted backend, that aims t

Conduit 199 Nov 24, 2022
Minecraft 1.8.9 mod which steals the access token and more things from the targetted user and sends to a backend server for processing. Disclaimer: For educational purposes only.

R.A.T Retrieve Access Token Check DxxxxY/TokenAuth to login into an MC account with a name, token and uuid combo. Features Grabs the username, uuid, t

null 42 Nov 18, 2022
Gatsby-Formik-contact-form-with-backend-panel - Full working contact form with backend GUI panel.

Gatsby minimal starter ?? Quick start Create a Gatsby site. Use the Gatsby CLI to create a new site, specifying the minimal starter. # create a new Ga

Bart 1 Jan 2, 2022
Venni backend - The backend of the Venni client apps implementing the credit card payments, matching algorithms, bank transfers, trip rating system, and more.

Cloud Functions Description This repository contains the cloud functions used in the Firebase backend of the Venni apps. Local Development Setup For t

Abrantes 1 Jan 3, 2022
Swagger UI is a collection of HTML, JavaScript, and CSS assets that dynamically generate beautiful documentation from a Swagger-compliant API.

Introduction Swagger UI allows anyone — be it your development team or your end consumers — to visualize and interact with the API’s resources without

Swagger 23k Nov 22, 2022
🕋Assets Compression module for Nuxt 3

@nuxt-modules/compression Assets Compression module for Nuxt 3 ✨ Release Notes ?? Read the documentation Features Nuxt 3 ready Assets Compression usin

Nuxt Modules 26 Nov 21, 2022