Wikipedia using only static assets & no backend

Last update: Jan 1, 2023

Related tags

Overview

http://static.wiki

Build a read-only Wikipedia using CSS, JS, WASM, and SQLite

A proof-of-concept inspired and enabled by Hosting SQLite Databases on Github Pages and the ensuing Hacker News post. The compiled single-page app supports autocomplete for titles, automatic redirecting & other MediaWiki datasets like WikiQuote or Chinese Wikipedia. It makes no external API calls except to get Wikipedia's images.

Known Limitations

The compiled app is not production ready and is bandwidth intensive. It will eat your phone's data plan. Here are some other shortcomings:

Autocomplete is slow

500ms after your last keypress, the app checks if an article matches your input exactly. Then, it searches through an index of 7.3 million article titles for 3 similar matches. Improving the performance of the search requires modifying SQLite or switching to a different WASM/index altogether. See this thread for details

Autocomplete is stuck / the page won't load any more content

If your page is stuck, please try refreshing it. I could not find a way to terminate an ongoing request in the SQLite HttpVFS library.

The `/search/` page always shows "no results"

This feature is currently disabled for en.db and a few others. Article search is even slower than the autocomplete and will lead to the page crashing.

Some wikipages look funny or are incomplete

Any mediawiki syntax/wikitext that can't be easily converted to markdown is stripped out and ignored. Check out wtf_wikipedia for details around the difficulties of parsing wikitext.

Building the app

Just want the db files? Download them from Kaggle

Watch out ⚠️ The codebase is filled with spaghetti. It's all glued together and will likely not be maintained. If you want to compile the app anyway, here are the steps:

1. Get a wikipedia dump

Go to the wikipedia dump website and choose your desired language and the latest date. On the index page, select the file that resembles <languageofchoice>wiki-<dateofbackup>-pages-articles-multistream.xml.bz2 and save the largest archive. Once it's downloaded, extract the xml file from the archive. If you chose the English dump, you'll have a ~90GB uncompressed xml to play with.

2. Process the xml into a SQLite db

Run npm install in the repo.

Then, stream the xml file into the converter. cat "/path/to/enwiki.xml" | node ./scripts/xml_to_sqlite.js /path/to/output/folder/en.db

A new SQLite file will be created at the directory and path you specified. For comparison, conversion for English wikipedia produces a 43 GB db file. It takes ~10 hours on 15" 2014 Macbook Pro and will require 100GB+ of space (excluding the xml dump file)

You can run ./scripts/sqlite3 /path/to/output/folder/en.db to see app-ready data.

3. Build the frontend assets

Run npm run build and it will compile "src/" into "dist/". See the Vite docs for more info on how SPA builds.

If you want to test the SQLite db with your local assets, you can run npm run build and npm run serve in conjunction with a local nginx service to serve the SQLite db separately. Check out the sample nginx.conf which, if you're on Mac, you can use to replace /usr/local/etc/nginx/nginx.conf.

Be sure to adjust the dev urls in db.js to point to your localhost-served "en.db".

But what about hot-reloading? Why can't I just drop the db file in the dist folder and execute `npm run dev`?

The Vite dev-server does not seem to emit the "Accept-Range" header which is required by the SQLite HttpVFS library. npm run dev also crashes after reloading the SQLite library a few times, for some reason.

4. Deploy "dist/" and "db/" to static file hosts

Upload the dist files to your S3 bucket, github pages, surge, vercel, or any static file host. The "db/" directory must be uploaded to a CORS friendly host that supports/emits an "Accept-Range" Header. Setting up cross-domain CORS on a static host is a real pain so I've included sample CORS config for S3 as a point-of-reference.

Finally, adjust the production urls in db.js. With that, your app should be ready to be deployed.

5. Repeat for other wikimedia dumps

If you want to try other dbs, just follow steps 1-4 and add them to db.js.

Questions around the code

Why did you build this?

I wanted to play around with the SQLite HttpVFS library. I got carried away with the wikipedia part.

Why not just build 6 million html files without SQlite?

Static files work well for rendering content. Discoverability demands a separate lookup index and search index. You might be able to build those indexes with JSON / your own code; I found SQLite-HttpVFS to be pretty easy to reason about.

Are the articles converted to real markdown?

Nope. I extended markdown to support thumbnails and other presentational content that most Markdown formats do not support natively.

Why is converting from xml to SQLite so slow?

The conversion code is inefficient. For one, it needs to be cleaned up. For two, it could be parallelized.

Why aren't the SQLite dbs in this repo?

They're too big. As a workaround, you either generate the databases yourself or get a copy from Kaggle.

Credits

Phiresky for his sqlite-httpvfs library and tutorial.
The good christians and folks behind SQLite.
spencermountain for his wtf_wikipedia library.
Wikipedia for their daily dataset dumps.
Everyone in package.json for their libraries.
Me for gluing things together.

The ManageYourCompany 📈 project is a project that creates, deletes, updates companies, units and assets.

The ManageYourCompany 📈 project is a project that creates, deletes, updates companies, units and assets. The rule is that every company has several units and the units have several assets, these assets are machines with several fields: Name, status, person in charge, image, among others... This is a project in order to exercise my Backend skills with NodeJs and front with react.

Feb 9, 2022

A tool to simplify importing custom assets in Minecraft

BAMO - Block And Move On A tool to simplify importing custom assets in Minecraft Currently only allows you to quickly prototype models in-game, but fu

Jul 15, 2022

Get discord application's assets in case you don't have them on your PC

get-discord-app-assets Get discord application's assets in case you don't have them on your PC (this is also the reason why I made this script) I came

Dec 25, 2022

Programmatically upload NFT assets to OpenSea.io for free

OpenSea Uploader This is the example repository for my blog post How to Mint 100,000 NFTs For Free. Please note that this is merely a proof of concept

Sep 15, 2022

Source code and 3D assets for the Rings (for Loot) NFT project

Rings (for Loot) Rings (for Loot) is the first and largest 3D interpretation of an entire category in Loot. Adventurers, builders, and artists are enc

Dec 24, 2022

A file manager plugin for logseq(Search unused assets file)

logseq-plugin-file-manager Search files from assets and draws but not used in journals or pages. Please backup files before operation, and before dele

Dec 23, 2022

Yet another library for generating NFT artwork, uploading NFT assets and metadata to IPFS, deploying NFT smart contracts, and minting NFT collections

eznft Yet another library for generating NFT artwork, uploading NFT assets and metadata to IPFS, deploying NFT smart contracts, and minting NFT collec

Sep 21, 2022

Public repository of assets used during editions of AWS LATAM Container Roadshow event.

LATAM Containers Roadshow This is the official repository of assets related to LATAM Containers Roadshow, an all-day customer-facing event to highligh

Dec 6, 2022

More than a Password Protection and Management tool, it secures all your valuable digital assets in your own vault

ZeroPass Client ZeroPass is more than a Password Protection and Management tool, it secures all your valuable digital assets in your own vault, includ

Aug 22, 2022

Comments

SD Card version?

It'd be awesome to provide a simple "SD Card" version for people to have ready for times when their internet goes out. Ideally this would be some format openable on an Android or iPhone phones. I also gather there was somebody who managed to further compress it to 10 GiB. Just throwing this out there in case somebody has time/vigor to implement this. It'd be cool if we had something that we could put to any phone and have as a backup for times when there's no internet and/or electricity.

opened by youurayy 4
IPFS mirror

Since this Wikipedia version doesn't require any backend it should be possible to host a completely decentralized static Wikipedia mirror on IPFS, making it very resilient against censorship.

https://docs.ipfs.io/how-to/websites-on-ipfs/single-page-website/

opened by madnight 0

Wikipedia using only static assets & no backend

Related tags

Overview

Known Limitations

Autocomplete is slow

Autocomplete is stuck / the page won't load any more content

The /search/ page always shows "no results"

Some wikipages look funny or are incomplete

Building the app

1. Get a wikipedia dump

2. Process the xml into a SQLite db

3. Build the frontend assets

But what about hot-reloading? Why can't I just drop the db file in the dist folder and execute npm run dev?

4. Deploy "dist/" and "db/" to static file hosts

5. Repeat for other wikimedia dumps

Questions around the code

Why did you build this?

Why not just build 6 million html files without SQlite?

Are the articles converted to real markdown?

Why is converting from xml to SQLite so slow?

Why aren't the SQLite dbs in this repo?

Credits

You might also like...

Comments

Owner

Evolve is an online investment portfolio management system where users can keep track of all the assets that they have invested in and how well their assets are performing.

Gofiber with NextJS Static HTML is a small Go program to showcase for bundling a static HTML export of a Next.js app

Gatsby-Formik-contact-form-with-backend-panel - Full working contact form with backend GUI panel.

Venni backend - The backend of the Venni client apps implementing the credit card payments, matching algorithms, bank transfers, trip rating system, and more.

The only Backend you'll ever need. Written in NodeJS, works with any stack

Minecraft 1.8.9 mod which steals the access token and more things from the targetted user and sends to a backend server for processing. Disclaimer: For educational purposes only.

🌈 GitHub following, followers, only-following, only-follower tracker 🌈

In this project, I built a simple HTML list of To-Do tasks. This simple web page was built using Webpack, creating everything from a JavaScript index file that imported all the modules and assets

Swagger UI is a collection of HTML, JavaScript, and CSS assets that dynamically generate beautiful documentation from a Swagger-compliant API.

🕋Assets Compression module for Nuxt 3

The `/search/` page always shows "no results"

But what about hot-reloading? Why can't I just drop the db file in the dist folder and execute `npm run dev`?