ltp is a parseless copyfree binary encoding format

Overview

ltp

ltp is a parseless copyfree binary encoding format. This means that you can read a field out of ltp encoded data without

a) inspecting every byte. b) going through a complicated state machine. c) creating another in memory data structure.

it's inspired by capt'n'proto, but much simpler.

Motivation

High performance and simplicity usually go together. We often think of high performance as "more power". To make a car go faster, you can fit a larger engine and burn more fuel, faster. But with software, that's wrong. To make software faster you can only take away unnecessary work. Often, something designed for simplicity is quite fast, because simplicity must also avoid unnecessary work. We also like simplicity because it makes it faster to understand. (again, because you don't have to understand the unnecessary parts)

A format such as JSON may appear simple, because it is so familiar. But parsing json text into a data structure involves quite a lot of extra work. One aspect of the work is examining each byte in the input, switching between states that represent escape characters, nesting level, etc. Another significant aspect is transforming the flat bytes into a data structure, (which means the garbage collector must get involved) Often, we parse a json object, and then only access one or two fields, but to do that we recreated another representation of the object as a data structure in memory, then threw it away.

That's not environmentally friendly!

If there are a lot of data moving through the system this parsing and data structure building can be very significant. People say that the JSON parsing libraries built into your system are well optimized, and are fast. They may well be fast compared to other JSON libraries, or to other encoding formats based around the same idea, but they still include a lot of uncessary work.

ltp can be pronounced like "litup" or LTP. The joke was that it's "lieutenant proto", lieutenant being a lower rank than captain.

Generating Code

ltp can currently generate code in C and Zig. Both of these languages are good for making wasm.

If you would like to make a PR to add support for another language, see ./generators/

to generate code, create a js file with a data structure describing the schema.

var schema = {
  basic: [
    {name: 'foo', position: 0, direct: {type: 'u8'}},
    {name: 'bar', position: 1, direct: {type: 'u32'}},
    {name: 'name', position: 5, direct: {type: 'u8'}, pointed: {type: 'string_u8'}}
  ]
}

console.log(require('ltp/generate')(schema, prefix='', includeHeader=true, language='c'))

running that file with node will output C code to encode and decode your schema. schema is the data structure describing for objects, prefix is a string that will be prepended to all function names. if includeHeaders is false, the utility functions needed by the codec will not be included. Use this if you are outputting many schemas into one file. language is the language to output. currently c and zig are supported.

Prior Work

Sadly, there are not many in-place formats available. This is why I need to both explain and design them.

bipf

bipf is another format I created earlier. bipf targets json style schemaless data, so it includes unnecessary work compared to ltp.

capt'n'proto

Captnproto always sounded good, but reading the docs trying to understand how it works always just made me more confused. I would start reading and be bombarded with too many clever ideas like segments and XOR defaults. I wanted something I could easily understand.

Fixed vs Variable size fields

Your basic primitives: numbers, boolean, are always the same size. These are always encoded at the start of an object, and in the same position. So if the first field is a 32 bit integer, then that's the first 4 bytes.

Strings, buffers, and embedded objects (if they contain a variable sized object) are variable sized.

An object with only fixed size fields are simple.

with the schema

{
  count i32
  b u8
  foo boolean
}

and values ({count: 123, b: 64, foo: true})

7b 00 00 00 //4 bytes, little endian
40          // a single byte 0x40=64
01          // a single bit for the boolean

this object always takes 6 bytes no matter what it's field values are.

On the other hand, variable sized fields are encoded with a relative pointer in the fixed size section - like primitive values, this is always in the same position.

using the schema

{
  count i32 //32 bit integer
  name string_u32 //string up to max u32 length
}
00 00 00 00 // 32 byte integer
04 00 00 00 // --, pointer to subsequent string
---         //   | (end of fixed section)
05 00 00 00 // <-` length of a string
68 65 6c 6c 6f

The value of the relative pointer is the number of bytes from the pointer's position to where the data value is. This is always after the fixed size values. The advantage of a relative pointer like this is that it always has the same meaning in different positions. An encoded object with relative pointers can be embedded inside another and the pointers remain valid, this is very handy when transmitting objects over the network, or embedding objects inside of other objects.

Schema Data Structure

To understand what sort of things ipb can represent, it's helpful to understand it's internal data structure.

ObjectCodec([fields...]) => codec

schema is an array with an optional LengthField then any number of pointed or direct fields [LengthField?, (DirectField|PointedField)*]

returns a codec which is based on abstract-encoding but has some extentions, such as a bytes property if it's fixed size, and encodedLength function.

codec.encode (value, buffer, start)

encode value into buffer at position start. sets codec.encode.bytes to the number of bytes used.

codec.decode (buffer, start, end) => value

decode value from buffer at position start. if anything goes outside of end while reading, then an exception will be thrown. By default end = buffer.length. When an ObjectCodec has a length field, or can otherwise determine the length, it may contract the end value for any embedded fields.

This is a security feature. Without the end field, it would be possible to have a record that on the outside would be small, but in the inside claimed to be big, which could cause it to read out of bounds memory which could contain sensitive data, such as private keys.

codec.encodingLength(value) =>

return the bytes that will be needed to encode value

codec.bytes //fixed size objects only

If a codec is fixed size, it will have an integer bytes property defined. Fixed size fields are very useful because they enable much faster reads.

codec.encodedLength(buffer, start, end)

return the length used by the record at start, but do not decode it. Some times it's useful to know how long a record is without decoding it. This allows you to for example, process records in a stream.

codec.dereference(buffer, start, field_index) => pointer

returns a pointer to where the field at index is encoded. For direct fields, this returns the fixed offset of that field. For variable size, pointed fields, this reads the relative pointer, and returns the actual position of the value.

codec.reflect(field_index) => codec

returns the codec used for a specific field. Use in combination with dereference

codec.reflect(index).decode(buffer, codec.dereference(buffer, start, index))

Fields

DirectField (name, position, value_codec)

create a field that stores a fixed size value in a fixed position. name is a utf8 string for the field (used when decoding to a js/json object) position is an integer, a fixed byte offset into an object. value_codec is the encoding used for the value. (must be abstract-encoding interface and must be a fixed size)

PointedField (name, position, direct_codec, pointed_codec, isNullable=false)

a field that stores a fixed size pointer to a variable sized field name and position are the same as DirectField but direct_codec is the an integer type used to encode the relative pointer to the value. pointed_codec is the actual value, usually this would be a length delimited variable size encoding.

if isNullable is true, null pointers are allowed, represented by a relpointer=0.

LengthField (name, position=0, length_codec)

A special case to embed the length of an object. position must be 0, (that is, at the start of the object). When encoding an object, the length used will be calculated and written to this position, including the length used to write any pointed values. If the input has a length value, it is ignored. When decoding an object, the length read is returned in the name field.

FixedPositionVariableField (name, position, value_codec)

If you care about using space as efficiently as possible, then you can save the space needed for the pointer as long as there is only one varible size field and it's the last field. FixedPositionVariableSizeField cannot be nullable (but can encode an empty string)

DirectArrayCodec (length_codec, value_codec)

create an array codec that stores fixed size values directly. length_codec must be an unsigned integer type, that will express the maximum byte length of the array.

value_codec is the codec used to encode the values, which must be fixed size. the maximum number of items in the array will be the maximum value expressible by the length codec, divided by the size of the value_codec.

because the value_codec is stored directly, DirectArrayCodec can not express null pointers, only zero values.

PointedArrayCodec (length_codec, direct_codec, pointed_codec, isNullable)

create an array codec that stores variable sized objects, indexed by fixed size relative pointers. length_codec must be an unsigned integer type, that will express the maximum byte length of the array. direct_codec must be a integer type. pointed_codec encodes the array value and should be a variable size codec. The maximum number of elements is the maximum value expressible of length_codec divided by the size of pointed_codec

if isNullable is true, null pointers are allowed, represented by a relpointer=0.

Any(type_codec, length_codec, codec_lookup) = require('ltp/any')

A codec for handling dynamic types. This is encoded with the type, then the length then the codec returned by codec_lookup(type). Returns a {type, value} object, where value is the embedded dynamic object. type_codec and length_codec must both be fixed size.

Any.AnyPointer(type_codec, length_codec, codec_lookup)

The same as above, but on decode, returns {type, length, value} but value is just a pointer to the start of where the value is encoded. The end will be at value + length.

If you care about performance it maybe important to use in-place access. Therefore it's recommended to use AnyPointer and then read specific fields of the returned object.

You might also like...

Dynamically resize, format and optimize images based on url modifiers.

Local Image Sharp Dynamically resize, format and optimize images based on url modifiers. Table of Contents ✨ Features 🖐 Installation 🚚 Usage Contrib

Nov 29, 2022

Chrome extension to change from 12 hour to 24 hour format.

WhatsappTime Chrome extension to change from 12 hour to 24 hour format. There might be some JavaScript errors thrown, but it seems to be working good

Jun 22, 2022

A command-line tool to convert Project Zomboid map data into Deep Zoom format

A command-line tool to convert Project Zomboid map data into Deep Zoom format

pzmap2dzi pzmap2dzi is a command-line tool running on Windows to convert Project Zomboid map data into Deep Zoom format. Features Supports both python

Dec 31, 2022

🔨 A more engineered, highly customizable, standard output format commitizen adapter.

🔨 A more engineered, highly customizable, standard output format commitizen adapter.

cz-git Github | Installation | Website | 简体中文文档 Introduction A more engineered, highly customizable, standard output format commitizen adapter. What i

Dec 31, 2022

The quickest way to format code!

Only Formats The quickest way to format code! ✨ The live Instance Tech Stack Next.js with TypeScript Chakra UI Prettier to format code Monaco editor f

Nov 29, 2022

A generative engine that takes various png layers on a sprite sheet format, combines them and then converts them into a .gif file

A generative engine that takes various png layers on a sprite sheet format, combines them and then converts them into a .gif file

Welcome to the Generative GIF Engine v2.0.4 🐤 [8 minute read] This python and node app generates layered-based gifs to create NFT gif art! It is fast

Jan 2, 2023

A generative engine that takes various png layers on a sprite sheet format, combines them and then converts them into a .gif file

A generative engine that takes various png layers on a sprite sheet format, combines them and then converts them into a .gif file

Welcome to the Generative Animated Engine v3.0.1 🐤 [8 minute read] This repo used to be called jalagar/Generative_Gif_Engine but because it now suppo

May 24, 2022

The original SSB feed format, for ssb-db2

ssb-classic This module is a plugin for ssb-db2 which implements the classic SSB feed format. You can use this module as an ssb-db2 plugin, or you can

Jun 18, 2022
Comments
  • Suggestion, single file implementation

    Suggestion, single file implementation

    The .c / .h implementation was nice to read for it's single file nature.

    Should we combine the JS code into a single file ? This would match how we've authored Tonic as that's also a single file implementation for being able to read and understand it all in a single pass.

    opened by Raynos 0
  • Variables never used

    Variables never used

    I noticed that _end is never used. I also noticed that bytes is not used, but it is incremented here which could be a bug? Similarly here with fields.

    opened by heapwolf 0
Owner
Socket Supply Co.
A Web owned and operated by Everyone
Socket Supply Co.
An npm package for demonstration purposes using TypeScript to build for both the ECMAScript Module format (i.e. ESM or ES Module) and CommonJS Module format. It can be used in Node.js and browser applications.

An npm package for demonstration purposes using TypeScript to build for both the ECMAScript Module format (i.e. ESM or ES Module) and CommonJS Module format. It can be used in Node.js and browser applications.

Snyk Labs 57 Dec 28, 2022
Binary-encoded serialization of JavaScript objects with generator-based parser and serializer

YaBSON Schemaless binary-encoded serialization of JavaScript objects with generator-based parser and serializer This library is designed to transfer l

Gildas 11 Aug 9, 2022
Glorious Binary Serialization and Deserialization for TypeScript.

Glorious SerDes for TypeScript The library you can rely on, For binary serialization and deserialization, In Node, Deno, and the Web environment, Whic

Wei 25 Dec 11, 2022
shell script replacement; write shell scripts in js instead of bash, then run them with a single static binary

yavascript YavaScript is a bash-like script runner which is distributed as a single statically-linked binary. Scripts are written in JavaScript. There

Lily Scott 59 Dec 29, 2022
Library to download binary files from GitHub releases detecting the correct platform.

Dbin TypeScript library to download binary files from GitHub releases detecting the correct platform. Example: import dbin from "https://deno.land/x/d

Óscar Otero 7 Oct 4, 2022
Small (fragile) script for migrating comments from dev.to posts to Wordpress format (WXR/XML)

dev-to-wxr Small (fragile) script for migrating comments from dev.to posts to Wordpress format (WXR/XML). Useful for importing in tools like disqus. U

Fahad Hossain 2 Jan 29, 2022
Babel-plugin-amd-checker - Module format checking plugin for Babel usable in both Node.js the web browser environments.

babel-plugin-amd-checker A Babel plugin to check the format of your modules when compiling your code using Babel. This plugin allows you to abort the

Ferdinand Prantl 1 Jan 6, 2022
Simple format that serves it's one and only purpose and that's creating simple task list everywhere where you can write plain text

SWTF (Simple Worklog Task Format) Simple format that serves it's one and only purpose and that's creating simple task list everywhere where you can wr

null 4 Apr 4, 2022
TypeScript playground plugin to save on format

TypeScript Playground Plugin TypeScript Playground plugin to save on format Running this plugin Click this link to install or Open up the TypeScript P

Anurag Hazra 8 Apr 11, 2022
A fixed-width file format toolset with streaming support and flexible options.

fixed-width A fixed-width file format toolset with streaming support and flexible options. Features Flexible: lot of options Zero dependencies: small

Evologi Srl 6 Jul 14, 2022