Fast Differentiable Tensor Library in JavaScript and TypeScript with Bun + Flashlight

Last update: Jan 7, 2023

Related tags

Overview

A fast differentiable tensor library for research in TypeScript and JavaScript. Built with bun + flashlight. ⚠️ This is experimental software! ⚠️

Usage
Install
Build from source
Contributing
Supported operations

Quickstart

Install Bun and ArrayFire, then run:

bun install @shumai/shumai

Only macOS and Linux are supported. Linux installs default to GPU computation with CUDA, and macOS to CPU. Detailed install instructions below.

Install is work in progress: please file an issue if you run into problems.

Why build this?

With Shumai, we hope to make

Creating datasets
- JavaScript, with native typed arrays and a JIT compiler, is perfect for twiddling with data before it can be made into big, flat GPU-compatible arrays.
Training small models
- FFI bindings in Bun are crazy fast (~3ns), so JS gets out of the way when training small models
Advanced/fine-grained training/inference logic
- Bun uses the JSC JIT compiler, meaning you can confidently write complex training logic without needing a native C++ implementation
Building applications
- JavaScript has a ~~large~~ HUGE ecosystem, which facilitates better application development

Usage

shumai will always attempt to use an attached GPU or accelerator; although CPU computation will use the ArrayFire CPU backend, which is not well-optimized.

We hope to support the ArrayFire OpenCL backend and other non-ArrayFire tensor backends soon.

If shumai seems unusually slow, please file an issue!

Standard array utilities:

import * as sm from "@shumai/shumai"

// create a 1024 by 1024 tensor, randomly filled with normal distribution
let X = sm.randn([1024, 1024])
let W = sm.identity(1024)
let Y = X.matmul(W)
console.log(Y.shape)

Conversion to and from JavaScript native arrays:

const data : Float32Array = new Float32Array(128)
for (let i = 0; i < 128; ++i) {
  data = Math.random()
}

const X : Tensor = sm.tensor(data)
const pi = sm.scalar(3.14)
const Y = X.mul(pi)

// tensors can be converted back to native JavaScript
const Y_data = Y.toFloat32Array()

// scalar tensors can be converted to JavaScript numbers
const total : number = X.sum().toFloat32()

Gradients:

const W = sm.randn([128, 128])
W.requires_grad = true

const X = sm.randn([128, 128])
const diff = X.sub(W)
const mse = diff.mul(diff).sum()
mse.backward()

W.grad // this gradient is now populated

// copy W without allowing gradient updates
const Y = W.detach()
Y.sum().backward() // nothing changes

Some more examples can be found here.

Supported operators can be found here.

Install

The install procedure is a work in progress! If you have any problems building or installing, we would greatly appreciate filed issues. Please tell us about your platform/OS when you do.

Prerequisites:

Ensure you have bun installed (https://bun.sh).
Install ArrayFire. macOS users should install ArrayFire's CPU backend; Linux users should install the CUDA backend^.
- macOS --- ArrayFire can easily be installed with Homebrew:
```
brew install arrayfire
```
Linux --- instructions can be found here. On Ubuntu, ArrayFire can be installed via package managers (e.g. apt).

Once bun and ArrayFire are installed, install the package and backing libs with bun:

bun install @shumai/shumai

^Linux users can use the CPU backend by swapping the required package.json dependency from @shumai/linux_x64_shumai_flashlight to @shumai/linux_x64_shumai_flashlight_cpu, i.e. running:

sed -i "s/linux_x64_shumai_flashlight/linux_x64_shumai_flashlight_cpu/g" package.json

Building Native Libraries from Source

Note: not required when developing TypeScript/Javascript library components locally.

From source build instructions for:

macOS
Linux

This process will build the dependent ffi libraries (libflashlight and libflashlight_binding) and pack them using npm pack to generate a @shumai/shumai_*.tgz package. You can then use npm install $PATH_TO_SOURCE/@shumai/shumai-*.tgz to install the package where you'd like.

Building on macOS from Source

First, install ArrayFire CPU with brew install arrayfire.

Build and install Flashlight:

mkdir -p $HOME/usr/ # installing flashlight here
git clone --recursive --depth 1 https://github.com/flashlight/flashlight.git
cd flashlight
mkdir -p build
cd build
cmake .. \
  -DCMAKE_BUILD_TYPE=RelWithDebInfo \ # or as specified
  -DFL_ARRAYFIRE_USE_CPU=ON \
  -DFL_ARRAYFIRE_USE_CUDA=OFF \
  -DFL_BUILD_DISTRIBUTED=OFF \
  -DFL_USE_ONEDNN=OFF \
  -DFL_BUILD_TESTS=OFF \
  -DFL_BUILD_EXAMPLES=OFF \
  -DFL_BUILD_SCRIPTS=OFF \
  -DCMAKE_INSTALL_PREFIX=$HOME/usr/
make -j$(nproc)
make install

Build Flashlight bindings for Shumai:

cd shumai
mkdir -p build
cd build
cmake .. -Dflashlight_DIR=$HOME/usr/share/flashlight/cmake/
make -j$(nproc)

Profiling

On macOS, you can record perf with xcrun xctrace record --template "Time Profiler" --launch $(which bun) train.js.

Building on Linux from Source

First install ArrayFire. The Linux build for shumai uses the CUDA backend, but from source, you can build the CPU backend as well (OpenCL support coming soon).

Build and install Flashlight:

mkdir -p $HOME/usr/ # installing flashlight here
git clone --recursive --depth 1 https://github.com/flashlight/flashlight.git
cd flashlight
mkdir -p build
cd build
cmake .. \
  -DCMAKE_BUILD_TYPE=RelWithDebInfo \ # or as specified
  -DFL_ARRAYFIRE_USE_CPU=OFF \
  \ # swap with the above to build for CPU
  -DFL_ARRAYFIRE_USE_CUDA=ON \ 
  -DFL_BUILD_DISTRIBUTED=OFF \
  -DFL_USE_ONEDNN=OFF \
  -DFL_BUILD_TESTS=OFF \
  -DFL_BUILD_EXAMPLES=OFF \
  -DFL_BUILD_SCRIPTS=OFF \
  -DCMAKE_INSTALL_PREFIX=$HOME/usr/
make -j$(nproc)
make install

Build bindings for shumai:

mkdir -p build && cd build
cmake .. \
    -DBUILD_SHARED_LIBS=ON \
    -DCMAKE_BUILD_TYPE=RelWithDebInfo \ # or as specified
    -Dflashlight_DIR=${FLASHLIGHT_INSTALL_PREFIX}/share/flashlight/cmake \
    -DArrayFire_DIR=${ARRAYFIRE_INSTALL_PREFIX}/share/ArrayFire/cmake # if built from source, else not needed
make -j$(nproc)

Contributing

If you'd like to make changes to the core bindings or ffi, first build from source.

All files ending in *.inl or *_gen.ts are generated. These can be modified by editing scripts/gen_binding.py and running ./scripts/gen_all_binding.sh.

See the CONTRIBUTING file for style guidance and more info on how to help out. 😁

Supported Operations

Some operations are supported as both static functions and methods on existing tensors.

Operation	Function	Tensor Method (`t : Tensor`)
rand	`rand(shape: number[]) : Tensor`
randn	`randn(shape: number[]) : Tensor`
full	`full(shape: number[], val: number) : Tensor`
identity	`identity(dim: number) : Tensor`
arange	`arange(start: number, end: number, step: number = 1) : Tensor`
iota	`iota(dims: number[], tileDims: number[] = [1]) : Tensor`
reshape	`reshape(tensor: Tensor, shape: number[]) : Tensor`	`t.reshape(shape: number[]) : Tensor`
transpose	`transpose(tensor: Tensor, axes: number[]) : Tensor`	`t.transpose(axes: number[]) : Tensor`
tile	`tile(tensor: Tensor, shape: number[]) : Tensor`	`t.tile(shape: number[]) : Tensor`
nonzero	`nonzero(tensor: Tensor) : Tensor`	`t.nonzero() : Tensor`
negative	`negative(tensor: Tensor) : Tensor`	`t.negative() : Tensor`
logicalNot	`logicalNot(tensor: Tensor) : Tensor`	`t.logicalNot() : Tensor`
exp	`exp(tensor: Tensor) : Tensor`	`t.exp() : Tensor`
log	`log(tensor: Tensor) : Tensor`	`t.log() : Tensor`
log1p	`log1p(tensor: Tensor) : Tensor`	`t.log1p() : Tensor`
sin	`sin(tensor: Tensor) : Tensor`	`t.sin() : Tensor`
cos	`cos(tensor: Tensor) : Tensor`	`t.cos() : Tensor`
sqrt	`sqrt(tensor: Tensor) : Tensor`	`t.sqrt() : Tensor`
tanh	`tanh(tensor: Tensor) : Tensor`	`t.tanh() : Tensor`
floor	`floor(tensor: Tensor) : Tensor`	`t.floor() : Tensor`
ceil	`ceil(tensor: Tensor) : Tensor`	`t.ceil() : Tensor`
rint	`rint(tensor: Tensor) : Tensor`	`t.rint() : Tensor`
absolute	`absolute(tensor: Tensor) : Tensor`	`t.absolute() : Tensor`
abs	`abs(tensor: Tensor) : Tensor`	`t.abs() : Tensor`
sigmoid	`sigmoid(tensor: Tensor) : Tensor`	`t.sigmoid() : Tensor`
erf	`erf(tensor: Tensor) : Tensor`	`t.erf() : Tensor`
flip	`flip(tensor: Tensor, dim: number) : Tensor`	`t.flip(dim: number) : Tensor`
clip	`clip(tensor: Tensor, low: Tensor, high: Tensor) : Tensor`	`t.clip(low: Tensor, high: Tensor) : Tensor`
roll	`roll(tensor: Tensor, shift: number, axis: number) : Tensor`	`t.roll(shift: number, axis: number) : Tensor`
isnan	`isnan(tensor: Tensor) : Tensor`	`t.isnan() : Tensor`
isinf	`isinf(tensor: Tensor) : Tensor`	`t.isinf() : Tensor`
sign	`sign(tensor: Tensor) : Tensor`	`t.sign() : Tensor`
tril	`tril(tensor: Tensor) : Tensor`	`t.tril() : Tensor`
triu	`triu(tensor: Tensor) : Tensor`	`t.triu() : Tensor`
where	`where(cond: Tensor, x: Tensor, y: Tensor) : Tensor`	`t.where(x: Tensor, y: Tensor) : Tensor`
sort	`sort(tensor: Tensor, dim: number) : Tensor`	`t.sort(dim: number) : Tensor`
add	`add(tensor: Tensor, other: Tensor) : Tensor`	`t.add(other: Tensor) : Tensor`
sub	`sub(tensor: Tensor, other: Tensor) : Tensor`	`t.sub(other: Tensor) : Tensor`
mul	`mul(tensor: Tensor, other: Tensor) : Tensor`	`t.mul(other: Tensor) : Tensor`
div	`div(tensor: Tensor, other: Tensor) : Tensor`	`t.div(other: Tensor) : Tensor`
eq	`eq(tensor: Tensor, other: Tensor) : Tensor`	`t.eq(other: Tensor) : Tensor`
neq	`neq(tensor: Tensor, other: Tensor) : Tensor`	`t.neq(other: Tensor) : Tensor`
lessThan	`lessThan(tensor: Tensor, other: Tensor) : Tensor`	`t.lessThan(other: Tensor) : Tensor`
lessThanEqual	`lessThanEqual(tensor: Tensor, other: Tensor) : Tensor`	`t.lessThanEqual(other: Tensor) : Tensor`
greaterThan	`greaterThan(tensor: Tensor, other: Tensor) : Tensor`	`t.greaterThan(other: Tensor) : Tensor`
greaterThanEqual	`greaterThanEqual(tensor: Tensor, other: Tensor) : Tensor`	`t.greaterThanEqual(other: Tensor) : Tensor`
logicalOr	`logicalOr(tensor: Tensor, other: Tensor) : Tensor`	`t.logicalOr(other: Tensor) : Tensor`
logicalAnd	`logicalAnd(tensor: Tensor, other: Tensor) : Tensor`	`t.logicalAnd(other: Tensor) : Tensor`
mod	`mod(tensor: Tensor, other: Tensor) : Tensor`	`t.mod(other: Tensor) : Tensor`
bitwiseAnd	`bitwiseAnd(tensor: Tensor, other: Tensor) : Tensor`	`t.bitwiseAnd(other: Tensor) : Tensor`
bitwiseOr	`bitwiseOr(tensor: Tensor, other: Tensor) : Tensor`	`t.bitwiseOr(other: Tensor) : Tensor`
bitwiseXor	`bitwiseXor(tensor: Tensor, other: Tensor) : Tensor`	`t.bitwiseXor(other: Tensor) : Tensor`
lShift	`lShift(tensor: Tensor, other: Tensor) : Tensor`	`t.lShift(other: Tensor) : Tensor`
rShift	`rShift(tensor: Tensor, other: Tensor) : Tensor`	`t.rShift(other: Tensor) : Tensor`
minimum	`minimum(tensor: Tensor, other: Tensor) : Tensor`	`t.minimum(other: Tensor) : Tensor`
maximum	`maximum(tensor: Tensor, other: Tensor) : Tensor`	`t.maximum(other: Tensor) : Tensor`
power	`power(tensor: Tensor, other: Tensor) : Tensor`	`t.power(other: Tensor) : Tensor`
matmul	`matmul(tensor: Tensor, other: Tensor) : Tensor`	`t.matmul(other: Tensor) : Tensor`
amin	`amin(tensor: Tensor, axes: number[] = [], keep_dims: boolean = false) : Tensor`	`t.amin(axes: number[] = [], keep_dims: boolean = false) : Tensor`
amax	`amax(tensor: Tensor, axes: number[] = [], keep_dims: boolean = false) : Tensor`	`t.amax(axes: number[] = [], keep_dims: boolean = false) : Tensor`
argmin	`argmin(tensor: Tensor, axis: number, keep_dims: boolean = false) : Tensor`	`t.argmin(axis: number, keep_dims: boolean = false) : Tensor`
argmax	`argmax(tensor: Tensor, axis: number, keep_dims: boolean = false) : Tensor`	`t.argmax(axis: number, keep_dims: boolean = false) : Tensor`
sum	`sum(tensor: Tensor, axes: number[] = [], keep_dims: boolean = false) : Tensor`	`t.sum(axes: number[] = [], keep_dims: boolean = false) : Tensor`
cumsum	`cumsum(tensor: Tensor, axis: number) : Tensor`	`t.cumsum(axis: number) : Tensor`
mean	`mean(tensor: Tensor, axes: number[] = [], keep_dims: boolean = false) : Tensor`	`t.mean(axes: number[] = [], keep_dims: boolean = false) : Tensor`
median	`median(tensor: Tensor, axes: number[] = [], keep_dims: boolean = false) : Tensor`	`t.median(axes: number[] = [], keep_dims: boolean = false) : Tensor`
var	`var(tensor: Tensor, axes: number[] = [], bias: boolean = false, keep_dims: boolean = false) : Tensor`	`t.var(axes: number[] = [], bias: boolean = false, keep_dims: boolean = false) : Tensor`
std	`std(tensor: Tensor, axes: number[] = [], keep_dims: boolean = false) : Tensor`	`t.std(axes: number[] = [], keep_dims: boolean = false) : Tensor`
norm	`norm(tensor: Tensor, axes: number[] = [], p: number = 2, keep_dims: boolean = false) : Tensor`	`t.norm(axes: number[] = [], p: number = 2, keep_dims: boolean = false) : Tensor`
countNonzero	`countNonzero(tensor: Tensor, axes: number[] = [], keep_dims: boolean = false) : Tensor`	`t.countNonzero(axes: number[] = [], keep_dims: boolean = false) : Tensor`
any	`any(tensor: Tensor, axes: number[] = [], keep_dims: boolean = false) : Tensor`	`t.any(axes: number[] = [], keep_dims: boolean = false) : Tensor`
all	`all(tensor: Tensor, axes: number[] = [], keep_dims: boolean = false) : Tensor`	`t.all(axes: number[] = [], keep_dims: boolean = false) : Tensor`

License

shumai is MIT licensed, as found in the LICENSE file.

Comments

Extensible statistics

Tried to keep scope down but the changes are pretty considerable and come with tradeoffs. The big downside of leveraging a consistent stats interface across all layers is that distributed training now requires a bit more effort to process the http results from the remote models. I believe the tradeoffs are well worth it though, and the updated docs attempt to explain these new patterns that should enable some pretty incredible and robust stats in the future (that I also plan to contribute).

Statistics

graph TD
  OpA(Op A) --> statsA{{"stats A"}};
  OpB(Op B) --> statsA;
  statsA --> LoggerA{{"LoggerConsole A"}};
  LoggerA --> Stdout(("Stdout"));
  OpC(Op C) --> statsA;
  OpD(Op D) --> statsA;
  statsA --> LoggerB("LoggerCustom B");
  LoggerB --> Disk(("Disk"));

Basic usage of gathering statistics is as simple adding a collector using the default StatsLoggerConsole.

import { stats, StatsLoggerConsole, rand, matmul } from '@shumai/shumai'

stats.enabled = true // all ops following will capture stats

// perform ops...

stats.enabled = false // all ops following will no longer capture stats

While the above examples may suffice for simple use cases, if you're looking to capture stats across multiple threads, processes, and/or hosts, StatsLoggerHttp has you covered.

graph TD
  subgraph Host C
    Processor("LoggerHttp Processor")
    style Processor stroke:#222,stroke-width:4px,stroke-dasharray:5 5
  end
  subgraph Host A
    OpA(Op A) --> statsA{{"stats A"}};
    OpB(Op B) --> statsA;
    statsA --> LoggerA{{"LoggerHttp A"}};
    LoggerA --> Processor;
  end
  subgraph Host B
    OpC(Op C) --> statsB{{"stats B"}};
    OpD(Op D) --> statsB;
    statsB --> LoggerB{{"LoggerHttp B"}};
    LoggerB --> Processor;
  end

import { StatsLoggerHttp } from '@shumai/shumai'

stats.logger = new StatsLoggerHttp({ url: 'http://localhost:4242' })

For more custom needs you can supply your own logger:

import { StatsLogger, StatsLoggerData } from '@shumai/shumai'

class CustomLogger implements StatsLogger {
  async process(data: StatsLoggerData): Promise<void> {
    const summary = data.collector.getSummary()
    console.log('Collector stats:', summary)
  }
}

stats.logger = new CustomLogger()

By default stack tracing is disabled as it adds 50%+ overhead, but can be enabled via stats.collectStacks = true.

Scoped Statistics

If you wish to isolate stats profiling you can do this as well:

import { collectStats } from '@shumai/shumai'

const scopedStats = collectStats(() => {
  // perform ops...
}/*, StatsCollectorOptions | StatsLogger */)
console.log(scopedStats.getSummary())

CLA Signed

opened by asilvas 9

Making softmax numerically stable
Modified the softmax function to be numerically stable with large exponents. Method taken from here.

I am fairly new to autodiff gradient functions, so my implementation of amax may be way off the mark (it certainly looks wrong).

I originally wrote the below code based on the min/max gradient functions that already exist, but it would not converge my test model (where the current implementation does).

const mask = ctx.forward_inputs[0] .eq(ctx.forward_output) .astype(ctx.backward_input.dtype); return ctx.backward_input.mul(mask)
CLA Signed
opened by joelshepherd 8
Transformer encoder

TransformerPositionalEncoding

$$ \mathrm{PE}_{i, 2z} = \sin \left( \frac{i}{10000^{2z/d}} \right) $$

$$ \mathrm{PE}_{i, 2z + 1} = \cos \left( \frac{i}{10000^{2z/d}} \right) $$

where $i$ is the sequence position, $2z$ and $2z+1$ are the dimensions of the input embedding, and $d$ is the dimensionality of the input embedding.

The multiplicative factors $\frac{1}{10000^{2z/d}}$ are precomputed during object creation as they are constant for all $i$.

The full PE is initially precomputed for all $i$ up to 256 (configurable). This is then extended and stored if the module is called with a sequence length larger than the initial value.

Returns a 2D tensor matching the last two dimensions of the input tensor to TransformerEncoder.

FeedForward

Simple 2-layer fully connected neural network with relu activation. This is kept as a private class for now. If we want to make this to be exported it should probably be in a separate file.

TransformerEncoderLayer

As described in Vaswani et al.

TransformerEncoder

The full encoder half of the Transformer, using a Sequential containing arbitrary number of TransformerEncoderLayers.

This includes the positional encoding, but does not include any initial embedding of an input sequence into vectors (which would be separately done by e.g. word2vec)
CLA Signed

opened by yushiyangk 8
attempt basic error handling from native code

Attempts to add a basic implementation of native error handling that works hand in hand w TS Error handling. It has room for improvement in terms of additional functionality, but I think this is a good first step at hashing out a native code error handling API RE #26.
CLA Signed

opened by cryptodeal 5
Add Support for `Float16Array`

While working on implementing Tensor Data Types, pretty quickly realized JS TypedArray doesn't implement Float16Array. Some research into solutions revealed that there's an existing library, @petamoriken/float16, that exports Float16Array that's been actively developed since ~2014 and has recently added support for Bun. @petamoriken/float16 Github Repo

Seems to support Node runtimes (Bun included) as well as browser implementations (seems like the lib was created as they needed a Float16Array when working with WebGL).
enhancement

opened by cryptodeal 5

SegmentationFault running `examples/bench.ts` and other examples

Running in a vanilla docker container FROM flml/flashlight:cuda-latest.

bun bench.ts
10 elements...
JS create 0 tensor               mean: 25.576us    (min: 19.379us, max: 895.272us)
native create 0 tensor           mean: 4.402us    (min: 2.48us, max: 348.661us)
JS create random tensor          mean: 23.681us    (min: 19.279us, max: 190.88us)
native create random tensor      mean: 12.875us    (min: 7.879us, max: 264.586us)
1000 elements...
JS create 0 tensor               mean: 26.09us    (min: 20.109us, max: 519.052us)
native create 0 tensor           mean: 3.65us    (min: 2.25us, max: 354.776us)
JS create random tensor          mean: 28.209us    (min: 22.628us, max: 388.019us)
native create random tensor      mean: 12.913us    (min: 7.789us, max: 447.59us)
100000 elements...

SegmentationFault at 0x0000000000000000


----- bun meta -----
Bun v0.1.13 (55bdf268) Linux x64 #1 SMP Wed Aug 24 22:24:20 UTC 2022
AutoCommand:
Elapsed: 1630ms | User: 957ms | Sys: 262ms
RSS: 67.11MB | Peak: 1.74GB | Commit: 67.11MB | Faults: 60
----- bun meta -----

I'm able to run benchmark from flashlight in the same container. The host is Win11 /w WSL2.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 516.94       CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:21:00.0  On |                  N/A |
|  0%   45C    P8    26W / 400W |   2120MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  On   | 00000000:4B:00.0 Off |                  N/A |
|  0%   35C    P8    19W / 350W |      0MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

bug

opened by asilvas 5

Submit latest version of EvalTuner and clean up commits

Must have been a bit tired last night because I messed up when merging commits to resolve conflicts; resubmitted with the updated code and cleaned up commit history.
CLA Signed

opened by cryptodeal 5
Forced GC during backward optimization slows training up to 60x

https://github.com/facebookresearch/shumai/blob/main/shumai/tensor/tensor.ts#L416

tensor.update() invocations are very high during backward pass and these GC calls are killing performance. There are a number of directions that will greatly improve the situation but wasn't sure if you had a plan/pattern in mind.

opened by asilvas 4
Implement `dispose`

Bun appears to run GC twice; if you uncomment the logged output in destroyTensor (only called @ Garbage Collection), the output is logged multiple times per pointer).

This causes the test to segfault as first GC, bun finds the pointer in alreadyDestroyed & removes the pointer from the set. 2nd Garbage Collection, destroyTensor fails to locate the pointer in alreadyDestroyed (as it was just cleared @ the prev Garbage Collection run).

It seems like there might be a bug in Bun, working on a repro to file an issue w Bun as this is likely blocking.

Once we resolve the above, this will partially implement #50 (still want to implement some equivalent to TFJS tidy in a separate PR).
CLA Signed

opened by cryptodeal 4
Add More Tests
Supported Operations Tests

[ ] rand

[ ] randn

[ ] full

[ ] identity

[ ] arange

[ ] iota

[x] reshape

[x] transpose

[x] tile

[ ] nonzero

[x] negative

[ ] logicalNot

[x] exp

[x] log

[ ] log1p

[x] sin

[x] cos

[ ] sqrt

[ ] tanh

[x] floor

[x] ceil

[ ] rint

[ ] absolute

[x] abs

[x] sigmoid

[x] erf

[x] flip (1D Tensor; addtl tests after 100% coverage basic ops)

[ ] clip

[ ] roll

[x] isnan

[x] isinf

[x] sign

[ ] tril

[ ] triu

[ ] where

[ ] sort

[x] add

[x] sub

[x] mul

[x] div

[ ] eq

[ ] neq

[ ] lessThan

[ ] lessThanEqual

[ ] greaterThan

[ ] greaterThanEqual

[ ] logicalOr

[ ] logicalAnd

[ ] mod

[ ] bitwiseAnd

[ ] bitwiseOr

[ ] bitwiseXor

[ ] lShift

[ ] rShift

[x] minimum

[x] maximum

[ ] power

[ ] matmul

[ ] amin

[ ] amax

[ ] argmin

[ ] argmax

[x] sum

[ ] cumsum

[x] mean

[ ] median

[ ] var

[ ] std

[x] norm

[ ] countNonzero

[ ] any

[ ] all

Tensor Class Methods & Properties Tests

[ ] backward

[ ] ndim

[x] shape (used in tests)

[ ] toString

[ ] valueOf (used in tests)

[ ] asContiguousTensor

[x] copy

[ ] detach

[x] elements (used in tests)

[x] toFloat32Array (tested implicitly in valueOf)

[x] toFloat32 (tested implicitly in valueOf)

CLA Signed
opened by cryptodeal 4
Add gradient fns for log and abs

This adds gradient functions for log and abs.

I have used and tested these locally for cross entropy and mean absolute error loss functions. If you would like having these in your library too, I am happy to upstream them from my experiment repo too.
CLA Signed

opened by joelshepherd 3
Implement `StandardScaler`; add associated tests

Implemented BaseScaler abstract class + StandardScaler, which extends the base class. Also added simple unit tests for StandardScaler.

Fixed a few type errors in shumai/tensor/tensor.ts while working on this.
CLA Signed

opened by cryptodeal 5
Examples about training and inference

I'd like to build a little feed-forward fully connected thing with just one hidden layer, I looked at the examples but perhaps the most relevant one, train.ts, doesn't seem to work anymore as things like sm.module.sequential and sm.optim.Adam don't seem to exist anymore.

It would be great to get that example fixed.

In general it would also be great to get a sort of simpler and more exhaustive "getting started" example, like a tiny model that learns XOR that showcases how to build the network (perhaps with 1 hidden layer for the sake of showcasing how to do it), how to feed it data for training, and how to validate it with more data afterwards.

At the moment I'm a bit stuck, I have the dataset, I had the network sort of working on top of Brain.js (too slow), but I don't know what Shumai code I should write to recreate the same network and training/testing "pipeline".

opened by fabiospampinato 2
[tracking] Browser Support

Great work on shumai! I'm very new to bun specifically and javascript in general, but I love the idea.

I am trying to import shumai into an html page I'm building and am curious how all the pieces work together.

I have @shumai/shumai installed via bun and can import it using ES6 syntax into a .js file no problem.

I run bun bun which generates a node_modules.bun which can be copied into node_modules.js by running ./node_modules.bun > node_modules.js

I can then import a script module <script type="module" src="node_modules.js">...</script> which seems to work.

However, as is intended by bun, hashes are exported instead of the modules holding the same structure. So now importing and using sm doesn't expose the same API and randn or tensor for example aren't available.

In your time working with bun, have you figured out a supported way to do this? How would you suggest using shumai in a web page?

I appreciate your help
enhancement

opened by andrewnc 2
WebGPU Backend
This will enable browser support. We'll need to shim some files:

[ ] io

[ ] network

[ ] tensor/ffi

It doesn't make sense to support anything besides WebGPU at this point. WASM + SIMD is around 15-20x slower on my machine[1]. Although WebGL is more widely supported today, it doesn't have the compute features needed for efficient modern ML (transformers etc) and will likely be a deprecated backend for other frameworks when WebGPU comes online.

[1]: In chrome canary, with Unsafe webGPU enabled try models here: https://tensorflow.github.io/tfjs/e2e/benchmarks/local-benchmark/index.html
enhancement help wanted
opened by bwasti 1
Add floating point operation and byte movement counter to every operation

Ideally, each operation would have its theoretical peak performance measured. This would help us easily catch "slow" operations or bottlenecks in models during training

This information could be added to tensor.stats
enhancement

opened by bwasti 0
CUDA backend for asynchronous distributed multi-trainer segfaults (race condition)
When running the distributed test with a CUDA backend, there's a segfault. It can be fixed easily with CUDA_LAUNCH_BLOCKING, but that's not ideal. Below are the commands to repro:

Server:

$ bash examples/distributed/serve.sh

Client:

$ bash examples/distributed/client.sh
bug
opened by bwasti 0

Owner

Meta Research

GitHub

Bun-Bakery is a web framework for Bun. It uses a file based router in style like svelte-kit. No need to define routes during runtime.

Bun Bakery Bun-Bakery is a web framework for Bun. It uses a file based router in style like svelte-kit. No need to define routes during runtime. Quick