High performance distributed data processing engine

skale

Last update: Nov 16, 2022

Overview

High performance distributed data processing and machine learning.

Skale provides a high-level API in Javascript and an optimized parallel execution engine on top of NodeJS.

Features

Pure javascript implementation of a Spark like engine
Multiple data sources: filesystems, databases, cloud (S3, azure)
Multiple data formats: CSV, JSON, Columnar (Parquet)...
50 high level operators to build parallel apps
Machine learning: scalable classification, regression, clusterization
Run interactively in a nodeJS REPL shell
Docker ready, simple local mode or full distributed mode
Very fast, see benchmark

Quickstart

npm install skale

Word count example:

var sc = require('skale').context();

sc.textFile('/my/path/*.txt')
  .flatMap(line => line.split(' '))
  .map(word => [word, 1])
  .reduceByKey((a, b) => a + b, 0)
  .count(function (err, result) {
    console.log(result);
    sc.end();
  });

Local mode

In local mode, worker processes are automatically forked and communicate with app through child process IPC channel. This is the simplest way to operate, and it allows to use all machine available cores.

To run in local mode, just execute your app script:

node my_app.js

or with debug traces:

SKALE_DEBUG=2 node my_app.js

Distributed mode

In distributed mode, a cluster server process and worker processes must be started prior to start app. Processes communicate with each other via raw TCP or via websockets.

To run in distributed cluster mode, first start a cluster server on server_host:

./bin/server.js

On each worker host, start a worker controller process which connects to server:

./bin/worker.js -H server_host

Then run your app, setting the cluster server host in environment:

SKALE_HOST=server_host node my_app.js

The same with debug traces:

SKALE_HOST=server_host SKALE_DEBUG=2 node my_app.js

Resources

Contributing guide
Documentation
Gitter for support and discussion
Mailing list for discussion about use and development

Authors

The original authors of skale are Cedric Artigue and Marc Vertes.

List of all contributors

License

Apache-2.0

Credits

Logo Icon made by Smashicons from www.flaticon.com is licensed by CC 3.0 BY

Comments

modernize javascript syntax
Apply coding rules as described in contributing:

Use arrow functions in mappers, filters, reducers and combiners

Use let and const in place of var

Use array or object destructuring to set variables from array or object: let [a, b] = [1, 2, 3]

easy good first issue
opened by mvertes 10
for loop over items used later as index
https://github.com/skale-me/skale-engine/blob/41f4a09c5d254f5d3b1b8598efaeba5aecb69464/lib/client.js#L339-L340

I just found this for-loop which seems to me as a bug. You are using the variable i as an index to dev, but the i here contains the item itself (or am I missing something?)

Therefore, instead of:

for (var i in dev) self.hostId[dev[i].uuid] = dev[i].id;

I would write:

for (const d in dev) { self.hostId[d.uuid] = d.id; }
opened by vsimko 3
If 0 workers are requested on the command line don't start any

It seems that there's no way to tell the server not to start any local workers. Using --local=0 was what I expected would turn off any local workers, but that didn't seem to work. Maybe I misunderstood how the command should be used, but if not this will fix that.

opened by mark-bradshaw 3
skale-engine version 0.5.3 regression?

$ skale create dd2bis create application dd2bis

[email protected] preinstall /home/felix/skale/cli/dd2bis/node_modules/.staging/skale-engine-1126d865 mkdir -p node_modules && echo "module.exports = require('..');" > node_modules/skale-engine.js

[email protected] /home/felix/skale/cli/dd2bis -- [email protected] +-- [email protected] +-- [email protected] +-- [email protected] +-- [email protected] |-- [email protected] +-- [email protected] +-- [email protected] +-- [email protected] +-- [email protected] | +-- [email protected] | | +-- [email protected] | | | -- [email protected] | | |-- [email protected] | | -- [email protected] | | +-- [email protected] | | +-- [email protected] | | +-- [email protected] | | +-- [email protected] | | +-- [email protected] | |-- [email protected] | +-- [email protected] | +-- [email protected] | | -- [email protected] |-- [email protected] -- [email protected] +-- [email protected]-- [email protected]

Project dd2bis is now ready. Pleas change directory to dd2bis: "cd dd2bis" To run your app: "skale run" To modify your app: edit dd2bis.js

$ cd dd2bis $ skale run /home/felix/skale/cli/dd2bis/node_modules/skale-engine/lib/dataset.js:171 result = combiner(result, res.data); ^ TypeError: Cannot read property 'data' of null at taskDone (/home/felix/skale/cli/dd2bis/node_modules/skale-engine/lib/dataset.js:171:33) at Object. (/home/felix/skale/cli/dd2bis/node_modules/skale-engine/lib/context.js:129:4) at Consumer._transform (/home/felix/skale/cli/dd2bis/node_modules/skale-engine/lib/client.js:120:32) at Consumer.Transform._read (_stream_transform.js:167:10) at Consumer.Transform._write (_stream_transform.js:155:12) at doWrite (_stream_writable.js:301:12) at writeOrBuffer (_stream_writable.js:287:5) at Consumer.Writable.write (_stream_writable.js:215:11) at FromGrid.ondata (_stream_readable.js:536:20) at emitOne (events.js:77:13)

opened by philippe56 3
Implement dependency injection into workers, resolves #203

A new skale context method sc.require is added. It specifies a set of modules on user side (master) which have to be deployed in workers for use by callbacks, such as mappers, reducers, etc.

Under the hood, browserify is used on master side to build a bundle which is serialized and sent to workers (as part of task). It is then evaluated in worker global context, and dependencies remain persistent as long as workers live.

This method allows to use in workers any javascript module which can be browserified, so a large number (almost any pure JS package).

The current commit is for local version, not distributed (code will be exactly the same). It is experimental for the moment.

An additional statement sc.bundle could be added as well to inject pre-compiled modules, avoiding the penalty of browserify at each run.

opened by mvertes 2
Rethinkdb connector

Hi,

For one of my customers, I m interested in making a trial with Skale.

However I need to be able to extract the data from RethinkDB and then process it further. I read #144 but that was not clear if the solution with sc.objectStream would scale.

So, in other words, would you mind writing some how-to instructions on how-to implement the connector on the workers, so I could potentially get this sorted ?

Thank you.

opened by thomasmodeneis 2
bin/server doesn't use nworker parameter

It looks like the nworker parameter is ignored, and instead the internal variable of the same name is set to the value of the local parameter. It seems like nworker should either be removed, or it should be used to set the value of the internal variable instead of local.

I'd be happy to provide a PR if that's desired.

opened by mark-bradshaw 2
Consider rewriting the core engine in Rust (with Node binding)

From what I understood meeting with @CedricArtigue , JS was chosen among other things because of the community and that you could have some predictable performance (by reusing objects which prevents triggering GC) and a more reasonable workflow than what's possible with Scala. Another idea for predictable performance would be to use Rust. It has C-like performance (and predictable memory characteristics unlike JS, see #52 ), static typing, awesome community, easy parallelism, actual threads (unlike Node which only has processes), safe memory.

It's possible to expose a Node.js API via a module which code is written in Rust. See https://blog.risingstack.com/how-to-use-rust-with-node-when-performance-matters/ http://calculist.org/blog/2015/12/23/neon-node-rust/
wontfix discussion

opened by DavidBruant 2
sizeOf is incomplete and inaccurate
Taking a step back, it's impossible to assess accurately the size of a JavaScript value as it can varies across implementations and according to the context (especially for objects). Also, I'm not sure how much the accuracy of the function matters, so I don't know whether giving feedback is any valuable to you. That said, here is some feedback on https://github.com/skale-me/skale-engine/blob/master/lib/sizeof.js

Boolean

Boolean being 4 bytes seems idiotic. In all reasonable cases, boolean object properties will certainly be 1 byte if not 1 bit (packed in a byte if there are several). At least, it's simple enough a optimization that implementations likely have done it.

Number

For integers, it's very likely JS engines actually store them in 4 bytes instead of 8 as the spec suggests for numbers.

String

I don't know for strings, but I know some time ago, the Mozilla JS team was considering storing ECMAScript strings as UTF-8 (or latin1) and only convert to the spec encoding if that becomes necessary because of subtle string manipulations, so UCS-2 should not be assumed from implementations (I don't know where V8 stands).

Object

obj instanceof Array should certainly be Array.isArray(obj). The array length probably takes 4 bytes as well.

Additionally, you probably want to distinguish Node's Buffer, because they'll likely be used as data and their size is deterministic (that's probably the only object type that is so).

Your object size does not take the "hidden class" size into account. Not sure how much that matters.

No clue how much a Date weighs, but your functions certainly guesses wrong as it does not look for "internal" properties.

Symbols

In Node.js v6:

> typeof Symbol() 'symbol'

This case is not taken into account by your function. 0 will be returned now. Maybe it's fine, but at least the code should be explicit about it.
opened by DavidBruant 2

examples/core/parallelize.js fails with 2 workers

To reproduce, in terminal 1:

$ cd skale-engine
$ ./bin/server -l 2

In terminal 2:

$ cd skale-engine
$ ./examples/core/parallelize.js
[ 1, 2, 3, 4 ]

assert.js:89
  throw new assert.AssertionError({
  ^
AssertionError: false == true
    at Console.assert (console.js:94:23)
    at /Users/marc/github/skale-engine/examples/core/parallelize.js:9:10
    at /Users/marc/github/skale-engine/node_modules/stream-to-array/index.js:54:9
    at _combinedTickCallback (node.js:376:9)
    at process._tickCallback (node.js:407:11)

The output should be [ 1, 2, 3, 4, 5 ]

bug

opened by mvertes 2

Bump ws from 6.1.2 to 7.4.6
Bumps ws from 6.1.2 to 7.4.6.

Release notes

Sourced from ws's releases.

7.4.6

Bug fixes

Fixed a ReDoS vulnerability (00c425ec).

A specially crafted value of the Sec-Websocket-Protocol header could be used to significantly slow down a ws server.

for (const length of [1000, 2000, 4000, 8000, 16000, 32000]) { const value = 'b' + ' '.repeat(length) + 'x'; const start = process.hrtime.bigint(); value.trim().split(/ *, */); const end = process.hrtime.bigint();
console.log('length = %d, time = %f ns', length, end - start); }

The vulnerability was responsibly disclosed along with a fix in private by Robert McLaughlin from University of California, Santa Barbara.

In vulnerable versions of ws, the issue can be mitigated by reducing the maximum allowed length of the request headers using the --max-http-header-size=size and/or the maxHeaderSize options.

7.4.5

Bug fixes

UTF-8 validation is now done even if utf-8-validate is not installed (23ba6b29).

Fixed an edge case where websocket.close() and websocket.terminate() did not close the connection (67e25ff5).

7.4.4

Bug fixes

Fixed a bug that could cause the process to crash when using the permessage-deflate extension (92774377).

7.4.3

Bug fixes

The deflate/inflate stream is now reset instead of reinitialized when context takeover is disabled (#1840).

7.4.2

Bug fixes

... (truncated)

Commits

f5297f7 [dist] 7.4.6

00c425e [security] Fix ReDoS vulnerability

990306d [lint] Fix prettier error

32e3a84 [security] Remove reference to Node Security Project

8c914d1 [minor] Fix nits

fc7e27d [ci] Test on node 16

587c201 [ci] Do not test on node 15

f672710 [dist] 7.4.5

67e25ff [fix] Fix case where abortHandshake() does not close the connection

23ba6b2 [fix] Make UTF-8 validation work even if utf-8-validate is not installed

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 1

Releases(1.2.0)

1.2.0(Nov 15, 2017)
This is a major feature relase. Install it with npm

New

Skale-engine is renamed to skale. Version is now 1.2.0, identical to 0.8.0.

Add a machine learning library with classification, regression, clustering

Allows dependencies to be deployed in workers with new routine sc.require(). This will ease considerably the integration of various connectors to data sources, databases, etc.

Major improvements to documentation website

Improvements

The test suite has been fully reworked, and now uses individual files that can be executed separately

Tests are considerably faster and easier to develop and debug

Both standalone and distributed engine are now systematically tested

save(): now support output to CSV format

save(), textFile(): automatic forward of AWS env and credentials to workers

Workers: control garbage collect by command line option

Modernize javascript syntax

Continuous integration: add MacOSX target in addition to Linux and Windows

Fixes

Fix a problem insample()

Fix support of undefined keys in aggregateByKey()

Fix debug traces

Source code(tar.gz)
Source code(zip)
0.7.1(May 17, 2017)
This is a stability and bug fix release.

Documentation has been improved.

A new skale hacker's guide has been added.

A worker crash when using sample() with replacement has been fixed.

Source code(tar.gz)
Source code(zip)
0.7.0(Apr 4, 2017)
This is a major release. It brings new features:

Support to azure storage for reading (textFile) and writing (save)

Support to Apache parquet file format, for reading and writing

Performances for wide transformations involving shuffling, such as aggregateByKey, reduceByKey, or coGroup, join etc., have increased considerably vs 0.6 branch.

many bug fixes and stability improvements

Despite new major version, this release remains backward compatible with previous branch 0.6.x

Also available as always through npm
Source code(tar.gz)
Source code(zip)
0.6.8(Dec 14, 2016)

This is a stability and bug fix release. Documentation is improved, distributed mode is better: handling of tmp files and environment has been fixed.
Source code(tar.gz)
Source code(zip)
0.6.7(Nov 22, 2016)

Performances and scalability improvement release.

In distributed mode, a direct peer-to-peer shuffle data transfer between workers has been implemented. It improves scalability on large clusters when running with hundreds of simultaneous workers.

Standalone and distributed modes are now described. Debug traces are improved.
Source code(tar.gz)
Source code(zip)
0.6.6(Nov 4, 2016)

This is a stability and performance improvements release.

Memory efficiency has been improved in presence of large datasets (thousands of partitions) and job complexity (hundreds of stages/steps).

S3 support has been fixed, both for input and output.

Multi-machine communications and debugging traces have been improved.
Source code(tar.gz)
Source code(zip)

Owner

skale

GitHub https://skale-me.github.io/skale

:green_book: SheetJS Community Edition -- Spreadsheet Data Toolkit

SheetJS js-xlsx Parser and writer for various spreadsheet formats. Pure-JS cleanroom implementation from official specifications, related documents, a

32k Jan 4, 2023

danfo.js is an open source, JavaScript library providing high performance, intuitive, and easy to use data structures for manipulating and processing structured data.

Danfojs: powerful javascript data analysis toolkit What is it? Danfo.js is a javascript package that provides fast, flexible, and expressive data stru

4k Dec 29, 2022

High performance JavaScript templating engine

art-template English document | 中文文档 art-template is a simple and superfast templating engine that optimizes template rendering speed by scope pre-dec

9.7k Jan 3, 2023

FormGear is a framework engine for dynamic form creation and complex form processing and validation for data collection.

FormGear is a framework engine for dynamic form creation and complex form processing and validation for data collection. It is designed to work across

91 Dec 27, 2022

Blazing Fast JavaScript Raster Processing Engine

Geoblaze A blazing fast javascript raster processing engine Geoblaze is a geospatial raster processing engine written purely in javascript. Powered by

125 Dec 20, 2022

🏁 High performance subscription-based form state management for React

You build great forms, but do you know HOW users use your forms? Find out with Form Nerd! Professional analytics from the creator of React Final Form.

7.2k Jan 7, 2023

A cross platform high-performance graphics system.

spritejs.org Spritejs is a cross platform high-performance graphics system, which can render graphics on web, node, desktop applications and mini-prog