High performance distributed data processing engine

Overview

logo

Build Status Build Status npm badge

High performance distributed data processing and machine learning.

Skale provides a high-level API in Javascript and an optimized parallel execution engine on top of NodeJS.

Features

  • Pure javascript implementation of a Spark like engine
  • Multiple data sources: filesystems, databases, cloud (S3, azure)
  • Multiple data formats: CSV, JSON, Columnar (Parquet)...
  • 50 high level operators to build parallel apps
  • Machine learning: scalable classification, regression, clusterization
  • Run interactively in a nodeJS REPL shell
  • Docker ready, simple local mode or full distributed mode
  • Very fast, see benchmark

Quickstart

npm install skale

Word count example:

var sc = require('skale').context();

sc.textFile('/my/path/*.txt')
  .flatMap(line => line.split(' '))
  .map(word => [word, 1])
  .reduceByKey((a, b) => a + b, 0)
  .count(function (err, result) {
    console.log(result);
    sc.end();
  });

Local mode

In local mode, worker processes are automatically forked and communicate with app through child process IPC channel. This is the simplest way to operate, and it allows to use all machine available cores.

To run in local mode, just execute your app script:

node my_app.js

or with debug traces:

SKALE_DEBUG=2 node my_app.js

Distributed mode

In distributed mode, a cluster server process and worker processes must be started prior to start app. Processes communicate with each other via raw TCP or via websockets.

To run in distributed cluster mode, first start a cluster server on server_host:

./bin/server.js

On each worker host, start a worker controller process which connects to server:

./bin/worker.js -H server_host

Then run your app, setting the cluster server host in environment:

SKALE_HOST=server_host node my_app.js

The same with debug traces:

SKALE_HOST=server_host SKALE_DEBUG=2 node my_app.js

Resources

Authors

The original authors of skale are Cedric Artigue and Marc Vertes.

List of all contributors

License

Apache-2.0

Credits

Logo Icon made by Smashicons from www.flaticon.com is licensed by CC 3.0 BY
Comments
  • modernize javascript syntax

    modernize javascript syntax

    Apply coding rules as described in contributing:

    • Use arrow functions in mappers, filters, reducers and combiners
    • Use let and const in place of var
    • Use array or object destructuring to set variables from array or object: let [a, b] = [1, 2, 3]
    easy good first issue 
    opened by mvertes 10
  • for loop over items used later as index

    for loop over items used later as index

    https://github.com/skale-me/skale-engine/blob/41f4a09c5d254f5d3b1b8598efaeba5aecb69464/lib/client.js#L339-L340

    I just found this for-loop which seems to me as a bug. You are using the variable i as an index to dev, but the i here contains the item itself (or am I missing something?)

    Therefore, instead of:

    for (var i in dev)
          self.hostId[dev[i].uuid] = dev[i].id;
    

    I would write:

    for (const d in dev) {
          self.hostId[d.uuid] = d.id;
    }
    
    opened by vsimko 3
  • If 0 workers are requested on the command line don't start any

    If 0 workers are requested on the command line don't start any

    It seems that there's no way to tell the server not to start any local workers. Using --local=0 was what I expected would turn off any local workers, but that didn't seem to work. Maybe I misunderstood how the command should be used, but if not this will fix that.

    opened by mark-bradshaw 3
  • skale-engine version 0.5.3 regression?

    skale-engine version 0.5.3 regression?

    $ skale create dd2bis create application dd2bis

    [email protected] preinstall /home/felix/skale/cli/dd2bis/node_modules/.staging/skale-engine-1126d865 mkdir -p node_modules && echo "module.exports = require('..');" > node_modules/skale-engine.js

    [email protected] /home/felix/skale/cli/dd2bis -- [email protected] +-- [email protected] +-- [email protected] +-- [email protected] +-- [email protected] |-- [email protected] +-- [email protected] +-- [email protected] +-- [email protected] +-- [email protected] | +-- [email protected] | | +-- [email protected] | | | -- [email protected] | | |-- [email protected] | | -- [email protected] | | +-- [email protected] | | +-- [email protected] | | +-- [email protected] | | +-- [email protected] | | +-- [email protected] | |-- [email protected] | +-- [email protected] | +-- [email protected] | | -- [email protected] |-- [email protected] -- [email protected] +-- [email protected]-- [email protected]

    Project dd2bis is now ready. Pleas change directory to dd2bis: "cd dd2bis" To run your app: "skale run" To modify your app: edit dd2bis.js

    $ cd dd2bis $ skale run /home/felix/skale/cli/dd2bis/node_modules/skale-engine/lib/dataset.js:171 result = combiner(result, res.data); ^ TypeError: Cannot read property 'data' of null at taskDone (/home/felix/skale/cli/dd2bis/node_modules/skale-engine/lib/dataset.js:171:33) at Object. (/home/felix/skale/cli/dd2bis/node_modules/skale-engine/lib/context.js:129:4) at Consumer._transform (/home/felix/skale/cli/dd2bis/node_modules/skale-engine/lib/client.js:120:32) at Consumer.Transform._read (_stream_transform.js:167:10) at Consumer.Transform._write (_stream_transform.js:155:12) at doWrite (_stream_writable.js:301:12) at writeOrBuffer (_stream_writable.js:287:5) at Consumer.Writable.write (_stream_writable.js:215:11) at FromGrid.ondata (_stream_readable.js:536:20) at emitOne (events.js:77:13)

    opened by philippe56 3
  • Implement dependency injection into workers, resolves #203

    Implement dependency injection into workers, resolves #203

    A new skale context method sc.require is added. It specifies a set of modules on user side (master) which have to be deployed in workers for use by callbacks, such as mappers, reducers, etc.

    Under the hood, browserify is used on master side to build a bundle which is serialized and sent to workers (as part of task). It is then evaluated in worker global context, and dependencies remain persistent as long as workers live.

    This method allows to use in workers any javascript module which can be browserified, so a large number (almost any pure JS package).

    The current commit is for local version, not distributed (code will be exactly the same). It is experimental for the moment.

    An additional statement sc.bundle could be added as well to inject pre-compiled modules, avoiding the penalty of browserify at each run.

    opened by mvertes 2
  • Rethinkdb connector

    Rethinkdb connector

    Hi,

    For one of my customers, I m interested in making a trial with Skale.

    However I need to be able to extract the data from RethinkDB and then process it further. I read #144 but that was not clear if the solution with sc.objectStream would scale.

    So, in other words, would you mind writing some how-to instructions on how-to implement the connector on the workers, so I could potentially get this sorted ?

    Thank you.

    opened by thomasmodeneis 2
  • bin/server doesn't use nworker parameter

    bin/server doesn't use nworker parameter

    It looks like the nworker parameter is ignored, and instead the internal variable of the same name is set to the value of the local parameter. It seems like nworker should either be removed, or it should be used to set the value of the internal variable instead of local.

    I'd be happy to provide a PR if that's desired.

    opened by mark-bradshaw 2
  • Consider rewriting the core engine in Rust (with Node binding)

    Consider rewriting the core engine in Rust (with Node binding)

    From what I understood meeting with @CedricArtigue , JS was chosen among other things because of the community and that you could have some predictable performance (by reusing objects which prevents triggering GC) and a more reasonable workflow than what's possible with Scala. Another idea for predictable performance would be to use Rust. It has C-like performance (and predictable memory characteristics unlike JS, see #52 ), static typing, awesome community, easy parallelism, actual threads (unlike Node which only has processes), safe memory.

    It's possible to expose a Node.js API via a module which code is written in Rust. See https://blog.risingstack.com/how-to-use-rust-with-node-when-performance-matters/ http://calculist.org/blog/2015/12/23/neon-node-rust/

    wontfix discussion 
    opened by DavidBruant 2
  • sizeOf is incomplete and inaccurate

    sizeOf is incomplete and inaccurate

    Taking a step back, it's impossible to assess accurately the size of a JavaScript value as it can varies across implementations and according to the context (especially for objects). Also, I'm not sure how much the accuracy of the function matters, so I don't know whether giving feedback is any valuable to you. That said, here is some feedback on https://github.com/skale-me/skale-engine/blob/master/lib/sizeof.js

    Boolean

    Boolean being 4 bytes seems idiotic. In all reasonable cases, boolean object properties will certainly be 1 byte if not 1 bit (packed in a byte if there are several). At least, it's simple enough a optimization that implementations likely have done it.

    Number

    For integers, it's very likely JS engines actually store them in 4 bytes instead of 8 as the spec suggests for numbers.

    String

    I don't know for strings, but I know some time ago, the Mozilla JS team was considering storing ECMAScript strings as UTF-8 (or latin1) and only convert to the spec encoding if that becomes necessary because of subtle string manipulations, so UCS-2 should not be assumed from implementations (I don't know where V8 stands).

    Object

    obj instanceof Array should certainly be Array.isArray(obj). The array length probably takes 4 bytes as well.

    Additionally, you probably want to distinguish Node's Buffer, because they'll likely be used as data and their size is deterministic (that's probably the only object type that is so).

    Your object size does not take the "hidden class" size into account. Not sure how much that matters.

    No clue how much a Date weighs, but your functions certainly guesses wrong as it does not look for "internal" properties.

    Symbols

    In Node.js v6:

    > typeof Symbol()
    'symbol'
    

    This case is not taken into account by your function. 0 will be returned now. Maybe it's fine, but at least the code should be explicit about it.

    opened by DavidBruant 2
  • examples/core/parallelize.js fails with 2 workers

    examples/core/parallelize.js fails with 2 workers

    To reproduce, in terminal 1:

    $ cd skale-engine
    $ ./bin/server -l 2
    

    In terminal 2:

    $ cd skale-engine
    $ ./examples/core/parallelize.js
    [ 1, 2, 3, 4 ]
    
    assert.js:89
      throw new assert.AssertionError({
      ^
    AssertionError: false == true
        at Console.assert (console.js:94:23)
        at /Users/marc/github/skale-engine/examples/core/parallelize.js:9:10
        at /Users/marc/github/skale-engine/node_modules/stream-to-array/index.js:54:9
        at _combinedTickCallback (node.js:376:9)
        at process._tickCallback (node.js:407:11)
    

    The output should be [ 1, 2, 3, 4, 5 ]

    bug 
    opened by mvertes 2
  • Bump ws from 6.1.2 to 7.4.6

    Bump ws from 6.1.2 to 7.4.6

    Bumps ws from 6.1.2 to 7.4.6.

    Release notes

    Sourced from ws's releases.

    7.4.6

    Bug fixes

    • Fixed a ReDoS vulnerability (00c425ec).

    A specially crafted value of the Sec-Websocket-Protocol header could be used to significantly slow down a ws server.

    for (const length of [1000, 2000, 4000, 8000, 16000, 32000]) {
      const value = 'b' + ' '.repeat(length) + 'x';
      const start = process.hrtime.bigint();
    

    value.trim().split(/ *, */);

    const end = process.hrtime.bigint();

    console.log('length = %d, time = %f ns', length, end - start); }

    The vulnerability was responsibly disclosed along with a fix in private by Robert McLaughlin from University of California, Santa Barbara.

    In vulnerable versions of ws, the issue can be mitigated by reducing the maximum allowed length of the request headers using the --max-http-header-size=size and/or the maxHeaderSize options.

    7.4.5

    Bug fixes

    • UTF-8 validation is now done even if utf-8-validate is not installed (23ba6b29).
    • Fixed an edge case where websocket.close() and websocket.terminate() did not close the connection (67e25ff5).

    7.4.4

    Bug fixes

    • Fixed a bug that could cause the process to crash when using the permessage-deflate extension (92774377).

    7.4.3

    Bug fixes

    • The deflate/inflate stream is now reset instead of reinitialized when context takeover is disabled (#1840).

    7.4.2

    Bug fixes

    ... (truncated)

    Commits
    • f5297f7 [dist] 7.4.6
    • 00c425e [security] Fix ReDoS vulnerability
    • 990306d [lint] Fix prettier error
    • 32e3a84 [security] Remove reference to Node Security Project
    • 8c914d1 [minor] Fix nits
    • fc7e27d [ci] Test on node 16
    • 587c201 [ci] Do not test on node 15
    • f672710 [dist] 7.4.5
    • 67e25ff [fix] Fix case where abortHandshake() does not close the connection
    • 23ba6b2 [fix] Make UTF-8 validation work even if utf-8-validate is not installed
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 1
Releases(1.2.0)
  • 1.2.0(Nov 15, 2017)

    This is a major feature relase. Install it with npm

    New

    • Skale-engine is renamed to skale. Version is now 1.2.0, identical to 0.8.0.
    • Add a machine learning library with classification, regression, clustering
    • Allows dependencies to be deployed in workers with new routine sc.require(). This will ease considerably the integration of various connectors to data sources, databases, etc.
    • Major improvements to documentation website

    Improvements

    • The test suite has been fully reworked, and now uses individual files that can be executed separately
    • Tests are considerably faster and easier to develop and debug
    • Both standalone and distributed engine are now systematically tested
    • save(): now support output to CSV format
    • save(), textFile(): automatic forward of AWS env and credentials to workers
    • Workers: control garbage collect by command line option
    • Modernize javascript syntax
    • Continuous integration: add MacOSX target in addition to Linux and Windows

    Fixes

    • Fix a problem insample()
    • Fix support of undefined keys in aggregateByKey()
    • Fix debug traces
    Source code(tar.gz)
    Source code(zip)
  • 0.7.1(May 17, 2017)

    This is a stability and bug fix release.

    • Documentation has been improved.
    • A new skale hacker's guide has been added.
    • A worker crash when using sample() with replacement has been fixed.
    Source code(tar.gz)
    Source code(zip)
  • 0.7.0(Apr 4, 2017)

    This is a major release. It brings new features:

    • Support to azure storage for reading (textFile) and writing (save)
    • Support to Apache parquet file format, for reading and writing
    • Performances for wide transformations involving shuffling, such as aggregateByKey, reduceByKey, or coGroup, join etc., have increased considerably vs 0.6 branch.
    • many bug fixes and stability improvements

    Despite new major version, this release remains backward compatible with previous branch 0.6.x

    Also available as always through npm

    Source code(tar.gz)
    Source code(zip)
  • 0.6.8(Dec 14, 2016)

    This is a stability and bug fix release. Documentation is improved, distributed mode is better: handling of tmp files and environment has been fixed.

    Source code(tar.gz)
    Source code(zip)
  • 0.6.7(Nov 22, 2016)

    Performances and scalability improvement release.

    In distributed mode, a direct peer-to-peer shuffle data transfer between workers has been implemented. It improves scalability on large clusters when running with hundreds of simultaneous workers.

    Standalone and distributed modes are now described. Debug traces are improved.

    Source code(tar.gz)
    Source code(zip)
  • 0.6.6(Nov 4, 2016)

    This is a stability and performance improvements release.

    Memory efficiency has been improved in presence of large datasets (thousands of partitions) and job complexity (hundreds of stages/steps).

    S3 support has been fixed, both for input and output.

    Multi-machine communications and debugging traces have been improved.

    Source code(tar.gz)
    Source code(zip)
:green_book: SheetJS Community Edition -- Spreadsheet Data Toolkit

SheetJS js-xlsx Parser and writer for various spreadsheet formats. Pure-JS cleanroom implementation from official specifications, related documents, a

SheetJS 32k Jan 4, 2023
danfo.js is an open source, JavaScript library providing high performance, intuitive, and easy to use data structures for manipulating and processing structured data.

Danfojs: powerful javascript data analysis toolkit What is it? Danfo.js is a javascript package that provides fast, flexible, and expressive data stru

JSdata 4k Dec 29, 2022
High performance JavaScript templating engine

art-template English document | 中文文档 art-template is a simple and superfast templating engine that optimizes template rendering speed by scope pre-dec

糖饼 9.7k Jan 3, 2023
FormGear is a framework engine for dynamic form creation and complex form processing and validation for data collection.

FormGear is a framework engine for dynamic form creation and complex form processing and validation for data collection. It is designed to work across

Ignatius Aditya Setyadi 91 Dec 27, 2022
Blazing Fast JavaScript Raster Processing Engine

Geoblaze A blazing fast javascript raster processing engine Geoblaze is a geospatial raster processing engine written purely in javascript. Powered by

GeoTIFF 125 Dec 20, 2022
🏁 High performance subscription-based form state management for React

You build great forms, but do you know HOW users use your forms? Find out with Form Nerd! Professional analytics from the creator of React Final Form.

Final Form 7.2k Jan 7, 2023
A cross platform high-performance graphics system.

spritejs.org Spritejs is a cross platform high-performance graphics system, which can render graphics on web, node, desktop applications and mini-prog

null 5.1k Dec 24, 2022
Execute one command (or mount one Node.js middleware) and get an instant high-performance GraphQL API for your PostgreSQL database!

PostGraphile Instant lightning-fast GraphQL API backed primarily by your PostgreSQL database. Highly customisable and extensible thanks to incredibly

Graphile 11.7k Jan 4, 2023
Quasar Framework - Build high-performance VueJS user interfaces in record time

Quasar Framework Build high-performance VueJS user interfaces in record time: responsive Single Page Apps, SSR Apps, PWAs, Browser extensions, Hybrid

Quasar Framework 22.7k Jan 9, 2023
A high-performance, dependency-free library for animated filtering, sorting, insertion, removal and more

MixItUp 3 MixItUp is a high-performance, dependency-free library for animated DOM manipulation, giving you the power to filter, sort, add and remove D

Patrick Kunka 4.5k Dec 24, 2022
Ultra-high performance reactive programming

________________________________ ___ |/ /_ __ \_ ___/__ __/ __ /|_/ /_ / / /____ \__ / _ / / / / /_/ /____/ /_ / /_/ /_/ \____/__

The Javascript Architectural Toolkit 3.5k Dec 28, 2022
Quasar Framework - Build high-performance VueJS user interfaces in record time

Quasar Framework Build high-performance VueJS user interfaces in record time: responsive Single Page Apps, SSR Apps, PWAs, Browser extensions, Hybrid

Quasar Framework 22.6k Jan 3, 2023
A simple high-performance Redis message queue for Node.js.

RedisSMQ - Yet another simple Redis message queue A simple high-performance Redis message queue for Node.js. For more details about RedisSMQ design se

null 501 Dec 30, 2022
Mapbox Visual for Power BI - High performance, custom map visuals for Power BI dashboards

Mapbox Visual for Microsoft Power BI Make sense of your big & dynamic location data with the Mapbox Visual for Power BI. Quickly design high-performan

Mapbox 121 Nov 22, 2022
Lightweight, High Performance Particles in Canvas

Sparticles https://sparticlesjs.dev Lightweight, High Performance Particles in Canvas. For those occasions when you ?? just ?? gotta ?? have ?? sparkl

Simon Goellner 171 Dec 29, 2022
A set of high performance yield handlers for Bluebird coroutines

bluebird-co A set of high performance yield handlers for Bluebird coroutines. Description bluebird-co is a reimplementation of tj/co generator corouti

null 76 May 30, 2022
A high performance MongoDB ORM for Node.js

Iridium A High Performance, IDE Friendly ODM for MongoDB Iridium is designed to offer a high performance, easy to use and above all, editor friendly O

Sierra Softworks 570 Dec 14, 2022
A template project for building high-performance, portable, and safe serverless functions in Vercel.

Tutorial | Demo for image processing | Demo for tensorflow This is a Next.js project bootstrapped with create-next-app. This project is aimed to demon

Second State 63 Dec 8, 2022
Customizable, Pluginable, and High-Performance JavaScript-Based Scrollbar Solution.

Smooth Scrollbar Customizable, Flexible, and High Performance Scrollbars! Installation ⚠️ DO NOT use custom scrollbars unless you know what you are do

Daofeng Wu 3k Jan 1, 2023
High performance JSX web views for S.js applications

Surplus const name = S.data("world"), view = <h1>Hello {name()}!</h1>; document.body.appendChild(view); Surplus is a compiler and runtime to all

Adam Haile 587 Dec 30, 2022