The benchmark results depend immensely on the performance of the denodeify
implementation - results can change by an order of magnitude depending on the performance of denodeify
(this also holds true for the .all
implementations, though to a lesser extent).
However, not all Promise implementations which are being tested at the benchmark implement this functionality internally, and for those, Bluebird benchmark includes a "fake" implementation at benchmark/lib/fakesP.js
.
This file also includes a fallback implementation as:
else {
var lifter = require('when/node').lift;
}
which is used for at least one of the benchmarked implementations: promises-dfilatov-vow.js
(beyond the obvious promises-cujojs-when.js
).
I believe that the benchmark should only test what the library provides, and if the library's API is slightly different than what the code needs, then a wrapper should be used in a way which will not put this library at a disadvantage. However, I believe that the wrapper at fakesP.js
is very far from providing equal grounds.
The following examples are with parallel
, but a similar pattern also shows with doxbee
.
Example:
normal run of the benchmark on my system using ./bench parallel
:
results for 10000 parallel executions, 1 ms per I/O op
file time(ms) memory(MB)
promises-bluebird.js 359 109.33
promises-bluebird-generator.js 422 113.92
promises-cujojs-when.js 594 164.51
promises-tildeio-rsvp.js 625 217.22
callbacks-caolan-async-parallel.js 922 225.85
promises-lvivski-davy.js 922 280.55
promises-calvinmetcalf-lie.js 1062 380.36
callbacks-baseline.js 1093 37.64
promises-dfilatov-vow.js 2187 534.49
promises-ecmascript6-native.js 2391 542.76
promises-then-promise.js 2453 695.89
promises-medikoo-deferred.js 3078 535.80
promises-obvious-kew.js 4594 963.20
Platform info:
Windows_NT 6.3.9600 x64
Node.JS 1.8.1 <-- that's actually io.js v1.8.1
V8 4.1.0.27
Intel(R) Core(TM) i7-4500U CPU @ 1.80GHz × 4
However, if we modify fakesP.js
to always use one implementation for all the libraries with the following patch which changes it from fallback to always:
-else {
var lifter = require('when/node').lift;
-}
Then the results look like this (obviously same system and same invocation as above):
results for 10000 parallel executions, 1 ms per I/O op
file time(ms) memory(MB)
promises-cujojs-when.js 563 163.62
promises-lvivski-davy.js 625 198.71
promises-then-promise.js 672 218.51
promises-bluebird.js 891 251.84
promises-bluebird-generator.js 937 255.72
callbacks-caolan-async-parallel.js 969 225.84
promises-tildeio-rsvp.js 1078 360.25
callbacks-baseline.js 1110 37.63
promises-obvious-kew.js 1203 319.76
promises-calvinmetcalf-lie.js 1750 431.94
promises-dfilatov-vow.js 2157 535.04
promises-medikoo-deferred.js 3359 554.31
promises-ecmascript6-native.js 6188 937.73
and suddenly the numbers look very different. Some implementation go up the ladder, while other go down (well, except for promises-dfilatov-vow.js
and promises-cujojs-when.js
which were already using this implementation before the patch, though they may still appear now higher or lower because others have moved).
For instance:
promises-bluebird.js
"deteriorated" by a factor of ~2.5 from 359 ms / 109 MB
to 851 ms / 251 MB
promises-then-promise.js
"improved" by a factor of ~3.5 from 2453 ms / 696 MB
to 672 ms / 218 MB
promises-obvious-kew.js
"improved" by a factor of ~3 from 4600 ms / 963 MB
to 1203 ms / 320 MB
.
Relative to each other, some of the results changed by a factor of ~10.
The results with the patch can actually be considered much more "fair", since it makes use of "external independent" code (external to all the three libs listed above) as lifter
, thus putting them on equal grounds.
(I do realize that of those three, only kew
has a wrapper implementation at fakesP.js
, while the other two do seem to provide internal implementation, but the example is to make a point of how the results can change when using the same "construction" to test different implementation).
So, what does this mean? It means that code outside the benchmarked library greatly affects the numbers which the benchmark reports as the score of that library, and this effect (and its magnitude) is different for different libraries.
How can it be improved?
IMO the best thing to do is to use the absolutely minimal API possible from the libraries (probably what the specifications define and what the promises-aplus-tests
suit expects - plain Promise API), and then build the whole benchmark construction on top of that. This way, libraries which don't implement denodify
or all
will still get exactly the same treatment as those who do - benchmarking of their Promise implementation.
Obviously, not everyone would like this suggestion, since clearly some/many library authors have put a lot of time and effort into improving and fine tuning these features, and with such suggestion this effort might not manifest at all at the benchmark.
OTOH, authors of libraries which have not implemented those (and possibly others - I didn't examine the benchmark code much) would finally have a benchmark which actually tests only what their library provide, and doesn't make their library "look bad" due to code which isn't part of the library.
The solution to this is, IMO, a different benchmark for features which are not considered core features of Promises.
One (or several) benchmark will build only on top of the common Promise features and will test how good does that work in itself and compared to other libraries, and another benchmark which can test the performance of "extra" features. If a library provides this feature, it would make sense to test it.
But what does not make much sense IMO, is that the benchmark itself includes wrappers which put some implementations at great disadvantage as far as the benchmark results go.