Introduction
I recently opened a pull request to add the ability to fetch snapshots of a document at a given version. This was first discussed in this other issue.
In that issue, the possibility of some optimisations was discussed briefly, but deemed out-of-scope for an initial offering. Now that it seems like that feature is close to being merged, I would like to start discussing how to optimise it further.
Motivation
These optimisations are far from just an academic exercise. We're currently running ShareDb on a production server, and we have some documents hitting 1,000,000+ operations.
Running the new getSnapshot
method on a document like this takes on the order of minutes to return on a development machine, which is not acceptable. This sort of operation would ideally take < 10s independent of number of operations.
Deliverable
What I would like to establish with this issue ticket is a sensible optimisation to start work on.
Possible approaches
There are a number of possible approaches, some of which were discussed in the original issue. Those are recapped here.
Reversible types
The current approach for getSnapshot
only works forwards, starting with the very first op and moving forwards. Depending on use case, it may often be likely that consumers are more interested in later versions of the document. In this case, starting from the current snapshot and working backwards would be much faster.
Pros
- Much faster to fetch recent versions of a document
- Roughly halves the worst-case time
Cons
Caching ops
A very large amount of time can be spent fetching ops. Anecdotally, it's responsible for over half the time spent retrieving a snapshot (using sharedb-mongo
).
Subsequent version queries could be sped up if the ops are cached. Any requests for a version lower than the latest version in the cache can be fed directly from the cache.
However, potentially this sort of optimisation belongs at the driver level?
Pros
- Would be a huge speed boost over database drivers
Cons
- ‘There are only two hard things in Computer Science: cache invalidation and naming things.’ — Phil Karlton
- First fetch is still unaffected
- Could lead to vastly increased memory usage
Current version
Whenever we try to fetch the current snapshot through getSnapshot
, it currently works its way through all the ops. We could instead trivially return the current snapshot. This is easy in the case of providing a null
version, but would require a lookup of the current snapshot first in all other cases.
Pros
- Much faster for fetching current version
Cons
- Would require an extra snapshot lookup before every fetch, which is added overhead for an (unlikely?) scenario
Build from latest creation
When a document is deleted and then re-created, if we want a version after re-creation, we only need to fetch and apply the ops after that new creation.
However, as far as I'm aware, there's currently nothing defined in the driver API that lets us fetch the latest creation operation, which means currently we would still need to fetch all the ops, and then perform an initial pass through the ops to determine if such an optimisation could be made.
Pros
- Faster retrieval in some cases
Cons
- Possibly requires a change to driver API
- Otherwise still fetches all ops, and requires a second pass
- (Re-creation may be relatively uncommon?)
Snapshot milestones
In a similar vein of thinking to building from the latest creation op, if fetching and applying ops is expensive, one of the best things you can do is fetch and apply fewer of them.
We could achieve this by periodically saving snapshots of documents (every 100,000 ops, say). Then we could fetch the latest applicable snapshot milestone and reconstruct the rest of the snapshot on top of that.
Pros
- Limits the time of any retrieval
- Could be made configurable?
Cons
- Large infrastructure change
- Requires a change to the driver API
- Hard to add retroactively to existing documents
Conclusion
I think that the inversible types optimisation is probably the lowest-hanging fruit, but for purely selfish reasons this is the one I'm least likely to implement, because we use rich-text
and our json0
ops fall into the trap I mentioned.
I also quite like the op caching idea, but the fact that first load is unaffected is quite a bummer.
I personally really like the idea of saving snapshot milestones. This is without doubt the optimisation that would be the most work, but I think that the ability to put a cap on the retrieval of any given document is absolutely invaluable. If people would be willing to support me on this change, I'd be more than keen to do this, because honestly I think we'll have to fork and implement this if I don't do it here.
Obviously, if there are any other optimisations I've missed, please pile in with comments.