This post is intended to be sort of a ‘living document’ and should be modified as needed. Please provide any questions, thoughts, or suggestions you might have
I had been browsing around earlier looking for info on profiling tools, and while some ideas are bouncing around, it seems there’s not too much out there right now. I believe that it would be beneficial for us to establish an interface for us query vault health as a means of maximizing the overall network health and to provide the tooling necessary for future work in in-depth debugging and profiling.
As a possible solution, I thought maybe some sort of extensible, telemetry-style interface for querying vaults might be a good solution to consolidate profiling efforts and offer an implementation for health checks, which have been proposed in the past.
This topic is for discussing an extensible telemetry implementation which allows vaults to remotely query another vaults’s health and/or supplementary information. In this discussion, the primary concern is the interface structure and an initial implementation allowing the network to accept/reject/evict vaults based on “health” That said, the interface should be extensible to allow for other supplementary information (e.g. performance info, structured-data logging, etc.) to be queried, which could pave the way for further optimizations and debugging tools down the line.
The existing solution is not integrated into the safe-vault
library and is mostly a stop-gap solution by its own admission (see resource proof below under background & related Info.
Goals
- Introduce a unified, easily-extensible interface for querying vaults for information to determine their fitness level and potentially other supplementary performance or diagnostic information
- Allow vaults to be compared to each other and to arbitrary standards based on health
- Define one or several metric(s) to define what it means for a network node to be “healthy” enough to participate in the network (separate from proof of resource; think cpu, bandwidth, etc.)
- Define behavior for nodes in/attempting to join the network which are not “healthy.”
Non-Goals
- Defining supplementary debug information to be supplied over the interface – these should be proposed separately if/when the proposed system is implemented, so that there is the appropriate focus to consider things like security/overhead of each additional query type
- Defining further behaviors beyond vault rejection/acceptance/eviction, based on the health metric. For the same reason as the above point, this should probably go in different topics
Background and Related Information
- resource_proof - A rust crate which implements best-effort checking of node resources to provide “some indication that a machine has some capabailities”
- previous discussion - A discussion that lead to the creation of this topic and discusses some possible use cases and background info
- Github - Performance Checks Issue - A github issue which alludes to a similar, unformalize version of the above demonstrating interest as well as highlighting some advantages of an integrated solution over resource_proof
Current Ideas
Some of these are borrowed from existing documents, some I’ve come up with.
- Define health as the “ability to securely store and deliver data” as per Safe Network Health Metrics Document with the idea that this is not a binary value, but rather a continuously valued function (of bandwidth and cpu speed? see Open Questions below).
- Telemetry would take the form of a new
Rpc
variant insafe-vault
and handled as a routing message, and the response would be dispatched via anAction
. Alternatively, retooling the existingRequest
variant is possible, but would require some more sweeping changes as theRequest
variant seems to refer specifically to get requests on network resources/data like files - The RPC indicates what log page to fetch. The log page can be fetched from some existing memory or constructed on the fly.
- Health checks are performed on entering the section and at random intervals to ensure continued health.
- Only Elders can perform a health check of vaults in the same section under normal circumstances.
- In internal testnets and the like (e.g. testnets where security is not paramount as Maidsafe owns all the vaults in the testnet or the net is entirely local), the above can be relaxed to any node can request a log page from any other node for the purposes of debugging and performance analysis.
- Similarly to the above, in internal testnets, we could expose conditionally-compiled log pages for unsecure but useful debugging/profiling information. Naturally vaults compiled in this testing mode aren’t interoperable with normal vaults
Potential Future Applications
Some of these aren’t necessarily secure, but are potentially useful in testnet scenarios and for debugging (implying conditional compilation)
- Collecting structured-data logs in a circular buffer and delivering them on request
- E.g. providing are the last
n
messages this vault signed and the messages they were attached to - E.g. providing a list of the previous
n
sections this vault has been a member of
- E.g. providing are the last
- Collecting and aggregating plaintext logs from various vaults in a quick and easy fashion
- Tracking the route of a message as it travels through the network
Open Questions
- What are good indications of health? Cpu and bandwidth come to mind (see git issue), but I’m certain others exist. Potentially things like disc-space (beyond proof-of-resource required to be a vault in the first place)
- How to measure health remotely and trust the response
- Issue a challenge of some sort (e.g. hash these numbers for me) and give an arbitrary time limit to respond?