FileTree CRDT for SAFE Network

happybeing · July 31, 2020, 5:06pm

Ditto. I find it useful to write things down even if they get torn up so don’t take anything I come up with as the way I think it should be.

I have been mixing research about filesystems with thinking about implementation of SAFE FS for FUSE and marrying these with how I imagine the FileTree will need to work, and have made notes about possible implementations. But rather than force a lot of reading on anyone I’ll respond to your questions and drop bit in where relevant.

The billion dollar question. I’ve been thinking about this and have more investigation to do to establish how closely SAFE FS should try to emulate typical inode features, or whether for example, it can create and destroy them more like file handles. I don’t know if that’s a useful idea or not, or whether to place that in SAFE FS or FileTree. For example, it may help us by decoupling the needs of the FileTree/TreeCDRT node identifiers from those exposed by SAFE FS.

I’ve summarised my understanding of Unix/Linux filesystem, including inodes (here). There are some things I probably need to look into further, so I made that a Wiki topic for this which can be updated and discussed separately.

I’d also like to know how the idea of a u64 node identifier might work (or not) with a TreeCDRT if you have any thoughts about this. More on that in a bit.

I don’t know about the CDRT side of this but see you’ve made progress. I have a couple of thoughts of how to manage the truncation to keep it trim without losing functionality:

I think it’s desirable to be able to access the full log rather than ever lose it completely. I think each replica can do this, by moving any pruned entries to a Sequence owned by the replica so that the full log can be be reconstructed, while only a shorter one is needed for most operations. This will preserve the ability to roll any FileTree back through all its previous states, or to merge any two FileTree replicas regardless of age and pruning. Maybe this is a ‘for later’ feature, but we could make preserving pruned entries the priority and implement features to make use of this later.
it might be helpful for users to be able to control how and when the log is pruned to suit different use cases. This could either be set when creating the first replica, or perhaps something that can be modified at any time.
EDIT: Another thought as I think truncation has a second aspect. The first is trimming the log so it doesn’t become so large that performance degrades unacceptably.

The second is deciding to have a cut off, before which no replica is merged if it has unmerged entries proper to the cut off. Why might that be useful? I’m not sure, but it occurred to me while I was thinking about the inode problem, trying to avoid clashes for numbers allocated independently. I came up with the following solution which I think would require us to limit how far back mergers should be allowed to go. The downside of that is that all topics will have to sync to stay ahead of the cut-off, or face their changes being blocked. That is recoverable though through a user managed re-sync, and I think for most use cases unlikely to be an issue, so how might having a cut off help as described in: Allocating inode numbers using pre-allocated blocks

I’ve not thought about this so idiot mode activated… The FileTree will be responsible for this because it will be a part of the CDRT operations, and so it will need to keep track of the metadata for each node, serialise it and retrieve it etc.

As for how to serialise it I’m not sure. Does it make sense to have a column in the log for each attribute or to have a single column for all metadata? In the latter case perhaps the metadata is serialised to a Sequence, in the former to the log itself but using run-length-encoding to avoid having to store values which rarely change, or maybe a bit of both?

Assuming this is just more metadata that’s something we could bolt on later so not an urgent question, but in general I’d say we try to implement everything we can in the long term in ways that don’t hurt performance until they are made use of by an application.

The init() operation is used to inform us about the characteristics and parameters of the FUSE kernel, and allows us to tweak some things if we want. Nothing exciting, but will need attention at some point (see docs).

You may be more interested in how we start up when a user tries to create or mount a SAFE FS filesystem. I expect we’ll have an API for creating a new FileTree object and from that object gain access to a SAFE FS API which SAFE FUSE can use to provide the filesystem operations on the FileTree.

So for example a safe-fuse CLI will be invoked with parameters telling it to either create a brand new filesystem (i.e. create an empty FileTree object for it), or an existing replica (i.e. fetch an existing FileTree from the network), and mount it at a given path. Once safe-fuse has a FileTree it obtains a SAFE FS API from it, then passes its FUSE operations to libfuse along with the requested mount point.

(Not necessarily in that order! As in, it might be better to create the mount first but block any operations until we have obtained the FileTree rather than keep the user waiting for that before failing the operation because they gave an invalid mount point.)

Another ‘I think’: from the application side you must always start with a path (you can’t just have an inode number and begin using that, is my understanding - except perhaps for the ‘root’). From the path you can obtain an inode which identifies the thing for subsequent operations on the same thing (typically a file or directory). This is hidden from view when using high-level FUSE. With low-level FUSE, the you call lookup() to get an inode number, which you then use for subsequent operations on the object it represents (e.g. a file or directory). The lookup() increments a hard-link count for the inode, so your process now effectively has a hard-link to the object, which is released when you call forget() which decrements the hard-link count. I see these hard-links as different (ephemeral) from hard-links from a directory entry to a file (enduring), and may be handled differently in our implementation. For example, if a process crashes before calling forget() it is probably better for the ‘link’ to die with it than to leave the inode in a state where its hard-link count will not reach zero when all its other hard-links have been removed.

Part of my to-do list is to look into this: how important it is that SAFE FS inodes correspond to typical filesystem inodes as it may be useful to deviate from this if it makes life easier for the FileTree/TreeCDRT implementation and its interface to SAFE FS.

I imagine it typical filesystem initialises an empty fixed sized table of inode references which has entries filled in or erased as file system objects are created and destroyed.

In terms of the APIs, the filesystem will call our FUSE API with a path in order to create an inode. We respond with the inode number, and it will use that for subsequent operations. So we would probably create an entry in the tree at this point, allocate it a u64 and respond with that.

For example:

OS-FUSE obtains the inode number for an entry in the root (‘/’) using an inode number of 1, so to get the inode for ‘/tmp’ it will call lookup(req_handle, 1, 'tmp')
If ‘/tmp’ exists we respond with a value, say I’ll call tmp_inode (which might be 2).
OS-FUSE then calls our create(req, tmp_inode, 'bar') to create the file entry.
We find the FileTree node which has an id of tmp_inode, create a child entry with name ‘bar’ and allocate an inode number for the entry (bar_inode) and initialise the metadata for the entry (e.g. creation time).
We respond with the inode number (bar_inode) of the new entry.

If that’s not complicated enough there’s a bit more to it than this so it is going to need a lot of reading and careful thought to ensure a clean, robust design.

Just read this part (possibly again), and it is what I had concluded after lots of thinking about how to implement hard-links. TL;DR: it works so long as we only need links to leaf nodes. This is ok IMO because only MacOS allows hard-links to directories.

Re a): Where the inode has only one reference, i.e. for directories, symlinks and files which don’t have extra hard-links, we could use the FileTree/TreeCDRT node to hold the metadata of the inode. In this case a file is just metadata, including an xor address (similar to a FilesContainer entry). So for most cases we just need a way to map between inode number and FileTree/TreeCDRT node in order to access the data of an inode.

In the case of a hard linked file, we will need a separate object to hold the inode data which can be referenced by every entry which hard-links to it from a directory. I have a scheme in mind where we store that in a Sequence but I won’t go into details here. If this passes scrutiny, the nice part is that we can start without support for hard-links and add it later.

Re b): I need a recap of this. Can you point me at anything or write a post explaining how the UUID is used, how the CDRT operates in this area? My head is filled with FUSE and I need a refresh of the CDRT side (or maybe a summary doc of what you envisage would help).

I agree with building in that way, and also find it useful to set out what I’m aiming at long term even if we end up somewhere else! I can also try to see the stepping stones we can use along that incremental path (e.g. the point about being able to add hard links later).