Great news @danda, keep going!
Yep, very nice.
Great news @danda, keep going!
Yep, very nice.
Ok, I have more code (and tests) working.
In a nutshell:
Output of tests can be seen here.
The prototype filesystem code is on github here.
At this point, I feel that Iāve demonstrated the design of the data structure works. I may implement symlinks tmw, and then update design docs, before starting on rust code next weekā¦
.
Also would be good to get some kind of timing/perf numbersā¦
The MemFS/FileTreeFS doesnāt implement Create()
but it is defined in for the FUSE Operation trait
in op.rs so we could implement it if needed. On the inode side, the Linux VFS docs here appear to say that create() is needed for regular files, but maybe the FUSE kernel stuff allows it to be optional.
Iāve added an implementation document to the SAFE FS repo so ./design now contains:
@happybeing here are a few issues worth discussion design-wise.
How best to implement stateful file handles?
How/where can we store local filesystem data?
What actions do we perform during mount and unmount?
In more detail:
How best to implement stateful file handles?
If a process has an open file handle, it can continue to use it even if the file has been deleted/unlinked by itself or another process. We need to ensure our design accomodates this. I plan to look into it more today. Let me know if you have any insights what the design pattern usually is for thisā¦
How/where can we store local filesystem data?
We have the crdt-tree op logs, (possibly) a cache of the current replica state, and also the local inode entries and ino to uuid lookup indexes.
Typically one thinks of a fileysystem as storing such data within itself, eg in the superblock of a block device.
However, as a network fs, we have no underlying block device. I am thinking that we somehow designate a file on an existing mounted filesystem as our storage area. Similar to (or exactly the same?) as mounting a loopback filesystem such as a cd-rom ISO file.
It seems that any networked filesystem would have similar needs, so hopefully there is a standard way to go about thisā¦?
What actions do we perform during mount and unmount?
In theory, we could download everything from the network at mount, replay logs, and arrive at current state. At unmount, we simply discard everything. Issues with this are that it is slow and isnāt great for ālocal-firstā operation because it means that if we unmount while offline, we canāt re-mount the filesystem until we are online again.
So what seems desirable in the final production system is to locally cache logs and state, so that we can quickly re-mount without a potentially lengthy download.
For phase 1 we are building a local-only filesystem. My thinking is that we should not try to optmize/cache just yet. Instead, at unmount time, we simply write out the operation log to our local store (whatever it is). At mount, we read it in and replay ops. This functionality will be needed for initial mount (no local cache) from the network anyway.
improvement: rather than writing entire log/state at unmount, it should be written as events occur, and then unmount would perform a final sync.
I have boat chores to do over the next two days so Iāll give some quick responses now and maybe revisit when I have more timeā¦
Iām not sure, but do we need to do this for FUSE? We have ref counts to hold the inode in memory until not in use, and each can only be open for write by one process so I think we need a memory cache for inodes and one for file data until synced.
So I think we may just need to support cached writes (in memory) until sync. Syncer shows how to do this using multiple threads, more on this in a bit.
Syncer handles this with a local disk cache of inodes, and another of file data split into 1MB chunks (each with its own cache in memory for unsynced data).
I agree we want caching as an option. My thought about adapting Syncer was to start using as much of its existing disk based caching (mentioned above), but then to remove this in favour of a ādeeperā cache using SAFE chunks - i.e. the actual raw chunks put/and got from the network.
However, Iām not sure that will work because I was assuming everything was stored as chunks, whereas am not sure that all data types can bet cached in this way. Do you know? For example, is Sequence chunked when put, and if so could we have a local cache of the chunks at the raw network side? From memory I think the whole Sequence is one object, but Iām not sure - I found the code difficult to decipher.
If so, we have can have the benefit of caching not just for FileTree but for all SAFE apps, and the chunks will already be encrypted (although I understand weāll need an extra layer of encryption on top later.)
So in the spirit of small steps, maybe we can implement a node cache and a file chunk cache (both in memory) like syncer, which syncs to the FileTree and directly to the network.
Later adding a cache for raw SAFE chunks to avoid the need to download everything in every mount, possibly with a special cache of the TreeCDRT of thatās not catered for by a chunk cache.
A decent SAFE cache is probably a whole separate project.
I donāt know if the above is feasible. Itās how I was thinking when trying to adapt Syncer before beginning to work on this with you.
This paper has a pretty good description of how fuse works, including stateful vs stateless handling of opened files/dirs.
Also, not about fuse or even unix specifically, but I found this free book āPractical File System Designā that presents the BE OS filesystem, with mostly posix features. It goes into a lot of good detail.
@happybeing Iād like to update you and get your thoughts on this:
Also, I have some code for the crdt-tree in rust here:
See examples/tree.rs and src/tree/*
It isnāt properly commented yet, needs unit tests, may move crates, etc.
Hi @danda,
Great to see such progress and that your work will be part of rust-crdt! The thesis sounds very interesting from your descriptions. I canāt access it, but as you have now gained such a good understanding of CRDT and FS areas, beyond mine in both I think, Iām not sure I can add much in those areas although I am happy to continue where you think I can still help out. Maybe review, sounding board etc? Iāve been learning Rust so maybe before long thereāll be something I can do on the coding side.
Your summary of the thesis sounds like you have a very good understanding of the issues, and some design options to evaluate. I donāt have much to add about most of the points you explored there as you seem to have covered things well.
One area had me thinking, where you talk about high level operations such as mkdir
etc, versus low level tree-move
. At present, SequenceCRDT and as I understand it, the plan for TreeCRDT, is that the whole data structure is loaded/saved to the network which is fine as a first step, and can work up to a point but comes with limitations. There are various ways to deal with that, but Iām just reminding myself of the idea that we could transmit operations, rather than load/save the entire data structure. So Iām wondering if you have any thoughts on that side in the light of the thesis implementation which sounds like it fits with that model?
Iāll have a play with the tree-crdt examples, hopefully today.
That might or might not happen. The author of rust-crdt would need to accept/merge the changes. Also, we are moving towards small, focused crates now, eg the xorname crate. So I may end up making a dedicated crate for crdt-tree (which rust-crdt could choose to include or not). Iām still considering the best path here, but leaning towards the dedicated crate.
Both LSeq (Sequence) and crdt-tree transmit operations between replicas, not full state. (CmRDT, not CvRDT). For LSeq the ops are Insert and Delete. For crdt-tree, there is only the Move operation. Or perhaps I missed your meaning?
Youāre right, I was confusing Sequence data type with LSeq CRDT. So my question stands, but Iām asking it about the SAFE data types Sequence and FileTree rather than the underlying CRDTs.
My understanding is the the Sequence data type is loaded/saved as a single blob, and that each replica has a complete copy. I could be wrong about that, I tried to clarify How SAFE Sequence CRDT Data Type is implemented, so this is the basis of my understanding and from there, I thought you planned to implement FileTree in a similar way at least to start? I donāt have a picture of how the TreeCRDT fits into this, so maybe I just need to understand your plans for implementation (e.g. block diagram stuff) better.
So basically any Op based CRDT can be thought of as an ordered sequence of operations, which are stored locally in a log. For any replica to become āup to dateā starting from zero, they must download the log entries from a peer and replay them. Each application of an operation results in a different local state. At any point, this local state can be saved to disk or loaded from disk. This is faster than replaying all log ops from the beginning. In this sense, one could think of a Sequence as being loaded/saved as a single blob. I havenāt dug into Sequence impl yet myself (beyond LSeq) so I canāt say for certain if bootstrapping from peers is performed via log replay or requesting complete current state, though I would imagine the former as a first safe step, since the latter (I think) implies trusting the peer implicitly. @bochaco? Anyway, once a replica is ācaught upā, new operations must be transmitted/received/processed individually as they occur.
At least, this is my understanding but again, I havenāt dug into Sequence/AT2 code yet.
I think this just shows that I donāt understand how the SAFE data types are implemented. I tried reviewing the code and put my conclusions in the post linked above, some of which were wrong and then corrected. But things may well be changing under my feet anyway. I shall wait!
In an operation based CRDT like Sequence we donāt need to store the operations log, only current state, a new replica gets the current state and based on the information it contains (vclocks) it can send new operations which can be applied by other replicas.
Update:
There are plenty of usage examples in examples/tree.rs.
Naming of the data structures is a bit awkward and may yet change.
Wins:
Some Drawbacks:
Even with the drawbacks, I tend to favor this approach. In particular because it fixes the issue with replica states not matching without requiring inverse operations.
Here is some output from the alternative implementation that shows the simplified data structure. Notice there is no local state, no ino ā uuid mappings.
------- fs state after: Home dir created -------
- null => forest
- 281474976710656 => {"name":"root","size":0,"ctime":1598629822,"mtime":1598629822,"kind":"dir"}
- 281474976710659 => {"name":"home","size":0,"ctime":1598629822,"mtime":1598629822,"kind":"dir"}
- 281474976710660 => {"name":"bob","size":0,"ctime":1598629822,"mtime":1598629822,"kind":"dir"}
- 281474976710661 => {"name":"projects","size":0,"ctime":1598629822,"mtime":1598629822,"kind":"dir"}
- 281474976710663 => {"name":"homework.txt","inode_id":281474976710662}
- 281474976710664 => {"name":"homework-link.txt","inode_id":281474976710662}
- 281474976710657 => {"name":"fileinodes","size":0,"ctime":1598629822,"mtime":1598629822,"kind":"dir"}
- 281474976710662 => {"size":0,"ctime":1598629822,"mtime":1598629822,"kind":"file","links":2,"content":null}
- 281474976710658 => {"name":"trash","size":0,"ctime":1598629822,"mtime":1598629822,"kind":"dir"}
------- end state -------
Next, I may attempt another alternative impl using inverse operations just to get a better feel for that approach. It could potentially provide more flexibility later on with regards to replicas storing local state outside the Tree.
Thereās an issue regarding how to deal with conflicting names of directory children. @happybeing @here let me know your thoughts please.
Looks like great work @danda, so much to get your head around and you seem to fly through it! Iāve not delved into the code, so just some quick thoughts about the filename conflict at this point.
At first I thought having a conflicts directory was good idea, but thinking about how conflicts arise and what might happen next I think renaming to append the replica ID is the best option (also better than using the actor + clock). The reason is because I think this will be easier to understand and recover from, and may also behave better if not noticed.
Consider a simple case of two replicas A and B like your example, they both create file.txt for different purposes. When A merges B, the result will be:
A has:
file-A.txt
file-B.txt
B has:
file.txt
B will continue working on file.txt unawares until it merges with some version of A. What happens next on A is likely to be one of two things:
If at any point replica B merges changes from A before A cleans up the conflict, B will end up with its own Situation 1 (its file.txt disappears with file-A.txt and file-B.txt in its place). And the above two scenarios can arise for B.
How to clean up the conflict? If A cleans up before B merges as follows:
Alternatively, A could delete file-A.txt when cleaning up the conflict, but probably not file-B.txt. Consider:
I think using the replica ID as described is relatively easy for humans to understand and recover from than moving conflicts somewhere else. And using just the replica ID in the name avoids creating unnecessary extra files with each merge, compared to using āActor + Clockā which would be confusing, and could create large numbers of files over time.
I have a question! The replica log allows the structure of the tree to be recovered as it was at any point in the history, but does this include some, all or no metadata? Iāve been assuming this is all available, but am not sure.
All metadata.
LogMove is defined as (timestamp, parent_id, metadata, child_id, oldp), where oldp is (parent_id, metadata) and represents previous state of child_id before this operation was applied, or None if it didnāt exist.
still thinking about the rest.
@happybeing I did some basic experiments with filename conflicts, and made a writeup.
I wouldnāt say the matter is settled, but perhaps Iāve shed a bit more light on it.
Oh thatās tricky indeed, naturally I hadnāt thought of that. I think you have a good way to proceed and that we should make a note that this is an area it might be good to improve, or maybe notā¦
I think that in situations where the same app is being used in different replicas, and where it generates names automatically, then merge errors could create s lot of pain for users, in others it will rarely happen and be easier for them to handle when it does (eg where users have named things manually).
However, I donāt think any CRDT scheme can solve the former and it will instead require adaptation in the apps or in workflow by the users, so ultimately your current plan may be more than good enough.
Very nicely worked through @danda.