mirror of https://github.com/nestriness/cdc-file-transfer.git synced 2026-03-17 03:33:09 +02:00

Go to file

chrschng 76bbdb01bb Merge dynamic manifest updates to Github (#7 )

This change introduces dynamic manifest updates to asset streaming.

Asset streaming describes the directory to be streamed in a manifest, which is a proto definition of all content metadata. This information is sufficient to answer `stat` and `readdir` calls in the FUSE layer without additional round-trips to the workstation.

When a directory is streamed for the first time, the corresponding manifest is created in two steps:
1. The directory is traversed recursively and the inode information of all contained files and directories is written to the manifest.
2. The content of all identified files is processed to generate each file's chunk list. This list is part of the definition of a file in the manifest.
* The chunk boundaries are identified using our implementation of the FastCDC algorithm.
* The hash of each chunk is calculated using the BLAKE3 hash function.
* The length and hash of each chunk is appended to the file's chunk list.

Prior to this change, when the user mounted a workstation directory on a client, the asset streaming server pushed an intermediate manifest to the gamelet as soon as step 1 was completed. At this point, the FUSE client started serving the virtual file system and was ready to answer `stat` and `readdir` calls. In case the FUSE client received any call that required file contents, such as `read`, it would block the caller until the server completed step 2 above and pushed the final manifest to the client. This works well for large directories (> 100GB) with a reasonable number of files (< 100k). But when dealing with millions of tiny files, creating the full manifest can take several minutes.

With this change, we introduce dynamic manifest updates. When the FUSE layer receives an `open` or `readdir` request for a file or directory that is incomplete, it sends an RPC to the workstation about what information is missing from the manifest. The workstation identifies the corresponding file chunker or directory scanner tasks and moves them to the front of the queue. As soon as the task is completed, the workstation pushes an updated intermediate manifest to the client which now includes the information to serve the FUSE request. The queued FUSE request is resumed and returns the result to the caller.

While this does not reduce the required time to build the final manifest, it splits up the work into smaller tasks. This allows us to interrupt the current work and prioritize those tasks which are required to handle an incoming request from the client. While this still takes a round-trip to the workstation plus the processing time for the task, an updated manifest is received within a few seconds, which is much better than blocking for several minutes.

This latency is only visible when serving data while the manifest is still being created. The situation improves as the manifest creation on the workstation progresses. As soon as the final manifest is pushed, all metadata can be served directly without having to wait for pending tasks.

2022-11-16 11:20:32 +01:00

absl_helper

Releasing the former Stadia file transfer tools

2022-11-03 10:39:10 +01:00

asset_stream_manager

Merge dynamic manifest updates to Github (#7 )

2022-11-16 11:20:32 +01:00

cdc_fuse_fs

Merge dynamic manifest updates to Github (#7 )

2022-11-16 11:20:32 +01:00

cdc_indexer

Improve cdc_fuse_fs and path (#2 )

2022-11-15 12:53:02 +01:00

cdc_rsync

Remove GGP dependencies from CDC RSync (#1 )

2022-11-15 12:48:09 +01:00

cdc_rsync_server

Remove GGP dependencies from CDC RSync (#1 )

2022-11-15 12:48:09 +01:00

common

Merge dynamic manifest updates to Github (#7 )

2022-11-16 11:20:32 +01:00

data_store

Releasing the former Stadia file transfer tools

2022-11-03 10:39:10 +01:00

docs

Releasing the former Stadia file transfer tools

2022-11-03 10:39:10 +01:00

fastcdc

Releasing the former Stadia file transfer tools

2022-11-03 10:39:10 +01:00

manifest

Merge dynamic manifest updates to Github (#7 )

2022-11-16 11:20:32 +01:00

metrics

Releasing the former Stadia file transfer tools

2022-11-03 10:39:10 +01:00

proto

Merge dynamic manifest updates to Github (#7 )

2022-11-16 11:20:32 +01:00

tests_asset_streaming_30

Releasing the former Stadia file transfer tools

2022-11-03 10:39:10 +01:00

tests_cdc_rsync

Remove GGP dependencies from CDC RSync (#1 )

2022-11-15 12:48:09 +01:00

tests_common

Releasing the former Stadia file transfer tools

2022-11-03 10:39:10 +01:00

third_party

Releasing the former Stadia file transfer tools

2022-11-03 10:39:10 +01:00

tools

Remove GGP dependencies from CDC RSync (#1 )

2022-11-15 12:48:09 +01:00

.bazelrc

Releasing the former Stadia file transfer tools

2022-11-03 10:39:10 +01:00

.clang-format

Releasing the former Stadia file transfer tools

2022-11-03 10:39:10 +01:00

.gitignore

Releasing the former Stadia file transfer tools

2022-11-03 10:39:10 +01:00

.gitmodules

Releasing the former Stadia file transfer tools

2022-11-03 10:39:10 +01:00

all_files.vcxitems

Remove GGP dependencies from CDC RSync (#1 )

2022-11-15 12:48:09 +01:00

all_files.vcxitems.user

Releasing the former Stadia file transfer tools

2022-11-03 10:39:10 +01:00

file_transfer.sln

Remove GGP dependencies from CDC RSync (#1 )

2022-11-15 12:48:09 +01:00

LICENSE

Releasing the former Stadia file transfer tools

2022-11-03 10:39:10 +01:00

manifest.natvis

Releasing the former Stadia file transfer tools

2022-11-03 10:39:10 +01:00

NMakeBazelProject.targets

Remove GGP dependencies from CDC RSync (#1 )

2022-11-15 12:48:09 +01:00

protobuf.natvis

Releasing the former Stadia file transfer tools

2022-11-03 10:39:10 +01:00

README.md

Remove GGP dependencies from CDC RSync (#1 )

2022-11-15 12:48:09 +01:00

rm_bazel_out_dir.bat

Releasing the former Stadia file transfer tools

2022-11-03 10:39:10 +01:00

WORKSPACE

Releasing the former Stadia file transfer tools

2022-11-03 10:39:10 +01:00

README.md

CDC File Transfer

This repository contains tools for synching and streaming files. They are based on Content Defined Chunking (CDC), in particular FastCDC, to split up files into chunks.

CDC RSync

CDC RSync is a tool to sync files from a Windows machine to a Linux device, similar to the standard Linux rsync. It is basically a copy tool, but optimized for the case where there is already an old version of the files available in the target directory.

It skips files quickly if timestamp and file size match.
It uses fast compression for all data transfer.
If a file changed, it determines which parts changed and only transfers the differences.

The remote diffing algorithm is based on CDC. In our tests, it is up to 30x faster than the one used in rsync (1500 MB/s vs 50 MB/s).

CDC Stream

CDC Stream is a tool to stream files and directories from a Windows machine to a Linux device. Conceptually, it is similar to sshfs, but it is optimized for read speed.

It caches streamed data on the Linux device.
If a file is re-read on Linux after it changed on Windows, only the differences are streamed again. The rest is read from cache.
Stat operations are very fast since the directory metadata (filenames, permissions etc.) is provided in a streaming-friendly way.

To efficiently determine which parts of a file changed, the tool uses the same CDC-based diffing algorithm as CDC RSync. Changes to Windows files are almost immediately reflected on Linux, with a delay of roughly (0.5s + 0.7s x total size of changed files in GB).

The tool does not support writing files back from Linux to Windows; the Linux directory is readonly.

Getting Started

The project has to be built both on Windows and Linux.

Prerequisites

The following steps have to be executed on both Windows and Linux.

Download and install Bazel from https://bazel.build/install.

Clone the repository.

git clone https://github.com/google/cdc-file-transfer

Initialize submodules.

cd cdc-file-transfer
git submodule update --init --recursive

Finally, install an SSH client on the Windows device if not present. The file transfer tools require ssh.exe and scp.exe.

Building

The two tools can be built and used independently.

CDC Sync

Build Linux components

bazel build --config linux --compilation_mode=opt //cdc_rsync_server

Build Windows components

bazel build --config windows --compilation_mode=opt //cdc_rsync

Copy the Linux build output file cdc_rsync_server from bazel-bin/cdc_rsync_server on the Linux system to bazel-bin\cdc_rsync on the Windows machine.

CDC Stream

Build Linux components

bazel build --config linux --compilation_mode=opt //cdc_fuse_fs

Build Windows components

bazel build --config windows --compilation_mode=opt //asset_stream_manager

Copy the Linux build output files cdc_fuse_fs and libfuse.so from bazel-bin/cdc_fuse_fs on the Linux system to bazel-bin\asset_stream_manager on the Windows machine.

Usage

CDC Sync

To copy the contents of the Windows directory C:\path\to\assets to ~/assets on the Linux device linux.machine.com, run

cdc_rsync --ssh-command=C:\path\to\ssh.exe --scp-command=C:\path\to\scp.exe C:\path\to\assets\* user@linux.machine.com:~/assets -vr

Depending on your setup, you may have to specify additional arguments for the ssh and scp commands, including proper quoting, e.g.

cdc_rsync --ssh-command="\"C:\path with space\to\ssh.exe\" -F ssh_config_file -i id_rsa_file -oStrictHostKeyChecking=yes -oUserKnownHostsFile=\"\"\"known_hosts_file\"\"\"" --scp-command="\"C:\path with space\to\scp.exe\" -F ssh_config_file -i id_rsa_file -oStrictHostKeyChecking=yes -oUserKnownHostsFile=\"\"\"known_hosts_file\"\"\"" C:\path\to\assets\* user@linux.machine.com:~/assets -vr

Lengthy ssh/scp commands that rarely change can also be put into environment variables CDC_SSH_COMMAND and CDC_SCP_COMMAND, e.g.

set CDC_SSH_COMMAND="C:\path with space\to\ssh.exe" -F ssh_config_file -i id_rsa_file -oStrictHostKeyChecking=yes -oUserKnownHostsFile="""known_hosts_file"""

set CDC_SCP_COMMAND="C:\path with space\to\scp.exe" -F ssh_config_file -i id_rsa_file -oStrictHostKeyChecking=yes -oUserKnownHostsFile="""known_hosts_file"""

cdc_rsync C:\path\to\assets\* user@linux.machine.com:~/assets -vr