Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save kofemann/2f0e63a1bc00865d1085658715d4fa3e to your computer and use it in GitHub Desktop.
Save kofemann/2f0e63a1bc00865d1085658715d4fa3e to your computer and use it in GitHub Desktop.
DATA AND METADATA COHERENCE
Some modern cluster file systems provide perfect cache coherence among their clients. Perfect cache coherence among disparate NFS clients is expensive to achieve, especially on wide area networks. As such, NFS settles for weaker cache coherence that satisfies the requirements of most file sharing types.
Close-to-open cache consistency
Typically file sharing is completely sequential. First client A opens a file, writes something to it, then closes it. Then client B opens the same file, and reads the changes.
When an application opens a file stored on an NFS version 3 server, the NFS client checks that the file exists on the server and is permitted to the opener by sending a GETATTR or ACCESS request. The NFS client sends these requests regardless of the freshness of the file's cached attributes.
When the application closes the file, the NFS client writes back any pending changes to the file so that the next opener can view the changes. This also gives the NFS client an opportunity to report write errors to the application via the return code from close(2).
The behavior of checking at open time and flushing at close time is referred to as close-to-open cache consistency, or CTO. It can be disabled for an entire mount point using the nocto mount option.
Weak cache consistency
There are still opportunities for a client's data cache to contain stale data. The NFS version 3 protocol introduced "weak cache consistency" (also known as WCC) which provides a way of efficiently checking a file's attributes before and after a single request. This allows a client to help identify changes that could have been made by other clients.
When a client is using many concurrent operations that update the same file at the same time (for example, during asynchronous write behind), it is still difficult to tell whether it was that client's updates or some other client's updates that altered the file.
Attribute caching
Use the noac mount option to achieve attribute cache coherence among multiple clients. Almost every file system operation checks file attribute information. The client keeps this information cached for a period of time to reduce network and server load. When noac is in effect, a client's file attribute cache is disabled, so each operation that needs to check a file's attributes is forced to go back to the server. This permits a client to see changes
to a file very quickly, at the cost of many extra network operations.
Be careful not to confuse the noac option with "no data caching." The noac mount option prevents the client from caching file metadata, but there are still races that may result in data cache incoherence between client and server.
The NFS protocol is not designed to support true cluster file system cache coherence without some type of application serialization. If absolute cache coherence among clients is required, applications should use file locking. Alternatively, applications can also open their files with the O_DIRECT flag to disable data caching entirely.
File timestamp maintainence
NFS servers are responsible for managing file and directory timestamps (atime, ctime, and mtime). When a file is accessed or updated on an NFS server, the file's timestamps are updated just like they would be on a filesystem local to an application.
NFS clients cache file attributes, including timestamps. A file's timestamps are updated on NFS clients when its attributes are retrieved from the NFS server. Thus there may be some delay before timestamp updates on an NFS server appear to applications on NFS clients.
To comply with the POSIX filesystem standard, the Linux NFS client relies on NFS servers to keep a file's mtime and ctime timestamps properly up to date. It does this by flushing local data changes to the server before reporting mtime to applications via system calls such as stat(2).
The Linux client handles atime updates more loosely, however. NFS clients maintain good performance by caching data, but that means that application reads, which normally update atime, are not reflected to the server where a file's atime is actually maintained.
Because of this caching behavior, the Linux NFS client does not support generic atime-related mount options. See mount(8) for details on these options.
In particular, the atime/noatime, diratime/nodiratime, relatime/norelatime, and strictatime/nostrictatime mount options have no effect on NFS mounts.
/proc/mounts may report that the relatime mount option is set on NFS mounts, but in fact the atime semantics are always as described here, and are not like relatime semantics.
Directory entry caching
The Linux NFS client caches the result of all NFS LOOKUP requests. If the requested directory entry exists on the server, the result is referred to as a positive lookup result. If the requested directory entry does not exist on the server (that is, the server returned ENOENT), the result is referred to as negative lookup result.
To detect when directory entries have been added or removed on the server, the Linux NFS client watches a directory's mtime. If the client detects a change in a directory's mtime, the client drops all cached LOOKUP results for that directory. Since the directory's mtime is a cached attribute, it may take some time before a client notices it has changed. See the descriptions of the acdirmin, acdirmax, and noac mount options for more information about
how long a directory's mtime is cached.
Caching directory entries improves the performance of applications that do not share files with applications on other clients. Using cached information about directories can interfere with applications that run concurrently on multiple clients and need to detect the creation or removal of files quickly, however. The lookupcache mount option allows some tuning of directory entry caching behavior.
Before kernel release 2.6.28, the Linux NFS client tracked only positive lookup results. This permitted applications to detect new directory entries created by other clients quickly while still providing some of the performance benefits of caching. If an application depends on the previous lookup caching behavior of the Linux NFS client, you can use lookupcache=positive.
If the client ignores its cache and validates every application lookup request with the server, that client can immediately detect when a new directory entry has been either created or removed by another client. You can specify this behavior using lookupcache=none. The extra NFS requests needed if the client does not cache directory entries can exact a performance penalty. Disabling lookup caching should result in less of a performance penalty than
using noac, and has no effect on how the NFS client caches the attributes of files.
The sync mount option
The NFS client treats the sync mount option differently than some other file systems (refer to mount(8) for a description of the generic sync and async mount options). If neither sync nor async is specified (or if the async option is specified), the NFS client delays sending application writes to the server until any of these events occur:
Memory pressure forces reclamation of system memory resources.
An application flushes file data explicitly with sync(2), msync(2), or fsync(3).
An application closes a file with close(2).
The file is locked/unlocked via fcntl(2).
In other words, under normal circumstances, data written by an application may not immediately appear on the server that hosts the file.
If the sync option is specified on a mount point, any system call that writes data to files on that mount point causes that data to be flushed to the server before the system call returns control to user space. This provides greater data cache coherence among clients, but at a significant performance cost.
Applications can use the O_SYNC open flag to force application writes to individual files to go to the server immediately without the use of the sync mount option.
Using file locks with NFS
The Network Lock Manager protocol is a separate sideband protocol used to manage file locks in NFS version 2 and version 3. To support lock recovery after a client or server reboot, a second sideband protocol -- known as the Network Status Manager protocol -- is also required. In NFS version 4, file locking is supported directly in the main NFS protocol, and the NLM and NSM sideband protocols are not used.
In most cases, NLM and NSM services are started automatically, and no extra configuration is required. Configure all NFS clients with fully-qualified domain names to ensure that NFS servers can find clients to notify them of server reboots.
NLM supports advisory file locks only. To lock NFS files, use fcntl(2) with the F_GETLK and F_SETLK commands. The NFS client converts file locks obtained via flock(2) to advisory locks.
When mounting servers that do not support the NLM protocol, or when mounting an NFS server through a firewall that blocks the NLM service port, specify the nolock mount option. NLM locking must be disabled with the nolock option when using NFS to mount /var because /var contains files used by the NLM implementation on Linux.
Specifying the nolock option may also be advised to improve the performance of a proprietary application which runs on a single client and uses file locks extensively.
NFS version 4 caching features
The data and metadata caching behavior of NFS version 4 clients is similar to that of earlier versions. However, NFS version 4 adds two features that improve cache behavior: change attributes and file delegation.
The change attribute is a new part of NFS file and directory metadata which tracks data changes. It replaces the use of a file's modification and change time stamps as a way for clients to validate the content of their caches. Change attributes are independent of the time stamp resolution on either the server or client, however.
A file delegation is a contract between an NFS version 4 client and server that allows the client to treat a file temporarily as if no other client is accessing it. The server promises to notify the client (via a callback request) if another client attempts to access that file. Once a file has been delegated to a client, the client can cache that file's data and metadata aggressively without contacting the server.
File delegations come in two flavors: read and write. A read delegation means that the server notifies the client about any other clients that want to write to the file. A write delegation means that the client gets notified about either read or write accessors.
Servers grant file delegations when a file is opened, and can recall delegations at any time when another client wants access to the file that conflicts with any delegations already granted. Delegations on directories are not supported.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment