From: Werner Almesberger (almesber@bernina.ethz.ch)
Date: 03/29/93


From: almesber@bernina.ethz.ch (Werner Almesberger)
Subject: Filesystem bugs and (maybe) a solution (was Re: 386bsd, linux: which runs more out of the box?)
Date: Mon, 29 Mar 1993 17:19:18 GMT


[ References removed in order to start a new thread in order to escape [,]
  (trn's kill-thread key). ]

In article <SCT.93Mar25155522@garay.dcs.ed.ac.uk> sct@dcs.ed.ac.uk (Stephen Tweedie) writes:
> The 0.99pl7 kernel introduced a feature where all filenames passed to
> the kernel are copied to kernel space before they are used; this
> hopefully (!) prevents any races due to the kernel trying to access
> filename data which is in swapped user memory.
>
> The double-filename bug was supposed to be due to exactly this kind of
> problem, so we may well have now seen the last of it.

Unfortunately not. Those race-conditions are still present, although the
name copying may help further reducing the rate at which they occur.

There are at least two flavors of them: *_create assumes that upper VFS
layers are assuring uniqueness of the name, which they don't, at least not
reliably. Many others, like *_mknod make a half-hearted attempt to do it
right, but they fail too. (Example: minix_mknod calls minix_find_entry to
check that the name doesn't exist. Race condition 1: a different process may
add the file at directory block M while minix_find_entry is waiting for block
N > M to be read in. Later, the entry is added by minix_add_entry. Race
condition 2: blocks may have to be read from disk while looking for a free
slot (we can't even assume the blocks previously read by minix_find_entry are
still in the buffer cache). A different process may create a file with the
same name at this time.)

I have to admit that I didn't trace all involved functions, but I'm pretty
sure that my analysis is correct at least in principle.

Fortunately, this won't start a new "my FS is better than yours" war,
because all file systems derived from the Minix FS (in alphabetic order:
ext, ext2 and xia) share the same behaviour :-) The other "standard" file
systems (iso, msdos, nfs and proc) don't have this particular problem.

Although those race conditions exist, they don't happen frequently, so it's
quite normal to run Linux under heavy load for months without experiencing
them.

As this example shows, testing a file system simply by running it for a more
or less extended period of time is inadequate. What we really need is a good
automated testing environment that generates reproducible behaviour. A first
step can be test suites like the one I'm using with the MS-DOS FS. (test.pl,
part of dosfs.N.tar.Z) Of course this test is totally useless to detect race
conditions or even file system corruptions. Also, it is incomplete and there
is no way to easily verify its completeness.

A better approach might look like this (everything runs in user mode):

  - get some means to verify consistency of a file system. This could be an
    fsck program or a formal definition that is used by some yet to be
    written program. I'm leaning towards the formal definition, because that
    may lead to more concise descriptions.
  - write specifications (e.g. what happens if argument "foo" is NULL, may
    it sleep, does it consume "inode", etc.) of all involved functions
    (iget, bread, *_create).
  - define sets of file system states and the operations that cause
    transitions between them (or that yield errors).
  - write a program that generates tests from the above definitions.
  - write a library of replacements for iget, etc. that follows the
    specifications and performs the desired operations on a file that
    contains a minimal file system. Add points at which a context switch may
    occur. (Requires a simple threads system.)
  - implement a mechanism that lets you add "hints" to the file system code,
    e.g. "won't context switch here", "will only context switch once", etc.)
  - run the file system code on the tests and verify file system consistency
    and that the file system assumes the expected state after each test.
    (E.g. the FS itself is consistent, the files are at their expected
    places and have the correct attributes, there are no referenced inodes
    sitting around, etc.)

    Three levels of race condition verification are possible:
      - sequential validation: no context switches at all.
      - validation for current environment: random or user-provided
        sequences of context switches are performed.
      - analysis of impact of possible future changes: like above, but hints
        telling not to context-switch are ignored.

One could also introduce random read errors, etc.

Running a file system over an extended period under these conditions should
be a good way to detect most race conditions. In simple cases, it might even
be feasible to test _all_ possible combinations of race conditions.

Comments ?

- Werner

-- 
   _________________________________________________________________________
  / Werner Almesberger, ETH Zuerich, CH           almesber@bernina.ethz.ch /
 /_IFW_A44__Tel._+41_1_254_7213___________________________________________/