From: Adam J. Richter (adam@netcom.com)
Date: 04/15/93


From: adam@netcom.com (Adam J. Richter)
Subject: compressed log structured filesystems
Date: Fri, 16 Apr 1993 04:17:02 GMT

In article <lsnb9uINNalq@priddy.cs.utexas.edu> danielsi@cs.utexas.edu (Daniel Aaron Supernaw-Issen) writes:
>
>about paging and random update:
>
>it seems to me that the solution is simple. don't compress the file,
>compress on the block level. You KNOW that when uncompressing a block how
>big it will be. You have to read in the whole block anyways the issue of how
>big it is on disk is orthogonal to everything else. prepend the block with
>a byte identifying what compression alg (if any was used) and all is taken
>care of. I'm currently working on implementing exactly this on a lfs. In
>this case, a number of different compression algorithms are used and the one
>with the best compression (including no compression) wins. the only real
>change to the surrounding lfs code was that the head position had to be
>byte indexed rather than block indexed. no biggie.

        I am also working, on and off, on a compressed log structured
filesystem for Linux. I'm busy with a lot of other things (like
making those Linux CDROM's that everyone uses), and it's slow going.
For example, just figuring out when you're out of disk space is
a tough problem on compressed log structured filesystem, and there
are some basic kernel modifications that I need to make to, for
example, cache file data that doesn't correspond to a particular disk
block. However, I thought that I would write this posting in the
hopes that even if I don't get this filesystem written, other people
doing Linux filesystems might be able to use these ideas.

        For the benefit of those of you who have not heard of log
structured filesystems, I will explain the basic idea. When a
traditional filesystem updates a particular block of a file, it moves
the disk head and overwrites the disk block corresponding to that
particular block of the file (allocating such a block if it doesn't
exist). On a log structured filesystem, the writes are recorded in a
contiguous log on the disk. So, rather than moving the disk head to
the old location of the corresponding disk block, the file block is
appended the end of the log. Once that block is written, the block
that points to it and the map of free blocks are updated to reflect
the fact that the file block has moved. These blocks later written to
the logs, followed by the blocks that keep track of them. Eventually
this chain of events terminates with a checkpoint block being written
to one of a small number of checkpoint regions (for example, there may
be two checkpoint regions so that the previous checkpoint is not
overwritten). There aren't that many extra blocks that have to be
written, because many of the blocks will belong to the same file, so
they will share the same indirect blocks and other bookkeeping blocks.
Although a log structured filesystem actually writes more data to the
disk, it makes up for this, because the disk head doesn't have to do
any seeks. By writing in a sequential log, a log structured
filesystem can theoretically do sustained random writes at close to
the maximum physical bandwidth of the disk.

        There are a number of ways in which a lot structured
filesystem can be implemented. One way is to treat the whole disk as
one gigantic circular log. The file system that Daniel is talking
about works that way. On such a system, one can guarantee a certain
amount of free space for the log by simply rewriting any live data
blocks that are within that amount of space from the front of the log.
The problem with this approach is that a disk is generally going to be
about 75% full (otherwise the user would have bought a smaller disk),
so about 3 blocks will be rewritten for every new block that is
written. This problem can be reduced substantially if the filesystem
simply skips the blocks that are in use, but, it takes just as long
for the disk head to skip over a single block as it does to write one.
In order to increase bandwidth, it would be necessary to skip whole
tracks, and whether even that is enough is debatable.

        A more popular way to organize a log structured filesystem
is to divide it into segments. The log is written into an empty
segment. When the system is running low on empty segments, it reads
the live data from a few live segments into one of the remaining empty
segments ("cleaning"). In this way, seldomly changed data is stuffed
into segments that rarely have to be cleaned.

        Proponents of log structured file systems argue that as RAM
size grows, read hit ratios on disk caches will increase toward 100%,
while a steady trickle of disk writes will continue to be necessary,
since dirty blocks are written to the disk every 30 seconds.
According to this reasoning, the relative performance of a filesystem
that maximizes write bandwidth should improve in comparison to a
traditional filesystem as RAM sizes grow.

        Log structured filesystems suffer from blocks that need to be
read together being dispersed all over the disk. In comparison, the
4.3 BSD filesystem tries to put the blocks of a all of the files in a
directory in the same part of the disk. When files that are read
together are also written together, then a log structured filesystem
has reasonable locality (e.g., a bunch of .o files are written during
a make and are then linked into an executable), otherwise a log
structured filesystem can do some other things to improve locality.
Firstly, a log structured filesystem can write all of the dirty blocks
for a file together. A log structured file system may also be able
to get similar locality by selecting how it rearranges blocks when it
does segment cleaning (or some other type of garbage collection if it
isn't a segmented LFS). The time that it takes to run the segment
cleaner is the other thing that brings the performance of log
structured filesystems down to the level of traditional filesystems.
On the other hand, if the workload of a system is primarily reads, it
can run the segment cleaner at night.

        Whether log structured filesystems are a win in practice is a
popular subject of debate. Current benchmarks suggest that log
structured filesystems perform somewhat better that traditional
filesystems in some areas and somewhat worse than others. Since a log
structured filesystem needs a segment cleaner, a fairer comparison
would probably be between a log structured filesystem and its cleaner
versus a traditional filesystem and a defragmenting program. My
primary interest in a log structured filesystem is not for improving
performance, but, rather, for doing a compressed filesystem.

        The advantage of using a log structured filesystem for
compression is that the data that is written to the log can be
written to byte boundaries rather than disk block boundaries.
Squeezing out every byte from compression is not the only benefit
of this approach. You can also adopt the convention of truncating
every block at its last non-zero byte. On a traditional filesystem,
you waste about half of a block per file, because the last block
will be, on average, only half full. If your average file size is
10kB, then your losing about 5% due to this internal fragmentation.
If you're running a news server, in which case, you'll have article
sizes of around 2kB, then you're losing 25% to internal fragmentation.
Currently, files larger than about 12kB need to use an indirect block
which is generally going to have a few non-zero entries followed by
mostly zeroes. By using a log structured filesystem you can save
space by making more efficient use of compression and by trimming
trailing zeros from each block. In addition, compression of metadata
such as indirect blocks and directory entries is possible (but not
required), random reads and random writes are fast, and data loss due
to media failure does not effect the decompression of other blocks
of same files that were corrupted by the media failure.

        Other possible features that can be added to log structured
filesystems include:

             1. RAID (Rapid Array of Inexpensive Disks). Given n
                identical disks, break each segments into n-1 parts.
                Distribute these n-1 parts to the same regions of each
                of the first n-1 drives. The last drive will hold the
                XOR of the n-1 parts of the segment, so that the
                contents of any drive can be reconstructed in the
                event of a drive failure. More complex arrangements
                of drives and parity sectors are possible. The
                problem with this scheme is that, for writes, all of
                the disk arms need to be used together, so the
                segments have to be fairly large for the system not to
                spend most of its time seeking.

             2. REPLICATING SEGMENTS. It may be advantageous to
                replicate frequently read but rarely modified segments
                to different parts of the disk to reduce seek times.
        
             3. more flexibility for ENCRYPTION. Some encryption
                algorithms change the size of their data. It's easier
                to support such algorithms if your block size is
                variable. Admittedly, this is a pretty contrived advantage.

        There are also other advantages of using a filesystem that
reallocates blocks instead of overwriting them, such as a log
structured filesystem:

             1. CRASH PROOFING in the sense that the filesystem will be
                in a consistent state when the system reboots, provided
                that a crash will leave any block that was being written
                to disk either completely written, completely unwritten,
                or in a detectable error state.

             2. FILESYSTEM FORKING: the ability to instantly clone the
                filesystem into basically two parallel universes
                that use additional disk sectors on a copy-on-write basis
                only for the things that they change (useful for taking a
                snapshot to do a disk backup).

             3. TIME TRAVEL (basically this involves holding onto a
                snapshot of the filesystem at every consistency point
                for the period that you want to be able to visit).

             4. easily adapted to WRITE-ONCE MEDIA

             5. DIGITAL SIGNATURES. Add a new field in the data
                structure that points to a disk block: the digital
                signature of the contents of that block. When you write
                a new block, you update not only the address of that block,
                but also the system's signature of that block. Any
                attempt to write to alter filesystem without the
                system's digital signature can be detected (perhaps
                the signature would have to be provided at mount
                time). The one exception is that an adversary could
                overwrite the disk with a snapshot of the disk at some
                previous time. It may also be possible to extend this
                scheme to have digital signatures for each user, but I
                haven't worked all of the details out. The problem is
                that the system has to be able to relocate a user's
                files without the user's digital signature without
                giving an adversary the ability to restore another
                user's file to a previous state. If these problems
                can be worked out, it would make it possible to
                require every user to sign his or her setuid files,
                for example.

             6. OBSCURING DELETED DATA. Similar to #5, but instead of
                signing each block, a random key is generated every
                time a new block is to be written. The new block is
                encrypted with that key, and the key is stored with
                the pointer the block. The pointers to the indirect
                blocks form a hierarchical encryption scheme, so that
                as soon as an old checkpoint is overwritten, the key
                to decrypt any data that is not pointed to by a
                current checkpoint is lost.

        It is worth noting that it would not be too difficult to adapt
a traditional filesystem like ext2fs to reallocate blocks instead of
overwriting them. The only changes in the disk format would be to use
data structures that could be updated by block reallocation for things
like the inode table and the free block table. The easiest way to
make these changes is to treat the inode table and the free block
table as regular files.

        Before anybody rushes off to change all of the Linux
filesystems to do reallocate on write, it is worth noting that
reallocating blocks is not without cost. Reallocating blocks requires
additional disk writes to update the block that pointed to the block
that has been updated, to update the list of free blocks, and to do
similar bookkeeping for these other blocks which had to be written.

        The only things in this posting that are actually my ideas are
the suggestions for trimming trailing zeros, encryption, replicated
segments, digital signatures, and obscuring deleted data. If you want
to learn about log structured filesystems, here are some references.

Ousterhout, John and Fred Douglis _Beating the I/O Bottleneck:
        A Case for Log-Structured File Systems_, Report UCB/CSD 88/467,
        October 1988

Burrows, Michael, Charles Jerian, Buttler Lampson, Timothy Mann
        _On-line Data Compression in a Log-structured File System_
        Report #85, April 15, 1992, Digital Systems Research Center

Rosenblum, Mendel, _The Design and Implementation of a Log-structured
        File System_, UCB/CSD 92/696, June 1992

Seltzer, Margo, Keith Bostic, Marshall Kirk McKusick and Carl Staelin
        _An Implementation of a Log-Structured File System for UNIX_
        January 1993 USENIX conference proceedings, pp. 307ff

-- 
Adam J. Richter                             Yggdrasil Computing, Incorporated
409 Evelyn Ave., Apt. 312, Albany CA 94706  PO Box 8418, Berkeley CA 94707-8418
(510) 528-3209                              (510) 526-7531, fax: (510) 528-8508
adam@netcom.com                             yggdrasil@netcom.com
Another member of the League for Programming Freedom (lpf@uunet.uu.net).