From: Drew Eckhardt (drew@ophelia.cs.colorado.edu)
Date: 07/06/92


From: drew@ophelia.cs.colorado.edu (Drew Eckhardt)
Subject: Re: Better FS's (was Re: New .96c , disappointing... : ( )
Date: Tue, 7 Jul 1992 04:52:49 GMT

In article <1992Jul6.113801.101@uniwa.uwa.edu.au> oreillym@tartarus.uwa.edu.au (Michael O'Reilly) writes:
>oles@wiliki.eng.hawaii.edu (Shawn Oles) writes:
>
>: I don't see an easy way to implement the "decent cache" that
>: you suggested. It sounds like a neat idea and a great
>: way to make use of unused memory, but how much real memory
>: really goes unused? Your strategy would require a cap
>
>On my 8 meg machine, typically 2 megs goes unused (not under X)
>That is, free space never goes below about 1.8 megs. Gcc ocasionally
>takes it to 1.8 meg but that is the mast I have seen it use.
>
>: on the amount of memory uses by malloc, otherwise the
>: file system could grind to a halt! When running X-windows,
>
>Yup. On the other hand, I would rather see X useing the memory than
>swapping...
>
>: gcc, or other memory intensive programs, this cap would
>: would equate to the current fixed size anyway? or did I
>: miss the point. Also, with a 4K block size, how do you
>: plan on handling fragmented blocks? Allocate them an
>: entire 4K block and waste all that precious memory?
>
>Nahh. The whole idea runs like this. Say we have a machine with 8 megs
>of memory. This give us 2000 pages. Lets give us a 4K block filesystem
>for ease of programming.
>
>Then we expand the memory manager and add one routine.
>
>void *get_page(int priority);
>
>This give us a new page. We specify how urgent our need is. runs from 2
>for a new process structure or other similar kernel need, 1 for user
>programs, and 0 for the buffer.
>
>If there isn't actually a page free, then the memory manager starts at
>the bottom and works it way up. It strips pages from the buffer (at
>least, call the buffer asking for some pages to be freed), then if that
>fails, does some swapping on users programs to free up a page.
>
>The curcial point here is that it never asks a pages that were allocated
>with a higher priority that we are asking for. i.e.
>The buffer can only get pages from the buffer or that are free.
>The user can get pages from other user process, the buffer, or free
> pages (in reverse order)
>The kernel can get it from damn near anywhere. ;)

I've been playing with this (meaning it works, is too kludgy and
being changed) on my system at home, and your implementation sugestion
is poor.

Managing buffer cache pages in the same way as the rest of the
pages is a Bad Idea (tm), for several reasons.

1. When you find a page, you will also have to look at the
        buffer structures to see if a buffer in it is locked -
        you can't discard a locked buffer.

2. This breaks the LRU replacement mechanism used for the buffer cache,
        and you stand a good chance of discarding a recently
        accessed buffer, causing it to be reread from disk.

        Currently, Linux does not use LRU for pages. Instead,
        it walks through the user process page tables,
        and pages the first one it sees to disk.

I changed swap_out() to call shrink_buffers() before swapping
pages, where shrink_buffers() uses an LRU algorithm as it should.

This would be easier if the BLOCK_SIZE was 4096 bytes, same
as page size.

Otherwise, you free the four least recently
used buffers, and consolodate the remaining buffers from
those pages into pages with empty slots.

The memcpy() of 3k is fairly cheap..

There are a few parameters that control buffer allocation :
1. minimum, maximum number of buffers
2. prefer_new - if num_free_pages <= prefer_new,
        then the oldest buffer is replaced, rather
        than allocating a new one from VM.
        
        Currently, these are set by a modified buffer_init()
        routine called from main.c in the startup code,
        but I plan to add a syscall to set these.

We should look at implementing a decent LRU algorithm for linux
paging - although I'm no VM expert

There are two bits in the page table / directory entries that
are for "user" use, a counter can be placed in there.

When you swap_out(), and can't shrink buffer cache, you should
probably update a timeout this field based on the accessed
bit, in all present directory entries, decrementing by 1
each time accessed is cleared, incrememnting when accessed
is set, clearing accessed.

I don't know how efficient it would be to extend this to
the individual page table entries, but you should definately
get something, because pages in idle tasks would be swapped
before pages in present tasks.
  
Also, it might be nice to swap (swap != page) tasks that are
idle > X seconds, like BSD does. This would free up
core for buffers.

>yes, you have to put a cap on it. My kernel currently allocates 2 megs
>for a file buffer. compiling under X makes my machine use about 1.2 megs
>of swap.
>so a 1/2 meg cap should see no swap being used..... and still give you a
>reasonable size buffer.
>
>This is nice and simple to implement, easy to understand and should help
>clean up a lot of kludges that currently exist in the kernel (Lots of
>structures are of fixed size. Why? becasue the kernel has no way to
>allocate memory for itself on the fly)

This is incorrect. The kernel malloc() function can allocate
memory in units of < 1 page. When the kernel free() function
is called, free pages are returned to the free page pool.

And, if you don't use kernel malloc(), you CAN use get_page.

Unfortunately, a malloc/free strategy doesn't work "too well"
for the kernel because as long as ONE thing is in a page,
that page stays allocated - you have no way to consolodate
things into pages so others can be freed.

That's one of the things I'm changing in
my buffer cache code before making it public - the dynamic allocation of
buffer_head structures. I'm changing it so that
"not full" pages get consolodated together when
possible. Since the buffer_head structures are stored
in doubly linked lists (both for the free list, and in
the chains used for the hash table collision chains),
this is fairly easy - if a buffer is not locked, you
can change a couple pointers, and move it to a new location.

>re framented blocks:
>The bitmap of free disk space is stored with a 512 byte granuality, so
>files smaller than 4K get a fragment (Need an extra 3 bit in the inode
>to implement this. ;). Yes, you do read the entire block into memory,
>all 4K of it, even if you only want 1/2 a K. This make nillo difference
>to the performance, and , given that the filesystem should strive from
>some locality, is just the same as read-ahead.
>

This is a good idea - you get the advantages of page granularity
block size (ie, simpler buffer cache management, larger reads from
the block device and less overhead), without the penalty in
disk usage.

>: I know that BSD uses virtual memory in a very elegant way
>: to implement a variable 8K cache. This eliminates the
>: memory wastage problem that will occur if a lot
>: of fragmented buffers are being accessed. Maybe this
>
>I think you have missed something about fragmented buffers. ONLY the
>last block in a file is ever fragmented. This means that fragmented
>buffers aren't something you will need to wrry about unduely. Some
>figures wouldn't go astray here but alas, I haven't got them.

>: is the kind of buffer you were talking about? If
>: so, it doesn't seem that easy to implement, but then again
>
>It isn't TOO bad. I should have it done by the end of the week (started
>it today.) I will HAVE to finish by end of week cos I am back at uni
>next week. ;)
>
>: I am no VM expert.
>You think I am?? ;) ;) ;)
>
>: Please tell us more about what it is you will have finished?
>: or at least an idea on how you want to extend the filesystem.
>:
>: -shawn oles@wiliki.eng.hawaii