From: Linus Benedict Torvalds (torvalds@klaava.Helsinki.FI)
Date: 08/20/92


From: torvalds@klaava.Helsinki.FI (Linus Benedict Torvalds)
Subject: Re: A question about Kernel system call mechanism
Date: 20 Aug 1992 12:20:51 GMT

In article <1992Aug19.174117.21233@ramsey.cs.laurentian.ca> ron@ramsey.cs.laurentian.ca (Ron Prediger [Velociraptor]) writes:
>I am relatively new to Linux and have been examining the kernel source.
>
>1) Does anyone know how linux passes parameters from the user process to the
>kernel service routine ? Below is what I think is happening and where I
>am confused.
>
>It appears that system calls are handled using interrupt or trap gates
>resident in the Interrupt descriptor table (IDT). From reading the Intel
>386 ref. manual I understand that a stack switch occurs automatically when
>a less privileged process accesses a gate for a more privileged subroutine.

Correct so far...

>What I can't see is how the kernel service routine gets the system call
>parameters (ie. addresses, etc) from the user process. Is there code
>somewhere which copies these parameters from the original (level 3) stack to
>the more privileged (level 0) stack ? If linux had used call gates to
>implement system calls, the parameters would automatically be copied to the
>privileged routine's stack by the 386. (This automatic
>copy of parameters does not occur when referencing interrupt/trap gates.)

I didn't like system call gates: they are too complicated for my taste
(besides, you have to know how many arguments to copy, or have a
specific system call gate for each type of argument: maybe not a bad
idea, but...). Anyway, things are easier than you make them out to be:
the arguments are simply passed in the normal registers.

Passing arguments in the registers allows you 6 (32-bit) direct
arguments (not counting %eax, which is used to tell which system call
you want handled), and more if you simply set up a pointer to a block in
user space. And the beauty of it all is that they are automatically put
on the stack in as arguments to the system calls when the registers are
saved - see the file linux/kernel/sys_call.S, which saves all the
necessary state information. It's the simplest and fastest way I could
find: linux doesn't even save the state in some special task-structure
like other unices seem to do, but just leaves the regs on the stack,
ready for popping when the process returns from the interrupt.

>2) It appears that Linux is making use of segment registers (FS,GS) and the
>LDT/GDT to transfer the actual data (ie. from a read system call) between
>user and kernel address spaces. Is this observation correct ?

Actually, only %fs is used: it points to the user-space segment when in
a system call. Thus linux never needs to check any bounds when copying
from/to user space: it's automatically handled by the hardware. The
get_fs_XXX() and put_fs_XXX() (XXX=byte, word, long) inline functions
can be used to transfer bytes from/to user space, and memcpy_tofs() or
memcpy_fromfs() can be used to move bigger blocks between kernel and
user segments.

What happens at a system call is roughly:

user space:
 - load the arguments into registers (%eax contains the system call
   index, %ebx... contain the parameters)
 - do an "int $0x80", moving to kernel mode:

kernel space:
 - clear the direction-flag, as gcc assumes this
 - save the system call number: a negative number means the interrupt
   was caused by a hardware IRQ or trap.
 - save all the segment registers
 - save %eax (which happens to be the same number we saved earlier if
   this is a normal system call)
 - save the other registers: they automatically form the stack frame for
   the system call.
 - call the appropriate system call handler by indexing the appropriate
   table with %eax.

The handler does it's stuff - it /can/ change the stack frame if it
wants to, and thus return information in any registers it wants to, but
that is really discouraged, and all system calls currently just return
their result in %eax as part of their normal return.

 - check if there were any signals, and change the return stack (both in
   kernel and user space) appropriately if so, invoking the signal
   handler instead of returning directly.
 - pop all the saved registers, and do an iret, returning to user mode.

While the system call runs, the %ds and %es registers point to kernel
data space, and %fs points to user space. But the system calls may
change %fs for their own needs: for example symbolic links result in
changing %fs to kernel space for a while as the name is parsed directly
from the kernel buffers instead of from user space etc.

Note that normal faults/traps and IRQ's do essentially exactly the same,
except for "fast" IRQ's, which just save a minimal amount of information
and don't do the signal checking (used by things like the serial
handlers). Also, they naturally haven't got any "system call number",
but have their own routine that is called after the stack is set up.

As to the GDT: the GDT contains just two normal segment entries: GDT[1]
is the kernel code segment descriptor, and GDT[2] is the kernel data
descriptor. The rest of the global descriptor table is filled with TSS
and LDT descriptors. The local descriptor tables normally contain just
the user-space code/data descriptors in LDT[1] and LDT[2], but it's
flexible enough to be extended if something wants to have more segments
in user space (I think the xenix emulator uses this, although I haven't
looked at the code yet).

                Linus