From: Josh Yelon <jyelon@cs.uiuc.edu> Subject: 'pklite' for Linux. Date: Mon, 16 Mar 1992 00:30:27 GMT
I considering writing binary file compressor for Linux, much like
pklite for msdos. The salient characteristic will be that you
can compress a program, and you won't have to decompress it
in order to run it. (It would decompress itself automatically
into RAM). I need ideas.
First of all, how do I pull it off?
Here is plan 1:
1. start the to-be-compressed executable using ptrace.
2. read out all of its memory into a file.
3. compress the file.
4. prepend a tiny in-memory decompression routine to the file,
5. prepend the proper headers for an executable program, and presto!
The disadvantages:
1. I like shared pages. However, a text segment that gets
decompressed at runtime isn't 'read-only', so that does
it in for shared pages.
2. Shared libraries would probably get linked in as soon as
ptrace ran the thing, so the libraries would become a part
of the compressed file. Yuck!
3. Thousands of copies of this decompression routine, one in
every executable, like a virus. Gross. Plus, the kernel
is constantly loading the same decompression routine from
disk. Wasteful.
The advantages:
1. It's pretty easy, considering that I already have the
code for dumping a running process into an executable file.
2. Nobody has to be aware of it. A compressed program would
run just like any other.
Here's another plan:
1. Edit the kernel code for exec, to look for a new 'compressed
executable' magic number.
2. Have it then decompress/load the file into ram, and
3. Have it then proceed as if it had just loaded a non-demand-paged
executable.
The advantages.
1. Its very clean.
2. The decompression routine is in the kernel, so it only
needs to be loaded once.
3. It sounds more reliable than the above approach.
The disadvantages.
1. The kernel is getting bigger (although not much,
assuming that the decompression routine is small).
2. You have to have the decompressing-kernel to run
compressed programs.
3. Again, compressed files are not demand-paged. (I, personally,
could care less - I tend to think that demand-paging slows
things down as much as it speeds things up).
Whaddaya think?
Also, about compression algorithms:
I have considered using 16-bit Huffman codes, but I'm only getting
a 70% compression ratio. Lempel-Ziv is okay (that's what compress
uses), but it takes gobs of ram to use it. Besides, I prefer a
static scheme- one where the decompression tables are fixed. (after
all, everything being compressed will be stylized gnu C 386 code...
why use an adaptive algorithm when there is nothing to adapt to?)
On that note, I was also considering inventing a specialized
algorithm for compressing 386 opcodes (it would still produce correct
output on other stuff, it just wouldn't be able to compress much).
Anybody know anything concrete about the feasibility of this?
PS: In order to do this I would need sample data - namely, assembly
listings containing BOTH opcodes and mnemonics, so I can pore them
over and see what kinds of regular patterns the gnu C compiler
typically produces. I cannot obtain this, since I do not have
access to a 386 right now. Would anybody be able to mail me a 100K
listing from the assembler?
y
Thanks in advance 4 ideas!
- Josh
PS: this might help alleviate some of the root-disk woes!