



           ____ ___________        1  File Compression



                                   **********
                                   * lzcomp *
                                   **********



        NAME:   lzcomp -- File Compression

        SYNOPSIS:

                lzcomp [-options] [infile [outfile]]

        DESCRIPTION:

                lzcomp  implements  the  Lempel-Ziv   file   compression
                algorithm.  (Files compressed by lzcomp are uncompressed
                by lzdcmp.) It operates by finding common substrings and
                replaces  them  with  a  variable-size  code.   This  is
                deterministic, and can be done with a single  pass  over
                the  file.   Thus,  the decompression procedure needs no
                input table, but can track the way the table was built.

                Options may be given in either case.

                -B      Input file  is  "binary",  not  "human  readable
                        text".   This  is  necessary  on  Dec  operating
                        systems, such as VMS  and  RSX-11M,  that  treat
                        these  files  differently.   (Note  that  binary
                        support is rudamentary and probably insufficient
                        as  yet.)  (On  VMS  version  4, this is ignored
                        unless the -x option is specified or  the  input
                        file is record-oriented.)

                -M bits Write using the specified number of bits in  the
                        code  -- necessary for big machines making files
                        for   little   machines.    For   example,    if
                        compressing a file on VMS which is to be read on
                        a PDP-11, you should select -M 12.

                -V [n]  Verbose if specified.  If a value is  specified,
                        it  will enable debugging code (if compiled in).
                        The values are bit-encoded as follows:
                          -V  0 Don't print anything
                          -V  1 Print a summary of the operation
                          -V  2 Print a running status on occassion
                          -V  4 Print some debug information.
                          -V  8 Print too much debug information
                          -V 16 Dump the compressed files

                                                                          Page 2
        lzcomp  File Compression


                -S      Perform  a  transformation  on  the  input  file
                        whereby   each   byte  is  subtracted  from  its
                        predecessor.  I.e., the difference between bytes
                        is  compressed,  rather than their actual value.
                        This improves the performance of  the  algorithm
                        on  certain  kinds  of  data files.  However, it
                        will worsen the compression of most other  kinds
                        of data.  Note that files compressed with the -S
                        option cannot be  decompressed  on  versions  of
                        Unix  compress  (or  lzdcmp) that do not support
                        this option.

                        The -S option  is  useful  for  digitized  video
                        images,  where each pixel element is represented
                        as a single byte.  It may also  be  useful  when
                        compressing digitized encoding of similar analog
                        data, where each sample is contained in a single
                        byte.

                -X [n]  "Export" -- write a file format that can be read
                        by  other  operating systems.  Only the bytes in
                        the file are copied;  file  attributes  are  not
                        preserved.   If  specified, the value determines
                        the level of compatiblity.  If not specified, or
                        specified  with  an  explicit value of zero, and
                        lzcomp is running on  Vax/VMS  version  4  under
                        VaxC  and  the  input  file is a disk or magtape
                        file  (block-oriented),  a  VMS-private   output
                        format  is  used  which is incompatible with the
                        Unix compress utility, but which  preserves  VMS
                        file  attributes.   -X may take on the following
                        values:

                         0  Choose VMS private format.  See restrictions
                            below.
                         1  Compatible with Unix compress  version  3.0,
                            except  that an end code is written that may
                            not be compatible with Unix compress.   This
                            is  the  default  if  -x  is given without a
                            value.  This is the correct mode to use when
                            transferring  files between VMS and RSX-11M,
                            RSTS/E, or RT11, or when transferring  files
                            to LZDCMP running on Unix.
                         2  As above, but do not  write  the  extra  end
                            code.    This   will   cause  problems  when
                            decompressing on RSX-11M and RT11 systems.
                         3  As  in  (2)  above,  but   suppress   "block
                            compression".
                         4  As  in  (3)  above,  but  do  not  output  a
                            compress   header   block.    This   is  for
                            compatiblity with a quite early  version  of
                            Unix       compress       (and      requires
                            conditional-compilation to use).

                                                                          Page 3
        lzcomp  File Compression


                        Note that the  -B  (binary)  option  is  ignored
                        unless the input file is "record-oriented", such
                        as a terminal or mailbox.

                The  other  two  arguments  are  the  input  and  output
                filenames   respectively.    Redirection  is  supported,
                however, the output must be a disk/tape file.

                The file format is almost identical to the current  Unix
                implementation  of  compress  (V4.0).   Files written by
                Unix compress  should  be  readable  by  lzdcmp.   Files
                written by lzcomp in export (-x) format will be readable
                by Unix compress (except that lzcomp outputs two "clear"
                codes  to  mark  EOF.   A  patch  to  Unix  compress  is
                available.)

        VMS COMMAND LANGUAGE INTERFACE:

                In addition  to  the  above  (Unix-style)  command  line
                interface, lzcomp supports a VMS command line interface.
                The following options are available:

                /BITS=<value>

                /EXPORT=(VMS, UNIX, BLOCK, HEADER, ENDMARKER)

                /MODE=(DELTA)

                /SHOW=(ALL, PROGRESS, STATISTICS, FDL,
                                DEBUG, DEBUG_SERIOUS, DEBUG_IO)

        VMS RESTRICTIONS:

                VMS Private mode stores the true name and attributes  of
                the  input  file  into  the  compressed  file and lzdcmp
                restores the attributes  (and  filename  if  requested).
                The  following  restrictions apply -- they may be lifted
                in the future as they are primarily due to the  author's
                lack of understanding of the intricacies of of VMS I/O:

                    All files must be stored on disk.
                    The lzcomp output file must be specified directly.

                Also, for all usage on VMS, the compressed file must  be
                written to, and read from disk.

                The following  file  attributes  are  not  preserved  by
                lzcomp:

                    File ownership and protection codes.
                    Date of creation, access, and backup.
                    Access control lists.

        RSX-11M RESTRICTIONS:

                                                                          Page 4
        lzcomp  File Compression


                lzcomp cannot determine the file attributes and may  not
                correctly read certain specialized file formats, such as
                "print image".  If a binary  file  is  compressed,  note
                that  it will be decompressed as "fixed-block, 512 byte"
                records.

        LZW COMPRESSION ALGORITHM:

                This section is abstracted from  Terry  Welch's  article
                referenced   below.    The  algorithm  builds  a  string
                translation table that maps substrings in the input into
                fixed-length  codes.   The  compress  algorithm  may  be
                described as follows:

                  1. Initialize table to contain single-character
                     strings.
                  2. Read the first character.  Set <w> (the prefix
                     string) to that character.
                  3. (step): Read next input character, K.
                  4. If at end of file, output code(<w>); exit.
                  5. If <w>K is in the string table:
                        Set <w> to <w>K; goto step 3.
                  6. Else <w>K is not in the string table.
                        Output code(<w>);
                        Put <w>K into the string table;
                        Set <w> to K; Goto step 3.

                "At each execution of the basic step an acceptable input
                string <w> has been parsed off.  The next character K is
                read and the extended string <w>K is tested to see if it
                exists  in  the  string table.  If it is there, then the
                extended string becomes the parsed string  <w>  and  the
                step  is  repeated.  If <w>K is not in the string table,
                then it is entered, the code for the successfully parsed
                string  <w> is put out as compressed data, the character
                K becomes the beginning of the next string, and the step
                is repeated."

                The decompression  algorithm  translates  each  received
                code   into  a  prefix  string  and  extension  [suffix]
                character.  The extension  character  is  stored  (in  a
                push-down stack), and the prefix translated again, until
                the  prefix  is  a  single  character,  which  completes
                decompression  of  this  code.   The entire code is then
                output by popping the stack.

                "An update to the string table is  made  for  each  code
                received  (except  the first one).  When a code has been
                translated, its final character is used as the extension
                character,  combined with the prior string, to add a new
                string to the string table.  This new string is assigned
                a  unique  code  value,  which is the same code that the
                compressor assigned to that string.  In  this  way,  the
                decompressor  incrementally reconstructs the same string

                                                                          Page 5
        lzcomp  File Compression


                table that the decompressor used....  Unfortunately  ...
                [the algorithm] does not work for an abnormal case.

                The abnormal case occurs  whenever  an  input  character
                string  contains  the  sequence  K<w>K<w>K,  where  K<w>
                already appears in the compressor string table."

                The decompression algorithm,  augmented  to  handle  the
                abnormal case, is as follows:

                  1. Read first input code;
                     Store in CODE and OLDcode;
                     With CODE = code(K), output(K);  FINchar = K;
                  2. Read next code to CODE; INcode = CODE;
                     If at end of file, exit;
                  3. If CODE not in string table (special case) then
                        Output(FINchar);
                        CODE = OLDcode;
                        INcode = code(OLDcode, FINchar);

                  4. If CODE == code(<w>K) then
                        Push K onto the stack;
                        CODE == code(<w>);
                        Goto 4.

                  5. If CODE == code(K) then
                        Output K;
                        FINchar = K;

                  6. While stack not empty
                        Output top of stack;
                        Pop stack;

                  7. Put OLDcode,K into the string table.
                     OLDcode = INcode;
                     Goto 2.

                The  algorithm  as  implemented  here   introduces   two
                additional complications.

                The actual codes are transmitted using a variable-length
                encoding.  The lowest-level routines increase the number
                of bits in the code when the largest  possible  code  is
                transmitted.

                Periodically, the algorithm checks that  compression  is
                still increasing.  If the ratio of input bytes to output
                bytes decreases, the entire process is reset.  This  can
                happen if the characteristics of the input file change.

        VMS PRIVATE FILE STRUCTURE:

                In VMS Private mode, the compressed data file contains a
                variable-length  (but  compressed)  file header with the

                                                                          Page 6
        lzcomp  File Compression


                file "attributes" needed  by  the  operating  system  to
                construct  the  file.   This  allows  the  decompression
                program to recreate the file  in  its  original  format,
                which is essential if ISAM databases are compressed.

                The overall file format is as follows:

                LZ_SOH  "start of  header"  signal  (this  value  cannot
                        appear in user data).

                        A  variable-length  data  record  (maximum   256
                        bytes)  containing  the header name, followed by
                        whitespace,    followed    by    header-specific
                        information.  In this case, the name record will
                        contain the string "vms$attributes" followed  by
                        the number of bytes in the attribute data block.
                        (I assume that the name record will consist of a
                        facility  name,  such  as  "vms",  followed by a
                        dollar  sign,  followed  by  a   facility-unique
                        word.)

                LZ_EOR  Signals "end of record".

                        This is followed by a VMS file attributes record
                        (generated by a VMS system library
                              routine).

                LZ_EOR  Signals "end of record" (optional).

                        This is followed by another "header record" with
                        additional information.  Currently, this is only
                        used to transmit  the  "longest  record  length"
                        field from the input file.

                        Additional header records may be defined in  the
                        future.

                LZ_ETX  Signals "end of segment".

                ST_STX  Signals "start of text"  (i.e.,  start  of  data
                        file).

                        This is followed by the user data file.

                LZ_ETX  Signals "end of segment"

                LZ_ETX  Two in a row signals "end of file".

                Note that this format can easily be extended to  include
                trailer  records (with file counts and checksums) and/or
                multiple data files in one compressed file.

                Note also that the LZ_CLEAR code may appear  in  headers
                or  data  files  to  cause  the decompression program to

                                                                          Page 7
        lzcomp  File Compression


                "readapt" to the  characteristics  of  the  input  data.
                LZ_STX  and  LZ_SOH  reset  the  compression  algorithm.
                LZ_EOR does not.

        AUTHORS:

                The algorithm is from "A Technique for High  Performance
                Data  Compression."  Terry A.  Welch.  IEEE Computer Vol
                17, No.  6 (June 1984), pp 8-19.

                This revision is by Martin Minow.

                Unix Compress authors are as follows:

                Spencer W. Thomas
                       (decvax!harpo!utah-cs!utah-gr!thomas)
                Jim McKie               (decvax!mcvax!jim)
                Steve Davies            (decvax!vax135!petsd!peora!srd)
                Ken Turkowski           (decvax!decwrl!turtlevax!ken)
                James A. Woods          (decvax!ihnp4!ames!jaw)
                Joe Orost               (decvax!vax135!petsd!joe)


