A KERMIT PROTOCOL EXTENSION FOR INTERNATIONAL CHARACTER SETS

			      Christine Gianone
		 Manager, Kermit Development and Distribution
             Columbia University Center for Computing Activities
                            612 West 115th Street
                           New York, NY 10025, USA

                                DRAFT NUMBER 3
                                 JULY 7, 1989

ABSTRACT

A two-level extension to the presentation layer of the Kermit file transfer
protocol is proposed to allow transfer of non-English-language text files
between unlike computers.  Level 1 allows substitution of single character
sets other than ASCII in Kermit's normal text-file transfer syntax.  Level 2
specifies a new transfer syntax in which multiple character sets may be used,
along with mechanisms for switching among them as defined in ISO Standard
2022.

This is still a DRAFT proposal.  Readers with knowledge of real-world
multi-alphabet applications and file formats are urged to comment on the
suitability of this proposal.  It is assumed the reader is familiar with the
Kermit file transfer protocol.


SUMMARY OF CHANGES SINCE DRAFT #2, March 30, 1989

 - Separation of extension into Levels 1 and 2.
 - Additional file attributes for preannouncement of character sets.
 - Criteria for selection of character sets.
 - Handling of unknown character sets.
 - Handling of "illegal" characters in data.
 - Preliminary specification of user-loadable translation tables.
 - Avoidance of cryptic ISO terminology in Kermit commands.


ACKNOWLEDGEMENTS

Many thanks to these people for their helpful and constructive comments on the
first two drafts.  In most cases, their suggestions or the information they
provided have been incorporated into the third draft.

  John Chandler (Harvard/Smithsonian Center for Astrophysics, USA)
  Alan Curtis (University of London, UK)
  Frank da Cruz (Columbia University, USA)
  Joe Doupnik (Utah State University, USA)
  Hirofumi Fujii (Japan National Laboratory of High Energy Physics, Tokyo)
  John Klensin (Massachusetts Institute of Technology, USA)
  Ken-ichiro Murakami (Nippon Telephone and Telegraph Research Labs, Tokyo)
  Vladimir Novikov (VNIIPAS, Moscow, USSR)
  Jacob Palme (Stockholm University, Sweden)
  Andre Pirard (University of Liege, Belgium)
  Paul Placeway (Ohio State University, USA)
  Gisbert W. Selke (University of Bonn, West Germany)
  Fridrik Skulason (University of Iceland, Reykjavik)
  Johan van Wingen (Leiden, Netherlands)
  Konstantin Vinogradov (ICSTI, Moscow, USSR)
  Amanda Walker (InterCon Systems Corp, USA)

Thanks also to the following people for organizing meetings or conferences
in their countries at which the issues of this proposal were discussed:

  Kohichi Nishimoto (Nihon DEC, Tokyo, Japan)
  Juri Gornostaev and A. Butrimenko (ICSTI, Moscow, USSR)

and thanks also to those who attended these gatherings!


STATEMENT OF THE PROBLEM

Kermit has always been able to transfer text files between unlike systems
(e.g. a UNIX system with ASCII stream text files and an IBM mainframe with
EBCDIC record-oriented text files).  To do the text file code conversion,
Kermit transfers text in ASCII.  But ASCII only includes enough letters and
symbols for English.

There are now computers capable of representing the characters of other
languages: Roman letters with diacritical marks, Cyrillic letters, Hebrew,
Arabic, and Greek characters, Japanese and Chinese ideograms.  But different
computer manufacturers use different codes for these characters.

For example, the IBM PS/2 and the Apple Macintosh have character sets that are
"8-bit ASCII".  When the character value is 32-127, the character is
(normally) a standard ASCII graphic (printable) character.  When the value is
128 or higher, it is a special character.  But the PC and the Macintosh assign
different special characters to these values.  Here are just a few of examples:

   Value     PS/2 Character      Macintosh Character
    138       Small e acute       Small a umlaut
    143       Capital A ring      Small e acute
    144       Capital E acute     Small e circumflex 
    136       Small e circumflex  Small a acute

When a file contains "8-bit ASCII", Kermit presently transfers it without any
character translation.  Therefore, a text file written in French, German,
Italian, or Norwegian transferred between a PS/2 and a Macintosh will contain
the wrong characters when it arrives at its destination: the PS/2's e-acute
becomes a-umlaut on the Macintosh, etc.

The problem is compounded when a file is composed of characters from more than
one character set, for example a Japanese text file that contains Kanji,
Katakana, and Roman characters.

There are many computer vendors in the world and nobody controls what codes
they use to represent characters.  Without a standard protocol for
transferring non-ASCII text, each computer would have to know the codes of all
the other computers in order for correct transfer of non-English text files to
occur between unlike systems.


NORMAL KERMIT FILE TRANSFER SYNTAX

The Kermit file transfer protocol makes a distinction between text and binary
files.  Binary files are transmitted with no translation or conversion.  For
text files, Kermit defines a standard transfer syntax for text files, namely
ASCII characters with carriage return and linefeed (CRLF) after each line, so
that text may be stored in useful fashion on any computer to which it is
transferred.  Each Kermit program knows how to translate from the local
text-file storage conventions to ASCII/CRLF syntax, and vice versa.  This is
the basic, required, and default mode of operation for any Kermit program, and
it will be referred to as Kermit's "Normal" or "Level 0" syntax.

EXPANDED KERMIT TRANSFER SYNTAX

This proposal adds two additional levels of transfer syntax, Levels 1 and 2.
Level 1 permits the use of a single character set other than ASCII in the
transfer syntax.  These additional character sets are taken from recognized
national or internation standards, such as ISO 8859-1 (Latin Alphabet 1), JIS
X 0208 (Japanese), etc.

By using using a standard character set (other than ASCII), it is possible to
transfer a file containing more than one language.  For example Latin Alphabet
1 can represent a file containing a mixture of Italian, Norwegian, French,
German, English, and Icelandic.

Level 2 allows a mixture of character sets to transfer mixed-language text
that requires characters from more than one standard character set, for
example a document written in Russian, French, and Greek.

The additional levels are optional features for Kermit programs, except that
Level 2 should not be provided without Level 1.

The additional overhead incurred by a Kermit program running in text mode at
any level can be avoided when transferring files between two computers that
use the same codes and formats.  Simply use the command SET FILE TYPE BINARY
to disable all translations and reformatting.

The following discussion applies to text-file transfer only.  When the Kermit
user has selected binary file transfer, none of the text-file conversions
discussed here apply.


EXPANDED SYNTAX, LEVEL 1

When all the characters in a text file can be represented by a single
character set, then that character set can be used in place of ASCII in
Kermit's text file transfer syntax.

As with ASCII, there must be a mapping between the local file character set
and the character set of the common transfer syntax.  That is, there must be a
pair of translation tables in the program, one from local to common, and one
from common to local.  Since this mode of operation is not Kermit's normal
behavior, it must be selected by the user.  The new Kermit commands are:

  SET FILE CHARACTER-SET <file-character-set-name>
  SET TRANSFER-SYNTAX CHARACTER-SET <transfer-character-set-name>

The file character set is a system-dependent item.  Some computers have only
one character set, in which case the SET FILE CHARACTER-SET command would be
unnecessary.  But other computers allow the use of different character sets,
often without any way to identify a file's encoding.  For example, the IBM PC
family running MS-DOS 3.3 or later supports a variety of "code pages" and
allows users to switch among them, as described in Chapter 9, "Code Page
Switching", of the IBM DOS 3.3 manual.  Thus, on any given PC, a file may be
encoded using Code Page 437 (USA), Code Page 850 (Multilingual), Code Page 860
(Portugal), Code Page 865 (Norway), etc.  If you have set your Code Page to
437, you may display a file created using Code Page 865 on your screen but the
wrong characters are likely to appear.

Therefore, Kermit for the IBM PC family will require the SET FILE
CHARACTER-SET command, with operands to denote the code page such as CP437,
CP850, etc.  The default character set would be the PC's original set, CP437.
Those who use other sets can avoid keying in a SET FILE CHARACTER-SET command
every time Kermit is started, by including this command in the program's
initialization file.

Similar remarks apply to European computers that use the "national replacement
characters" allowed by ISO Standard 646.  This standard specifies a 7-bit
character set equivalent to ASCII, but with national variants in which certain
non-alphanumeric ASCII graphic characters are replaced by "national
characters", as shown in Table 1.

_____________________________________________________________________________

Column/Row   ASCII          German         Finnish   Norwegian     French

  04/00      at-sign        section       at-sign   at-sign       a-grave
  05/11      left-bracket   A-umlaut      A-umlaut  AE-diphthong  degree
  05/12      backslash      O-umlaut      O-umlaut  O-slash       c-cedilla
  05/13      right-bracket  U-umlaut      A-circle  A-circle      section  
  06/00      accent-grave   accent-grave  e-acute   accent-grave  accent-grave
  07/11      left-brace     a-umlaut      a-umlaut  ae-diphthong  e-acute
  07/12      vertical-bar   o-umlaut      o-umlaut  o-circle      u-grave
  07/13      right-brace    u-umlaut      a-circle  a-circle      e-grave
  07/14      tilde          ess-zet       u-umlaut  tilde         umlaut

           Table 1: ISO 646 Usage in Selected Countries
_____________________________________________________________________________

(see Figure 1 for an explanation of column/row notation.)

For example, the German phrase "Gr<u-umlaut><ess-zet> aus K<o-umlaut>ln" would
be rendered in ASCII as "Gr}~ aus K|ln", and the ASCII C-language phrase
"{~a[x]}" would become "<a-umlaut><ess-zet>a<A-umlaut>x<U-umlaut><u-umlaut>"
in German ISO 646.  The German user would want Kermit to interpret the local
file characters as German in the former case, and as ASCII in the latter.

SPECIFYING THE TRANSFER SYNTAX

To select Level 1, the user must type the command

  SET TRANSFER-SYNTAX CHARACTER-SET <name>

Where <name> is the name of a standard character set.  To minimize the work of
the programmer, the consternation of the user, and the memory requirements for
the Kermit program itself, the number of character sets which Kermit uses for
Level 1 transfer syntax should be kept to a minimum.  As a starting point, the
sets shown in Table 2 are recommended.  The criteria for including a character
set in this table are:

1. ASCII (= ISO-646 International Reference Version, IRV) is included.

2. Except for ASCII, each set should be either (a) the "right half" of an
   8-bit single-byte set whose "left half" is the same as ASCII/ISO-646-IRV,
   or (b) a multi-byte set.

3. Each character in the set should be self-contained, and not formed as
   a composite of other characters.

4. The set must be listed in the ISO International Register of Character 
   Sets, so that it has a unique registration number and designating escape
   sequence.  (But provisions are made for other registration authorities.)

5. The set must be a national or international standard graphic character 
   set, intended for use in computer text processing or programming (as
   opposed to Videotex, Teletex, OCR, device control, and other applications).
   This category may include line-drawing or technical character sets which
   fit the other criteria.

Note in particular that the national variants of ISO 646 are not included,
since these are covered adequately by ASCII and the ISO Latin alphabets.

Standard "Kermit names" (for use with the SET TRANSFER-SYNTAX command) are
given to these character sets so that they may be referred to uniformly in
all Kermit implementations.  These names are chosen to be mnemonic, so that
users don't have to remember long numbers like "ISO-8859-3".  The choice of
single words like "CYRILLIC" implies that there will not be more than one
transfer syntax for Cyrillic text.  However, if these standards change in the
future, it will be possible to append further identifying material to these
names, e.g. "CYRILLIC-2", "CYRILLIC-3", etc.

_____________________________________________________________________________

US 7-bit ASCII, equivalent to the ISO 646 International Reference Version
  (IRV) character set.  English, German without umlauts or ess-zet, etc.
  Kermit name: NORMAL.  ISO Registration Number: 2.

ISO 8859-1, Latin Alphabet 1, for Dutch, English, Faeroese, Finnish, French,
  German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish, and
  Swedish.
  Kermit name: LATIN1.  ISO Registration Number: 100.

ISO 8859-2, Latin Alphabet 2.  Albanian, Czech, English, German, Hungarian,
  Polish, Romanian, Serbocroation, Slovak, and Slovene.
  Kermit name: LATIN2.  ISO Registration Number: 101.

ISO 8859-3, Latin Alphabet 3, for Afrikaans, Catalan, English, Esperanto,
  French, Galician, German, Italian, Maltese, and Turkish.
  Kermit name: LATIN3.  ISO Registration Number: 109.

ISO 8859-4, Latin Alphabet 4, for Danish, English, Estonian, Finnish, German,
  Greenlandic, Lappish, Latvian, Lithuanian, Norwegian, and Swedish.
  Kermit name: LATIN4.  ISO Registration Number: 110.

ISO 8859-5, the Latin/Cyrillic Alphabet, for Bulgarian, Byelorussian,
  Macedonian, Russian, Serbocroation, and Ukrainian (Compatible with USSR GOST
  Standard 19768-1987 and ECMA-113).
  Kermit name: CYRILLIC.  ISO Registration Number: 144.

ISO 8859-6, the Latin/Arabic Alphabet.
  Kermit name: ARABIC.  ISO Registration Number: 127.

ISO 8859-7, the Latin/Greek Alphabet.
  Kermit name: GREEK.  ISO Registration Number: 126.

ISO 8859-8, the Latin/Hebrew Alphabet.
  Kermit name: HEBREW.  ISO Registration Number: 138.

ISO DIS 8859-9, Latin Alphabet 5, in which six Icelandic letters from
  Latin Alphabet 1 were replaced by six other letters needed for Turkish.
  Kermit name: LATIN5.  ISO Registration Number: 148.

CSN 36 91 03, Czechoslovak Standard alphabet.
  Kermit name: CZECH.  ISO Registration Number: 139.

JIS X 0201, a 1-byte code including ASCII and Japanese Katakana.
  Kermit name: KATAKANA.  ISO Registration Number: 13 (Kana), 14 (Roman).

JIS X 0208, a 2-byte code containing Japanese Kanji, Katakana, Hiragana,
  Roman, Greek, and Russian characters, plus special symbols, etc.
  Kermit name: KANJI.  ISO Registration Number: 87.

Chinese Standard GB 2312-80, a 2-byte code for Chinese.
  Kermit name: CHINESE.  ISO Registration Number: 58.

KS C 5601 (1987), a 2-byte code for Korean.
  Kermit name: KOREAN.  ISO Registration Number: 149.

            Table 2: Standard 8-Bit Character Sets
_____________________________________________________________________________

The ISO Latin alphabets and the Czech character set are 8-bit character sets
whose left half is identical with ASCII, and whose right half contains the
special characters.  The ISO registration number refers only to the right half
of each of these character sets.  But each of these sets must be used in its
entirety, because the unaccented Roman letters, the digits, and the
punctuation marks appear only in the ASCII left half.  Therefore, this
proposal considers an 8-bit character set composed of ASCII plus one of the
right-half sets to be a SINGLE character set.  The Kermit character-set name
refers to the two halves combined as a single set.  See Figure 2 for the
layout of an 8-bit character set.

A particular Kermit program need not incorporate all of these character sets.
In many cases, a single 8-bit character set such as LATIN1 will suffice.  For
example, in the USSR there are at least five computer codes in use for
Cyrillic characters.  But all of them can be mapped to ISO Latin/Cyrillic,
which also includes ASCII.  So in all likelihood, a Soviet version of Kermit
need only use LATIN5 in its Level-1 transfer syntax, allowing it to transfer
Russian and English language text files among computers using different codes.

When a language is representable in more than one character set from this
table, as are English, German, Finnish, Czech, Turkish, etc., the character
set highest on the list which adequately represents the language should be
preferred.  For example, NORMAL for English, LATIN1 for French, LATIN1 for
German (because it represents German better than ASCII), LATIN5 for Turkish
(because it represents Turkish better than LATIN3), etc.  This is to maximize
the chance that any two particular Kermit programs will recognize the same
character sets.

Unfortunately, but unavoidably, the burden of choosing the best transfer
syntax character set must be placed upon the user.  If a file containing a
mixture of Finnish, English, and Danish must be transferred, the user must
find a character set that can adequately represent all three languages, in
this case Latin Alphabet 4.  A table like Table 3 should be provided in the
user documentation to help the user make this selection.

_____________________________________________________________________________

    Arabic     ARABIC                      Italian        LATIN1,3
    Bulgarian  CYRILLIC                    Kanji          KANJI
    Chinese    CHINESE                     Katakana       KATAKANA, KANJI
    Czech      CZECH, LATIN2               Korean         KOREAN
    Danish     LATIN4                      Latvian        LATIN4
    Dutch      LATIN1,2,3,4                Lithuanian     LATIN4
    English    NORMAL,LATIN1,2,3,4,5,etc   Norwegian      LATIN1,4
    Esperanto  LATIN3                      Polish         LATIN2
    Estonian   LATIN4                      Portuguese     LATIN1
    Finnish    LATIN1,4                    Romanian       LATIN2
    Flemish    LATIN1,2,3,4,5              Russian        CYRILLIC
    French     LATIN1,3,5                  Serbocroation  LATIN2
    German     LATIN1,2,3,4,5              Slovak         LATIN2
    Greek      GREEK                       Spanish        LATIN1
    Hebrew     HEBREW                      Swedish        LATIN1,4