shall be defined by the use of the appropriate locking-shift functions." Kermit programs should "agree otherwise" that the default G0 character set is the US-ASCII/ISO-646-IRV (International Reference Version) 7-bit character set; thus international transfer syntax can be identical to Normal Kermit transfer syntax when transferring 7-bit text files. There are no defaults for G1, G2, or G3, in the interest of fairness to all countries and peoples. When the text contains characters outside the ASCII range, an escape sequence from Table 5 must be issued, designating the alphabet to which they belong (using the identification letters shown in Table 5) to the desired intermediate character set G0, G1, G2, or G3. This sequence must be given before the first occurrence of a character in that alphabet. If no such sequence is given, then all characters are treated as ASCII data, including , the shift characters, and bytes with their 8th bits set to one. In other words, the file transfer behaves in the normal Kermit fashion for text files. Since ISO 8859 character sets are subject to revision from time to time, an alphabet selector may be preceded by &F, where F is the revision number (@ = 1, A = 2, B = 3, etc). For example, &@-A means Latin Alphabet Number One, Revision One. (This information is from ISO 2022 6.3.13.) ISO 2022 escape sequences are inserted into the data, and are indistinguishable by the Kermit packet encoder/decoder from the data itself. Therefore these escape sequences may be broken across packets, just as any other data may be. UNKNOWN ALPHABETS It is not required that the sender preannounce all of a file's character sets prior to transfer. Suppose a file contains a mixture of alphabets, some known to the receiver, others not. At some point, an alphabet designator arrives which the receiving Kermit does not recognize. Should the receiving Kermit cancel the file transfer, or accept the unknown code? A new command is provided to let the user control what happens in this situation: SET UNKNOWN-ALPHABET {KEEP, CANCEL}. If the user elects CANCEL, then the receiver will behave as if the user had manually cancelled the file, i.e. it will put the character "X" in the data field of its next acknowledgement, and the sender (assuming it supports this feature) will stop sending the file. If the user elects KEEP, the file will be accepted in its entirety. But the unknown code should be marked in case the user wants to fix it afterwards. To do this, receiving program accepts the designator for the unknown alphabet and stores it in the file as data, with subsequent characters stored untranslated. When the unknown character set is shifted out of (or the end of file arrives), the receiving Kermit stores the ISO-2022 Coding Method Delimiter, d, and resumes translation. If the unknown alphabet is shifted back into, the designating escape sequence is stored again, and the process resumes. Unknown alphabets may be nested in this manner. The default behavior should be "KEEP". This command should also be effective at Level 1, where it would simply prevent the receiving Kermit from refusing a file on the basis of the character set used to transfer it. LOCAL FILE REPRESENTATION This proposal assumes nothing about the representation of the file on the local storage medium. It may be ASCII, EBCDIC, a proprietary word processor format, IBM code page, or anything else. It is an implementation "detail" for Kermit programmer to convert between the local file representation for multi-alphabet text files, and Kermit's file transfer syntax. In some cases, the file itself (or its directory entry) might contain the necessary identifying information, in which case the sending Kermit program can automatically emit the appropriate escape sequences during file transfer. In others, the user will have to tell the sending program how the file is encoded. The suggested command is: SET FILE TYPE where specifies how the file is (or when receiving, is to be) encoded on disk. This will necessarily be highly dependent on the system's conventions, or the conventions of the applications to be used with the file (e.g. a multi-language word processing program). Possibilities for might include application names like WORDPERFECT, XYWRITE, NOTA-BENE, MACWRITE, ALEPH-BET, PC-HANGUL. BREAKING THE RULES If the local file is not encoded according to ISO 2022 rules, it may contain , , and characters. It is up to the Kermit program to know what these characters mean in the context of the file's format, and to either strip them from the file or translate them to something else. The ISO 2022 rules forbid the use of these characters as data to be transferred. If a file is to be transferred using international syntax, and it contains any of the characters significant to this syntax, namely , , , , or , then such characters should be prefixed during transmission with Datalink Escape, , C0 character 01/00 (Control-P). Furthermore, if itself occurs in the data, it should also be prefixed with . LEVEL-2 PERFORMANCE Kermit programs may use the full range of ISO 2022 code extension techniques, including use of G0, G1, G2, and G3 in both the 7-bit and 8-bit environments, with both single-byte and multibyte character sets. In the general case, G0 will be used for ASCII and English, G1 for the "native language" of the local country or region, G2 for a third language, and G3 for a fourth. Additional character sets may be swapped in and out of G2 and G3 as required. Transmission of 8-bit data in the 7-bit environment is accomplished by Kermit using 8th-bit prefixing, which is an optional feature of the Kermit protocol. However, most popular implementations of Kermit do include this feature. If a Kermit program cannot do 8th-bit prefixing, then it must operate in the ISO 2022 7-bit environment, shifting GL among the intermediate graphics sets G0-G3. If the Kermit program can do 8th-bit prefixing, the choice of the ISO 2022 7-bit or 8-bit environment is entirely independent of the communication channel. Selection of the ISO 2022 7-bit or 8-bit environment should be made on other grounds, such as transmission efficiency or program simplicity. For example, if the ISO 2022 8-bit environment is used on a 7-bit channel, then Kermit will have to do 8th-bit prefixing. On a 7-bit communication channel, the best choice of ISO 7-bit or 8-bit environment depends on the nature of the data to be transferred. If there is little or no 8-bit data (as in English text), it doesn't matter. If there is frequent shifting between 7-bit and 8-bit characters (as in French or Portuguese), then single shifts would tend to be more efficient than locking shifts, and Kermit's 8th-bit prefixing is equivalent to a single shift. Therefore, use the ISO 8-bit environment and let Kermit do the prefixing. If there are along strings of 8-bit characters, as in "right-sided" languages like Russian, Greek, Arabic, and Hebrew, then locking shifts are more efficient -- use the ISO 7-bit environment. In Japan, many computer systems use at least three character sets, Roman (close to ASCII), Katakana (a 1-byte code), and Kanji (a 2-byte code). Kanji is specified in JIS X 0208, which also includes Roman, Hiragana, Katakana, and some other character sets, but these are double width and not normally used. Roman characters are usually taken from the left half of JIS X 0201, and Katakana from the right half. Japanese text frequently shifts between Roman, Kana, and Kanji, and therefore requires three active character sets, for example G0 (Roman), G1 (Kana), and G2 or G3 (Kanji). In the 8-bit environment, data transfer can be quite efficient: locking shifts are used to shift GL between Roman and Kana, and any bytes with the 8th bit set to one automatically invoke Kanji in GR as a multi-byte character set. In the 7-bit environment, locking shifts would also be used to select Kanji. Note that locking shifts are more efficient in this case than Kermit 8th-bit prefixing because Kanji characters consist of more than one byte, and tend to occur in runs. For Japanese, therefore, it is better to use the ISO 7-bit environment on a 7-bit communication channel. The situation is summarized in Table 4. _____________________________________________________________________________ ISO 2022 Environment 7-bit 8-bit +------------------------------+-----------------------------+ | Recommended for right- | Recommended for 2-sided | 7-bit | sided languages like Greek, | languages like French, | data | Russian, Arabic, Hebrew. | German, etc. Use Kermit's | path | Use ISO 2022 locking shifts. | 8th-bit prefix for special | | Also for Japanese. | characters. | +------------------------------+-----------------------------+ | No reason to use ISO 7-bit | Clear transmission of 8-bit | 8-bit | environment on a clear 8-bit | characters. Use for both | data | communication channel. | left- and right-sided | path | OK for 7-bit ASCII, though. | languages. | | | | +------------------------------+-----------------------------+ Table 4: Selecting ISO 7- vs 8-Bit Environment _____________________________________________________________________________ The user should have control over whether the ISO-2022 7-bit or 8-bit environment is used. To allow this, the command SET TRANSFER-SYNTAX INTERNATIONAL may be extended as follows: SET TRANSFER-SYNTAX INTERNATIONAL [ {7, 8} ] which means that an optional final field may be included to specify the 7- or 8-bit ISO-2022 environment. The default should be 8, since it is the most efficient method in most cases. If Kermit -- at all levels -- offered locking shifts in addition to single shifts, then international syntax could always proceed in the 8-bit environment, and this would simplify implementation considerably. A proposal on locking shifts for Kermit is forthcoming. FILE TRANSFER SYNTAX EXAMPLES A simple 7-bit ASCII text file can be transmitted in the normal Kermit manner for text files, without any escapes or shifts, even in ISO 2022 mode. The "encoding" file attribute, if used with international transfer syntax, could be "*#IAJ2"I2" (encoding = international with GL = G0, ISO 2022 7-bit environment, character set = ASCII). Or it could be simply "*!A" (ASCII). A text file containing characters from a language or languages covered by a single alphabet other than ASCII can be transferred exactly like an ASCII text file, except that the attribute, if used, would denote the character set, e.g. "*!C2$I100" for Latin-1. In the 7-bit environment, international syntax can be used to cut down on Kermit's 8th-bit prefixing overhead, in which case the attributes might look like "*#IBJ2$144", and any strings of GR characters would be preceded by LS1 and transmitted with their high-order bits set to zero. A multi-character-set text file will require an escape sequence to identify each alphabet. The attribute packet would show international encoding, optionally including the ISO 2022 facilities announcers, and the character sets, as in "*#ICK2)I100,I144". In the 7-bit environment, and are used to shift between the G0 and G1 sets. In the absence of any specific designators, the G0 set is presumed to be ASCII. Example: A dangerous German word is "gef-Adhrlich". In this case, the only extended character is the umlaut-a in "gefaehrlich" (where ae is a way of writing umlaut-a without an umlaut). -A designates Latin-1 into G1, shifts GL out to G1, "d" is the left-half equivalent of umlaut-a, and shifts GL back in to G0. For clarity and consistency with the ISO-2022 recommendations, it is recommended that the text begin with explicit character set designations, and then explicitly shift into the G0 set, rather than defaulting to it: (B-AA dangerous German word is "gefdhrlich". A text file containing characters from multiple ISO 8859 alphabets requires an designation sequence for each alphabet. In the 7-bit environment, SO and SI can be used to shift between G0 and G1 of the current alphabet, and (B can be used to select G0 of any of the alphabets, since these are all the same. For example, the following text contains the same word in English, French, and Russian: -ADisappointed, digu, -L`PW^gP`^RP]]kY. The first escape sequence assigns Latin Alphabet No. 1 to G1, and the subsequent and shifts apply to its G0 and G1 set, which is used to form the English and French words. The second escape sequence assigns the Latin/Cyrillic 96-character set to G1, and the subsequent shifts apply to this new set. Another 7-bit example, in which the same word is repeated in English, Russian, and German, shows how a locking shift remains in effect when the alphabet is changed. We begin in Latin/Cyrillic, start with an English word from G0, shift to G1 for the Russian word, and while still in G1 switch to Latin Alphabet No. 1 for German to get the umlaut-A at the beginning of Aenderung (where Ae = umlaut-uppercase-A), and shift back to G0 for the rest of the word: -LAlteration _U`UTU[ZP -ADnderung. Some rules and hints to remember: 1. In the 8-bit communication environment, always use 8-bit character transmission -- it's more efficient. 2. There can be no more than four character sets designated at one time. Generally designate ASCII to G0, the most frequently-used non-ASCII set to G2, less frequently used sets to G3 and G1. If a file has more than four sets, swap the least frequently used sets in and out of G3 and G1. 3. Single shifts can only be used with G2 and G3. This is why G2 and G3 are preferred to G1. 4. Only two character sets can be invoked at once in the 8-bit communication environment, and only one in the 7-bit environment. TERMINAL EMULATION While not part of the Kermit file transfer protocol, terminal emulation is a feature of many Kermit programs. It is hoped that these terminal emulators will evolve along the lines of the ISO standards mentioned above. In some cases, this is already a fact, insofar as DEC VT300 series terminals already follow these standards and Kermit programs are beginning to emulate these terminals. In this regard, it is important to note that not all languages are written from left to right, top to bottom. Hebrew and Arabic are two examples of right-to-left languages, and Japanese and Chinese may be written top to bottom. The order of the text characters on disk or on the transmission line do not necessarily reflect their order on the screen or the printed page. Kermit should be as easy to use as possible, but should still give the user the ability to specify exactly what character codes are in use for both terminal emulation and file transfer. There should also be a consistent set of commands for all Kermit programs. SPECIAL EFFECTS Today, most multi-alphabet files are produced by proprietary text processing programs. These programs have many functions besides switching among alphabets. They may also endow text with special attributes such as boldface, italic, underline, super- or subscript, color, etc, and render characters in a variety of type styles and sizes. Each text processing program may have its own unique formats and conventions. These special effects are not addressed by this proposal. Nevertheless, it is likely that a multi-alphabet file produced by a text processing program also contains special effects. In order for a Kermit program to send a multi-alphabet file, it must have detailed knowledge of the file's format and coding conventions. Therefore, the Kermit program should be able to strip out the special effects, and send only the text. Otherwise the result would be meaningless when received on an unlike system or for use with a different application. (When transferring such files between like systems or compatible applications, Kermit binary mode transfers will suffice.) At some future time, it might be possible to adapt one of the popular document description languages to the Kermit protocol, so that Kermit will be able to transfer formatted documents between unlike systems and applications. Presently, there are many competing would-be standards including IBM DCA and DIA, DEC DDIF, US Navy DIF, Postscript. There are also two ISO standards emerging in this area: Standard Generalized Markup Language (ISO 8879, 9069, and 9573), and Office Document Architecture (ISO 8613). This is an area for further study. APPENDIX A: STANDARDS ANSI X3.4 (1986), "Coded Character Sets - 7-bit American Standard Code for Information Interchange" (US ASCII), is the 7-bit code currently used by Kermit for transferring text files. ISO 646 (1983) (= ECMA-6), "Information Processing - ISO 7-bit Coded Character Sets for Information Interchange", gives us a 7-bit character set equivalent to ASCII with provision for substituting "national characters" in selected positions. ISO 4873 (1986) (= ECMA-43), "Information Processing - ISO 8-bit Code for Information Interchange - Structure and Rules for Implementation", defines 8-bit character sets, their graphic and control regions, and how to extend an 8-bit character set by using multiple intermediate graphics sets. ISO 2022 (1986) (= ECMA-35), "Information Processing - ISO 7-bit and 8-bit Coded Character Sets - Code Extension Techniques", describes how to use 8-bit character sets in both 7-bit and 8-bit environments, and how to switch among different character sets and alphabets. ISO International Register of Coded Character Sets to be Used with Escape Sequences. This is the source of the ISO registration numbers. ISO 2375 (1985) "Data Processing - Procedure for Registration of Escape Sequences". The procedure by which a character set gets into the above register and has a registration number and designating escape sequence assigned to it. JIS X 0202, "Code Extension Techniques for Use the the Code for Information Interchange", the Japanese counterpart of ISO 2022. ANSI X3.41-1974, "Code Extension Techniques for Use with the 7-Bit Coded Character Set of the American National Standard Code for Information Interchange", describes 7- and 8-bit codes and extension techniques in approximately the same manner as ISO 4873 and ISO 2022. ISO 8859 (1987-present) (see Table 5 for ECMA equivalents), "Information Processing - 8-Bit Single-Byte Coded Graphic Character Sets", defines the actual 8-bit character sets to be used for many of the world's languages. The left half of each of these is the same as ASCII and ISO 646. Each character, including those with diacritics, is represented by a single byte. ISO is the Internation Standardization Organization, ANSI is the American National Standards Institute, ECMA is the European Computer Manufacturers Association. JIS means Japan Industrial Standard. The ISO/ECMA standards discussed in this proposal may be obtained free of charge in their ECMA form by writing to: ECMA Headquarters Rue du Rhone 114 CH-1204 Geneva SWITZERLAND Be sure to specify the title and the ECMA number of each standard requested. ISO standards can also be ordered from the UN bookstore, but not for free: CCITT United Nations Bookstore United Nations Building New York, NY 10017 ANSI standards may be ordered, for a fee, from: