13-May-97 3:48:24-GMT,1383;000000000011 Received: from Unicode.ORG (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id XAA03046 for ; Mon, 12 May 1997 23:48:23 -0400 (EDT) Received: by Unicode.ORG (NX5.67g/NX3.0M) id AA29045; Mon, 12 May 97 20:13:02 -0700 Message-Id: <9705130313.AA29045@Unicode.ORG> Errors-To: uni-bounce@Unicode.ORG Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 2583 (1997-5-13 03:12:48 GMT) To: Multiple Recipients of Reply-To: "Mark H. David" From: "Unicode Discussion" Date: Mon, 12 May 1997 20:12:48 -0700 (PDT) Subject: Line Separator Character What is the deal with unicode line separator? Why would I want to use it, as opposed to using, say, LF or CRLF? Microsoft's CF_UNICODETEXT clipboard format apparently requires CRLF, and their notepad application displays black blob characters when you feed it the Unicode line separator. I've heard reports that Java similarly misdisplays this character, prefering LF only. What was the idea behind Unicode line separator. Is there any advantage to using it? It seems to be different just to be different. If I chose to use LF or CRLF, at least I'd be compatible with many things. This way I'm compatible with just about nothing. Can anyone provide any further information or insights? 13-May-97 22:32:30-GMT,2225;000000000001 Received: (from fdc@localhost) by watsun.cc.columbia.edu (8.8.5/8.8.5) id SAA06966; Tue, 13 May 1997 18:32:26 -0400 (EDT) Date: Tue, 13 May 97 18:32:25 EDT From: Frank da Cruz To: "Mark H. David" Cc: Multiple Recipients of Subject: Re: Line Separator Character In-Reply-To: Your message of Mon, 12 May 1997 20:12:48 -0700 (PDT) Message-ID: > What is the deal with unicode line separator? Why would I want to use it, > as opposed to using, say, LF or CRLF? Microsoft's CF_UNICODETEXT clipboard > format apparently requires CRLF, and their notepad application displays > black blob characters when you feed it the Unicode line separator. I've > heard reports that Java similarly misdisplays this character, prefering LF > only. What was the idea behind Unicode line separator. Is there any > advantage to using it? It seems to be different just to be different. If I > chose to use LF or CRLF, at least I'd be compatible with many things. This > way I'm compatible with just about nothing. Can anyone provide any further > information or insights? > I suppose that as the one who proposed the Unicode line separator, I should speak to this one. The following are statements from, or paraphrased from, the Unicode standard: . Unicode encodes plain text; . Plain text should contain enough information to permit the text to be rendered legibly and nothing more; . The appearance of the text depends on an upper level protocol and not on ASCII or ISO control characters, which are retained only for compatibility. . Unicode does not prescribe specific semantics for U+000D (CR) and U+000A (LF); it is left the application to interpret these codes. In other words, without Line Separator U+2028, there would be no canonical way to represent line breaks, as in (e.g.) poetry, in Unicode plain text. Why? Because the semantics of CR, LF, CRLF, and other control characters vary from platform to platform (e.g. Macintosh, UNIX, DOS). Furthermore, the conventions for separating paragraphs are also platform and application-specific. Thus the Paragraph Separator, U+2029. - Frank 14-May-97 5:29:38-GMT,1893;000000000011 Received: from unicode.unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id BAA09866 for ; Wed, 14 May 1997 01:29:37 -0400 (EDT) Received: by unicode.unicode.org (NX5.67g/NX3.0M) id AA01180; Tue, 13 May 97 21:35:01 -0700 Message-Id: <9705140435.AA01180@unicode.unicode.org> Errors-To: uni-bounce@unicode.unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-2022-jp Content-Transfer-Encoding: 7bit X-Uml-Sequence: 2598 (1997-05-14 04:34:36 GMT) To: Multiple Recipients of Reply-To: Adrian Havill From: "Unicode Discussion" Date: Tue, 13 May 1997 21:34:34 -0700 (PDT) Subject: Re: Line Separator Character Unicode Discussion wrote: > In other words, without Line Separator U+2028, there would be no canonical > way to represent line breaks, as in (e.g.) poetry, in Unicode plain text. > Why? Because the semantics of CR, LF, CRLF, and other control characters > vary from platform to platform (e.g. Macintosh, UNIX, DOS). > > Furthermore, the conventions for separating paragraphs are also platform > and application-specific. Thus the Paragraph Separator, U+2029. Does this mean that new applications should refrain from using LF and CR and use the two new control characters instead? How many Unicode applications currently understand the Unicode line and paragraph separators? As for future Unicode apps what about Unicode supporting e-mail apps? Will the upcoming Netscape Communicator (most popular commercial Unicode capable e-mail client I can think of) send e-mail (and understand) with the new markers (providing they're Unicode encoded, of course). (targeted towards the Netscape/Unicode group) -- Adrian Havill Engineering Division, System Planning & Production Section 14-May-97 6:31:04-GMT,2078;000000000001 Received: from malmo.trab.se (malmo.trab.se [131.115.48.10]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id CAA22840 for ; Wed, 14 May 1997 02:31:02 -0400 (EDT) Received: from valinor.malmo.trab.se (valinor.malmo.trab.se [131.115.48.20]) by malmo.trab.se (8.7.5/TRAB-primary-2) with ESMTP id IAA24548; Wed, 14 May 1997 08:31:00 +0200 (MET DST) Received: by valinor.malmo.trab.se (8.7.5/TRM-1-KLIENT); Wed, 14 May 1997 08:30:59 +0200 (MET DST) (MET) Date: Wed, 14 May 1997 08:30:59 +0200 (MET DST) From: Dan Oscarsson Message-Id: <199705140630.IAA20207@valinor.malmo.trab.se> To: unicode@unicode.unicode.org, fdc@watsun.cc.columbia.edu Subject: Re: Line Separator Character Mime-Version: 1.0 Content-MD5: JuyMhI2YbpSiZuTwTb2uNw== Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit > I suppose that as the one who proposed the Unicode line separator, I should > speak to this one. The following are statements from, or paraphrased from, > the Unicode standard: > > . Unicode encodes plain text; > . Plain text should contain enough information to permit the text to be > rendered legibly and nothing more; > . The appearance of the text depends on an upper level protocol and not > on ASCII or ISO control characters, which are retained only for > compatibility. > . Unicode does not prescribe specific semantics for U+000D (CR) and > U+000A (LF); it is left the application to interpret these codes. > > In other words, without Line Separator U+2028, there would be no canonical > way to represent line breaks, as in (e.g.) poetry, in Unicode plain text. > Why? Because the semantics of CR, LF, CRLF, and other control characters > vary from platform to platform (e.g. Macintosh, UNIX, DOS). Why should we use a new Unicode special character for line separator when there is a line separator control character: NL (Next Line) defined in the 0200-0237 range. It would be better to to use that instead of CR/LF and U+2028. It can also be used in 8-bit byte text. Dan 14-May-97 11:30:04-GMT,3255;000000000001 Received: from unicode.unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id HAA17906 for ; Wed, 14 May 1997 07:30:01 -0400 (EDT) Received: by unicode.unicode.org (NX5.67g/NX3.0M) id AA02063; Wed, 14 May 97 03:43:11 -0700 Message-Id: <9705141043.AA02063@unicode.unicode.org> Errors-To: uni-bounce@unicode.unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-2022-jp Content-Transfer-Encoding: 7bit X-Uml-Sequence: 2600 (1997-05-14 10:41:14 GMT) To: Multiple Recipients of Reply-To: Adrian Havill From: "Unicode Discussion" Date: Wed, 14 May 1997 03:41:12 -0700 (PDT) Subject: Re: Line Separator Character Martin J. Duerst wrote: > Email has very strict restrictions on this. You can't send doublebyte > UTF-16 or UCS-2 in Email. CRLF always has to be present as a line > separator. Unicode in Email is possible with UTF-7 (and CRLF as line > separator) or UTF-8 + BASE64/QuotedPrintable (and CRLF...). > Please see RFC 2045/6/7 for this. I'm aware of this. Allow me to clarify: encode the Unicode line and paragraph separators in UTF-7 and transmit no CR and LFs. Some protocols, such as SMTP, have a line limit (998 octets in the case of SMTP). However, as the behavior of CR and LF is system dependent, an e-mail client could theoretically ignore CR LF, etc and go by the UTF-7 encoded Unicode line and paragraph breaks, when RFC2046 says '[i]t should not be necessary to add any line breaks to display "text/plain" correctly....' So why not NOT use them and go with the Unicode ones? I admit, I am not clear as to whether this phrase was referring specifically to the ASCII CR and LF control characters, or was referring to all types of line breaks in general. Is "plain text" Unicode with Unicode line breaks considered to be "text/plain" or "text/enriched" (which requires line breaks)? As there are few legacy Unicode-capable e-mail clients, is it not possible to push to get this functionality added now? Many e-mail clients today have an option which enables them to wrap/not-wrap long lines. Why not add a similar feature for Unicode capable clients, which allows a selection (under the "Unicode section" between "interpret CR and LF codes only", "interpret Unicode line and paragraph breaks only", "interpret both Unicode line and paragraph breaks AND CR and LF codes." (I'd also like a feature in future e-mail clients that says "display Unrenderable Unicode as...") Or am I overlooking something painfully obvious and being obtuse? If so, my apologies for wasting everybody's time. ;-) I can see how adding this kind of functionality might confuse the average end-user. But the current end-user which now has to deal with such cryptic functions such as "encode using MIME quoted-printable" or "8-bit", so I don't see how this functionality could make e-mail clients any more complicated, especially if the defaults are set for them for Unicode. Yet another reason why books like "The Complete Moron's Guide to E-Mail" continue to sell, I guess. (^_^) -- Adrian Havill Engineering Division, System Planning & Production Section 14-May-97 12:33:46-GMT,2283;000000000001 Received: from unicode.unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id IAA27390 for ; Wed, 14 May 1997 08:33:45 -0400 (EDT) Received: by unicode.unicode.org (NX5.67g/NX3.0M) id AA02381; Wed, 14 May 97 05:05:37 -0700 Message-Id: <9705141205.AA02381@unicode.unicode.org> Errors-To: uni-bounce@unicode.unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Uml-Sequence: 2601 (1997-05-14 12:03:56 GMT) To: Multiple Recipients of Reply-To: Dan Oscarsson From: "Unicode Discussion" Date: Wed, 14 May 1997 05:03:54 -0700 (PDT) Subject: Re: Line Separator Character > On Tue, 13 May 1997, Dan Oscarson wrote: > > > Why should we use a new Unicode special character for line separator > > when there is a line separator control character: NL (Next Line) defined > > in the 0200-0237 range. It would be better to to use that instead of CR/LF and > > U+2028. It can also be used in 8-bit byte text. > > First: Please don't use octal numbers in an environment where everybody > is firmly used to hexadecimal. I had quite some problems figuring out > what you ment with the 0200-0237 range :-). Well, general use i octal if leading zero, hex if leading 0x, U+ is not hex, also octal is nicer. > > Second: Neither ISO 10646 nor Unicode define the CR control characters. > While for CL, virtually everybody uses the same assignement, and the > codepoints are even named in UNicode (but not in ISO 10646), there > are no stable conventions for CR. Many systems and encodings (Mac, > Windows, UTF-8) use the CR area for graphic characters. Yes, but neither the lower nor the upper range of control chaarcters is defined in ISO 10646, but both places are reserved for them, and there is a standard for both the upper and lower range. If we are going to extend the use of control characters it is better to use the control codes in the 8-bit range, especially if it is something as important as line separator. Then it can be used in many 8-bit character sets too. If is unfortunate that Mac, MS Win and UTF-8 have decided to use the upper control space for other things. Dan 14-May-97 17:26:33-GMT,4002;000000000001 Received: from unicode.unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id NAA24800 for ; Wed, 14 May 1997 13:26:29 -0400 (EDT) Received: by unicode.unicode.org (NX5.67g/NX3.0M) id AA03772; Wed, 14 May 97 10:18:12 -0700 Message-Id: <9705141718.AA03772@unicode.unicode.org> Errors-To: uni-bounce@unicode.unicode.org X-Uml-Sequence: 2603 (1997-05-14 17:17:01 GMT) To: Multiple Recipients of Reply-To: kenw@sybase.com (Kenneth Whistler) From: "Unicode Discussion" Date: Wed, 14 May 1997 10:16:59 -0700 (PDT) Subject: Re: Line Separator Character > > > On Tue, 13 May 1997, Dan Oscarson wrote: > > > > > Why should we use a new Unicode special character for line separator > > > when there is a line separator control character: NL (Next Line) defined > > > in the 0200-0237 range. It would be better to to use that instead of CR/LF and > > > U+2028. It can also be used in 8-bit byte text. > > > > First: Please don't use octal numbers in an environment where everybody > > is firmly used to hexadecimal. I had quite some problems figuring out > > what you ment with the 0200-0237 range :-). > Well, general use i octal if leading zero, hex if leading 0x, U+ is not hex, also > octal is nicer. U+ most assuredly is hex. Not only de facto, but now de jure. I cite from DAM No. 9 to ISO/IEC 10646-1:1: "The full syntax of the notation of a short identifier, in Backus-Naur form, is: {U|u}[{+}xxxx|{-}xxxxxxxx] where "x" represents one hexadecimal digit (0 to 9, A to F, or a to f),..." And I concur with the respondent. Some may agree with you that "octal is nicer", but on this list, octal will generally only confuse instead of communicating. By the way, octal 0200-0237, for those of you following this issue, corresponds to U+0080 - U+009F, also known in ISO documents as the C1 range, and referred to below as the "CR area". So what Dan is suggesting is making use of C1 controls for linebreak control, instead of U+2028 LINE SEPARATOR. > > > > > Second: Neither ISO 10646 nor Unicode define the CR control characters. > > While for CL, virtually everybody uses the same assignement, and the > > codepoints are even named in UNicode (but not in ISO 10646), there > > are no stable conventions for CR. Many systems and encodings (Mac, > > Windows, UTF-8) use the CR area for graphic characters. > Yes, but neither the lower nor the upper range of control chaarcters is defined > in ISO 10646, but both places are reserved for them, and there is a standard for > both the upper and lower range. If we are going to extend the use of > control characters it is better to use the control codes in the 8-bit range, especially > if it is something as important as line separator. Then it can be used in > many 8-bit character sets too. If is unfortunate that Mac, MS Win and UTF-8 have > decided to use the upper control space for other things. Use of C1 controls for 8-bit character sets is a logically separate issue from use of U+2028 LINE SEPARATOR (and U+2029 PARAGRAPH SEPARATOR) in Unicode. You may consider it unfortunate, but it is reality that in a world dominated by IBM, Microsoft, Apple, and even Hewlett-Packard 8-bit character encodings, most 8-bit data makes use of the 0x80..0x9F range for graphic characters. Implementations of the ISO 8859 series are the most notable exceptions. And if you want to talk unfortunate, we wouldn't be having nearly so many problems with the ISO 8-bit character sets if they had been built in the first place with graphic characters in 0x80..0x9F (an extra 32) instead of following an ill-conceived ISO 6937 attempt to extend control functions through character encodings in that space. For example, 8859-1 would have the French characters that are currently missing in it, and 8859-2 would not have had to make the ill-starred compromise between Romanian and Turkish letters! --Ken Whistler > > Dan > 15-May-97 0:10:05-GMT,3379;000000000001 Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id UAA04177 for ; Wed, 14 May 1997 20:10:04 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA04271; Wed, 14 May 97 16:01:54 -0700 Message-Id: <9705142301.AA04271@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Uml-Sequence: 2610 (1997-05-14 23:01:30 GMT) To: Multiple Recipients of Reply-To: Mark Davis From: "Unicode Discussion" Date: Wed, 14 May 1997 16:01:28 -0700 (PDT) Subject: Re: Line Separator Character The discription of LINE SEPARATOR and PARAGRAPH SEPARATOR should be clear from the discussions on page 6-72 in The Unicode Standard, Version 2.0. Anyone using The Unicode Standard, Version 1.0 should "upgrade" to Version 2.0. The full current state of the standard is established by that document, supplemented by the Errata information on the Unicode web site (http://unicode.org). (By the way, there is also a listing of the table of contents on the web site.) Mark Unicode Discussion wrote: > > On 13 May 97 at 19:49, Frank da Cruz wrote: > > > . Unicode does not prescribe specific semantics for U+000D (CR) and > > U+000A (LF); it is left the application to interpret these codes. > > > > In other words, without Line Separator U+2028, there would be no canonical > > way to represent line breaks, as in (e.g.) poetry, in Unicode plain text. > > Why? Because the semantics of CR, LF, CRLF, and other control characters > > vary from platform to platform (e.g. Macintosh, UNIX, DOS). > > This sounded good until I looked up U2028 and found the name LINE > SEPARATOR and the comment "may be used to represent this semantic > unambiguously", but no explanation of what the semantic is! (I am > quoting from the 1.0 document, so I apologize in advance if this is > covered in 2.0, which I don't have here.) > > Several interpretations of the idea of LINE SEPARATOR are possible, > the obvious issue being whether a carriage return is implied. The > various EBCDIC character sets use the NEWLINE (NL, X'15') character > to mean "move to the leftmost position of the next line"; most ASCII- > like systems infer one of the motions from the other, or require that > both be specified (CR,LF). I think this all comes from the different > mechanical backgrounds: the EBCDIC concept from the IBM 2741 > terminal, which was incapable of executing a carriage return without > also doing a line feed, but which could line feed and backspace > independent of carriage return, and the various teletypewriter-like > devices which generally had no backspace, but could execute > independent carriage return and line feed. > > So what *is* the semantic represented by U2028 ? Is it perhaps a > higher level semantic than the low level detail of whether to return > to the originating margin ? If so, then presumably the notion of > line feed is also at a lower level, and U2028 might be implemented by > e.g. inserting bullets between the lines of poetry without actually > spacing down the page. Somehow this seems like the wrong level of > stuff to be encoding in a character set standard, though. > > Tony Harminc > tzha0@juts.ccc.amdahl.com 15-May-97 0:31:38-GMT,4425;000000000001 Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id UAA11230 for ; Wed, 14 May 1997 20:31:37 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA04536; Wed, 14 May 97 16:09:45 -0700 Message-Id: <9705142309.AA04536@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2611 (1997-05-14 23:09:27 GMT) To: Multiple Recipients of Reply-To: kenw@sybase.com (Kenneth Whistler) From: "Unicode Discussion" Date: Wed, 14 May 1997 16:09:26 -0700 (PDT) Subject: Re: Line Separator Character > > On 13 May 97 at 19:49, Frank da Cruz wrote: > > > . Unicode does not prescribe specific semantics for U+000D (CR) and > > U+000A (LF); it is left the application to interpret these codes. > > > > In other words, without Line Separator U+2028, there would be no canonical > > way to represent line breaks, as in (e.g.) poetry, in Unicode plain text. > > Why? Because the semantics of CR, LF, CRLF, and other control characters > > vary from platform to platform (e.g. Macintosh, UNIX, DOS). > > This sounded good until I looked up U2028 and found the name LINE > SEPARATOR and the comment "may be used to represent this semantic > unambiguously", but no explanation of what the semantic is! (I am > quoting from the 1.0 document, so I apologize in advance if this is > covered in 2.0, which I don't have here.) >From the Unicode Standard, Version 2.0, p 6-72: "[discussion of paragraph separator...] A line separator indicates that a line-break should occur at this point; although the text continues on the next line, it does not start a new paragraph: no interparagraph line spacing nor paragraphic indentation is applied. Since these are separator codes, it is not necessary to start the first line or paragraph, nor end the last line or paragraph with them." In other words, a U+2028 LINE SEPARATOR is to Unicode plain text formatting approximately as ";" is to Pascal statement syntax. > > Several interpretations of the idea of LINE SEPARATOR are possible, > the obvious issue being whether a carriage return is implied. The > various EBCDIC character sets use the NEWLINE (NL, X'15') character > to mean "move to the leftmost position of the next line"; most ASCII- > like systems infer one of the motions from the other, or require that > both be specified (CR,LF). I think this all comes from the different > mechanical backgrounds: the EBCDIC concept from the IBM 2741 > terminal, which was incapable of executing a carriage return without > also doing a line feed, but which could line feed and backspace > independent of carriage return, and the various teletypewriter-like > devices which generally had no backspace, but could execute > independent carriage return and line feed. No mechanical background is intended or implied. This is one reason to depart from the CR/LF/NL control code legacy. The Unicode LINE SEPARATOR implies a GUI model of text layout and formatting (although it is possible to implement on a terminal or virtual terminal). > > So what *is* the semantic represented by U2028 ? Is it perhaps a > higher level semantic than the low level detail of whether to return > to the originating margin ? If so, then presumably the notion of > line feed is also at a lower level, and U2028 might be implemented by > e.g. inserting bullets between the lines of poetry without actually > spacing down the page. Somehow this seems like the wrong level of > stuff to be encoding in a character set standard, though. It is the minimum information to encode in plain text to make it possible for a formatter (which is at a higher level abstraction, and which, indeed, has notions of margins, line advance, etc.) requires to render lines and paragraph breaks at appropriate places. While no one wants to encode all kinds of formatting details in plain text (it belongs in rich or fancy text protocols), neither does anyone want "plain text" to just be a completely unstructured stream of characters with no expressed or expressable chunking into lines and paragraphs. andifintroductionoflineseparatorandparagraphseparatorin tothecharacterencodingseemsobjectionableforplaintextrem embertoothatpunctuationcasingandspaceswereaddedtowritin gsystemstomakethemmorelegiblekenwhistler > > Tony Harminc > tzha0@juts.ccc.amdahl.com > 15-May-97 1:42:05-GMT,4598;000000000011 Received: from halon.sybase.com (halon.sybase.com [192.138.151.33]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id VAA21873 for ; Wed, 14 May 1997 21:42:03 -0400 (EDT) Received: from smtp1.sybase.com (sybgate.sybase.com [130.214.220.35]) by halon.sybase.com (8.8.4/8.8.4) with SMTP id SAA25446; Wed, 14 May 1997 18:03:53 -0700 (PDT) Received: from birdie.sybase.com by smtp1.sybase.com (4.1/SMI-4.1/SybH3.5-030896) id AA04531; Wed, 14 May 97 18:02:03 PDT Received: by birdie.sybase.com (5.x/SMI-SVR4/SybEC3.5) id AA15003; Wed, 14 May 1997 18:00:36 -0700 Date: Wed, 14 May 1997 18:00:36 -0700 From: kenw@sybase.com (Kenneth Whistler) Message-Id: <9705150100.AA15003@birdie.sybase.com> To: fdc@watsun.cc.columbia.edu Subject: Re: C0 contorls (was: Line Separator Character) Cc: unicode@unicode.org, kenw@sybase.com X-Sun-Charset: US-ASCII > > Does this mean that new applications should refrain from using LF and CR > > and use the two new control characters instead? How many Unicode > > applications currently understand the Unicode line and paragraph > > separators? > > > I would say that this would be the intention of the Unicode standard, in > which traditional control characters are emphatically deprecated. That would > include NL also. Yes. But implementation pressure to keep using CR, LF, or CRLF in their Unicode forms in plain text may result in other outcomes. Cf. Murray's note regarding Microsoft's de facto usage. > > I'm not saying this was necessarily the best decision. Unicode, although a > self-proclaimed "plain text" standard, is nevertheless strongly biased towards > use within systems, rather than between them, and particularly by high-end > "rendering engines" that can handle all the complexities of composed > characters, lookahead, and so forth. Control characters are largely intended > for use in communications, where it has always been necessary to mix pure > information in-band with control codes. Although such usage has long been archaic, replaced by clean communication protocols that transmit arbitrary binary data, or by full-blown device control languages implemented in plain text (e.g. PostScript). But of course "archaic" does not mean obsolete, since no computer communication protocol ever seems to go away. ...Well, maybe paper tape punchcodes. ... > > I don't think the status of control characters in Unicode would have been an > issue if the C0 control characters had been better defined and used > consistently throughout history. If CR (or LF, or CRLF) always meant "end of > line", there would have been no need for the Unicode Line Separator, but the > framers of ASCII did not view it as an internal encoding for files, only as an > interchange code Yep. Note that the only C0 control character with an assumed and required semantics in Unicode 2.0 is U+0009 TAB. Nobody implements a TAB *character* with other than 0x09, and it seemed superfluous to clone one. U+0009 TAB is referenced in the normative Unicode bidi algorithm. > (more thought -- or at least experience -- went into the ISO > C1 control set, but it never really caught on -- how many file systems have > you seen in which NL is the line terminator?). Exactly. The C1 control set is largely ignored, as far as I can tell. > > Unicode is the opposite -- it is an internal encoding, but not an interchange > code. I disagree with the implication of this. Unicode is emphatically intended as an interchange code (as well as an internal encoding, or processing code). It is just not designed to be consistent with C0/byte-oriented transmission protocols. It is an interchange code for plain text, in much the same way that GIF is an interchange code for graphics. I don't much care what layers of other transmission and communication protocols are involved in packing it up and delivering it down the wire, as long as it arrives with the same content that it left with. > It does not contain the control elements to be one, but rather pushes > that off on lower levels of the communications architecture (just as it leaves > rendering issues to higher levels); Unicode is the stuff inside the data > fields of TCP/X.25/ISDN/etc packets. But it's not the code on the wire > between a computer and a terminal or a plain-text printer. Thus, unlike ASCII > or ISO 8859-1 (etc), it can't easily be used in a communications setting > except in combination with "something else" that packages it up for > transmission, and another "something else" that renders it. Agreed. --Ken Whistler > > - Frank > 15-May-97 3:09:12-GMT,1728;000000000001 Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id XAA02618 for ; Wed, 14 May 1997 23:09:11 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA05976; Wed, 14 May 97 19:37:23 -0700 Message-Id: <9705150237.AA05976@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 2614 (1997-05-15 02:37:06 GMT) To: Multiple Recipients of Reply-To: "Mark H. David" From: "Unicode Discussion" Date: Wed, 14 May 1997 19:37:05 -0700 (PDT) Subject: Re: Line Separator Character At 04:01 PM 5/14/97 -0700, you wrote: >The discription of LINE SEPARATOR and PARAGRAPH SEPARATOR should be >clear from the discussions on page 6-72 in The Unicode Standard, Version >2.0. Yes, but could this list be used to get practical advice on implementation and on interpreting what the spec means in the real world? OK, so Unicode recommends LINE SEPARATOR (LS) with the clear description alluded to above. And let's say Java AWT does not handle LS. (That's more or less the report I'm getting, but let's consider this hypothetical for now.) Can we then conclude that Java AWT is actually not Unicode compliant? That is, it does handle line separation, but does not assign this semantics to the appropriate character. I.e., if Java AWT printed black blobs for LF and for LS, meaning that it just can't understand the concept of line breaking, that would be technically Unicode compliant, I guess. But if it actually can do line breaking, but but doesn't do it for LS, then that's non-compliant. Is that correct? 15-May-97 11:12:47-GMT,2670;000000000001 Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id HAA06876 for ; Thu, 15 May 1997 07:12:46 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA06852; Thu, 15 May 97 03:33:47 -0700 Message-Id: <9705151033.AA06852@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Uml-Sequence: 2616 (1997-05-15 10:33:04 GMT) To: Multiple Recipients of Reply-To: "Kent Karlsson (\e\d\v \E\D\V \e\x\f \E\X\F \i\I)" From: "Unicode Discussion" Date: Thu, 15 May 1997 03:33:03 -0700 (PDT) Subject: Re: Line Separator Character > "[discussion of paragraph separator...] > A line separator indicates that a line-break should occur at this > point; although the text continues on the next line, it does not > start a new paragraph: no interparagraph line spacing nor paragraphic > indentation is applied. Since these are separator codes, it is not > necessary to start the first line or paragraph, nor end the last line > or paragraph with them." > > In other words, a U+2028 LINE SEPARATOR is to Unicode plain text > formatting approximately as ";" is to Pascal statement syntax. No, but U+2029 PARAGRAPH SEPARATOR ("PS") is. I.e., the PS character should be the normally occurring character to indicate a new paragraph. The LINE SEPARATOR is intended only for *rare* occasions where a new line is strongly(?) advised, such as within a poetic verse, or saying "it is good place to break the line here, but don't start a new paragraph". (This is similar to a soft hyphen.) I don't know how strong the advice is, since it says "line-break should...", not "line-break shall...". If in an HTML-document, I would guess that a U+2029 PARAGRAPH SEPARATOR should be interpreted *exactly* as a

, and a U+2028 LINE SEPARATOR should be interpreted as a
. Maybe the strength of the advice to break should differ between
(always break) and LS (perhaps: break here, if a break is needed and no better place is found), I don't know. I make no argument as to the good- or ill-advisedness of having these characters. I just note that they are there, and may (or should) be used. Also, Unicode is going to be used with "higher level 'protocols'" (such as HTML), and a clarification of the interpretation of the PS and LS characters in such contexts is needed, perhaps exemplified with HTML. (Note that HTML does NOT interpret NL or CR as indicating any kind of line break, except in special circumstances (

).)

		/kent karlsson

15-May-97 16:37:45-GMT,2737;000000000001
Received: from unicode.org (unicode.org [192.195.185.2])
	by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id MAA05854
	for ; Thu, 15 May 1997 12:37:44 -0400 (EDT)
Received: by unicode.org (NX5.67g/NX3.0M)
	id AA07630; Thu, 15 May 97 08:56:34 -0700
Message-Id: <9705151556.AA07630@unicode.org>
Errors-To: uni-bounce@unicode.org
X-Uml-Sequence: 2618 (1997-05-15 15:55:59 GMT)
To: Multiple Recipients of 
Reply-To: Frank da Cruz 
From: "Unicode Discussion" 
Date: Thu, 15 May 1997 08:55:57 -0700 (PDT)
Subject: Re: C0 contorls (was: Line Separator Character)

> > ... Control characters are largely intended
> > for use in communications, where it has always been necessary to mix pure
> > information in-band with control codes.
> 
> Although such usage has long been archaic, replaced by clean communication
> protocols that transmit arbitrary binary data, or by full-blown device
> control languages implemented in plain text (e.g. PostScript). But of
> course "archaic" does not mean obsolete, since no computer communication
> protocol ever seems to go away.
> 
One can argue the merits and tradeoffs of older and newer protocols, but many
of the older ones were quite successful and continue to be by virtue of the
fact that they were unleashed only after a great deal of thought, and often
only after compromise and concensus among diverse groups with conflicting
interests.

I would be very happy if words like "archaic" and "legacy" were dropped from
the lexicon of serious people for use in describing existing practice, and
especially existing practice that conforms to hard-fought and hard-won
national and international standards such as ISO 2022, 8859, or even the early
ANSI standards specifying the use of control characters in communications
protocols, which forms the basis for many of our modern protocols.

These are emotionally-toned marketing terms used by greedy corporations that
want to shame you into discarding systems that work perfectly well and buy new
replacements from them.  Maybe new stuff has its advantages, but personally I
don't think that applying epithets to old stuff is the right way to point that
out.  (This is not directed at Ken -- I'm just airing one of my pet peeves.)

Speaking of which, on the other end of the spectrum is the profligate use of
the word "comply", which once carried some weight because it was used in
connection with the aforementioned hard-won standards, but now is used with
any three-letter acronym that any company can dream up on its own without any
sort of review, quality control, or concensus.

My goodness, nowadays we even have to "comply" with a year!

- Frank

15-May-97 19:16:06-GMT,2249;000000000001
Received: from unicode.org (unicode.org [192.195.185.2])
	by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id PAA10790
	for ; Thu, 15 May 1997 15:16:01 -0400 (EDT)
Received: by unicode.org (NX5.67g/NX3.0M)
	id AA08132; Thu, 15 May 97 11:32:46 -0700
Message-Id: <9705151832.AA08132@unicode.org>
Errors-To: uni-bounce@unicode.org
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Uml-Sequence: 2620 (1997-05-15 18:32:10 GMT)
To: Multiple Recipients of 
Reply-To: "Martin J. Duerst" 
From: "Unicode Discussion" 
Date: Thu, 15 May 1997 11:32:09 -0700 (PDT)
Subject: Re: Line Separator Character

On Thu, 15 May 1997, Unicode Discussion wrote:

> I agree with this; the best explanation if you know HTML is: 
> 
> U+2029 PARAGRAPH       =  

> U+2028 LINE SEPARATOR =
> > As you say, there needs to be some clarification of the usage of these > with HTML, since they occupy the same roles. RFC 2070 has some explanation on some of the "control"-like characters in Unicode. The main aim when working on RFC 2070 was to assure that some basic quality of display could be achieved for a wide range of languages, and that where possible, things could be brought in alignement with Unicode. Because HTML is not plain text, but plain text with markup, there are two layers. The first is what you see in a raw text editor (you see the markup). There are line breaks there, but to be consistent with the rest of HTML around (according to the reference processing model explained in RFC 2070), these have to be CR, LF, or CRLF. As explained above,

and
are already here for the second level (what you see in a browser). So the above two characters never actually came into play. If there is a need for specification, it would only be preemptive (avoid that different people start to use it for different purposes). There is no place where they currently would be needed. That was different for other things, such as SHY (where we made a recommendation in a Note) and all the "control" characters needed for BIDI and joining (which are very instrumental for certain languages and scripts). Any comments wellcome. Regards, Martin. 15-May-97 20:07:15-GMT,2569;000000000001 Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id QAA22628 for ; Thu, 15 May 1997 16:07:13 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA08194; Thu, 15 May 97 11:36:22 -0700 Message-Id: <9705151836.AA08194@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Uml-Sequence: 2621 (1997-05-15 18:35:58 GMT) To: Multiple Recipients of Reply-To: Glen Perkins From: "Unicode Discussion" Date: Thu, 15 May 1997 11:35:57 -0700 (PDT) Subject: Re: Line Separator Character Mark Davis wrote: > > The discription of LINE SEPARATOR and PARAGRAPH SEPARATOR should be > clear from the discussions on page 6-72 in The Unicode Standard, Version > 2.0. > Actually, the description of LINE SEPARATOR doesn't seem to state explicitly whether it means "just advance to the next line" or "both advance to the next line *and* return to the beginning of the line": >From p. 6-72: "A line separator indicates that a line-break should occur at this point; although the text continues on the next line, it does not start a new paragraph: no inter-paragraph line spacing nor paragraphic indentation is applied." I assume that "continues on the next line," implies "continues at the beginning of the next line". That's what the expression "line-break" means to me, but I'm not completely sure that it *has* to have that meaning, and that everyone knows that it has that meaning and no other. It probably ought to be stated explicitly since the question of implied CR is answered differently by unix (LF implies CR) and DOS/Win (LF has a CR welded to it, at least implying that LF by itself wouldn't return to the beginning of the following line.) On old line printers, I had no trouble linefeeding without returning to the beginning of the line, though I've long since forgotten the char used to do so (I was but a child.) ;-) This may just be a nit, but while I'm at it, the definition of the PS includes "this *could* cause, *for example*,..." [emphasis mine.] That sounds as though the PS could just as easily "cause, for example" something else, so maybe the specific behavior of the LS is also "left as an exercise for the reader." Perhaps it *could* include an implied CR in one implementation and not in another, both conforming to the standard. What was the actual intent? __Glen Perkins__ glen.perkins@NativeGuide.com 16-May-97 1:29:53-GMT,1364;000000000001 Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id VAA14542 for ; Thu, 15 May 1997 21:29:52 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA08926; Thu, 15 May 97 14:04:53 -0700 Message-Id: <9705152104.AA08926@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2624 (1997-05-15 21:01:51 GMT) To: Multiple Recipients of Reply-To: kenw@sybase.com (Kenneth Whistler) From: "Unicode Discussion" Date: Thu, 15 May 1997 14:01:49 -0700 (PDT) Subject: Re: Line Separator Character I'll try one more time. U+2028 LINE SEPARATOR indicates the separation of lines. A formatter of Unicode plain text then does with the separated lines what it will do with separated lines. It is not intended to be abstruse. And it should not be considered in the same context as the complexity caused by the intermingling of device control semantics of CR and LF (which after all came from the world of *physical* TTY platen and print head control) and the text formatting semantics of CR, LF, and/or CRLF in Mac, Unix, and/or the DOS/Win worlds as EOL, EOP, newline, and/or line separators. It is precisely because CR and LF are such a mess that Unicode has a LINE SEPARATOR and a PARAGRAPH SEPARATOR distinctly encoded. --Ken 16-May-97 1:30:04-GMT,1690;000000000001 Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id VAA14569 for ; Thu, 15 May 1997 21:30:00 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA08662; Thu, 15 May 97 13:07:09 -0700 Message-Id: <9705152007.AA08662@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Uml-Sequence: 2623 (1997-05-15 20:06:13 GMT) To: Multiple Recipients of Reply-To: John Cowan From: "Unicode Discussion" Date: Thu, 15 May 1997 13:06:11 -0700 (PDT) Subject: Re: Line Separator Character Martin J. Duerst wrote: > That was different for other things, such as SHY (where we > made a recommendation in a Note) and all the "control" > characters needed for BIDI and joining (which are very > instrumental for certain languages and scripts). Line Separator and Paragraph Separator are essential for BIDI. Paragraph Separator delimits the maximum scope of text that the BIDI algorithm must consider all at once (roughly stated: even if the line width is infinite, paragraphs are still stacked top to bottom, so there is no need to reverse any text across a paragraph mark). Line Separator also significantly affects BIDI behavior. That said, I think that the suggestion that LS =
and PS =

is very sensible, and BIDI HTML renderers should be licensed to treat

as PS and
as LS for BIDI purposes. (This would be a "higher-level protocol" within the meaning of Unicode 2.0.) -- John Cowan cowan@ccil.org e'osai ko sarji la lojban 16-May-97 1:30:04-GMT,3981;000000000001 Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id VAA14619 for ; Thu, 15 May 1997 21:30:03 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA08530; Thu, 15 May 97 12:44:39 -0700 Message-Id: <9705151944.AA08530@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2622 (1997-05-15 19:44:16 GMT) To: Multiple Recipients of Reply-To: Murray Sargent From: "Unicode Discussion" Date: Thu, 15 May 1997 12:44:14 -0700 (PDT) Subject: FW: Line Separator Character > I can report what MS Word and some other MS software products do. > Microsoft text software typically follows Word's lead. On PCs, > Unicode and ANSI plain text files use CRLF for the End Of Paragraph > (EOP) mark. This is a little different in function from the Unicode > Paragraph Separator (U+2029), since it can exist without being > followed by another paragraph. On the Mac, the plain-text EOP is just > a CR, whereas on Unix it's just a LF. Word97 accepts files with all > three of these choices (but not U+2029, which doesn't translate, > sigh), and translates them to a CR for internal use (including in its > object model) and in Word's .doc file format. Word uses VT (0xB) for > a line separator. This is handy, e.g., when you have numbered > paragraphs and would like to insert a paragraph without the leading > number. > > In RTF (Word's Rich Text Format), CRLFs are used for readability only, > with \par representing the EOP and \line representing the line > separator. Similarly, HTML uses CRLFs for readability only, using >
for the line separator and various paragraph tags for paragraph > identification. For these rich-text formats, the Unicode PS and LS > have no defined role and really shouldn't even be used. > > One advantage of using LF through for CR for EOP, etc., is that > they're relatively efficient to parse: you can single them out as a > group with a single if statement instead of a more lengthy switch > statement. Word uses other ASCII control characters for various > things, e.g., 0x1F for the soft hyphen (instead of 0xAD, sigh) and 7 > for a table cell end. Using CRLF particularly for internal use is a > real pain, since it has some of the navigation problems of DBCS. Note > that it's more complicated to handle than the Unicode surrogates, > since with the latter you always know whether a code is a lead word, > trail word, or neither. With CR you have to check to see if it's > followed by a LF. It gets worse on PCs: a "soft carriage return", > i.e., just a word wrap point is represented by the system edit > controls as a CRCRLF. So before you can conclude that a CR is an EOP, > you have to check the two characters that follow! Similarly for a LF > you have to check the preceding two characters. The silver lining in > all of this is that it's pretty trivial to generalize such text > software to handle the Unicode surrogates since they can tag along > with the CRLF code, thereby keeping the caret where it belongs, etc. > > Personally I like Word's choices and have used them in the RichEdit > 2.0 control, but ideally text software should recognize the Unicode > General Punctuation symbols as well. RichEdit 2.0, for example, does > translate U+2029/U+2028 to CR/VT, respectively, on reading in a file > or pasting plain text. On plain-text output though, it uses CRLF and > VT, respectively. > > Unfortunately at this late date, there isn't any unique approach to > these issues. ASCII has been an amazingly successful character set, > but one of its worst deficiencies has been in not specifying a single > code for an EOP mark. Unix attempted to remedy the problem by using > the LF, but it didn't catch on in general. My favorite among the > alternatives is the lone CR, which as explained above is the default > on the Mac and Word. > > Murray > 16-May-97 18:09:29-GMT,2092;000000000001 Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id OAA29734 for ; Fri, 16 May 1997 14:09:27 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA11467; Fri, 16 May 97 10:29:25 -0700 Message-Id: <9705161729.AA11467@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2626 (1997-05-16 17:28:03 GMT) To: Multiple Recipients of Reply-To: "Pierre Lewis" From: "Unicode Discussion" Date: Fri, 16 May 1997 10:28:00 -0700 (PDT) Subject: Re: Line Separator Character Context: plain text unicode file. Assuming we use LS to separate lines (I guess there's no answer to the question "what should I use"), then doesn't that interact negatively with bidi markup, in particular embedding markups? Ie. I have to reestablish the proper embedding level at each line. Say I have two lines, some English with embedded Yiddish (levels shown here, in logical order): 000 0000 00 00000 RLE 11 1111 NL | English RLE Yiddish NL 11 11111 1 11111 PDF 00 0000 ... | Yiddish PDF English ... Now if the newline (NL in above) is indicated by a LS (\u2028), the bidi state is reset between the lines. If I now start the second line with RLE (so as to say I'm reestablishing an embedding level), I can no longer tell whether I have one embedded segment or two (with a 0-level space between, where the LS is). Could be an issue if I later reformat (reflow) this text (as I might want to do in an editor). As a matter of fact, if the second line (after LS) starts with a strong R2L character and I don't reissue RLE, won't the base level be set to 1? This would put the following English at level 2 (not intended as the English isn't embedded in the Yiddish here, but the other way around). These problems go away if I use any combinations of CR/LF to indicate newline. Another question: does PS imply LS? Or would I end a paragraph with LS PS? I presume it does. Thanks in advance for any clarifications. Pierre lew@nortel.ca 16-May-97 19:12:00-GMT,1113;000000000001 Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id PAA08918 for ; Fri, 16 May 1997 15:11:48 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA11815; Fri, 16 May 97 11:49:36 -0700 Message-Id: <9705161849.AA11815@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2627 (1997-05-16 18:49:12 GMT) To: Multiple Recipients of Reply-To: kenw@sybase.com (Kenneth Whistler) From: "Unicode Discussion" Date: Fri, 16 May 1997 11:49:10 -0700 (PDT) Subject: Re: Line Separator Character Pierre, I'll let the bidi experts respond re the first part of your query. > Another question: does PS imply LS? Presence of a paragraph separator would imply a line break. It does not imply a LS character. > Or would I end a paragraph with LS PS? No. You could, but it would imply presence of a blank line before the end of the paragraph. And keep in mind these are *separators". You don't end a paragraph with anything. You separate two paragraphs by use of a PS. --Ken 16-May-97 22:00:09-GMT,14624;000000000001 Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id SAA11638 for ; Fri, 16 May 1997 18:00:04 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA12353; Fri, 16 May 97 13:10:08 -0700 Message-Id: <9705162010.AA12353@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Uml-Sequence: 2630 (1997-05-16 20:09:43 GMT) To: Multiple Recipients of Reply-To: Mark Davis From: "Unicode Discussion" Date: Fri, 16 May 1997 13:09:42 -0700 (PDT) Subject: Re: Line Separator Character Pierre, Doug Felt here at Taligent was kind enough to take a pass at answering your questions. His comments are marked with "**". I have added on in a few places, marked with "@@", but haven't looked at the examples as carefully as Doug. Mark =========================================== All, I've been trying to get a clear picture of what a "plain-text unicode file" should look like (wrt control chars, bidi markup, &c.). By "plain-text unicode file" I mean something that would be output by a plain-text editor, eg. a Unicode-capable vi (Unix) or brief (DOS). No HTML or Web implications (altho such an editor could certainly be used to prepare multi-lingual Web pages). I have prepared a short text (not semantically very meaningful) with mixed directionalites so I can ask some concrete questions. I took the liberty to attach the GIF to this message (about same size as the text). Postscript and GIF versions of this text can also be seen at URL http://www.centrcn.umontreal.ca/~lewis/LJL/uniplain.html Below, the text is shown in logical order (and all in English), with an indication of the language in the postscript page (A=Arabic, E=English, F=French, G=German, Y=Yiddish), and what I believe the levels should be. Some examples of dates. In Yiddish, "Monday, the 24th February 1997". 1 E................................E Y............................Y 000000000000000000000000000000000000011111111111122111111111111222200 In German, "Monday, the 24th Febrary 1997". 2 E.......E G...........................G 0000000000002222222222222222222222222222200 In Arabic, "Saturday March 90\3\10" (March 10, 1990) 3 E.......E A....................A E............E 0000000000001111111111111112222222000000000000000000 "Shindler's List", so is called my favorite film. The jew has in the 4 E.............E Y...............................................Y 12222222222222221111111111111111111111111111111111111111111111111111 ring written: "All who preserve one soul of Israel the book makes up to 5 Y..........Y H......................................................H 11111111111111133333333333333333333333333333333333333333333333333333333 him as if he preserved a whole world.". 6 H..................................H 333333333333333333333333333333333333111 The guest has been in Berlin. He has said: "I am 49 years 7 Y........................................Y G...........G 111111111111111111111111111111111111111111112222222222222 old and am called Boutros". This means in Yiddish: "I am old 49 years and 8 G...............G A.....A Y...................Y Y...................Y 2222222222222222223333333111111111111111111111111111111111111221111111111 am called Boutros" (Pierre in French). 9 Y.......Y B.....B F....F Y.......Y 11111111113333333111222222111111111111 Notes: o Translations are fairly literal (and not always very accurate): just for general orientation. And there are surely imperfections in all but the French (with just my name, I'm pretty safe here). o line 3: I'm not too sure what the logical order of the date in Arabic is. Could be 10\3\90 (levels 2212122 -- three level-2 numbers separated by level-1 backslashes) or 90\3\10 (all at level 2). Not too sure of the exact translation of words either. ** The logical order is, in general, the spoken order. The fields of the date ** would probably appear in the order the putative speaker would say them, ** however this is one place where writing and speaking can diverge. Here ** it depends on the order in which the putative speaker would type them. ** My description of what follows assumes the order you present is correct, ** and the desired appearance is what you present on your web site. ** ** Now as to the levels: This is very long, bear with me. ** ** Solidus (Slash) U+002F is a European Number Separator (ES). ** Reverse Solidus (Backslash) U+005C is Other Neutral (ON). You use ** reverse solidus but I'm not sure if this is to represent mirroring (neither ** character is mirrored). Either way, neither is a strong directional ** character. ** ** If the digits are Roman, by rule P0 all these numbers are treated as ** Arabic Numerals because the preceeding strong directional character ** is Arabic text (the 'h' in March). You may have intended them to be ** Arabic-Indic digits from the start. Either way, the digits are AN. ** ** If you intended Solidus (ES) this is converted to ON by rule P3. So ** either solidus or reverse solidus is ON. ** ** ON between AN is converted to R by rule N3(c). ** ** The quoted string on line 3 is thus "L R... AN AN R AN R AN AN L" where ** the L characters are the quote marks surrounding the text. The ** base line direction is LTR because of the initial L (Roman 'I'), so ** the base level is 0. In rule I1 the levels thus become ** "0 1... 2 2 1 2 1 2 2 0". By application of rule L2 this first becomes ** "Saturday March 09\3\01" as the level 2 runs are reversed, then ** "10\3\90 hcraM yadrutaS" as the levels 1&2 run is reversed. ** ** This is not consistent with the output on your web page. To force the ** date to be formatted left to right assuming this logical order, you'd ** need to force all date characters to L. This can be done either using an LRM ** before the first Roman digit, if the digits are roman, or by surrounding ** the date with LRO..PDF, if the digits are arabic-indic. Note that LRE ** won't work because the reverse solidus, being between two AN, would ** still convert to R, instead of L as desired. ** ** For example, using "Saturday March [LRE]90\3\10[PDF]", ** assuming Arabic-indic digits, would resolve the levels to ** 01111111111111112443434420, progressively resulting in ** "Saturday March 09\3\01" -- level 4 reversed ** "Saturday March 10\3\90" -- levels 3 and above reversed ** "Saturday March 09\3\01" -- levels 2 and above reversed ** "10\3\90 hcraM yadrutaS" -- levels 1 and above reversed ** This is a direct result of the fact that the date is not a ** solid run of left-to-right text, because the solidus is still R. ** ** "Saturday March [LRO]90\3\10[PDF]" however would resolve to ** 01111111111111112222222220, progressively resulting in ** "Saturday March 01\3\09" -- level 2 reversed ** "90\3\10 hcraM yadretaS" -- level 1 reversed. o Quotes aren't the right ones (some should be low quotes, ...). Questions 1) Do the levels in the above make sense (plus/minus some punctuation)? It may be that I've totally misunderstood levels. ** Generally, they make sense, see my discussion above. Text does not ** necessarily change level simply because of a quotation, or because of ** a change in language. So in line 2, the level wouldn't change simply ** because of a switch from English to German, since the German ** characters would be L. Only LRE or LRO would do that. Since you ** don't indicate strong formatting characters, I'd have to assume they ** were present to force the levels you indicate. 2) When embedding L2R in L2R (eg German in English, line 2) or R2L in R2L (eg. Arabic in Yiddish, line 9, or Hebrew in Yiddish, line 5), should I use LRE/PDF and RLE/PDF (even though the direction doesn't change)? ** Generally, you wouldn't need to. 3) The second and third paragraphs are right-aligned (R2L main direction). How do I indicate this? I thought of making each paragraph a block (separating them with PS, paragraph separator), and starting each block with a strong char of the appropriate directionality. In the second paragraph, this would mean starting the block with RLM (since the first letters are English). Ie. if base level is odd, main directionality is R2L and the text is right aligned. Or, other possibility, starting a right-adjusted paragraph with RLE? But then what about a left-adjusted paragraph that starts with R2L text. ** Either way would work. Alignment depends on the base line direction, ** which is determined by the first strong character in the block. The ** explicit directional formatting codes LRE, RLE, LRO, RLO as well as ** RLM and LRM are all strong directional characters. LTR text within ** a RLE embedding will still format LTR, but the overall run of text ** within the embedding will be RTL. 4) What should I use to separate lines? LS or CR or LF or CR/LF? If I use LS, which is a block separator, doesn't that interact negatively with bidi markup (control chars), in particular embedding markups? Ie. I have to reestablish the proper level at each line. And what happens with right alignment? Couldn't this cause confusion. If I have two lines (in logical order) 000 0000 00 00000 RLE 11 1111 LS | English RLE Yiddish LS 11 11111 1 11111 00 0000 ... | Yiddish English ... and reissue an RLE at start of second, I can no longer tell whether I have one embedded segment or two (with a 0-level space between, where the LS is). Could be an issue if I later reformat (reflow) this text (as I might want to do in an editor). As a matter of fact, if the second line (after LS) starts with a strong R2L character and I don't reissue RLE, won't the base level be set to 1? This would put the following English at level 2 (not intended as the English isn't embedded in the Yiddish here, but the other way around). (I haven't read the recent thread on LS very carefully yet, but it's not too reassuring: lots of opinions) @@ The standard is pretty clear. Most of those opinions are from people @@ who have not read it. Think of these characters in terms of what you @@ use in a word processor. @@ For Microsoft word or FrontPage, think of LS as the @@ character that you get with shift-Return @@ (causing no paragraph spacing or indent), @@ and PS as what you get with Return. @@ (on the Mac, this would be option-Return). ** This is a good observation! We believe the current standard is in ** error and should categorize LS as whitespace instead of as a block ** separator. ** ** This would allow LS characters to be inserted wherever whitespace ** appears and not interfere with explicit formatting codes. ** ** That said, the explicit formatting codes are basically intended for static ** text interchange only. They pose several problems for editing. One is that it ** is easy to radically alter the text by inserting, copying, or deleting ** one of these codes. This can reorder the text within the block and ** completely change the text on several lines. Similarly, the default ** base line direction rule can be problematic, as changes to the text at ** the start of a block can change the base line direction. Users might ** have difficulty editing unless the editor provides some support (such ** as assisting the user to insert/delete explicit formatting codes and ** their matching PDFs as a unit). @@ For actual editing of text with different directions, it is far easier to have @@ out-of-band style information with explicit embedding levels, @@ as mentioned briefly on page 3-22. ** ** Additionally, text reordering after levels are computed is done on a ** line by line basis. Depending on where line breaks occur, different ** text may appear on a line, and in different orders. This is independent ** of the issue of how to represent line breaks-- if they are represented ** external to the text (a line break table, based on wrapping to some ** width or character count, say) this still happens. This makes rebreaking ** lines somewhat more of an issue than it is with ASCII text. ** 5) Does PS imply LS? Or would I end a paragraph with LS PS? ** Yes, use only PS to separate paragraphs. 6) Imagine I want to start the third paragraph on a new page. Where do I put the FF (wrt the LS/CR/LF/ and bidi markup in the vicinity)? ** FF is higher-level formatting, you'd have to interpret it separately. @@ In particular, you would definitely interpret it as a block separator. 7) Any specific bidi markup required around the numerals? In the Arabic date: if levels intended are 2212122, would I need extra markup? I would think I would need: LRO number PDF \ LRO number PDF \ LRO number PDF (so that the \s, which are "other neutral", stay at level 1)? ** Almost, see my example above. In your example, the separate runs ** of LTR text would occur in RTL order, reversing the year and day of ** the date from what your example shows. 8) What is the intent (as opposed to the effect which the algo surely makes clear) of RLE and LRE? When are they useful? (Relates to question 1). ** Quoted text where the text itself contains mixed directions is a common ** case. You can see it (implicitly) in the examples for rule L2. The quotes ** logically belong to the surrounding text, and the embedding codes are ** just inside the quotes. @@ In the vast majority of cases, it is not necessary. The important cases are @@ those that Doug mentioned. @@ RLO and LRO are even more infrequent, and are designed to allow for cases @@ such part numbers with mixed numbers and letters, where the character @@ order is forced. 9) A typesetting question. Where do quotes belong in mixed-directionality texts (eg. in line 7)? Should they be at the same level as the text introducing the quote? Or at the level of the text being quoted. On line 7, should the quote be at the end of the line instead of where I put it (in the PS file)? Can't say I'm comfortable with either solution. And what style of quotes does one use? That of the quoting or of the quoted language? ** Quotes are at the same level as the text introducing the quote. @@ In general, you expect the style of the quotes to be the same as the containing @@ text, not the embedded text. However, that is up to the user's choice. Thanks in advance for any clarifications. Pierre lew@nortel.ca 16-May-97 22:09:47-GMT,4669;000000000001 Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id SAA13190 for ; Fri, 16 May 1997 18:09:45 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA12505; Fri, 16 May 97 13:19:41 -0700 Message-Id: <9705162019.AA12505@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Uml-Sequence: 2632 (1997-05-16 20:19:26 GMT) To: Multiple Recipients of Reply-To: "Martin J. Duerst" From: "Unicode Discussion" Date: Fri, 16 May 1997 13:19:24 -0700 (PDT) Subject: Re: Line Separator Character On Fri, 16 May 1997, Pierre Lewis wrote: > Context: plain text unicode file. There are basically two models of plain text. The first is line-oriented, the second is paragraph-oriented. Email or programm code is the traditional example of line-oriented plain text. Descriptive text as it appears in word processors, minus formatting, is the typical example of paragraph- oriented plain text. In traditional encoding (using CR/LF/CRLF) and in "official" Unicode encoding (using PS), the two models are made compatible by treating each line in the line-oriented plain text as a paragraph. On the other hand, the paragraph-oriented model can be reduced to the line-oriented model by splitting lines in a particular layout of the paragraph. This splitting is again done by paragraph separators (CR/LF/CRLF/PS), and not by LS. LS is only used for certain effects in the paragraph-oriented model that occur inside a paragraph. For example, I use it in some wordprocessors to start an new line without having the last line aligned left in a justified paragraph and/or without having the new line alligning indented like a first line of a paragraph. The use to avoid paragraph interspacing has also been mentionned. In summary, LS is an advanced device for paragraph-oriented plain text, and not to be used for line-oriented plain text. That said, let's now look at BIDI: > Assuming we use LS to separate lines (I guess there's no answer to the > question "what should I use"), then doesn't that interact negatively > with bidi markup, in particular embedding markups? Ie. I have to > reestablish the proper embedding level at each line. > > Say I have two lines, some English with embedded Yiddish (levels shown > here, in logical order): > 000 0000 00 00000 RLE 11 1111 NL | English RLE Yiddish NL > 11 11111 1 11111 PDF 00 0000 ... | Yiddish PDF English ... > > Now if the newline (NL in above) is indicated by a LS (\u2028), the > bidi state is reset between the lines. If I now start the second line > with RLE (so as to say I'm reestablishing an embedding level), I can no > longer tell whether I have one embedded segment or two (with a 0-level > space between, where the LS is). Could be an issue if I later reformat > (reflow) this text (as I might want to do in an editor). > > As a matter of fact, if the second line (after LS) starts with a strong > R2L character and I don't reissue RLE, won't the base level be set to 1? > This would put the following English at level 2 (not intended as the > English isn't embedded in the Yiddish here, but the other way around). LS is defined as a block separator, so you are right. When you insert an LS to split the lines, your application could insert arbitrary additional codepoints such as RLE. What it does insert (or not) is outside of the Unicode BIDI spec, which only describes static behaviour (what has to happen when the insertions are done), and not dynamic interactive behaviour (which can be a lot more complex if you want it to follow user's expectations, and given that static BIDI is already difficult, I hope you get the point :-). But when you edit BIDI text, you really should work with paragraph-oriented plain text, without additional LSs. Then everything will run more or less smoothly. Reformatting (reflow) is done automatically and correctly. In those cases where you indeed insert LSs, they will in most cases not be in the middle of text, but at some logical interruption point, without the need for frequent reflow. > These problems go away if I use any combinations of CR/LF to indicate > newline. This might be a solution for some very special cases. But in general, for BIDI you should use paragraph-oriented plain text, with CR/LF/ CRLF/PS as paragraph separators. I'm pretty sure that when Microsoft implements BIDI (or the way they already do it), they will treat CR (what they use internally) as a block separator in the BIDI algorithm. Regards, Martin. 16-May-97 22:25:22-GMT,2878;000000000001 Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id SAA15567 for ; Fri, 16 May 1997 18:25:21 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA12695; Fri, 16 May 97 13:39:12 -0700 Message-Id: <9705162039.AA12695@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Uml-Sequence: 2634 (1997-05-16 20:38:15 GMT) To: Multiple Recipients of Reply-To: "Martin J. Duerst" From: "Unicode Discussion" Date: Fri, 16 May 1997 13:38:13 -0700 (PDT) Subject: Re: Line Separator character On Wed, 14 May 1997, Adrian Havill wrote: > Martin J. Duerst wrote: > > Email has very strict restrictions on this. You can't send doublebyte > > UTF-16 or UCS-2 in Email. CRLF always has to be present as a line > > separator. Unicode in Email is possible with UTF-7 (and CRLF as line > > separator) or UTF-8 + BASE64/QuotedPrintable (and CRLF...). > > Please see RFC 2045/6/7 for this. > > I'm aware of this. Allow me to clarify: encode the Unicode line and > paragraph separators in UTF-7 and transmit no CR and LFs. Some > protocols, such as SMTP, have a line limit (998 octets in the case of > SMTP). SMTP email requires that line breaks be encoded as CRLF for all things that are text (i.e. Content-Type: text/*). The user (or the user agent) is also asked to limit line length to something like 80 characters (actually 80 bytes). > However, as the behavior of CR and LF is system dependent, an e-mail > client could theoretically ignore CR LF, etc and go by the UTF-7 encoded > Unicode line and paragraph breaks, when CR and LF are system dependent, but in mail, it's always CRLF, and mail user agents do the conversion. > RFC2046 says '[i]t should not be necessary to add any line breaks to > display "text/plain" correctly....' That's because text/plain (and all of text/*) is already defined to have these as CRLF, at 'short' intervals. > So why not NOT use them and go with > the Unicode ones? Because that may (or actually will) break some mail software. I know many people don't like that (I don't either), but some things in Internet mail are braindead, and will stay braindead. Too many influential people are too used to the way things are, and too many people are affraid of some software failing to work. Of course, what you can do is to have your local user agent change from CRLF to whatever line breaking convention you use locally, which might very well be the "true" Unicode codes. > As there are few legacy Unicode-capable e-mail clients, is it not > possible to push to get this functionality added now? The problem is not the clients. The problem is all the software that the mail passes from one client to the other. Regards, Martin. 17-May-97 21:28:56-GMT,4627;000000000011 Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id RAA05910 for ; Sat, 17 May 1997 17:28:55 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA15437; Sat, 17 May 97 14:09:06 -0700 Message-Id: <9705172109.AA15437@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 2642 (1997-05-17 21:08:44 GMT) To: Multiple Recipients of Reply-To: Edward Cherlin From: "Unicode Discussion" Date: Sat, 17 May 1997 14:08:43 -0700 (PDT) Subject: Re: Line Separator Character "Martin J. Duerst" wrote: >On Fri, 16 May 1997, Pierre Lewis wrote: > > >> Context: plain text unicode file. > >There are basically two models of plain text. The first is line-oriented, >the second is paragraph-oriented. Email or programm code is the traditional >example of line-oriented plain text. Descriptive text as it appears in >word processors, minus formatting, is the typical example of paragraph- >oriented plain text. > >In traditional encoding (using CR/LF/CRLF) and in "official" Unicode >encoding (using PS), the two models are made compatible by treating >each line in the line-oriented plain text as a paragraph. On the other >hand, the paragraph-oriented model can be reduced to the line-oriented >model by splitting lines in a particular layout of the paragraph. >This splitting is again done by paragraph separators (CR/LF/CRLF/PS), >and not by LS. There are actually several other models for files of 7-bit or 8-bit character codes, commonly, but misleadingly, known as ASCII text files. The original model was control of a Teletype machine, where several control characters called for physical movement of the mechanism. Many of the bad habits used in text files are survivals of this model. Others, fortunately, have died out. (I am thinking of some of the uses of control characters in editors meant for hard copy terminals.) CRLF was *required* to initiate a new line, but CR by itself was sometimes used for overstriking (if BS was not available), including underlining and composition of APL characters, and also for imitating typewriter overstrikes such as c| for the cent sign and some accented letters such as u" or e`. HT and FF were very commonly used, and some others, such as SI and SO, less so, but each of these specified a mechanical action. SI and SO allowed a fairly standard way to control some dual-script devices including ASCII/Arabic, ASCII/Cyrillic, APL/ASCII, and other combinations. Many devices used ASCII control characters for new purposes, so that an ASCII character string could specify the hardware behavior needed for bold facing and so on. The actual process of printing might call for translation from a 'text file' to an ASCII command string file which would produce the same printed image by other means. For example, a printer driver for a bidirectional printer could save time by printing alternate lines in reverse order, with LF and some spacing commands between lines. We then had the glass Teletype, or dumb terminal, model, which might treat CR and LF as on mechanical devices, or might treat them both as new line characters, or might do something else. At the same time, 'text files' could still be used to control electronic printers, with varying interpretations of some of the control characters. Now, on computers with GUIs, we have different systems that expect CR, or LF, or CRLF, as the new line signal, and have other interpretations of other control characters. System software vendors are going off in all directions inventing new misinterpretations of Unicode characters and constructing yet other file designs. We want to have a uniform, portable definition of the meaning of a file of 16-bit character codes interpreted as Unicode, or "Unicode text file" for short. At the same time, we have several uses for such files, where different interpretations may be desired. If we want to do this right, I think we have to find the appropriate organization for defining such file formats and uses, and get down to some serious and at times difficult standard making. The Unicode character code standard does not seem to be the right place to do this. -- Edward Cherlin Help outlaw Spam Everything should be made Vice President http://www.cauce.org as simple as possible, NewbieNet, Inc. 1000 members and counting __but no simpler__. http://www.newbie.net/ 17 May 97 Attributed to Albert Einstein 17-May-97 23:00:51-GMT,6375;000000000001 Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id TAA21108 for ; Sat, 17 May 1997 19:00:50 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA15658; Sat, 17 May 97 15:40:09 -0700 Message-Id: <9705172240.AA15658@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2643 (1997-05-17 22:39:47 GMT) To: Multiple Recipients of Reply-To: Frank da Cruz From: "Unicode Discussion" Date: Sat, 17 May 1997 15:39:45 -0700 (PDT) Subject: Re: Line Separator Character > There are actually several other models for files of 7-bit or 8-bit > character codes, commonly, but misleadingly, known as ASCII text files. > > The original model was control of a Teletype machine, where several control > characters called for physical movement of the mechanism. Many of the bad > habits used in text files are survivals of this model. > I wouldn't call them bad habits necessarily. The primary bone of contention here is the distinction between LF and CR... > CRLF was *required* to initiate a new line, but CR by itself was sometimes > used for overstriking (if BS was not available), including underlining and > composition ... > Right. And LF was used by itself to go down one row. > We then had the glass Teletype, or dumb terminal, model, which might treat > CR and LF as on mechanical devices, or might treat them both as new line > characters... > Actually I think that practically all CRTs treat CR and LF just as the TTY did. CR positions the cursor to the left of the current row, LF moves it down one row. > Now, on computers with GUIs, we have different systems that expect CR, or > LF, or CRLF, as the new line signal, and have other interpretations of > other control characters. > Really the problem started when the UNIX designers decided that it was good idea to have a storage model that was different than the tranmsission model. This allowed some space to be saved on disk, and it made text processing software a bit easier to write. However, it complicated the tty driver by requiring it to substitute CRLF for LF when displaying text files, which in turn has led to all sorts of confusion about "raw" vs "cooked" mode, etc, and the related distinction between NVT vs binary mode in Telnet protocol. (It is a simplification that UNIX was the first disk operating system to store textual files differently than it transmitted them, but it may have been the first *stream-oriented* one to do so -- or at least the one we remember.) Thus CRLF has always been the line terminator in ASCII (in the broad sense of "not EBCDIC") text transmission. Systems that chose to use different internal representations have had the obligation to convert back and forth during transmission. It's interesting to speculate how different the world (of computing) might be today if only a few arbitrary and perhaps whimsical decisions had been made differently decades ago: if UNIX and several other popular platforms had used CRLF rather than LF (or CR) as the line terminator; if DOS had used "forward slash" (/) rather than "backward slash" (\) as the directory separator... How many person-eons of effort have gone into addressing the consequences of these decisions... > HT and FF were very commonly used... > (And still are...) Now there's an interesting point. Unicode has addressed the CR/LF/CRLF confusion with LS and PS, but what about formfeed? Isn't it sometimes just as necessary to specify a hard page break as it is to specify a hard line or paragraph break? I suppose there must be a boundary somewhere between "Trust your rendering engine" and "Mother, Please! I'd rather do it myself!" I don't have a copy handy, and I might be entirely wrong about this, but isn't the Holy Koran a document that must be paginated in a specific way? In any case, the strong Use-A-GUI thrust of Unicode will make it increasingly difficult for certain kinds of people to operate in the ways to which they have become accustomed over the past decades in which plain text was "good enough" save that one could not put lots of languages into it. For example, today I can write a letter that spills over to one or more "second sheets" in plain text and print it on a plain-text printer without a second thought, using any software at all on any platform, embedding hard line, paragraph, and page breaks in it, just as most of us still do with email (except for the page breaks). No "templates", "wizards", "profiles", "preferences", or "Buzzword-1.0 Compliance" involved. I can move this letter to practically any other platform and it will still be perfectly legible and printable -- no export or import or conversion or version skew to worry about. I think a lot of people would be perfectly happy to do the same in a plain-text Unicode world using plain-text Unicode terminals and printers, if there were such things. But there's a bigger issue... The idea that one must embed Unicode in a higher level wrapper (e.g. a Microsoft Word document, or even HTML) to make it useful has a certain frightening consequence: the loss of any expectancy of longevity for our new breed of documents. These higher-level systems will be overwhelmingly proprietary due to the vast amount of coding that must go into them, the voracious nature of the marketplace, etc, and so formats will become obsolete with ever-increasing frequency, and it will become ever harder to extract the plain-text characters -- the substance -- from them. That which is perceived at a critical moment in time to be worthy of preservation will be converted to the new format, the rest discarded or left for decipherment by future generations of information archaeologists. (If you don't believe this is a problem, think about what is happening to our (physical) libraries all over the world at this moment -- get ready to say goodbye forever to five millenia of history that was not worth digitizing.) (And then to do it all over again when the digital formats and media need conversion in another ten years.) (And then again five years after that, etc...) So let's do our part and make some effort to accommodate traditional plain-text applications in Unicode, rather than discourage them :-) - Crank (Oops, I mean Frank) 18-May-97 0:13:19-GMT,2045;000000000001 Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id UAA29722 for ; Sat, 17 May 1997 20:13:18 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA15880; Sat, 17 May 97 16:56:42 -0700 Message-Id: <9705172356.AA15880@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2644 (1997-05-17 23:56:16 GMT) To: Multiple Recipients of Reply-To: Terry Allen From: "Unicode Discussion" Date: Sat, 17 May 1997 16:56:15 -0700 (PDT) Subject: Re: Line Separator Character Frank da Cruz asked: >(And still are...) Now there's an interesting point. Unicode has addressed the CR/LF/CRLF confusion with LS and PS, but what about formfeed? Isn't it sometimes just as necessary to specify a hard page break as it is to specify a hard line or paragraph break? I suppose there must be a boundary somewhere between "Trust your rendering engine" and "Mother, Please! I'd rather do it myself!" I don't have a copy handy, and I might be entirely wrong about this, but isn't the Holy Koran a document that must be paginated in a specific way? It isn't. My Egyptian Qur'an is one continuous text flow; the heading of a surah may even occur right at the bottom of a page. But there are such documents; the example of legal documents was brought up recently wrt SGML style sheets. >From an SGML point of view, I want to separate lines and paragraphs in my SGML markup. That's how I'd expect to obtain longevity for the text, not through LS and PS. CR and LF and SGML's difficulty in dealing with them (now redressed partially in XML) are bad enough. In SGML I can't see using LS or PS. Regards (and thanks for an interesting discussion), Terry Allen Electronic Publishing Consultant tallen[at]sonic.net http://www.sonic.net/~tallen/ Davenport and DocBook: http://www.ora.com/davenport/index.html T.A. at Passage Systems: terry.allen[at]passage.com 18-May-97 8:11:08-GMT,1439;000000000011 Received: from mtshasta.snowcrest.net (mtshasta.snowcrest.net [206.245.192.1]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id EAA07970 for ; Sun, 18 May 1997 04:11:06 -0400 (EDT) Received: from [206.245.192.57] (ttyD0.mtshasta.snowcrest.net [206.245.192.32]) by mtshasta.snowcrest.net (8.8.5/8.6.5) with ESMTP id BAA00515 for ; Sun, 18 May 1997 01:11:02 -0700 (PDT) X-Sender: cherlin@snowcrest.net Message-Id: In-Reply-To: References: Your message of Sat, 17 May 1997 14:08:43 -0700 (PDT) Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sat, 17 May 1997 18:52:05 -0700 To: Frank da Cruz From: Edward Cherlin Subject: Re: Line Separator Character You wrote: [snip] >So let's do our part and make some effort to accommodate traditional >plain-text applications in Unicode, rather than discourage them :-) > >- Crank (Oops, I mean Frank) As you say. So do you think my suggestion of a formal standard for Unicode text files has merit? -- Edward Cherlin Help outlaw Spam Everything should be made Vice President http://www.cauce.org as simple as possible, NewbieNet, Inc. 1000 members and counting __but no simpler__. http://www.newbie.net/ 17 May 97 Attributed to Albert Einstein 18-May-97 15:40:32-GMT,1713;000000000001 Received: (from fdc@localhost) by watsun.cc.columbia.edu (8.8.5/8.8.5) id LAA21787; Sun, 18 May 1997 11:40:31 -0400 (EDT) Date: Sun, 18 May 97 11:40:30 EDT From: Frank da Cruz To: Edward Cherlin Subject: Re: Line Separator Character In-Reply-To: Your message of Sat, 17 May 1997 14:08:43 -0700 (PDT) Message-ID: Oops, never mind -- it was this: > We want to have a uniform, portable definition of the meaning of a file of > 16-bit character codes interpreted as Unicode, or "Unicode text file" for > short. At the same time, we have several uses for such files, where > different interpretations may be desired. If we want to do this right, I > think we have to find the appropriate organization for defining such file > formats and uses, and get down to some serious and at times difficult > standard making. The Unicode character code standard does not seem to be > the right place to do this. > I'm not sure what you're after. I'm mainly concerned about the continued viability of files containing only graphic characters, spaces, line breaks, paragraph breaks, and formfeeds. Plain, literal text that can contain poetry, tables, source code, you name it, and stays like it is. Pretty much what we have today with 7- and 8-bit plain text, except without the confusion over CRLF/CR/LF, etc. I think that what's really valuable about these files is their self-contained and independent expressiveness -- they don't need a rendering engine, they don't need any special transport protocol -- they contain the text and the minimal control information to be transported and understood universally. - Frank 19-May-97 3:06:29-GMT,1723;000000000001 Received: from orpheus.amdahl.com (orpheus.amdahl.com [129.212.11.6]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id XAA09584 for ; Sun, 18 May 1997 23:06:28 -0400 (EDT) Received: from minerva.amdahl.com by orpheus.amdahl.com with smtp (Smail3.1.29.1 #3) id m0wTImI-0001JvC; Sun, 18 May 97 20:06 PDT Received: from juts.ccc.amdahl.com by minerva.amdahl.com with smtp (Smail3.1.29.1 #5) id m0wTIm0-0002ChC; Sun, 18 May 97 20:06 PDT Received: by juts.ccc.amdahl.com (/\../\ Smail3.1.14.4 #14.6) id ; Sun, 18 May 97 20:06 PDT Message-Id: Comments: Authenticated sender is From: "Tony Harminc" To: "Unicode Discussion" , fdc@watsun.cc.columbia.edu Date: Sun, 18 May 1997 23:04:41 -0400 MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII Content-transfer-encoding: 7BIT Subject: Re: Line Separator Character Priority: normal In-reply-to: <9705172240.AA15682@unicode.org> X-mailer: Pegasus Mail for Win32 (v2.52) On 17 May 97 at 15:39, Frank da Cruz wrote: > It's interesting to speculate how different the world (of computing) might be > today if only a few arbitrary and perhaps whimsical decisions had been made > differently decades ago: if UNIX and several other popular platforms had used > CRLF rather than LF (or CR) as the line terminator; if DOS had used "forward > slash" (/) rather than "backward slash" (\) as the directory separator... How > many person-eons of effort have gone into addressing the consequences of these > decisions... If the original IBM PC had used EBCDIC instead of ASCII... Tony Harminc 19-May-97 17:48:10-GMT,5906;000000000001 Return-Path: Received: from halon.sybase.com (halon.sybase.com [192.138.151.33]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id NAA16182 for ; Mon, 19 May 1997 13:48:05 -0400 (EDT) Received: from smtp1.sybase.com (sybgate.sybase.com [130.214.220.35]) by halon.sybase.com (8.8.4/8.8.4) with SMTP id KAA10672; Mon, 19 May 1997 10:51:14 -0700 (PDT) Received: from birdie.sybase.com by smtp1.sybase.com (4.1/SMI-4.1/SybH3.5-030896) id AA06870; Mon, 19 May 97 10:49:25 PDT Received: by birdie.sybase.com (5.x/SMI-SVR4/SybEC3.5) id AA17679; Mon, 19 May 1997 10:47:55 -0700 Date: Mon, 19 May 1997 10:47:55 -0700 From: kenw@sybase.com (Kenneth Whistler) Message-Id: <9705191747.AA17679@birdie.sybase.com> To: fdc@watsun.cc.columbia.edu Subject: Unicode plain text (Was: Line Separator Character) Cc: unicode@unicode.org, kenw@sybase.com X-Sun-Charset: US-ASCII Crank, er... Frank, >> HT and FF were very commonly used... >> >(And still are...) Now there's an interesting point. Unicode has addressed >the CR/LF/CRLF confusion with LS and PS, but what about formfeed? Isn't it >sometimes just as necessary to specify a hard page break as it is to specify a >hard line or paragraph break? You can still use U+000C FORM FEED in Unicode plain text, and a renderer that knows about page breaks can do the "right thing", namely whatever it did with ^L for an ASCII text. FORM FEED, like HORIZONTAL TAB, was not considered to be ambiguous enough in usage (unlike CR/LF) to require any separate encoding in Unicode. > In any case, the strong Use-A-GUI thrust of Unicode will make it increasingly > difficult for certain kinds of people to operate in the ways to which they > have become accustomed over the past decades in which plain text was "good > enough" save that one could not put lots of languages into it. The goal of Unicode plain text is to recapture that portability in the encoding, but also allow you to put lots of languages into it. The "Use-A-GUI thrust" of Unicode acknowledges the fact that rendering of complex scripts (including the Latin script with generative use of combining marks) requires logic that is much more amenable to implementation in a GUI framework than in a terminal model. However, appropriate (and very large and useful) subsets of Unicode *can* be implemented with simple rendering models. (Cf. Windows NT until very recently. :-) ) > I can move this letter to practically any > other platform and it will still be perfectly legible and printable -- no > export or import or conversion or version skew to worry about. I think a lot > of people would be perfectly happy to do the same in a plain-text Unicode > world using plain-text Unicode terminals and printers, if there were such > things. That is exactly what Unicode plain text is all about. And, by the way, Notepad on Windows NT was pretty close to being a "plain-text Unicode terminal". > The idea that one must embed Unicode in a higher level wrapper (e.g. a > Microsoft Word document, or even HTML) to make it useful has a certain > frightening consequence: the loss of any expectancy of longevity for our new > breed of documents. There is absolutely nothing new about this. I was warning my linguistic colleagues about the longevity of their documents when they started using WordStar back around 82/83. 7-bit ASCII is the only encoding that stayed stable enough and was widely enough implemented to retain easy transmissibility across the computer generations without the intervention of information archaeologists. Well, 16-bit Unicode plain text is aimed at no less a goal than being the universal wide-ASCII plain text of the 21st century. Grumpy aside: This goal is not helped by people who treat Unicode as a standards dumping ground for assigning numbers to everybody's favorite collection of junk vaguely related to text, or who try to infiltrate mechanisms (such as language tags) that do not belong in plain text. > So let's do our part and make some effort to accommodate traditional > plain-text applications in Unicode, rather than discourage them :-) I agree completely. An excellent example of the appropriate place for a Unicode plain-text editor would be a Java IDE. If someone writes a good Unicode plain-text editor for such an application, it would have wider applicability. (I know I often use the editors of C++ IDE's to create (ASCII) plain text when I don't want it all gummed up as a Word or Frame document.) Ed Cherlin commented: > We want to have a uniform, portable definition of the meaning of a file of > 16-bit character codes interpreted as Unicode, or "Unicode text file" for > short. At the same time, we have several uses for such files, where > different interpretations may be desired. If we want to do this right, I > think we have to find the appropriate organization for defining such file > formats and uses, and get down to some serious and at times difficult > standard making. The Unicode character code standard does not seem to be > the right place to do this. I disagree about the last point. A Unicode plain text file consists of a stream of Unicode characters (and nothing else), interpreted according to the Unicode standard. It should be marked with an initial U+FEFF (though technically that is optional). This much is already clear from the standard, as is the usage of LINE SEPARATOR and PARAGRAPH SEPARATOR for minimal, unambiguous, plain text formatting consistent with the bidi algorithm. The situation is complicated by the two possible byte orders (which is one reason for the U+FEFF) and by the fact that the most widely implemented variant, namely that in Windows NT, chose LSB order instead of MSB order. But other than that, there is not much more to be said about a Unicode plain text file. The usefulness of the concept lies in its simplicity. --Ken Whistler 20-May-97 20:29:52-GMT,4480;000000000011 Return-Path: Received: from mtshasta.snowcrest.net (mtshasta.snowcrest.net [206.245.192.1]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id QAA02464 for ; Tue, 20 May 1997 16:29:41 -0400 (EDT) Received: from [206.245.192.36] (ttyD23.mtshasta.snowcrest.net [206.245.192.67]) by mtshasta.snowcrest.net (8.8.5/8.6.5) with ESMTP id NAA01464; Tue, 20 May 1997 13:29:30 -0700 (PDT) X-Sender: cherlin@snowcrest.net Message-Id: In-Reply-To: References: Your message of Sat, 17 May 1997 14:08:43 -0700 (PDT) Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Mon, 19 May 1997 23:57:56 -0700 To: Frank da Cruz From: Edward Cherlin Subject: Unicode plain text standard? (was Re: Line Separator Character) Cc: unicode@Unicode.ORG >Oops, never mind -- it was this: > >> We want to have a uniform, portable definition of the meaning of a file of >> 16-bit character codes interpreted as Unicode, or "Unicode text file" for >> short. At the same time, we have several uses for such files, where >> different interpretations may be desired. If we want to do this right, I >> think we have to find the appropriate organization for defining such file >> formats and uses, and get down to some serious and at times difficult >> standard making. The Unicode character code standard does not seem to be >> the right place to do this. >> >I'm not sure what you're after. I'm mainly concerned about the continued >viability of files containing only graphic characters, spaces, line breaks, >paragraph breaks, and formfeeds. Plain, literal text that can contain >poetry, tables, source code, you name it, and stays like it is. I can tell you don't know what table building in Sanskrit is like, and you don't understand BIDI direction marking. >Pretty much what we have today with 7- and 8-bit plain text, except without >the confusion over CRLF/CR/LF, etc. and the utter incompatibility of the extra 128 characters in the 8-bit sets between PC DOS, PC Windows, Mac, various Unix definitions, and all the other extended ASCII code sets such as PC code pages and the ISO 8859 series. Files of 8-bit characters are extremely non-portable. Having lived in Korea and Japan, and been a mathematician and APL programmer, I lost all faith in ASCII long ago. It is horribly inadequate for English, and more so for almost any other language, except for various computer programming languages and constructed languages like Lojban, which were deliberately built within the limits of ASCII, or in the old days EBCDIC. >I think that what's really valuable about >these files is their self-contained and independent expressiveness -- they >don't need a rendering engine, they don't need any special transport protocol >-- they contain the text and the minimal control information to be transported >and understood universally. >- Frank I agree on the transport protocol in principle, although today we need UTF-7, UTF-8, and other encodings, but the idea of full Unicode text without a rendering engine won't fly. That's fine for simple alphabetic scripts, and even for Chinese and Japanese. It doesn't work right for RTL scripts (Arabic and Hebrew), especially for mixtures of RTL and LTR, and for scripts that combine characters into larger groups, usually syllables. This includes Korean, all of the Indic scripts, Tibetan, and Ethiopic. Arabic script has a very large dependence on ligatures, some of them quite complex. There are also problems for rendering math expressions in plain text. Then there are various deprecated characters, the private use areas, and the surrogate character mechanism. Anyone who thought the CRLF business was bad should consider how many incompatible choices can be made in Unicode. Yes, it is true that the Unix file model of a sequence of uninterpreted bytes is very general, and so is a file of uninterpreted 16-bit codes, but files have to be interpreted to be useful. We gloss over the amount of interpretation we do on ASCII text files, but we cannot do that with Unicode. -- Edward Cherlin Help outlaw Spam Everything should be made Vice President http://www.cauce.org as simple as possible, NewbieNet, Inc. 1000 members and counting __but no simpler__. http://www.newbie.net/ 17 May 97 Attributed to Albert Einstein 20-May-97 21:39:41-GMT,7559;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id RAA20335 for ; Tue, 20 May 1997 17:39:38 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA25440; Tue, 20 May 97 13:31:38 -0700 Message-Id: <9705202031.AA25440@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 2653 (1997-05-20 20:29:36 GMT) To: Multiple Recipients of Reply-To: Edward Cherlin From: "Unicode Discussion" Date: Tue, 20 May 1997 13:29:34 -0700 (PDT) Subject: Re: Unicode plain text (Was: Line Separator Character) kenw@sybase.com (Kenneth Whistler) wrote: [snip] >You can still use U+000C FORM FEED in Unicode plain text, and a renderer that >knows about page breaks can do the "right thing", namely whatever it did with >^L for an ASCII text. FORM FEED, like HORIZONTAL TAB, was not considered to >be ambiguous enough in usage (unlike CR/LF) to require any separate encoding >in Unicode. > >> In any case, the strong Use-A-GUI thrust of Unicode will make it >>increasingly >> difficult for certain kinds of people to operate in the ways to which they >> have become accustomed over the past decades in which plain text was "good >> enough" save that one could not put lots of languages into it. > >The goal of Unicode plain text is to recapture that portability in the >encoding, but also allow you to put lots of languages into it. The "Use-A-GUI >thrust" of Unicode acknowledges the fact that rendering of complex scripts >(including the Latin script with generative use of combining marks) requires >logic that is much more amenable to implementation in a GUI framework than in >a terminal model. However, appropriate (and very large and useful) subsets of >Unicode *can* be implemented with simple rendering models. (Cf. Windows NT >until very recently. :-) ) > >> I can move this letter to practically any >> other platform and it will still be perfectly legible and printable -- no >> export or import or conversion or version skew to worry about. I think >>a lot >> of people would be perfectly happy to do the same in a plain-text Unicode >> world using plain-text Unicode terminals and printers, if there were such >> things. The Everson Mono fonts would suit such a product admirably, up to a point. >That is exactly what Unicode plain text is all about. And, by the way, >Notepad on Windows NT was pretty close to being a "plain-text Unicode >terminal". > >> The idea that one must embed Unicode in a higher level wrapper (e.g. a >> Microsoft Word document, or even HTML) to make it useful has a certain >> frightening consequence: the loss of any expectancy of longevity for our new >> breed of documents. > >There is absolutely nothing new about this. I was warning my linguistic >colleagues about the longevity of their documents when they started using >WordStar back around 82/83. 7-bit ASCII is the only encoding that stayed >stable enough and was widely enough implemented to retain easy >transmissibility >across the computer generations without the intervention of information >archaeologists. Well, 16-bit Unicode plain text is aimed at no less a >goal than being the universal wide-ASCII plain text of the 21st century. > [snip] > >> So let's do our part and make some effort to accommodate traditional >> plain-text applications in Unicode, rather than discourage them :-) > >I agree completely. An excellent example of the appropriate place for >a Unicode plain-text editor would be a Java IDE. If someone writes >a good Unicode plain-text editor for such an application, it would >have wider applicability. (I know I often use the editors of C++ >IDE's to create (ASCII) plain text when I don't want it all gummed up >as a Word or Frame document.) > >Ed Cherlin commented: > >> We want to have a uniform, portable definition of the meaning of a file of >> 16-bit character codes interpreted as Unicode, or "Unicode text file" for >> short. At the same time, we have several uses for such files, where >> different interpretations may be desired. If we want to do this right, I >> think we have to find the appropriate organization for defining such file >> formats and uses, and get down to some serious and at times difficult >> standard making. The Unicode character code standard does not seem to be >> the right place to do this. > >I disagree about the last point. A Unicode plain text file consists of >a stream of Unicode characters (and nothing else), interpreted according >to the Unicode standard. It should be marked with an initial U+FEFF (though >technically that is optional). This much is already clear from the standard, >as is the usage of LINE SEPARATOR and PARAGRAPH SEPARATOR for minimal, >unambiguous, plain text formatting consistent with the bidi algorithm. I'm not concerned about where. If the Unicode standard is an acceptable place to do this, I'm in. >The situation is complicated by the two possible byte orders (which is one >reason for the U+FEFF) and by the fact that the most widely implemented >variant, namely that in Windows NT, chose LSB order instead of MSB order. > >But other than that, there is not much more to be said about a Unicode >plain text file. The usefulness of the concept lies in its simplicity. > >--Ken Whistler I disagree about the simplicity of the problem. Some of the leading issues are: byte order in storage and transmission line, paragraph, and page breaks BIDI (Hebrew, Arabic, etc.) non-linear scripts (Indic, Korean, Mongolian, Ethiopian, etc.) multiply accented characters (IPA, math, several human languages) math compatibility characters private use characters control codes other deprecated characters surrogates, especially unpaired surrogate codes non-character values text processing algorithms (sorting, upper and lower case, pattern matching) Full portability of data requires some rules. If there is no standard, users of "Unicode text files" will make every possible choice about each of these issues. CRLF will be nothing in comparison. We have begun to see programs that can handle CRLF, CR alone, and LF alone, either line-by-line or in paragraph format, reading and writing in any option. The range of choices for Unicode is far greater, and I don't want to think about how long it would take to achieve unity if we don't do it now. The process for dealing with byte order is fairly simple in itself, and the standard gives clear conformance requirements. Most of the other issues I listed have thorns, few in some cases, and many in others. When I was in Korea in the 1960s, telegrams were printed linearly, so Koreans can read this form of their script if they have to. Indic scripts, Ethiopic, and a few others, would require special training to read as separate elements in a straight line. Do we wish to say that users of these scripts can't have text files? Do we say we have to come up with a suitable rendering method for Unicode text files including full BIDI and full character-->glyph composition? Do we say that there should be implementation levels? None of these alternatives is quite satisfactory at present. -- Edward Cherlin Help outlaw Spam Everything should be made Vice President http://www.cauce.org as simple as possible, NewbieNet, Inc. 1000 members and counting __but no simpler__. http://www.newbie.net/ 17 May 97 Attributed to Albert Einstein 20-May-97 22:11:38-GMT,4132;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id SAA25206 for ; Tue, 20 May 1997 18:11:32 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA25784; Tue, 20 May 97 14:49:30 -0700 Message-Id: <9705202149.AA25784@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2655 (1997-05-20 21:49:05 GMT) To: Multiple Recipients of Reply-To: Frank da Cruz From: "Unicode Discussion" Date: Tue, 20 May 1997 14:49:03 -0700 (PDT) Subject: Re: Unicode plain text standard? (was Re: Line Separator Character) > >I'm not sure what you're after. I'm mainly concerned about the continued > >viability of files containing only graphic characters, spaces, line breaks, > >paragraph breaks, and formfeeds. Plain, literal text that can contain > >poetry, tables, source code, you name it, and stays like it is. > > I can tell you don't know what table building in Sanskrit is like, and you > don't understand BIDI direction marking. > Not Sanskrit, certainly, but I know a little about Hebrew by virtue of having devoted some time to issues of Hebrew terminal emulation in the plain-text world, and our Kermit terminal emulators (the software we make here) are quite popular in Israel. But yes, one must go through more than a few contortions on one end or the other (or both) to handle BIDI issues in the terminal/host setting, to the extent that Hebrew is (according to my sources) hardly used at all in email. The contortions involve generation and interpretation of terminal-specific escape sequences for cursor positioning, reversal of writing direction, character insertion, etc, and of course character-set invocation and designation, all of which obviously add up to something more than plain text. So sure, of course I agree that plain streams of text are not adequate for writing systems that are intrinsically bidirectional (like Hebrew) or for which correct rendering is variable and context-dependent (Indic scripts, etc). (So where, you might ask, is Hebrew terminal emulation used? As far as I know, the major application by far is in library information systems like ALEPH; there are some others, like a Hebrew version of the "vi" editor and more recently, Mule (Multilingual EMACS). At one point some years ago I thought (naively) that the very same mechanisms could be used for Arabic (after all, PCs have an Arabic code page), but in practice, as far as I can tell, no speaker of Arabic would be satisfied with a character-cell representation of Arabic text, because of the way characters must change shape depending on their context (as you point out), which is evidently not an issue in Hebrew (although it might be in Yiddish).) > Having lived in Korea and Japan, and been a mathematician and APL > programmer, I lost all faith in ASCII long ago. > Right -- I wasn't suggesting we all revert to ASCII -- the ability to write text in as many languages as possible is why we're here! I am looking for the option to extend the simplicity (and success) of ASCII to Unicode -- or at least to the large subset of it (as Ken said) that can be used "like ASCII". To me this means the ability to compose a plain-text message containing a certain amount of formatting controls like line breaks, paragraph breaks, and page breaks, that are part of the same code, and without application-specific metacodes (SGML tags, Microsoft Word codes, etc). Let Unicode be able to stand on its own! (Of course, also let it be used in other applications -- but that's not the issue.) If additional considerations need to be applied to the world's more complex scripts in order to have a standard universal representation for plain text, to whatever extent the Unicode 2.0 standard does not already suffice, I'm all for it. Let's not repeat the confusing aspects of ASCII -- particularly CRLF/CR/LF semantics, and, as Ed suggests, let's not leave room for this kind of confusion in areas that are new to Unicode. - Frank 21-May-97 0:19:39-GMT,7895;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id UAA14733 for ; Tue, 20 May 1997 20:19:36 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA26328; Tue, 20 May 97 17:02:20 -0700 Message-Id: <9705210002.AA26328@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2656 (1997-05-21 00:01:51 GMT) To: Multiple Recipients of Reply-To: kenw@sybase.com (Kenneth Whistler) From: "Unicode Discussion" Date: Tue, 20 May 1997 17:01:49 -0700 (PDT) Subject: Re: Unicode plain text (Was: Line Separator Character) I (Ken) commented: >But other than that, there is not much more to be said about a Unicode >plain text file. The usefulness of the concept lies in its simplicity. And Ed Cherlin responded: > > I disagree about the simplicity of the problem. And now I think I understand where we were miscommunicating. I was speaking of a Unicode plain text *file*, which I thought was the issue. And for that the issue is simple. A Unicode plain text *file* is Unicode plain text in a file (preferably marked with U+FEFF and in MSB byte order). But what Ed is addressing here is the standardization of the meaning of Unicode *plain text*--an issue which should be considered outside instantiation of that plain text in transmissible computer files. On that point I agree that there are a vast number of issues which require specification and standardization. And I do believe that the Unicode Standard is the correct place to address many of them. I've made the point before that one of the big differences between ISO/IEC 10646 and the Unicode Standard is that 10646 standardizes the encodings and names of the characters, but that the Unicode Standard goes way beyond that and attempts to provide enough information (some normative and some informative) to enable meaningful and transmissible implementations of Unicode plain text. Below is Ed's list of leading issues. I've interspersed my comments indicating what I think the current Unicode Standard's take is on many of them. (Others may disagree, or may feel that things which are not covered should be.) > Some of the leading issues are: > > byte order in storage and transmission Byte order is addressed by the Unicode Standard. > line, paragraph, and page breaks The Unicode Standard specifies LINE SEPARATOR and PARAGRAPH SEPARATOR, but considers page break to be out of scope. > BIDI (Hebrew, Arabic, etc.) The normative bidi algorithm is specified in great detail in the Unicode Standard. > non-linear scripts (Indic, Korean, Mongolian, Ethiopian, etc.) The Unicode Standard considers specification of script behavior to be part of the desired content of the standard. It doesn't do an equally detailed accounting of all cases, mostly due to resource and information constraints. But Devanagari and Tamil script handling are provided in significant detail as a guide to Indian script behavior, and there is an extensive discussion of Arabic script shaping behavior. There is a specification of normative behavior for Hangul combining jamo. If we could get equally detailed expert contributions for each complex script, I expect the inclination of the UTC and the editors would be to include them in the standard, for everybody's benefit. > multiply accented characters (IPA, math, several human languages) This is considered an integral part of the Unicode Standard, and is detailed with both normative and informative sections. > math There is a definite gap here, though the topic has been a continuing one for the UTC. The consensus seems to be that we would like to get a consistent model of plain text math formula construction stated, to make such information exchangeable in Unicode plain text. > compatibility characters These are now completely specified in the Unicode Standard names list. > private use characters Also specified by the standard, although the interpretation of particular usages of private use characters is, by definition, out of scope for the standard. But there has been some effort by people to make available specifications of their particular private or corporate private usage repertoires of private use characters. > control codes If you mean by this, U+0000 .. U+001F, U+0080..U+009F and the control chimera U+007F, then the Unicode Standard does provide a answer. It doesn't try to reinvent control function standards, but it says those characters should be interpreted as if they were 16-bit analogues of the 8-bit encodings of the corresponding control functions. Maybe unsatisfying, but probably the best we can expect, given existing control code usage. > other deprecated characters There may be room for improvement here, but the Unicode Standard has had to tread a little carefully here. There are political consequences in crying out too loudly that xyz are *deprecated* when xyz may be somebody else's favorite set they lobbied hard to get in! > surrogates, especially unpaired surrogate codes Surrogate usage (in general, as opposed to particular encodings for surrogate pairs, none of which exist yet) is fully specified by the Unicode Standard. > non-character values As opposed to unassigned character values, there are only two non-character values in Unicode: 0xFFFE and 0xFFFF. The standard specifies that 0xFFFE is the illegal byte-swapped version of U+FEFF. The use of 0xFFFF is deliberately unspecified and is untransmissible by design. > text processing algorithms (sorting, upper and lower case, pattern matching) Default case mapping is provided as an informative part of the Unicode Standard. Language-specific casing is effectively also a part of the standard, since everybody knows the few instances in question: Turkish i, the debatable French accents, German  ß, etc., and they are discussed in the standard. Beyond that, sorting, pattern matching, etc. are out of scope of the Unicode Standard (though some implementation guidelines are provided), and, in my opinion, appropriately belong to other standards under development. > > Full portability of data requires some rules. If there is no standard, > users of "Unicode text files" will make every possible choice about each of > these issues. CRLF will be nothing in comparison. We have begun to see > programs that can handle CRLF, CR alone, and LF alone, either line-by-line > or in paragraph format, reading and writing in any option. The range of > choices for Unicode is far greater, and I don't want to think about how > long it would take to achieve unity if we don't do it now. Yes, but... The goal is interchangeable plain text that is legible when interpreted and rendered in accord with the standard. The goal is not to force everyone to "spell" multilingual text exactly the same way. The drafters of the Unicode Standard tried to place normative requirements on plain text where failure to do so would lead to complete chaos. Obvious examples are specification that combining marks must follow (not precede) their base character, and specification of the complete bidi algorithm. Failure to specify either of these would clearly have led to uninterpretable gibberish if everyone made up their own rules, and that was clearly understood by the members of the Unicode Technical Committee. But one draws the line somewhere. No one wants to legislate against people, for example, making cross-linguistic puns in text by spelling out Russian words with Latin letters, or any other "inappropriate" or creative usage of the characters at their disposal, once Unicode implementations become more widely available. Half the joy of having universal multilingual text implemented on computers will be seeing what creative and fantastic new inventions millions of users put it to. --Ken Whistler 21-May-97 1:32:55-GMT,2729;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id VAA24596 for ; Tue, 20 May 1997 21:32:53 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA26556; Tue, 20 May 97 18:14:20 -0700 Message-Id: <9705210114.AA26556@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 X-Uml-Sequence: 2657 (1997-05-21 01:13:31 GMT) To: Multiple Recipients of Reply-To: clarkcb@corp.sykes.com From: "Unicode Discussion" Date: Tue, 20 May 1997 18:13:29 -0700 (PDT) Subject: Unicode Plain Text Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by watsun.cc.columbia.edu id VAA24596 I'm a little confused by this recent thread. I get the feeling that some people think Unicode needs additional features to be useable, whereas I think that the necessary features need to be present in Unicode-supporting applications and fonts. Maybe I'm misunderstanding, but I'll continue anyway. I think maybe the problem is that the definition of "plain text" needs some refining with respect to Unicode. To me, a Unicode plain text file would contain ANY Unicode character. It would be the writer's responsibility (together with an input editor, perhaps) to make sure the file contained the minimum necessary information to render correctly, eg. proper placement of directional indicators, etc., and it would in turn be the application's responsibility to render the file in a readable fashion, given the information contained in the file. Keep in mind that even 7-bit ASCII text still must be "rendered" by an editor on the screen. Also, keep in mind that, according to the Unicode Standard, compliance does not necessarily mean full support. An application might not have bidirectional rendering capabilities, but that does not mean that a Unicode file with a mixture or English and Hebrew/Arabic with directional indicators is not a plain text file. What makes a plain text file different from any other electronic document, in my opinion, is the lack vs. the presence of "style" information, such as font, font size, margins, etc., and additionally, in the case of SGML instances, procedural markup. As for usage standards, such as CRLF vs. CR vs. LF vs. LS vs. PS, etc., we have two options: 1. agree on definitive standards now, and support nothing but, or 2. support everything Now, I have done enough programming to know that supporting more means more headaches, but I still feel that the second option is the better one at this time. Feedback? Cary 21-May-97 19:00:12-GMT,5614;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id OAA04479 for ; Wed, 21 May 1997 14:59:51 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA29248; Wed, 21 May 97 11:11:19 -0700 Message-Id: <9705211811.AA29248@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2661 (1997-05-21 18:10:29 GMT) To: Multiple Recipients of Reply-To: "Pierre Lewis" From: "Unicode Discussion" Date: Wed, 21 May 1997 11:10:27 -0700 (PDT) Subject: Re: Unicode plain-text file Doug/Mark, Thanks a lot for your answers. They clarify a lot of things. > ** This is not consistent with the output on your web page. To force the > ** date to be formatted left to right assuming this logical order, you'd > ** need to force all date characters to L. This can be done either using an LRM > ** before the first Roman digit, if the digits are roman, or by surrounding > ** the date with LRO..PDF, if the digits are arabic-indic. Note that LRE > ** won't work because the reverse solidus, being between two AN, would > ** still convert to R, instead of L as desired. I finally had a chance to chat with my Arab friend to whom I owe this short fragment. It is visually correct (on GIF/PS), but my logical ordering was worng. The logical order is 10\3\90. So it seems that things should automatically fall into place with no extra markup. It is a reverse solidus. The digits are arabic-indic (U+066x). So the reverse solidus, an ON, stays R as needed by virtue of the ANs being treated as Rs for the purpose of resolving neutrals. Not simple, but effective. That section of the standard really requires careful reading and exploring :-). > ... So in line 2, the level wouldn't change simply > ** because of a switch from English to German, since the German > ** characters would be L. Only LRE or LRO would do that. Since you > ** don't indicate strong formatting characters, I'd have to assume they > ** were present to force the levels you indicate. The levels as shown are what I believe(d) they should be. I didn't include the required BIDI markup, but would assume that the application that outputs the file for this text would include whatever is necessary to achieve this result. So you assumed correctly. > @@ The standard is pretty clear. Most of those opinions are from people > @@ who have not read it. Think of these characters in terms of what you > @@ use in a word processor. > @@ For Microsoft word or FrontPage, think of LS as the > @@ character that you get with shift-Return > @@ (causing no paragraph spacing or indent), > @@ and PS as what you get with Return. > @@ (on the Mac, this would be option-Return). Thinking in terms of a word processor is what I'm trying to get away from, because it's not really open. (And I live on Unix :-)) When I open up a file using vi on Unix, I can't tell if this file was created with vi, emacs, pine, ed, sed, awk or whatever. There are still issues (CR/LF/CRLF, TAB, FF placement, top 128 codes) with plain-text ASCII files, but still, it is a very useful concept. Imagine if I had to open mail from user A with vi, from user B with emacs, from user C with pine because that's what each used to write to me. It would be chaos. Unfortunately, if we can't agree on some conventions for plain-text Unicode files, we're going to get into this situation to some extent. Right now, if I want to be as flexible as possible (in an editor, say), I have to deal with 4 new-line conventions (maybe 5): CR, LF, CRLF, LS, maybe NL. I have to deal with various placements of FFs. And I may have to deal with various uses and misuses of some of the new codes. > ** This is a good observation! We believe the current standard is in > ** error and should categorize LS as whitespace instead of as a block > ** separator. I'll consider it changed. > ** That said, the explicit formatting codes are basically intended for static > ** text interchange only. They pose several problems for editing. One is that it > ** is easy to radically alter the text by inserting, copying, or deleting I wouldn't let a user directly input/modify BIDI markup! Rather I'd have him/her tell the editor what a piece of text should look like, then let the editor issue whatever markup is required to achieve this at the time the file is written out. > ** FF is higher-level formatting, you'd have to interpret it separately. > @@ In particular, you would definitely interpret it as a block separator. That's one area where I'd love more guidance from Unicode. FF is, I think, a reasonable requirement for plain-text files, so I would have liked Unicode to tell me more about it, or provide a PAS -- page separator. Pierre lew@nortel.ca P.S.1. I was shocked, when I visited the IUC10 Web site, to find HTML pages in Unicode, but no plain-text files. Yes, let Unicode be able to stand on its own (as fdc@watsun.cc.columbia.edu writes)! P.S.2. Btw, one thing I love about "plain-text" files is that they have the best chances of surviving. If I write stuff today that my 3-year old will want to read when he turns 33, my only choice is plain text. To write for him in French, plain-text ASCII (with the Latin1 assumption) is just fine. But if I wanted to add some notes in Greek, Russian or Yiddish, I need more than just the ASCII conventions and Latin1 codepage. P.S.3. Someone in this thread stated that LF was a paragraph separator in Unix. I see it as a line separator. 22-May-97 8:33:13-GMT,1687;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id EAA13940 for ; Thu, 22 May 1997 04:33:12 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA01595; Thu, 22 May 97 01:07:55 -0700 Message-Id: <9705220807.AA01595@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 2666 (1997-05-22 08:07:03 GMT) To: Multiple Recipients of Reply-To: Edward Cherlin From: "Unicode Discussion" Date: Thu, 22 May 1997 01:06:53 -0700 (PDT) Subject: Re: Unicode plain-text file >> ** FF is higher-level formatting, you'd have to interpret it separately. >> @@ In particular, you would definitely interpret it as a block separator. No, no, please, no! Whitespace, please, or some new category. FF can come in the middle of a paragraph, or a sentence, or even a word. >That's one area where I'd love more guidance from Unicode. FF is, I think, >a reasonable requirement for plain-text files, so I would have liked >Unicode to tell me more about it, or provide a PAS -- page separator. >P.S.3. Someone in this thread stated that LF was a paragraph separator >in Unix. I see it as a line separator. Another good example of the confusion we need to prevent. -- Edward Cherlin Help outlaw Spam Everything should be made Vice President http://www.cauce.org as simple as possible, NewbieNet, Inc. 1000 members and counting __but no simpler__. http://www.newbie.net/ 17 May 97 Attributed to Albert Einstein 22-May-97 9:31:20-GMT,4440;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id FAA19614 for ; Thu, 22 May 1997 05:31:18 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA01475; Thu, 22 May 97 01:04:38 -0700 Message-Id: <9705220804.AA01475@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 2663 (1997-05-22 08:03:46 GMT) To: Multiple Recipients of Reply-To: Edward Cherlin From: "Unicode Discussion" Date: Thu, 22 May 1997 01:03:44 -0700 (PDT) Subject: Unicode plain text standard? (was Re: Line Separator Character) >Oops, never mind -- it was this: > >> We want to have a uniform, portable definition of the meaning of a file of >> 16-bit character codes interpreted as Unicode, or "Unicode text file" for >> short. At the same time, we have several uses for such files, where >> different interpretations may be desired. If we want to do this right, I >> think we have to find the appropriate organization for defining such file >> formats and uses, and get down to some serious and at times difficult >> standard making. The Unicode character code standard does not seem to be >> the right place to do this. >> >I'm not sure what you're after. I'm mainly concerned about the continued >viability of files containing only graphic characters, spaces, line breaks, >paragraph breaks, and formfeeds. Plain, literal text that can contain >poetry, tables, source code, you name it, and stays like it is. I can tell you don't know what table building in Sanskrit is like, and you don't understand BIDI direction marking. >Pretty much what we have today with 7- and 8-bit plain text, except without >the confusion over CRLF/CR/LF, etc. and the utter incompatibility of the extra 128 characters in the 8-bit sets between PC DOS, PC Windows, Mac, various Unix definitions, and all the other extended ASCII code sets such as PC code pages and the ISO 8859 series. Files of 8-bit characters are extremely non-portable. Having lived in Korea and Japan, and been a mathematician and APL programmer, I lost all faith in ASCII long ago. It is horribly inadequate for English, and more so for almost any other language, except for various computer programming languages and constructed languages like Lojban, which were deliberately built within the limits of ASCII, or in the old days EBCDIC. >I think that what's really valuable about >these files is their self-contained and independent expressiveness -- they >don't need a rendering engine, they don't need any special transport protocol >-- they contain the text and the minimal control information to be transported >and understood universally. >- Frank I agree on the transport protocol in principle, although today we need UTF-7, UTF-8, and other encodings, but the idea of full Unicode text without a rendering engine won't fly. That's fine for simple alphabetic scripts, and even for Chinese and Japanese. It doesn't work right for RTL scripts (Arabic and Hebrew), especially for mixtures of RTL and LTR, and for scripts that combine characters into larger groups, usually syllables. This includes Korean, all of the Indic scripts, Tibetan, and Ethiopic. Arabic script has a very large dependence on ligatures, some of them quite complex. There are also problems for rendering math expressions in plain text. Then there are various deprecated characters, the private use areas, and the surrogate character mechanism. Anyone who thought the CRLF business was bad should consider how many incompatible choices can be made in Unicode. Yes, it is true that the Unix file model of a sequence of uninterpreted bytes is very general, and so is a file of uninterpreted 16-bit codes, but files have to be interpreted to be useful. We gloss over the amount of interpretation we do on ASCII text files, but we cannot do that with Unicode. -- Edward Cherlin Help outlaw Spam Everything should be made Vice President http://www.cauce.org as simple as possible, NewbieNet, Inc. 1000 members and counting __but no simpler__. http://www.newbie.net/ 17 May 97 Attributed to Albert Einstein Ed Cherlin cherlin@cauce.org Support the anti-Spam amendment Text at Free signature--Inquire within. 22-May-97 10:02:07-GMT,7689;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id GAA23212 for ; Thu, 22 May 1997 06:02:05 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA01479; Thu, 22 May 97 01:04:41 -0700 Message-Id: <9705220804.AA01479@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 2664 (1997-05-22 08:04:06 GMT) To: Multiple Recipients of Reply-To: Edward Cherlin From: "Unicode Discussion" Date: Thu, 22 May 1997 01:04:05 -0700 (PDT) Subject: Re: Unicode plain text (Was: Line Separator Character) kenw@sybase.com (Kenneth Whistler) wrote: [snip] >You can still use U+000C FORM FEED in Unicode plain text, and a renderer that >knows about page breaks can do the "right thing", namely whatever it did with >^L for an ASCII text. FORM FEED, like HORIZONTAL TAB, was not considered to >be ambiguous enough in usage (unlike CR/LF) to require any separate encoding >in Unicode. > >> In any case, the strong Use-A-GUI thrust of Unicode will make it >>increasingly >> difficult for certain kinds of people to operate in the ways to which they >> have become accustomed over the past decades in which plain text was "good >> enough" save that one could not put lots of languages into it. > >The goal of Unicode plain text is to recapture that portability in the >encoding, but also allow you to put lots of languages into it. The "Use-A-GUI >thrust" of Unicode acknowledges the fact that rendering of complex scripts >(including the Latin script with generative use of combining marks) requires >logic that is much more amenable to implementation in a GUI framework than in >a terminal model. However, appropriate (and very large and useful) subsets of >Unicode *can* be implemented with simple rendering models. (Cf. Windows NT >until very recently. :-) ) > >> I can move this letter to practically any >> other platform and it will still be perfectly legible and printable -- no >> export or import or conversion or version skew to worry about. I think >>a lot >> of people would be perfectly happy to do the same in a plain-text Unicode >> world using plain-text Unicode terminals and printers, if there were such >> things. The Everson Mono fonts would suit such a product admirably, up to a point. >That is exactly what Unicode plain text is all about. And, by the way, >Notepad on Windows NT was pretty close to being a "plain-text Unicode >terminal". > >> The idea that one must embed Unicode in a higher level wrapper (e.g. a >> Microsoft Word document, or even HTML) to make it useful has a certain >> frightening consequence: the loss of any expectancy of longevity for our new >> breed of documents. > >There is absolutely nothing new about this. I was warning my linguistic >colleagues about the longevity of their documents when they started using >WordStar back around 82/83. 7-bit ASCII is the only encoding that stayed >stable enough and was widely enough implemented to retain easy >transmissibility >across the computer generations without the intervention of information >archaeologists. Well, 16-bit Unicode plain text is aimed at no less a >goal than being the universal wide-ASCII plain text of the 21st century. > [snip] > >> So let's do our part and make some effort to accommodate traditional >> plain-text applications in Unicode, rather than discourage them :-) > >I agree completely. An excellent example of the appropriate place for >a Unicode plain-text editor would be a Java IDE. If someone writes >a good Unicode plain-text editor for such an application, it would >have wider applicability. (I know I often use the editors of C++ >IDE's to create (ASCII) plain text when I don't want it all gummed up >as a Word or Frame document.) > >Ed Cherlin commented: > >> We want to have a uniform, portable definition of the meaning of a file of >> 16-bit character codes interpreted as Unicode, or "Unicode text file" for >> short. At the same time, we have several uses for such files, where >> different interpretations may be desired. If we want to do this right, I >> think we have to find the appropriate organization for defining such file >> formats and uses, and get down to some serious and at times difficult >> standard making. The Unicode character code standard does not seem to be >> the right place to do this. > >I disagree about the last point. A Unicode plain text file consists of >a stream of Unicode characters (and nothing else), interpreted according >to the Unicode standard. It should be marked with an initial U+FEFF (though >technically that is optional). This much is already clear from the standard, >as is the usage of LINE SEPARATOR and PARAGRAPH SEPARATOR for minimal, >unambiguous, plain text formatting consistent with the bidi algorithm. I'm not concerned about where. If the Unicode standard is an acceptable place to do this, I'm in. >The situation is complicated by the two possible byte orders (which is one >reason for the U+FEFF) and by the fact that the most widely implemented >variant, namely that in Windows NT, chose LSB order instead of MSB order. > >But other than that, there is not much more to be said about a Unicode >plain text file. The usefulness of the concept lies in its simplicity. > >--Ken Whistler I disagree about the simplicity of the problem. Some of the leading issues are: byte order in storage and transmission line, paragraph, and page breaks BIDI (Hebrew, Arabic, etc.) non-linear scripts (Indic, Korean, Mongolian, Ethiopian, etc.) multiply accented characters (IPA, math, several human languages) math compatibility characters private use characters control codes other deprecated characters surrogates, especially unpaired surrogate codes non-character values text processing algorithms (sorting, upper and lower case, pattern matching) Full portability of data requires some rules. If there is no standard, users of "Unicode text files" will make every possible choice about each of these issues. CRLF will be nothing in comparison. We have begun to see programs that can handle CRLF, CR alone, and LF alone, either line-by-line or in paragraph format, reading and writing in any option. The range of choices for Unicode is far greater, and I don't want to think about how long it would take to achieve unity if we don't do it now. The process for dealing with byte order is fairly simple in itself, and the standard gives clear conformance requirements. Most of the other issues I listed have thorns, few in some cases, and many in others. When I was in Korea in the 1960s, telegrams were printed linearly, so Koreans can read this form of their script if they have to. Indic scripts, Ethiopic, and a few others, would require special training to read as separate elements in a straight line. Do we wish to say that users of these scripts can't have text files? Do we say we have to come up with a suitable rendering method for Unicode text files including full BIDI and full character-->glyph composition? Do we say that there should be implementation levels? None of these alternatives is quite satisfactory at present. -- Edward Cherlin Help outlaw Spam Everything should be made Vice President http://www.cauce.org as simple as possible, NewbieNet, Inc. 1000 members and counting __but no simpler__. http://www.newbie.net/ 17 May 97 Attributed to Albert Einstein Ed Cherlin cherlin@cauce.org Support the anti-Spam amendment Text at Free signature--Inquire within. 22-May-97 10:24:51-GMT,10338;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id GAA26796 for ; Thu, 22 May 1997 06:24:49 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA01571; Thu, 22 May 97 01:07:10 -0700 Message-Id: <9705220807.AA01571@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" X-Uml-Sequence: 2665 (1997-05-22 08:06:33 GMT) To: Multiple Recipients of Reply-To: Edward Cherlin From: "Unicode Discussion" Date: Thu, 22 May 1997 01:06:32 -0700 (PDT) Subject: Re: Unicode plain text (Was: Line Separator Character) Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by watsun.cc.columbia.edu id GAA26796 kenw@sybase.com (Kenneth Whistler), commenting on my previous message, did an admirable job of summarizing the state of the problem of Unicode plain text in terms of what the Unicode standard does and does not cover, and the fact that a standard for use of such files must address many more issues. I (Ed) agree with his summary entirely. My added comments here address the issues of function of editors and renderers. >I (Ken) commented: > >>But other than that, there is not much more to be said about a Unicode >>plain text file. The usefulness of the concept lies in its simplicity. > >And Ed Cherlin responded: > >> >> I disagree about the simplicity of the problem. > >And now I think I understand where we were miscommunicating. I was >speaking of a Unicode plain text *file*, which I thought was the >issue. And for that the issue is simple. A Unicode plain text *file* >is Unicode plain text in a file (preferably marked with U+FEFF >and in MSB byte order). > >But what Ed is addressing here is the standardization of the meaning >of Unicode *plain text*--an issue which should be considered outside >instantiation of that plain text in transmissible computer files. >On that point I agree that there are a vast number of issues which >require specification and standardization. And I do believe that the >Unicode Standard is the correct place to address many of them. I've >made the point before that one of the big differences between ISO/IEC >10646 and the Unicode Standard is that 10646 standardizes the encodings >and names of the characters, but that the Unicode Standard goes way >beyond that and attempts to provide enough information (some >normative and some informative) to enable meaningful and transmissible >implementations of Unicode plain text. > >Below is Ed's list of leading issues. I've interspersed my comments >indicating what I think the current Unicode Standard's take is on >many of them. (Others may disagree, or may feel that things which >are not covered should be.) > >> Some of the leading issues are: >> byte order in storage and transmission > >Byte order is addressed by the Unicode Standard. No problem there. We might want to go further and *require* a byte order mark. >> line, paragraph, and page breaks > >The Unicode Standard specifies LINE SEPARATOR and PARAGRAPH SEPARATOR, >but considers page break to be out of scope. That would have to be addressed, because it will be used. >> BIDI (Hebrew, Arabic, etc.) > >The normative bidi algorithm is specified in great detail in >the Unicode Standard. So Unicode text editors should be required to implement it correctly, if they handle BIDI at all. >> non-linear scripts (Indic, Korean, Mongolian, Ethiopian, etc.) > >The Unicode Standard considers specification of script behavior to >be part of the desired content of the standard. It doesn't do an >equally detailed accounting of all cases, mostly due to resource >and information constraints. But Devanagari and Tamil script >handling are provided in significant detail as a guide to Indian >script behavior, and there is an extensive discussion of Arabic >script shaping behavior. There is a specification >of normative behavior for Hangul combining jamo. If we could get >equally detailed expert contributions for each complex script, >I expect the inclination of the UTC and the editors would be to >include them in the standard, for everybody's benefit. That would be a very great improvement. >> multiply accented characters (IPA, math, several human languages) > >This is considered an integral part of the Unicode Standard, and >is detailed with both normative and informative sections. So should it be required in all editors? I think so. >> math > >There is a definite gap here, though the topic has been a continuing >one for the UTC. The consensus seems to be that we would like to >get a consistent model of plain text math formula construction >stated, to make such information exchangeable in Unicode plain text. There has been some good work on this reported at IUC conferences. An option in an editor, for now anyway. >> compatibility characters > >These are now completely specified in the Unicode Standard names list. It should be possible to use them, but the user should have to choose to activate them. >> private use characters > >Also specified by the standard, although the interpretation of >particular usages of private use characters is, by definition, out >of scope for the standard. But there has been some effort by people >to make available specifications of their particular private or >corporate private usage repertoires of private use characters. I don't know of any particular behavior that could be required of software, other than the option of marking them all as unrecognized. >> control codes > >If you mean by this, U+0000 .. U+001F, U+0080..U+009F and the >control chimera U+007F, then the Unicode Standard does provide >a answer. It doesn't try to reinvent control function standards, >but it says those characters should be interpreted as if they >were 16-bit analogues of the 8-bit encodings of the corresponding >control functions. Maybe unsatisfying, but probably the best we >can expect, given existing control code usage. More precision is required, I think, at least for CR, LF, HT, and FF. >> other deprecated characters > >There may be room for improvement here, but the Unicode Standard >has had to tread a little carefully here. There are political >consequences in crying out too loudly that xyz are *deprecated* >when xyz may be somebody else's favorite set they lobbied hard >to get in! We can't just forbid them, certainly. >> surrogates, especially unpaired surrogate codes > >Surrogate usage (in general, as opposed to particular encodings >for surrogate pairs, none of which exist yet) is fully specified >by the Unicode Standard. OK. Unpaired surrogate codes should be marked in some way in rendering plain text. >> non-character values > >As opposed to unassigned character values, there are only two >non-character values in Unicode: 0xFFFE and 0xFFFF. The standard >specifies that 0xFFFE is the illegal byte-swapped version of >U+FEFF. The use of 0xFFFF is deliberately unspecified and is >untransmissible by design. Why do I think someone is going to decide to use it? :( >> text processing algorithms (sorting, upper and lower case, pattern matching) > >Default case mapping is provided as an informative part of the >Unicode Standard. Language-specific casing is effectively also >a part of the standard, since everybody knows the few instances >in question: Turkish i, the debatable French accents, German þ, etc., >and they are discussed in the standard. > >Beyond that, sorting, pattern matching, etc. are out of scope of >the Unicode Standard (though some implementation guidelines are >provided), and, in my opinion, appropriately belong to other standards >under development. The question is to some degree whether there is or will be a standard library of string functions, as there has been in C and C++. Of course I recognize that there were many such libraries, and perhaps that is unavoidable. >> Full portability of data requires some rules. If there is no standard, >> users of "Unicode text files" will make every possible choice about each of >> these issues. CRLF will be nothing in comparison. We have begun to see >> programs that can handle CRLF, CR alone, and LF alone, either line-by-line >> or in paragraph format, reading and writing in any option. The range of >> choices for Unicode is far greater, and I don't want to think about how >> long it would take to achieve unity if we don't do it now. > >Yes, but... The goal is interchangeable plain text that is legible >when interpreted and rendered in accord with the standard. The goal >is not to force everyone to "spell" multilingual text exactly the >same way. The drafters of the Unicode Standard tried to place normative >requirements on plain text where failure to do so would lead to >complete chaos. Obvious examples are specification that combining >marks must follow (not precede) their base character, and specification >of the complete bidi algorithm. Failure to specify either of these >would clearly have led to uninterpretable gibberish if everyone >made up their own rules, and that was clearly understood by the >members of the Unicode Technical Committee. I think the best way to discuss this is over some sample texts. I don't know how much time I can put into this, but if I can I will go through the standard and see if I can pick out anything else that might be a problem. >But one draws the line somewhere. No one wants to legislate against >people, for example, making cross-linguistic puns in text by >spelling out Russian words with Latin letters, or any other >"inappropriate" or creative usage of the characters at >their disposal, once Unicode implementations become more widely >available. Half the joy of having universal multilingual text >implemented on computers will be seeing what creative and fantastic >new inventions millions of users put it to. > >--Ken Whistler Think of the smilies we can make. %-] -- Edward Cherlin Help outlaw Spam Everything should be made Vice President http://www.cauce.org as simple as possible, NewbieNet, Inc. 1000 members and counting __but no simpler__. http://www.newbie.net/ 17 May 97 Attributed to Albert Einstein 22-May-97 22:24:46-GMT,1378;000000000011 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id SAA27950 for ; Thu, 22 May 1997 18:24:43 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA05533; Thu, 22 May 97 13:37:39 -0700 Message-Id: <9705222037.AA05533@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2673 (1997-05-22 20:37:12 GMT) To: Multiple Recipients of Reply-To: "Tony Harminc" From: "Unicode Discussion" Date: Thu, 22 May 1997 13:37:11 -0700 (PDT) Subject: Re: Unicode plain text How do record oriented file systems fit into this discussion ? (Remember those file systems that ruled the world before the UNIX idea of the byte stream came along...) I imagine the short answer is "they don't", and the longer one is something about record oriented files being fine, as long as the semantics of the defined control characters are honoured. What I'm getting at, though, is whether there is anything in the definition of Unicode plain text that disallows such files. Is there a mapping between the out-of-band record markers and Unicode separators ? It seems trivially obvious to map to/from . Or is this something that no one thinks should even be addressed ? Tony Harminc 22-May-97 22:26:48-GMT,2034;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id SAA28229 for ; Thu, 22 May 1997 18:26:46 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA05800; Thu, 22 May 97 14:34:52 -0700 Message-Id: <9705222134.AA05800@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2674 (1997-05-22 21:34:21 GMT) To: Multiple Recipients of Reply-To: Timothy Partridge From: "Unicode Discussion" Date: Thu, 22 May 1997 14:34:17 -0700 (PDT) Subject: Re: Unicode plain-text file In message <9705220812.AA01704@unicode.org> you recently said: > > >> ** FF is higher-level formatting, you'd have to interpret it separately. > >> @@ In particular, you would definitely interpret it as a block separator. > > No, no, please, no! Whitespace, please, or some new category. FF can come > in the middle of a paragraph, or a sentence, or even a word. I'm not sure I understand your reasoning. During rendering a page break can occur anywhere in the same way that a new line may be started anywhere as a line becomes too full. (I'm using anywhere rather loosely.) Wasn't the question about *forcing* a page break - surely this wouldn't normally be done within a paragraph or smaller part. (Or were you thinking of text streams that have already been formatted by some other process but are now plain text with line breaks etc. added by where the formatting process felt they ought to be.) I feel that adding FF may be part of a slippery slope to pretty text. What about starting a new column or keeping text together? Someone else suggested that New Line should just be white space not a block separator. I don't agree - surely a paragraph is (usully) a new line with some extra white space added - this implies the semantics should be similar. Tim -- Tim Partridge. Any opinions expressed are mine only and not those of my employer 22-May-97 23:00:17-GMT,2344;000000000001 Return-Path: Received: (from fdc@localhost) by watsun.cc.columbia.edu (8.8.5/8.8.5) id TAA04612; Thu, 22 May 1997 19:00:12 -0400 (EDT) Date: Thu, 22 May 97 19:00:11 EDT From: Frank da Cruz To: "Tony Harminc" Cc: Multiple Recipients of Subject: Re: Unicode plain text In-Reply-To: Your message of Thu, 22 May 1997 13:37:11 -0700 (PDT) Message-ID: > How do record oriented file systems fit into this discussion ? > (Remember those file systems that ruled the world before the UNIX > idea of the byte stream came along...) > They are far from dead; IBM VM/CMS and Digital (Open)VMS, to name two, are still widespread. But VM/CMS and other IBM mainframe and midrange operating systems use EBCDIC text encoding and I am not aware of any movement to support Unicode in this setting, at least not internally. In VMS, most text files are record oriented -- usually variable length records, with end of line *implied* for each record, but not recorded in any particular format. This is actually quite a sensible approach, given the wide variety of text-stream formats that abound for no good reason. In principle, it should be just as possible to fill records with Unicode as it is to fill them with ASCII, Latin-1, or JIS X 0208. The VMS file system also supports the notion of "carriage control", of which there are many types (like the once-familiar Fortran Hollerith style, in which the first character specified whether the line was to overprint the previous line, appear on the next line, appear 2 lines down, etc, or start on a new page). The carriage control information, again, is separate from the file's data. So again, in principle, there should be no clash with Unicode. In fact, I think a VMS implementation of Unicode text might be an interesting exercise. But this too begs the question of how to map Unicode plain text into this environment, which in turn calls for a Unicode plain-text standard for such things as page breaks. And no, I don't think this brings us anywhere near any slippery slopes. Page breaks have been an integral part of plain text since the 1950s when we were programming IBM 409 Electric Accounting Machines by sticking little wires into plugboards. - Frank 22-May-97 23:59:19-GMT,1434;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id TAA13839 for ; Thu, 22 May 1997 19:59:18 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA06394; Thu, 22 May 97 15:59:25 -0700 Message-Id: <9705222259.AA06394@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2676 (1997-05-22 22:59:12 GMT) To: Multiple Recipients of Reply-To: kenw@sybase.com (Kenneth Whistler) From: "Unicode Discussion" Date: Thu, 22 May 1997 15:59:10 -0700 (PDT) Subject: Re: Unicode plain-text file Tim Partridge wrote: > Someone else suggested that New Line should just be white space not a block > separator. I don't agree - surely a paragraph is (usully) a new line with > some extra white space added - this implies the semantics should be similar. Please be extra careful here. The suggestion specifically was that U+2028 LINE SEPARATOR (not NL nor LF functioning as newline) should be considered WS (a technical category of the bidi algorithm, not white space as processed, for example in a C preprocessor, or white space meaning unprinted area on a text page) rather than BS (another technical category of the bidi algorithm which is used to determine the boundaries of directional blocks). Cf. pages 3-15 and 3-17 of the Unicode Standard. --Ken Whistler 23-May-97 1:27:28-GMT,3553;000000000011 Return-Path: Received: from mail2.microsoft.com (mail2.microsoft.com [131.107.3.42]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id VAA26026 for ; Thu, 22 May 1997 21:27:28 -0400 (EDT) Received: by INET-02-IMC with Internet Mail Service (5.0.1458.30) id ; Thu, 22 May 1997 18:27:29 -0700 Message-ID: <61CDD2C9A961CF11B6A000805FD40AA90368E0AC@RED-84-MSG.dns.microsoft.com> From: Murray Sargent To: "'Frank da Cruz'" Cc: "'unicode@unicode.org'" Subject: RE: Unicode plain text Date: Thu, 22 May 1997 18:27:26 -0700 X-Priority: 3 X-Mailer: Internet Mail Service (5.0.1458.30) I think page breaks given by (0xC) belong in the block separator category and imply an end of paragraph. Page breaks that come in the middle of a paragraph or word should be called _soft_ page breaks much as we have soft line breaks. We could talk about adding an optional page-break analogous to the optional hyphen (0xAD), but computer folklore of the years clearly indicates that shouldn't be overloaded for this purpose. (Off hand, I don't think an optional pagebreak would be a useful code to have, since you'd really like to have the semantic "eject if within n lines of the page bottom." Such a semantic requires the number n, which doesn't fit into a single code position.) Murray > -----Original Message----- > From: Unicode Discussion [SMTP:unicode@unicode.org] > Sent: Thursday, May 22, 1997 4:00 PM > To: Multiple Recipients of > Subject: Re: Unicode plain text > > > How do record oriented file systems fit into this discussion ? > > (Remember those file systems that ruled the world before the UNIX > > idea of the byte stream came along...) > > > They are far from dead; IBM VM/CMS and Digital (Open)VMS, to name > two, are still widespread. But VM/CMS and other IBM mainframe > and midrange operating systems use EBCDIC text encoding and I am > not aware of any movement to support Unicode in this setting, > at least not internally. > > In VMS, most text files are record oriented -- usually variable > length records, with end of line *implied* for each record, but > not recorded in any particular format. This is actually quite a > sensible approach, given the wide variety of text-stream formats > that abound for no good reason. > > In principle, it should be just as possible to fill records with > Unicode as it is to fill them with ASCII, Latin-1, or JIS X 0208. > > The VMS file system also supports the notion of "carriage control", > of which there are many types (like the once-familiar Fortran > Hollerith style, in which the first character specified whether the > line was to overprint the previous line, appear on the next line, > appear 2 lines down, etc, or start on a new page). The carriage > control information, again, is separate from the file's data. So > again, in principle, there should be no clash with Unicode. > > In fact, I think a VMS implementation of Unicode text might be an > interesting exercise. But this too begs the question of how to > map Unicode plain text into this environment, which in turn calls > for a Unicode plain-text standard for such things as page breaks. > > And no, I don't think this brings us anywhere near any slippery > slopes. > Page breaks have been an integral part of plain text since the 1950s > when we were programming IBM 409 Electric Accounting Machines by > sticking little wires into plugboards. > > - Frank 23-May-97 1:28:50-GMT,4054;000000000001 Return-Path: Received: from halon.sybase.com (halon.sybase.com [192.138.151.33]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id VAA26150 for ; Thu, 22 May 1997 21:28:49 -0400 (EDT) Received: from smtp1.sybase.com (sybgate.sybase.com [130.214.220.35]) by halon.sybase.com (8.8.4/8.8.4) with SMTP id SAA03968; Thu, 22 May 1997 18:32:06 -0700 (PDT) Received: from birdie.sybase.com by smtp1.sybase.com (4.1/SMI-4.1/SybH3.5-030896) id AA28055; Thu, 22 May 97 18:30:19 PDT Received: by birdie.sybase.com (5.x/SMI-SVR4/SybEC3.5) id AA23641; Thu, 22 May 1997 18:28:46 -0700 Date: Thu, 22 May 1997 18:28:46 -0700 From: kenw@sybase.com (Kenneth Whistler) Message-Id: <9705230128.AA23641@birdie.sybase.com> To: fdc@watsun.cc.columbia.edu Subject: Re: Unicode plain text Cc: unicode@unicode.org, kenw@sybase.com X-Sun-Charset: US-ASCII > > > How do record oriented file systems fit into this discussion ? > > (Remember those file systems that ruled the world before the UNIX > > idea of the byte stream came along...) > > [snip] > > In principle, it should be just as possible to fill records with > Unicode as it is to fill them with ASCII, Latin-1, or JIS X 0208. And in practice. The portable Unicode backend library I have written merrily reads and writes Unicode plain text into MVS and VMS filing systems through standard C file interfaces. No problem. I just don't depend on MVS or VMS to provide any specific interpretations of *anything* in those files, nor would I want to, to stay portable. > > The VMS file system also supports the notion of "carriage control", > of which there are many types (like the once-familiar Fortran > Hollerith style, in which the first character specified whether the > line was to overprint the previous line, appear on the next line, > appear 2 lines down, etc, or start on a new page). The carriage > control information, again, is separate from the file's data. So > again, in principle, there should be no clash with Unicode. > > In fact, I think a VMS implementation of Unicode text might be an > interesting exercise. Only *interesting* in the sense you mean if you depended on VMS for anything other than basic system services underneath a C library. To be portable, everything else would be built on layers of support libraries independent of VMS. > But this too begs the question of how to > map Unicode plain text into this environment, which in turn calls > for a Unicode plain-text standard for such things as page breaks. I agree with Tim that page breaks are on the slippery slope to pretty text. Pagination is not necessary for legibility of plain text in the same sense that line breaking (forced in some instances) or paragraph breaking (required among other things for bidi directional control) are. Furthermore, since pagination assumes much more about actual rendering devices, forced pagination is as often a source of illegibility. (Think of all those preformatted documents you've seen at one time or another that on your device display or print with one or two lines spilled over to the next page for each forced page.) I suspect that the device dependency of pagination is one of the reasons why HTML doesn't use a built-in concept of page-break on display or FF. > > And no, I don't think this brings us anywhere near any slippery slopes. > Page breaks have been an integral part of plain text since the 1950s > when we were programming IBM 409 Electric Accounting Machines by > sticking little wires into plugboards. Again, think device dependency here. FF used to literally be the electronic control for the "Form Feed" on a particular device. It moved a mechanical device that shoved paper out and new paper in. In modern Page Description Languages such as PostScript, an operator such as showpage is a high-level operation that dumps a frame buffer to a smart raster device. Trying to control such operations by embedding an FF control character in plain text is pretty klutzy. --Ken > > - Frank > 23-May-97 4:12:51-GMT,4953;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id AAA17192 for ; Fri, 23 May 1997 00:12:50 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA07465; Thu, 22 May 97 20:50:55 -0700 Message-Id: <9705230350.AA07465@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2682 (1997-05-23 03:50:06 GMT) To: Multiple Recipients of Reply-To: Murray Sargent From: "Unicode Discussion" Date: Thu, 22 May 1997 20:50:05 -0700 (PDT) Subject: RE: Unicode plain text But back in the '60s and early '70s we had line printers (with fixed-width characters) and would ship "plain-text" documents to them preformatted with the desired line and page breaks. Such breaks consisted of hard CRLFs and FFs to control the line printer, and they could appear in the middle of a paragraph or word. Similarly these codes create such breaks on most modern printers. So in this sense, an FF can come in the middle of a paragraph or even a word. But this should be something down at the printer device-driver level. It would be a bad choice for file storage (unless it's a printer file). To date, Unicode has avoided defining control characters except for the TAB and NULL, precisely because there were multiple uses for these characters. The Unicode Standard states that "the others may be interpreted according to ISO/IEC 6429". Nevertheless, Frank's recommendation that Unicode fill in some of the other control-character semantics seems compelling, if only on a recommendation basis. We could, for example, enumerate the most common usages of the control characters CR, LF, VT, and FF in contemporary software. Murray > -----Original Message----- > From: Unicode Discussion [SMTP:unicode@unicode.org] > Sent: Thursday, May 22, 1997 6:27 PM > To: Multiple Recipients of > Subject: RE: Unicode plain text > > I think page breaks given by (0xC) belong in the block separator > category and imply an end of paragraph. Page breaks that come in the > middle of a paragraph or word should be called _soft_ page breaks much > as we have soft line breaks. We could talk about adding an optional > page-break analogous to the optional hyphen (0xAD), but computer > folklore of the years clearly indicates that shouldn't be > overloaded for this purpose. (Off hand, I don't think an optional > pagebreak would be a useful code to have, since you'd really like to > have the semantic "eject if within n lines of the page bottom." Such > a > semantic requires the number n, which doesn't fit into a single code > position.) > > Murray > > > -----Original Message----- > > From: Unicode Discussion [SMTP:unicode@unicode.org] > > Sent: Thursday, May 22, 1997 4:00 PM > > To: Multiple Recipients of > > Subject: Re: Unicode plain text > > > > > How do record oriented file systems fit into this discussion ? > > > (Remember those file systems that ruled the world before the UNIX > > > idea of the byte stream came along...) > > > > > They are far from dead; IBM VM/CMS and Digital (Open)VMS, to name > > two, are still widespread. But VM/CMS and other IBM mainframe > > and midrange operating systems use EBCDIC text encoding and I am > > not aware of any movement to support Unicode in this setting, > > at least not internally. > > > > In VMS, most text files are record oriented -- usually variable > > length records, with end of line *implied* for each record, but > > not recorded in any particular format. This is actually quite a > > sensible approach, given the wide variety of text-stream formats > > that abound for no good reason. > > > > In principle, it should be just as possible to fill records with > > Unicode as it is to fill them with ASCII, Latin-1, or JIS X 0208. > > > > The VMS file system also supports the notion of "carriage control", > > of which there are many types (like the once-familiar Fortran > > Hollerith style, in which the first character specified whether the > > line was to overprint the previous line, appear on the next line, > > appear 2 lines down, etc, or start on a new page). The carriage > > control information, again, is separate from the file's data. So > > again, in principle, there should be no clash with Unicode. > > > > In fact, I think a VMS implementation of Unicode text might be an > > interesting exercise. But this too begs the question of how to > > map Unicode plain text into this environment, which in turn calls > > for a Unicode plain-text standard for such things as page breaks. > > > > And no, I don't think this brings us anywhere near any slippery > > slopes. > > Page breaks have been an integral part of plain text since the 1950s > > when we were programming IBM 409 Electric Accounting Machines by > > sticking little wires into plugboards. > > > > - Frank 23-May-97 14:50:25-GMT,5993;000000000001 Return-Path: Received: (from fdc@localhost) by watsun.cc.columbia.edu (8.8.5/8.8.5) id KAA08237; Fri, 23 May 1997 10:50:06 -0400 (EDT) Date: Fri, 23 May 97 10:50:06 EDT From: Frank da Cruz To: Murray Sargent Cc: "'unicode@unicode.org'" Subject: RE: Unicode plain text In-Reply-To: Your message of Thu, 22 May 1997 18:27:26 -0700 Message-ID: Murray Sargent wrote: > I think page breaks given by (0xC) belong in the block separator > category and imply an end of paragraph. Page breaks that come in the > middle of a paragraph or word should be called _soft_ page breaks much > as we have soft line breaks. ... > This is GUI thinking. Think "plain text", no rendering engines. is a hard, unconditional page break. Think of running off monthly paychecks on your lineprinter, or addressing envelopes (and spelling peoples' names correctly in hundreds of languages -- imagine that!). kenw@sybase.com (Kenneth Whistler) wrote: > > In principle, it should be just as possible to fill records with > > Unicode as it is to fill them with ASCII, Latin-1, or JIS X 0208. > > And in practice. The portable Unicode backend library I have > written merrily reads and writes Unicode plain text into MVS and > VMS filing systems through standard C file interfaces. No problem. > I just don't depend on MVS or VMS to provide any specific interpretations > of *anything* in those files, nor would I want to, to stay portable. > It's funny how the pendulum swings. Back in the old days we didn't even have file systems, just boxes of cards. Then we developed complex file systems based on punched-card ideas (look at your old OS/360 JCL manual). Then we reacted against all of that complexity and said "a file is just a stream of bytes" with imbedded control information. Now the simplicity of the stream approach is coming back to bite us because of all the differing interpretations of the imbedded controls, since no standard was ever set for their use in files. Now we see that there is something to be said for keeping the control information out of band -- it makes it really simple to change coding systems. But anybody who has ever done VMS Record Management System programming knows that the price is complexity and loss of portability. You can't just "copy" a VMS file to DOS or UNIX, you have to "export" it from the file system and convert its record information to the appropriate stream format. Nor can you run an RMS program on a non-VMS system. If we had it all to do over again -- and we do -- we could retain the simplicity of the stream model without the confusion by precisely defining a set of controls that may be imbedded, as we have done for LS and PS. This will allow for both portable data AND portable software. > I agree with Tim that page breaks are on the slippery slope to pretty > text. Pagination is not necessary for legibility of plain text in > the same sense that line breaking (forced in some instances) or > paragraph breaking (required among other things for bidi directional > control) are. Furthermore, since pagination assumes much more > about actual rendering devices, forced pagination is as often a > source of illegibility. (Think of all those preformatted documents > you've seen at one time or another that on your device display or print > with one or two lines spilled over to the next page for each forced > page.) I suspect that the device dependency of pagination is one > of the reasons why HTML doesn't use a built-in concept of page-break > on display or FF. > This is all true, but that does not mean there should be no such thing as a forced page break. Paychecks. Envelopes. Like any tool, a hard page break can be used for good or evil. It's not the tool's fault. > Again, think device dependency here. FF used to literally be the > electronic control for the "Form Feed" on a particular device. It > moved a mechanical device that shoved paper out and new paper in. > Yes, we still do these things. Murray Sargent said: > > But back in the '60s and early '70s we had line printers (with > fixed-width characters) and would ship "plain-text" documents to them > preformatted with the desired line and page breaks. Such breaks > consisted of hard CRLFs and FFs to control the line printer, and they > could appear in the middle of a paragraph or word. Similarly these > codes create such breaks on most modern printers. So in this sense, an > FF can come in the middle of a paragraph or even a word. But this > should be something down at the printer device-driver level. It would > be a bad choice for file storage (unless it's a printer file). > Again, printer files are common practice, and they are not sent only to printers. They are also viewed on terminals, "straight no chaser" or in a text editor, and they are shipped around among diverse platforms. There is no reason to try to stamp out this practice. It has its legitimate uses. > To date, Unicode has avoided defining control characters except for the > TAB and NULL, precisely because there were multiple uses for these > characters. The Unicode Standard states that "the others may be > interpreted according to ISO/IEC 6429". > I agree that ASCII and ISO 6429 control characters are mess, and that is why it is important to precisely define a minimal set for use in Unicode plain text. This might be done by defining semantics for the existing C0 and C1 control characters, or by adding new ones. This will not only make Unicode able to stand on its own, but it will allow export and import of fancy text between incompatible GUI applications. And it will provide a Common Intermediate Representation for plain text that can last for decades, while the corporations slug it out in the marketplace over their three-letter acronyms du jour. - Frank 24-May-97 0:29:40-GMT,1048;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id UAA16623 for ; Fri, 23 May 1997 20:29:39 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA10499; Fri, 23 May 97 16:59:08 -0700 Message-Id: <9705232359.AA10499@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2686 (1997-05-23 23:58:54 GMT) To: Multiple Recipients of Reply-To: "Pierre Lewis" From: "Unicode Discussion" Date: Fri, 23 May 1997 16:58:53 -0700 (PDT) Subject: Re: Unicode plain text In message "Re: Unicode plain text", 'fdc@watsun.cc.columbia.edu' writes: > And no, I don't think this brings us anywhere near any slippery slopes. > Page breaks have been an integral part of plain text since the 1950s > when we were programming IBM 409 Electric Accounting Machines by > sticking little wires into plugboards. I have to agree. Don't RFCs all come with FFs in them? Pierre 25-May-97 7:08:25-GMT,2860;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id DAA02704 for ; Sun, 25 May 1997 03:08:24 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA13201; Sat, 24 May 97 23:43:12 -0700 Message-Id: <9705250643.AA13201@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 2689 (1997-05-25 06:42:40 GMT) To: Multiple Recipients of Reply-To: Edward Cherlin From: "Unicode Discussion" Date: Sat, 24 May 1997 23:42:37 -0700 (PDT) Subject: Re: Unicode plain text Timothy Partridge wrote: >We seem to have two different requirements for plain text here. >Now my assumption was that we would mostly want to use one type, whereas >there seems to be a strong demand for another. At the risk of teaching >you all to suck eggs I will contrast and compare them at some length. >I hope you will find a useful point or two. This is exactly what I was trying to get at in earlier messages. I would say that there are other requirements in other cases, and it would be worth our while to make a stab at enumerating them so we have some idea of what we are talking about. Here are some of the common uses of "plain text", each having a different purpose and different constraints: E-mail Printer command files--ASCII, PostScript Source code--programming, SGML, HTML, TeX Encoded binaries--UUencode, UTF-7 Transfer formats--RTF, APL Workspace Interchange Archiving Portability Database Application file formats Constraints on line length vary widely. I have seen database files with lines of nearly 1000 characters, and of course there is the theorem that any computable function can be expressed in one line of APL. :-) Other constraints will also vary widely. We must allow for this variation, and only specify what we have to. >First the type I had assumed as the default. >I would call this logical formatting. [snip] > The second type I would call physical formatting. [snip] The snipped analysis was quite good, although a few points might be argued. One of the best points is that we can require a certain competence from a Unicode renderer. The implementor can decide which character ranges to support, but having done that must support certain features in the way specified in the standard. This mechanism can be extended to cover some of the requirements of various text file usages. -- Edward Cherlin Help outlaw Spam Everything should be made Vice President http://www.cauce.org as simple as possible, NewbieNet, Inc. 1000 members and counting __but no simpler__. http://www.newbie.net/ 17 May 97 Attributed to Albert Einstein 25-May-97 15:45:55-GMT,4499;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id LAA23257 for ; Sun, 25 May 1997 11:45:54 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA13742; Sun, 25 May 97 08:01:25 -0700 Message-Id: <9705251501.AA13742@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2691 (1997-05-25 15:01:09 GMT) To: Multiple Recipients of Reply-To: "Pierre Lewis" From: "Unicode Discussion" Date: Sun, 25 May 1997 08:01:07 -0700 (PDT) Subject: Re: Unicode plain text In message "Re: Unicode plain text", 'timpart@perdix.demon.co.uk' writes: > We seem to have two different requirements for plain text here. > ... > > First the type I had assumed as the default. > I would call this logical formatting. > ... This first type (usually the result of "save as text" from some WP) always causes me trouble and I usually have to reformat it before I can do anything with it (such as printing it). > The second type I would call physical formatting. > The text has already been formatted by the author into lines and > paragraphs... I think the second type is by far the most common and is what I consider to be plain text: o It's the format of all RFCs, perhaps the most widely-read plain-text files around, o It's the format of the vast majority of email and Usenet posts I read (but I do see some type 1 stuff), o It's the format of much e-documentation that comes with many S/W (eg. linux, TeX (at least installation), X.11, ...), o It's the natural format of all a2ps (ascii-to-postscript) converters I've come across, and (last but not least) o It's the format chosen by project Gutenberg, the wonderful collection of English texts. I have a dream here, of a multi-lingual project Gutenberg with classics in various languages, and, of course, in plain-text Unicode.... (URL: ftp://uiarchive.cso.uiuc.edu/pub/etext/ ) I'd be really curious to see how one would express RFC2070, on "Internationalization of the Hypertext Markup Language", as a type 1 plain-text file (for those looking for a challenge: type 2 plain-text file of this RFC is at: http://ds.internic.net/rfc/rfc2070.txt). Of course, type 2 means some assumptions. > * The author knows exactly how many characters fit on a line. (Often > there is also the assumption that each character is fixed width.) True enough, and that may break down somewhat with ideograms (surely one can't fit 80 of those on a line). But, in general, staying under 80 chars will give a plain-text file that most can print. I rarely have trouble printing a plain-text file of this second type. And I think this will work with a lot of scripts, eg. Russian, Greek, Hebrew, Arabic. > * The author knows exactly how many lines fit on a page. Most plain-text files have no FFs, but when they do (as RFCs do), it's not too difficult to be conservative so that again most folks can print them with no problem. I don't see FFs as being on the slippery slope to pretty text. Besides their use in RFCs (so the TOC can be paginated), they're also often used to separate "chapters". For example, I'll save all the posts on the current threads, and I'll probably put an FF between each one so that, if/when I print the whole thing, I'll get each post to start on a new page. > * The author knows in which sequence the characters in a line will > be printed. (Usually assumes left to right without any reordering.) That's where it gets interesting (and why I had a few questions a few days ago). The only ordering possible within the plain-text Unicode file is of course logical. So that means a bit more intelligence in the a2ps conversion or in the display engines. Or, in despair, such a file could be put thru a filter that would reorder it into visual ordering for local consumption. In summary, notwithstanding some difficulties, I still think a plain-text Unicode file of the second type above makes perfect sense and would be very useful. I'm still not too sure how exactly I would encode it (wrt controls), but this thread has been quite helpful. Btw, this type 1 vs type 2 is a very useful distinction, and I think therein lies the source of much confusion in the current threads. Pierre lew@nortel.ca P.S. It's probable that my view of things is somewhat colored by my Unix bigotry. But still... 25-May-97 23:42:24-GMT,3079;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id TAA17171 for ; Sun, 25 May 1997 19:42:23 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA14407; Sun, 25 May 97 16:23:13 -0700 Message-Id: <9705252323.AA14407@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2692 (1997-05-25 23:22:41 GMT) To: Multiple Recipients of Reply-To: "Pierre Lewis" From: "Unicode Discussion" Date: Sun, 25 May 1997 16:22:39 -0700 (PDT) Subject: RE: Unicode plain text In message "RE: Unicode plain text", Murray writes: > The preformatted plain text works OK as long as you have no plans to > modify it. If you want to edit it, then you have to worry about > reflowing the lines ... Most decent plain-text editors have facilities for that. > ... But even much older software was adept at formatting text. > E.g., troff and TeX have been around for years and do beautiful jobs of > formatting text. Of course, so does HTML today. But none of that is plain text, troff, TeX and HTML require some processing intelligence that may no longer be around in 30 years. That may not be available everywhere. Is there a specification somewhere that tells me how type 1 plain text (using Tim's terminology again for a moment) will be formatted for display and printing? Will things such as the following be dealt with properly? This is a recursive bulleted list. o Bullet one, a very long line..... that folds: - subbullet one a, another long line.... that folds; - a second subbullet o Bullet two. Can I rely on this intelligence to always yield something that reflects my intentions? With recursive bullet lists? With tables. Etc. Ah, maybe that's what some folks mean when they ask for a standard for plain text in Unicode?! Or am I not more likely to see things such as what your email software did to my original post: > > o It's the format of all RFCs, perhaps the most widely-read > > plain-text > > files around, The middle line got folded, but the software didn't realize it was a bulleted list :-) > Within the Microsoft email system, we use rich text ... Well I hope you won't send me such, as I won't know what to do with it. Is it HTML-like markup? Of course rich text can be nice, but only if everyone has it. The nice thing about plain text *is* that everyone has it by default. But I think that applies only to type 2, ie. plain text with hard line breaks, ie. preformatted. The big advantage I see of the type 2 plain text (with hard line breaks) is that it requires *no* intelligence to render correctly. Well Unicode requires BIDI I guess (and let's hope that won't change in the next 30 years). But otherwise, just adjust to line length convention (by chosing a decent point size) and you're in business. No reliance on some S/W to do some undefined reformatting and hope it won't misrepresent your intentions. Pierre 26-May-97 12:40:12-GMT,2068;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id IAA10853 for ; Mon, 26 May 1997 08:40:11 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA15892; Mon, 26 May 97 05:16:43 -0700 Message-Id: <9705261216.AA15892@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Uml-Sequence: 2696 (1997-05-26 12:16:14 GMT) To: Multiple Recipients of Reply-To: "Martin J. Duerst" From: "Unicode Discussion" Date: Mon, 26 May 1997 05:16:12 -0700 (PDT) Subject: Re: Unicode plain text On Mon, 26 May 1997, Otto Stolz wrote: > On May 24, 11:04, Timothy Partridge wrote: > > We seem to have two different requirements for plain text here. > ... > > The text has already been formatted by the author into lines and > > paragraphs. (Just as I have done with this e-mail. [...] > > Since NL usually does not denote any logical division in the text > > it is extremely annoying if the BiDi algorithm treats it as a new > > block. > > In contrary, it is annoying if it doesn't -- see below. The example you give doesn't apply. Independently of whether LS is a block separator or treated as whitespace, there will never be any text part B a line higher than a text part A when logically, text part A is before text part B. This is the very basic principle of the BIDI algorithm. What is affected by the decision whether LS is a block separator or treated as whitespace is whether bidirectional embeding and overwrite codes are terminated (at the block boundary) or not. As long as you don't have any of these, the only effect may be that in the absence of any other convention, the first character of a block defines the block's base directionality. Thus if LS is a block separator, you risk that the second part of the paragraph has a different base directionality than the first. Regards, Martin. 26-May-97 15:26:43-GMT,4060;000000000001 Return-Path: Received: (from fdc@localhost) by watsun.cc.columbia.edu (8.8.5/8.8.5) id LAA01862; Mon, 26 May 1997 11:26:38 -0400 (EDT) Date: Mon, 26 May 97 11:26:38 EDT From: Frank da Cruz To: Timothy Partridge Cc: Multiple Recipients of Subject: Re: Unicode plain text In-Reply-To: Your message of Sat, 24 May 1997 11:04:00 -0700 (PDT) Message-ID: > We seem to have two different requirements for plain text here. > Now my assumption was that we would mostly want to use one type, whereas > there seems to be a strong demand for another. > ... > First the type I had assumed as the default. > I would call this logical formatting. > > Paragraph Separator is most commonly used. Text usually runs on without > any control characters until a new paragraph is needed. Since this > is logical formatting the author does not know or care whether a > paragraph is indicated by a completly blank line or a new line is > started with an indent or some other convention. > I suppose this is, indeed, a form of plain text, but I would call it "input for a text formatter", not text to be used and viewed on its own as it stands. It is a degenerate case of a larger class, e.g. input for TeX, Scribe, Troff, IPFC, SGML, or HTML (for text formatting). It is only in the last few years that I began to receive "long-line" text in email, and I can only suppose that it was generated by some sort of editor that does its own word wrapping during input, but does not send the line breaks on the mistaken assumption that every email client in the world is (or should be) also a text formatter. [The second type of plain text...] > The assumptions behind this explicit approach include: > * The text will go straight to a printer that is not very bright. > * The author knows exactly how many characters fit on a line. (Often > there is also the assumption that each character is fixed width.) > * The author knows exactly how many lines fit on a page. > * The author knows in which sequence the characters in a line will > be printed. (Usually assumes left to right without any reordering.) > Right -- this is the kind people have been using for more decades than many of us have been alive. It does not deserve the bad rap. Of course we all find it irritating when the composer of such text assumes wider or longer pages than we have, but that is not a reason to abolish this, the most common form of plain text -- in fact, it is all the more reason to set standards for its use. "Standard lines are so wide; standard pages are so long", etc. Such standards tend to be set of their own volution, e.g. among e-mail and netnews users, where recipients of badly formatted messages tend to take it on themselves to educate the senders as to common practice. Ideally, preformatted plain text can also be fed into your favorite rendering engine to produce the effect that most pleases your eye, and indeed we have been doing this sort of thing for decades with many formatters. I grant that automatic recognition of nested bullet lists or meticulously formatted tables might be a stretch, but it is certainly not difficult to treat blank lines as paragraph separators, and otherwise to ignore line breaks when reformatting prose such as this. But once any kind of markup ("this is a table", "this is a bullet list", "this is a section of preformatted text") is introduced, our plain text becomes "input for a text formatter". Incidentally, another form of plain text is "output from a text formatter", which often has been hyphenated. Such text is an end result, not intended for further processing. I think that living in a world of email has demonstrated the value of plain text, at least to most people. The lesson is that this is the only text form that can be sent without prior prearrangement with any reasonable expectation that it will be readable at its destination. - Frank 26-May-97 15:48:20-GMT,2862;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id LAA06491 for ; Mon, 26 May 1997 11:48:19 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA16506; Mon, 26 May 97 07:38:18 -0700 Message-Id: <9705261438.AA16506@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 X-Uml-Sequence: 2698 (1997-05-26 14:37:52 GMT) To: Multiple Recipients of Reply-To: Otto Stolz From: "Unicode Discussion" Date: Mon, 26 May 1997 07:37:50 -0700 (PDT) Subject: Rare Writing Directions Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by watsun.cc.columbia.edu id LAA06491 Some scripts are neither left-to-right, nor right-to-left. 1. Mongolian is written top-to-bottom; Japanese and Chinese used to be written this way, the lines were stacked right-to-left. Recently, somebody (sorry, I haven't kept that note) has said that mixing Latin with Japanese was impossible, hence modern Japanese is written left-to-right. However, there is a way to mix top-to-bottom with horizontally written scripts: about twenty years ago I have seen a book in Japanese, written top-to-bottom, with German proper, and place, names imbedded. These were also written top-to-bottom, with the glyphs rotated by 90 degrees; so you could turn the book counter- clockwise to read these names, in the usual way. This imebedding method would also work with left-to-right phrases in Mongolian text. For righ-to-left scripts, you would have to turn the glyphs the other way. I think, it would be useful to have this method described in a forthcoming Unicode standard. 2. Some old scripts (Greek, Latin, Hethitic, Runes) were used to write boustropheda. A boustrophedon runs back and forth like a ploughing ox (thence the name), i.e. the lines are written, alternatingly, left-to-right and right-to-left. As Unicode will adopt the Runes alphabet (or rather: fuþark), it would propbably be useful to have boustrophedon-markers akin to the existing LEFT-TO-RIGHT MARK and its siblings, U+200E .. U+200F and U+202A .. U+202E. These markers could be used to mark plain, logically formatted, Unicode text. (To mark physically formatted text, you could probably use the OVERRIDE characters, U+202D and U+202E.) Also a normative boustrophedon algorithm, akin to the existing bidi algorithm would probably be nice to have. I guess, this algorithm could be much simpler than the bidi algorithm, as the boustrophedon feature will apply only to whole paragraphs (it is more like a layout style, which does not have to allow for intrinsic character features). Opinions? Am I wrong, again? Best wishes, Otto Stolz 26-May-97 16:18:23-GMT,1285;000000000011 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id MAA10524 for ; Mon, 26 May 1997 12:18:22 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA16649; Mon, 26 May 97 08:21:01 -0700 Message-Id: <9705261521.AA16649@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2699 (1997-05-26 15:20:37 GMT) To: Multiple Recipients of Reply-To: "Pierre Lewis" From: "Unicode Discussion" Date: Mon, 26 May 1997 08:20:35 -0700 (PDT) Subject: re:Multi-Lingual Project Gutenberg (was: Unicode plain text) In message "Multi-Lingual Project Gutenberg (was: Unicode plain text)", 'Otto.Stolz@uni-konstanz.de' writes: > You'll find the German project Gutenberg (in German, of course), under > . The format > is currently HTML, in ISO 8859-1 encoding. Thanks for the pointer, I don't think I had it. Well done (just had a look at Max and Moritz). HTML certainly is an interesting alternative to plain text because it is so universal (and, hopefully, with a stable foundation). And it allows to include illustrations, annotations, &c. Pierre 26-May-97 16:43:54-GMT,2364;000000000001 Return-Path: Received: (from fdc@localhost) by watsun.cc.columbia.edu (8.8.5/8.8.5) id MAA15283; Mon, 26 May 1997 12:42:51 -0400 (EDT) Date: Mon, 26 May 97 12:42:51 EDT From: Frank da Cruz To: "Pierre Lewis" Cc: Multiple Recipients of Subject: re:Multi-Lingual Project Gutenberg (was: Unicode plain text) In-Reply-To: Your message of Mon, 26 May 1997 08:20:35 -0700 (PDT) Message-ID: > HTML certainly is an interesting alternative to plain text because it > is so universal (and, hopefully, with a stable foundation). And it > allows to include illustrations, annotations, &c. > There is an infinite number of alternatives to plain text. Anybody, anywhere can make up whatever such alternatives they like -- and they do. HTML is controlled by Netscape and Microsoft, and changes every five minutes as each attempts to outdo and undercut the other. Plain text is an interesting alternative to HTML because nobody controls it but "just us chickens", and it alone stands a chance of surviving year after year, decade after decade, as the corporate giants pull the rug out from each other (and us) on a weekly basis, with their proclamations of ever more complex proprietary "standards" with which we all must "comply". This is not to say that a simple and stable form of HTML -- say 1.0, but augmented by some minimally adequate method of coping with character sets -- is not a suitable method for publishing literary classics on the Web -- after all, this is the sort of thing the Web was originally designed for, lest we forget... But this is not to say that even a stable form of HTML could be thought of as a replacement for plain text. My printer does not render HTML; my email client is not a Web browser. My text editor is not an HTML authoring system. My C compiler does not compile HTML. My Telnet client does not interpret HTML. And perhaps most important, the incomprehensibly enormous corpus of existing plain-text information does not need to be converted to HTML or anything else (except perhaps Unicode plain text), especially since any such requirement would leave most of it behind, and even that which was deemed worthy of conversion would become obsolete as soon as HTML is replaced by the next thing. - Frank 26-May-97 18:29:49-GMT,2166;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id OAA01077 for ; Mon, 26 May 1997 14:29:48 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA17491; Mon, 26 May 97 10:23:48 -0700 Message-Id: <9705261723.AA17491@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2706 (1997-05-26 17:23:31 GMT) To: Multiple Recipients of Reply-To: "Pierre Lewis" From: "Unicode Discussion" Date: Mon, 26 May 1997 10:23:30 -0700 (PDT) Subject: re:Multi-Lingual Project Gutenberg (was: Unicode plain text) In message "re:Multi-Lingual Project Gutenberg (was: Unicode plain text)", 'fdc@watsun.cc.columbia.edu' writes: > ... HTML is > controlled by Netscape and Microsoft, and changes every five minutes as each > attempts to outdo and undercut the other. I thought at least some baseline HTML came from more neutral bodies than these two corporations?! Of course, HTML is an acceptable alternative to plain text *only* if it is corporation-neutral, widespread, and reasonably stable. I certainly wouldn't agree to any MSIE-or NN-specific extensions being used in the texts offered by these projects, but this specific site is quite legible with lynx, so I assume it doesn't use too many fancy features. > Plain text is an interesting alternative to HTML because nobody controls it > but "just us chickens", and it alone stands a chance of surviving year after > year, decade after decade, ... Well put. > This is not to say that a simple and stable form of HTML -- say 1.0, but > augmented by some minimally adequate method of coping with character sets -- Since the German Gutenberg project uses latin 1 (the HTML default), they don't even need any extensions over HTML 1.0. > ... My printer does not render HTML; my email > client is not a Web browser. ... Same here. Still, browsers are getting pretty common, so for a project Gutenberg, it's probably a reasonable choice. Pierre 26-May-97 19:25:43-GMT,4512;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id PAA10016 for ; Mon, 26 May 1997 15:25:41 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA17805; Mon, 26 May 97 11:35:09 -0700 Message-Id: <9705261835.AA17805@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2707 (1997-05-26 18:34:53 GMT) To: Multiple Recipients of Reply-To: Timothy Partridge From: "Unicode Discussion" Date: Mon, 26 May 1997 11:34:50 -0700 (PDT) Subject: Re: Unicode plain text Pierre Lewis recently said: > This first type (usually the result of "save as text" from some WP) > always causes me trouble and I usually have to reformat it before I can > do anything with it (such as printing it). In my opinion a Unicode renderer should cope with this automatically and divide paragraphs up into lines for you. This is mostly because of the intelligence of the BiDi algorithm. What you won't get is page headers and footers and page numbers since there is no way to specify them in Unicode plain text. Is there general agreement that text that is only split into paragraphs should be rendered properly by a Unicode engine? I.e. it is acceptable as plain text. > I think the second type is by far the most common and is what I > consider to be plain text: > > o It's the format of all RFCs, perhaps the most widely-read plain-text > files around, [snip] > o It's the format chosen by project Gutenberg, the wonderful collection > of English texts. I have a dream here, of a multi-lingual project > Gutenberg with classics in various languages, and, of course, in > plain-text Unicode.... > > (URL: ftp://uiarchive.cso.uiuc.edu/pub/etext/ ) > > I'd be really curious to see how one would express RFC2070, on > "Internationalization of the Hypertext Markup Language", as a type 1 > plain-text file (for those looking for a challenge: type 2 plain-text > file of this RFC is at: http://ds.internic.net/rfc/rfc2070.txt). Can I have the original source please! I suspect that documents like this have been prepared in some markup language and sent through something like troff. > Of course, type 2 means some assumptions. > > > * The author knows exactly how many characters fit on a line. (Often > > there is also the assumption that each character is fixed width.) > > True enough, and that may break down somewhat with ideograms (surely > one can't fit 80 of those on a line). But, in general, staying under 80 > chars will give a plain-text file that most can print. I rarely have > trouble printing a plain-text file of this second type. And I think this > will work with a lot of scripts, eg. Russian, Greek, Hebrew, Arabic. I'm not so sure that fixed width Arabic will look good but the general point holds. But should I need to fiddle with point sizes if Unicode renderers will accept type 1 text. Type 2 text is very common. And it is the published form. In some cases the original marked up text will have been lost. Where it hasn't a Unicode type 1 style plain text file could be produced from the original. I dug out some troff documentation and it says that the plain text output is a representation that is an approximation to the printed page. I suggest that much of the type 2 text is in this form, i.e. Formatting *including* BiDi has already been carried out. Does anyone have examples of mixed direction text in RFC style format that could confirm this? I think that for type 2 physical format files Unicode rendering is *too* intelligent and would scramble the preformatted lines if they contained BiDi text. (As well as getting horribly confused by the NLs which presumably have been converted to Line Separator.) I would propose a new control code - Disable BiDirectional Processing which would switch off BiDi altogether. It could be used with physical format files so that they come out as intended. (There needs to be an Enable code as well.) I'll also allow you a Page Separator. This would be treated as a block separator by BiDi and would cause a new page to be started. The introduction of a new control code would mean that existing text that uses the current standard would work in the same way, but additional control could be given to text that needs it. Tim -- Tim Partridge. Any opinions expressed are mine only and not those of my employer 27-May-97 14:26:28-GMT,1209;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id KAA28296 for ; Tue, 27 May 1997 10:26:27 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA21401; Tue, 27 May 97 06:34:30 -0700 Message-Id: <9705271334.AA21401@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2718 (1997-05-27 13:33:37 GMT) To: Multiple Recipients of Reply-To: "Pierre Lewis" From: "Unicode Discussion" Date: Tue, 27 May 1997 06:33:36 -0700 (PDT) Subject: re:Multi-Lingual Project Gutenberg (was: Unicode plain text) With waivering faith I wrote: :-) > HTML certainly is an interesting alternative to plain text because it > is so universal (and, hopefully, with a stable foundation). And it > allows to include illustrations, annotations, &c. Coincidently, I was reading last nite (ironically, in "iX", a German magazine) about XML (eXtensible Markup Language) which, says the article, could replace (in the mid term) HTML as the lingua franca of the Web. So much for that idea... Es lebe plain text! (long live ~) Pierre 27-May-97 17:30:54-GMT,2490;000000000011 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id NAA03642 for ; Tue, 27 May 1997 13:30:52 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA22032; Tue, 27 May 97 09:17:26 -0700 Message-Id: <9705271617.AA22032@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Uml-Sequence: 2720 (1997-05-27 16:16:36 GMT) To: Multiple Recipients of Reply-To: John Fieber From: "Unicode Discussion" Date: Tue, 27 May 1997 09:16:34 -0700 (PDT) Subject: re:Multi-Lingual Project Gutenberg (was: Unicode plain text) On Tue, 27 May 1997, Pierre Lewis > With waivering faith I wrote: > :-) > > > HTML certainly is an interesting alternative to plain text because it > > is so universal (and, hopefully, with a stable foundation). And it > > allows to include illustrations, annotations, &c. > > Coincidently, I was reading last nite (ironically, in "iX", a German > magazine) about XML (eXtensible Markup Language) which, says the > article, could replace (in the mid term) HTML as the lingua franca of > the Web. So much for that idea... Both HTML and XML rest on a very stable foundation: SGML. The unicode standard defers quite a number of things to "higher level protocols". SGML just such a protocol, XML represents a profile of the SGML standard that makes writing processing applications a lot easier. If you invest a lot of energy building a document system around HTML, you will be SOL when HTML falls out of fashion. If you spend the same energy building a document system on the SGML foundation, you can automatically deal with HTML and all its variants, XML, or whatever the next fad is. Real SGML tools are polymorphic. > Es lebe plain text! (long live ~) I find this a tragic position. Before unicode, the common denominator for cross-platform data transfer was 7 bit ASCII. Unicode charged ahead to raise the common denominator but statements like this essentially say that the common denominator should go no further. This is counter to the spirit that inspired Unicode and counter to the standard itself which explicitly defers a number of important dimensions of text processing to higher level protocols. Plain text is simply not an option for most anyone serious about their documents. -john 27-May-97 18:42:14-GMT,2820;000000000001 Return-Path: Received: (from fdc@localhost) by watsun.cc.columbia.edu (8.8.5/8.8.5) id OAA23125; Tue, 27 May 1997 14:40:39 -0400 (EDT) Date: Tue, 27 May 97 14:40:38 EDT From: Frank da Cruz To: John Fieber Subject: re:Multi-Lingual Project Gutenberg (was: Unicode plain text) In-Reply-To: Your message of Tue, 27 May 1997 09:16:34 -0700 (PDT) Message-ID: > > Es lebe plain text! (long live ~) > > I find this a tragic position. Before unicode, the common > denominator for cross-platform data transfer was 7 bit ASCII. > Unicode charged ahead to raise the common denominator but > statements like this essentially say that the common denominator > should go no further. This is counter to the spirit that > inspired Unicode and counter to the standard itself which > explicitly defers a number of important dimensions of text > processing to higher level protocols. > But that is to say that Unicode is useless except in combination with a higher level protocol over which it has no control. I have absolutely no faith in any higher level protocol. They come into fashion and then exit ignominiously with astounding speed. So perhaps the need for plain text is "tragic" (so too would be the fact that many citizens of earth do not possess high-end bit-mapped rendering engines, let alone sufficient food to eat), but it is nonetheless real. I think a lot of Unicoders have little idea what the real world is like. They know it is populated by people who speak many languages written in diverse writing systems, which is a step forward. But they don't pay much attention to the "low tech" computer-related components of everyday life -- not only in the less "developed" countries, but even in the rich ones. They seem to believe that the only use for computers any more is Web browsing and composition of glossy (multilingual) sales brochures. Try to remember all the real work that computers are doing every day in hidden places: medical and laboratory equipment, manufacturing equipment, telecommunications equipment, traffic control, POS, EDI, etc. Case in point: the imbedded microprocessors and microcontrollers whose interface to the outside world is a lowly serial port, and which have only a few K available for their control program. Countless millions of them, chosen precisely for their low cost. Now, isn't it our goal for Unicode to become, eventually, the world's one-and-only character set? Good! Then let's not lock out the low end. Let's see if we can't separate the concept of character set from the *necessity* for higher (and lower) level protocols and the need for a high-end rendering engine. (Sure, use them if you want, but that's a totally separate issue.) - Frank 27-May-97 21:05:57-GMT,2979;000000000001 Return-Path: Received: from fallout.campusview.indiana.edu (fallout.campusview.indiana.edu [149.159.1.1]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id RAA00789 for ; Tue, 27 May 1997 17:05:54 -0400 (EDT) Received: from localhost (jfieber@localhost) by fallout.campusview.indiana.edu (8.8.5/8.8.5) with SMTP id QAA01376 for ; Tue, 27 May 1997 16:05:45 -0500 (EST) Date: Tue, 27 May 1997 16:05:44 -0500 (EST) From: John Fieber To: Frank da Cruz Subject: re:Multi-Lingual Project Gutenberg (was: Unicode plain text) In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII On Tue, 27 May 1997, Frank da Cruz wrote: > > > Es lebe plain text! (long live ~) > > > > I find this a tragic position. Before unicode, the common > > denominator for cross-platform data transfer was 7 bit ASCII. > > Unicode charged ahead to raise the common denominator but > > statements like this essentially say that the common denominator > > should go no further. This is counter to the spirit that > > inspired Unicode and counter to the standard itself which > > explicitly defers a number of important dimensions of text > > processing to higher level protocols. > > > But that is to say that Unicode is useless except in combination > with a higher level protocol over which it has no control. I never said and most certainly did not mean to imply that Unicode is "useless" without higher level protocols. That proposition is absurd. > I have absolutely no faith in any higher level protocol. They come > into fashion and then exit ignominiously with astounding speed. Your opinion does not change the fact that a great many applications would be impossible without higher level protocols, transient or otherwise. (I'd hardly describe SGML as transient though--it dates back into the 1960s and has continuously gained in pouplarity ever since with no sign of fading in the future.) [statements about the real world] > Now, isn't it our goal for Unicode to become, eventually, the world's > one-and-only character set? Good! Then let's not lock out the low > end. Let's see if we can't separate the concept of character set > from the *necessity* for higher (and lower) level protocols and the > need for a high-end rendering engine. ...but I never said anything about a monolithic standard including low and high level protocols! I'd be the first to say it would be a Bad Idea for exactly the reasons you cite. I would also add that separation is critical because different applications may need different high level protocols. SGML works great for publishing type applications, but it certainly is not an answer to every text processing applications. -john 27-May-97 21:19:35-GMT,3042;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id RAA03062 for ; Tue, 27 May 1997 17:19:31 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA23523; Tue, 27 May 97 13:04:30 -0700 Message-Id: <9705272004.AA23523@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Uml-Sequence: 2726 (1997-05-27 20:04:01 GMT) To: Multiple Recipients of Reply-To: John Fieber From: "Unicode Discussion" Date: Tue, 27 May 1997 13:03:59 -0700 (PDT) Subject: re:Multi-Lingual Project Gutenberg (was: Unicode plain text) On Tue, 27 May 1997, Marion Gunn > The general consensus there was that > incipient XML was being very heavily pushed as an alternative to html by > SUN and MICROSOFT in collaboration Sun has been actively involved in the development of XML, so their position is no surprise. Lately Microsoft has been jumping on the "standards" bandwagon (witness the ditching of WINS for DNS, adoption of Kerberos, etc.) and a move to XML in particular represents taking a distinctly different direction than Netscape, whose founder has publicly stated that SGML is stupid--a position I firmly believe will only hasten Netscape's death if it persists. > (as an alternative which would eliminate > markup language altogether from the actual text to be transferred). This is nonsensical. In the world of HTML, you have a fixed set of tags you can use in your documents, and you must assume that the browser knows how to do something sensible with them (not always safe). With XML, or SGML for that matter, your document gets marked up using tags appropriate for the data being marked up. The document gets sent to the browser along with a style sheet so that the browser can do something sensible when it encounters the markup. This allows for (a) more concise and precise markup of the document and (b) more precise control over the ultimate rendering by the browser. The push for XML represents a "back to the roots" movement. The basic premise of SGML is that it is impossible to define a markup language that is both general and precise. Thus, SGML is a meta-language; a language for defining markup languages. At a technical level, SGML standardizes parsing--how to distinguish markup from data. HTML is just a single markup language defined in terms of SGML. However, the promotion of HTML as a universal exchange format is fundamentally at odds with the spirit of SGML. A problem with using SGML in a web environment is the complexity of the software required to implement the parsing rules. Enter XML. XML basically does away with numerous non-essential features of SGML that complicate parsing, things like tag omission and minimization, shortrefs and the like. XML also raises the compliance bar on character encoding from 7 bit ASCII to Unicode. -john 27-May-97 21:23:11-GMT,2405;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id RAA03727 for ; Tue, 27 May 1997 17:23:10 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA23596; Tue, 27 May 97 13:08:43 -0700 Message-Id: <9705272008.AA23596@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2727 (1997-05-27 20:08:24 GMT) To: Multiple Recipients of Reply-To: "Pierre Lewis" From: "Unicode Discussion" Date: Tue, 27 May 1997 13:08:22 -0700 (PDT) Subject: re:Multi-Lingual Project Gutenberg (was: Unicode plain text) In message "re:Multi-Lingual Project Gutenberg (was: Unicode plain text)", 'jfieber@indiana.edu' writes: > > Es lebe plain text! (long live ~) > > I find this a tragic position. Before unicode, the common > denominator for cross-platform data transfer was 7 bit ASCII. First, don't take anything I write too literally. I make available most of my project documentation in HTML. So I'm not religious about these things. The above is not an exclusive statement. HTML serves a most useful purpose and I'm not saying to ban it! Second, Unicode is something more or less orthogonal to the notion of plain text. So I don't really understand your comment above. Plain text does not mean 7-bit ASCII. It could just as well mean UTF-8 Unicode. Third, for all the great things that can be said for SGML, HTML, XML, and ML, it still remains that plain text is the most portable format, the simplest to deal with (on all platforms), and the only one that is likely to be legible in 30 years. For some things, it's still the best solution. > Plain text is simply not an option for most anyone serious about > their documents. That depends on the purpose. For example, I'm writing some biographical notes on myself (how pretentious can one get :-)?) so my son will know a bit about me should I leave early. I can't think of a better medium for that than plain text (Latin 1 here). Surely not some WP that will be so badly out of style by the time he gets to read the stuff (he doesn't talk yet)... And look at a typical novel. Plain text is all that's required to capture it. Marketing glossies are another matter of course. And so is most technical documentation. Anyway, getting off topic again! Pierre 27-May-97 22:15:20-GMT,2216;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id SAA20653 for ; Tue, 27 May 1997 18:15:18 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA24108; Tue, 27 May 97 14:24:35 -0700 Message-Id: <9705272124.AA24108@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2729 (1997-05-27 21:24:18 GMT) To: Multiple Recipients of Reply-To: Timothy Partridge From: "Unicode Discussion" Date: Tue, 27 May 1997 14:24:16 -0700 (PDT) Subject: Re: re:Multi-Lingual Project Gutenberg (was: Unicode plain text) Pierre Lewis recently said: > With waivering faith I wrote: > :-) > > > HTML certainly is an interesting alternative to plain text because it > > is so universal (and, hopefully, with a stable foundation). And it > > allows to include illustrations, annotations, &c. > > Coincidently, I was reading last nite (ironically, in "iX", a German > magazine) about XML (eXtensible Markup Language) which, says the > article, could replace (in the mid term) HTML as the lingua franca of > the Web. So much for that idea... > > Es lebe plain text! (long live ~) And what about the Standard Generalised Markup Language (SGML)? This has been around for ages. It lets you define a set of markup tags and then use them. HTML is a particular set of SGML tags and the SGML definition of HTML (the DTD) is available from W3. If you are writing text in HTML I would strongly recommend that you put a DTD version declaration at the top. e.g. which is English with HTML 3.2 markup. Then syntax check the HTML with a SGML parser to make sure it conforms. Finally keep a copy of the DTD somewhere safe along with a copy of the matching HTML standard so that future generations can always understand your text. (The copy of 3.2 that I have is about 12K in size.) You might want a copy of the SGML standard too - I don't know where to get a machine readable copy from. Tim -- Tim Partridge. Any opinions expressed are mine only and not those of my employer 28-May-97 0:12:27-GMT,2766;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id UAA10920 for ; Tue, 27 May 1997 20:12:26 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA24725; Tue, 27 May 97 16:51:28 -0700 Message-Id: <9705272351.AA24725@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2731 (1997-05-27 23:51:00 GMT) To: Multiple Recipients of Reply-To: kenw@sybase.com (Kenneth Whistler) From: "Unicode Discussion" Date: Tue, 27 May 1997 16:50:58 -0700 (PDT) Subject: Unstable foundations and wavering faith > With waivering faith I wrote: > :-) > > > HTML certainly is an interesting alternative to plain text because it > > is so universal (and, hopefully, with a stable foundation). > > Es lebe plain text! (long live ~) It is no accident that Silicon Valley thrives in Earthquake country. But while everything seems to be in constant turmoil, and yesterday's hot new item is today's trash -- try to take the long view. 1. The Information Technology industry is still in its adolescent phase (no longer its infancy, certainly), but maturing rapidly. As industrial technology matures, it tends to stabilize into well- understood, efficient patterns, with competition for innovations just fizzing around the edges. Handling of multilingual text as part of the general problem of automated information technology is still in ferment, but we can see the beginnings of the crystallizations of well-understood, accepted ways of dealing with the issues on computers. 2. Unicode is laying the (firm, we hope) foundation for plain text representation through the next century--perhaps longer. In any case, like ASCII, it should last long enough to gain the lustrous, comfortable patina of trusted age. Just as my nieces now find it hard to conceive of a political age before Ronald Reagan, people just being introduced to computer science and programming in Java will find it hard to conceive of character sets before Unicode. --Ken (Color me rosy) Whistler P.S. For those who, like me, worry that all electronic data not in plain text (and ASCII plain text at that) is in constant danger of disappearing into the enormous historical bit bucket of undecipherable formats using undecipherable encodings on obsolete media, consider the following: Perhaps the greatest source of information loss in the longrun was the shift by the publishing industry to use of cheap high-acid papers early in this century. Ask librarians about the conditions of their pre-War collections (my nieces just asked, "The Gulf war?") of books. Or how about all the nitrate movie film stock collapsing into dust? 28-May-97 1:36:52-GMT,2584;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id VAA24050 for ; Tue, 27 May 1997 21:36:49 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA25007; Tue, 27 May 97 18:18:07 -0700 Message-Id: <9705280118.AA25007@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Uml-Sequence: 2733 (1997-05-28 01:17:40 GMT) To: Multiple Recipients of Reply-To: Giles S Martin From: "Unicode Discussion" Date: Tue, 27 May 1997 18:17:35 -0700 (PDT) Subject: Re: Unstable foundations and wavering faith It's getting a little off-topic, but ... . Arguably the single event causing the greatest information loss was the destruction of the library at Alexandria, which broke countless links in chains of transmission of unique manuscripts. Acid paper and nitrate film have destroyed lots of copies, but most information of any signnificance produced in this era has been reproduced in lots of copies, and procographically recopied at a trivial cost compared to the cost of copying a manuscript by hand (which is why there were so many unique copies in Alexandria). Giles #### ## Giles Martin ####### #### Quality Control Section ################# University of Newcastle Libraries #################### New South Wales, Australia ###################* E-mail: ulgsm@dewey.newcastle.edu.au ##### ## ### Phone: +61 49 215 828 (International) Fax: +61 49 215 833 (International) ## The web of our life is of a mingled yarn, good and ill together -- All's Well That Ends Well, IV.iii.98-99 On Tue, 27 May 1997, Kenneth Whistler wrote: > P.S. For those who, like me, worry that all electronic data > not in plain text (and ASCII plain text at that) is in constant > danger of disappearing into the enormous historical bit bucket > of undecipherable formats using undecipherable encodings on > obsolete media, consider the following: Perhaps the greatest source > of information loss in the longrun was the shift by the publishing > industry to use of cheap high-acid papers early in this century. > Ask librarians about the conditions of their pre-War collections > (my nieces just asked, "The Gulf war?") of books. Or how about > all the nitrate movie film stock collapsing into dust? 28-May-97 2:32:39-GMT,3210;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id WAA01024 for ; Tue, 27 May 1997 22:32:38 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA25197; Tue, 27 May 97 18:53:00 -0700 Message-Id: <9705280153.AA25197@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Uml-Sequence: 2734 (1997-05-28 01:52:43 GMT) To: Multiple Recipients of Reply-To: John Fieber From: "Unicode Discussion" Date: Tue, 27 May 1997 18:52:41 -0700 (PDT) Subject: re:Multi-Lingual Project Gutenberg (was: Unicode plain text) On Tue, 27 May 1997, Pierre Lewis > In message "re:Multi-Lingual Project Gutenberg (was: Unicode plain text)", > 'jfieber@indiana.edu' writes: > > > > Es lebe plain text! (long live ~) > > > > I find this a tragic position. Before unicode, the common > > denominator for cross-platform data transfer was 7 bit ASCII. > > Second, Unicode is something more or less orthogonal to the notion of > plain text. So I don't really understand your comment above. Plain text > does not mean 7-bit ASCII. It could just as well mean UTF-8 Unicode. >From other replies I've received I guess I wasn't clear about my point. Within the domain of "plain text" Unicode is doing a lot to raise the common denominator. This is great, but a sentiment has been expressed in this thread that higher level protocols are a hopeless mess and if you want portability, stick with plain text. In the near term that may be a reality but Unicode was born out of frustration with the existing mess of character encoding standards and a determination to make things better. I was simply making the observation that swearing off high level protocols because they are messy now seems very out of character with the spirit of Unicode. To clarify another posting, I did not say or mean to imply that higher level protocols should be addressed by the Unicode standard. That would be a Bad Thing for numerous reasons I'm sure you can all figure out. > Third, for all the great things that can be said for SGML, HTML, XML, > and ML, it still remains that plain text is the most portable > format, the simplest to deal with (on all platforms), and the only one > that is likely to be legible in 30 years. For some things, it's still > the best solution. Explain to me how SGML is less portable than plain text? If you don't have something that understand the tags, any reasonable text editor can strip them out leaving you with plain text. You don't need anything fancier than a text editor to create and view SGML documents. You are no *worse* off using SGML than you would be using plain text, but chances are good that you will be better off. In 30 years, SGML will still be legible because, unlike other markup schemes, it is a public standard not bound to a particular transient software product. This is why you find SGML in places like the aircraft industry where documents have active lifespans longer than most software companies. -john 28-May-97 3:24:59-GMT,2803;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id XAA06051 for ; Tue, 27 May 1997 23:24:58 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA25358; Tue, 27 May 97 19:22:40 -0700 Message-Id: <9705280222.AA25358@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Uml-Sequence: 2736 (1997-05-28 02:21:41 GMT) To: Multiple Recipients of Reply-To: John Fieber From: "Unicode Discussion" Date: Tue, 27 May 1997 19:21:39 -0700 (PDT) Subject: Re: Unstable foundations and wavering faith On Tue, 27 May 1997, Unicode Discussion wrote: > --Ken (Color me rosy) Whistler > > P.S. For those who, like me, worry that all electronic data > not in plain text (and ASCII plain text at that) is in constant > danger of disappearing into the enormous historical bit bucket > of undecipherable formats using undecipherable encodings on > obsolete media, consider the following: Perhaps the greatest source > of information loss in the longrun was the shift by the publishing > industry to use of cheap high-acid papers early in this century. > Ask librarians about the conditions of their pre-War collections No need to worry about electronic data disappearing in the future, it has been disappearing for quite some time now thanks to being stored on flakey or obsolete media, or in undocumented data formats of long extinct software. In a former life as a librarian, I spent quite a bit of time dealing with electronic data sneaking into the library inside the back covers of books and in other ways. Librarians have been fretting over digital data for some time now. Unlike computer scientists, we have been through the preservation thing many times. It is true, a book published in the 1700 is as good as new (okay, I exagerate a bit...) while relatively recent publications turn to dust thanks to cheap paper. Most of the computer science literature has been published after the "acid incident" so as a discipline, they tend to be are blissfully ignorant of the event. The problem is not really that data isn't in plain text format, although that is sometimes helpful, but that the formats are (a) not documented and (b) there are way too many of them. Even if they were documented, condition (b) makes it too expensive to deal with unless it is *really* important data. SGML makes a serious attack on both problems. I just hope the marriage of SGML and Unicode in the form of XML is successful in bringing portable, durable documents to the masses. Then continue ironing out the storage media qirks and librarians will be happy. :) -john 28-May-97 3:40:48-GMT,4183;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id XAA09453 for ; Tue, 27 May 1997 23:40:48 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA25292; Tue, 27 May 97 19:17:48 -0700 Message-Id: <9705280217.AA25292@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-1" X-Uml-Sequence: 2735 (1997-05-28 02:17:31 GMT) To: Multiple Recipients of Reply-To: "Pierre Lewis" From: "Unicode Discussion" Date: Tue, 27 May 1997 19:17:30 -0700 (PDT) Subject: re:Multi-Lingual Project Gutenberg (was: Unicode plain text) Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by watsun.cc.columbia.edu id XAA09453 In message "re:Multi-Lingual Project Gutenberg (was: Unicode plain text)", 'jfieber@indiana.edu' writes: > I was simply making the observation that swearing off high level > protocols because they are messy now seems very out of character > with the spirit of Unicode. I don't see them as messy, just as short-lived. I don't perceive HTML as messy, quite the opposite (notwithstanding frequent abuse by authors such as using tags to get bold/bigger), but I don't expect to still use it in 30 years. For my part, I'm not swearing off high level protocols, but I think a very good point can be made for plain text, and I had a few questions I wished clarified wrt Unicode. That's all. > Explain to me how SGML is less portable than plain text? If you > don't have something that understand the tags, any reasonable > text editor can strip them out leaving you with plain text. I don't know SGML, but let's try the exercise with an HTML page I wrote (chosen randomly amongst the ones I can show outside): HTML source

Connecting an HP LaserJet 5M at home

By Pierre Lewis (aka téléLew).

This short page provides some notes on using an HP LaserJet 5M connected to a home setup. If you have comments or encounter problems, don't hesitate to call me (x8207).

The description is specific to the HP LaserJet 5M. Some useful information can also be found on the page about connecting a LaserWriter II NTX to a home NCD.

Basic connectivity

  • The normal way to connect the LJ5M to your home setup is via the Ethernet port. This requires some kind of hub to interconnect the Gandalf box, the NCD Same with tags stripped (almost illegible: headings, bullets gone) Connecting an HP LaserJet 5M at home By Pierre Lewis (aka téléLew). This short page provides some notes on using an HP LaserJet 5M connected to a home setup. If you have comments or encounter problems, don't hesitate to call me (x8207). The description is specific to the HP LaserJet 5M. Some useful information can also be found on the page about connecting a LaserWriter II NTX to a home NCD. Basic connectivity The normal way to connect the LJ5M to your home setup is via the Ethernet port. This requires some kind of hub to interconnect the Gandalf box, the NCD Same as a decent plain text file (formatted by lynx -- Tim's type 2) Connecting an HP LaserJet 5M at home _By Pierre Lewis (aka téléLew)._ This short page provides some notes on using an HP LaserJet 5M connected to a home setup. If you have comments or encounter problems, don't hesitate to call me (x8207). The description is specific to the HP LaserJet 5M. Some useful information can also be found on the page about [1]connecting a LaserWriter II NTX to a home NCD. Basic connectivity * The normal way to connect the LJ5M to your home setup is via the Ethernet port. This requires some kind of hub to interconnect the Gandalf box, the NCD ... References 1. file://localhost/tmp/lw2ntx.html Wonder what the SGML version of above would look like. Pierre 28-May-97 13:18:29-GMT,1533;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id JAA13596 for ; Wed, 28 May 1997 09:18:28 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA26845; Wed, 28 May 97 05:56:52 -0700 Message-Id: <9705281256.AA26845@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Uml-Sequence: 2737 (1997-05-28 12:56:14 GMT) To: Multiple Recipients of Reply-To: Kent Karlsson From: "Unicode Discussion" Date: Wed, 28 May 1997 05:56:12 -0700 (PDT) Subject: SGML (Was: Re: Multi-Lingual Project Gutenberg (was: Unicode plain text)) Hi! Sorry for asking a maybe trivial question (and for getting a bit off-track): > > which is English with HTML 3.2 markup What "in English"? English markup or English "proper text"? I could imagine (though there is none now) HTML 3.2 markup in, say, Swedish. But are you saying that if the "proper text" of the document is in, say, Swedish, I should write at the top, even if the markup is "in English"? (I thought that the "EN" meant that the **markup** is based on English words.) And language attributes are to become a part of HTML, suitable also for multilingual "proper texts"... (Sorry, I don't know SGML.) /kent k 28-May-97 15:48:53-GMT,3968;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id LAA15268 for ; Wed, 28 May 1997 11:48:47 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA27338; Wed, 28 May 97 07:27:59 -0700 Message-Id: <9705281427.AA27338@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2741 (1997-05-28 14:27:34 GMT) To: Multiple Recipients of Reply-To: Frank da Cruz From: "Unicode Discussion" Date: Wed, 28 May 1997 07:27:32 -0700 (PDT) Subject: re:Multi-Lingual Project Gutenberg (was: Unicode plain text) > From other replies I've received I guess I wasn't clear about my > point. Within the domain of "plain text" Unicode is doing a lot > to raise the common denominator. This is great, but a sentiment > has been expressed in this thread that higher level protocols are > a hopeless mess and if you want portability, stick with plain > text. In the near term that may be a reality but Unicode was > born out of frustration with the existing mess of character > encoding standards and a determination to make things better. > > I was simply making the observation that swearing off high level > protocols because they are messy now seems very out of character > with the spirit of Unicode. > Nobody advocates stamping out higher level protocols, even if that were possible. We all use them all the time. I, for one, use them with my eyes open -- i.e. with full knowledge that all the work I put into creating a "rich" document will need to be done again at some point when the current "standard" for richness has been replaced by a new one if I want the document to survive. And again. And again. I remember the excitement when it first became possible to produce typeset-quality documents with Troff, R, DSR, Scribe, TeX, and their relatives. But I also continued to produce plain-text "documents" on a daily basis: email; netnews; computer programs in assembly language, Sail, Simula, C, Fortran, Pascal, PL/I, etc; online documentation that had to be portable to hundreds of platforms; plain-text record-oriented databases -- mailing lists for example. There is no reason for most of this sort of information to be "rich" and that this type of work should not continue in Unicode. What is needed is emphatic allowance and support for Unicode plain text in the Unicode standard, i.e. a precise and thorough definition of what constitutes a self-contained preformatted plain-text document. This is primarily a matter of adopting a small but complete set of control codes needed for line breaks, paragraph breaks, page breaks, and direction control (most of these are already there), and a clear statement of the role of the "traditional" control characters at U+0000 - U+001F, U+007F, and U+0100 - U+011F. And outside the scope of the Unicode standard is the problem of properly tagging files in the file system. This has never been done right, on any operating system. The use of the "extension" (the part of the name after the dot, e.g. "DOC") is just plain silly, especially now that GUI-based operating systems are using this to associate applications with files -- click on a data file, launch the associated application on that file. What's silly about it is that anybody can name a file any way they please and there is no registration authority for extensions; conflicts inevitably arise -- sometimes with disastrous consequences. Even sillier is the idea the each file must belong to one and only one application. Plain text files can be used by many applications, but how do we mark them as being written in Unicode? Or Latin-1? Or JIS X 0208, etc. Ideally there should be information in the directory entry to specify the file type and encoding. That's an issue for each OS maker, but one whose resolution is long overdue. - Frank 28-May-97 23:08:37-GMT,7018;000000000001 Return-Path: Received: from fallout.campusview.indiana.edu (fallout.campusview.indiana.edu [149.159.1.1]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id TAA12580 for ; Wed, 28 May 1997 19:08:34 -0400 (EDT) Received: from localhost (jfieber@localhost) by fallout.campusview.indiana.edu (8.8.5/8.8.5) with SMTP id QAA04102; Wed, 28 May 1997 16:14:11 -0500 (EST) Date: Wed, 28 May 1997 16:14:10 -0500 (EST) From: John Fieber Reply-To: John Fieber To: Frank da Cruz cc: Multiple Recipients of Subject: Plain text vs. markup (was: re:Multi-Lingual Project Gutenberg) In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII On Wed, 28 May 1997, Frank da Cruz wrote: > Nobody advocates stamping out higher level protocols, even if that were > possible. We all use them all the time. I, for one, use them with my > eyes open -- i.e. with full knowledge that all the work I put into > creating a "rich" document will need to be done again at some point when > the current "standard" for richness has been replaced by a new one if I > want the document to survive. And again. And again. > > I remember the excitement when it first became possible to produce > typeset-quality documents with Troff, R, DSR, Scribe, TeX, and their > relatives. The transient nature of these markup languages is not a trait of markup languages, but a product of having a one-to-one relationship between the markup language and a specific piece of application software. TeX files go with TeX, troff files go with troff, Scribe files go with Scribe, WordPerfect files go with WordPerfect, MS-Word files go with MS-Word. If the application falls out of favor, it takes its markup language and data with it. Exactly the same thing happens if you depend on software that uses its own unique character encoding, or the glyph encoding of some oddball font. It is percicely this fatal one-to-one markup/application relationship that SGML is targeted at. SGML is very different beast and it is a mistake to throw it in with the rest. Claiming that SGML is just another transient markup language that doesn't address document portability is similar to saying that Unicode is just another transient character encoding scheme that doesn't address multilingual computing. Absurd? Of course. > But I also continued to produce plain-text "documents" on a > daily basis: email; netnews; computer programs in assembly language, > Sail, Simula, C, Fortran, Pascal, PL/I, etc; I think we differ on the notion of "plain text" and "markup". Lets see. In email for example, what is the difference between this markup: From: jfieber@indiana.edu To: Whoever@somewhere Subject: la de da blah blah blah blah... and this markup: jfieber@indiana.edu Whoever@somewhere la de da blah blah blah blah... Semantically identical. Furthermore, the correct delivery of mail and news depends critically on markup as does netnews. However you delimit it, it is still markup. Same for the computer languages. What are braces, semicolons, parentheses, and comment delimiters in C if not markup to guide the compiler in parsing the program? Incidentally, most computer languages could be expressed in SGML markup (although the utility would be dubious). Unlike other markup languages, SGML makes no assumptions about the processing application. SGML merely provides a standard way for an application to distinguish markup from data. This allows SGML to be used as a foundation for a much broader range of applications and helps ensure a long life. On the other hand, as you may guess, SGML is not a complete solution--if typesetting is your domain, for example, you will still need some software to do the layout of your data (TeX works quite well)--but SGML serves to protect your data from dependencies on specific applications. That protection facilitates exchange between applications. In one case you feed your document to a typesetter, in another case you feed it to a database, in a third case, an on-line document viewer. Portability between applications extrapolates to portability across time. HTML may be out of fashion in 20 years, but any SGML compliant application can still process it even if the degigners never heard of HTML. (You might have to make up a style sheet, but that is orders of magnitude easier than the digital archaeology required to re-invent, say troff, from a couple sample document. SGML documents come with their own rosetta stone--the DTD, or document type definition.) In an SGML world, the data drives the application, not the other way around as is the status quo currently. That is the fundamental shift that sets SGML apart from the other markup languages cited here as examples of why markup languages are to be avoided when document portability is a concern. > What is needed is emphatic allowance and support for Unicode plain text > in the Unicode standard, i.e. a precise and thorough definition of what > constitutes a self-contained preformatted plain-text document. This is > primarily a matter of adopting a small but complete set of control codes > needed for line breaks, paragraph breaks, page breaks, and direction > control (most of these are already there), and a clear statement of the > role of the "traditional" control characters at U+0000 - U+001F, U+007F, > and U+0100 - U+011F. I think the notion of "plain text" is a little muddy as these sorts of codes represent markup that is conceptually no different than, say, SGML. I fully agree, however, that there is room and a historical precedent for a small set of control (markup) codes in Unicode, but getting people to agree on what constitues "complete" is another matter. :) I would propose that "complete" be defined as a minimal set of markup codes necessary to make a document understandable by a human without resorting to anything outside the Unicode standard. Machine processing, beyond doing the Right Thing with whitespace should not be a criteria. Except for directional control, most of the necessary markup should be covered by addressing compatibility with ASCII, although clarification would be helpful. > Plain text files can be used by many applications, but how do we mark > them as being written in Unicode? Or Latin-1? Or JIS X 0208, etc. SGML offers some options here by hiding file system (or any storage mechanism) behind an entity manager which provides for such tagging. The details are not currently covered by the standard (which treats the entity manager pretty much as a black box), but the entity manager in James Clark's SP system offers a good example of how it might be done. -john 28-May-97 23:24:54-GMT,4953;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id TAA14930 for ; Wed, 28 May 1997 19:24:53 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA29376; Wed, 28 May 97 15:30:28 -0700 Message-Id: <9705282230.AA29376@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2747 (1997-05-28 22:29:45 GMT) To: Multiple Recipients of Reply-To: Frank da Cruz From: "Unicode Discussion" Date: Wed, 28 May 1997 15:29:43 -0700 (PDT) Subject: Re: Plain text vs. markup (was: re:Multi-Lingual Project Gutenberg) > It is percicely this fatal one-to-one markup/application > relationship that SGML is targeted at. SGML is very different > beast and it is a mistake to throw it in with the rest. Claiming > that SGML is just another transient markup language that doesn't > address document portability ... > I don't think anybody did that. But this does not mean SGML can be used for everything. > Unlike other markup languages, SGML makes no assumptions about > the processing application. > Except that it can parse SGML. I'm not arguing against SGML -- quite the opposite: I'm heavily in favor of (almost) anything that has survived the international standards process AND sees use in the real world, as opposed to schemes that companies make up and unilaterally proclaim to be standards. But SGML is to mark up text for later formatting to fit the requirements of some output device or application that understands this kind of markup. As distinguished from plain text as we have known it since the 1960s, in which a repertoire of graphic characters is mixed with a small number of control codes (call them markup if you wish) for simple actions like line breaks and so on, in order to achieve the *final* result, not (necessarily) to be input for a higher-level reformatter. > I would propose that "complete" be defined as a minimal set of > markup codes necessary to make a document understandable by a > human without resorting to anything outside the Unicode standard. > Machine processing, beyond doing the Right Thing with whitespace > should not be a criteria. Except for directional control, most of > the necessary markup should be covered by addressing > compatibility with ASCII, although clarification would be > helpful. > Right. Something like the following (ignoring BIDI for the moment): . LS is a hard line break. The next graphic character appears at the left margin of the following line. Equivalent to CR and LF on a Teletype. . Two LSs result in a blank line. . Three LSs result in two blank lines, and so on. . PS is a hard paragraph break (more about this below). . (form separator), whatever its instantiation (a new Unicode character, or ASCII Formfeed with a well-defined use in Unicode), starts a new page. The next graphic character appears on the top line, leftmost position of the new page. . Two FSs result in a blank page, and so on. Plus whatever is needed for specifying writing direction, including expanding on what is meant by "left", "top", etc, in the preceding items. That should do it. Personally, I find text to be most portable when it is displayed in fixed-width font, and spaces are used to line things up, rather than tabs (because tabs require external agreement about the tab settings). I don't think Vertical Tab or other obscure formatting controls (such as Line Feed taken literally) are of any use; in my experience they have always been treated as "synonyms" for the controls listed above. Then what to do about ASCII controls in Unicode text? I'd say that since ASCII (and Latin-x, etc) must be converted to Unicode, then it is the responsibility of the conversion agent to understand the local conventions for line breaks (etc) in the source text, and to convert to the well-defined Unicode controls. About Paragraph Separator... It seems to me that this one was designed with the "export from word processor" type of file in mind (those files we were discussing earlier in which each paragraph is a long line, terminated by a "paragraph separator" such as CR). I would not call this type of file plain text -- I would call it "input for a text formatter"; it needs further processing to be readable. (For example, if I print such a file on the local Laserwriter, the long lines are truncated -- thus I only see the first 80 characters of each paragraph.) Clearly we can become increasingly epistemological about what constitutes plain text (yes, C source code is input for a C compiler, but it is also text to be read, understood, and edited by people, sent by email without being reformatted, etc). And obviously some details still need working out: treatment of soft hyphens and such. But I think we're on the right track. - Frank 29-May-97 4:14:06-GMT,3937;000000000001 Return-Path: Received: from fallout.campusview.indiana.edu (fallout.campusview.indiana.edu [149.159.1.1]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id AAA24371 for ; Thu, 29 May 1997 00:14:05 -0400 (EDT) Received: from localhost (jfieber@localhost) by fallout.campusview.indiana.edu (8.8.5/8.8.5) with SMTP id XAA06831; Wed, 28 May 1997 23:14:04 -0500 (EST) Date: Wed, 28 May 1997 23:14:03 -0500 (EST) From: John Fieber To: Frank da Cruz cc: Multiple Recipients of Subject: Re: Plain text vs. markup (was: re:Multi-Lingual Project Gutenberg) In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII On Wed, 28 May 1997, Frank da Cruz wrote: > > It is percicely this fatal one-to-one markup/application > > relationship that SGML is targeted at. SGML is very different > > beast and it is a mistake to throw it in with the rest. Claiming > > that SGML is just another transient markup language that doesn't > > address document portability ... > I don't think anybody did that. But this does not mean SGML can > be used for everything. No, but its useful range of applications is quite a bit wider than any other markup scheme I know of. That helps a lot in building a solid foundation that won't fade away. > But SGML is to mark up text for later formatting to fit the > requirements of some output device or application that understands > this kind of markup. SGML is explicitly *not* about text formatting. It is about marking up documents describing what content *is*, not what to do with it. If markup represents typesetting instructions, that markup is good for little else. If your markup describes what the content is, you have far more options. For example, the introduction of a new term in a technical manual may be rendered in italics. You could mark it up like: new term which would be fine if the end target is a typesetter, but if you mark it up with: new term, you can still render it as italic, but you can also automatically add it to the index as the defining location of the term, or in an on-line environment if you encounter a unfamiliar term, the search engine can seek out the defining occurence if it exists. But back to your point: > As distinguished from plain text as we have ... > so on, in order to achieve the *final* result, not (necessarily) to > be input for a higher-level reformatter. Yes, though I would argue at length why SGML markup is well worth the extra effort, I'll also agree that this minimalist approach to document portability deserves support. > Then what to do about ASCII controls in Unicode text? I'd say > that since ASCII (and Latin-x, etc) must be converted to Unicode, > then it is the responsibility of the conversion agent to > understand the local conventions for line breaks (etc) in the > source text, and to convert to the well-defined Unicode controls. The only hitch for 7-bit ASCII is utf-8, which can be seen as a convenient way to avoid the explicit conversion process of legacy data. If your external storage is utf-8, how can you reliably tell what has been converted and what has not? > Clearly we can become increasingly epistemological about what > constitutes plain text (yes, C source code is input for a C > compiler, but it is also text to be read, understood, and edited > by people, sent by email without being reformatted, etc). After pondering it for awhile, I cut that section out of my last post. :) One sentence summary: some markup scheme cater to human processing, others to machine processing, and yet others, most notably programming languages, work hard to satisfy both needs. -john 29-May-97 14:32:08-GMT,1718;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id KAA14517 for ; Thu, 29 May 1997 10:32:07 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA01680; Thu, 29 May 97 06:53:02 -0700 Message-Id: <9705291353.AA01680@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2753 (1997-05-29 13:52:26 GMT) To: Multiple Recipients of Reply-To: "Pierre Lewis" From: "Unicode Discussion" Date: Thu, 29 May 1997 06:52:24 -0700 (PDT) Subject: Re: Plain text vs. markup (was: re:Multi-Lingual Project Gutenberg) In message "Re: Plain text vs. markup (was: re:Multi-Lingual Project Gutenberg)", 'fdc@watsun.cc.columbia.edu' writes: > Right. Something like the following (ignoring BIDI for the moment): > ... (details removed) BIDI is what I think makes it difficult. Without BIDI, I would be tempted to stick to local Unix/MAC/DOS conventions for C0 chars, add maybe BOM and ISS (or whatever). But BIDI works in blocks. Currently both LS and PS are block separators. It's been said here that probably LS shouldn't be a BIDI block separator. That leaves PS. And I have to use it (in partic. if I have both right- and left-aligned sections). So can I mix PS with LS (or LF) and FF? Looks funny. Maybe it is an error to have PS function as both a paragraph separator (whatever that is -- I too feel it probably comes from WP context) *and* a BIDI block separator. Maybe it would have been better to have a BIDI block separator as a separate Unicode control char, independant of any formatting intents. Just a thought, Pierre 6-Jun-97 2:39:06-GMT,3185;000000000001 Return-Path: Received: from mail-out1.apple.com (A17-254-0-52.apple.com [17.254.0.52]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id WAA10937 for ; Thu, 5 Jun 1997 22:39:05 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by mail-out1.apple.com (8.8.5/8.8.5) with SMTP id TAA14624; Thu, 5 Jun 1997 19:25:14 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA00269; Thu, 5 Jun 97 19:21:45 -0700 Message-Id: <9706060221.AA00269@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-2022-jp Content-Transfer-Encoding: 7bit X-Uml-Sequence: 2832 (1997-06-06 02:21:05 GMT) To: Multiple Recipients of Reply-To: Adrian Havill From: "Unicode Discussion" Date: Thu, 5 Jun 1997 19:21:03 -0700 (PDT) Subject: Re: Comments on ? Tim Partridge wrote: > I agree with his point of view that the tags > should be at the character level and not just > in the UTF-8 format. > > How about using Escape sequences? Ugh. The relatively few escape sequences at the character level is what makes Unicode so ATTRACTIVE, esp. to those that currently use escape sequence based character sets. (Tools to repair broken escape codes in JIS are almost standard equipment with most Japanese computer systems) Not to mention the complexity they add to simple and elegant string manipulation functions... processing escape codes can sometimes bump the algorithm efficiency up by one O() level. Put in escape codes at the character level, and Unicode begins to lose the simplicity factor, and becomes just another mammoth character set that nobody can or will implement--there are plenty out there. If I wanted escape sequences, I could choose from a lot of other character sets that are already out there. If you want a complicated character system that does tags and everything, there are plenty to choose from-- Unicode basher Prof. Ken Sakamura (U. of Tokyo) and Co. would be more than happy to tout the virtues of TRON, which is loaded with escape sequences galore. The TRON project has made a religion out of bad-mouthing Unicode, much like the computer industry has made a religion out of bad-mouthing a certain software firm in Redmond, Washington (who make a darn fine Unicode based OS, I might add). They have to-- they have to justify that the years of blood, sweat, tears (and most importantly, money) they've used making -their- worldwide standard character set has not repeated work that's already here and in use and better. (see and ) Granted, Unicode is complicated. It will get more complicated. This is a fact of life as representing languages is complicated. But I'd hope the character level stays as simple as possible, for those that need simplicity. I do NOT agree that tags should be at the character level. -- Adrian Havill Engineering Division, System Planning & Production Section 6-Jun-97 14:37:44-GMT,2109;000000000011 Return-Path: Received: from mail-out1.apple.com (A17-254-0-52.apple.com [17.254.0.52]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id KAA18800 for ; Fri, 6 Jun 1997 10:37:43 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by mail-out1.apple.com (8.8.5/8.8.5) with SMTP id HAA12466; Fri, 6 Jun 1997 07:22:02 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA02593; Fri, 6 Jun 97 07:16:28 -0700 Message-Id: <9706061416.AA02593@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2847 (1997-06-06 14:14:18 GMT) To: Multiple Recipients of Reply-To: "Pierre Lewis" From: "Unicode Discussion" Date: Fri, 6 Jun 1997 07:14:17 -0700 (PDT) Subject: Re: Comments on ? In message "Re: Comments on ?", 'glenn@spyglass.com' writes: > I'd like to briefly summarize some of the positions taken on various > sides in this discussion. Thanks, very useful (esp. for one who didn't have the time to read all the posts carefully). I haven't read the MLSF yet (will do this weekend), but I'm sure I still won't agree with putting this tagging in UTF-8. UTF-8 is nothing more than one of many possible transformation formats, and it must always be possible to move between it and UCS-2 and other UTFs. Filters surely will (and almost certainly already do) exist to transform between these various CESs. What would they do with language tagging? > My personal position on the above is that an alternative non-UCD (i.e., > standard code assignment) approach is preferred. Its only negatives are > (a) opposition from (1) above and (b) the time required to make actual > code assignments. Sounds to me like the only possible approach, assuming language tagging is needed at the plain-text level (I don't have the knowledge to comment on that). Pierre P.S. What happened to the "unicode plain-text file" thread? Seems it died very suddenly (with no closure)! Maybe it was displaced by this new thread :-). 6-Jun-97 15:15:46-GMT,1789;000000000001 Return-Path: Received: from mail-out1.apple.com (A17-254-0-52.apple.com [17.254.0.52]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id LAA26420 for ; Fri, 6 Jun 1997 11:15:45 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by mail-out1.apple.com (8.8.5/8.8.5) with SMTP id IAA11222; Fri, 6 Jun 1997 08:03:32 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA02791; Fri, 6 Jun 97 07:57:13 -0700 Message-Id: <9706061457.AA02791@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2848 (1997-06-06 14:56:16 GMT) To: Multiple Recipients of Reply-To: Frank da Cruz From: "Unicode Discussion" Date: Fri, 6 Jun 1997 07:56:15 -0700 (PDT) Subject: Re: Comments on ? > P.S. What happened to the "unicode plain-text file" thread? Seems it > died very suddenly (with no closure)! Maybe it was displaced by this > new thread :-). > It seems as if this is trying to become a plain-text issue. I hope not. Plain text is supposed to be a simple sequence of *characters* and minimal formatting information (hard spaces, line breaks, page breaks, and in the case of Unicode, directionality indicators), irrespective of language, containing no mysterious metacodes. (Let's agree that hard line and page breaks are not mysterious metacodes.) In view of the temperature surrounding the language-tagging issue, the solution is not going to be simple or stable or soon to come, and therefore I believe it falls outside the scope of plain text, which by definition should be simple and stable and long-lasting. Language tags will be constantly changing and surrounded by politics and emotion. - Frank 7-Jun-97 16:38:53-GMT,1467;000000000001 Return-Path: Received: from mail-out2.apple.com (A17-254-0-51.apple.com [17.254.0.51]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id MAA02679 for ; Sat, 7 Jun 1997 12:38:52 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by mail-out2.apple.com (8.8.5/8.8.5) with SMTP id JAA07384; Sat, 7 Jun 1997 09:27:30 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA07340; Sat, 7 Jun 97 09:24:48 -0700 Message-Id: <9706071624.AA07340@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2857 (1997-06-07 16:24:32 GMT) To: Multiple Recipients of Reply-To: Frank da Cruz From: "Unicode Discussion" Date: Sat, 7 Jun 1997 09:24:30 -0700 (PDT) Subject: Re: Plane 14 codes for language tagging? > > > My personal preference is for number 2. I kind of like Martin's proposal > > > for introducing a plain-text language tag using a control code, and I > > > think the existing control codes are fine. > > Good idea. Indeed the C1 area is not used in the Internet as far as I know. > There are still such things as terminals that use C1 control codes such as CSI, APC, OSC, etc (primarily VT220 and higher, which are the predominant types used by emulators such Kermit, Xterm, DECterm, etc). Do we intend that Unicode and terminal-to-host communication will become mutually exclusive concepts? - Frank 7-Jun-97 17:14:31-GMT,1996;000000000011 Return-Path: Received: from josef.ifi.unizh.ch (josef.ifi.unizh.ch [130.60.48.10]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id NAA07882 for ; Sat, 7 Jun 1997 13:14:30 -0400 (EDT) Received: from ifi.unizh.ch by josef.ifi.unizh.ch with SMTP (PP) id <18036-0@josef.ifi.unizh.ch>; Sat, 7 Jun 1997 19:14:30 +0200 Date: Sat, 7 Jun 1997 19:14:28 +0200 (MET DST) From: "Martin J. Duerst" Sender: mduerst@enoshima To: Frank da Cruz cc: Multiple Recipients of , MLSF discussion -- IETF Languages , Multiple Recipients of Subject: Re: Plane 14 codes for language tagging? In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII On Sat, 7 Jun 1997, Frank da Cruz wrote: > > > > My personal preference is for number 2. I kind of like Martin's proposal > > > > for introducing a plain-text language tag using a control code, and I > > > > think the existing control codes are fine. > > > > Good idea. Indeed the C1 area is not used in the Internet as far as I know. > > > There are still such things as terminals that use C1 control codes such as > CSI, APC, OSC, etc (primarily VT220 and higher, which are the predominant > types used by emulators such Kermit, Xterm, DECterm, etc). Do we intend that > Unicode and terminal-to-host communication will become mutually exclusive > concepts? Frank - I understand your concerns. But one way of looking at what we need is some tagging format possibly used in ACAP and IMAP, which MUST not leak to other places. And what you probably worry about is the C1 area in terms of octets (which is already gone with UTF-8) and not the C1 character space in Unicode, which turns up as two bytes in UTF-8. Regards, Martin. 8-Jun-97 8:27:08-GMT,2730;000000000001 Return-Path: Received: from mail-out1.apple.com (mail-out1.apple.com [17.254.0.52]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id EAA06491 for ; Sun, 8 Jun 1997 04:27:07 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by mail-out1.apple.com (8.8.5/8.8.5) with SMTP id BAA08438; Sun, 8 Jun 1997 01:14:24 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA09568; Sun, 8 Jun 97 01:11:13 -0700 Message-Id: <9706080811.AA09568@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2866 (1997-06-08 08:10:50 GMT) To: Multiple Recipients of Reply-To: "Pierre Lewis" From: "Unicode Discussion" Date: Sun, 8 Jun 1997 01:10:49 -0700 (PDT) Subject: Re: Comments on Finally got around to reading the MLSF Internet Draft. Couple of comments: 1) One thing really made me jump: the first sentence in the Abstract. "While UTF-8 solves most internationalization (I18N) problems, ..." That makes as much sense to me as saying that QuotedPrintable solves most I18N problems for Western Europe. It's not QP which does that, it's ISO 8859-1. QP is just one way to encode 8859-1 text so it can past most mail relays without corruption. But Base64 is another way to do the same thing (which can make statistical sense for some languages). Similarly, it's not UTF-8 which solves the wider problem of world-wide I18N, it's Unicode (and/or ISO 10646). The canonical representation of Unicode is 16-bit quantities (UCS-2). UTF-8 is nothing more than one of many possible transformations (UTF-7 is another that's already defined: RFC 2152). If I understood right, UTF-8 was created mainly to make Unicode coexist reasonably well with existing OSs that use 8-bit characters, for example Unix. Not that I agree with the proposal, but the MLSF Internet Draft should make clear what the implications are of trying to put language tags into UTF-8 (for example, assumption that UTF-8 becomes the canonical representation of Unicode, loss of tagging when converting to other CESs). I guess the pros and cons have been discussed at length here. 2) It would have been nice to put a few examples of actual UTF-8 strings with language tags (in hex of course) in the document. As to the fundamental issue of whether language tagging belongs in plain-text Unicode, I must say I'm pretty neutral at this point. I think they could be useful. But, as Frank was saying, if it's going to take 10 years to converge to an acceptable solution, then it doesn't belong in plain text, but at a higher level. Pierre 9-Jun-97 3:10:12-GMT,1193;000000000001 Return-Path: Received: from cam.spyglass.com (sapir.cam.spyglass.com [208.203.148.66]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id XAA24496 for ; Sun, 8 Jun 1997 23:10:11 -0400 (EDT) Received: from mykhe.cam.spyglass.com (shivacam-1.cam.spyglass.com [208.203.149.181]) by cam.spyglass.com (8.7.5/8.7.3) with SMTP id XAA00525 for ; Sun, 8 Jun 1997 23:10:22 -0400 (EDT) Message-Id: <3.0.32.19970608224316.006e9e50@mailhost.cam.spyglass.com> X-Sender: glenn@mailhost.cam.spyglass.com X-Mailer: Windows Eudora Pro Version 3.0 (32) Date: Sun, 08 Jun 1997 22:57:16 -0400 To: Frank da Cruz From: Glenn Adams Subject: Re: Plane 14 codes for language tagging? Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" At 10:32 AM 6/7/97 -0700, you wrote: >and escape sequences would take in a "Unicode terminal"? Would it use >octets or hextets? The Unicode standard is clear that escape sequences and controls in canonical Unicode are encoded using 16-bit codes. Of course another encoding system which employs Unicode may choose a different tack. G. 4-Jul-97 0:38:37-GMT,4502;000000000001 Return-Path: Received: from mail-out2.apple.com (mail-out2.apple.com [17.254.0.51]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id UAA01503 for ; Thu, 3 Jul 1997 20:38:35 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by mail-out2.apple.com (8.8.5/8.8.5) with SMTP id RAA37606; Thu, 3 Jul 1997 17:27:11 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA11841; Thu, 3 Jul 97 17:22:24 -0700 Message-Id: <9707040022.AA11841@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Uml-Sequence: 3064 (1997-07-04 00:22:02 GMT) To: Multiple Recipients of Reply-To: Randy Presuhn From: "Unicode Discussion" Date: Thu, 3 Jul 1997 17:22:01 -0700 (PDT) Subject: UTF-8 in SNMPv3 Hi - The SNMPv3 working group of the IETF is hoping to make use of UTF-8 for some human-readable information in the MIBs used to manage SNMPv3. The convention currently used for this kind of information is described on page 4 of RFC 1903. (For easy reference, I've appended the text to the end of this message.) We would like to define a new convention formulated in terms of UTF-8 for use in new MIBs. What we've not yet reached agreement on is the question of "non-printable stuff". Some believe that NVT ASCII's control characters are somehow less problematic than those of 10646, others find the problems equivalent. The questions that come to my mind are: 1) Is there any merit to the argument that the "non-printable stuff" in 10646 is any better or worse than the NVT ASVII definition? 2) Can we use standard character properties to identify a "printable" subset that would not break for any language? (The folks that want these also want to have CRLF...) Background information: In the SNMP protocol notions of equality and ordering have no "locale" component. There is no notion of character equivalence. It is very much a "bits is bits" environment. The concerns of working group members appear to be arising from: 1) what does it mean to "support 10646" 2) how to display "wierd stuff" 3) how to input "wierd stuff" 4) the old CR/LF problem Is there a nice, concise, convincing answer I can take back to the working group? ========== Excerpt from RFC 1903, DisplayString Textual convention ========== "Represents textual information taken from the NVT ASCII character set, as defined in pages 4, 10-11 of RFC 854. To summarize RFC 854, the NVT ASCII repertoire specifies: - the use of character codes 0-127 (decimal) - the graphics characters (32-126) are interpreted as US ASCII - NUL, LF, CR, BEL, BS, HT, VT and FF have the special meanings specified in RFC 854 - the other 25 codes have no standard interpretation - the sequence 'CR LF' means newline - the sequence 'CR NUL' means carriage-return - an 'LF' not preceded by a 'CR' means moving to the same column on the next line. - the sequence 'CR x' for any x other than LF or NUL is illegal. (Note that this also means that a string may end with either 'CR LF' or 'CR NUL', but not with CR.) Any object defined using this syntax may not exceed 255 characters in length." ========== End Excerpt =============== --------------------------------------------------------------------- Randy Presuhn BMC Software, Inc. (Silicon Valley Division) Voice: +1 408 556-0720 (Formerly PEER Networks) http://www.bmc.com Fax: +1 408 556-0735 1190 Saratoga Avenue, Suite 130 Email: rpresuhn@bmc.com San Jose, California 95129-3433 USA --------------------------------------------------------------------- In accordance with the BMC Communications Systems Use and Security Policy memo dated December 10, 1996, page 2, item (g) (the first of two), I explicitly state that although my affiliation with BMC may be apparent, implied, or provided, my opinions are not necessarily those of BMC Software and that all external representations on behalf of BMC must first be cleared with a member of "the top management team." --------------------------------------------------------------------- 30-Jun-99 19:29:47-GMT,1992;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id PAA19372 for ; Wed, 30 Jun 1999 15:29:45 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id MAA342738 ; Wed, 30 Jun 1999 12:18:25 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA07842; Wed, 30 Jun 99 12:01:45 -0700 Message-Id: <9906301901.AA07842@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 8249 (1999-06-30 19:01:34 GMT) From: Frank da Cruz To: Unicode List Cc: Unicode List Date: Wed, 30 Jun 1999 12:01:33 -0700 (PDT) Subject: Re: Unicode selections for X11 (cont'd) Juliusz Chroboczek wrote: > I've got a question about the C0 and C1 control character ranges. > I call them `legacy control characters'. Do people object to this > terminology? > I hope so! The word "legacy" is emotionally toned and value-laden. It denigrates 30+ years of computing practice and standards activities, and it implies that plain text is a relic of the past to be discarded with all possible haste, and those who haven't done so yet have some sort of "character" defect. In fact, plain text is the only immutable format in computing. GUI and WYSIWYG formats change faster than anybody can keep up with them, and information encoded in these formats rapidly becomes inaccessible (or accessible only by utilities (like UNIX "strings") that extract the plain text from them, if there is any). > Does anyone have a better name? > C0 and C1 control characters. These are ISO standard character sets and ISO-standard terminology is available to refer to them. Finally, please remember that Unicode is a plain-text standard. The control characters are there for a reason: you need them in plain text. - Frank 30-Jun-99 19:54:27-GMT,2968;000000000011 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id PAA29133 for ; Wed, 30 Jun 1999 15:54:26 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id MAA188082 ; Wed, 30 Jun 1999 12:50:57 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA08427; Wed, 30 Jun 99 12:36:54 -0700 Message-Id: <9906301936.AA08427@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 (generated by tm-edit 7.104) Content-Type: text/plain; charset=US-ASCII X-Uml-Sequence: 8252 (1999-06-30 19:36:20 GMT) From: Juliusz Chroboczek To: Unicode List Date: Wed, 30 Jun 1999 12:36:18 -0700 (PDT) Subject: Re: Unicode selections for X11 (cont'd) >> I've got a question about the C0 and C1 control character ranges. >> I call them `legacy control characters'. Do people object to this >> terminology? Frank da Cruz : FdC> I hope so! The word "legacy" is emotionally toned and FdC> value-laden. It denigrates 30+ years of computing practice and FdC> standards activities, and it implies that plain text is a relic FdC> of the past to be discarded with all possible haste, It cannot be said that the C0 and C1 control characters are the greatest achievement of these ``30+ years etc.'' FdC> In fact, plain text is the only immutable format in computing. Agreed. And the only reason it is not portable is the poor standardisation of the C0 and C1 control characters. I've seen the following forms of plain text: NL is a line break, there's no paragraphs: Unix NL is a line break, NL NL is a paragraph separator: Unix NL is a paragraph separator, line breaks are implicit: ports of MS-DOS applications to Unix. CR LF is a line break: MS-DOS CR LF is a paragraph separator, line breaks are implicit: MS-DOS. CR LF is a paragraph separator, CR (or was it LF?) is a line break: MS-DOS. CR is a line break: MacOS. CR is a paragraph separator: MacOS. without counting, of course, systems on which record information is kept out-of-band (such as VMS). >> Does anyone have a better name? FdC> C0 and C1 control characters. These are ISO standard character FdC> sets and ISO-standard terminology is available to refer to them. Okay. Changed. FdC> Finally, please remember that Unicode is a plain-text standard. FdC> The control characters are there for a reason: you need them in FdC> plain text. You need a paragraph separator and possibly a line break (and perhaps a page break). Unicode defines well-standardised codepoints for those. If you use other control characters, such as SO/SI for controlling boldface or italics, or BS (or CR) for overstriking, or terminal control sequences, it ain't plain text no more. J. 30-Jun-99 20:08:23-GMT,4025;000000000011 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id QAA03904 for ; Wed, 30 Jun 1999 16:08:22 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id MAA200106 ; Wed, 30 Jun 1999 12:52:24 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA08371; Wed, 30 Jun 99 12:35:22 -0700 Message-Id: <9906301935.AA08371@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 8250 (1999-06-30 19:35:08 GMT) From: Asmus Freytag To: Unicode List Date: Wed, 30 Jun 1999 12:35:07 -0700 (PDT) Subject: Re: Superscript asterisk Being able to do "plain text" math is one of the goals of the Unicode Technical Committee now. Since the publication of Unicode 2.0, three years ago, we have had a lot of expert input on what plain text math capabilities are needed, and also, where our existing repertoire of math operators is insufficient. (We are, incidentally, also interested in evaluating and improving our other technical symbol collections, but so far have not had the long and sustained input from experts in other fields, as we had for mathematics). Full layout of mathematical expressions will need some form of markup, although many formulas that do not need the full generality can be laid out correctly if the mathematical operator characters in Unicode are interpreted semantically. Semantics for formatting that one needs to distinguish e.g. between summation sign and sigma. They look the same, but summation sign can take limit expressions etc. Another aspect of semantics is the mathematical semantics. Here it's necessary to make enough distinctions so that, if a small and large form of an operator can occur in the same text, that they can be distinguished by their character code without recourse to font information. Doing so, allows plain text searches for math formula. Caveat: If and where mathematicians have used 'operator overloading', to borrow a C++ term, and deliberately used the same operator with different mathemtical meaning in another sub-discipline, we would not sub-divide the character, as the larger context would be enough to determine its meaning. Our foremost goal has therefore been to complete our repertoire and where necessary introduce additional distinctions for the two reasons I mentioned. In the case of ASTERISK, the analysis that is needed, and that, as far as I have seen, has not been made, is to present evidence that cases exist (or are easily conceivable) where *both* the ASCII asterisk and yet another asterisk are needed in the same text, and with consistent distinction in use or formatting. Ricardo has said that one could use the proposed asterisk in conjunction with the ASCII asterisk do denote a regular expression of zero or more asterisks. This is the one example that cannot serve, since by extension, it would require an infinite series of asterisks (suppose I wanted to define a regular expression consisting of zero or more instances of the proposed asterisk!). Typographically, asterisk may indeed show a variation betweem full-size and superscript forms. For standard text fonts, the full-size form of asterisk occurs only occasionally. In the vast majority of fonts on my system, as well as in the Unicode Standard, and ISO/IEC10646-1, ASTERISK is clearly depicted as a superscripted symbol (i.e. it's 1/2 height and extends upwards from the centerline of the font, which is just slightly below the x height). The asterisk and superscript 2 have the same location and dimension. Therefore, unless Ricardo is proposing a character that has the same dimension as a *superscripted* SUPERSCRIPT TWO, my conclusion would be that we already _have_ the character he wants, and that he is using a poor font for his purpose. A./ 30-Jun-99 20:24:18-GMT,2893;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id QAA08593 for ; Wed, 30 Jun 1999 16:24:17 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id NAA188518 ; Wed, 30 Jun 1999 13:10:34 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA08959; Wed, 30 Jun 99 13:00:48 -0700 Message-Id: <9906302000.AA08959@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 8253 (1999-06-30 20:00:25 GMT) From: Frank da Cruz To: Unicode List Cc: unicode@unicode.org Date: Wed, 30 Jun 1999 13:00:24 -0700 (PDT) Subject: Re: Unicode selections for X11 (cont'd) > It cannot be said that the C0 and C1 control characters are the > greatest achievement of these ``30+ years etc.'' > Actually they served us all rather well considering how few of them there are and how long they lasted (and continue to last). We've covered this ground before... But (to cite only one example) do you know how many terminals and terminal emulators are "still" in use? I would venture to say the number has not declined significantly since the 1980s. It might well have increased. It's just that they are no longer the *only* form of online access, and they work well, so we ignore them. > FdC> In fact, plain text is the only immutable format in computing. > > Agreed. And the only reason it is not portable is the poor > standardisation of the C0 and C1 control characters. > The CR/LF/CRLF confusion is annoying of course, but we've lived with it all these years, and continue to live with it. But you're talking about file formats. The use of control characters in data communications is fairly well standardized, pretty much along the lines of a Teletype: CR moves the print head to the left margin, LF moves it down one line, and ESC introduces a device-dependent escape or control sequence, etc. > FdC> Finally, please remember that Unicode is a plain-text standard. > FdC> The control characters are there for a reason: you need them in > FdC> plain text. > > You need a paragraph separator and possibly a line break (and perhaps > a page break). Unicode defines well-standardised codepoints for > those. If you use other control characters, such as SO/SI for > controlling boldface or italics, or BS (or CR) for overstriking, or > terminal control sequences, it ain't plain text no more. > But Unicode and the terminal acess model are not mutually exclusive. There can be (and are) Unicode-based terminal emulators, capable of handling (e.g.) UTF-8 on the wire. And when you have terminal communications, you have control characters. (When you emulate, say, a VT320, you have LOTS of control characters :-) - Frank 30-Jun-99 21:45:11-GMT,2978;000000000011 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id RAA04617 for ; Wed, 30 Jun 1999 17:45:10 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id OAA339770 ; Wed, 30 Jun 1999 14:33:51 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA09734; Wed, 30 Jun 99 14:17:35 -0700 Message-Id: <9906302117.AA09734@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Uml-Sequence: 8255 (1999-06-30 21:17:25 GMT) From: Markus Kuhn To: Unicode List Date: Wed, 30 Jun 1999 14:17:23 -0700 (PDT) Subject: Re: Plain Text Juliusz Chroboczek wrote on 1999-06-30 19:36 UTC: > You need a paragraph separator and possibly a line break (and perhaps > a page break). Unicode defines well-standardised codepoints for > those. If you use other control characters, such as SO/SI for > controlling boldface or italics, or BS (or CR) for overstriking, or > terminal control sequences, it ain't plain text no more. The only thing that is clear about "plain text" is that it is not well defined at all. There is certainly no ISO standard that gives you any indication of what "plain text" is. The Unix community feels somewhat confident about the notion of plain text, just because they have editors such as ed, vi, emacs, etc. that agree on a common text format that is so simple that it has become customary to refer to it as plaintext. Many aspects of "plain text" are ill-defined these days: a) how do you terminate lines and paragraphs b) is there a terminator after the last line/paragraph c) is the line formatting the task of the sending or the receiving process? For Unix the answers used to be a) LF and no paragraph concept b) yes c) the sender has to insert line breaks but thanks to the heterogenity of the Internet, these strict rules have for some years been weakened significantly in common practice. Some aspects of the classical Unix plaintext definition (which came originally from tty output hardware interfaces) do not make sense any more. For example, the insertation of LFs in the middle of paragraphs, causes these LFs to move around whenever a few words are changed, which seriously disrupts revision control systems (e.g., diff and RCS) and it is not adequate anymore at all today with reformatting web browsers now being a dominating output device and not 1960s ttys. I think the Unix community should slowly get used to the idea of abandoning LFs in the middle of paragraphs in plain text documents and let the editor and display tool perform the reformatting at display time. Markus -- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: 30-Jun-99 22:46:24-GMT,2237;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id SAA22071 for ; Wed, 30 Jun 1999 18:46:24 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id PAA187464 ; Wed, 30 Jun 1999 15:36:20 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA10018; Wed, 30 Jun 99 15:25:38 -0700 Message-Id: <9906302225.AA10018@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Uml-Sequence: 8256 (1999-06-30 22:25:27 GMT) From: John Cowan To: Unicode List Date: Wed, 30 Jun 1999 15:25:26 -0700 (PDT) Subject: Re: Plain Text Content-Transfer-Encoding: 7bit Markus Kuhn scripsit: > The only thing that is clear about "plain text" is that it is not well > defined at all. There is certainly no ISO standard that gives you any > indication of what "plain text" is. What a pity. Perhaps there should be one (no :-)). > The Unix community feels somewhat > confident about the notion of plain text, just because they have editors > such as ed, vi, emacs, etc. that agree on a common text format that is > so simple that it has become customary to refer to it as plaintext. The notion of plain text long predates Unix: it was exactly the same, for example, on the PDP-8, which is where I first learned computing. (Terminator was CR/LF, and the character code was 7-bit-ASCII-with-8th-bit- set, for uniformity with Model 33 Teletypes). > I think the Unix community should slowly get used to the idea of > abandoning LFs in the middle of paragraphs in plain text documents and > let the editor and display tool perform the reformatting at display > time. AFAIK, the "reformatting web browsers" you refer to do not reformat plain text at all, which means that infinite-line-length alleged plain text can be read only with difficulty and much scrolling, and printing is impossible. -- John Cowan cowan@ccil.org I am a member of a civilization. --David Brin 30-Jun-99 22:54:43-GMT,3347;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id SAA23739 for ; Wed, 30 Jun 1999 18:54:43 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id PAA57670 ; Wed, 30 Jun 1999 15:46:45 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA10105; Wed, 30 Jun 99 15:33:04 -0700 Message-Id: <9906302233.AA10105@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 8257 (1999-06-30 22:32:56 GMT) From: Frank da Cruz To: Unicode List Cc: unicode@unicode.org Date: Wed, 30 Jun 1999 15:32:55 -0700 (PDT) Subject: Re: Plain Text > The only thing that is clear about "plain text" is that it is not well > defined at all. > Actually, it tends to be well-defined for each platform. And then the interchange methods among platforms tend to converge on a few simple conventions: ASCII (or the appropriate ISO character set, or now UTF-8 or other form of Unicode), as opposed to EBCDIC (or Baudot, or Sixbit); CRLFs separating lines, and paragraphs separated by blank lines. Somewhat less well defined, but nevertheless in common use, are bare Carriage Return or Backspace for overstriking, Formfeed for "new page", and Tab for tabbing (with several different conventions about tabstops). Lines are terminated at somewhere between 72 and 80 characters by convention, because that's how wide terminal screens are, and before them the Teletype carriage, and before that the most common kind of punchcard. Or for that matter, typewriters and sheets of paper (A4 or US, take your pick :-) To this day, we follow these conventions in newsgroups and email, although now it might be more a matter of "netiquette" than necessity (as in the BITNET days, when e-mail was, quite literally, 80-column card images). These simple conventions let us format our text exactly the way we want to. We can indent or not, we can put line breaks where we want them, we can have columns of numbers or other tabular presentations, mathematical expressions, and idiosyncratic forms of emphasis. Many people want their text to stay the way they wrote it. And many people also are not fond of receiving email in every kind of bizarre format than any application developer can dream up when it contains, in fact, nothing but words (but I stray). > I think the Unix community should slowly get used to the idea of > abandoning LFs in the middle of paragraphs in plain text documents and > let the editor and display tool perform the reformatting at display > time. > But what IS plain text? Maybe some people might like to have their email reformatted, but I don't think they want their C or Fortran or PostScript programs to receive the same treatment. Nor, for that matter poetry or any other forms of text where line breaks, indentation, and blank lines serve a purpose. As in, for example, the preceding paragraph. No more plain-text bashing! No more "legacy" saying! Our focus should be not on stamping out plain text, but on promoting international multilingual communication through a universal character set that does not impose a a particular modus vivendi upon its users. - Frank 30-Jun-99 23:19:45-GMT,1376;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id TAA26033 for ; Wed, 30 Jun 1999 19:19:43 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id QAA258932 ; Wed, 30 Jun 1999 16:07:28 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA10518; Wed, 30 Jun 99 15:53:44 -0700 Message-Id: <9906302253.AA10518@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Uml-Sequence: 8258 (1999-06-30 22:53:34 GMT) From: John Cowan To: Unicode List Date: Wed, 30 Jun 1999 15:53:33 -0700 (PDT) Subject: Re: Plain Text Content-Transfer-Encoding: 7bit Frank da Cruz scripsit: > No more plain-text bashing! No more "legacy" saying! Our focus should be > not on stamping out plain text, but on promoting international multilingual > communication through a universal character set that does not impose a > a particular modus vivendi upon its users. Hear, hear! Unicode (n.): The *last* legacy character set. -- John Cowan cowan@ccil.org I am a member of a civilization. --David Brin 1-Jul-99 20:12:49-GMT,3132;000000000001 Return-Path: Received: (from fdc@localhost) by watsun.cc.columbia.edu (8.8.5/8.8.5) id QAA02796; Thu, 1 Jul 1999 16:11:21 -0400 (EDT) Date: Thu, 1 Jul 99 16:11:21 EDT From: Frank da Cruz To: Otto Stolz cc: unicode@unicode.org Subject: Re: Plain Text In-Reply-To: Your message of Thu, 1 Jul 1999 03:57:30 -0700 (PDT) Message-ID: > Am 1999-06-30 um 14:17 h PDT hat Markus Kuhn geschrieben: > > The only thing that is clear about "plain text" is that it is not well > > defined at all. > > Am 1999-06-30 um 15:32 h PDT hat Frank da Cruz geschrieben: > > Actually, it tends to be well-defined for each platform. > > In MS-DOS (or PC-DOS and other DOS variants) on the PC, it is not > well defined, at all: > Not to prolong this discussion, which took place once before, at great length, in May to July 1997... > - '0D0A'x (CR+LF) means either line-break or pararaph separator, > When/if it means pararaph separator it's not plain text. Plain text is what you TYPE at the DOS prompt. In such files (e.g. a READ.ME file) CRLF means Carriage Return (move the cursor to the left margin) and Line Feed (move the cursor down one row). > - '09'x (HT) means either a tabulator (and nobody knows where the > tab positions are supposed to be) or a line-break, > In DOS, when you TYPE a file at the DOS prompt, a Tab character is expanded to enough blanks to bring us to the next tab stop, which are set according to the most common convention: 1, 9, 17, ... (1-based). > - '1A'x (SUB, aka Ctrl-Z) either means end of text, or a > right-pointing arrow; when it is used as an end-of-text marker, > the remainder of the storage block may contain arbitrary characters > with some programs and must contain '00'x with other programs (nice > feature when one of the former writes a file one of the latter is > supposed to read). > That's not a plain-text issue, it's a character encoding and file format issue. Ctrl-Z as an EOF indicator is a relic of CP/M, carried forward into DOS for compatibility, used by some apps and ignored by others. Two years ago I suggested that we come up with a standard for Unicode plain text that can be used as a baseline when converting files from DOS, UNIX, the Macintosh, etc, to Unicode, and that says what control characters (C0, C1, as well as Line Separator, Paragraph Separator, etc) mean in a plain-text file or data stream. We made some good progress but eventually the discussion fizzled out. If I can summarize it briefly: . Yes, but plain text in this sense is inadequate for representing (list of writing systems that need higher-level formatting assistance, rendering engines, etc.) . Fine, but they need that anyway. For many other languages, plain text is possible, and there should be no reason not to settle on a standard representation for it in those cases where it can be used. If anybody would like to revisit that discussion, I've uploaded it to: ftp://kermit.columbia.edu/kermit/e/plain.txt (about 300K of plain text :-) - Frank 2-Jul-99 7:37:25-GMT,6658;000000000011 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id DAA21905 for ; Fri, 2 Jul 1999 03:37:24 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id AAA206792 ; Fri, 2 Jul 1999 00:29:42 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA18813; Fri, 2 Jul 99 00:08:06 -0700 Message-Id: <9907020708.AA18813@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 8285 (1999-07-02 07:07:55 GMT) From: Edward Cherlin To: Unicode List Cc: unicode@unicode.org Date: Fri, 2 Jul 1999 00:07:54 -0700 (PDT) Subject: Re: Plain Text At 15:32 -0700 6/30/1999, Frank da Cruz wrote: >> The only thing that is clear about "plain text" is that it is not well >> defined at all. My experience is that ASCII plain text is sufficiently well defined but has been incredibly badly implemented, due in part to the requirement in the 1960s and 1970s for keeping programs as small as possible, and in part to the rarity of cross-platform file transfer until the 1990s. The original definition, as John Cowan has pointed out, was anything a Teletype could reliably render, including overstrikes. Thinking of ASCII as printer commands rather than text makes it easier to understand the origins of its problems. (I have used printing terminals and video terminals that permitted overstrikes, designed for APL in particular and for what you will in general. Overstriking used to be taught in typing textbooks for creating signs like cent, c BS /. The problems we have with ASCII plain text come mainly from a small set of common variant practices. Using CR, LF, or CR/LF as a line or paragraph end Different tab spacings Optional line wrap Formfeed codes vs. computed page breaks BS = DEL or BS-overstrike In the past, editors on one platform, or written for one purpose, ignored all other practices. I use two text editors, Alpha for Macintosh and Notespad (note extra 's') for Windows, which can handle all of these variations according to my preferences, including the ability to read and write text files with Mac, Windows, or Unix line break codes. Notespad even maintains an extensible list of file types where line breaking is never to be changed by the editor (mostly programming language source code). Alpha asks whether to wrap paragraphs when opening files. >Actually, it tends to be well-defined for each platform. And then the >interchange methods among platforms tend to converge on a few simple >conventions: ASCII (or the appropriate ISO character set, or now UTF-8 or >other form of Unicode), as opposed to EBCDIC (or Baudot, or Sixbit); CRLFs >separating lines, and paragraphs separated by blank lines. Somewhat less >well defined, but nevertheless in common use, are bare Carriage Return or >Backspace for overstriking, Formfeed for "new page", and Tab for tabbing >(with several different conventions about tabstops). That is, we agree on everything except our variant usages. >Lines are terminated at somewhere between 72 and 80 characters by >convention, because that's how wide terminal screens are, and before them >the Teletype carriage, and before that the most common kind of punchcard. >Or for that matter, typewriters and sheets of paper (A4 or US, take your >pick :-) > >To this day, we follow these conventions in newsgroups and email, although >now it might be more a matter of "netiquette" than necessity (as in the >BITNET days, when e-mail was, quite literally, 80-column card images). As long as e-mail readers cannot correctly reformat messages with bad line breaks (like this), it will be a matter of real necessity. >These simple conventions let us format our text exactly the way we want to. >We can indent or not, we can put line breaks where we want them, we can have >columns of numbers or other tabular presentations, mathematical expressions, which actually require several hundred non-ASCII characters, unless you mean, as so many do, arithmetic expressions. >and idiosyncratic forms of emphasis. Many people want their text to stay >the way they wrote it. And many people also are not fond of receiving email >in every kind of bizarre format than any application developer can dream up >when it contains, in fact, nothing but words (but I stray). When I want my text to stay as I wrote it, I put it into a PDF, not a text file. Others prefer TeX for this purpose, or PostScript. >> I think the Unix community should slowly get used to the idea of >> abandoning LFs in the middle of paragraphs in plain text documents and >> let the editor and display tool perform the reformatting at display >> time. >> >But what IS plain text? Maybe some people might like to have their email >reformatted, but I don't think they want their C or Fortran or PostScript >programs to receive the same treatment. Nor, for that matter poetry or any >other forms of text where line breaks, indentation, and blank lines serve a >purpose. As in, for example, the preceding paragraph. Yes, it's that old Devil cross-cultural ignorance again. It wouldn't surprise me if some people here had never even read a Fortran program. >No more plain-text bashing! No more "legacy" saying! Our focus should be >not on stamping out plain text, but on promoting international multilingual >communication through a universal character set that does not impose a >a particular modus vivendi upon its users. > >- Frank We raised the question of defining a Unicode plain text format about two years ago, but nothing seemed to come of it. We also discussed the possibility of actually *using* Unicode text in this discussion, but nothing came of that either. Does anyone else here feel excessively constrained by our lack of glyphs for the characters we talk about? Would anyone else like to get UTF-8-capable mailers and extensive sets of Unicode fonts and see what effect they have on our deliberations? I have made the suggestion before, but here goes again--Alis Technologies offers a 30-day free trial period of its Tango Browser with Tango E-mail, downloadable from http://www.alis.com/internet_products/try_form.html. It runs on Windows 95, 98, and NT. Would anyone care to try it with me? -- Edward Cherlin President Coalition Against Unsolicited Commercial E-mail Help outlaw Spam. Talk to us at 2-Jul-99 16:04:55-GMT,11158;000000000001 Return-Path: Received: (from fdc@localhost) by watsun.cc.columbia.edu (8.8.5/8.8.5) id MAA17085; Fri, 2 Jul 1999 12:02:27 -0400 (EDT) Date: Fri, 2 Jul 99 12:02:27 EDT From: Frank da Cruz To: Edward Cherlin Subject: Re: Plain Text In-Reply-To: Your message of Fri, 2 Jul 1999 00:07:54 -0700 (PDT) Cc: unicode@unicode.org Message-ID: > The problems we have with ASCII plain text come mainly from a small set of > common variant practices. > > Using CR, LF, or CR/LF as a line or paragraph end > Different tab spacings > Optional line wrap > Formfeed codes vs. computed page breaks > BS = DEL or BS-overstrike > We all have dealt with these annoyances throughout our careers. They are indeed annoying, but not impassible impediments. Also, let's not mix up: . File storage format . Interchange format . Data entry format > Using CR, LF, or CR/LF as a line or paragraph end > As a line end: This is a file storage issue. As a paragraph end: There is no such thing as a paragraph end or paragraph separator in traditional plain text. Here I am sitting at my VT100 terminal, which is plugged in to my UNIX computer. I type: This is a line Then I push the Return key (sometimes marked Enter), which sends a Carriage Return. I would enter a line in exactly the same way no matter what computer was on the far end of the wire. Now: . The UNIX terminal driver turns the CR into a LF before giving it to the application. If the application is storing the line into a file, the file gets "This is a line". Ditto for some other operating systems, like AOS/VS. . If I had OS-9 on the far end, it would store "This is a line". . If I had TOPS-10, TOPS-20, RT-11, etc, on the far end, it would store "This is a line". . If I had VMS, VOS, VM/CMS, MVS/TSO or other complex file system on the far end, who knows how the line would be stored -- it depends on chosen the file organization and record format. The point is, it doesn't matter. Each platform has its own format for internal use, but a standardized interface to the outside world. To further demonstrate this fact, if I then tell the computer on the far end to "type" or "cat" the file, it will, invariably, send: This is a line So who cares what the file format is -- except of course when we want to transfer the file to another platform. In that case, it is the responsibility of each file-transfer agent to convert between its peculiar local format and the common one. And that is exactly what they do, just as is done at the terminal/terminal-driver/data-entry level. FTP and Kermit are two examples that show it is not that hard to convert plain-text file record formats from one platform to another. (And in Kermit's case, the character set too.) Of course life would have been simpler if there had been only ONE standard text-file format used on all platforms. But the early days of computing was a time of "Let the Hundred Flowers Bloom", and they did. Now, however, we are in a position to start over, and it is an opportunity we are not likely to have again. > Different tab spacings > I used to say this too, but the last platform I know about that did not assume tabstops at 1,9,17,25,... was MULTICS. Of course tabs are variable in word processors, etc, but that is not plain text. > Optional line wrap > This is a feature of the terminal or the application, not of "plain text". Files that do not contain line breaks and must rely on some form of postprocessing to insert line breaks at appropriate points is not really plain text, it is "input for a text formatter". Prior to the advent of word processors, the idea of "long line as paragraph" never came up. > Formfeed codes vs. computed page breaks > Page breaks are an issue worth discussing, and we discussed them at some length two years ago. Basically, you can let your "rendering engine" or printer driver insert them for you, or you can insert them yourself. One should be allowed the choice. (Why would anybody want "hard" page breaks? Because they are printing paychecks, invoices, envelopes, etc.) > BS = DEL or BS-overstrike > This is a data entry issue, unless you mean including BS in a file for overstriking. But in that case, there is never any confusion between BS and DEL, since DEL is never used for that purpose. In other words, the only confusion is at data entry, and this is entirely irrelevant to the definition of plain text. > >Lines are terminated at somewhere between 72 and 80 characters by > >convention, because that's how wide terminal screens are, and before them > >the Teletype carriage, and before that the most common kind of punchcard. > >Or for that matter, typewriters and sheets of paper (A4 or US, take your > >pick :-) > > > >To this day, we follow these conventions in newsgroups and email, although > >now it might be more a matter of "netiquette" than necessity (as in the > >BITNET days, when e-mail was, quite literally, 80-column card images). > > As long as e-mail readers cannot correctly reformat messages with bad > line breaks ^^^^^^^^^^^^^^^^^^^^^^^^^^^ > (like this), it will be a matter of real necessity. > What does "correctly reformat messages" mean? How can your mail client read my mind? How does it know that the message I sent you was not already formatted exactly the way I wanted it? Notice that to illustrate my point, I need your original formatting (above) preserved, with the "> " quote indicators added at the left margin, and with my emphasis added under the appropriate words. What is a "correct" mail client supposed to do with this? Something like this?: > As long as e-mail readers cannot correctly reformat messages with bad > line breaks ^^^^^^^^^^^^^^^^^^^^^^^^^^^ > (like this), it will be a matter of real necessity. No, a correct email client will leave it alone. Whether I want my email reformatted by your client should be my choice, since only I know what my intentions are in sending it. Granted, plain text requires some minimal level of agreement, for example that your screen is 72 (or 76, or 79) columns wide. I maintain that this convention is universal, except for Kanji, etc, which are displayed in two character cells each. People who use email, netnews, and other forms of open, interplatform communication have learned these conventions. We use them ourselves on this mailing list. Those of us who do not are often excoriated for our antisocial behavior. Especially when we send email or netnews in some application-specific format, assuming that everybody else uses the same platform and applications we do. > >These simple conventions let us format our text exactly the way we want > >to. We can indent or not, we can put line breaks where we want them, we > >can have columns of numbers or other tabular presentations, mathematical > >expressions, > > which actually require several hundred non-ASCII characters, unless you > mean, as so many do, arithmetic expressions. > Yes, that's what I meant, thanks. (All of us here recognize the shortcomings of ASCII -- that's why we're here! But let's not forget that ASCII can be used to write, say, Fortran programs that can handle far more in the way of mathematics than the repertoire of ASCII might suggest, and that people send Fortran-like expressions back and forth in email, etc, which could easily lose their meaning when reformatted.) > When I want my text to stay as I wrote it, I put it into a PDF, not a text > file. Others prefer TeX for this purpose, or PostScript. > My point exactly. And how do I read your PDF if I don't have a PDF reader? (Don't say "get one" -- I'm reading your mail on a DOS PC or a PDP-11, or a Cray supercomputer.) How do I read TeX if I don't have the software? How do I read PostScript if I don't have a PostScript printer or rendering engine. But the crucial point is: How will I read your PDF file 200 years from now, when PDF itself has been consigned to the "legacy" trashheap for the past 195 years? > We raised the question of defining a Unicode plain text format about two > years ago, but nothing seemed to come of it. > Then let's try again. Let me get the ball rolling with the following simple suggestion for Unicode Plain-Text File and Interchange Format: A monospaced character-cell display device is assumed for the purposes of line breaking. Characters that are too wide for a character cell (such as Kanjis) occupy a double-width cell. Of course, Unicode Plain Text can also be displayed on any other kind of device, in any font, monospaced or not, in which case "all bets are off", just as they are now with traditional plain text when displayed in a proportional font. Conversely, it is recognized that a monospaced (or duospaced) character-cell device might be inadequate for display of certain writing systems, such as Arabic or Indic scripts, and in this case intelligent rendering engines might very well be required. This should, nevertheless, be possible with plain text, without the aid of any particular markup scheme. Plain text is composed only of Unicode characters, with no meta-level of formatting information, presentation hints, etc, except: 1. Spaces, such as U+0020 and U+00A0, which are are "kept" (e.g. adjacent spaces are not collapsed). 2. Horizontal Tabs are indicated by the HT character, U+0009. Tab stops shall be assumed every 8 columns, starting at the first. (This provision is primarily to facilitate conversion of ASCII and 8-bit text to Unicode. Alternatively, it would be OK to force all horizontal alignment to be accomplished by spaces.) 3. Line breaks are indicated by Line Separator, U+2028. Preformatted text must break lines at column 79 or less to avoid unwanted reformatting. Column numbers are 1-based, relative to the left or right margin, according to the previaling directionality, with single-width characters as the counting unit. A line break is required at the end of the final line if it is to be considered a line. (This is to allow append operations to work in the expected fashion.) 4. Paragraph breaks are indicated by two successive Line Separators or by Paragraph Separator, U+2029. 5. Hard page breaks are indicated by FF, U+000C. C0 and C1 control characters other than HT and FF have no function whatsoever in Unicode Plain Text. (If there were Unicode Horizontal Tab and Page Break characters, we wouldn't need C0 at all; however, the UTC -- or at least members of it, in previous discussions -- indicated that there is no good reason to duplicate the C0 characters that are already in Unicode.) A Unicode plain-text "rendering engine" shall not mess with the format of a plain-text file except, optionally, at the user's discretion, to wrap lines that are longer than the display or printing device. Higher-level rendering engines, of course, can do whatever they want. - Frank 2-Jul-99 16:32:42-GMT,2273;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id MAA25758 for ; Fri, 2 Jul 1999 12:32:41 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id JAA248914 ; Fri, 2 Jul 1999 09:27:18 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA21218; Fri, 2 Jul 99 09:18:02 -0700 Message-Id: <9907021618.AA21218@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 8293 (1999-07-02 16:17:51 GMT) From: Frank da Cruz To: Unicode List Date: Fri, 2 Jul 1999 09:17:48 -0700 (PDT) Subject: Plain text: Amendment 1 90 seconds later... 1. Spaces, such as U+0020 and U+00A0, which are are "kept" (e.g. adjacent spaces are not collapsed). 2. Horizontal Tabs are indicated by the HT character, U+0009. Tab stops shall be assumed every 8 columns, starting at the first. (This provision is primarily to facilitate conversion of ASCII and 8-bit text to Unicode. Alternatively, it would be OK to force all horizontal alignment to be accomplished by spaces.) 3. Line breaks are indicated by Line Separator, U+2028. Preformatted text must break lines at column 79 or less to avoid unwanted reformatting. Column numbers are 1-based, relative to the left or right margin, according to the previaling directionality, with single-width characters as the counting unit. A line break is required at the end of the final line if it is to be considered a line. (This is to allow append operations to work in the expected fashion.) 4. Paragraph breaks are indicated by two successive Line Separators or by Paragraph Separator, U+2029. 5. Hard page breaks are indicated by FF, U+000C. Change (4) to: 4. Paragraph breaks are indicated by Paragraph Separator, U+2029. Add to (3): A blank line is indicated by two successive Line Separators. Two blank lines are indicated by three of them, etc. This is to allow paragraphs like this one, which contain embedded "displays" set off by blank lines that are NOT paragraph separators. - Frank 2-Jul-99 17:17:52-GMT,4232;000000000011 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id NAA07783 for ; Fri, 2 Jul 1999 13:17:51 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id KAA281172 ; Fri, 2 Jul 1999 10:08:26 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA21632; Fri, 2 Jul 99 09:58:39 -0700 Message-Id: <9907021658.AA21632@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Uml-Sequence: 8294 (1999-07-02 16:58:29 GMT) From: Geoffrey Waigh To: Unicode List Date: Fri, 2 Jul 1999 09:58:27 -0700 (PDT) Subject: Re: Plain Text Content-Transfer-Encoding: 7bit Frank da Cruz wrote: > > Then let's try again. Let me get the ball rolling with the following simple > suggestion for Unicode Plain-Text File and Interchange Format: > > A monospaced character-cell display device is assumed for the purposes of > line breaking. Characters that are too wide for a character cell (such as > Kanjis) occupy a double-width cell. Of course, Unicode Plain Text can also > be displayed on any other kind of device, in any font, monospaced or not, in > which case "all bets are off", just as they are now with traditional plain > text when displayed in a proportional font. Why are you specifying font characteristics for plain text? > Conversely, it is recognized that a monospaced (or duospaced) character-cell > device might be inadequate for display of certain writing systems, such as > Arabic or Indic scripts, and in this case intelligent rendering engines > might very well be required. This should, nevertheless, be possible with > plain text, without the aid of any particular markup scheme. And then saying that you don't really need a monospace font and it is still plain text even when you have to do a proper job of rendering it? > > Plain text is composed only of Unicode characters, with no meta-level > of formatting information, presentation hints, etc, except: > > 1. Spaces, such as U+0020 and U+00A0, which are are "kept" (e.g. > adjacent spaces are not collapsed). I don't see how barring all the other spacing and presentation codes (e.g. ZWNJ) improves plain text. > > 2. Horizontal Tabs are indicated by the HT character, U+0009. Tab > stops shall be assumed every 8 columns, starting at the first. (This > provision is primarily to facilitate conversion of ASCII and 8-bit > text to Unicode. Alternatively, it would be OK to force all > horizontal alignment to be accomplished by spaces.) > > 3. Line breaks are indicated by Line Separator, U+2028. Preformatted > text must break lines at column 79 or less to avoid unwanted > reformatting. Column numbers are 1-based, relative to the left or > right margin, according to the previaling directionality, with > single-width characters as the counting unit. A line break is > required at the end of the final line if it is to be considered a > line. (This is to allow append operations to work in the expected > fashion.) I don't see how specifying the maximum text width is in the purview of "plain text." That is suggesting that running my terminal in 132 column mode (or printing on wide paper/with narrow fonts,) involves something special. I suspect that all the attention to cell widths, column counting and what not is to make tab processing map nicely to the character cell terminal model. That model is responsible for some horrible hacks when it migrated to other countries and I believe the difficulties in adapting software that depends on it to writing systems it does not work for has been a serious drag on more advanced Unicode implementations. > > 4. Paragraph breaks are indicated by two successive Line Separators > or by Paragraph Separator, U+2029. If we are supporting Unicode and have a notion of Paragraph it seems reasonable to specify it is denoted with U+2029. > > 5. Hard page breaks are indicated by FF, U+000C. Geoffrey 2-Jul-99 18:15:24-GMT,5607;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id OAA24570 for ; Fri, 2 Jul 1999 14:15:23 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id LAA270448 ; Fri, 2 Jul 1999 11:10:54 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA22476; Fri, 2 Jul 99 10:54:55 -0700 Message-Id: <9907021754.AA22476@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 8297 (1999-07-02 17:54:45 GMT) From: Frank da Cruz To: Unicode List Cc: unicode@unicode.org Date: Fri, 2 Jul 1999 10:54:44 -0700 (PDT) Subject: Re: Plain Text > Why are you specifying font characteristics for plain text? > Only for purposes of getting across the idea that "long line = paragraph, break where you please" should not be considered well-formed plain text. Or, to look at it the other way, that plain text must allow for hard line breaks, and there should be a convention as to how long we might reasonably expect lines to be. "Columns" are the only measurement that makes sense (surely not picas, inches, millimeters, pixels, ...) and this presupposes fixed spacing. This might be a farfetched notion except that it is completely consonent with current practice. The fact that monospaced fonts have fallen out of fashion should not cloud our judgement. Naturally they present some difficulties for multilingual text, but they also provide numerous benefits. They let me compose a text document that anybody can read in -- barring "rendering engine" interference -- the same form in which I composed it. Tables line up, columns of numbers add up, comments in my C program are aligned, etc. All this without our having to agree in advance on which rendering engine or markup language to use. Parenthetically, look at the mess the craze for the typeset appearance has gotten us into. If I want to make a table on a Web page or in a typeset document, I have to use some kind of markup language or "table" package, rather than just spacing or tabbing the items appropriately. Which is fine until you consider that any markup language or tables package you are using today will be long forgotten a few years from now, and so your laboriously constructed document will either require conversion or be lost forever (or humans will need to read the markup language directly). As noted, I grant that the monospace-font model does not apply equally well to all writing systems, but for the many to which it does apply -- Roman, Hebrew, Cyrillic, Armenian, Greek, Georgian, etc, and to some extent CJK since, at least in Japan, they have been using mono- and duospaced fonts on terminals and PCs for decades, and care as much about things lining up as anybody else -- should guidelines not be stated up front? > > 1. Spaces, such as U+0020 and U+00A0, which are are "kept" (e.g. > > adjacent spaces are not collapsed). > > I don't see how barring all the other spacing and presentation codes > (e.g. ZWNJ) improves plain text. > They aren't barred -- they are Unicode characters that are not C0 or C1 control characters. And they aren't a higher-level markup language. > I don't see how specifying the maximum text width is in the purview of > "plain text." That is suggesting that running my terminal in 132 column > mode (or printing on wide paper/with narrow fonts,) involves something > special. I suspect that all the attention to cell widths, column > counting and what not is to make tab processing map nicely to the > character cell terminal model. That model is responsible for some > horrible hacks when it migrated to other countries and I believe the > difficulties in adapting software that depends on it to writing systems > it does not work for has been a serious drag on more advanced Unicode > implementations. > I suppose you're right about the intention. That's what the discussion is for -- to find suitable language for expressing a model for "text that is already formatted and stands on its own without additional formatting from any higher intelligence and that can displayed by the most minimalistic plain-text viewer", like this email message. You might be right about specifying a maximum line length. And yet, if there is to be such a thing as preformatted plain text, and none of us can deny that there already is such a thing since this is how we commicate, should there not be some form of guideline as to what is a safe default line-length, in the absence of any prior agreement to set a different one? That's what we do now, implicitly. Why not make it explicit? So how should the guideline be expressed? Let's assume you are composing some plain text, and you don't care how it's rendered. Then don't include Line Separators and let the viewer "flow" the text. That's fine for ordinary prose, but it assumes a viewer that knows how to flow text, and I'm not sure that a text-flowing viewer should be assumed or required. As somebody mentioned earlier, most printers will truncate long lines, as will many terminals and other display devices. If you do care how the text is rendered, include Line Separators. > > 4. Paragraph breaks are indicated by two successive Line Separators > > or by Paragraph Separator, U+2029. > > If we are supporting Unicode and have a notion of Paragraph it seems > reasonable to specify it is denoted with U+2029. > Agreed and amended already. - Frank 2-Jul-99 18:32:33-GMT,4626;000000000001 Return-Path: Received: from inergen.sybase.com (inergen.sybase.com [192.138.151.43]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id OAA29877 for ; Fri, 2 Jul 1999 14:32:32 -0400 (EDT) Received: from smtp1.sybase.com (sybgate.sybase.com [130.214.220.35]) by inergen.sybase.com with ESMTP id LAA07740; Fri, 2 Jul 1999 11:33:15 -0700 (PDT) Received: from birdie.sybase.com (birdie.sybase.com [130.214.140.3]) by smtp1.sybase.com with SMTP id LAA03792; Fri, 2 Jul 1999 11:32:11 -0700 (PDT) Received: by birdie.sybase.com (5.x/SMI-SVR4/SybEC3.5) id AA03974; Fri, 2 Jul 1999 11:32:11 -0700 Date: Fri, 2 Jul 1999 11:32:11 -0700 From: kenw@sybase.com (Kenneth Whistler) Message-Id: <9907021832.AA03974@birdie.sybase.com> To: fdc@watsun.cc.columbia.edu Subject: Re: Plain text: Amendment 1 Cc: unicode@unicode.org, kenw@sybase.com X-Sun-Charset: US-ASCII The problem I am having with Frank's suggestions boil down essentially this: The Unicode concept of plain text is of a text stream consisting only of Unicode characters, interpreted according to the rules of the standard, and not including (or not interpreting the inclusion of) higher-level markup, however expressed. It does not involve specification of particular font behavior (including monospacing), details of terminal interaction, or line length. It is that concept of Unicode plain text that we intend and hope will be stable for the next century. Given the text stream itself, basic textual content should be derivable, although not necessarily any detailed layout information. The intended invariant is textual content, rather than document form including textual content. To specify invariant document form, it is clear that a higher-level protocol must be specified. And I see Frank's Unicode plain text proposal as just the bare-bottom, minimal common denominator for a document description standard. In that respect it is no different from PDF, except in complexity and faithfulness to original appearance of a document in all details. Some of the difficulty of this discussion, of course, derives from the fact that the Unicode Standard unavoidably had to contain some bare minimum of format control characters. We have had to specify format semantics for CR, LF, TAB, VT, FF because there was no way we were going to get from the past to the future without people converting existing documents using these (or carrying analogous practice into new documents); and LS and PS were added to provide a minimum, unambiguous set of format controls to organize plain text. Bidi format controls were added because they had to be: otherwise, you run into situations where intended content is inexpressible, or existing content is uninterpretable in plain text. And on the other hand, the situation is muddied by plain text markup conventions where the markup is carried around in the plain text: 9/23/98 38 widgets sold 65,416 --- 65,416 Where the "plain text" is: "NLF9/23/98NLF38 widgets soldNLF65,416NLF---NLF65,416NLF" But the plain text of the content is 5 strings: "9/23/98" "38 widgets sold" "65,416" "---" "65,416" And the full document desription is, of course, not just these 5 strings, but includes the fact that they constitute a row embedded in a table, and are aligned in specified ways within the cells in that row. The Unicode vision is that the character encoding standard itself should be as robust and useful in its larger domain as the 7-bit ASCII standard was in its own contrained textual domain. But given the enormous complexities that are inherent in trying to deal with *all* of the writing systems of the world, it is inevitable that plain text *layout* conventions involving Unicode are going to be considerably more complex than plain text *layout* conventions involving ASCII only. At the bare minimum, for example, plain text in Unicode *must* take bidirectional layout into account--otherwise, you would be saying that you could express Unicode content in plain text, as long as you avoided Hebrew, Arabic, and Syriac characters. In some respects, the entire content of the Unicode Standard beyond just the code charts and names lists is an elaborate attempt to describe what it means to deal with plain text layout and interpretation for all of the Unicode characters. It cannot be encapsulated in the kind of constraints that Frank has suggested, in my opinion. --Ken 2-Jul-99 18:51:30-GMT,5169;000000000011 Return-Path: Received: from mail.rdc1.bc.home.com (ha1.rdc1.bc.wave.home.com [24.2.10.66]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id OAA05270 for ; Fri, 2 Jul 1999 14:51:28 -0400 (EDT) Received: from home.com ([24.113.28.108]) by mail.rdc1.bc.home.com (InterMail v4.01.01.00 201-229-111) with ESMTP id <19990702185120.ZXVS29070.mail.rdc1.bc.home.com@home.com>; Fri, 2 Jul 1999 11:51:20 -0700 Message-ID: <377D0A96.86F53390@home.com> Date: Fri, 02 Jul 1999 11:53:10 -0700 From: Geoffrey Waigh X-Mailer: Mozilla 4.5 [en] (Win98; I) X-Accept-Language: en MIME-Version: 1.0 To: unicode@unicode.org Subject: Re: Plain Text References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Frank da Cruz wrote: > > > Why are you specifying font characteristics for plain text? > > > Only for purposes of getting across the idea that "long line = paragraph, > break where you please" should not be considered well-formed plain text. > Or, to look at it the other way, that plain text must allow for hard line > breaks, and there should be a convention as to how long we might reasonably > expect lines to be. "Columns" are the only measurement that makes sense > (surely not picas, inches, millimeters, pixels, ...) and this presupposes > fixed spacing. See below for comments on maximum line length. When considering why other measurements were inappropriate I realized it is because "preformatted" plain text has no control over font size and thus cannot do position based formatting as someone would do on a sheet of paper. The cell model allows people to position text without recourse to a markup system but at the sacrafice of which scripts can be properly rendered. It happens that many of the commercially significant languages can cope with the cell model which is part of the reason it has survived so long. Unfortunately it just helps keep the hard writing systems in the ghetto because it isn't nearly as profitable and requires dealing with many cans of worms when trying to fit them to a system that depends on implicit positioning. > The fact that monospaced fonts have fallen out of fashion should not cloud > our judgement. Naturally they present some difficulties for multilingual > text, but they also provide numerous benefits. They let me compose a text > document that anybody can read in -- barring "rendering engine" interference > -- the same form in which I composed it. Tables line up, columns of numbers > add up, comments in my C program are aligned, etc. All this without our > having to agree in advance on which rendering engine or markup language to > use. Presumably the markup language specifies the semantics well enough to be rendering engine independent - if the rendering engine is capable of displaying the text as described. For text that is being sent without any markup, then monospace for the bulk of the text is probably what the reader should use (at least if they believe the text to have horizontal structure.) I just don't think that it should be enforced. As for the concerns about the ephemeral nature of markup languages, hopefully we will someday reach some stability for systems that don't require a proprietary encoder, do not require extensive computer training to grok and do not have flavour of the week problems. These difficulties are not inherent in the design of markup languages but an artifact of the political and economic forces driving them. > You might be right about specifying a maximum line length. And yet, > if there is to be such a thing as preformatted plain text, and none of us > can deny that there already is such a thing since this is how we commicate, > should there not be some form of guideline as to what is a safe default > line-length, in the absence of any prior agreement to set a different one? > That's what we do now, implicitly. Why not make it explicit? So how should > the guideline be expressed? Because if it is made explicit, software writers will feel free to take such a limit as a hard one and do silly things for text that exceeds it. Right now most software will handle long lines albeit sometimes awkwardly. If someone preformats their text for 200 columns, then that is what they should get if the output device can cope. If it cannot, they need to consider why they think it has to be 200 columns. In the case of Usenet and public mailing lists people have to curtail their lines if they don't want them mangled. > Let's assume you are composing some plain text, and you don't care how it's > rendered. Then don't include Line Separators and let the viewer "flow" the > text. That's fine for ordinary prose, but it assumes a viewer that knows > how to flow text, and I'm not sure that a text-flowing viewer should be > assumed or required. As somebody mentioned earlier, most printers will > truncate long lines, as will many terminals and other display devices. > > If you do care how the text is rendered, include Line Separators. I agree with this. Geoffrey 2-Jul-99 20:08:21-GMT,3467;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id QAA25355 for ; Fri, 2 Jul 1999 16:08:20 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id MAA91712 ; Fri, 2 Jul 1999 12:56:37 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA24684; Fri, 2 Jul 99 12:47:49 -0700 Message-Id: <9907021947.AA24684@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 8306 (1999-07-02 19:47:40 GMT) From: Frank da Cruz To: Unicode List Cc: unicode@unicode.org Date: Fri, 2 Jul 1999 12:47:39 -0700 (PDT) Subject: Re: Plain Text OK, then perhaps the idea of "recommended maximum line length" is an unnecessary complication. Perhaps it is enough to say that Line Separator means what it says. If I put one in my text, then it means to start a new line. If I make sure that there are no more than 79 characters between line separators (or whatever else is appropriate to my writing system), I'll get the desired effect. > As for the concerns about the ephemeral nature of markup languages, > hopefully we will someday reach some stability for systems that > don't require a proprietary encoder, do not require extensive > computer training to grok and do not have flavour of the week > problems. These difficulties are not inherent in the design of > markup languages but an artifact of the political and economic > forces driving them. > Right, of course. But we can we trust the market to settle on a simple standard for plain text? Of course not; there's no money in it. Does the market want an immutable standard for plain-text documents that can last for a century or an eon? Of course not. The market wants everything to change all the time, so everybody will have to "upgrade" constantly. That's great for business but bad for preservation of history and culture. And it shortens the productive lives of "content providers". There are ways to make money that don't require artificially induced instability. Furthermore, I would not like to think that in the Unicode world of the future, that it will not be possible to send preformatted email or netnews without the assistance of some specific markup language or embedded proprietary word-processor codes. Email has already deteriorated significantly from its original openness thanks to MIME's blessing of any kind of proprietary gewgaw any vendor wants to add to their GUI email clients. Thus a perfect application for Unicode plain text would be as a MIME type, specifically intended to proclaim and promote the adherence to a simple, universal, vendor-independent, self-contained standard. Hopefully the IETF would have the sense to see the value of a Unicode successor to RFC822. So I'd like to see a definition for plain text in the Unicode standard, that is totally independent of any external product, that allows a file or stream of Unicode text to stand on its own, for all time, and retain a minimum level of formatting, in those cases where the author of the text feels formatting is important. (In fact, all of us do, otherwise we wouldn't care so much about fonts and rendering engines and markup languages). I think email and netnews are two areas where the need for such a standard is evident. - Frank 2-Jul-99 20:31:40-GMT,1251;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id QAA01309 for ; Fri, 2 Jul 1999 16:31:39 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id NAA185114 ; Fri, 2 Jul 1999 13:26:21 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA25362; Fri, 2 Jul 99 13:17:37 -0700 Message-Id: <9907022017.AA25362@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" X-Uml-Sequence: 8308 (1999-07-02 20:17:28 GMT) From: "Paul Dempsey (Exchange)" To: Unicode List Date: Fri, 2 Jul 1999 13:17:27 -0700 (PDT) Subject: RE: Plain Text This would be a fine standard. However, it doesn't have to be part of the _Unicode_ standard, and I don't think it belongs as a normative part of Unicode. As minimal as it may be, it still falls into the domain of file formats and "higher-level protocol". It's a tribute to the success of Unicode that people want to piggyback on it's success to solve closely related problems. --- Paul 2-Jul-99 23:30:54-GMT,1326;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id TAA29893 for ; Fri, 2 Jul 1999 19:30:53 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id QAA194924 ; Fri, 2 Jul 1999 16:24:01 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA28898; Fri, 2 Jul 99 16:10:32 -0700 Message-Id: <9907022310.AA28898@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Uml-Sequence: 8321 (1999-07-02 23:10:02 GMT) From: John Cowan To: Unicode List Date: Fri, 2 Jul 1999 16:10:01 -0700 (PDT) Subject: Re: Plain text: Amendment 1 Content-Transfer-Encoding: 7bit Frank da Cruz scripsit: > This is to allow paragraphs like this one, which contain embedded > "displays" set off by blank lines that are NOT paragraph separators. A great thing. It is only in plain text that I can compare 1) example A with 2) example B in a single paragraph without confusion. -- John Cowan cowan@ccil.org I am a member of a civilization. --David Brin 3-Jul-99 0:50:17-GMT,1114;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id UAA08384 for ; Fri, 2 Jul 1999 20:50:16 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id RAA339856 ; Fri, 2 Jul 1999 17:46:29 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA00424; Fri, 2 Jul 99 17:34:34 -0700 Message-Id: <9907030034.AA00424@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" X-Uml-Sequence: 8325 (1999-07-03 00:34:07 GMT) From: "Christopher J. Fynn" To: Unicode List Date: Fri, 2 Jul 1999 17:34:05 -0700 (PDT) Subject: RE: Plain Text [**NOT**] Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by watsun.cc.columbia.edu id UAA08384 Edward Cherlin wrote: > I know of no device which required the user to enter a CR followed > by an LF The manual typewriter? - Chris 3-Jul-99 1:13:59-GMT,1390;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id VAA09857 for ; Fri, 2 Jul 1999 21:13:59 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id SAA251700 ; Fri, 2 Jul 1999 18:07:10 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA00795; Fri, 2 Jul 99 17:51:32 -0700 Message-Id: <9907030051.AA00795@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 8326 (1999-07-03 00:51:19 GMT) From: kenw@sybase.com (Kenneth Whistler) To: Unicode List Cc: unicode@unicode.org, kenw@sybase.com Date: Fri, 2 Jul 1999 17:51:17 -0700 (PDT) Subject: RE: Plain Text [**NOT**] Chris Fynn suggested: > > Edward Cherlin wrote: > > > I know of no device which required the user to enter a CR followed > > by an LF > > The manual typewriter? Hehe, not even that, since when you pull the "carriage return lever" to return the carriage to the left margin, the ratchet setting (for single space or double space) automatically feeds the line (or lines) on the platen to the ratchet stop before the lever locks and allows you to drag the carriage back. So nice try, but CRLF was already mechanically automated decades ago. --Ken > > - Chris > 3-Jul-99 3:06:18-GMT,1372;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id XAA20716 for ; Fri, 2 Jul 1999 23:06:17 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id UAA266332 ; Fri, 2 Jul 1999 20:01:53 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA01966; Fri, 2 Jul 99 19:48:33 -0700 Message-Id: <9907030248.AA01966@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain X-Uml-Sequence: 8330 (1999-07-03 02:48:22 GMT) From: "Hohberger, Clive P." To: Unicode List Date: Fri, 2 Jul 1999 19:48:20 -0700 (PDT) Subject: RE: Plain Text [**NOT**] The Teletypes did, up through at least the KSR 33 and ASR 35, at least. That's why CR and LF were made part of the control character set... along with alot of other Teletype commands (SI, SO, HT, etc) Clive > -----Original Message----- > From: Christopher J. Fynn [SMTP:cfynn@dircon.co.uk] > Sent: Friday, July 02, 1999 7:34 PM > To: Unicode List > Subject: RE: Plain Text [**NOT**] > > Edward Cherlin wrote: > > > I know of no device which required the user to enter a CR followed > > by an LF > > The manual typewriter? > > - Chris 3-Jul-99 3:41:00-GMT,1740;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id XAA24627 for ; Fri, 2 Jul 1999 23:40:59 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id UAA268562 ; Fri, 2 Jul 1999 20:33:54 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA02522; Fri, 2 Jul 99 20:21:14 -0700 Message-Id: <9907030321.AA02522@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 8332 (1999-07-03 03:20:58 GMT) From: Edward Cherlin To: Unicode List Date: Fri, 2 Jul 1999 20:20:57 -0700 (PDT) Subject: Re: Plain Text At 11:45 -0700 7/2/1999, Geoffrey Waigh wrote: >Frank da Cruz wrote: >> >> > Why are you specifying font characteristics for plain text? >> > >> Only for purposes of getting across the idea that "long line = paragraph, >> break where you please" should not be considered well-formed plain text. >> Or, to look at it the other way, that plain text must allow for hard line >> breaks, and there should be a convention as to how long we might reasonably >> expect lines to be. [much snippage] There cannot be an enforceable line length limit on plain text. One of the uses of plain text is for database interchange, where any number of fields of any length, plus separators, may constitute a line. -- Edward Cherlin President Coalition Against Unsolicited Commercial E-mail Help outlaw Spam. Talk to us at 3-Jul-99 3:46:36-GMT,22091;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id XAA25014 for ; Fri, 2 Jul 1999 23:46:35 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id UAA186618 ; Fri, 2 Jul 1999 20:35:16 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA02526; Fri, 2 Jul 99 20:21:17 -0700 Message-Id: <9907030321.AA02526@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 8333 (1999-07-03 03:21:01 GMT) From: Edward Cherlin To: Unicode List Cc: unicode@unicode.org Date: Fri, 2 Jul 1999 20:20:59 -0700 (PDT) Subject: Re: Plain Text At 08:58 -0700 7/2/1999, Frank da Cruz wrote: [failing to mention that Ed Cherlin wrote:] >> The problems we have with ASCII plain text come mainly from a small set of >> common variant practices. >> >> Using CR, LF, or CR/LF as a line or paragraph end >> Different tab spacings >> Optional line wrap >> Formfeed codes vs. computed page breaks >> BS = DEL or BS-overstrike >> >We all have dealt with these annoyances throughout our careers. They are >indeed annoying, but not impassible impediments. Also, let's not mix up: > > . File storage format > . Interchange format > . Data entry format . Rendering options On looking through the remainder of this message, I conclude that I disagree with Frank's attempts to make his own limited experience normative, but I heartily agree that his proposal for a bottom-level plain text Unicode format is on the right track, and that it allows us to deal with some of the issues listed above as file format issues, specifically line and paragraph ends and other control codes. Tab stops, wrapping, and page breaking must be left to the user's choice when rendering, since they are not file format issues. >> Using CR, LF, or CR/LF as a line or paragraph end >> >As a line end: > This is a file storage issue. > >As a paragraph end: > There is no such thing as a paragraph end or paragraph separator in > traditional plain text. > >Here I am sitting at my VT100 terminal, which is plugged in to my UNIX >computer. Here *I* am, sitting at my Mac, and recalling what I have been doing on an NT system and Silicon Graphics Indy and O2 computers running Irix for the last year and a half, when I was shuttling files back and forth between them. (The Indy is used as an embedded controller in a 750 kg laser microscope system for semiconductor wafer inspection, and the O2 to run the microscope software without the hardware for demos and simulations, none of which matters to this discussion.) >I type: > > This is a line > >Then I push the Return key (sometimes marked Enter), which sends a Carriage >Return. Whereas my VT100 simulator used to get its CR from the keyboard buffer, where it was deposited after the keyboard driver translated from the keyboard scan codes. Anyway, input technology is not at issue here. >I would enter a line in exactly the same way no matter what >computer was on the far end of the wire. Now: > > . The UNIX terminal driver turns the CR into a LF before giving it > to the application. If the application is storing the line into a > file, the file gets "This is a line". Ditto for some other > operating systems, like AOS/VS. > > . If I had OS-9 on the far end, it would store "This is a line". ^or Mac OS > . If I had TOPS-10, TOPS-20, RT-11, etc, on the far end, it would > store "This is a line". > > . If I had VMS, VOS, VM/CMS, MVS/TSO or other complex file system on > the far end, who knows how the line would be stored -- it depends on > chosen the file organization and record format. > >The point is, it doesn't matter. Each platform has its own format for >internal use, but a standardized interface to the outside world. To further >demonstrate this fact, if I then tell the computer on the far end to "type" >or "cat" the file, it will, invariably, send: > > This is a line Your cultural ignorance/sheltered life-experience is showing. *You* may live in an environment where these changes are made automatically, but a lot of us don't. >So who cares what the file format is -- except of course when we want to >transfer the file to another platform. And since I don't use a VT100 simulator anymore, I only encounter this issue when transfering files to another platform, and as a result I care all the time. >In that case, it is the >responsibility of each file-transfer agent When reading floppy disks? >to convert between its peculiar >local format and the common one. And that is exactly what they do, just >as is done at the terminal/terminal-driver/data-entry level. FTP and Kermit >are two examples that show it is not that hard to convert plain-text file >record formats from one platform to another. (And in Kermit's case, the >character set too.) > >Of course life would have been simpler if there had been only ONE standard >text-file format used on all platforms. But the early days of computing >was a time of "Let the Hundred Flowers Bloom", and they did. Now, however, >we are in a position to start over, and it is an opportunity we are not >likely to have again. Yes, yes, everything *could* have been made to work, except for the parts that couldn't, you see, because management wouldn't allow the extra time and space required to make things portable, or worse still, was trying to lock customers into proprietary data formats. >> Different tab spacings >> >I used to say this too, but the last platform I know about that did not >assume tabstops at 1,9,17,25,... was MULTICS. Of course tabs are variable >in word processors, etc, but that is not plain text. Your limited experience again. I have rarely used an editor with fixed tab stops since about 1982 (EDLIN, IIRC). I once knew the escape sequences for IBM, Diablo, and Qume *printing* terminal tab settings by heart. >> Optional line wrap >> >This is a feature of the terminal or the application, not of "plain text". This is a feature found in ASCII *files* which were written either with or without explicit line breaks, requiring a choice for appropriate rendering--a choice which the editor should be able to make, but which the user should actually make. >Files that do not contain line breaks and must rely on some form of >postprocessing to insert line breaks at appropriate points is not really >plain text, it is "input for a text formatter". But the text editor is frequently the chosen text reformatter. You are still claiming that text files as they occur in your computer subculture are for some reason normative for the rest of us. >Prior to the advent of >word processors, the idea of "long line as paragraph" never came up. Word processing began in the 1960s. I gather you had a later date in mind. Did you mean specifically WYSIWYG word processors, invented at Xerox in the late 1970s? >> Formfeed codes vs. computed page breaks >> >Page breaks are an issue worth discussing, and we discussed them at some >length two years ago. Basically, you can let your "rendering engine" or >printer driver insert them for you, or you can insert them yourself. One >should be allowed the choice. (Why would anybody want "hard" page breaks? >Because they are printing paychecks, invoices, envelopes, etc.) If we can establish that general principle and apply it to the previous cases, the problem will be solved in short order. The application determines the requirements for tab stops, page breaks, and paragraph or line formatting. >> BS = DEL or BS-overstrike >> >This is a data entry issue, unless you mean including BS in a file for >overstriking. But in that case, there is never any confusion between BS and >DEL, since DEL is never used for that purpose. In other words, the only >confusion is at data entry, and this is entirely irrelevant to the >definition of plain text. > >> >Lines are terminated at somewhere between 72 and 80 characters by >> >convention, because that's how wide terminal screens are, and before them >> >the Teletype carriage, and before that the most common kind of punchcard. >> >Or for that matter, typewriters and sheets of paper (A4 or US, take your >> >pick :-) >> > >> >To this day, we follow these conventions in newsgroups and email, although >> >now it might be more a matter of "netiquette" than necessity (as in the >> >BITNET days, when e-mail was, quite literally, 80-column card images). >> >> As long as e-mail readers cannot correctly reformat messages with bad >> line breaks ^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> (like this), it will be a matter of real necessity. >> >What does "correctly reformat messages" mean? How can your mail client read >my mind? How does it know that the message I sent you was not already >formatted exactly the way I wanted it? I mean that it should have the ability to reformat such badly broken text, to use when I decide. Right now I have to reformat such text by hand, or leave it severely broken. Well, maybe I should learn Perl, but I prefer that someone else learn Perl and write the routines I and many others need. If any reader is interested, the spec is as follows. 1) Reflow paragraphs, removing extra white space, while preserving quoting marks '>' in the left margin. Don't get confused by angle brackets in the text. 2) Realign tables with "tab damage". Tables that are too wide should be broken into pages, rather than having lines folded. If you can manage those two, you're good, and I have some more little jobs for you. E-mail users will be eternally grateful (for a week or two, anyway, on Net time). >Notice that to illustrate my point, I need your original formatting (above) >preserved, with the "> " quote indicators added at the left margin, and with >my emphasis added under the appropriate words. What is a "correct" mail >client supposed to do with this? Something like this?: > > > As long as e-mail readers cannot correctly > reformat messages with bad > line breaks > ^^^^^^^^^^^^^^^^^^^^^^^^^^^ > (like this), > it will be a matter of real necessity. > >No, a correct email client will leave it alone. Whether I want my email >reformatted by your client should be my choice, since only I know what my >intentions are in sending it. ^^^^^^^^^ However, it actually is the recipient's choice, and you can't stop us. The "correct" reformatting I had in mind would look like this. >> As long as e-mail readers cannot correctly reformat messages with bad >> line breaks (like this), it will be a matter of real necessity. or possibly >>As long as e-mail readers cannot correctly reformat messages with bad >>line breaks (like this), it will be a matter of real necessity. (**my choice**) >Granted, plain text requires some minimal level of agreement, for example >that your screen is 72 (or 76, or 79) columns wide. I maintain that this >convention is universal, except for Kanji, etc, which are displayed in two >character cells each. People who use email, netnews, and other forms of >open, interplatform communication have learned these conventions. We use >them ourselves on this mailing list. Those of us who do not are often >excoriated for our antisocial behavior. Universal, of course, except where it isn't, you know. No matter where we set the right margin, text quoted from e-mails will break against it if it can't be reflowed. >Especially when we send email or netnews in some application-specific >format, assuming that everybody else uses the same platform and applications >we do. > >> >These simple conventions let us format our text exactly the way we want >> >to. We can indent or not, we can put line breaks where we want them, we >> >can have columns of numbers or other tabular presentations, mathematical >> >expressions, >> >> which actually require several hundred non-ASCII characters, unless you >> mean, as so many do, arithmetic expressions. >> >Yes, that's what I meant, thanks. (All of us here recognize the >shortcomings of ASCII -- that's why we're here! But let's not forget that >ASCII can be used to write, say, Fortran programs that can handle far more >in the way of mathematics than the repertoire of ASCII might suggest, and >that people send Fortran-like expressions back and forth in email, etc, >which could easily lose their meaning when reformatted.) How do you express a vector inner product in FORTRAN? In TeX it's something like $\Sigma_(i=0)^n a_i \times b_i$, and in APL it's nearly "A+.xB", but with a real times symbol. >> When I want my text to stay as I wrote it, I put it into a PDF, not a text >> file. Others prefer TeX for this purpose, or PostScript. >> >My point exactly. No, your point was that ASCII text files stay formatted the way you write them. That would be true, I suppose, if we agreed with you that we could outlaw differences in tab stops, line breaking, and other options on different platforms, because your subworld is normative and there aren't any variant practices worthy of consideration. >And how do I read your PDF if I don't have a PDF reader? >(Don't say "get one" -- I'm reading your mail on a DOS PC or a PDP-11, or a >Cray supercomputer.) Yes, we had the same problem with SGI Irix 5.2, which doesn't support a PDF reader. But the field engineers have Windows on their laptops, so it's only a problem for the user manual, not the service manual, and only becomes vitally important in paperless fabs. >How do I read TeX if I don't have the software? How >do I read PostScript if I don't have a PostScript printer or rendering >engine. But the crucial point is: > > How will I read your PDF file 200 years from now, when > PDF itself has been consigned to the "legacy" trashheap > for the past 195 years? along with ASCII, 8859, and 2022, and all of our removable storage media. Do you know someone with a functioning Teletype paper tape reader who can read legacy ASCII files from 1970? What would you suggest I archive my life's work on for the ages to come (if anyone cares)? >> We raised the question of defining a Unicode plain text format about two >> years ago, but nothing seemed to come of it. >Then let's try again. Let me get the ball rolling with the following simple >suggestion for Unicode Plain-Text File and Interchange Format: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The following discusses a file format and a number of rendering options, but fails to address interchange. UTF-8 is usually recommended for interchange, since it avoids the Endianness question, but transfer of files in other encodings will occur, and must be provided for. The file format must define permitted character codes and code sequences. I suggest that we permit any character code that can represent a character, even if no character is defined for that code, but that we not permit unmatched surrogate characters or codes which are defined not to have the possibility of representing a character. Error behavior for the rendering process when there are illegal codes or code sequences can be undefined, or we could specify error messages and continuation policies. The display rendering process does not change the file, so any display options such as word wrap, tab stops, character width, ligatures, combining characters, and so on are orthogonal to the file format. The user can change the text and save in the new form, but the software isn't allowed to on its own. Rendering behavior of control codes and other non-printing characters must be defined. >A monospaced character-cell display device is assumed for the purposes of >line breaking. Characters that are too wide for a character cell (such as >Kanjis) occupy a double-width cell. Users may choose to display all characters in cells of the same width, or to mix single- and double-cell display. Note that this is not the same as half-width and full-width CJK characters, which have been defined as separate characters. >Of course, Unicode Plain Text can also >be displayed on any other kind of device, in any font, monospaced or not, in >which case "all bets are off", just as they are now with traditional plain >text when displayed in a proportional font. Specifically, we will permit rendering in ATSUI on the Mac, in Java, on NT2K, in Plan 9, and on other platforms, all with whatever level of Unicode rendering and fonts happen to be available, and we will specify what should happen for missing characters, lack of BIDI capability, lack of ligatures, etc. >Conversely, it is recognized that a monospaced (or duospaced) character-cell >device might be inadequate for display of certain writing systems, such as >Arabic or Indic scripts, and in this case intelligent rendering engines >might very well be required. For some purposes a monospaced LTR rendering of these characters may be useful, and is permitted as a user option and as a fallback. >This should, nevertheless, be possible with >plain text, without the aid of any particular markup scheme. But with the use of Unicode markup characters, such as explicit ordering and joining characters. >Plain text is composed only of Unicode characters, ^printing ^including surrogate character pairs, >with no meta-level >of formatting information, presentation hints, etc, except: > > 1. Spaces, such as U+0020 and U+00A0, which are are "kept" (e.g. > adjacent spaces are not collapsed). including spaces defined at code points U+2000-U+200B. > 2. Horizontal Tabs are indicated by the HT character, U+0009. Tab > stops shall be assumed every 8 columns, starting at the first. (This > provision is primarily to facilitate conversion of ASCII and 8-bit > text to Unicode. Alternatively, it would be OK to force all > horizontal alignment to be accomplished by spaces.) As on a typewriter, we have no control of the user's tab stop settings. I recommend that we legislate alignment of monospaced text using spaces only, and forget HT. That's what I have taught people to do for tabular e-mail such as resumes. > 3. Line breaks are indicated by Line Separator, U+2028. Preformatted > text must break lines at column 79 or less to avoid unwanted > reformatting. At present software is free to truncate long lines, wrap at the last column, or word wrap. I would recommend that we forbid truncation and allow the user to choose wrapping style. >Column numbers are 1-based, relative to the left or > right margin, according to the previaling directionality, with > single-width characters as the counting unit. A line break is > required at the end of the final line if it is to be considered a > line. (This is to allow append operations to work in the expected > fashion.) > > 4. Paragraph breaks are indicated by two successive Line Separators legacy, deprecated in new software > or by Paragraph Separator, U+2029. > > 5. Hard page breaks are indicated by FF, U+000C. 6. BIDI modifiers: U+200E, LEFT-TO-RIGHT MARK; U+200F, RIGHT-TO-LEFT MARK 7. Joining modifiers: U+200C, ZERO-WIDTH NON-JOINER; U+200D ZERO-WIDTH JOINER 8. Combining characters: numerous accents; vowels in Hebrew, Arabic, Indic scripts, etc. 9. FEFF ZERO-WIDTH NO-BREAK SPACE=BYTE ORDER MARK should be the first character in a Unicode text file in 16-bit encoding (is that UTF-16? I can't keep them all straight.) BOM is not required in UTF-8 encoding. Non-normative comment: >C0 and C1 control characters other than HT and FF have no function >whatsoever in Unicode Plain Text. (If there were Unicode Horizontal Tab and >Page Break characters, we wouldn't need C0 at all; however, the UTC -- or at >least members of it, in previous discussions -- indicated that there is no >good reason to duplicate the C0 characters that are already in Unicode.) End comment. >A Unicode plain-text "rendering engine" shall not mess with the format of a \\\\\\\\\change >plain-text file except, optionally, at the user's discretion, to wrap lines \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\. It may on the display >that are longer than the display or printing device. Higher-level rendering ^line length >engines, of course, can do whatever they want. And plain text can contain any markup for such engines using Unicode characters that is defined for a specific use, such as HTML, TeX source code, RTF, etc. >- Frank Ed The following non-printing characters may occur in the file, but will be treated as unavailable characters. U+206A INHIBIT SYMMETRIC SWAPPING U+206B ACTIVATE SYMMETRIC SWAPPING U+206C INHIBIT ARABIC SHAPING U+206D ACTIVATE ARABIC SHAPING U+206E NATIONAL DIGIT SHAPES U+206F NOMINAL DIGIT SHAPES Unicode Standard 2.0 describes them as "Alternate format characters (usage strongly discouraged)" Behavior for unavailable characters should be defined. Options include a single glyph for any unavailable character, glyphs indicating the code block of unavailable characters, and numeric rendering. Behavior for non-printing characters with no semantic significance in plain text should be defined. Should they be treated as unavailable characters, or as though they aren't there? A growing number of standards specify the use of Unicode text files, without explicitly defining them. If we get anywhere with this, we will have to run our proposal past these other groups, including the IETF, the POSIX committee, programming language standards committees, etc. -- Edward Cherlin President Coalition Against Unsolicited Commercial E-mail Help outlaw Spam. Talk to us at 3-Jul-99 11:14:03-GMT,1857;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id HAA14406 for ; Sat, 3 Jul 1999 07:14:03 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id EAA276564 ; Sat, 3 Jul 1999 04:11:00 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA06070; Sat, 3 Jul 99 03:53:40 -0700 Message-Id: <9907031053.AA06070@unicode.org> Errors-To: uni-bounce@unicode.org Content-Type: text/plain X-Uml-Sequence: 8345 (1999-07-03 10:53:24 GMT) From: dickey@clark.net To: Unicode List Cc: unicode@unicode.org Date: Sat, 3 Jul 1999 03:53:23 -0700 (PDT) Subject: Re: Plain Text > > At 08:58 -0700 7/2/1999, Frank da Cruz wrote: > [failing to mention that Ed Cherlin wrote:] > >> The problems we have with ASCII plain text come mainly from a small set of > >> common variant practices. > >> > >> Using CR, LF, or CR/LF as a line or paragraph end > >> Different tab spacings > >> Optional line wrap > >> Formfeed codes vs. computed page breaks > >> BS = DEL or BS-overstrike > >> > >We all have dealt with these annoyances throughout our careers. They are > >indeed annoying, but not impassible impediments. Also, let's not mix up: > > > > . File storage format > > . Interchange format > > . Data entry format > . Rendering options > > On looking through the remainder of this message, I conclude that I > disagree with Frank's attempts to make his own limited experience Perhaps you should introduce yourself - I know who Frank is, and the other contributors to this list at least give the impression of being polite and knowledgable. -- Thomas E. Dickey dickey@clark.net http://www.clark.net/pub/dickey 3-Jul-99 22:53:44-GMT,2716;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id SAA19468 for ; Sat, 3 Jul 1999 18:53:43 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id PAA321514 ; Sat, 3 Jul 1999 15:49:25 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA08130; Sat, 3 Jul 99 15:36:24 -0700 Message-Id: <9907032236.AA08130@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 8350 (1999-07-03 22:36:12 GMT) From: Edward Cherlin To: Unicode List Cc: unicode@unicode.org Date: Sat, 3 Jul 1999 15:36:11 -0700 (PDT) Subject: Re: Plain Text At 03:56 -0700 7/3/1999, dickey@clark.net wrote: >> >> At 08:58 -0700 7/2/1999, Frank da Cruz wrote: >> [failing to mention that Ed Cherlin wrote:] >> >> The problems we have with ASCII plain text come mainly from a small >>set of >> >> common variant practices. >> >> >> >> Using CR, LF, or CR/LF as a line or paragraph end >> >> Different tab spacings >> >> Optional line wrap >> >> Formfeed codes vs. computed page breaks >> >> BS = DEL or BS-overstrike >> >> >> >We all have dealt with these annoyances throughout our careers. They are >> >indeed annoying, but not impassible impediments. Also, let's not mix up: >> > >> > . File storage format >> > . Interchange format >> > . Data entry format >> . Rendering options >> >> On looking through the remainder of this message, I conclude that I >> disagree with Frank's attempts to make his own limited experience > >Perhaps you should introduce yourself - I know who Frank is, and the other >contributors to this list at least give the impression of being polite >and knowledgable. > >-- >Thomas E. Dickey >dickey@clark.net >http://www.clark.net/pub/dickey Well, in no particular order, I am Edward Cherlin Spam fighter Participant in standards processes for APL, I18N, Unicode Experience in production of documents including APL, math, music, Chinese, Korean, Japanese, Greek, Russian, Hebrew, Yiddish Author and publisher of The Worldwide Impact of the Unicode Character Set Standard, 1994. BA Honors Math & Philosophy Yale 1967 Buddhist priest Author of The New Newbie Pages at http://www.newbie.net Member of this list for several years. I was part of the discussion with Frank about a Unicode text standard two years ago. Ed Cherlin, President, CAUCE "Everything should be made as simple as possible, __but no simpler__." Attributed to Albert Einstein 3-Jul-99 23:14:02-GMT,1568;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id TAA20747 for ; Sat, 3 Jul 1999 19:14:02 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id QAA323778 ; Sat, 3 Jul 1999 16:09:39 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA08393; Sat, 3 Jul 99 16:00:26 -0700 Message-Id: <9907032300.AA08393@unicode.org> Errors-To: uni-bounce@unicode.org Content-Type: text/plain X-Uml-Sequence: 8351 (1999-07-03 23:00:16 GMT) From: dickey@clark.net To: Unicode List Cc: unicode@unicode.org Date: Sat, 3 Jul 1999 16:00:15 -0700 (PDT) Subject: Re: Plain Text > Well, in no particular order, I am > > Edward Cherlin > Spam fighter > Participant in standards processes for APL, I18N, Unicode > Experience in production of documents including APL, math, music, Chinese, > Korean, Japanese, Greek, Russian, Hebrew, Yiddish > Author and publisher of The Worldwide Impact of the Unicode Character Set > Standard, 1994. > BA Honors Math & Philosophy Yale 1967 > Buddhist priest > Author of The New Newbie Pages at http://www.newbie.net > Member of this list for several years. so? (I don't see any clue for berating Frank about "limited experience", except possibly your implied age ~55 -- for the rest, I don't see anything that matters much) -- Thomas E. Dickey dickey@clark.net http://www.clark.net/pub/dickey 4-Jul-99 9:31:30-GMT,3005;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id FAA18134 for ; Sun, 4 Jul 1999 05:31:29 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id CAA258396 ; Sun, 4 Jul 1999 02:24:16 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA09766; Sun, 4 Jul 99 02:16:33 -0700 Message-Id: <9907040916.AA09766@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 8355 (1999-07-04 09:16:18 GMT) From: Edward Cherlin To: Unicode List Cc: unicode@unicode.org Date: Sun, 4 Jul 1999 02:16:17 -0700 (PDT) Subject: Re: Plain Text At 16:00 -0700 7/3/1999, dickey@clark.net wrote: [failing to note that Ed Cherlin wrote in reply to his request for identification] >> Well, in no particular order, I am >> >> Edward Cherlin >> Spam fighter >> Participant in standards processes for APL, I18N, Unicode >> Experience in production of documents including APL, math, music, Chinese, >> Korean, Japanese, Greek, Russian, Hebrew, Yiddish >> Author and publisher of The Worldwide Impact of the Unicode Character Set >> Standard, 1994. >> BA Honors Math & Philosophy Yale 1967 >> Buddhist priest >> Author of The New Newbie Pages at http://www.newbie.net >> Member of this list for several years. [and also omitting Ed's statement about having been in a similar discussion with Frank on this list two years ago, about creating a Unicode text format standard.] > >so? (I don't see any clue for berating Frank about "limited experience", Are you berating me? You didn't ask me for "clues for berating Frank", just who I am. Do you mean that my experience is irrelevant in discussing his experience? Frank? Am I being mean to you? Is my criticism too harsh? If so, I apologize. What did you think about my suggestions for the Unicode text standard? >except possibly your implied age ~55 -- for the rest, I don't see anything >that matters much) Frank's "limited experience" is not youth but insularity. He cites practices current on UNIX systems as though they applied universally. I have used UNIX, DOS, Windows, CP/M, Apple ][, IBM mainframes via timesharing, and several other kinds of computers, dealing with character set problems well outside Frank's range of experience. I forgot to mention that I instigated and managed a software development project for a highly portable APL that came out in English, French, German, Finnish, Russian, and Japanese, on a variety of computer architectures. >-- >Thomas E. Dickey >dickey@clark.net >http://www.clark.net/pub/dickey -- Edward Cherlin President Coalition Against Unsolicited Commercial E-mail Help outlaw Spam. Talk to us at 4-Jul-99 11:25:19-GMT,3107;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id HAA04675 for ; Sun, 4 Jul 1999 07:25:19 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id EAA204984 ; Sun, 4 Jul 1999 04:19:18 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA10393; Sun, 4 Jul 99 04:04:37 -0700 Message-Id: <9907041104.AA10393@unicode.org> Errors-To: uni-bounce@unicode.org Content-Type: text/plain X-Uml-Sequence: 8357 (1999-07-04 11:04:25 GMT) From: dickey@clark.net To: Unicode List Cc: unicode@unicode.org Date: Sun, 4 Jul 1999 04:04:23 -0700 (PDT) Subject: Re: Plain Text [omitted previous discussion] > [and also omitting Ed's statement about having been in a similar discussion > with Frank on this list two years ago, about creating a Unicode text format > standard.] [this appeared redundant, except as a note that you had been introduced to Frank] > > > >so? (I don't see any clue for berating Frank about "limited experience", > > Are you berating me? You didn't ask me for "clues for berating Frank", just hmm (though the nearest dictionary does not convey this, my sense of 'berating' is related to the repetition of the "limited experience". There are indeed degrees here - but then we can argue about shades of meaning. > who I am. Do you mean that my experience is irrelevant in discussing his > experience? It doesn't make a good argument - and most of your listeners stop at that point. (If you wish to be convincing, leave that out and point out the places where his posting leaves out information - and _why_ that is more important than than what he's presenting). > Frank? Am I being mean to you? Is my criticism too harsh? If so, I > apologize. What did you think about my suggestions for the Unicode text > standard? I have a hunch that Frank is home for the weekend. > >except possibly your implied age ~55 -- for the rest, I don't see anything > >that matters much) > > Frank's "limited experience" is not youth but insularity. He cites > practices current on UNIX systems as though they applied universally. > > I have used UNIX, DOS, Windows, CP/M, Apple ][, IBM mainframes via > timesharing, and several other kinds of computers, dealing with character > set problems well outside Frank's range of experience. I forgot to mention I wouldn't be surprised if many people on this list have also used a variety of systems (otherwise they'd not be reading this list ;-). > that I instigated and managed a software development project for a highly > portable APL that came out in English, French, German, Finnish, Russian, > and Japanese, on a variety of computer architectures. I suppose so - but APL itself has little to do with the natural language aspect (perhaps you managed the message library - that would be relevant to your statement). -- Thomas E. Dickey dickey@clark.net http://www.clark.net/pub/dickey 4-Jul-99 16:49:34-GMT,1507;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id MAA12217 for ; Sun, 4 Jul 1999 12:49:33 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id JAA272994 ; Sun, 4 Jul 1999 09:42:06 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA12324; Sun, 4 Jul 99 09:27:16 -0700 Message-Id: <9907041627.AA12324@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 8364 (1999-07-04 16:27:00 GMT) From: Curtis Clark To: Unicode List Cc: unicode@unicode.org Date: Sun, 4 Jul 1999 09:26:58 -0700 (PDT) Subject: Dickey vs. Cherlin, was Re: Plain Text I haven't been on this list long (I've found it interesting and useful), and I don't claim any qualifications at all; but I wonder, are these sorts of exchanges common? I can understand that Unicode could generate some strident differences of opinion, but I sense that I'm missing something here. ---------------------------------------------------------------- Curtis Clark http://www.csupomona.edu/~jcclark/ Biological Sciences Department Voice: (909) 869-4062 California State Polytechnic University FAX: (909) 869-4078 Pomona CA 91768-4032 USA jcclark@csupomona.edu 4-Jul-99 16:51:03-GMT,1926;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id MAA12504 for ; Sun, 4 Jul 1999 12:51:03 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id JAA263944 ; Sun, 4 Jul 1999 09:45:02 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA12328; Sun, 4 Jul 99 09:27:17 -0700 Message-Id: <9907041627.AA12328@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 8365 (1999-07-04 16:27:01 GMT) From: Curtis Clark To: Unicode List Date: Sun, 4 Jul 1999 09:26:59 -0700 (PDT) Subject: Re: dotless j At 07:40 AM 7/4/99 -0700, Jeroen Hellingman wrote: > The semantics of both i and j >should be that >they loose their dots if you put an accent on top of them, so there never >should be a problem. I'm puzzled by this: 1. Precomposed accented characters, I have read, are included in support of legacy character sets; the ideal is to use a combining accent with a non-accented character. 2. There are issues with combining accents needing to account for the height of the base letter, dots, as well, no doubt, as ascenders and descenders. These are semantic issues, which should be handled by the software. 3. Unicode, it is said, is a plain text standard. (2) and (3) seem to be at odds, unless programs that display plain text become a lot more sophisticated. ---------------------------------------------------------------- Curtis Clark http://www.csupomona.edu/~jcclark/ Biological Sciences Department Voice: (909) 869-4062 California State Polytechnic University FAX: (909) 869-4078 Pomona CA 91768-4032 USA jcclark@csupomona.edu 4-Jul-99 17:41:35-GMT,2050;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id NAA24093 for ; Sun, 4 Jul 1999 13:41:35 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id KAA196170 ; Sun, 4 Jul 1999 10:37:54 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA13836; Sun, 4 Jul 99 10:24:07 -0700 Message-Id: <9907041724.AA13836@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Uml-Sequence: 8370 (1999-07-04 17:23:59 GMT) From: John Cowan To: Unicode List Date: Sun, 4 Jul 1999 10:23:57 -0700 (PDT) Subject: Re: dotless j Content-Transfer-Encoding: 7bit Curtis Clark scripsit: > 1. Precomposed accented characters, I have read, are included in support of > legacy character sets; the ideal is to use a combining accent with a > non-accented character. Just so. > 2. There are issues with combining accents needing to account for the > height of the base letter, dots, as well, no doubt, as ascenders and > descenders. These are semantic issues, which should be handled by the > software. I don't know what you mean by "semantic". They are *rendering* issues, which must be handled by displaying-and-printing software. Much other software doesn't care a bit. For example, you can write Java code with comments and identifier names in Yoruba, using combining characters as needed. > 3. Unicode, it is said, is a plain text standard. So it is. > (2) and (3) seem to be at odds, unless programs that display plain text > become a lot more sophisticated. So they must, if they are to handle all of Unicode: BIDI, conjoining Hangul jamo, etc. etc. This is the escape from your dilemma. -- John Cowan cowan@ccil.org I am a member of a civilization. --David Brin 4-Jul-99 18:00:36-GMT,1450;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id OAA27757 for ; Sun, 4 Jul 1999 14:00:36 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id KAA321522 ; Sun, 4 Jul 1999 10:54:52 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA14163; Sun, 4 Jul 99 10:40:44 -0700 Message-Id: <9907041740.AA14163@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Uml-Sequence: 8371 (1999-07-04 17:40:36 GMT) From: Roozbeh Pournader To: Unicode List Cc: Unicode List Date: Sun, 4 Jul 1999 10:40:34 -0700 (PDT) Subject: Re: dotless j On Sun, 4 Jul 1999, Curtis Clark wrote: > 3. Unicode, it is said, is a plain text standard. > > (2) and (3) seem to be at odds, unless programs that display plain text > become a lot more sophisticated. Yes! Don't consider simple scripts like Latin only. If one likes to have plain text Arabic, what should he do? He needs sofisticated software to do that. Unicode is there for all scripts. When it sees that some processing is needed for scripts like Arabic or Devanagari, it allows some processing for scripts like Latin, to solve ambiguities etc. --Roozbeh 4-Jul-99 18:19:02-GMT,8795;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id OAA01657 for ; Sun, 4 Jul 1999 14:19:01 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id LAA187394 ; Sun, 4 Jul 1999 11:11:02 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA14272; Sun, 4 Jul 99 10:51:50 -0700 Message-Id: <9907041751.AA14272@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 8372 (1999-07-04 17:51:37 GMT) From: Frank da Cruz To: Unicode List Cc: unicode@unicode.org Date: Sun, 4 Jul 1999 10:51:36 -0700 (PDT) Subject: Re: Plain Text > I conclude that I disagree with Frank's attempts to make his own limited > experience normative... > I'm not sure why my experience has become an issue in this discussion but I can assure you I have a fair amount. My first programming experience was with plugboards and little wires on IBM EAM equipment. My current project, now in its 18th year, is precisely the interchange of text among divergent platforms, with full conversion of both record format and character set. I have written software to do this that has run, at one time or another, on more than 700 different hardware-and-OS platforms, many long dead, and at present on more than 150. This project, which I manage, also produces (and/or collects and distributes, and supports) similar software written by other people both here and abroad, and the entire collection spans practically every computer and operating system that has existed over the past 25-30 years with just a few exceptions. Part of the project is the definition of a protocol for meaningful text transfer. The protocol requires conversion of local formats and character sets to standard ones when sending, and the reverse procedure when receiving. Only international standard character sets are used on the wire, and are tagged using standard ISO-registered identifiers. This protocol has been in production for more than 10 years and is used in many parts parts of the world, especially Eastern and Western Europe, Isreal, Greece, the former USSR, Japan, and the Americas. One of the key questions in designing and implementing such a protocol is "what is a text file?" What distinguishes it from a non-text, or "binary" file? Constant day-to-day experience with a worldwide user base helps me to form what I hope is an adequate grasp of the issues. > >The point is, it doesn't matter. Each platform has its own format for > >internal use, but a standardized interface to the outside world. To > >further demonstrate this fact, if I then tell the computer on the far > >end to "type" or "cat" the file, it will, invariably, send: > > > > This is a line > > Your cultural ignorance/sheltered life-experience is showing. *You* may > live in an environment where these changes are made automatically, but a > lot of us don't. > Then please give counterexamples. > >So who cares what the file format is -- except of course when we want to > >transfer the file to another platform. > > And since I don't use a VT100 simulator anymore, I only encounter this > issue when transfering files to another platform, and as a result I care > all the time. > > >In that case, it is the > >responsibility of each file-transfer agent > > When reading floppy disks? > Of course. One of the biggest problems facing any of us who wishes to live in a world of computing diversity is the failure of file system designers to develop a rational method for tagging files, and indeed, for developing standard interchange formats. That's what we're trying to do here. Consider a minimal platform like DOS. You can set up your DOS system to load different code pages, such as CP850 for West European languages, CP866 for Cyrillic, and so on. Then you can use standard DOS utilities to create and edit text files in many languages (but only one per file). However, no record is kept of the encoding (character set) of each file. This presents rather significant problems even when we stay on the PC, before we ever think about interchanging files. So at minimum, a text file should be tagged according to character set. To my knowledge, this has never been done at the file-system level. What about file type and record format? Data interchange can be done in various ways. One way involves cooperating agents at each end -- e.g. FTP client and server. They can use their own application-specific protocol to control the process. For example, one can say "I'm DOS" and the other "I'm UNIX" and then apply the appropriate conversions. Of course as platforms multiply, we have an n x n problem. Therefore we settle upon standard formats to be used on the wire. Each transfer partner converts to and from these standard formats. Moving files by magnetic media present numerous problems, but only because we have forgotten how to do it. Back in the 1970s, ANSI developed standards for data interchange by magnetic media (e.g. ANSI X3.26-1978) that worked perfectly well until the personal computer revolution came along and standards went out of style. A DOS (or Macintosh or IRIX or any other) diskette is simply not intended for export to other platforms. This is the kind of situation we would like to avoid in the future. Hence this discussion. > You are still claiming that text files as they occur in your computer > subculture are for some reason normative for the rest of us. > Actually I am attempting to achieve an agreement a precise definition of Unicode plain text that allows the text to be already formatted, one that gives us the same capability that we have always had with ASCII (and Latin-x etc) of encoding and presenting information without *requiring* the use of any higher intelligence beyond what is needed to interpret Space, LS, PS, HT, and FF characters, plus whatever else is needed to accommodate bidi, etc. > >Prior to the advent of > >word processors, the idea of "long line as paragraph" never came up. > > Word processing began in the 1960s. I gather you had a later date in mind. > Did you mean specifically WYSIWYG word processors, invented at Xerox in the > late 1970s? > And, before it, NLS, used at government research institutes in the 1960s. But again, that's not plain text. It's "input for a text formatter". It does not stand on its own. > >No, a correct email client will leave it alone. Whether I want my email > >reformatted by your client should be my choice, since only I know what my > >intentions are in sending it. ^^^^^^^^^ > > However, it actually is the recipient's choice, and you can't stop us. > This sounds like quibbling but it's an important point. If I have the capability to compose and format a plain-text message exactly as I want you to see it, the mail system should allow me to mark it as "preformatted plain text" and then you would have to go out of your way to reformat it. Whereas if my mail client sends long lines with no formatting, it should mark it as "plain text to be flowed". Email issues, especially MIME, are a whole new topic, and a controversial one, best avoided here. But a clear statement from the Unicode Consortium on plain text that addresses the issue of formatting might motivate the "email community" to deal with these issues in a productive way. > A growing number of standards specify the use of Unicode text files, > without explicitly defining them. If we get anywhere with this, we will > have to run our proposal past these other groups, including the IETF, the > POSIX committee, programming language standards committees, etc. > Good. Let's try to keep making progress. We all have an intuitive grasp of the meaning of preformatted plain text. You'll find it in many places: . READ.ME files on your software disks. . Program source code. . Traditional (not "legacy") email and netnews. . Voluminous full-text information already online. and so on. We should find a way to carry this notion forward for Unicode in a way that: . Avoids the pitfalls of platform-dependent formatting conventions. . Allows straightforward and unambiguous conversion of 8-bit data to Unicode (and, to the extent possible, vice-versa). . Is independent of any higher-level protocol, markup language, product, or even standard. In other words, the Unicode definition should stand entirely on its own so that files encoded (or transmitted) in this format will be universally understood for years, decades, centuries to come, no matter what else might change, as long as Unicode itself lives on. - Frank 4-Jul-99 18:24:28-GMT,2544;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id OAA02663 for ; Sun, 4 Jul 1999 14:24:28 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id LAA266440 ; Sun, 4 Jul 1999 11:17:45 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA14818; Sun, 4 Jul 99 11:04:45 -0700 Message-Id: <9907041804.AA14818@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Uml-Sequence: 8376 (1999-07-04 18:04:20 GMT) From: Markus Kuhn To: Unicode List Date: Sun, 4 Jul 1999 11:04:16 -0700 (PDT) Subject: Re: Frank and Plain Text Edward Cherlin wrote on 1999-07-04 09:16 UTC: > Frank's "limited experience" is not youth but insularity. He cites > practices current on UNIX systems as though they applied universally. > > I have used UNIX, DOS, Windows, CP/M, Apple ][, IBM mainframes via > timesharing, and several other kinds of computers, dealing with character > set problems well outside Frank's range of experience. Just for the record, let me quickly introduce your discussion partners: Frank da Cruz , whom you attested "limited experience" in the field of inter-platform plaintext exchange, is the author of KERMIT. KERMIT is a widely ported classic terminal emulator with build-in file transmission software. It is most likely available on *all* the platforms that you have ever used, and as the implementor of KERMIT's text-file transmission mechanism, Frank certainly had to worry about the plain text file conventions used on all these systems. He his probably one of the most qualified experts on matters related to the emulation of historic data-entry terminals and inter-platform plain-text format convention. (In case you have never used or heard about KERMIT, please draw the appropriate conclusions regarding the scope of your own experience.) Thomas Dickey is the maintainer of xterm, probably the currently most widely used VT100 terminal emulator on this planet, and the application that primarily has to process all plaintext on Unix workstations in the end. (If you have never heard of xterm, same conclusion.) Markus -- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: 4-Jul-99 19:05:41-GMT,1529;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id PAA12778 for ; Sun, 4 Jul 1999 15:05:41 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id LAA190184 ; Sun, 4 Jul 1999 11:57:31 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA16486; Sun, 4 Jul 99 11:44:25 -0700 Message-Id: <9907041844.AA16486@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Uml-Sequence: 8380 (1999-07-04 18:44:09 GMT) From: John Cowan To: Unicode List Date: Sun, 4 Jul 1999 11:44:08 -0700 (PDT) Subject: Re: Plain Text Content-Transfer-Encoding: 7bit Frank da Cruz scripsit: > One of the key questions in designing and implementing such a protocol is > "what is a text file?" Indeed. The GNU utilities go to great lengths to process all 256 bytes even in purely text utilities, but none of them (except specific conversion programs) handle multibyte text. > So at minimum, a text file should be tagged according to character set. To > my knowledge, this has never been done at the file-system level. Either that, or there needs to be only one character set! :-) -- John Cowan cowan@ccil.org I am a member of a civilization. --David Brin 4-Jul-99 20:08:35-GMT,2740;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id QAA26524 for ; Sun, 4 Jul 1999 16:08:35 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id NAA255332 ; Sun, 4 Jul 1999 13:01:26 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA17563; Sun, 4 Jul 99 12:45:51 -0700 Message-Id: <9907041945.AA17563@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" X-Uml-Sequence: 8386 (1999-07-04 19:45:25 GMT) From: "Paul Dempsey (Exchange)" To: Unicode List Date: Sun, 4 Jul 1999 12:45:20 -0700 (PDT) Subject: RE: Plain Text > > Frank da Cruz: > > So at minimum, a text file should be tagged according to character set. To > > my knowledge, this has never been done at the file-system level. > John Cowan: > Either that, or there needs to be only one character set! :-) We'll have to deal with multiple untagged codepages/encodings/charsets for a long time yet. It's unlikely we'll get file systems to carry any meta-information beyond the filename in any portable way and certainly not retroactively. What we CAN do is use encoding signatures for all Unicode files. The various forms of Unicode are still relatively new and we still have a chance to establish the conventions. The Unicode standard lists signatures for _some_ Unicode encodings, in section 13.6 Specials, Encoding Form Signature: UCS-2(UTF-16) FE FF UCS-4 00 00 FE FF However, this is incomplete. The most important thing we're missing from the standard is: UTF-8 EF BB BF These are all the ZERO WIDTH NO BREAK SPACE (a.k.a BYTE ORDER MARK) in the corresponding representation. Without a signature for UTF-8, you can't reliably assume you're working with UTF-8 and not some other MBCS. A number of Microsoft programs (Notepad, Visual Studio, richedit) are using this signature for UTF-8. For the rest of what constitutes "plain text", the Unicode standard covers most of the issues, but not explicitly in one place. The grayer part of this discussion is about what constitutes "preformatted plain text". I don't think this can be standardized to practical effect. That is, you could write a standard, but would anyone use it? This quickly gets into the domain of presentation and document structure, which is beyond the scope of the Unicode standard proper. It is still worthwhile to capture the common conventions and make recommendations. --- Paul Chase Dempsey Microsoft Visual Studio Text Editor Development 4-Jul-99 20:46:36-GMT,2397;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id QAA05825 for ; Sun, 4 Jul 1999 16:46:36 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id NAA12582 ; Sun, 4 Jul 1999 13:41:38 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA19319; Sun, 4 Jul 99 13:33:13 -0700 Message-Id: <9907042033.AA19319@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 8391 (1999-07-04 20:33:04 GMT) From: Frank da Cruz To: Unicode List Cc: unicode@unicode.org Date: Sun, 4 Jul 1999 13:33:03 -0700 (PDT) Subject: RE: Plain Text > We'll have to deal with multiple untagged codepages/encodings/charsets > for a long time yet. It's unlikely we'll get file systems to carry any > meta-information beyond the filename in any portable way and certainly > not retroactively. > And I most emphatically recommend against using filenames for this purpose for at least the following reasons: . Different platforms have different filename formats and restrictions as to what can be in a filename, how long it can be, etc. . There is no central registry for filename associations. Horrible confusion arises when different software vendors choose the same association for two different products or, worse, when files are transferred across platforms that have different associations. > For the rest of what constitutes "plain text", the Unicode standard > covers most of the issues, but not explicitly in one place. The grayer > part of this discussion is about what constitutes "preformatted plain > text". I don't think this can be standardized to practical effect. That > is, you could write a standard, but would anyone use it? > Those who needed a guaranteed way to record preformatted plain text in documents that can persist over long periods of time and across all applications and platforms would use it. Even now, there exists such a standard, albeit unwritten, for 8-bit text. For example, almost every word processor and web browser has a "Save as" option for "plain text with line breaks" which, in the general case, is the only reliable interchange format. What will be the Unicode equivalent? - Frank 4-Jul-99 20:58:13-GMT,2175;000000000001 Return-Path: Received: from dfssl.exchange.microsoft.com (dfssl.exchange.microsoft.com [131.107.88.59]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id QAA07800 for ; Sun, 4 Jul 1999 16:58:13 -0400 (EDT) Received: by dfssl with Internet Mail Service (5.5.2648.0) id <3DSG1TJV>; Sun, 4 Jul 1999 13:57:07 -0700 Message-ID: <01D6C7224936D211BA450000F805D5380809563E@TOTO> From: "Paul Dempsey (Exchange)" To: "'Frank da Cruz'" , "Paul Dempsey (Exchange)" Cc: unicode@unicode.org Subject: RE: Plain Text Date: Sun, 4 Jul 1999 13:56:57 -0700 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2648.0) Content-Type: text/plain; charset="iso-8859-1" > > Paul Chase Dempsey: > > We'll have to deal with multiple untagged codepages/encodings/charsets > > for a long time yet. It's unlikely we'll get file systems to carry any > > meta-information beyond the filename in any portable way and certainly > > not retroactively. > > Frank da Cruz > I most emphatically recommend against using filenames for this purpose .. I emphatically agree. I meant to say that the name is the only information you can expect a file system to maintain apart the data in the file. I did not mean to imply that the name should be used to encode any other information. If I did, I would have proposed a notation. Without a reliable means to capture the encoding external to the bits in the file itself, I suggest the standardization of Unicode file signatures. These are already in common use except for UTF-8, and it's useful to extend the practice to UTF-8. ... > Frank da Cruz > Even now, there exists such a standard, albeit unwritten, for 8-bit text. > For example, almost every word processor and web browser has a "Save as" > option for "plain text with line breaks" which, in the general case, is the > only reliable interchange format. What will be the Unicode equivalent? Exactly the same, except Unicode data intead of 8-bit MBCS data. So let's write down the unwritten! Regards, --- Paul Chase Dempsey 4-Jul-99 22:58:59-GMT,2175;000000000011 Return-Path: Received: from light.dkuug.dk (55.ppp1-10.image.dk [212.54.73.247]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id SAA04769 for ; Sun, 4 Jul 1999 18:58:57 -0400 (EDT) Received: (from keld@localhost) by light.dkuug.dk (8.9.3/8.9.3) id AAA03372; Mon, 5 Jul 1999 00:58:56 +0200 Date: Mon, 5 Jul 1999 00:58:56 +0200 From: keld@dkuug.dk To: Frank da Cruz Cc: Unicode List Subject: Re: Plain text: Amendment 1 Message-ID: <19990705005856.B3289@light.dkuug.dk> References: <9907021618.AA21230@unicode.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.95.4us In-Reply-To: <9907021618.AA21230@unicode.org>; from Frank da Cruz on Fri, Jul 02, 1999 at 09:17:48AM -0700 On Fri, Jul 02, 1999 at 09:17:48AM -0700, Frank da Cruz wrote: > 90 seconds later... > > 3. Line breaks are indicated by Line Separator, U+2028. Preformatted > text must break lines at column 79 or less to avoid unwanted > reformatting. Column numbers are 1-based, relative to the left or > right margin, according to the previaling directionality, with > single-width characters as the counting unit. A line break is > required at the end of the final line if it is to be considered a > line. (This is to allow append operations to work in the expected > fashion.) > > 4. Paragraph breaks are indicated by two successive Line Separators > or by Paragraph Separator, U+2029. > > Change (4) to: > > 4. Paragraph breaks are indicated by Paragraph Separator, U+2029. > > Add to (3): > > A blank line is indicated by two successive Line Separators. > Two blank lines are indicated by three of them, etc. > > This is to allow paragraphs like this one, which contain embedded > "displays" set off by blank lines that are NOT paragraph separators. could one not use C0 or C1 characters for these, so that the conventions could equally apply to say 8859 character sets? 3) could be something like one out of 3: 1. CR 2. LF 3. CR LF 4) could we use something like one of the C0 characers for that? Keld 4-Jul-99 23:42:12-GMT,3430;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id TAA15758 for ; Sun, 4 Jul 1999 19:42:11 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id QAA270558 ; Sun, 4 Jul 1999 16:32:52 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA20741; Sun, 4 Jul 99 16:13:35 -0700 Message-Id: <9907042313.AA20741@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 8397 (1999-07-04 23:13:11 GMT) From: Kermit Software Support To: Unicode List Cc: unicode@unicode.org Date: Sun, 4 Jul 1999 16:13:04 -0700 (PDT) Subject: Re: Plain text: Amendment 1 Keld wrote: > > Frank Wrote: > > > > 4. Paragraph breaks are indicated by two successive Line Separators > > or by Paragraph Separator, U+2029. > > > > Change (4) to: > > > > 4. Paragraph breaks are indicated by Paragraph Separator, U+2029. > > > > Add to (3): > > > > A blank line is indicated by two successive Line Separators. > > Two blank lines are indicated by three of them, etc. > > > > This is to allow paragraphs like this one, which contain embedded > > "displays" set off by blank lines that are NOT paragraph separators. > > could one not use C0 or C1 characters for these, so that the conventions > could equally apply to say 8859 character sets? > They could be, but I think we want to standardize on true Unicode characters whenever we can, since we have the power to define their semantics. The C0 and C1 sets are included for compatibility with existing sets over which the Unicode Consortium has no control, and over which we have been haggling the past few days ("the Mac does this, the PC does that, UNIX does something else"...) Anyway, we can't go back and change existing Latin-Alphabet or PC Code Page files to use consistent record formats -- that's an operating system and programming language issue, not to mention a conversion task that not even Hercules (or Xena) could handle. > 3) could be something like one out of 3: > > 1. CR > 2. LF > 3. CR LF > This is exactly why we should use LS rather than any of the above in Unicode text. Then converting existing 8-bit text to Unicode will have the happy by-product of erasing these differences. As noted previously, I would not object to adding two more "control characters" to Unicode to remove our dependence on C0 and C1 completely: 1. UHT "Unicode Horizontal Tab", which is just like C0 HT except that the tabstops are well-defined (should the tabbing concept be carried forward into Unicode Plain Text, rather than using only spaces). How to define them is, of course, another question. 2. UFF "Unicode Form Feed", like C0 Formfeed, except not in C0. I can't think of any applications for C0 Form Feed other than page feed or page eject, or the analogous action on video terminals, namely clear screen. But I'm sure that C0 FF has been misused in ways I never heard of and therefore a more clearly defined Unicode version might be warranted. However, I'm perfectly happy to stick with C0 HT and FF as long as they are given precise definitions for Unicode Plain Text, and nobody says "legacy" when referring to them :-) Whatever is chosen, let's keep it simple. - Frank 5-Jul-99 3:32:35-GMT,2165;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id XAA18456 for ; Sun, 4 Jul 1999 23:32:34 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id UAA268622 ; Sun, 4 Jul 1999 20:27:50 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA22629; Sun, 4 Jul 99 20:11:42 -0700 Message-Id: <9907050311.AA22629@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 8402 (1999-07-05 03:11:31 GMT) From: Jonathan Rosenne To: Unicode List Date: Sun, 4 Jul 1999 20:11:30 -0700 (PDT) Subject: Re: Plain Text I agree with John. The interchange standard should be UTF-8 or UTF-16. The sending and receiving systems should handle conversions. If the receiving system does not tag files, and uses just one encoding, it should convert the file as best it can. This way, the receiving system does not need to recognize a large number of character sets, only those it wishes to support. Since the meaning of CR, LF, CRLF, FF cannot be agreed, I agree additional Unicode characters look like a good solution. And again, the sending and receiving systems should handle conversions. I don't think tabs are needed. Spaces are sufficient. Jony At 11:44 04/07/99 -0700, John Cowan wrote: >Frank da Cruz scripsit: > >> One of the key questions in designing and implementing such a protocol is >> "what is a text file?" > >Indeed. The GNU utilities go to great lengths to process all 256 bytes >even in purely text utilities, but none of them (except specific conversion >programs) handle multibyte text. > >> So at minimum, a text file should be tagged according to character set. To >> my knowledge, this has never been done at the file-system level. > >Either that, or there needs to be only one character set! >:-) > >-- >John Cowan cowan@ccil.org > I am a member of a civilization. --David Brin > 5-Jul-99 3:44:29-GMT,2096;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id XAA19615 for ; Sun, 4 Jul 1999 23:44:29 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id UAA188358 ; Sun, 4 Jul 1999 20:38:11 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA22757; Sun, 4 Jul 99 20:26:13 -0700 Message-Id: <9907050326.AA22757@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 8403 (1999-07-05 03:26:05 GMT) From: Edward Cherlin To: Unicode List Cc: unicode@unicode.org Date: Sun, 4 Jul 1999 20:26:03 -0700 (PDT) Subject: Re: Dickey vs. Cherlin, was Re: Plain Text At 09:26 -0700 7/4/1999, Curtis Clark wrote: >I haven't been on this list long (I've found it interesting and useful), >and I don't claim any qualifications at all; but I wonder, are these sorts >of exchanges common? I can understand that Unicode could generate some >strident differences of opinion, but I sense that I'm missing something >here. > > >---------------------------------------------------------------- >Curtis Clark http://www.csupomona.edu/~jcclark/ >Biological Sciences Department Voice: (909) 869-4062 >California State Polytechnic University FAX: (909) 869-4078 >Pomona CA 91768-4032 USA jcclark@csupomona.edu I have to say it surprises me. I wasn't trying to flame Frank, and we haven't had anyone take exception to the tone of the discussion in the several years I've been here. We do tell each other quite plainly when an opinion seems ill-founded, as in Michael's comments on my notion of encoding IPA extensions using XML. -- Edward Cherlin President Coalition Against Unsolicited Commercial E-mail Help outlaw Spam. Talk to us at 5-Jul-99 5:59:35-GMT,2826;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id BAA01699 for ; Mon, 5 Jul 1999 01:59:34 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id WAA261196 ; Sun, 4 Jul 1999 22:53:27 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA23597; Sun, 4 Jul 99 22:43:55 -0700 Message-Id: <9907050543.AA23597@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 8405 (1999-07-05 05:43:43 GMT) From: Edward Cherlin To: Unicode List Date: Sun, 4 Jul 1999 22:43:42 -0700 (PDT) Subject: Frank & Ed Evidently my diagnosis, that Frank da Cruz had insufficient experience in a cross-platform environment, was completely wrong, so I apologize for writing it. It puzzles me even more, then, that Frank writes in his Unicode text file proposal as if Unix practice, or more particularly his own practice (including practice in file format conversions in cross-platform data transfers), is normative, not just for other software, but for file formats on other platforms, without saying how this norm is to be implemented so that file format conversion ceases to be a problem for all applications. Also: How do we get agreement on such a standard from, e.g., Microsoft? How do we get users to stop using current methods? How do we deal with delimited database transfer files with a fixed limit on line length? How do we deal with legacy data? I find myself dealing with Unicode text created by Windows and Windows applications quite frequently now, with line ends marked in little-endian fashion as 0D 00 0A 00 What do we do about that? I entirely agree that cross-platform protocols should be defined so that we stop having conversion problems (such as translating text file formats upon transfer, as ftp does), but it can't be done within a character set standard, nor by defining a text file format without file format handling for applications on different platforms. I have had to collect or in some cases write conversion routines for text file transfer, including text files in ASCII, 8-bit character sets, and Unicode. I would much rather have the operating systems do it. If someone can explain to me how Frank's proposal will lead to that desired goal better than Frank's proposal with my suggested amendments, I'll be happy to go along. So can we discuss the issues now? -- Edward Cherlin President Coalition Against Unsolicited Commercial E-mail Help outlaw Spam. Talk to us at 5-Jul-99 9:13:48-GMT,8259;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id FAA19730 for ; Mon, 5 Jul 1999 05:13:47 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id CAA258406 ; Mon, 5 Jul 1999 02:01:43 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA25036; Mon, 5 Jul 99 01:46:42 -0700 Message-Id: <9907050846.AA25036@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 8410 (1999-07-05 08:46:30 GMT) From: Edward Cherlin To: Unicode List Cc: unicode@unicode.org Date: Mon, 5 Jul 1999 01:46:28 -0700 (PDT) Subject: Re: Plain Text At 10:51 -0700 7/4/1999, Frank da Cruz wrote: [Ed Cherlin wrote:] >> I conclude that I disagree with Frank's attempts to make his own limited >> experience normative... [snip] I withdraw the remark, in view of other information received, and the answers to my objections which Frank has provided, like the next. [Frank] >> >So who cares what the file format is -- except of course when we want to >> >transfer the file to another platform. >> > >> >In that case, it is the >> >responsibility of each file-transfer agent >> [Ed] >> When reading floppy disks? >> [Frank] >Of course. One of the biggest problems facing any of us who wishes to live >in a world of computing diversity is the failure of file system designers to >develop a rational method for tagging files, and indeed, for developing >standard interchange formats. That's what we're trying to do here. > >Consider a minimal platform like DOS. You can set up your DOS system to >load different code pages, such as CP850 for West European languages, CP866 >for Cyrillic, and so on. Then you can use standard DOS utilities to create >and edit text files in many languages (but only one per file). However, no >record is kept of the encoding (character set) of each file. This presents >rather significant problems even when we stay on the PC, before we ever >think about interchanging files. > >So at minimum, a text file should be tagged according to character set. To >my knowledge, this has never been done at the file-system level. > >What about file type and record format? Data interchange can be done in >various ways. One way involves cooperating agents at each end -- e.g. FTP >client and server. They can use their own application-specific protocol >to control the process. For example, one can say "I'm DOS" and the other >"I'm UNIX" and then apply the appropriate conversions. Of course as >platforms multiply, we have an n x n problem. Therefore we settle upon >standard formats to be used on the wire. Each transfer partner converts to >and from these standard formats. > >Moving files by magnetic media present numerous problems, but only because >we have forgotten how to do it. Back in the 1970s, ANSI developed standards >for data interchange by magnetic media (e.g. ANSI X3.26-1978) that worked >perfectly well until the personal computer revolution came along and >standards went out of style. A DOS (or Macintosh or IRIX or any other) >diskette is simply not intended for export to other platforms. > >This is the kind of situation we would like to avoid in the future. Hence >this discussion. > >> You are still claiming that text files as they occur in your computer >> subculture are for some reason normative for the rest of us. >> >Actually I am attempting to achieve an agreement a precise definition of >Unicode plain text that allows the text to be already formatted, one that >gives us the same capability that we have always had with ASCII (and Latin-x >etc) of encoding and presenting information without *requiring* the use of >any higher intelligence beyond what is needed to interpret Space, LS, PS, >HT, and FF characters, plus whatever else is needed to accommodate bidi, >etc. [snip] [Frank] >> >Whether I want my email >> >reformatted by your client should be my choice, since only I know what my >> >intentions are in sending it. ^^^^^^^^^ >> >> However, it actually is the recipient's choice, and you can't stop us. >> >This sounds like quibbling but it's an important point. If I have the >capability to compose and format a plain-text message exactly as I want you >to see it, the mail system should allow me to mark it as "preformatted plain >text" and then you would have to go out of your way to reformat it. Whereas >if my mail client sends long lines with no formatting, it should mark it as >"plain text to be flowed". This is the key point for me. You acknowledge the need for flavors of text other than your preformatted plain text. I thought you were holding out for one flavor only. Now we can discuss the flavors, such as delimited database interchange files with lines of arbitrary length. Presumably we can define them using some of the apparatus that is becoming available in XML or as MIME data types. Would it make sense, then, to create a formal XML definition of plain text files, with a leading BOM, no interpretations for any tags, the minimum set of control characters, and the appropriate set of transformation formats? That would get around my earlier objection, about how to make an implementation available on all platforms. What about corresponding MIME types? >Email issues, especially MIME, are a whole new topic, and a controversial >one, best avoided here. But a clear statement from the Unicode Consortium >on plain text that addresses the issue of formatting might motivate the >"email community" to deal with these issues in a productive way. > >> A growing number of standards specify the use of Unicode text files, >> without explicitly defining them. If we get anywhere with this, we will >> have to run our proposal past these other groups, including the IETF, the >> POSIX committee, programming language standards committees, etc. >> >Good. Let's try to keep making progress. > >We all have an intuitive grasp of the meaning of preformatted plain text. >You'll find it in many places: > > . READ.ME files on your software disks. Preformatted or reflowable. > . Program source code. Preformatted. > . Traditional (not "legacy") email and netnews. There is presently no way to specify preformatted or reflowable. > . Voluminous full-text information already online. Including Unicode tables and other database interchange formats. >and so on. We should find a way to carry this notion forward for Unicode >in a way that: > > . Avoids the pitfalls of platform-dependent formatting conventions. > > . Allows straightforward and unambiguous conversion of 8-bit data to > Unicode (and, to the extent possible, vice-versa). > > . Is independent of any higher-level protocol, markup language, > product, or even standard. In other words, the Unicode definition > should stand entirely on its own so that files encoded (or transmitted) > in this format will be universally understood for years, decades, > centuries to come, no matter what else might change, as long as Unicode > itself lives on. Hear, hear. >- Frank To summarize your answer to my objections, we are defining a new format independent of previous conventions, in which we can specify usage of the minimal set of formatting characters regardless of usage in text files of 7-bit ASCII and 8-bit character sets of any kind, while allowing for a few variant flavors of text, such as preformatted, reflowable, and database. To which I add, that we can specify a portable implementation, too, and not have to wait for computer and OS vendors to get on board. Well, apparently there are no hard feelings from Frank over my earlier harsh words, so perhaps nobody else need be offended on his behalf. In case anybody missed it elsewhere, I apologize for misunderstanding Frank, and for giving the impression that I was attacking him personally. -- Edward Cherlin President Coalition Against Unsolicited Commercial E-mail Help outlaw Spam. Talk to us at 5-Jul-99 9:30:53-GMT,1554;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id FAA21152 for ; Mon, 5 Jul 1999 05:30:53 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id CAA93992 ; Mon, 5 Jul 1999 02:23:20 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA25135; Mon, 5 Jul 99 02:04:54 -0700 Message-Id: <9907050904.AA25135@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" X-Uml-Sequence: 8411 (1999-07-05 09:04:46 GMT) From: Michael Everson To: Unicode List Date: Mon, 5 Jul 1999 02:04:44 -0700 (PDT) Subject: Re: Plain Text Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by watsun.cc.columbia.edu id FAA21152 Ar 10:51 -0700 1999-07-04, scríobh Frank da Cruz: >Moving files by magnetic media present numerous problems, but only because >we have forgotten how to do it. Oh, is that the reason? I thought it was a Y2K thing, that on January 1 all the magnetic tapes would go "fzzzzzzzzzzst!" like in Mission Impossible. Frivolously, -- Michael Everson * Everson Gunn Teoranta * http://www.indigo.ie/egt 15 Port Chaeimhghein Íochtarach; Baile Átha Cliath 2; Éire/Ireland Guthán: +353 1 478 2597 ** Facsa: +353 1 478 2597 (by arrangement) 27 Páirc an Fhéithlinn; Baile an Bhóthair; Co. Átha Cliath; Éire 5-Jul-99 14:53:48-GMT,1211;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id KAA24315 for ; Mon, 5 Jul 1999 10:53:48 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id HAA278358 ; Mon, 5 Jul 1999 07:44:56 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA29395; Mon, 5 Jul 99 07:31:40 -0700 Message-Id: <9907051431.AA29395@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Uml-Sequence: 8426 (1999-07-05 14:31:31 GMT) From: Peter_Constable@sil.org To: Unicode List Date: Mon, 5 Jul 1999 07:31:30 -0700 (PDT) Subject: NLF (was Frank and Ed, was Plain Text) Content-Transfer-Encoding: 7bit >I find myself dealing with Unicode text created by Windows and Windows applications quite frequently now, with line ends marked in little-endian fashion as 0D 00 0A 00 Indeed, this practice has surprised me. Chris Pratley: can you comment on why Word 97 does this rather than using PS? Peter 5-Jul-99 15:00:09-GMT,3951;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id LAA25531 for ; Mon, 5 Jul 1999 11:00:09 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id HAA187530 ; Mon, 5 Jul 1999 07:46:42 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA29478; Mon, 5 Jul 99 07:33:11 -0700 Message-Id: <9907051433.AA29478@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Uml-Sequence: 8427 (1999-07-05 14:32:44 GMT) From: Peter_Constable@sil.org To: Unicode List Date: Mon, 5 Jul 1999 07:32:43 -0700 (PDT) Subject: Re: Plain Text Content-Transfer-Encoding: 7bit >Of course. One of the biggest problems facing any of us who wishes to live in a world of computing diversity is the failure of file system designers to develop a rational method for tagging files, and indeed, for developing standard interchange formats. That's what we're trying to do here. .. >What about file type and record format?... >Actually I am attempting to achieve an agreement a precise definition of Unicode plain text that allows the text to be already formatted, one that gives us the same capability that we have always had with ASCII (and Latin-x etc) of encoding and presenting information without *requiring* the use of any higher intelligence beyond what is needed to interpret Space, LS, PS, HT, and FF characters... I find myself in agreement with Ken W's comments a few messages back. I'm also inclined to say that you are wanting to define (in effect) a MIME type, and that part of the confusion / disagreement that has arisen in this thread comes about by calling this type "plain text". You want a file that is tagged with null markup to be interpreted in a specific way (as a text document as opposed, e.g. to a database) and with specific layout formatting. As was pointed out in an earlier message, and as we are all familiar with, sometime files that contain only text characters and no tagging are used for purposes other than this, such as the CSV database. Also, there are times when I've had such text files in which I intend all of the text that exists between instances of { BOF, EOF, NLF } to appear on a single line, regardless of length (e.g. in source code), and other times when I expect it to wrap to whatever width is appropriate for the window in which it is viewed. All of these are legitimate things to want to be able to do with a file in this format that we have always known as "plain text". Neither the intended meaning of the content, nor the intended appearance have ever been part of the definition of plain text. Thus, I think you should expect some objection to any suggestion that "plain text" should refer to a file that is intended to be interpreted in a specific way, i.e. as a text document with specific layout formatting. Plain text can be neither more nor less than what is has always been. As we apply plain text to the Unicode context, Ken's comments were on the mark. That is not to say that it isn't reasonable, or desireable, to specify a file format to be used for text documents with specific layout formatting such that it will always appear as the author intended, and such that no markup is used beyond a standard interpretation of the characters (separating this file format from others such as PDF). We'd all benefit from it, if an agreement can be made. I just think that we may need to call it something else. And this is what Frank has acknowledged, though he may not have done so consciously: >the mail system should allow me to mark it as "preformatted plain text" We're not just talking about plain text here, we're talking about a specific kind of plain text. Peter 5-Jul-99 17:04:57-GMT,7653;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id NAA24261 for ; Mon, 5 Jul 1999 13:04:57 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id KAA246240 ; Mon, 5 Jul 1999 10:00:25 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA01854; Mon, 5 Jul 99 09:45:38 -0700 Message-Id: <9907051645.AA01854@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 8434 (1999-07-05 16:45:27 GMT) From: Frank da Cruz To: Unicode List Cc: Unicode List Date: Mon, 5 Jul 1999 09:45:26 -0700 (PDT) Subject: Re: Plain Text [Ed wrote...] > It puzzles me even more, then, that Frank writes in his Unicode text > file proposal as if Unix practice, or more particularly his own practice > (including practice in file format conversions in cross-platform data > transfers), is normative, not just for other software, but for file > formats on other platforms, without saying how this norm is to be > implemented so that file format conversion ceases to be a problem for > all applications. > I'll try to be more explicit. Whether we know it or not, text interchange methods are well-established in the pre-Unicode world, at least at the record-format level (character sets are another matter, but we know that). When I sit at my { terminal, terminal emulator, xterm window } and tell the host to "type" or "cat" a file, the internal text format is translated to the de facto canonical one, primarily that the local convention for line separation/termination is translated to CRLF. When I transfer a text file with FTP or any other file transfer protocol I know about, the same thing happens (see, e.g. RFC959). However, many of us are confused by the fact that local conventions differ, and perceive this as an obstacle to interchange because, for example, it is difficult to read a PC diskette on a UNIX workstation or a Macintosh, or because of the increasing amounts of email we get that uses some encoding or format we don't understand. These are problems that we have an opportunity to solve in the conversion of 8-bit text to Unicode. > How do we get agreement on such a standard from, e.g., Microsoft? > Hopefully Microsoft's representatives to the Unicode Consortium will be supportive, as some of the commentary already seems to indicate. > How do we get users to stop using current methods? > We don't have to. If the Unicode Standard defines what plain text is, then conversion of 8-bit text to Unicode will put all the divergent platform-specific formats into the same Unicode format. > How do we deal with delimited database transfer files with a fixed > limit on line length? > I don't see how these files would be affected. You can put line separators in them if you want, or leave them out. > How do we deal with legacy data? > How do convert existing 7-bit and 8-bit plain-text files to Unicode plain text? The straightforward conversion is: . Source line -> Destination line terminated by LS. This is according to whatever the local definition of "line" is (UNIX, Macintosh, DOS, VMS, MVS, ...). And of course: . Source character set converted to Unicode. This seems obvious. C0 control characters are kept, including Horizontal Tab and Form Feed. C1 control characters are kept if the source character set has them (e.g. a Latin Alphabet) and translated otherwise (e.g. CP850). Additional wrinkles (options) might include: . Tabs expanded to spaces based on the desired tab stops, which should be 1,9,17,35,... BY DEFAULT (meaning you can supply your own tab stops). . Heuristics might be used to identify paragraphs and to separate them by Paragraph Separator. For example, a blank line is replaced by PS. Obviously there are pitfalls. . Any conversion program would probably need an option to deal with files with "word processor" record format, in which a line is really a paragraph. > I find myself dealing with Unicode text created by Windows and Windows > applications quite frequently now, with line ends marked in > little-endian fashion as > > 0D 00 0A 00 > > What do we do about that? > I would say that this practice should be discouraged ("be conservative in what you 'send'") in any application that creates or saves Unicode text files. But it should be allowed for ("be liberal in what you 'receive'") in any conversion/import program. > I entirely agree that cross-platform protocols should be defined so that > we stop having conversion problems (such as translating text file formats > upon transfer, as ftp does), but it can't be done within a character set > standard, nor by defining a text file format without file format handling > for applications on different platforms. > I don't think anybody can presume to offer a panacea for differing application formats, other than to define a text-file format that can be used for export/import/interchange, as we have now with most popular applications. We simply need to extend this idea to Unicode. > I have had to collect or in some cases write conversion routines for text > file transfer, including text files in ASCII, 8-bit character sets, and > Unicode. I would much rather have the operating systems do it. > The operating system doesn't know what format or encoding is used in a file. It would be nice if this information was saved along with the file, but it usually isn't. If, in the transition to an all-Unicode computing environment, we specify not only the encoding but also a standard record format for interchange of plain text -- including (but not requiring) preformatted plain text -- we won't have to worry about operating systems, file systems, or presentation-layer issues in text-file transfer ever again. Obviously we will always have to worry about format conversions between applications that do NOT use plain text data files. But by defining a low-level baseline format for plain text, there will always be a method for recording and transmitting textual information that rises above ("sinks below") those differences, and that can always be used across platforms, distance, and time. > ... You acknowledge the need for flavors of text > other than your preformatted plain text. I thought you were holding out > for one flavor only. Now we can discuss the flavors, such as delimited > database interchange files with lines of arbitrary length. Presumably we > can define them using some of the apparatus that is becoming available in > XML or as MIME data types. > No, thase are higher-level protocols that will go out of fashion some day, probably sooner than you think. Of course you can define or use all the higher level protocols you want, but you should bear in mind they are ephemeral. If you want something that lasts forever, do it in Unicode without reference to MIME, *ML, or anything else, and keep it extremely simple. > To summarize your answer to my objections, we are defining a new format > independent of previous conventions, in which we can specify usage of the > minimal set of formatting characters regardless of usage in text files of > 7-bit ASCII and 8-bit character sets of any kind, while allowing for a few > variant flavors of text, such as preformatted, reflowable, and > database. > Yes. > To which I add, that we can specify a portable implementation, > too, and not have to wait for computer and OS vendors to get on board. > Double yes. - Frank 5-Jul-99 18:06:41-GMT,4435;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id OAA08662 for ; Mon, 5 Jul 1999 14:06:40 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id LAA185694 ; Mon, 5 Jul 1999 11:02:46 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA02161; Mon, 5 Jul 99 10:50:02 -0700 Message-Id: <9907051750.AA02161@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 8435 (1999-07-05 17:49:51 GMT) From: Frank da Cruz To: Unicode List Cc: Unicode List Date: Mon, 5 Jul 1999 10:49:50 -0700 (PDT) Subject: Re: Plain Text [Peter wrote] > I find myself in agreement with Ken W's comments a few messages back. I'm > also inclined to say that you are wanting to define (in effect) a MIME > type, and that part of the confusion / disagreement that has arisen in > this thread comes about by calling this type "plain text". > I most emphatically do not want to define a MIME type, because MIME will disappear some day but Unicode will last forever (if we do it right). > You want a file that is tagged with null markup to be interpreted in a > specific way (as a text document as opposed, e.g. to a database) and with > specific layout formatting. As was pointed out in an earlier message, and > as we are all familiar with, sometime files that contain only text > characters and no tagging are used for purposes other than this, such as > the CSV database. Also, there are times when I've had such text files in > which I intend all of the text that exists between instances of { BOF, > EOF, NLF } to appear on a single line, regardless of length (e.g. in > source code), and other times when I expect it to wrap to whatever width > is appropriate for the window in which it is viewed. > All of that is fine. I'm only proposing that we codify existing practice. If Unicode has a Line Separator (and it does), then if I put it in a file, it should serve its purpose. Ditto for Paragraph Separator. Ditto for C0 HT and FF (even though those purposes might be ill-defined), in the absence of "native" Unicode replacements for them. I agree that marking a "plain-text" stream as "preformatted" or "to be flowed" is a higher-level issue. However, we must also agree that plain text CAN be preformatted and not ALWAYS flowed, and that Unicode already contains the mechanisms to do it. > All of these are legitimate things to want to be able to do with a file in > this format that we have always known as "plain text". Neither the > intended meaning of the content, nor the intended appearance have ever > been part of the definition of plain text. Thus, I think you should expect > some objection to any suggestion that "plain text" should refer to a file > that is intended to be interpreted in a specific way, i.e. as a text > document with specific layout formatting. Plain text can be neither more > nor less than what is has always been. As we apply plain text to the > Unicode context, Ken's comments were on the mark. > > That is not to say that it isn't reasonable, or desireable, to specify a > file format to be used for text documents with specific layout formatting > such that it will always appear as the author intended, and such that no > markup is used beyond a standard interpretation of the characters > (separating this file format from others such as PDF). We'd all benefit > from it, if an agreement can be made. I just think that we may need to > call it something else. > "Preformatted plain text"? It's not catchy but I think it says what it means. > I certainly empathise with a desire to have a standard for preformatted > plain text. Here's the first paragraph of something in a message sent to > me recently. > Yes, "fractured plain text" comes from a flawed conversion algorithm, e.g. when pasting from a web page into an email window (a "double-ended break" in this case: misinterpretation of the left margin as leading spaces by the copier and gratuitous word wrapping by the paster). Obviously that's an application issue. However, I do believe that if we can establish a baseline for preformatted plain text, makers of such applications will have a better idea of how to interchange text. - Frank 5-Jul-99 20:45:18-GMT,3272;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id QAA14621 for ; Mon, 5 Jul 1999 16:45:17 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id NAA243178 ; Mon, 5 Jul 1999 13:39:10 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA04382; Mon, 5 Jul 99 13:21:31 -0700 Message-Id: <9907052021.AA04382@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT X-Uml-Sequence: 8444 (1999-07-05 20:20:07 GMT) From: "Jonathan Coxhead" To: Unicode List Date: Mon, 5 Jul 1999 13:20:06 -0700 (PDT) Subject: Re: Plain text: Amendment 1 Content-Transfer-Encoding: 7BIT | As noted previously, I would not object to adding two more "control | characters" to Unicode to remove our dependence on C0 and C1 | completely: | | 1. UHT "Unicode Horizontal Tab", which is just like C0 HT except | that | the tabstops are well-defined (should the tabbing concept be | carried forward into Unicode Plain Text, rather than using only | spaces). How to define them is, of course, another question. My thoughts on this indicate that explicit tab widths are not appropriate: the only real requirement for plain text is that the columns line up. So we could have a character COLUMN SEPARATOR (CSEP) to go with LINE SEPARATOR (LSEP) and PARAGRAPH SEPARATOR (PSEP). It should interact with these as follows. "Within a paragraph that contains a CSEP, each LSEP-delimited line represents a row of a table. The table has as many columns as the maximum number of CSEP characters in any line. Each column should be wide enough to accommodate the longest column-contents in any line in that column. No inter-column spacing is provided: if there is to be space between columns, one column or the other must contain explicit space chatacters." So the general form of a table would be PSEP ... CSEP ... CSEP ... LSEP ... CSEP ... LSEP ... CSEP ... CSEP ... PSEP An unsophisticated renderer may choose to render CSEP as a tab to an 8-column tab stop, and this may often give acceptable results. | Whatever is chosen, let's keep it simple. This is simple to define, but not to render. Also, it doesn't give control over left/right/centre justifying each column. If this is important, I suppose the solution would be a SPACE FILL character, like \hfil in TeX, which (when occuring in a table, i e, a paragraph with at least one CSEP character) provides enough space to pad the entry it appears in to the full width available. This would allow a column to be right-justified (start all entries with SPACE FILL), centre-justified (put a SPACE FILL character before and after the entries), or even justified on a particular character, e g, the decimal point FULL STOP (break it into 2 columns, by writing CSEP, FULL STOP instead of FULL STOP, and right-justify the first, left-justify the second). /| o o o (_|/ /| (_/ 5-Jul-99 22:33:53-GMT,1510;000000000011 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id SAA10623 for ; Mon, 5 Jul 1999 18:33:51 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id PAA200950 ; Mon, 5 Jul 1999 15:23:09 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA06878; Mon, 5 Jul 99 15:07:37 -0700 Message-Id: <9907052207.AA06878@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Uml-Sequence: 8453 (1999-07-05 22:07:14 GMT) From: keld@dkuug.dk To: Unicode List Date: Mon, 5 Jul 1999 15:07:08 -0700 (PDT) Subject: Re: Plain text: Amendment 1 On Mon, Jul 05, 1999 at 03:16:01AM -0700, keld@dkuug.dk wrote: > 3) could be something like one out of 3: > > 1. CR > 2. LF > 3. CR LF To clarify: I think "line break" could follow the conventions currently in use on the Internet: Accept all of the three above forms, but only generate one form, preferably the CR LF sequence. It seems like the Internet is going to standardize on UTF-8, and as UTF-8 encodes C0 as a single octet, I think there would be much sense in chosing a C0 sequence for the "line break" function. I think the paragraph break could then be chosen as one of the C0 Information separators, possibly the Record Separator aka control-^ . Just my 2 eurocent Keld 5-Jul-99 22:58:31-GMT,2280;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id SAA15543 for ; Mon, 5 Jul 1999 18:58:31 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id PAA199038 ; Mon, 5 Jul 1999 15:54:19 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA08168; Mon, 5 Jul 99 15:43:35 -0700 Message-Id: <9907052243.AA08168@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 8457 (1999-07-05 22:43:27 GMT) From: Frank da Cruz To: Unicode List Cc: unicode@unicode.org Date: Mon, 5 Jul 1999 15:43:26 -0700 (PDT) Subject: Re: Plain text: Amendment 1 > On Mon, Jul 05, 1999 at 03:16:01AM -0700, keld@dkuug.dk wrote: > > 3) could be something like one out of 3: > > > > 1. CR > > 2. LF > > 3. CR LF > > To clarify: I think "line break" could follow the conventions > currently in use on the Internet: Accept all of the three above forms, > but only generate one form, preferably the CR LF sequence. > > It seems like the Internet is going to standardize on UTF-8, > and as UTF-8 encodes C0 as a single octet, I think there would be > much sense in chosing a C0 sequence for the "line break" function. > > I think the paragraph break could then be chosen as one of > the C0 Information separators, possibly the Record Separator > aka control-^ . > I think the problem with this idea is that if we look at a Unicode text file and see CR and/or LF in it, we don't know if those characters came from the private text format of a 7- or 8-bit file that was converted to Unicode without any record-format conversion, or if they are the "Unicode" CR and LF. Therefore this would only move the problem of incompatible record formats from the old world (of DOS, Windows, UNIX, Macintosh) to the new one. It's better to have Unicode characters LS and PS (and I think also Tab/Column-Separator and Page Separator) than to recycle the C0 controls. This ensures round-trip integrity without having to know the history of the data ("it came originally from DOS so to convert it from Unicode to UNIX we need to...") - Frank 5-Jul-99 23:12:00-GMT,1857;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id TAA18562 for ; Mon, 5 Jul 1999 19:11:58 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id QAA198948 ; Mon, 5 Jul 1999 16:05:19 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA08383; Mon, 5 Jul 99 15:51:57 -0700 Message-Id: <9907052251.AA08383@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Uml-Sequence: 8458 (1999-07-05 22:51:42 GMT) From: Otto Stolz To: Unicode List Date: Mon, 5 Jul 1999 15:51:36 -0700 (PDT) Subject: Re: Plain Text Am 1999-07-01 um 13:00 h hat Otto Stolz geschrieben: > In MS-DOS (or PC-DOS and other DOS variants) on the PC, it is not > well defined, at all: [...] > - '09'x (HT) means either a tabulator [...] or a line-break, I am no more sure about the HT used as a line-break in plain text. It is indeed used in an internal Word-format (Word 2.0 for DOS, and perhaps in later versions) for this purpose, but I haven't kept an old Word implementation, so I cannot check Word's input conversion from plain text to this format. Current Word for Windows input conversions from plain text interpret some C0 characters thus (checked with Word 97): '09'x (TAB) tabulator '0A'x (LF) paragraph break '0B'x (VT) line break '0C'x (FF) page break '0D'x (CR) ignored '0E'x (SO) [sic!] column break Still, my main point holds: In MS-DOS, plain text is not well defined, as there are wide variations in the usage and meaning of several controll characters. Best wishes, Otto Stolz 6-Jul-99 3:34:39-GMT,2640;000000000001 Return-Path: Received: from proxy4.ba.best.com (proxy4.ba.best.com [206.184.139.15]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id XAA04624 for ; Mon, 5 Jul 1999 23:34:25 -0400 (EDT) Received: from macchiato.com (dynamic45.pm03.mv.best.com [209.24.240.173]) by proxy4.ba.best.com (8.9.3/8.9.2/best.out) with ESMTP id UAA22744; Mon, 5 Jul 1999 20:32:14 -0700 (PDT) Message-ID: <37817931.41860B@macchiato.com> Date: Mon, 05 Jul 1999 20:34:09 -0700 From: Mark Davis X-Mailer: Mozilla 4.6 [en] (Win98; U) X-Accept-Language: en,de-CH,fr-CH,it MIME-Version: 1.0 To: Frank da Cruz CC: Unicode List Subject: Re: Plain text: Amendment 1 References: <9907052243.AA08164@unicode.org> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit A lot of the discussion of line termination relates to technical report #13. Any suggestions for additional information for that report would be welcome. (http://www.unicode.org/unicode/reports/tr13/) Mark Frank da Cruz wrote: > > On Mon, Jul 05, 1999 at 03:16:01AM -0700, keld@dkuug.dk wrote: > > > 3) could be something like one out of 3: > > > > > > 1. CR > > > 2. LF > > > 3. CR LF > > > > To clarify: I think "line break" could follow the conventions > > currently in use on the Internet: Accept all of the three above forms, > > but only generate one form, preferably the CR LF sequence. > > > > It seems like the Internet is going to standardize on UTF-8, > > and as UTF-8 encodes C0 as a single octet, I think there would be > > much sense in chosing a C0 sequence for the "line break" function. > > > > I think the paragraph break could then be chosen as one of > > the C0 Information separators, possibly the Record Separator > > aka control-^ . > > > I think the problem with this idea is that if we look at a Unicode > text file and see CR and/or LF in it, we don't know if those > characters came from the private text format of a 7- or 8-bit file > that was converted to Unicode without any record-format conversion, > or if they are the "Unicode" CR and LF. Therefore this would only > move the problem of incompatible record formats from the old world > (of DOS, Windows, UNIX, Macintosh) to the new one. > > It's better to have Unicode characters LS and PS (and I think also > Tab/Column-Separator and Page Separator) than to recycle the C0 > controls. This ensures round-trip integrity without having to know > the history of the data ("it came originally from DOS so to convert > it from Unicode to UNIX we need to...") > > - Frank 6-Jul-99 15:00:44-GMT,2552;000000000005 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id LAA06958 for ; Tue, 6 Jul 1999 11:00:43 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id HAA200904 ; Tue, 6 Jul 1999 07:53:44 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA14263; Tue, 6 Jul 99 07:20:41 -0700 Message-Id: <9907061420.AA14263@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Uml-Sequence: 8478 (1999-07-06 14:18:21 GMT) From: Kevin Bracey To: Unicode List Date: Tue, 6 Jul 1999 07:18:20 -0700 (PDT) Subject: Re: NLF (was Frank and Ed, was Plain Text) In message <9907051432.AA29431@unicode.org> Peter_Constable@sil.org wrote: > > > >I find myself dealing with Unicode text created by Windows and Windows > applications quite frequently now, with line ends marked in little-endian > fashion as > > 0D 00 0A 00 > > Indeed, this practice has surprised me. > > Chris Pratley: can you comment on why Word 97 does this rather than using > PS? > I think I can partially answer this from experience on our (non-MS) environment. Our system continues to use our native line-ending type (LF only) when dealing with Unicode data, for compatibility. In particular, when converted to UTF-8, which is how Unicode is normally passed around our OS, the data will have standard looking line endings - if PS or LS were used, many non-UTF-8 aware parts of the system would get confused. Also, a lot of Unicode data is converted from non-Unicode sources - conversion will almost always leave C0 and C1 characters untouched. Changing to PS and LS would need knowledge of the source data's line ending conventions, which is hard to determine automatically. If you also need round-trip conversion (eg Shift-JIS data in an HTML form -> Unicode browser workings -> Shift-JIS submission to server), messing with line endings is almost out of the question. All other encodings use C0 controls for line endings - it's hard to make a change for one particular encoding that does it differently. -- Kevin Bracey, Senior Software Engineer Pace Micro Technology plc Tel: +44 (0) 1223 725228 645 Newmarket Road Fax: +44 (0) 1223 725328 Cambridge, CB5 8PB, United Kingdom WWW: http://www.acorn.co.uk/ 6-Jul-99 15:48:02-GMT,3748;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id LAA21789 for ; Tue, 6 Jul 1999 11:48:01 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id IAA246046 ; Tue, 6 Jul 1999 08:30:04 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA14360; Tue, 6 Jul 99 07:26:42 -0700 Message-Id: <9907061426.AA14360@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Uml-Sequence: 8480 (1999-07-06 14:23:46 GMT) From: John Cowan To: Unicode List Date: Tue, 6 Jul 1999 07:23:45 -0700 (PDT) Subject: Re: Plain Text Content-Transfer-Encoding: 7bit Edward Cherlin wrote: > This is the key point for me. You acknowledge the need for flavors of text > other than your preformatted plain text. I thought you were holding out for > one flavor only. Indeed, but "preformatted plain text" has traditionally been called "plain text", or in MIME "text/plain", and this terminology ought not to be revised unwarrantedly. Other species of plain text should have a distinguishing adjective. > Now we can discuss the flavors, such as delimited database > interchange files with lines of arbitrary length. We can, but I think we would do well to nail down preformatted plain text (aka "plain text") first, as it is the most stable. > Presumably we can define > them using some of the apparatus that is becoming available in XML or as > MIME data types. Would it make sense, then, to create a formal XML > definition of plain text files, with a leading BOM, no interpretations for > any tags, the minimum set of control characters, and the appropriate set of > transformation formats? No, at least for the XML part. (You could create a full-SGML definition, but I question the purpose of it, except perhaps to help in defining a Unicode-preformatted-plain-text grove model.) XML compels special interpretations for "<" and "&" and requires matching enclosing tags; preformatted plain text has no such requirements. > That would get around my earlier objection, about > how to make an implementation available on all platforms. What about > corresponding MIME types? The corresponding MIME type is "text/plain; charset=utf-8" or "... utf-16". Anything else should have a different MIME type or at least different parameters. > Preformatted or reflowable. I have not seen ones that are not preformatted. > > . Traditional (not "legacy") email and netnews. > > There is presently no way to specify preformatted or reflowable. There is a widespread presumption for preformatted, although sometimes the formatting is done by the creating software, not the user, alas. Rendering software usually has at least an option to display as-is. > To summarize your answer to my objections, we are defining a new format > independent of previous conventions, in which we can specify usage of the > minimal set of formatting characters regardless of usage in text files of > 7-bit ASCII and 8-bit character sets of any kind, Yes. > while allowing for a few > variant flavors of text, such as preformatted, reflowable, and database. And of these, preformatted is the most important and stable, and should be specified first. The others can be specified ad libitum later. -- John Cowan http://www.ccil.org/~cowan cowan@ccil.org Schlingt dreifach einen Kreis um dies! / Schliesst euer Aug vor heiliger Schau, Denn er genoss vom Honig-Tau / Und trank die Milch vom Paradies. -- Coleridge / Politzer 6-Jul-99 16:52:25-GMT,2051;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id MAA10222 for ; Tue, 6 Jul 1999 12:52:24 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id JAA242372 ; Tue, 6 Jul 1999 09:38:54 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA16405; Tue, 6 Jul 99 08:49:29 -0700 Message-Id: <9907061549.AA16405@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Uml-Sequence: 8490 (1999-07-06 15:45:40 GMT) From: John Cowan To: Unicode List Date: Tue, 6 Jul 1999 08:45:35 -0700 (PDT) Subject: Re: Plain Text Content-Transfer-Encoding: 7bit Frank da Cruz wrote: > I most emphatically do not want to define a MIME type, because MIME will > disappear some day but Unicode will last forever (if we do it right). Technically, "MIME types" are called "media types", and what they really are is named interchange formats. You *are* trying to develop an interchange format; making it a media type requires only finding a name and filling out a short registration form. As I said in an earlier message, MIME rules provide a strong case for distinguishing between "text/plain" and "application/character-stream", (where "application" here really means "other" i.e. "catchall".) The former must be composed of lines with a maximum length of (IIRC) 998 characters; the latter has no such restrictions. Text/plain could still include both reflowable and preformatted text, but I believe the weight of history is in favor of using that term for preformatted text only. -- John Cowan http://www.ccil.org/~cowan cowan@ccil.org Schlingt dreifach einen Kreis um dies! / Schliesst euer Aug vor heiliger Schau, Denn er genoss vom Honig-Tau / Und trank die Milch vom Paradies. -- Coleridge / Politzer 6-Jul-99 16:52:29-GMT,4215;000000000005 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id MAA10261 for ; Tue, 6 Jul 1999 12:52:28 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id JAA90060 ; Tue, 6 Jul 1999 09:38:57 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA16005; Tue, 6 Jul 99 08:36:26 -0700 Message-Id: <9907061536.AA16005@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Uml-Sequence: 8488 (1999-07-06 15:30:50 GMT) From: John Cowan To: Unicode List Date: Tue, 6 Jul 1999 08:30:34 -0700 (PDT) Subject: Re: Plain Text Content-Transfer-Encoding: 7bit Frank da Cruz wrote: > We don't have to. If the Unicode Standard defines what plain text is, > then conversion of 8-bit text to Unicode will put all the divergent > platform-specific formats into the same Unicode format. Or some other widely accepted source of standardization, such as Oasis or ECMA or ISO or even W3C (though the first three, IMHO, have a better "fit" to the subject matter). > C1 control characters are kept if the source character > set has them (e.g. a Latin Alphabet) and translated otherwise > (e.g. CP850). I take this to mean "Characters 0x80 to 0x9F are zero-bit-extended if the source character set has C1 characters; if it does not (like CP850, CP1252, or VISCII), they are translated to their proper Unicode graphic equivalents." > . Heuristics might be used to identify paragraphs and to separate them > by Paragraph Separator. For example, a blank line is replaced by PS. > Obviously there are pitfalls. Indeed. For example, blank lines in source code, e.g., are not necessarily paragraph marks. This might be a reasonable QOI issue. > . Any conversion program would probably need an option to deal with > files with "word processor" record format, in which a line is really > a paragraph. Note that arbitrary-length lines do not meet the MIME definition of "text" (and nor does UTF-16 text); such things should really have a media type of "application/character-stream" or the like, analogous to "application/octet-stream" but with a charset parameter. > > 0D 00 0A 00 > > > > What do we do about that? > > > I would say that this practice should be discouraged ("be conservative in > what you 'send'") in any application that creates or saves Unicode text > files. But it should be allowed for ("be liberal in what you 'receive'") in > any conversion/import program. Does this Windows-Unicode text always have a proper little-endian BOM, as I believe it does? If so, then the only problem is the precise value of line terminator. In practice, much of the Unicode text (perhaps all of it) in the world today uses old line terminators, and I think they must be explicitly allowed in a flexible definition of preformatted Unicode plain text, even if tagged with SHOULD NOT. > No, thase are higher-level protocols that will go out of fashion some day, > probably sooner than you think. Of course you can define or use all the > higher level protocols you want, but you should bear in mind they are > ephemeral. SGML is almost as old, as computer things go, as plain text. Though it was not standardized until 1986, it was devised in 1974; ASCII itself only dates to 1963 or so. Moreover, unlike most file formats, SGML is character-based, not octet-based, and does not depend on any specific processing application, so whatever process refreshes Unicode data will refresh SGML data too. (XML is merely a special case of SGML.) I agree that preformatted plain text should not depend on SGML, though; that is putting Cart before Horse. [snip] > Yes. [snip] > Double yes. Sounds like a case of violent agreement. -- John Cowan http://www.ccil.org/~cowan cowan@ccil.org Schlingt dreifach einen Kreis um dies! / Schliesst euer Aug vor heiliger Schau, Denn er genoss vom Honig-Tau / Und trank die Milch vom Paradies. -- Coleridge / Politzer 6-Jul-99 18:06:01-GMT,1998;000000000005 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id OAA01861 for ; Tue, 6 Jul 1999 14:05:58 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id KAA258748 ; Tue, 6 Jul 1999 10:57:06 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA19726; Tue, 6 Jul 99 10:37:16 -0700 Message-Id: <9907061737.AA19726@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Uml-Sequence: 8500 (1999-07-06 17:34:29 GMT) From: John Cowan To: Unicode List Date: Tue, 6 Jul 1999 10:34:27 -0700 (PDT) Subject: UTR #13 comments (was: Plain text: Amendment 1) Content-Transfer-Encoding: 7bit Mark Davis wrote: > A lot of the discussion of line termination relates to technical report #13. > Any suggestions for additional information for that report would be welcome. My suggestions: 1) The NEL character in the C1 set (0x85) is the ISO equivalent of EBCDIC NL (0x15) and this mapping is duly given in the EBCDIC code page mappings on the Unicode FTP site. The text should therefore advise applications to treat U+0085 (NL/NEL) as a newline, not U+0015 (NAK). 2) There should be a warning that some old documents use bare CR (0x0D) to do underlining or other overstriking; an application that converts such text should do a more complex conversion, though treating bare CR as a NLF is marginally acceptable even for these documents (which may then wind up containing occasional lines with only spaces and underscores). -- John Cowan http://www.ccil.org/~cowan cowan@ccil.org Schlingt dreifach einen Kreis um dies! / Schliesst euer Aug vor heiliger Schau, Denn er genoss vom Honig-Tau / Und trank die Milch vom Paradies. -- Coleridge / Politzer 6-Jul-99 18:07:23-GMT,2852;000000000005 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id OAA02179 for ; Tue, 6 Jul 1999 14:07:22 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id KAA265860 ; Tue, 6 Jul 1999 10:56:23 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA19432; Tue, 6 Jul 99 10:28:04 -0700 Message-Id: <9907061728.AA19432@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Uml-Sequence: 8499 (1999-07-06 17:25:42 GMT) From: John Cowan To: Unicode List Date: Tue, 6 Jul 1999 10:25:40 -0700 (PDT) Subject: Re: Plain text: Amendment 1 Content-Transfer-Encoding: 7bit Frank da Cruz wrote: > [I]f we look at a Unicode > text file and see CR and/or LF in it, we don't know if those > characters came from the private text format of a 7- or 8-bit file > that was converted to Unicode without any record-format conversion, > or if they are the "Unicode" CR and LF. The semantics of CR and LF in Unicode 2.x *are* the ambiguous ones inherited from the 7-bit controls; there are no other semantics. But this has been changed in Unicode 3.0: see UTR #13 (http://www.unicode.org/unicode/reports/tr13/), which will be a normative part of Unicode 3.0. Note well that UTR #13 does not solely prescribe the semantics of CR and LF during conversion to and from Unicode, but also the semantics of CR and LF *in* Unicode. XML, a major Unicode application, takes almost the same point of view. (IMHO, XML should be modified to accept LS as a line-end character.) > Therefore this would only > move the problem of incompatible record formats from the old world > (of DOS, Windows, UNIX, Macintosh) to the new one. Indeed. But the only real problem there is that some people and applications (notably nroff output) use bare CR in plain text to produce physical or notional overprinting. Otherwise, it is perfectly fine to take the UTR #13 viewpoint. > It's better to have Unicode characters LS and PS (and I think also > Tab/Column-Separator and Page Separator) than to recycle the C0 > controls. This ensures round-trip integrity without having to know > the history of the data ("it came originally from DOS so to convert > it from Unicode to UNIX we need to...") As for HT and FF, nobody uses them incompatibly, and introducing new characters for them is supererogation at best. -- John Cowan http://www.ccil.org/~cowan cowan@ccil.org Schlingt dreifach einen Kreis um dies! / Schliesst euer Aug vor heiliger Schau, Denn er genoss vom Honig-Tau / Und trank die Milch vom Paradies. -- Coleridge / Politzer 6-Jul-99 20:21:22-GMT,1739;000000000001 Return-Path: Received: from osiris.taz.de (osiris.taz.de [194.162.12.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id QAA11732 for ; Tue, 6 Jul 1999 16:21:20 -0400 (EDT) Received: from track.hal.taz.de (track.hal.taz.de [10.1.0.1]) by osiris.taz.de (8.9.3/8.9.3) with ESMTP id WAA22660; Tue, 6 Jul 1999 22:21:18 +0200 Received: from diva.edv.taz.de (diva.edv.taz.de [10.1.1.44]) by track.hal.taz.de (8.9.3/8.9.3) with ESMTP id WAA13247; Tue, 6 Jul 1999 22:21:13 +0200 (MET DST) Date: Tue, 6 Jul 1999 22:21:13 +0200 (MEST) From: Roman Czyborra X-Sender: czyborra@diva.edv.taz.de To: Unicode List , John Cowan , Frank da Cruz Subject: Re: Plain Text In-Reply-To: <9907061616.AA17333@unicode.org> Message-ID: Organization: http://czyborra.com/ @ http://taz.de/ Gender: male MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII > Text/plain could still include both reflowable and preformatted > text, but I believe the weight of history is in favor of using > that term for preformatted text only. Please read http://imc.org/draft-gellens-format (also known as http://www.ietf.org/internet-drafts/draft-gellens-format-06.txt) about the Content-Type: text/plain;charset=UTF-8;format=flowed > MIME will disappear some day but Unicode will last forever The Internet and MIME will evolve but I don't see them vanish any earlier than Unicode. MIME has been integrated into the majority of platforms, browsers and mailreaders worldwide. Without MIME we wouldn't be able to properly send multilingual text anywhere. 6-Jul-99 21:47:42-GMT,1792;000000000005 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id RAA07183 for ; Tue, 6 Jul 1999 17:47:41 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id OAA186772 ; Tue, 6 Jul 1999 14:35:38 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA24360; Tue, 6 Jul 99 14:25:17 -0700 Message-Id: <9907062125.AA24360@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 8513 (1999-07-06 21:24:03 GMT) From: "Tony Harminc" To: Unicode List Date: Tue, 6 Jul 1999 14:24:01 -0700 (PDT) Subject: Re: Plain text: Amendment 1 On 6 Jul 99, at 10:25, John Cowan wrote: > As for HT and FF, nobody uses them incompatibly, and > introducing new characters for them is supererogation at best. Actually the question of HT and FF is the most bothersome one, for me. There are (at least) two problems: HT and FF both depend in some sense on the user's environment, e.g. page length (paper size if the "rendering engine" is a printer or hardcopy terminal), and tab stop settings. HT has ambiguous semantics when the HT occurs when the cursor is already at a tab stop. If the cursor got to a tab stop because of an HT, then there is no argument - another HT moves to the next tab stop. But if the cursor got there because of ordinary, implicit movement, then some systems ignore an HT (i.e. stay in the same place), while others move on to the next stop. Granted, this is mainly a problem of input methods rather than data storage or interchange, but I don't think it's quite fair to say that no one uses HT incompatibly. Tony H. 6-Jul-99 22:57:17-GMT,1684;000000000005 Return-Path: Received: from inergen.sybase.com (inergen.sybase.com [192.138.151.43]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id SAA24819 for ; Tue, 6 Jul 1999 18:57:16 -0400 (EDT) Received: from smtp1.sybase.com (sybgate.sybase.com [130.214.220.35]) by inergen.sybase.com with ESMTP id PAA16767; Tue, 6 Jul 1999 15:58:18 -0700 (PDT) Received: from birdie.sybase.com (birdie.sybase.com [130.214.140.3]) by smtp1.sybase.com with SMTP id PAA23039; Tue, 6 Jul 1999 15:57:15 -0700 (PDT) Received: by birdie.sybase.com (5.x/SMI-SVR4/SybEC3.5) id AA04633; Tue, 6 Jul 1999 15:57:14 -0700 Date: Tue, 6 Jul 1999 15:57:14 -0700 From: kenw@sybase.com (Kenneth Whistler) Message-Id: <9907062257.AA04633@birdie.sybase.com> To: fdc@watsun.cc.columbia.edu Subject: Re: Plain Text Cc: unicode@unicode.org, kenw@sybase.com X-Sun-Charset: US-ASCII > So at minimum, a text file should be tagged according to character set. Whoa! Wait a minute. How do we get from here to there? If it's tagged, it's not a *plain* text file, but something else. The way ahead out of the character set identity morass for "text files" is to use the Universal Character Set -- that way, once again, we will know how to interpret plain text files. The rest of this discussion is about something else other than what the Unicode Standard means by "plain text", and has, as far as I can tell, more to do with devising a kind of a lowest common denominator document format standard for interoperability. While people on this list may find that interesting to discuss, it is rather orthogonal to the intended scope of the Unicode Standard. --Ken Whistler 6-Jul-99 23:07:50-GMT,1372;000000000005 Return-Path: Received: from inergen.sybase.com (inergen.sybase.com [192.138.151.43]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id TAA27626 for ; Tue, 6 Jul 1999 19:07:49 -0400 (EDT) Received: from smtp1.sybase.com (sybgate.sybase.com [130.214.220.35]) by inergen.sybase.com with ESMTP id QAA18832; Tue, 6 Jul 1999 16:08:40 -0700 (PDT) Received: from birdie.sybase.com (birdie.sybase.com [130.214.140.3]) by smtp1.sybase.com with SMTP id QAA25245; Tue, 6 Jul 1999 16:07:37 -0700 (PDT) Received: by birdie.sybase.com (5.x/SMI-SVR4/SybEC3.5) id AA04637; Tue, 6 Jul 1999 16:07:37 -0700 Date: Tue, 6 Jul 1999 16:07:37 -0700 From: kenw@sybase.com (Kenneth Whistler) Message-Id: <9907062307.AA04637@birdie.sybase.com> To: fdc@watsun.cc.columbia.edu Subject: RE: Plain Text Cc: unicode@unicode.org, kenw@sybase.com X-Sun-Charset: US-ASCII > > The grayer > > part of this discussion is about what constitutes "preformatted plain > > text". I don't think this can be standardized to practical effect. That > > is, you could write a standard, but would anyone use it? > > > Those who needed a guaranteed way to record preformatted plain text in > documents that can persist over long periods of time and across all > applications and platforms would use it. At the moment, this format is called a "book". :-) --Ken 6-Jul-99 23:35:26-GMT,1544;000000000001 Return-Path: Received: from inergen.sybase.com (inergen.sybase.com [192.138.151.43]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id TAA06353 for ; Tue, 6 Jul 1999 19:35:26 -0400 (EDT) Received: from smtp1.sybase.com (sybgate.sybase.com [130.214.220.35]) by inergen.sybase.com with ESMTP id QAA22879; Tue, 6 Jul 1999 16:36:29 -0700 (PDT) Received: from birdie.sybase.com (birdie.sybase.com [130.214.140.3]) by smtp1.sybase.com with SMTP id QAA27662; Tue, 6 Jul 1999 16:35:24 -0700 (PDT) Received: by birdie.sybase.com (5.x/SMI-SVR4/SybEC3.5) id AA04643; Tue, 6 Jul 1999 16:35:24 -0700 Date: Tue, 6 Jul 1999 16:35:24 -0700 From: kenw@sybase.com (Kenneth Whistler) Message-Id: <9907062335.AA04643@birdie.sybase.com> To: fdc@watsun.cc.columbia.edu Subject: Re: Plain text: Amendment 1 Cc: unicode@unicode.org, kenw@sybase.com X-Sun-Charset: US-ASCII > I think the problem with this idea is that if we look at a Unicode > text file and see CR and/or LF in it, we don't know if those > characters came from the private text format of a 7- or 8-bit file > that was converted to Unicode without any record-format conversion, > or if they are the "Unicode" CR and LF. Therefore this would only > move the problem of incompatible record formats from the old world > (of DOS, Windows, UNIX, Macintosh) to the new one. The unfortunate horse is already out of the burning barn on this one. So now we have to add a stable to the new Unicode garage. See Unicode Technical Report #13. --Ken 6-Jul-99 23:35:44-GMT,1286;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id TAA06404 for ; Tue, 6 Jul 1999 19:35:44 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id QAA203350 ; Tue, 6 Jul 1999 16:31:17 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA25399; Tue, 6 Jul 99 16:12:09 -0700 Message-Id: <9907062312.AA25399@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 8518 (1999-07-06 23:07:52 GMT) From: kenw@sybase.com (Kenneth Whistler) To: Unicode List Cc: unicode@unicode.org, kenw@sybase.com Date: Tue, 6 Jul 1999 16:07:21 -0700 (PDT) Subject: RE: Plain Text > > The grayer > > part of this discussion is about what constitutes "preformatted plain > > text". I don't think this can be standardized to practical effect. That > > is, you could write a standard, but would anyone use it? > > > Those who needed a guaranteed way to record preformatted plain text in > documents that can persist over long periods of time and across all > applications and platforms would use it. At the moment, this format is called a "book". :-) --Ken 6-Jul-99 23:45:36-GMT,2026;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id TAA08869 for ; Tue, 6 Jul 1999 19:45:35 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id QAA10440 ; Tue, 6 Jul 1999 16:42:01 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA26067; Tue, 6 Jul 99 16:32:14 -0700 Message-Id: <9907062332.AA26067@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 8519 (1999-07-06 23:30:50 GMT) From: kenw@sybase.com (Kenneth Whistler) To: Unicode List Cc: unicode@unicode.org, kenw@sybase.com Date: Tue, 6 Jul 1999 16:30:44 -0700 (PDT) Subject: Re: Plain text: Amendment 1 Jonathan suggested: > > My thoughts on this indicate that explicit tab widths are not > appropriate: the only real requirement for plain text is that the > columns line up. So we could have a character > > COLUMN SEPARATOR > > (CSEP) to go with LINE SEPARATOR (LSEP) and PARAGRAPH SEPARATOR (PSEP). This isn't going to happen. Column alignment in tables is clearly a higher-level document formatting issue -- not a problem to be solved by attributing complex layout attributes to yet another format control character in the character encoding standard. > > So the general form of a table would be > > PSEP ... CSEP ... CSEP ... LSEP > ... CSEP ... LSEP > ... CSEP ... CSEP ... > PSEP > No, a table is an object defined at a higher level. > > | Whatever is chosen, let's keep it simple. Frank got that one right. We already got TAB's, ineluctably. So define some interoperable behavior on them, as is already done for the kind of preformatted plain text Frank is talking about. Otherwise, use spaces. Any other attempts to push more complex formatting down to the bare minimum preformatted plain text format is bound to fail, IMO. --Ken > 7-Jul-99 0:15:33-GMT,3405;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id UAA16352 for ; Tue, 6 Jul 1999 20:15:32 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id RAA260764 ; Tue, 6 Jul 1999 17:11:34 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA26592; Tue, 6 Jul 99 16:58:21 -0700 Message-Id: <9907062358.AA26592@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 8521 (1999-07-06 23:57:08 GMT) From: kenw@sybase.com (Kenneth Whistler) To: Unicode List Cc: unicode@unicode.org, kenw@sybase.com Date: Tue, 6 Jul 1999 16:57:07 -0700 (PDT) Subject: Re: Plain text: Amendment 1 John Cowan wrote: > > The semantics of CR and LF in Unicode 2.x *are* the ambiguous > ones inherited from the 7-bit controls; there are no other semantics. > But this has been changed in Unicode 3.0: see UTR #13 > (http://www.unicode.org/unicode/reports/tr13/), which will be a > normative part of Unicode 3.0. This is not the case. UTR #13 *is* to be considered part of the Unicode Standard, Version 3.0: http://www.unicode.org/unicode/standard/versions/Unicode3.0-beta.html However, UTR #13 constitutes "Unicode Newline *Guidelines*" [emphasis added]. There is no conformance specification and there are no normative implications. The scope constitutes: "a set of recommendations for handling these characters so as to minimize the effects on users." Think of UTR #13 as a late addition to Chapter 5, Implementation Guidelines, that did not make it into the actual printed text of The Unicode Standard, Version 3.0, forthcoming. > Note well that UTR #13 does not > solely prescribe the semantics of CR and LF during conversion to and > from Unicode, but also the semantics of CR and LF *in* Unicode. It makes suggestions. It does not normatively prescribe. > > As for HT and FF, nobody uses them incompatibly, and > introducing new characters for them is supererogation at best. I would agree with this. > Mark Davis wrote: > > > A lot of the discussion of line termination relates to technical report #13. > > Any suggestions for additional information for that report would be welcome. > > My suggestions: > > 1) The NEL character in the C1 set (0x85) is the ISO equivalent of > EBCDIC NL (0x15) and this mapping is duly given in the EBCDIC code page > mappings on the Unicode FTP site. The text should therefore advise > applications to treat U+0085 (NL/NEL) as a newline, not U+0015 (NAK). This was a typo/oversight in the text of UTR #13 and will be corrected. > > 2) There should be a warning that some old documents use bare > CR (0x0D) to do underlining or other overstriking; an application > that converts such text should do a more complex conversion, though > treating bare CR as a NLF is marginally acceptable even for these > documents (which may then wind up containing occasional lines > with only spaces and underscores). This is a good suggestion to add to the text of UTR #13. --Ken > > -- > John Cowan http://www.ccil.org/~cowan cowan@ccil.org > Schlingt dreifach einen Kreis um dies! / Schliesst euer Aug vor heiliger Schau, > Denn er genoss vom Honig-Tau / Und trank die Milch vom Paradies. > -- Coleridge / Politzer > 7-Jul-99 0:47:40-GMT,3104;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id UAA25012 for ; Tue, 6 Jul 1999 20:47:39 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id RAA252212 ; Tue, 6 Jul 1999 17:42:46 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA26896; Tue, 6 Jul 99 17:29:02 -0700 Message-Id: <9907070029.AA26896@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 8523 (1999-07-07 00:26:52 GMT) From: Frank da Cruz To: Unicode List Cc: unicode@unicode.org Date: Tue, 6 Jul 1999 17:26:51 -0700 (PDT) Subject: Re: Plain Text > > So at minimum, a text file should be tagged according to character set. > > Whoa! Wait a minute. How do we get from here to there? > > If it's tagged, it's not a *plain* text file, but something else. > Sorry, I meant externally tagged, e.g. in the directory entry, along with the size, date, etc. (The lack of this kind of external tagging is a pet peeve of long duration, but is not exactly relevant to this discussion.) > The way ahead out of the character set identity morass for "text files" > is to use the Universal Character Set -- that way, once again, we > will know how to interpret plain text files. > Agreed! Well... At least if we are successful, and some new consortium doesn't come along xx years from now and declare Unicode to be "legacy" and its own new-and-improved universal encoding to be the only one to use from now on. At which point, we might need to differentiate "legacy" Unicode data from the new code, just as we now need to distinguish Unicode from Macintosh Quickdraw, Latin-1, etc. (Saying there will be only one character set in the future is like saying a network address can be 8 bits because there will never be more than 256 computers on a network :-) > The rest of this discussion is about something else other than what > the Unicode Standard means by "plain text", and has, as far as I can > tell, more to do with devising a kind of a lowest common denominator > document format standard for interoperability. While people on this list > may find that interesting to discuss, it is rather orthogonal to the > intended scope of the Unicode Standard. > If it is, it shouldn't be. If we rely on some other organization to worry about this (which one has the authority?) and Unicode outlives the standards and products of that organization, then we're back to "all bets are off". On the other hand, if we can back up the statement that Unicode is a plain-text standard with a definition of plain text that incorporates "lowest common denominator document format standard for interoperability" I think we will have added significant value and endurance to Unicode. The discussion seems to be trailing off -- I suppose I'll wait a few days to see what else comes up and then attempt to write something up (with full consideration of TR13). - Frank 7-Jul-99 2:44:42-GMT,1655;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id WAA07895 for ; Tue, 6 Jul 1999 22:44:41 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id TAA191402 ; Tue, 6 Jul 1999 19:37:18 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA27639; Tue, 6 Jul 99 19:27:23 -0700 Message-Id: <9907070227.AA27639@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Uml-Sequence: 8525 (1999-07-07 02:26:09 GMT) From: John Cowan To: Unicode List Date: Tue, 6 Jul 1999 19:26:07 -0700 (PDT) Subject: Re: Plain text: Amendment 1 Content-Transfer-Encoding: 7bit Kenneth Whistler scripsit: > However, UTR #13 constitutes "Unicode Newline *Guidelines*" [emphasis > added]. There is no conformance specification and there are no > normative implications. The scope constitutes: "a set of > recommendations for handling these characters so as to minimize the > effects on users." Think of UTR #13 as a late addition to Chapter 5, > Implementation Guidelines, that did not make it into the actual printed > text of The Unicode Standard, Version 3.0, forthcoming. Ah, I missed that point. But my point was that whereas Unicode 2.0 had nothing to say about CR and LF and N(E)L, Unicode 3.0 does. -- John Cowan cowan@ccil.org I am a member of a civilization. --David Brin 7-Jul-99 2:45:21-GMT,1919;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id WAA08109 for ; Tue, 6 Jul 1999 22:45:20 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id TAA243160 ; Tue, 6 Jul 1999 19:37:31 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA27576; Tue, 6 Jul 99 19:23:24 -0700 Message-Id: <9907070223.AA27576@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Uml-Sequence: 8524 (1999-07-07 02:22:09 GMT) From: John Cowan To: Unicode List Date: Tue, 6 Jul 1999 19:22:07 -0700 (PDT) Subject: Re: Plain Text Content-Transfer-Encoding: 7bit Kenneth Whistler scripsit: > > > > So at minimum, a text file should be tagged according to character set. > > Whoa! Wait a minute. How do we get from here to there? > > If it's tagged, it's not a *plain* text file, but something else. I believe the reference was to file metadata like the application tag on the Mac, rather than to anything in-band. > The rest of this discussion is about something else other than what > the Unicode Standard means by "plain text", and has, as far as I can > tell, more to do with devising a kind of a lowest common denominator > document format standard for interoperability. While people on this list > may find that interesting to discuss, it is rather orthogonal to the > intended scope of the Unicode Standard. Just so. Historically, such document have been called "plain text" documents. What Unicode means by "plain text" is simply a stream of characters. -- John Cowan cowan@ccil.org I am a member of a civilization. --David Brin 7-Jul-99 15:43:51-GMT,4432;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id LAA20987 for ; Wed, 7 Jul 1999 11:43:51 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id IAA245160 ; Wed, 7 Jul 1999 08:35:39 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA01732; Wed, 7 Jul 99 08:11:57 -0700 Message-Id: <9907071511.AA01732@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Uml-Sequence: 8535 (1999-07-07 15:10:18 GMT) From: Mark Davis To: Unicode List Cc: Unicode List Date: Wed, 7 Jul 1999 08:10:16 -0700 (PDT) Subject: Re: Plain text: Tab stops Content-Transfer-Encoding: 7bit HT has even more ambiguous semantics than you indicate. We did a survey a few years ago of word processors and desktop publishing programs, and found a wide range of different behaviors. Suppose you have a set of tab stops, e.g. at 12pt, 36pt, 72pt, etc. You also have a string of text containing tabs. The tabs in the text divide up the text into a list of tab fields (the text between tabs). There are four problematic situations. 1. A tab field would touch or overlap a previous tab field if placed at the tab stop.* Possible behaviors we observed here were: - go to the next tab stop - go to the next line, at that tab stop. - go to the next line, at the start - ignore the tab, treat it as a space, and merge with the next tab field. 2. There are more tab fields than tab stops. Possible behaviors we observed here were: - go to the next line, at that tab stop. - go to the next line, at the start - ignore the tab, treat it as a space, and merge with the next tab field. - manufacture implicit tab stops past the end, e.g. at every 36 points, or at every 8 em. 3. A tab field would exceed the paragraph margin. Possible behaviors we observed here were: - go to the next line, at the start - go to the next line, at the first tab stop. 4. Tabs are used in non-left flush lines (e.g. with centered or right-flush lines). Possible behaviors we observed here were: - ignore the flush setting on the line. - apply the flush to just the first tab field. - apply the flush to just the last tab field. - lay out the tab fields as if the text were left-flush, then shift the entire line to center or right-flush it. (This comes up with pretty random looking tabulation.) Some DTP programs, despite our best efforts to figure out the rules they were using, appeared to be pretty random in their behavior. This is especially the case with #4. * Overlap (#1) does not only mean that the tab field is too big for the tab stop; it also happens with mixtures of left, right and center tabs. Look at the following example, where '[' means left tab stop, and '|' means centered tab stop, and '~' means tab (and use monospaced font to see properly): [ | aaaaaaaaaaaa~bbbbbbb The bbbbbbb text can't be placed at the centered tab stop properly without overlapping the aaaaaaaaaaaa. Overlap can also happen when the second tab field is centered or right flush and is so large that it overlaps with the left margin. Mark Tony Harminc wrote: > On 6 Jul 99, at 10:25, John Cowan wrote: > > > As for HT and FF, nobody uses them incompatibly, and > > introducing new characters for them is supererogation at best. > > Actually the question of HT and FF is the most bothersome one, for > me. There are (at least) two problems: > > HT and FF both depend in some sense on the user's environment, e.g. > page length (paper size if the "rendering engine" is a printer or > hardcopy terminal), and tab stop settings. >