Erlang Mailing Lists

Author Message

<  Erlang questions mailing list  ~  Language change proposal

hakan.stenholm at mbox304
Posted: Sat Nov 01, 2003 4:09 am Reply with quote
Guest
> Richard A. O'Keefe wrote:
>> One of the long term goals for Erlang is that it should support
>> Unicode;
>
> This is something that I'd advise against.
> True, it would be nice to be able to write your source code using
> native-language identifiers witout having to worry about ASCII
> representation.
> However, there are two problems here:

There is also the problem of mixing native language identifiers with
the english ones from the OTP libs, which is bound to look rather odd
and might possibly be confusing in some cases, where the words mean
different things in each languages. It also limits the portability of
the code as fewer people can understand it - imagine Linux written in
finish.

>
> 1) If somebody gives me software to maintain, I might hit a, say,
> Chinese glyph somewhere. I'd have to download the proper font just to
> be able to look at the sources.

I might also be just a bit tricky to figure out how to write the
glyph/s, if it's something like japanese, chinese or korean.

> I have programmed in Java, which also uses Unicode. I tend to avoid
> the German special characters
joachim.durchholz at web.
Posted: Sat Nov 01, 2003 10:40 am Reply with quote
Guest
H
richardc at csd.uu.se
Posted: Sat Nov 01, 2003 12:30 pm Reply with quote
Guest
Joachim Durchholz wrote:

> That's also the reason why Linux is written in English - had Linus stuck
> to Finnish identifiers, Linux wouldn't be an international platform.

And please note that if Linus had coded using his mother tongue,
he would have written in Swedish, not Finnish.

/Richard



Post generated using Mail2Forum (http://m2f.sourceforge.net)
jonathan at meanwhile.fre
Posted: Sat Nov 01, 2003 10:57 pm Reply with quote
Guest
On 1 Nov 2003 at 13:30, Richard Carlsson wrote:

> Joachim Durchholz wrote:
>
> > That's also the reason why Linux is written in English - had Linus stuck
> > to Finnish identifiers, Linux wouldn't be an international platform.
>
> And please note that if Linus had coded using his mother tongue,
> he would have written in Swedish, not Finnish.
>
> /Richard

I suspect this is now a common belief - the Linux equivalent "Finux"
in Neal Stephenson's novel "Cryptonomicon" is Finnish.

- Jonathan Coupe




Post generated using Mail2Forum (http://m2f.sourceforge.net)
hakan.stenholm at mbox304
Posted: Sun Nov 02, 2003 1:18 am Reply with quote
Guest
>>> 1) If somebody gives me software to maintain, I might hit a, say,
>>> Chinese glyph somewhere. I'd have to download the proper font just
>>> to be able to look at the sources.

If your lucky they may actually be included with OS, but even if they
are not, I assume that a decent editor has at least some kind of
fallback method to display them say as their unicode integer code e.g.
2345, which is probably just as in/comprehensible as the chinese glyph.

>> I might also be just a bit tricky to figure out how to write the
>> glyph/s, if it's something like japanese, chinese or korean.
>
> The software that displays Unicode is supposed to do that for you.

I don't think thats really going to help if you want to figure out how
to write say a japanese or chinese character on a regular keyboard -
you can of course always do a copy and paste of the text.

> Actually there are issues that I haven't seen properly handled yet;
> for example, one Far-East script (Indonesian IIRC) has glyphs that /go
> around/ their neighbouring glyph.
> Human writing is indeed a strange, aesthetically wonderful but
> technically over-complicated beast - and Unicode is designed for
> aesthetics and completeness, not for making life easy on the programs
> that use it.
>
>>> Unicode also has issues with letter case.
>> Isn't this really a kind of design error/bug/feature in erlang ?
>> While I personally would prefer code to be written in english I don't
>> see any real problems with using Unicode.
>
> I don't, too - but why use Unicode if you're writing in English
> anyway? Even 7-bit ASCII is enough. Heck, even the common subset of
> EBCDIC and ASCII would be enough!

Well some additional symbol (none letter) characters (like +,-,_at_ and ^)
might be nice and come in handy as operators in different kinds of
scientific notations.

>
> > The simplest way would
>> probably be to introduce some kind of standard upper case marker
>> (character) in the case that there is no upper case version of a
>> character. Another somewhat more confusing choice would be to require
>> that functions can only start with upper case Unicode letters
>> (possibly only the characters supplied in the current erlang
>> character set).
>
> Too complicated, too much of a burden on the programmer to remember
> correctly, too much of a burden on the maintainer to interpret
> correctly.
>
> At least that was my initial reaction. Seeing a concrete example of
> how this is done elegantly in practice, I might reconsider :-)

(note: I'm assuming we only want non-english identifiers, strings and
atoms)

In erlang we could require all variables/functions to start with either
upper case ASCII characters (for compatibility with the current OTP
libs) if the identifier is in pure ASCII and otherwise start the
identifier with some special character (to mark them as upper case) say
_at_ or some other character that isn't used in the context in which
function and variable identifiers are used. It could look something
like this:

%% multiply
joachim.durchholz at web.
Posted: Sun Nov 02, 2003 12:49 pm Reply with quote
Guest
H
ok at cs.otago.ac.nz
Posted: Mon Nov 03, 2003 1:48 am Reply with quote
Guest
I wrote that:
> One of the long term goals for Erlang is that it should support Unicode;

Note that this wasn't a _recommendation_, it was a straightforward
report of fact. Jonas Barklund's "es_std_0.6.ps" states plainly in
section 3 that "Standard Erlang" uses Unicode. As it happens, I DO
recommend that Erlang should support Unicode, but it wasn't me that
said it back in 1998.

Joachim Durchholz <joachim.durchholz_at_web.de> wrote:,
This is something that I'd advise against.
...
1) If somebody gives me software to maintain, I might hit a,
say, Chinese glyph somewhere. I'd have to download the proper
font just to be able to look at the sources.

es_std_0.6.ps describes u escapes exactly like Java/C++/C99.
This means that it is quite untrue that you would need special fonts
to look at sources. (Not that suitable free fonts haven't been available
for several years now, ...) Not that the result would be particularly
readable, but then, if I gave you source code full of identifiers like
"waea", "tuhituhi", "kupenga", "tiimata", and so on, you wouldn't find
that particularly readable despite it using none but ASCII letters.

You cannot realistically maintain software in a language you can't at
least read, which of course is why Chinese Erlang programmers should be
allowed to use Chinese words.

2) There are many glyphs that look the same. For example, that
"a" letter might actually have an entirely different encoding
since it's from the Russian alphabet.

True: U+0430 CRYLLIC SMALL LETTER A. It's not clear how much of a
problem this is in practice. In any case, since people expect to work
with XML, this is a problem Erlang *has* to live with somehow.

Unicode also has issues with letter case.

More precisely, the world's scripts have issues with letter case.

For one, there is no good mapping of lowercase and uppercase
letters (and cannot be: for example, the German <ss> has no
uppercase equivalent, it transliterates to SS or SZ depending on
personal whim).

Case conversion is not a simple one-to-one mapping. That's not Unicode's
fault, that's just the way things are. There are, for example, two
conventions for converting lower case to upper case in French (lose the
accents/keep the accents). There's the point, spelled out in the Unicode
book itself, that the Turkish upper case equivalent of "i" is not "I" but
capital-I-with-dot-above, and the Turkish lower case equivalent of "I" is
not "i" but lower-case-dotless-i. Since Erlang is a case sensitive language,
this is a non-problem: you don't care about case conversion when processing
Erlang sources because you don't ever do it. When it comes to data, it's
up to the application to decide whether to use locale-sensitive case mapping
or the case mapping tables that are available free from unicode.org.

Additionally, Unicode has /three/ lettercase categories: lower, upper,
and title case. (The latter information is gleaned from the Haskell
language report, I don't know anything further about Unicode.)

This is true. Again, I don't see what the problem is. If you want to find
a stick to beat Unicode with, there are stouter ones. (Like the fact that
the encoding of a glyph is not unique, and there is a bewildering choice of
normalisation forms.)

(There's also a portability issue: there are still EBCDIC machines
around that don't support Unicode. I don't think this is relevant for
Erlang though *g*)

What machines are those? Certainly not IBM ones; z/Architecture has
hardware support for Unicode. If it comes to that, there are probably
still a few PDP-11s in service that only support ASCII. What of it?

My personal idea about Unicode is that it is massively overengineered
for simple tasks like representing source code.

It is, on the other hand, the only international widely supported large
character set standard around, and it _wasn't_ engineered just for simple
tasks.

What are the advantages of keeping some XML data as atoms?

The same as the advantages of keeping any other data as atoms. Atoms are
physically compact and testing for atom equality is very fast; if you want
to write a program that transforms XML to something else, you'd be mad to
do it in XSLT if you could do it in Erlang, and that means pattern matching
against XML trees is interesting. SWI Prolog doesn't just store generic
identifiers and attribute names as atoms, it stores #PCDATA as atoms as well,
and SWI Prolog is used with very large RDF files. (Mind you, SWI Prolog is
a multithreaded system whose atom table _is_ garbage collected.)

About ISO Latin and Windows: That's one of the reasons why I
don't use umlauts in my source code, except when it comes to
literal strings. And I'm painfully aware that having umlauts in
strings makes my sources nonportable; the better solution is to
have some internationalization support.

As a matter of fact, vowels with umlauts and the sharp-s character are
no trouble at all: ISO Latin 1, ISO Latin 9 (=8859-15), MacRoman, and
Windows all support them perfectly well. The big problem is things like
English quotation marks, which MacRoman and Windows support, but none of
the ISO 8859 character sets.

Even if we confine Erlang to 8-bit character sets, people DO have reason
to use different 8-bit character sets, and some way of indicating _which_
8-bit character set was used is going to be increasingly important. (I
repeat my observation that ISO Latin 9 has a Euro character and ISO Latin 1
does not, so there is a strong incentive for Europeans to switch to
ISO Latin 9 as their default character set.)

PS: the words are "wire", "write", "net", and "start".


Post generated using Mail2Forum (http://m2f.sourceforge.net)
ok at cs.otago.ac.nz
Posted: Mon Nov 03, 2003 2:44 am Reply with quote
Guest
=?ISO-8859-1?Q?H=E5kan_Stenholm?= <hakan.stenholm_at_mbox304.swipnet.se> wrote:
> Unicode also has issues with letter case.

Isn't this really a kind of design error/bug/feature in erlang ?

No. Erlang requires that some characters be classified as upper-case
letters (I'd include title-case letters in that set) and some other
characters be classified as not-upper-case letters (include lower case,
non-case, syllables, logograms). The upper case letters should contain
the 26 ASCII ones; the not-upper-case-letters should contain the 26 ASCII
ones; the two sets should be disjoint; various other characters (digits,
layout, punctuation) should also be disjoint. Works fine for Unicode.
This was all sorted out for the ISO Prolog standard.

While I personally would prefer code to be written in english I don't
see any real problems with using Unicode. The simplest way would
probably be to introduce some kind of standard upper case marker
(character) in the case that there is no upper case version of a
character.

Erlang syntax doesn't *care* whether there is an upper case version of a
character or not. People writing in Chinese, Japenese, Korean, &c should
start their variables with an "_" (like people writing Prolog for those
languages); that's enough. The problem was first considered for Prolog
back in about 1983, as far as I know; Quintus implemented this solution
(variable starts with any upper case letter; if your script doesn't have
upper case letters, use a leading "_") by about 1985 or 1986.

Another somewhat more confusing choice would be to require that
functions can only start with upper case Unicode letters
(possibly only the characters supplied in the current erlang
character set).

That would certainly be confusing, since Erlang function names normally
start with not-upper-case letters.

By the way, the Unicode book spells out clear, simple, and usable rules
for identifier syntax. I wish people would read that before trying to
solve problems that don't actually exist. (There are more than enough
Unicode problems that _do_ exist...)

[Unicode]
might also be useful in comments, if they aren't written in english
- japanese, russian and other languages that have completely different
character sets will be rather tedious to encode in some kind of
ASCII/latin1 version.

Heck, IBM mainframe programmers have been able to use wide characters
in strings and comments for at least 20 years. In Fortran, yet. My
point is that IF you are going to do this, you had better say up front
with a -erlang(Encoding,Version) declaration, which character set you are
using in those comments and strings, lest they be misunderstood.



Post generated using Mail2Forum (http://m2f.sourceforge.net)
ok at cs.otago.ac.nz
Posted: Mon Nov 03, 2003 3:07 am Reply with quote
Guest
Joachim Durchholz <joachim.durchholz_at_web.de> wrote:
The software that displays Unicode is supposed to do that for you.
Actually there are issues that I haven't seen properly handled yet; for
example, one Far-East script (Indonesian IIRC) has glyphs that /go
around/ their neighbouring glyph.

Indonesian can be written in ASCII. He may be thinking of some Indic script.

I don't, too - but why use Unicode if you're writing in English anyway?

Because quite a lot of the characters you want for writing English
are not available in ASCII, most obviously, but not limited to,
the 6..9 66..99 quotation marks (and NO, '..' ".." are *NOT* adequate
substitutes). I can't even write my father's name in ASCII.

Not to mention the fact that this country has two official languages,
and one of them requires letters that are not available even in ISO Latin 1.
I can't even write the name of my University in Latin 1, far less ASCII.

Agreed.
Though the Russians tend to manage somehow - I've been seeing a lot of
Russion software lately.
Actually, all the non-Western languages have ways of transliterating to
Western script. AFAIK there are even several schemes to choose from for
any such language.

Note that many such transliteration schemes use diacritical marks,
which means that they don't map to ASCII, and may have trouble mapping
to Latin 1. The whole reason that Unicode has three alphabetic cases
is to support a historic Cyrillic->Latin transliteration scheme.



Post generated using Mail2Forum (http://m2f.sourceforge.net)
Bengt.Kleberg at ericsson
Posted: Mon Nov 03, 2003 8:21 am Reply with quote
Guest
jonathan_at_meanwhile.freeserve.co.uk wrote:
> On 1 Nov 2003 at 13:30, Richard Carlsson wrote:
>
>
>>Joachim Durchholz wrote:
>>
...deleted
>>And please note that if Linus had coded using his mother tongue,
>>he would have written in Swedish, not Finnish.
>>
>> /Richard
>
>
> I suspect this is now a common belief - the Linux equivalent "Finux"
> in Neal Stephenson's novel "Cryptonomicon" is Finnish.
>

presumably everybody already knows, but anyway:

finland is a bilingual country (like belgium, perhaps?). most people
living there speak finish, but some speak swedish. mr torvalds could be
finish, but speak/write swedish.


bengt



Post generated using Mail2Forum (http://m2f.sourceforge.net)
joachim.durchholz at web.
Posted: Mon Nov 03, 2003 12:42 pm Reply with quote
Guest
Richard A. O'Keefe wrote:
> By the way, the Unicode book spells out clear, simple, and usable rules
> for identifier syntax.

Ah, wonderful.
Do you have a URL, or a set of promising Google keywords?

Regards,
Jo



Post generated using Mail2Forum (http://m2f.sourceforge.net)
cyberlync at yahoo.com
Posted: Mon Nov 03, 2003 2:35 pm Reply with quote
Guest
> (There's also a portability issue: there are still
> EBCDIC machines
> around that don't support Unicode. I don't think
> this is relevant for
> Erlang though *g*)
>
> What machines are those? Certainly not IBM ones;
> z/Architecture has
> hardware support for Unicode. If it comes to that,
> there are probably
> still a few PDP-11s in service that only support
> ASCII. What of it?

Sure the IBM machines support ununicodebut at the
cost of doubling the size required to store your
character based data. Most shops ararn going to go
this route, dadasdn big iron is still not cheap.

Also I wowouldn compare 390s and 400s to PDP-11s.
PDP-11s have not been produced in many years, nor have
they been supported. 390s and 400s have been
constantly supported and upgraded over
ththeirifetimes, this looks to continue
inindefinitely390s and 400s are not dead architectures
by any ststretchf the imagination.

__________________________________
Do you Yahoo!?
Exclusive Video Premiere - Britney Spears
http://launch.yahoo.com/promos/britneyspears/


Post generated using Mail2Forum (http://m2f.sourceforge.net)
hal at vailsys.com
Posted: Mon Nov 03, 2003 2:48 pm Reply with quote
Guest
Joachim Durchholz <joachim.durchholz_at_web.de> writes:

> Richard A. O'Keefe wrote:
>> By the way, the Unicode book spells out clear, simple, and usable rules
>> for identifier syntax.
>
> Ah, wonderful.
> Do you have a URL, or a set of promising Google keywords?

I couldn't help it:

google: unicode "identifier syntax"

...

http://www.unicode.org/reports/tr31/


OTOH we could just say variables begin with capital Latin/Cyrillic
letter, katakana, zhuyin, or any hanzi with "ren" radical ... (j/k).


Post generated using Mail2Forum (http://m2f.sourceforge.net)
michael at hobbshouse.org
Posted: Mon Nov 03, 2003 4:15 pm Reply with quote
Guest
Richard A. OKeefe said:
> My point
> is that IF you are going to do this, you had better say up front with a
> -erlang(Encoding,Version) declaration, which character set you are using
> in those comments and strings, lest they be misunderstood.

Not to sidetrack this topic even further, but I will... :-)

I've always found it interesting that the XML specification does not
explicitly specify which encoding should be used for the encoding
declaration. (e.g. "<?xml encoding='UTF-8'?>") The closest it comes to
making such a specification is the line, "it is an error for an entity
including an encoding declaration to be presented to the XML processor in
an encoding other than that named in the declaration, or for an entity
which begins with neither a Byte Order Mark nor an encoding declaration to
use an encoding other than UTF-8."

That line seems to imply that if an entity contains an encoding
declaration, then the whole entity must be encoded with that encoding.
This presents a chicken-or-egg problem in that how is an XML processor to
process an encoding declaration before it knows what the encoding is?

So, to bring the wagons back around to Erlang, if there ever is an
-erlang(Encoding, Version) declaration, it would be nice if it is clearly
stated what encoding should be used for the "-erlang(Encoding, Version)"
text.

- Michael Hobbs




Post generated using Mail2Forum (http://m2f.sourceforge.net)
joachim.durchholz at web.
Posted: Tue Nov 04, 2003 1:35 am Reply with quote
Guest
Michael Hobbs wrote:
> That line seems to imply that if an entity contains an encoding
> declaration, then the whole entity must be encoded with that encoding.
> This presents a chicken-or-egg problem in that how is an XML processor to
> process an encoding declaration before it knows what the encoding is?

The first byte of an entity is always a specific character (probably "<"
for XML).
Assuming the entity is correct, the XML processor can infer at least a
first estimate of what encoding was used, and later check it against the
encoding declarations.

> So, to bring the wagons back around to Erlang, if there ever is an
> -erlang(Encoding, Version) declaration, it would be nice if it is clearly
> stated what encoding should be used for the "-erlang(Encoding, Version)"
> text.

The declaration should use the same encoding as the rest of the source file.
Proceed as follows:

IF first two characters are hex FEFF or FFFE THEN
assume Unicode
ELSEIF first character is EBCDIC encoding of "-" THEN
assume the intersection of all EBCDIC code pages
parse first line
IF it's not something like "-erlang(EBCDIC-whatever, Version)" THEN
report error, abort compilation
END
load requested EBCDIC code page
ELSEIF first character is ASCII encoding of "-" THEN
assume the intersection of all ASCII code pages (Latin-1, ...)
parse first line
IF it's not something like "-erlang(Encoding, Version)"
(where "Encoding" is one of the supported ASCII code pages)
THEN
report error, abort compilation
END
load requested ASCII code page
ELSE
assume 7-bit ASCII
ENDIF

Just my 2c.

Regards,
Jo



Post generated using Mail2Forum (http://m2f.sourceforge.net)

Display posts from previous:  

All times are GMT
Page 3 of 4
Goto page Previous  1, 2, 3, 4  Next
This forum is locked: you cannot post, reply to, or edit topics.

Jump to:  

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You cannot download files in this forum