Erlang Mailing Lists

Author Message

<  Erlang questions mailing list  ~  Language change proposal

michael at hobbshouse.org
Posted: Tue Nov 04, 2003 4:28 pm Reply with quote
Guest
Joachim Durchholz said:
> Michael Hobbs wrote:
>> That line seems to imply that if an entity contains an encoding
>> declaration, then the whole entity must be encoded with that encoding.
>> This presents a chicken-or-egg problem in that how is an XML processor
>> to process an encoding declaration before it knows what the encoding
>> is?
>
> The first byte of an entity is always a specific character (probably "<"
> for XML).
> Assuming the entity is correct, the XML processor can infer at least a
> first estimate of what encoding was used, and later check it against the
> encoding declarations.

Okay, before you wrote this, I hadn't realized that every character
encoding (besides UTF-16 and EBCDIC) is a superset of ASCII. I had
ass-u-me-d that there are some character encodings that have a wildly
different binary representation, like ASCII vs. EBCDIC. After doing some
searching though, I have discovered that every character encoding that I
could find uses the same standard Latin letters and symbols for the bytes
between 0x20 - 0x7F.

The world is a little less chaotic that I had thought,
- Michael Hobbs





Post generated using Mail2Forum (http://m2f.sourceforge.net)
joachim.durchholz at web.
Posted: Tue Nov 04, 2003 7:05 pm Reply with quote
Guest
Michael Hobbs wrote:

> Joachim Durchholz said:
>
>>Michael Hobbs wrote:
>>
>>>That line seems to imply that if an entity contains an encoding
>>>declaration, then the whole entity must be encoded with that encoding.
>>>This presents a chicken-or-egg problem in that how is an XML processor
>>>to process an encoding declaration before it knows what the encoding
>>>is?
>>
>>The first byte of an entity is always a specific character (probably "<"
>> for XML).
>>Assuming the entity is correct, the XML processor can infer at least a
>>first estimate of what encoding was used, and later check it against the
>> encoding declarations.
>
>
> Okay, before you wrote this, I hadn't realized that every character
> encoding (besides UTF-16 and EBCDIC) is a superset of ASCII.

Oh, there /are/ character sets that vary wildly. The Leibniz Computer
Center in Munich had several CDC computers, which sported non-standard
word sizes (48-bit words), non-standard character sets (6-bit, A-Z are
codes 1-32, 0-9 are 33-44), and a raw computing power that exceeded the
best IBM machines by a factor of ten, and stayed the fastest machine on
the market until about 1960 (when it was outdone by its own chief
engineer who had founded his own company, Cray Research *g*).

You can find this and other CDC-related character sets at
http://www.informatik.uni-hamburg.de/RZ/software/gnu/utilities/recode_9.html
if you're interested Smile
Actually, the "recode" tool still understand these character sets,
supposedly because recode originated on a CDC :-)

These encodings are more of historical interest than anything else,
though. I'm pretty sure that only few machines with truly non-ASCII
non-EBCDIC encodings exist, and that even fewer would call for an Erlang
port...

Regards,
Jo



Post generated using Mail2Forum (http://m2f.sourceforge.net)
ok at cs.otago.ac.nz
Posted: Wed Nov 05, 2003 1:25 am Reply with quote
Guest
I wrote:
> By the way, the Unicode book spells out clear, simple, and usable rules
> for identifier syntax.

Joachim Durchholz <joachim.durchholz_at_web.de> replied:
Ah, wonderful.
Do you have a URL, or a set of promising Google keywords?

Well, it doesn't take the brain of a Feynman to figure out that
the Unicode book is the best place to look, or failing that, www.unicode.org.

In fact it's Section 5.15 "Identifiers" in the Unicode 4.0 book,
and a draft replacement for that section can be found in
http://www.unicode.org/reports/tr31/

"The formal syntax provided here is intended to capture the general
intent that an identifier consists of a string of characters that
begins with a letter or an ideograph, and then includes any number
of letters, ideographs, digits, or underscores. Each programming
language standard has its own identifier syntax; different
programming languages have different conventions for the use of
certain characters from the ASCII range ($, _at_, #, _) in identifiers.
To extend such a syntax to cover the full behavior of a Unicode
implementation, implementers need only combine these specific rules
with the sample syntax provided here.

Syntactic Rule

<identifier> := <identifier_start>
(<identifier_start> | <identifier_extend>)* "

Since Erlang _doesn't_ use anything other than letters, digits, and
underscores, the Unicode rules would apply exactly.

There are some subtleties to all this concerning normalisation
and the non-breaking format characters, but once you've figured out how
to represent a classification scheme for over a million characters
economically (not, actually, all that hard), the rest is easy.


Post generated using Mail2Forum (http://m2f.sourceforge.net)
ok at cs.otago.ac.nz
Posted: Wed Nov 05, 2003 1:46 am Reply with quote
Guest
Eric Merritt <cyberlync_at_yahoo.com> replied:
Sure the IBM machines support ununicodebut at the
cost of doubling the size required to store your
character based data.

This claim is quite untrue.

First off, for the people who really REALLY need Unicode,
they were going to be using 16 bits per character anyway.
Their storage costs don't go up at all. As I believe I've
mentioned, IBM have supported "DBCS" (Double-Byte Character
Sets) for decades.

Second, in addition to UTF-8, which is good for ASCII, there is
Unicode Technical Report 6, which describes a compressed storage
format for Unicode which can handle Latin 1 with *no* expansion,
several other 8-bit schemes with 1 byte of overhead, and CJK
strings also with 1 byte of overhead, no matter what the length
of the string.

Typically what you do is store text in some compressed form on
disc, unpack it if and only if you are going to do some processing,
and then repack on the way out.

390s and 400s are not dead architectures
by any ststretchf the imagination.

Someone who knows that the current 64-bit "360" architecture is
called z/Architecture clearly *knows* that; as does someone who
has read the z/Architecture Principles of Operation closely enough
to know about the Unicode support instructions.



Post generated using Mail2Forum (http://m2f.sourceforge.net)
ok at cs.otago.ac.nz
Posted: Wed Nov 05, 2003 2:19 am Reply with quote
Guest
I proposed
-erlang(Encoding, Version).

"Michael Hobbs" <michael_at_hobbshouse.org> came up with the obvious
question:
This presents a chicken-or-egg problem in that how is an XML processor to
process an encoding declaration before it knows what the encoding is?

The XML specification spells this out in as much detail as one could
possibly wish.

document ::= prolog element Misc*
prolog ::= XMLDecl? Misc* (doctypedecl Misc*)?
XMLDecl ::= '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'

Either an XML document has an XML declaration or it doesn't.
If it doesn't, the encoding must be UTF-8 (or, maybe UTF-16 with
a Byte Order Mark).
If it does have an XML declaration, then the first 5 characters
must literally be '<?xml'; no white space is permitted before
the XML declaration.

Appendix F of the XML specification is non-normative, but it explains
how you can automatically detect the encoding. Not perfectly. But
well enough to read the encoding declaration. You see, if there is
an XML declaration, then the first character MUST be '<', and whatever
the document begins with must be an encoding of that character.
So it's quite easy to distinguish between
UCS-4 (big-endian) UCS-4 (little-endian) UCS-4 (nuxi order)
UCS-2 (big-endian) UCS-2 (little-endian)
some version or extension of ISO 646 (ASCII family)
some version of EBCDIC

The only characters that may appear in an XML declaration are
< ? > ' " =
space, tab, cr, lf,
a-z A-Z 0-9
- _ . :
and these are all in the invariant part of ISO 646. I'm not sure whether
"_" has the same encoding in every version of EBCDIC, but anything you
find in an encoding which is _not_ a letter, digit, hyphen, dot, or colon
may be assumed to be an underscore.

The following encoding names are defined by XML:
UTF-8
UTF-16
ISO-10646-UCS-2
ISO-10646-UCS-4
ISO-8859-1 ... ISO-8859-9 (presumably this should go up to ISO-8859-15)
ISO-2022-JP
Shift_JIS
EUC-JP
with a recommendation that other encoding names be taken from then
IANA registry, and matching should ignore case.

Since an Erlang source file would either literally begin with an -erlang
declaration or else not have one at all, we could pull exactly the same
kind of auto-detection trick, looking for a "-" instead of "<". To
better fit Erlang syntax, we'd convert the XML/IANA names to lower case
and replace '-' by '_', so
-erlang(iso_8859_1, [10,3,1]).

So, to bring the wagons back around to Erlang, if there ever is
an -erlang(Encoding, Version) declaration, it would be nice if
it is clearly stated what encoding should be used for the
"-erlang(Encoding, Version)" text.

The same as the encoding used for the rest of the file, of course.
Just exactly like XML. (People do actually _read_ the XML specification
before spouting about it, don't they?)



Post generated using Mail2Forum (http://m2f.sourceforge.net)
cyberlync at yahoo.com
Posted: Wed Nov 05, 2003 3:06 am Reply with quote
Guest
> This claim is quite untrue.
>
> First off, for the people who really REALLY need
> Unicode,
> they were going to be using 16 bits per character
> anyway.
> Their storage costs don't go up at all. As I
> believe I've
> mentioned, IBM have supported "DBCS" (Double-Byte
> Character
> Sets) for decades.

You are right, but the person I was responding to (I
forget who now) implied that ebcdic should simple be
replaced with unicode where erlang could be used. This
implies that it force shops that do not currently use
unicode to use it.

> Typically what you do is store text in some
> compressed form on
> disc, unpack it if and only if you are going to do
> some processing,
> and then repack on the way out.

I am not familiar enough with the admin side to
verify this either way.


> 390s and 400s are not dead architectures
> by any ststretchf the imagination.
>
> Someone who knows that the current 64-bit "360"
> architecture is
> called z/Architecture clearly *knows* that; as does
> someone who
> has read the z/Architecture Principles of Operation
> closely enough
> to know about the Unicode support instructions.

The 390s and 400s do have new names, I am not aware
of too many people who use these names in day to day
speech. In fact, IBM has just named a new laptop line
the iSeries as well, if you call for support and use
iSeries instead of 400 you will most likly be
forwarded to support for the laptop (we have had this
experience serveral times). So yes the terminology I
use is somewhat out of date, but there is a reason I
use it. As an aside, both the 390 and 400 are greater
then 64 bit architecures (though I forget which now, I
think 128 but I could be wrong).

In any case, we are very very off topic ;)


__________________________________
Do you Yahoo!?
Protect your identity with Yahoo! Mail AddressGuard
http://antispam.yahoo.com/whatsnewfree


Post generated using Mail2Forum (http://m2f.sourceforge.net)
wuji
Posted: Thu Aug 23, 2012 7:09 am Reply with quote
User Joined: 10 Aug 2012 Posts: 654
in the meat.In addition to the Minneapolis flight, a needle needle [h2]replica designer *beep*[/h2] needle was discovered by a teenage passenger aboard a Delta
from Amsterdam to Atlanta. The teen would not surrender the the cheap Ralph Lauren the needle to authorities, who noted he told them that
planned to use it as evidence in a lawsuit.In a a [h4]cheap replica *beep*[/h4] a federal report on the incidents, it was noted that
teen was the son of a passenger aboard the flight flight [h4]replica Christian Louboutin[/h4] flight to Minneapolis who also found a needle in his
needles were reported found on two other flights, one by by Cheap Ralph Lauren Shirts by a crew member and another by a federal air
View user's profile Send private message

Display posts from previous:  

All times are GMT
Page 4 of 4
Goto page Previous  1, 2, 3, 4
This forum is locked: you cannot post, reply to, or edit topics.

Jump to:  

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You cannot download files in this forum