Erlang Mailing Lists

Author Message

<  Erlang questions mailing list  ~  Strings (was: are Mnesia tables immutable?)

headspin at gmail.com
Posted: Sun Jul 02, 2006 12:14 am Reply with quote
Guest
Oh, by the way, UTF stands for Unicode *Transformation* Format, not
transmission...
http://www.unicode.org/faq/utf_bom.html#14

I guess we all miss a bit or two in the docs...

--
Didier

On 6/29/06, Richard A. O'Keefe <ok@cs.otago.ac.nz> wrote:

> True, there are all sorts of good things about UTF-8. It's really cool
> that modern systems come with UTF-8 locales set by default so I can type
> practically _anything_ in TextEdit. BUT it's a *Transmission* format,
> that's what the "T" and "F" stand for. It was never designed to be
> used for serious *processing*
Post generated from www.trapexit.org
Guest
Posted: Wed Jul 05, 2006 1:35 pm Reply with quote
Guest
I have been reading this discussion off-line so I have not been able to
reply quickly. As I see it we are discussing (at least) 3 different
types of strings:

1) internal mutable strings, strings we want to modify and work on (yes
I KNOW Erlang doesn't really have mutable datatypes Smile
2) internal immutable strings, strings we don't want to modify
3) external representation of strings, term_to_binary

Unfortunately we are using the same name for these different things.

Before I go on I would like to point out that the convention of strings
being lists of integers >= 0, =< 255 originated in the code for fwrite
~p because I needed an easy way to decide when a list should be printed
as a string. Just to be helpful. Nothing more. I know because I wrote
the code and "invented" the convention. I am guessing that it became
part of binary encoding because such lists could be encoded efficiently
and useful for strings. But that doesn't mean that strings must only
contain small integers.

I think some people are putting WAY too much significance into a trivial
convention.

That being said some comments on representation:

1. Internal mutable strings. Sorry, I can't for the life of me
understand why they should be represented as anything else other than
one unicode character per list element. Easy to work with, backwards
compatible (which I NEVER worried about before, ask Joe) and relatively
efficient. Anything else at this level would be a serious pain in the arse.

2. Internal immutable strings. I am wondering if we really need fix
this. These strings are VERY application dependant and the application
definitely knows what it needs in the way of encoding. Store them as
binaries and provide some libraries for converting between list strings
which can handle everything, and various encodings in the binaries.

3. External representation. In one respect I don't really see the
problem here, if you solve 1&2 then this problem goes away. What I want
from an external representation is that I get back out of it what I put
into it! Nothing more, nothing less! I have chosen the representation so
I don't want "help" in converting it. If I have a list of integers in
then I want the SAME list of integers out, if I have a binary string in
I want the SAME binary string out. If term_to_binary detects that my
list consists of only 8/16/24/32 bit integers and smart-codes that then
fine, as long as I get it back the same way.

I am definitely not an expert on Unicode so I may have missed something
important. But keep it simple so the programmer knows what is happening
and can work with that. KISS principle.

My main worry that if you start baking in hard-wired solutions into the
systems then you a) will get it wrong, b) make a lot of people unhappy
because you made the wrong choice and c) make the system bigger and
harder to maintain. Provide libraries and credit the programmer with
some intelligence in making their own choices as to what they need.

As you may understand I am definitely for having different
representations depending on what you are doing. One-size does
definitely NOT fit all.

Robert


Post generated from www.trapexit.org
Guest
Posted: Wed Jul 05, 2006 2:52 pm Reply with quote
Guest
On 7/5/06, Robert Virding <robert.virding@telia.com> wrote:
> I want the SAME binary string out. If term_to_binary detects that my
> list consists of only 8/16/24/32 bit integers and smart-codes that then
> fine, as long as I get it back the same way.

There is a term_to_binary(Term, Options) that allow compressed
external representations.

So on the same path: how about adding different levels of compression?
term_to_binary(Term, {compressed, text}) that perform cdr coding for
various bitlength integers, but avoid zlib compression? Saves some
memory but doesnt make binary_to_term() cost anything. Could
optionally be done automaticly on everything inserted in an ets table.

I dont experience problems with high memory use myself, so what are
people working with that do have real mem consuption problems?

Would it feel crufty to have an ets table around just to be able to
store large amounts of text compactly in memory?
Post generated from www.trapexit.org
Thomas Lindgren
Posted: Thu Jul 06, 2006 11:00 am Reply with quote
User Joined: 09 Mar 2005 Posts: 284
--- Robert Virding <robert.virding@telia.com> wrote:

> [lots of good stuff]

I mainly agree with your points. My only caveat here
is that strings with an implicit encoding run the
risk of that encoding being inadvertently forgotten --
for example, when storing them in a database, changing
maintainers/developers, changing specification
version, sending strings between nodes, ...

But that problem can perhaps (and hopefully) be solved
with appropriate libraries, rather than some sort of
low-level hacking.

Best,
Thomas


__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
Post recived frommailinglist
View user's profile Send private message

Display posts from previous:  

All times are GMT
Page 1 of 1
This forum is locked: you cannot post, reply to, or edit topics.

Jump to:  

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You cannot download files in this forum