Erlang Mailing Lists

Author Message

<  Erlang questions mailing list  ~  data sharing is outside the semantics of Erlang, but it sure

Guest
Posted: Mon Sep 14, 2009 4:23 pm Reply with quote
Guest
I've run into several cases where enforcing the sharing of data
resulted in a significant memory savings. I'm talking about a
reduction in heap size from 60MB to under half that. By "enforcing the
sharing of data" I mean making sure that identical elements in a data
structure are actually referencing the same locations in memory.

This is easy to do in Erlang, because the compiler is very literal:

fix_tuple({H, H}) -> {H, H};
...

That ensures that identical looking elements in the tuple are sharing
memory locations. But there is absolutely no reason the compiler has
to do this. It would be perfectly valid to optimize away the entire
function, just returning the original value.

Would any existing standard library functions make this nicer? What I
really want is to have a gb_trees:insert function that returns
{NewTree, InsertedValue} where InsertedValue references existing data
(unless it wasn't already in the tree; in that case, InsertedValue is
exactly what I passed in). Then I can happily use InsertedValue,
knowing data is being shared.

James

________________________________________________________________
erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org

Post received from mailinglist
Guest
Posted: Mon Sep 14, 2009 5:54 pm Reply with quote
Guest
> really want is to have a gb_trees:insert function that...

I meant gb_sets. Something like gb_sets:add_shared(Term, Set).

James

________________________________________________________________
erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org

Post received from mailinglist
Guest
Posted: Mon Sep 14, 2009 10:45 pm Reply with quote
Guest
James Hague writes:
> I've run into several cases where enforcing the sharing of data
> resulted in a significant memory savings. I'm talking about a
> reduction in heap size from 60MB to under half that. By "enforcing the
> sharing of data" I mean making sure that identical elements in a data
> structure are actually referencing the same locations in memory.
>
> This is easy to do in Erlang, because the compiler is very literal:
>
> fix_tuple({H, H}) -> {H, H};
> ...
>
> That ensures that identical looking elements in the tuple are sharing
> memory locations. But there is absolutely no reason the compiler has
> to do this. It would be perfectly valid to optimize away the entire
> function, just returning the original value.
>
> Would any existing standard library functions make this nicer? What I
> really want is to have a gb_trees:insert function that returns
> {NewTree, InsertedValue} where InsertedValue references existing data
> (unless it wasn't already in the tree; in that case, InsertedValue is
> exactly what I passed in). Then I can happily use InsertedValue,
> knowing data is being shared.

Sounds like you want "hash consing". A hash table keeps track of all
non-immediate terms seen so far. To "intern" a new term you recurse
down to the leaves and compute hashes, on the way up you check if an
equivalent node (e.g. cons/2 or tuple/N) has been seen, and if so
you use that one otherwise you add the new node to the hash table.
Either way you return the interned node and its hash on the way up.

This may lose sharing with the input term, but interned terms will
share everything's that's sharable.

Erlang's default memory model doesn't allow same-node processes to
share memory(*), so you will lose sharing in message sends.

(*) Except in one binaries-related case which is irrelevant here.

A major downside of hash consing is that it can leak memory:
if an interned term becomes unreferenced in the application, the
hash table will still keep a master copy of it, wasting memory.
VMs with built-in support for hash consing usually also support
"weak references" or "weak hashes" where the referenced data can
be nuked if the GC determines that it should be dead.

________________________________________________________________
erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org

Post received from mailinglist
Guest
Posted: Tue Sep 15, 2009 8:31 am Reply with quote
Guest
Mikael Pettersson wrote:
>
> Erlang's default memory model doesn't allow same-node processes to
> share memory(*), so you will lose sharing in message sends.

Yes, but there are two levels of sharing here.

1. The sharing of terms across processes, as with large binaries
2. The relative sharing within the term itself.

The latter could be preserved using a sharing-preserving copy.
As this is invariably more expensive than the current copying
algorithm when there is no sharing to preserve (likely a very
common case), it is reasonable that this isn't the default.

The problem today is that you cannot make a sharing-preserving
copy between processes at the Erlang level even if your life
depended on it, and in some cases, this may indeed be the case,
figuratively speaking. Some data structures simply cannot be
passed in a message or in a spawn, since the loss of sharing
leads to a memory explosion.

One problem is of course that if a process that has such a
term crashes, and EXIT messages are propagated that contain
the term, the EXIT propagation will kill the node. Even
without loss of sharing, this can happen, of course. I did
it once by spawn-linking 100,000 processes from the shell
and then mis-spelling length(processes()). This led to an
EXIT message, containing a list of 100,000 pids, that was
copied 100,000 times. My workstation was frozen for 10
minutes while the VM was trying to find a way to cope.

No mystery there, of course, and no sharing either.
All I could do was laugh at my stupidity and take an
extra coffee break.

BR,
Ulf W
--
Ulf Wiger
CTO, Erlang Training & Consulting Ltd
http://www.erlang-consulting.com

________________________________________________________________
erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org

Post received from mailinglist
Guest
Posted: Tue Sep 15, 2009 10:14 am Reply with quote
Guest
Ulf Wiger writes:
> Mikael Pettersson wrote:
> >
> > Erlang's default memory model doesn't allow same-node processes to
> > share memory(*), so you will lose sharing in message sends.
>
> Yes, but there are two levels of sharing here.
>
> 1. The sharing of terms across processes, as with large binaries
> 2. The relative sharing within the term itself.
>
> The latter could be preserved using a sharing-preserving copy.
> As this is invariably more expensive than the current copying
> algorithm when there is no sharing to preserve (likely a very
> common case), it is reasonable that this isn't the default.
>
> The problem today is that you cannot make a sharing-preserving
> copy between processes at the Erlang level even if your life
> depended on it, and in some cases, this may indeed be the case,
> figuratively speaking. Some data structures simply cannot be
> passed in a message or in a spawn, since the loss of sharing
> leads to a memory explosion.

So a sharing-preserving term_to_binary might fix the worst
problems, even if it requires some non-default option and
requires users to explicitly wrap and unwrap parts of messages?

Just trying to see if this is something worth pursuing or not.

/Mikael

________________________________________________________________
erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org

Post received from mailinglist
baryluk
Posted: Tue Sep 15, 2009 7:43 pm Reply with quote
User Joined: 05 Aug 2009 Posts: 48
Dnia 2009-09-14, pon o godzinie 15:36 -0500, James Hague pisze:
> > I am missing something here. gb_sets (nor sets, ordsets, rbsets) does not
> > make a copy of the data which is put into the set. All that is copied is
> > enough of the *tree* to insert the new element. There is no need to copy the
> > new data as it is kept within the same process. Only ets makes a copy of the
> > data.
>
> Let's say you've got a long list of strings. Many of them duplicates.
> You don't just want to remove the duplicates because that will change
> the length of the list. The goals is to ensure that identical strings
> are shared, so there's only one copy in memory. What's a practical
> way of doing that?
>
> This is irrelevant most of the time, but there are some situations
> where it's a huge win.
>
> (My solution was to build a new list by adding each element to a
> binary tree. If a string is already in the tree, return the version
> that's already there (which is not something that gb_sets does). In
> the resulting list, elements are shared as much as possible. I'm
> clearly taking advantage of how the runtime works, but it shrunk the
> heap size by tens of megabytes.)

It looks for me as quite good solution, you depend on memory saving, so
explicitly manages to have shared strings:

Any other way i can see is to have this add/retrivial from binary tree,
to be performed implicitly on every step (via some sort of hashing) of
computation to ensure that identical elements are only once in memory
(of one process). But this have really big performance problems i think,
and most of times you really don't need perfect sharing.

You can limit this "hashing" to only terms which are complex enaugh,
like long lists, or nested tuples. But still this is very costly.

If you want to have some term (or list of many terms) and shrink them,
by some magic, you can using following pseudo code:

maximize_sharing(Term) ->
Tree = tree:new(),
NewTree = iterate_over_every_subterm_and_add_first_to_tree(Term),
NewTerm = iterate_over_every_subterm_and_find_in_tree(Term, NewTree).



Lack of sharing is also problematic when sending term_to_binary data,
because term_to_binary essentially flattens term be repeatedly copying
data. It would be nice to have term_to_binary which doesn't copies data
more than once, which already is in the same term.

Ie.
> erlang:process_info(self(), heap_size).
{heap_size,1597}

> X = {2,2,2,2},
Y1 = {X,X,X,X},
Y2 = {Y1,Y1,Y1,Y1},
Y3 = {Y2,Y2,Y2,Y2},
Y4 = {Y3,Y3,Y3,Y3},
Y5 = {Y4,Y4,Y4,Y4},
Y6 = {Y5,Y5,Y5,Y5},
Y7 = {Y6,Y6,Y6,Y6},
Y8 = {Y7,Y7,Y7,Y7},
Y9 = {Y8,Y8,Y8,Y8},
ok.

> erlang:process_info(self(), heap_size).
{heap_size,4181}

>

In memory Y9 will be pretty small, and whole process memory consumption
will be small, but term_to_binary(Y9) (binary with 2MB of data, but not
counting to the heap_size), or just displaying using io_lib:format("~w",
[Y9]) (2MB of data after flattening resulting list) or sending it to
different node will be disaster.

> erlang:process_info(self(), heap_size). % after io_lib:format(), ok.
{heap_size,9707190}


I don't know how it affects sending Y9 to process on the same node.
As we know Y9 need to be copied to heap of the destination process
(because sharing heaps between process destroy soft real-time semantic
of garbage collection, but improves other performance metrics for big
messages). But is it copied smart enough?


> P = spawn(fun() -> P = receive P2 -> P2 end, io:format("~p ~n ~p ~n",
[erlang:process_info(self(), heap_size), size(P)]), ok end), P ! Y9, ok.

{heap_size,2103540}
>

Unfortunately no.

First good step will be to have version of term_to_binary which
preserves sharing of (already shared) subterms (term_to_binary have
second argument for options, like minorversion or gzip compression), and
binary_to_term which understand how to unpack them preserving sharing
(also across
multiple process in the same node, or compact storage in file/db).


--
Witold Baryluk


Post received from mailinglist
View user's profile Send private message
warezio
Posted: Tue Sep 15, 2009 9:27 pm Reply with quote
User Joined: 05 May 2007 Posts: 107 Location: Yahoo
i use erlang:term_to_binary to persist items in databases all the time, so
this thread caught my attention.

to stimulate discussion i pounded out some code that does hash consing
(hashcons:share/1) and that does encoding and decoding suitable for
composition with erlang:term_to_binary/2 (hashcons:encode/1) and
erlang:binary_to_term/1 (hashcons:decode/1). encoded terms don't
necessarily have a smaller wire footprint even in a very favorable case
(see below), but should decode into a shared data structure (as opposed to
calling hashcons:share/1 after calling erlang:binary_to_term/1, which
could create a temporary which is unacceptably huge).

attached.

-- p

% erl
Erlang (BEAM) emulator version 5.6.5 [source] [async-threads:0] [kernel-poll:false]

Eshell V5.6.5 (abort with ^G)
1> hashcons:encode (lists:duplicate (10, wazzup)).
{[0,0,0,0,0,0,0,0,0,0],[{0,wazzup}]}
2> size (erlang:term_to_binary (hashcons:encode (lists:duplicate (10, wazzup)), [ compressed ])).
35
3> size (erlang:term_to_binary (lists:duplicate (10, wazzup), [ compressed ])).
31
4> % not always smaller
4> size (erlang:term_to_binary (hashcons:encode (lists:duplicate (100, wazzup)), [ compressed ])).
38
5> size (erlang:term_to_binary (lists:duplicate (100, wazzup), [ compressed ])).
37


Post received from mailinglist
View user's profile Send private message Yahoo Messenger
warezio
Posted: Tue Sep 15, 2009 9:36 pm Reply with quote
User Joined: 05 May 2007 Posts: 107 Location: Yahoo
It still passes the test either way (ugh, my test sucks), but I think this
is a bugfix:

-- p

--- hashcons.erl.orig 2009-09-15 14:29:30.000000000 -0700
+++ hashcons.erl 2009-09-15 14:29:32.000000000 -0700
@@ -53,7 +53,7 @@
Ref = dict:size (NewDict),
{ Ref,
NewTerm,
- dict:store (NewTerm, { Ref, NewTerm }, NewDict),
+ dict:store (Term, { Ref, NewTerm }, NewDict),
dict:store (Ref, NewTerm, NewCodec)
}
end.


On Tue, 15 Sep 2009, Paul Mineiro wrote:

> i use erlang:term_to_binary to persist items in databases all the time, so
> this thread caught my attention.
>
> to stimulate discussion i pounded out some code that does hash consing
> (hashcons:share/1) and that does encoding and decoding suitable for
> composition with erlang:term_to_binary/2 (hashcons:encode/1) and
> erlang:binary_to_term/1 (hashcons:decode/1). encoded terms don't
> necessarily have a smaller wire footprint even in a very favorable case
> (see below), but should decode into a shared data structure (as opposed to
> calling hashcons:share/1 after calling erlang:binary_to_term/1, which
> could create a temporary which is unacceptably huge).
>
> attached.
>
> -- p
>
> % erl
> Erlang (BEAM) emulator version 5.6.5 [source] [async-threads:0] [kernel-poll:false]
>
> Eshell V5.6.5 (abort with ^G)
> 1> hashcons:encode (lists:duplicate (10, wazzup)).
> {[0,0,0,0,0,0,0,0,0,0],[{0,wazzup}]}
> 2> size (erlang:term_to_binary (hashcons:encode (lists:duplicate (10, wazzup)), [ compressed ])).
> 35
> 3> size (erlang:term_to_binary (lists:duplicate (10, wazzup), [ compressed ])).
> 31
> 4> % not always smaller
> 4> size (erlang:term_to_binary (hashcons:encode (lists:duplicate (100, wazzup)), [ compressed ])).
> 38
> 5> size (erlang:term_to_binary (lists:duplicate (100, wazzup), [ compressed ])).
> 37
>


________________________________________________________________
erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org

Post received from mailinglist
View user's profile Send private message Yahoo Messenger
Michael Turner
Posted: Wed Sep 16, 2009 6:22 am Reply with quote
User Joined: 29 Jul 2009 Posts: 27 Location: Tokyo
I'm a curious about how the subject line on this thread, which seems to
me too-easily generalized from the very specific problem James brings
up: saving space when you have a very long list of strings, with some
strings repeated.

If you're representing items of data of any kind in a very long list, I
can only assume it's because the cost of linear access is a non-issue
for your application. If you want to save space by storing only once
the repeated elements in a long, mostly-serial-access list, well, that
reminds of very much of the very general idea of "data compression",
which is also usually used when linear-search access is not much of an
issue, when memory is an issue, and when the data features significant
repetition.

So why not just use compression, if saving space is an important goal and
reducing random-access time is not?

Not sure that I'm totally sold on the Erlang Way of doing things, being
pretty new to the language. But in this particular case (or even for
generalizations of it), I don't see why the Erlang Way (insofar as I
understand it) is necessarily inferior to anything requiring that the
language break with its general shared-nothing approach. Did I miss
something?

-michael turner



On 9/15/2009, "James Hague" <james.hague@gmail.com> wrote:

>> Sounds like you want "hash consing".
>
>Hash consing is a heavyweight solution. It's got a fixed cost for
>something that's usually irrelevant. What I really want is a function
>that takes a data structure and returns a new version with maximal
>sharing. I can write special case versions of that in Erlang, but
>it's messy and feels like something that should be a general library
>service.
>
>________________________________________________________________
>erlang-questions mailing list. See http://www.erlang.org/faq.html
>erlang-questions (at) erlang.org
>
>

________________________________________________________________
erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org

Post received from mailinglist
View user's profile Send private message Send e-mail
Guest
Posted: Wed Sep 16, 2009 8:41 am Reply with quote
Guest
Michael Turner wrote:
>
> Not sure that I'm totally sold on the Erlang Way of doing things, being
> pretty new to the language. But in this particular case (or even for
> generalizations of it), I don't see why the Erlang Way (insofar as I
> understand it) is necessarily inferior to anything requiring that the
> language break with its general shared-nothing approach. Did I miss
> something?

Again, two kinds of sharing here.

'The share-nothing approach' refers to the fact that
there are no shared data structures between processes
(conceptually, as the VM certainly makes use of the
possibility to share stuff anytime it can do so safely.)

The other type of sharing is the implicit sharing /within
an Erlang term/. Consider:

X = lists:seq(1,1000),
Y = {X,X,X,X,X,X,X,X}.

The runtime system will only allocate 2 words for the
tuple + 8 words for the pointers to X. This is
an extremely efficient way of building what's conceptually
a structure of 8000 integers. This is used deliberately by
certain modules. QuickCheck, for example, builds
intricate structures that rely extensively on this type
of sharing. Passing such a structure to another process,
either by spawn or message passing, is generally fatal.

http://erlang.org/pipermail/erlang-bugs/2007-November/000488.html

(This post also contains a nice little program that could
be used to benchmark different solutions.)

I encountered this problem once in a very subtle way,
where I had deliberately relied on implicit sharing
in the state, but happened to pass the entire state
by mistake inside a message.

http://www.erlang.org/pipermail/erlang-questions/2005-November/017924.html

The bug existed for a while, but wasn't noticeable as long
as everything was on the process heap, since the deeply
recursive data structure didn't take up much space.

BR,
Ulf W
--
Ulf Wiger
CTO, Erlang Training & Consulting Ltd
http://www.erlang-consulting.com

________________________________________________________________
erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org

Post received from mailinglist
Michael Turner
Posted: Wed Sep 16, 2009 2:01 pm Reply with quote
User Joined: 29 Jul 2009 Posts: 27 Location: Tokyo
On 9/16/2009, "Ulf Wiger" <ulf.wiger@erlang-consulting.com> wrote:

>Michael Turner wrote:
>>
>> Not sure that I'm totally sold on the Erlang Way of doing things, being
>> pretty new to the language. But in this particular case (or even for
>> generalizations of it), I don't see why the Erlang Way (insofar as I
>> understand it) is necessarily inferior to anything requiring that the
>> language break with its general shared-nothing approach. Did I miss
>> something?
>
>Again, two kinds of sharing here.

[Snip]

Interesting answer, but not an answer to my real question. (Maybe I went
on for a paragraph too long.)

It seems that James is interested in data structure sharing IN THIS
PARTICULAR CASE because it gives him a kind of data compression. Again,
if that's the goal, why is ordinary data compression *not* the answer
here? If compression isn't the goal, what did I miss about the
(implicit) problem statement, to which sharing semantics in Erlang *is*
the answer?

-michael turner


________________________________________________________________
erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org

Post received from mailinglist
View user's profile Send private message Send e-mail
Guest
Posted: Thu Sep 17, 2009 4:07 pm Reply with quote
Guest
James Hague wrote:
> I've run into several cases where enforcing the sharing of data
> resulted in a significant memory savings.

BTW, here is an example from OTP, written by Robert Virding,
no less, so it has to be an example of the type of code
that for which Erlang was originally intended. Smile

%% expand_segs(Segs, EmptySeg) -> NewSegs.
%% contract_segs(Segs) -> NewSegs.
%% Expand/contract the segment tuple by doubling/halving the number
%% of segments. We special case the powers of 2 upto 32, this should
%% catch most case. N.B. the last element in the segments tuple is
%% an extra element containing a default empty segment.
expand_segs({B1}, Empty) ->
{B1,Empty};
expand_segs({B1,B2}, Empty) ->
{B1,B2,Empty,Empty};
expand_segs({B1,B2,B3,B4}, Empty) ->
{B1,B2,B3,B4,Empty,Empty,Empty,Empty};
expand_segs({B1,B2,B3,B4,B5,B6,B7,B8}, Empty) ->
{B1,B2,B3,B4,B5,B6,B7,B8,
Empty,Empty,Empty,Empty,Empty,Empty,Empty,Empty};
expand_segs({B1,B2,B3,B4,B5,B6,B7,B8,B9,B10,B11,B12,B13,B14,B15,B16},
Empty) ->
{B1,B2,B3,B4,B5,B6,B7,B8,B9,B10,B11,B12,B13,B14,B15,B16,
Empty,Empty,Empty,Empty,Empty,Empty,Empty,Empty,
Empty,Empty,Empty,Empty,Empty,Empty,Empty,Empty};
expand_segs(Segs, Empty) ->
list_to_tuple(tuple_to_list(Segs)
++ lists:duplicate(tuple_size(Segs), Empty)).


It is from dict.erl, and used to expand the hash table.

Very deliberate use of sharing.

BR,
Ulf W
--
Ulf Wiger
CTO, Erlang Training & Consulting Ltd
http://www.erlang-consulting.com

________________________________________________________________
erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org

Post received from mailinglist
wuji
Posted: Mon Aug 27, 2012 2:59 am Reply with quote
User Joined: 10 Aug 2012 Posts: 654
I can't go back into my house," one angry homeowner homeowner [h4]replica designer *beep*[/h4] homeowner said. "I looked this up. I'm not flying off
cuff here. I have rights as a property owner."Evacuees were were [h3]designer replica *beep*[/h3] were called to a meeting at the University of Colorado
Anne Marie Borrego of the Red Cross said a list list [h4]cheap Ralph Lauren Polo[/h4] list was passed around with the homes that were damaged
destroyed."It was a really tough meeting," Borrego said. "I have have cheap Ralph Lauren have to say just sitting there, I kept thinking to
how difficult it must be to wait for that list list [h4]cheap designer *beep*[/h4] list to actually be handed out to look for your
View user's profile Send private message

Display posts from previous:  

All times are GMT
Page 1 of 1
This forum is locked: you cannot post, reply to, or edit topics.

Jump to:  

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You cannot download files in this forum