Erlang Mailing Lists

Author Message

<  RabbitMQ mailing list  ~  intermittent erl.exe crash

Guest
Posted: Wed Nov 04, 2009 5:23 pm Reply with quote
Guest
I'm running a Rabbitmq cluster on two Windows Server 2008 machines and am having an issue with erl.exe crashing occasionally, which sometimes corrupts the Rabbitmq database forcing me to delete it. Here is the output from Windows event viewer:
Guest
Posted: Thu Nov 05, 2009 9:43 am Reply with quote
Guest
JD,

JD Conley wrote:
> I'm running a Rabbitmq cluster on two Windows Server 2008 machines and
> am having an issue with erl.exe crashing occasionally, which sometimes
> corrupts the Rabbitmq database forcing me to delete it. Here is the
> output from Windows event viewer:
>
> Faulting application erl.exe, version 0.0.0.0, time stamp 0x491190a3,
> faulting module beam.smp.dll, version 0.0.0.0, time stamp 0x49118fbd,
> exception code 0x40000015, fault offset 0x00010831, process id 0x6e0,
> application start time 0x01ca5ce1246bff61.

That error doesn't tell us much Sad

Did Erlang write a erl_crash.dump file? If so, that should provide some
clues as to the cause.

Also, is there anything unusual at all in the rabbit logs?


Regards,

Matthias.

_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss@lists.rabbitmq.com
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Post received from mailinglist
Guest
Posted: Thu Nov 05, 2009 4:32 pm Reply with quote
Guest
> > Faulting application erl.exe, version 0.0.0.0, time stamp 0x491190a3,
> > faulting module beam.smp.dll, version 0.0.0.0, time stamp 0x49118fbd,
> > exception code 0x40000015, fault offset 0x00010831, process id 0x6e0,
> > application start time 0x01ca5ce1246bff61.
>
> That error doesn't tell us much Sad
>
> Did Erlang write a erl_crash.dump file? If so, that should provide some
> clues as to the cause.

I'm an Erlang newbie, so forgive the ignorance. Where would I find the dump
file on Windows? I haven't changed any of the default configuration -- other
than the setup required to run a cluster. I searched under the Rabbitmq
directory for the file but nothing was there. Is there any configuration I
can do to try to get more information?

> Also, is there anything unusual at all in the rabbit logs?

I don't know. The one thing I notice is Rabbitmq seems to get very upset and
disconnect the client when we try to unbind a non-existent binding. You can
see the log file on pastebin: http://pastebin.com/m614be2bb

Thank you,
JD


_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss@lists.rabbitmq.com
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Post received from mailinglist
Guest
Posted: Thu Nov 05, 2009 4:52 pm Reply with quote
Guest
JD,

JD Conley wrote:
> Where would I find the dump file on Windows?

By default the Windows service setup of rabbitmq configures Erlang to
place the erl_crash.dump file in the same place as the rabbit logs.
Well, that's what the code in rabbitmq-service.bat purports to do - no
idea whether anything actually pays attention to that setting.

> Is there any configuration I can do to try to get more information?

If you can, try running the server directly rather than starting it as a
service. Then, if it crashes, a) you may see a message in the console
where you started it, and b) the erl_crash.dump should get written to
the dir where you started the server.

> The one thing I notice is Rabbitmq seems to get very upset and
> disconnect the client when we try to unbind a non-existent binding.

It doesn't really get upset about it, it just logs it as an error and
closes the offending connection with an appropriate error message.

> You can see the log file on pastebin: http://pastebin.com/m614be2bb

Is there anything in the sasl log?


Regards,

Matthias.

_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss@lists.rabbitmq.com
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Post received from mailinglist
Guest
Posted: Thu Nov 05, 2009 5:15 pm Reply with quote
Guest
> By default the Windows service setup of rabbitmq configures Erlang to
> place the erl_crash.dump file in the same place as the rabbit logs.

It's not there. I did some file system searching in %appdata% as well, with
no luck. So, I guess it didn't write one.

> If you can, try running the server directly rather than starting it as
> a
> service.

I have the broker running interactively from a cmd prompt. I'll let you know
when it crashes. I hope it crashes. It would be annoying if it didn't. Smile
The big difference with the interactive one is it is running with my user
account, where the service is running as local system. Hopefully that
doesn't make a difference (I copied the cookie).

> > The one thing I notice is Rabbitmq seems to get very upset and
> > disconnect the client when we try to unbind a non-existent binding.
>
> It doesn't really get upset about it, it just logs it as an error and
> closes the offending connection with an appropriate error message.

Is there any way to change this behavior through configuration? The log is
fine, but part of my use case is binding/unbinding occasionally and I would
prefer it if my clients didn't keep getting errors and having to reconnect.

> Is there anything in the sasl log?

No. It's using guest only.

Thanks,
JD


_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss@lists.rabbitmq.com
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Post received from mailinglist
Guest
Posted: Thu Nov 05, 2009 5:59 pm Reply with quote
Guest
JD,

JD Conley wrote:
>>> The one thing I notice is Rabbitmq seems to get very upset and
>>> disconnect the client when we try to unbind a non-existent binding.
>> It doesn't really get upset about it, it just logs it as an error and
>> closes the offending connection with an appropriate error message.
>
> Is there any way to change this behavior through configuration? The log is
> fine, but part of my use case is binding/unbinding occasionally and I would
> prefer it if my clients didn't keep getting errors and having to reconnect.

Why are your clients trying to unbind non-existing bindings?

The AMQP spec requires that this results in an error.

>> Is there anything in the sasl log?
>
> No. It's using guest only.

I mean the rabbit-sasl.log; "SASL" in this case stands for Erlang's
"System Architecture Support Libraries" - nothing to do with authentication.


Regards,

Matthias.

_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss@lists.rabbitmq.com
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Post received from mailinglist
Guest
Posted: Thu Nov 05, 2009 6:41 pm Reply with quote
Guest
> Why are your clients trying to unbind non-existing bindings?

They don't know the bindings are non-existent since the queues/exchange in
question aren't durable and, say, a server might crash and restart. A client
created a binding, then some time went by and it was time to remove the
binding, probably by a different client, that maybe wasn't connected yet and
doesn't know the binding was destroyed. There is a web farm with long
running connections/sessions to the Rabbitmq cluster.

I'm using this as a message passing system for a web application written
using the official .NET client. I have a Fanout exchange and a queue per
user session. Sort of like a chat room, when a user 'joins' her queue is
bound to the routing key for that 'room'. Users are long polling my web app,
which is doing a BasicConsume through a shared EventingBasicConsumer during
the 60 second polling period, and a BasicCancel when the poll is completed.
I persist the list of currently bound routing keys in a database and a
binding will usually last a few minutes before it is unbound. There is a
background cleanup process that runs and removes bindings and destroys
queues after users haven't polled in a while.

If there's a better way to do something like this, I'm open to suggestion.
Smile

> The AMQP spec requires that this results in an error.

That's fine, but that doesn't seem critical enough for the connection to
have to drop. As a newcomer in the space, it's ironic to me that I can
declare duplicate queues and exchanges without any errors but not silently
remove a binding that doesn't exist. It would be great if it were an option
to behave like the .NET generic Dictionary that returns 'true' on remove if
there was actually something there, or silently ignores the call and returns
'false' if not. Is there a DoesBindingExist method I can use to do things
cleanly today? I didn't see one.

> >> Is there anything in the sasl log?
> >
> > No. It's using guest only.
>
> I mean the rabbit-sasl.log; "SASL" in this case stands for Erlang's
> "System Architecture Support Libraries" - nothing to do with
> authentication.

Hah! Oops. Too much network protocol background in my past. The answer is
the same, even without the part of me looking like an idiot. The log is
empty.

-JD


_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss@lists.rabbitmq.com
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Post received from mailinglist
Guest
Posted: Thu Nov 05, 2009 7:30 pm Reply with quote
Guest
JD,

JD Conley wrote:
>> Why are your clients trying to unbind non-existing bindings?
>
> They don't know the bindings are non-existent since the queues/exchange in
> question aren't durable and, say, a server might crash and restart.

Rabbit doesn't usually crash Wink

> I persist the list of currently bound routing keys in a database and a
> binding will usually last a few minutes before it is unbound. There is a
> background cleanup process that runs and removes bindings and destroys
> queues after users haven't polled in a while.
>
> If there's a better way to do something like this, I'm open to suggestion.

I think what you are doing is fine. One possible improvement is to
(re)declare the queues/bindings just before attempting to delete them.
It's a cheap operation for the server if the entities do exist already -
which is the common case - and it prevents the errors when they don't.

One possible extension of RabbitMQ could allow queues to have a
"automatically delete if idle for longer than X" setting. That would
actually be quite easy to implement and is a natural extension of the
existing exclusivity and auto-delete flags. Would that work for you or
do you actually need an idle timeout for *bindings*? Those would be a
lot harder to implement.

>> The AMQP spec requires that this results in an error.
>
> That's fine, but that doesn't seem critical enough for the connection to
> have to drop.

Actually, not_found (404) is defined to be *soft* error, which means it
only closes the channel, not the connection. So I suspect your
application code is tripping over the channel error and causing the
connection to close.

> As a newcomer in the space, it's ironic to me that I can declare
> duplicate queues and exchanges without any errors but not silently
> remove a binding that doesn't exist.

Declaration is an assertion of existence, and hence is idempotent - you
can assert the existence of something as many times as you like; you
cannot create duplicates. Now, deletion/unbind should really be an
assertion of non-existence and hence be idempotent too, but
unfortunately it isn't. One day that will hopefully get fixed in the
protocol spec.

> It would be great if it were an option to behave like the .NET
> generic Dictionary that returns 'true' on remove if there was
> actually something there, or silently ignores the call and returns
> 'false' if not. Is there a DoesBindingExist method I can use to do
> things cleanly today? I didn't see one.

There is no such mechanism. Ideally the protocol would allow the listing
of existing entities. RabbitMQ currently allows that via the
'rabbitmqctl list_*' commands, but that is outside the protocol and not
designed for high rates of execution.


Regards,

Matthias.

_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss@lists.rabbitmq.com
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Post received from mailinglist
Guest
Posted: Thu Nov 05, 2009 8:47 pm Reply with quote
Guest
> One possible extension of RabbitMQ could allow queues to have a
> "automatically delete if idle for longer than X" setting. That would
> actually be quite easy to implement and is a natural extension of the
> existing exclusivity and auto-delete flags. Would that work for you or
> do you actually need an idle timeout for *bindings*? Those would be a
> lot harder to implement.

That's interesting, and would save my app code a lot of headache. My use
case is an idle timeout for inactive queues, which by definition should
remove associated bindings, right? Activity in my case would be defined as
having a consumer within X amount of time.

> Actually, not_found (404) is defined to be *soft* error, which means it
> only closes the channel, not the connection. So I suspect your
> application code is tripping over the channel error and causing the
> connection to close.

I'm using the .NET client, which is throwing an
OperationInterruptedException. Stack looks something like:

RabbitMQ.Client.Exceptions.OperationInterruptedException: The AMQP operation
was interrupted: AMQP close-reason, initiated by Peer, code=404,
text="NOT_FOUND - no queue
'youtopia.playerqueue.bc19a9767bc94cc18a20c7d8c4c9c1c6' in vhost '/'",
classId=50, methodId=50, cause=
at RabbitMQ.Client.Impl.SimpleBlockingRpcContinuation.GetReply()
at RabbitMQ.Client.Impl.ModelBase.ModelRpc(MethodBase method,
ContentHeaderBase header, Byte[] body)
at RabbitMQ.Client.Framing.Impl.v0_8.Model.QueueUnbind(String queue,
String exchange, String routingKey, IDictionary arguments)
[My code here]

I am not closing the connection unless IModel.IsOpen is false after any
operation. You say it closes the channel. So, would that be the IModel in
.NET client speak? How would I recover from this exception without
rebuilding everything from the connection on up?

-JD


_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss@lists.rabbitmq.com
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Post received from mailinglist
Guest
Posted: Thu Nov 05, 2009 9:17 pm Reply with quote
Guest
JD,

JD Conley wrote:
>> One possible extension of RabbitMQ could allow queues to have a
>> "automatically delete if idle for longer than X" setting.
>
> My use case is an idle timeout for inactive queues, which by
> definition should remove associated bindings, right?

Yes, bindings to a queue get removed when the queue is removed.

> Activity in my case would be defined as having a consumer within X amount of time.

Actually I'd define "in use" to mean either having consumers, or
performing any operation affecting the queue that one might expect a
(logical) consumer to perform, such as basic.get, basic.ack,
basic.consume, basic.cancel, queue.purge, queue.bind, queue.unbind,
queue.declare, queue.delete. Essentially the only excluded operations
are routing messages to queues and administrative operations, e.g.
rabbitmqctl commands.

>> Actually, not_found (404) is defined to be *soft* error, which means it
>> only closes the channel, not the connection. So I suspect your
>> application code is tripping over the channel error and causing the
>> connection to close.
>
> I'm using the .NET client, which is throwing an
> OperationInterruptedException.
>
> I am not closing the connection unless IModel.IsOpen is false after any
> operation. You say it closes the channel. So, would that be the IModel in
> .NET client speak? How would I recover from this exception without
> rebuilding everything from the connection on up?

IModel does indeed correspond to AMQP's notion of a channel. You can
proceed after the error by opening a fresh channel with
IConnection.CreateModel().


Regards,

Matthias.

_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss@lists.rabbitmq.com
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Post received from mailinglist
Guest
Posted: Fri Nov 06, 2009 10:10 pm Reply with quote
Guest
>
> If you can, try running the server directly rather than starting it as
> a
> service. Then, if it crashes, a) you may see a message in the console
> where you started it, and b) the erl_crash.dump should get written to
> the dir where you started the server.

Ok, running it directly worked. I got a crash dump.
http://corp.hive7.com/download/erl_crash.zip

The logs weren't very interesting. Just more of the same stuff with
connections terminating and being re-created. Nothing in the SASL log.

It seems like it ran out of memory. How much memory overhead is there per
queue/stored message? I might need to make my timeout sweeping more
aggressive, or bring up more instances.

Or, more likely, my queue configuration or consuming code is wrong and my
messages are being stored forever in memory causing the memory leak. Here
are some c# snippets:

//init
_model.ExchangeDeclare("myexchangename", ExchangeType.Fanout, false, false,
false, false, false, null);
_consumer = new EventingBasicConsumer();

...

//start consuming
_model.QueueDeclare(userQueue, false, false, false, false, false, null);
consumerTag = _model.BasicConsume(userQueue, true, null, _consumer);

...

//end consuming
_model.BasicCancel(consumerTag);

...

//bind user specific queue to routing key
_model.QueueDeclare(userQueue, false, false, false, false, false, null);
_model.QueueBind(userQueue, _exchange, key, false, null);

...

//unbind user queue from routing key
_model.QueueUnbind(userQueue, _exchange, key, null);

...

//publish message
var props = _model.CreateBasicProperties();
props.Expiration = "5000"; //is this right? desired expiration time is 5
seconds.
_model.BasicPublish(_exchange, key, false, false, props, body);

...

And, here's the console output:

node : rabbit@VMPRODWEB2
app descriptor: c:/Rabbitmq/rabbitmq_server-1.7.0/sbin/../ebin/rabbit.app
home dir : C:\Users\jdc
cookie hash : //1wK/zUYlPaC0GVhNfEsw==
log : C:/Rabbitmq/data/log/rabbit.log
sasl log : C:/Rabbitmq/data/log/rabbit-sasl.log
database dir : c:/Rabbitmq/data/db/rabbit-mnesia

starting database ...done
starting core processes ...done
starting recovery ...done
starting persister ...done
starting guid generator ...done
starting builtin applications ...done
starting TCP listeners ...done
starting SSL listeners ...done

broker running

Crash dump was written to: erl_crash.dump
temp_alloc: Cannot allocate 9176036 bytes of memory (of type "tmp_heap").

This application has requested the Runtime to terminate it in an unusual
way.
Please contact the application's support team for more information.

-JD


_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss@lists.rabbitmq.com
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Post received from mailinglist
Guest
Posted: Fri Nov 20, 2009 12:38 am Reply with quote
Guest
On Tue, Nov 10, 2009 at 11:01:31AM -0800, JD Conley wrote:
> Yeah, that's what I thought. But yesterday I added acking just to see if it
> would help and the memory footprint no longer grows out of control and the
> process hasn't crashed since. I wonder if either the .NET client or rabbit
> isn't honoring the noack for this use case?

The .Net and Java clients definitely honour the noAck flag. Can you
produce a small test case that shows the issue you're seeing?

> > Expiration is not implemented. There is no notion of Time To Live or
> > messages expiring in RabbitMQ. This is an oft requested feature and is
> > high on our todo list, but is not in active development yet.
>
> Add one more to that request number.

Noted; rapidly approaching integer overflow...

> The dump is over 1GB. What is the VM limit for erlang by default? Maybe we
> just hit that? The system has 4gb ram and plenty free to use the 2GB imposed
> user land limit in Windows, not to mention swap space.

Well on a 64-bit Linux system, I've made erlang happily eat 10s of GBs
of memory. I would have thought that erlang should be able to make use
of 2GB in Windows, but that error message suggests to me that's coming
back from malloc, thus it's Windows that's causing the issue. We would
thoroughly recommend you use at least a 64-bit system. Wink

Matthew

_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss@lists.rabbitmq.com
http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Post received from mailinglist

Display posts from previous:  

All times are GMT
Page 1 of 1
This forum is locked: you cannot post, reply to, or edit topics.

Jump to:  

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You cannot download files in this forum