| Author |
Message |
< Erlang bugs mailing list ~ [erlang-questions] epmd leaving ports in TIME_WAIT? |
| Guest |
Posted: Mon Mar 22, 2010 3:17 pm |
|
|
|
Guest
|
Escalating to erlang-bugs.
I've restarted both my server and laptop over the weekend.
On both machines, I restarted my 2 erlang applications (4 nodes, connected
in pairs: A <-> B, C <-> D, with pairs on the same computer)
This was yesterday. This morning I did another netstat -t, and indeed, I
have >100 sockets stuck in TIME_WAIT on both computers. Both with outgoing
on localhost and the other pc, in about equal proportion.
No node has crashed/restarted. None of the nodes does anything fancy, simply
net_adm:ping to connect the nodes and then data is exchanged using messages.
The problem seems somewhat related to the fact that epmd seems to restart
from time to time as the OS gets confused and cannot retrieve the PID that
originally opened the sockets (although port shows it is epmd)
I briefly looked at the epmd code and did see a few comments in there about
// should probably always close and a few other potential places where it
might leak sockets. Unfortunately I ran out of time.
Can anyone confirm if they see similar behavior? Note that on both
computers, both nodes are started manually (not automated yet) and as such
it isn't a race to see which node can start epmd first. Although, I wonder
if it might be related to the problem of the epmd 100% cpu use, I believe
another poster made the point that it would happen when epmd runs out of
file descriptor (which would happen if it leaks sockets in TIME_WAIT).
On Mon, Mar 15, 2010 at 2:53 PM, Nicholas Frechette <zeno490@gmail.com>wrote:
> Hi,
> I recently started running 2 erlang applications in distributed mode (with
> -sname) on the same box.
> I am noticing now (doing netstat -t) that a _LOT_ of ports are left open at
> 4369 (the port used by epmd) on my ubuntu 9.10 box.
>
> In fact, of all active connections, 90%+ of my open ports will be
> localhost:4369 -> culpritbox:randomport.
> All are stuck in TIME_WAIT
>
> Any idea what could be causing this? I use a different computer to do my
> development and I see a similar pattern emerging (again ubuntu 9.10).
>
> My erlang version is R13B01.
>
> Here is an example output of `netstat -t` (note that even with -p, netstat
> doesn't display a program name for those ports). Any ideas?
>
> Active Internet connections (w/o servers)
> Proto Recv-Q Send-Q Local Address Foreign Address State
> tcp 0 0 mercury:4369 mercury:49448
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:45420
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:41234
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:35179
> TIME_WAIT
> * something else
> tcp 0 0 mercury:4369 mercury:44567
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:45846
> TIME_WAIT
> tcp 0 0 localhost:4369 localhost:33424
> ESTABLISHED
> tcp 0 0 mercury:4369 mercury:38486
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:38624
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:44724
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:44398
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:47189
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:45306
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:36997
> TIME_WAIT
> tcp 0 0 localhost:48762 localhost:4369
> ESTABLISHED
> tcp 0 0 mercury:4369 mercury:38627
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:37665
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:48427
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:57916
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:51098
> TIME_WAIT
> * something else
> tcp 0 0 mercury:4369 mercury:55867
> TIME_WAIT
> * something else
> * something else
> tcp 0 0 mercury:4369 mercury:36005
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:46053
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:35974
> TIME_WAIT
> * something else
> tcp 0 0 mercury:4369 mercury:42211
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:33363
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:53662
> TIME_WAIT
> * something else
> tcp 0 0 mercury:4369 mercury:37094
> TIME_WAIT
> * something else
> tcp 0 0 mercury:4369 mercury:43824
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:51092
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:43258
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:43064
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:37111
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:54677
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:44286
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:49718
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:46809
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:46112
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:48825
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:44124
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:45203
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:51149
> TIME_WAIT
> * something else
> tcp 0 0 mercury:4369 mercury:46636
> TIME_WAIT
> * something else
> tcp 0 0 mercury:4369 mercury:48254
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:49424
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:59976
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:46730
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:44890
> TIME_WAIT
> * something else
> tcp 0 0 mercury:4369 mercury:39385
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:57297
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:37066
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:50186
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:45703
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:42943
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:55328
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:44401
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:45791
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:56537
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:42194
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:33216
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:46544
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:47610
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:52892
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:38877
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:50983
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:45376
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:54394
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:45412
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:36546
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:32776
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:38289
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:35126
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:50964
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:47857
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:55772
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:41209
> TIME_WAIT
> * something else
> tcp 0 0 mercury:4369 mercury:41426
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:52887
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:33961
> TIME_WAIT
> * something else
> tcp 0 0 mercury:4369 mercury:58946
> TIME_WAIT
> tcp 0 0 localhost:33424 localhost:4369
> ESTABLISHED
> tcp 0 0 mercury:4369 mercury:46272
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:58219
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:60676
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:37091
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:34972
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:53706
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:52788
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:53221
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:57241
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:56398
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:40434
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:43636
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:41792
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:53162
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:41266
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:36990
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:37871
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:40089
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:58028
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:40347
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:55445
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:56130
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:37858
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:53709
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:45924
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:56969
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:33933
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:51305
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:53452
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:35840
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:49678
> TIME_WAIT
> * something else
> tcp 0 0 mercury:4369 mercury:57573
> TIME_WAIT
> tcp 0 0 localhost:4369 localhost:48762
> ESTABLISHED
> tcp 0 0 mercury:4369 mercury:46680
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:41095
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:44073
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:43461
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:39410
> TIME_WAIT
> tcp 0 0 mercury:4369 mercury:38881
> TIME_WAIT
>
>
Post received from mailinglist |
|
|
| Back to top |
|
| Guest |
Posted: Mon Mar 22, 2010 3:46 pm |
|
|
|
Guest
|
On Mon, Mar 22, 2010 at 11:17:25AM -0400, Nicholas Frechette wrote:
> Escalating to erlang-bugs.
> I've restarted both my server and laptop over the weekend.
> On both machines, I restarted my 2 erlang applications (4 nodes, connected
> in pairs: A <-> B, C <-> D, with pairs on the same computer)
>
> This was yesterday. This morning I did another netstat -t, and indeed, I
> have >100 sockets stuck in TIME_WAIT on both computers.
Sockets in TIME_WAIT state are normal. After the socket is closed,
the OS puts the socket into TIME_WAIT to ensure any pending packets
queued somewhere in the network for the socket pair have time to arrive.
Usually TIME_WAIT is 2 or 4 minutes.
It looks as if there a is a number of TCP connections that are being
established and closed to your epmd.
> Both with outgoing
> on localhost and the other pc, in about equal proportion.
> No node has crashed/restarted. None of the nodes does anything fancy, simply
> net_adm:ping to connect the nodes and then data is exchanged using messages.
>
> The problem seems somewhat related to the fact that epmd seems to restart
> from time to time as the OS gets confused and cannot retrieve the PID that
> originally opened the sockets (although port shows it is epmd)
What is restarting epmd?
See anything in your logs? Maybe try running epmd in debug mode. Kill
epmd if it is running and run: epmd -d
> I briefly looked at the epmd code and did see a few comments in there about
> // should probably always close and a few other potential places where it
> might leak sockets. Unfortunately I ran out of time.
Doesn't appear to be leaking fd's, but you can check with lsof.
> Can anyone confirm if they see similar behavior? Note that on both
> computers, both nodes are started manually (not automated yet) and as such
> it isn't a race to see which node can start epmd first. Although, I wonder
> if it might be related to the problem of the epmd 100% cpu use, I believe
> another poster made the point that it would happen when epmd runs out of
> file descriptor (which would happen if it leaks sockets in TIME_WAIT).
That's just one error condition; for example, the connection could have
been aborted or the socket could have been closed. Are you seeing a lot
of CPU usage?
________________________________________________________________
erlang-bugs (at) erlang.org mailing list.
See http://www.erlang.org/faq.html
To unsubscribe; mailto:erlang-bugs-unsubscribe@erlang.org
Post received from mailinglist |
|
|
| Back to top |
|
| Guest |
Posted: Tue Mar 23, 2010 9:04 pm |
|
|
|
Guest
|
On Tue, Mar 23, 2010 at 01:15:18PM -0400, Nicholas Frechette wrote:
> Hi,
> I did as you suggested and ran epmd -d.
> It ends up outputting something like:
> epmd: Tue Mar 23 09:26:39 2010: ** sent PORT2_RESP (error) for "rodc10"
> epmd: Tue Mar 23 09:26:40 2010: ** got PORT2_REQ
>
> Over and over.
> This is because one of my nodes pings (net_adm:ping) a node that doesn't
> exist from time to time. (Every couple seconds or so)
Right, so every time the node connects and disconnects the TCP session
will go into TIME_WAIT.
> Also, when epmd dies, the ports are closed properly. In any case, I find it
> surprising that epmd has to open so many sockets to ask around if someone
> has seen the missing node.
1> [ begin {ok,S} = gen_tcp:connect({127,0,0,1},4369,[]), ok = gen_tcp:close(S) end || _ <- lists:seq(1,10000) ].
That will generate 10,000 sessions in TIME_WAIT I guess the question
is why your nodes keep disappearing from the network.
> On Mon, Mar 22, 2010 at 11:45 AM, Michael Santos
> <michael.santos@gmail.com>wrote:
>
> > On Mon, Mar 22, 2010 at 11:17:25AM -0400, Nicholas Frechette wrote:
> > > Escalating to erlang-bugs.
> > > I've restarted both my server and laptop over the weekend.
> > > On both machines, I restarted my 2 erlang applications (4 nodes,
> > connected
> > > in pairs: A <-> B, C <-> D, with pairs on the same computer)
> > >
> > > This was yesterday. This morning I did another netstat -t, and indeed, I
> > > have >100 sockets stuck in TIME_WAIT on both computers.
> >
> > Sockets in TIME_WAIT state are normal. After the socket is closed,
> > the OS puts the socket into TIME_WAIT to ensure any pending packets
> > queued somewhere in the network for the socket pair have time to arrive.
> > Usually TIME_WAIT is 2 or 4 minutes.
> >
> > It looks as if there a is a number of TCP connections that are being
> > established and closed to your epmd.
> >
> > > Both with outgoing
> > > on localhost and the other pc, in about equal proportion.
> > > No node has crashed/restarted. None of the nodes does anything fancy,
> > simply
> > > net_adm:ping to connect the nodes and then data is exchanged using
> > messages.
> > >
> > > The problem seems somewhat related to the fact that epmd seems to restart
> > > from time to time as the OS gets confused and cannot retrieve the PID
> > that
> > > originally opened the sockets (although port shows it is epmd)
> >
> > What is restarting epmd?
> >
> > See anything in your logs? Maybe try running epmd in debug mode. Kill
> > epmd if it is running and run: epmd -d
> >
> > > I briefly looked at the epmd code and did see a few comments in there
> > about
> > > // should probably always close and a few other potential places where it
> > > might leak sockets. Unfortunately I ran out of time.
> >
> > Doesn't appear to be leaking fd's, but you can check with lsof.
> >
> > > Can anyone confirm if they see similar behavior? Note that on both
> > > computers, both nodes are started manually (not automated yet) and as
> > such
> > > it isn't a race to see which node can start epmd first. Although, I
> > wonder
> > > if it might be related to the problem of the epmd 100% cpu use, I believe
> > > another poster made the point that it would happen when epmd runs out of
> > > file descriptor (which would happen if it leaks sockets in TIME_WAIT).
> >
> > That's just one error condition; for example, the connection could have
> > been aborted or the socket could have been closed. Are you seeing a lot
> > of CPU usage?
> >
> >
> >
________________________________________________________________
erlang-bugs (at) erlang.org mailing list.
See http://www.erlang.org/faq.html
To unsubscribe; mailto:erlang-bugs-unsubscribe@erlang.org
Post received from mailinglist |
|
|
| Back to top |
|
| Guest |
Posted: Thu Mar 25, 2010 12:25 am |
|
|
|
Guest
|
On Tue, Mar 23, 2010 at 06:43:37PM -0400, Nicholas Frechette wrote:
> Yes, I know why my node isn't 'up', my question is why is epmd opening a
> socket to query if a node is up? ie: what does it attempt to connect to?
epmd isn't opening the connection, a distributed erlang node is querying
if a node exists on that host and the port it should connect to for that
node. epmd maps local erlang nodes to a port.
> Said node was never up in epmd's lifetime meaning it shouldn't be a cached
> value or the likes. Supposing it connects to my other epmd process on my
> other computer, why would it not keep it as a permanent tcp connection? It
> also attemps to connect to ports on the same host, not just the other
> computer so i'm a bit curious.
> Is epmd implemented as a soap over an http like protocol where queries are
> made over single use socket connections?
When an erlang node is brought up as a distributed node, it connects
to the local epmd with a persistent TCP connection. If the node dies,
the TCP connection is closed and epmd deregisters the node (the atom
that identifies the node). Queries to epmd about which registered nodes
exist on a host are carried out in a single TCP transaction.
See the comments in epmd_srv.c and:
http://www.erlang.org/doc/apps/erts/erl_dist_protocol.html
epmd_srv.c:
* To keep track of when registered Erlang nodes are terminated this
* server keeps the socket open where the request for registration was
* made.
* In all but one case there is only one request for each connection made
* to this server so we can safely close the socket after sending the
* reply. The exception is ALIVE_REQ where we keep the connection
* open without sending any data. When we receive a "close" this is
* an indication that the Erlang node was terminated. The termination
* may have been "normal" or caused by a crash. The operating system
* ensure that the connection is closed either way.
________________________________________________________________
erlang-bugs (at) erlang.org mailing list.
See http://www.erlang.org/faq.html
To unsubscribe; mailto:erlang-bugs-unsubscribe@erlang.org
Post received from mailinglist |
|
|
| Back to top |
|
|
|
All times are GMT
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum You cannot attach files in this forum You cannot download files in this forum
|
|
|