Erlang/OTP Forums

Author Message

<  Erlang bugs mailing list  ~  [erlang-questions] R11B-2 SMP Timer Race Condition Bug [ bug

Guest
Posted: Fri Dec 22, 2006 8:49 pm Reply with quote
Guest
On 22-Dec-06, at 3:22 PM, Serge Aleynikov wrote:

> Additionally, I should say that we've been able to reproduce this
> bug on
> several Linux platforms ...
> It happens when all the CPUs in SMP mode are over 75% loaded. The bug
> doesn't happen immediately after starting a release, but after a
> period
> of 5 min to 3 hours, which makes it pretty hard to diagnose. The
> tracing method that we initially tried to use was to include printf
> statements in the emulator to stderr. However, this prevented the bug
> from showing up. Further it was changed to using SysV message
> queue to
> communicate trace to an external process that dumped the trace to a
> file. This allowed to gain further understanding of the problem,
> but as
> Dmitry indicated any attempt to reduce the code to a minimal example
> made the problem disappear.

Could DTrace on Solaris help?

--Toby
_______________________________________________
erlang-bugs mailing list
erlang-bugs@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-bugs
Post recived from mailinglist
Guest
Posted: Sun Dec 24, 2006 12:37 am Reply with quote
Guest
Thanks for your detailed bug report. I'll look at this as soon as possible.

BR,
Rickard Green, Erlang/OTP

Serge Aleynikov wrote:
> Additionally, I should say that we've been able to reproduce this bug on
> several Linux platforms (RH ES 4.0, Fedora Core 5.0, 32bit and 64bit) in
> R11B-0, R11B-1 and R11B-2. The bug (what appears to be a race
> condition) is seen only if the emulator is started in the SMP mode and
> results in the following construct blocking infinitely in the context of
> some Erlang process handing a message dispatching function:
>
> receive
> after N -> % Where N is between 1 and 999
> ok
> end.
>
> It happens when all the CPUs in SMP mode are over 75% loaded. The bug
> doesn't happen immediately after starting a release, but after a period
> of 5 min to 3 hours, which makes it pretty hard to diagnose. The
> tracing method that we initially tried to use was to include printf
> statements in the emulator to stderr. However, this prevented the bug
> from showing up. Further it was changed to using SysV message queue to
> communicate trace to an external process that dumped the trace to a
> file. This allowed to gain further understanding of the problem, but as
> Dmitry indicated any attempt to reduce the code to a minimal example
> made the problem disappear.
>
> The emulator code is quite involved, but hopefully someone in the OTP
> team could come up with a recommendation of how/where to put a missing
> synchronization. If needed we can arrange for a remote SSH login to the
> system(s) where the problem is reproducible.
>
> Regards,
>
> Serge
>
> Dmitriy Kargapolov wrote:
>> Unfortunately I can not create standalone test for this bug, even when I
>> became much more close to understanding the effect.
>> This bug appears only in highly loaded system.
>>
>> Recently I did manage to trace some points in the code and see at least
>> one scenario for the race condition bug.
>>
>> 1. Thread A erl_set_timer (time.c) Lock Timing Wheel
>> 2. Thread A insert_timer (time.c) Insert Timer T1
>> 3. Thread A erl_set_timer (time.c) Unlock Timing Wheel
>> 4. Thread B bump_timer_internal (time.c) Lock Timing Wheel
>> 5. Thread A cancel_timer (erl_process.c) Cancel timer T1
>> 6. Thread B bump_timer_internal (time.c) Build list of Expired
>> Timers
>> 7. Thread A erl_cancel_timer (time.c) Cancel timer T1:
>> Waiting for Timing Wheel Lock
>> 8. Thread B bump_timer_internal (time.c) Unlock Timing Wheel
>> 9. Thread C set_timer (erl_process.c) New Timeout Request (T2)
>> 10. Thread B bump_timer_internal (time.c) Call Expired Timers
>> Callbacks
>> 11. Thread B free_ptimer (utils.c) Timer T1 callback
>> invokes free_ptimer()
>> 12. Thread C erts_create_smp_ptimer (utils.c) Create Timer
>> ErtsSmpPTimer for T2
>> 13. Thread B free_ptimer (utils.c) Free ErtsSmpPTimer
>> memory block
>> 14. Thread C erts_create_smp_ptimer (utils.c) Allocate ErtsSmpPTimer
>> for T2, block reused!
>> 15. Thread C erl_set_timer (time.c) erl_set_timer invoked
>> for T2
>> 16. Thread C erl_set_timer (time.c) Lock Timing Wheel
>> 17. Thread C insert_timer (time.c) Insert Timer T2
>> 18. Thread C erl_set_timer (time.c) Unlock Timing Wheel
>> 19. Thread A erl_cancel_timer (time.c) Lock Timing Wheel
>> 20. Thread A erl_cancel_timer (time.c) Remove ex-T1 == T2
>> from the timing wheel
>> 21. Thread A erl_cancel_timer (time.c) Unlock Timing Wheel
>>
>> See also attached diagram.
>>
>> Looks like one more mutex required, excluding release of ErtsSmpPTimer
>> memory block by timeout callback if cancel request was issued for the
>> timer and vise versa. The two point of control - cancel timer and timer
>> expiration should not interfere.
>> This bug happens only in SMP mode since there additional timer control
>> structure ErtsSmpPTimer is used between emulator and timing wheel.
>>
>> Mikael Pettersson wrote:
>>> Dmitriy Kargapolov writes:
>>> > > When running erl with -smp +S 2 option, sometimes process gets
>>> stuck in > timer:sleep/1.
>>> > Process code looks like:
>>> > > some_receiver(State) ->
>>> > NewState = receive
>>> > % legal packet
>>> > {some_keyword, Address, Port, Packet} ->
>>> > State1 = handle_packet(Address, Port, Packet, State),
>>> > timer:sleep(get_loop_delay()),
>>> > State1;
>>> > % unknown message
>>> > _ ->
>>> > State
>>> > end,
>>> > some_receiver(NewState).
>>> > > Delay value varies in range 1..999
>>> > > Since timer:sleep/1 implemented as:
>>> > sleep(T) ->
>>> > receive
>>> > after T -> ok
>>> > end.
>>> > it seems to be problem with "after" in smp implementation in R11B-0
>>> > > I don't have more details yet but will continue testing.
>>> > My platform: 2.6.9-5.ELsmp #1 SMP i686 i686 i386 GNU/Linux
>>>
>>> Interesting. Please send us a small standalone module that exhibits
>>> the bug, and I'll see if I can reproduce it.
>>>
>>> /Mikael
>>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@erlang.org
>> http://www.erlang.org/mailman/listinfo/erlang-questions
>
_______________________________________________
erlang-bugs mailing list
erlang-bugs@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-bugs
Post recived from mailinglist
Guest
Posted: Wed Dec 27, 2006 5:55 pm Reply with quote
Guest
The process lock plays an important role here. Unfortunately a faulty
optimization (blush) prevented the process lock from playing that role.
ptimer_timeout() has to acquire the process lock before looking at the
ptimer flags. I've attached a patch that should fix the problem.

$ tar -zxf otp_src_R11B-2.tar.gz
$ patch -p0 < ptimer.patch
patching file `otp_src_R11B-2/erts/emulator/beam/utils.c'

Please, report to us whether or not the problem went away.

Great work Dmitriy and Serge! Many thanks!

BR,
Rickard Green, Erlang/OTP

Rickard Green wrote:
> Thanks for your detailed bug report. I'll look at this as soon as possible.
>
> BR,
> Rickard Green, Erlang/OTP
>
> Serge Aleynikov wrote:
>> Additionally, I should say that we've been able to reproduce this bug on
>> several Linux platforms (RH ES 4.0, Fedora Core 5.0, 32bit and 64bit) in
>> R11B-0, R11B-1 and R11B-2. The bug (what appears to be a race
>> condition) is seen only if the emulator is started in the SMP mode and
>> results in the following construct blocking infinitely in the context of
>> some Erlang process handing a message dispatching function:
>>
>> receive
>> after N -> % Where N is between 1 and 999
>> ok
>> end.
>>
>> It happens when all the CPUs in SMP mode are over 75% loaded. The bug
>> doesn't happen immediately after starting a release, but after a period
>> of 5 min to 3 hours, which makes it pretty hard to diagnose. The
>> tracing method that we initially tried to use was to include printf
>> statements in the emulator to stderr. However, this prevented the bug
>> from showing up. Further it was changed to using SysV message queue to
>> communicate trace to an external process that dumped the trace to a
>> file. This allowed to gain further understanding of the problem, but as
>> Dmitry indicated any attempt to reduce the code to a minimal example
>> made the problem disappear.
>>
>> The emulator code is quite involved, but hopefully someone in the OTP
>> team could come up with a recommendation of how/where to put a missing
>> synchronization. If needed we can arrange for a remote SSH login to the
>> system(s) where the problem is reproducible.
>>
>> Regards,
>>
>> Serge
>>
>> Dmitriy Kargapolov wrote:
>>> Unfortunately I can not create standalone test for this bug, even when I
>>> became much more close to understanding the effect.
>>> This bug appears only in highly loaded system.
>>>
>>> Recently I did manage to trace some points in the code and see at least
>>> one scenario for the race condition bug.
>>>
>>> 1. Thread A erl_set_timer (time.c) Lock Timing Wheel
>>> 2. Thread A insert_timer (time.c) Insert Timer T1
>>> 3. Thread A erl_set_timer (time.c) Unlock Timing Wheel
>>> 4. Thread B bump_timer_internal (time.c) Lock Timing Wheel
>>> 5. Thread A cancel_timer (erl_process.c) Cancel timer T1
>>> 6. Thread B bump_timer_internal (time.c) Build list of Expired
>>> Timers
>>> 7. Thread A erl_cancel_timer (time.c) Cancel timer T1:
>>> Waiting for Timing Wheel Lock
>>> 8. Thread B bump_timer_internal (time.c) Unlock Timing Wheel
>>> 9. Thread C set_timer (erl_process.c) New Timeout Request (T2)
>>> 10. Thread B bump_timer_internal (time.c) Call Expired Timers
>>> Callbacks
>>> 11. Thread B free_ptimer (utils.c) Timer T1 callback
>>> invokes free_ptimer()
>>> 12. Thread C erts_create_smp_ptimer (utils.c) Create Timer
>>> ErtsSmpPTimer for T2
>>> 13. Thread B free_ptimer (utils.c) Free ErtsSmpPTimer
>>> memory block
>>> 14. Thread C erts_create_smp_ptimer (utils.c) Allocate ErtsSmpPTimer
>>> for T2, block reused!
>>> 15. Thread C erl_set_timer (time.c) erl_set_timer invoked
>>> for T2
>>> 16. Thread C erl_set_timer (time.c) Lock Timing Wheel
>>> 17. Thread C insert_timer (time.c) Insert Timer T2
>>> 18. Thread C erl_set_timer (time.c) Unlock Timing Wheel
>>> 19. Thread A erl_cancel_timer (time.c) Lock Timing Wheel
>>> 20. Thread A erl_cancel_timer (time.c) Remove ex-T1 == T2
>>> from the timing wheel
>>> 21. Thread A erl_cancel_timer (time.c) Unlock Timing Wheel
>>>
>>> See also attached diagram.
>>>
>>> Looks like one more mutex required, excluding release of ErtsSmpPTimer
>>> memory block by timeout callback if cancel request was issued for the
>>> timer and vise versa. The two point of control - cancel timer and timer
>>> expiration should not interfere.
>>> This bug happens only in SMP mode since there additional timer control
>>> structure ErtsSmpPTimer is used between emulator and timing wheel.
>>>
>>> Mikael Pettersson wrote:
>>>> Dmitriy Kargapolov writes:
>>>> > > When running erl with -smp +S 2 option, sometimes process gets
>>>> stuck in > timer:sleep/1.
>>>> > Process code looks like:
>>>> > > some_receiver(State) ->
>>>> > NewState = receive
>>>> > % legal packet
>>>> > {some_keyword, Address, Port, Packet} ->
>>>> > State1 = handle_packet(Address, Port, Packet, State),
>>>> > timer:sleep(get_loop_delay()),
>>>> > State1;
>>>> > % unknown message
>>>> > _ ->
>>>> > State
>>>> > end,
>>>> > some_receiver(NewState).
>>>> > > Delay value varies in range 1..999
>>>> > > Since timer:sleep/1 implemented as:
>>>> > sleep(T) ->
>>>> > receive
>>>> > after T -> ok
>>>> > end.
>>>> > it seems to be problem with "after" in smp implementation in R11B-0
>>>> > > I don't have more details yet but will continue testing.
>>>> > My platform: 2.6.9-5.ELsmp #1 SMP i686 i686 i386 GNU/Linux
>>>>
>>>> Interesting. Please send us a small standalone module that exhibits
>>>> the bug, and I'll see if I can reproduce it.
>>>>
>>>> /Mikael
>>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> erlang-questions mailing list
>>> erlang-questions@erlang.org
>>> http://www.erlang.org/mailman/listinfo/erlang-questions
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@erlang.org
> http://www.erlang.org/mailman/listinfo/erlang-bugs
>





Post recived from mailinglist
Guest
Posted: Thu Dec 28, 2006 4:11 pm Reply with quote
Guest
Hi Rickard,

First of all, many thanks indeed for the very fast response time on
investigating and fixing issues like this! That level of
responsiveness really helps reassure new adopters of Erlang.

How risky is this patch? i.e. Should everyone apply it?

Is the patch ...

a) An experimental fix that needs testing by Serge and Dmitriy before
others consider it.

b) A definite fix for a definite problem, and has been tested. But
it may or may not be the problem that Serge and Dmitriy found.

regards,

Chris

On 12/27/06, Rickard Green <rickard.s.green@ericsson.com> wrote:
> The process lock plays an important role here. Unfortunately a faulty
> optimization (blush) prevented the process lock from playing that role.
> ptimer_timeout() has to acquire the process lock before looking at the
> ptimer flags. I've attached a patch that should fix the problem.
>
> $ tar -zxf otp_src_R11B-2.tar.gz
> $ patch -p0 < ptimer.patch
> patching file `otp_src_R11B-2/erts/emulator/beam/utils.c'
>
> Please, report to us whether or not the problem went away.
>
> Great work Dmitriy and Serge! Many thanks!
>
> BR,
> Rickard Green, Erlang/OTP
>
> Rickard Green wrote:
> > Thanks for your detailed bug report. I'll look at this as soon as possible.
> >
> > BR,
> > Rickard Green, Erlang/OTP
> >
> > Serge Aleynikov wrote:
> >> Additionally, I should say that we've been able to reproduce this bug on
> >> several Linux platforms (RH ES 4.0, Fedora Core 5.0, 32bit and 64bit) in
> >> R11B-0, R11B-1 and R11B-2. The bug (what appears to be a race
> >> condition) is seen only if the emulator is started in the SMP mode and
> >> results in the following construct blocking infinitely in the context of
> >> some Erlang process handing a message dispatching function:
> >>
> >> receive
> >> after N -> % Where N is between 1 and 999
> >> ok
> >> end.
> >>
> >> It happens when all the CPUs in SMP mode are over 75% loaded. The bug
> >> doesn't happen immediately after starting a release, but after a period
> >> of 5 min to 3 hours, which makes it pretty hard to diagnose. The
> >> tracing method that we initially tried to use was to include printf
> >> statements in the emulator to stderr. However, this prevented the bug
> >> from showing up. Further it was changed to using SysV message queue to
> >> communicate trace to an external process that dumped the trace to a
> >> file. This allowed to gain further understanding of the problem, but as
> >> Dmitry indicated any attempt to reduce the code to a minimal example
> >> made the problem disappear.
> >>
> >> The emulator code is quite involved, but hopefully someone in the OTP
> >> team could come up with a recommendation of how/where to put a missing
> >> synchronization. If needed we can arrange for a remote SSH login to the
> >> system(s) where the problem is reproducible.
> >>
> >> Regards,
> >>
> >> Serge
> >>
> >> Dmitriy Kargapolov wrote:
> >>> Unfortunately I can not create standalone test for this bug, even when I
> >>> became much more close to understanding the effect.
> >>> This bug appears only in highly loaded system.
> >>>
> >>> Recently I did manage to trace some points in the code and see at least
> >>> one scenario for the race condition bug.
> >>>
> >>> 1. Thread A erl_set_timer (time.c) Lock Timing Wheel
> >>> 2. Thread A insert_timer (time.c) Insert Timer T1
> >>> 3. Thread A erl_set_timer (time.c) Unlock Timing Wheel
> >>> 4. Thread B bump_timer_internal (time.c) Lock Timing Wheel
> >>> 5. Thread A cancel_timer (erl_process.c) Cancel timer T1
> >>> 6. Thread B bump_timer_internal (time.c) Build list of Expired
> >>> Timers
> >>> 7. Thread A erl_cancel_timer (time.c) Cancel timer T1:
> >>> Waiting for Timing Wheel Lock
> >>> 8. Thread B bump_timer_internal (time.c) Unlock Timing Wheel
> >>> 9. Thread C set_timer (erl_process.c) New Timeout Request (T2)
> >>> 10. Thread B bump_timer_internal (time.c) Call Expired Timers
> >>> Callbacks
> >>> 11. Thread B free_ptimer (utils.c) Timer T1 callback
> >>> invokes free_ptimer()
> >>> 12. Thread C erts_create_smp_ptimer (utils.c) Create Timer
> >>> ErtsSmpPTimer for T2
> >>> 13. Thread B free_ptimer (utils.c) Free ErtsSmpPTimer
> >>> memory block
> >>> 14. Thread C erts_create_smp_ptimer (utils.c) Allocate ErtsSmpPTimer
> >>> for T2, block reused!
> >>> 15. Thread C erl_set_timer (time.c) erl_set_timer invoked
> >>> for T2
> >>> 16. Thread C erl_set_timer (time.c) Lock Timing Wheel
> >>> 17. Thread C insert_timer (time.c) Insert Timer T2
> >>> 18. Thread C erl_set_timer (time.c) Unlock Timing Wheel
> >>> 19. Thread A erl_cancel_timer (time.c) Lock Timing Wheel
> >>> 20. Thread A erl_cancel_timer (time.c) Remove ex-T1 == T2
> >>> from the timing wheel
> >>> 21. Thread A erl_cancel_timer (time.c) Unlock Timing Wheel
> >>>
> >>> See also attached diagram.
> >>>
> >>> Looks like one more mutex required, excluding release of ErtsSmpPTimer
> >>> memory block by timeout callback if cancel request was issued for the
> >>> timer and vise versa. The two point of control - cancel timer and timer
> >>> expiration should not interfere.
> >>> This bug happens only in SMP mode since there additional timer control
> >>> structure ErtsSmpPTimer is used between emulator and timing wheel.
> >>>
> >>> Mikael Pettersson wrote:
> >>>> Dmitriy Kargapolov writes:
> >>>> > > When running erl with -smp +S 2 option, sometimes process gets
> >>>> stuck in > timer:sleep/1.
> >>>> > Process code looks like:
> >>>> > > some_receiver(State) ->
> >>>> > NewState = receive
> >>>> > % legal packet
> >>>> > {some_keyword, Address, Port, Packet} ->
> >>>> > State1 = handle_packet(Address, Port, Packet, State),
> >>>> > timer:sleep(get_loop_delay()),
> >>>> > State1;
> >>>> > % unknown message
> >>>> > _ ->
> >>>> > State
> >>>> > end,
> >>>> > some_receiver(NewState).
> >>>> > > Delay value varies in range 1..999
> >>>> > > Since timer:sleep/1 implemented as:
> >>>> > sleep(T) ->
> >>>> > receive
> >>>> > after T -> ok
> >>>> > end.
> >>>> > it seems to be problem with "after" in smp implementation in R11B-0
> >>>> > > I don't have more details yet but will continue testing.
> >>>> > My platform: 2.6.9-5.ELsmp #1 SMP i686 i686 i386 GNU/Linux
> >>>>
> >>>> Interesting. Please send us a small standalone module that exhibits
> >>>> the bug, and I'll see if I can reproduce it.
> >>>>
> >>>> /Mikael
> >>>>
> >>> ------------------------------------------------------------------------
> >>>
> >>> _______________________________________________
> >>> erlang-questions mailing list
> >>> erlang-questions@erlang.org
> >>> http://www.erlang.org/mailman/listinfo/erlang-questions
> > _______________________________________________
> > erlang-bugs mailing list
> > erlang-bugs@erlang.org
> > http://www.erlang.org/mailman/listinfo/erlang-bugs
> >
>
>
>
>
>
> --- otp_src_R11B-2/erts/emulator/beam/utils.c 2006-11-06 14:51:50.000000000 +0100
> +++ otp_src_R11B-2.ptimer_patch/erts/emulator/beam/utils.c 2006-12-27 18:11:44.772758000 +0100
> @@ -2999,15 +2999,16 @@
> static void
> ptimer_timeout(ErtsSmpPTimer *ptimer)
> {
> - if (!(ptimer->timer.flags & ERTS_PTMR_FLG_CANCELLED)) {
> if (is_internal_pid(ptimer->timer.id)) {
> Process *p;
> - p = erts_pid2proc(NULL,
> - 0,
> - ptimer->timer.id,
> - ERTS_PROC_LOCK_MAIN|ERTS_PROC_LOCK_STATUS);
> + p = erts_pid2proc_opt(NULL,
> + 0,
> + ptimer->timer.id,
> + ERTS_PROC_LOCK_MAIN|ERTS_PROC_LOCK_STATUS,
> + ERTS_P2P_FLG_ALLOW_OTHER_X);
> if (p) {
> - if (!(ptimer->timer.flags & ERTS_PTMR_FLG_CANCELLED)) {
> + if (!p->is_exiting
> + && !(ptimer->timer.flags & ERTS_PTMR_FLG_CANCELLED)) {
> ASSERT(*ptimer->timer.timer_ref == ptimer);
> *ptimer->timer.timer_ref = NULL;
> (*ptimer->timer.timeout_func)(p);
> @@ -3028,7 +3029,6 @@
> erts_smp_io_unlock();
> }
> }
> - }
> free_ptimer(ptimer);
> }
>
>
>
>
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@erlang.org
> http://www.erlang.org/mailman/listinfo/erlang-bugs
>
>
>
_______________________________________________
erlang-bugs mailing list
erlang-bugs@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-bugs
Post recived from mailinglist
Guest
Posted: Fri Dec 29, 2006 1:34 pm Reply with quote
Guest
The scenario described by Serge and Dmitriy can happen due to this bug.
The fix has been tested and I am quite sure it will fix the described
problem. There could of course exist yet another bug causing the same
problem, but I don't think so. The results of Serge's and Dmitriy's
tests are of course interesting, but regardless that the patch fixes a
real bug. If you use the smp emulator, apply the patch.

BR,
Rickard Green, Erlang/OTP

Chris Newcombe wrote:
> Hi Rickard,
>
> First of all, many thanks indeed for the very fast response time on
> investigating and fixing issues like this! That level of
> responsiveness really helps reassure new adopters of Erlang.
>
> How risky is this patch? i.e. Should everyone apply it?
>
> Is the patch ...
>
> a) An experimental fix that needs testing by Serge and Dmitriy before
> others consider it.
>
> b) A definite fix for a definite problem, and has been tested. But
> it may or may not be the problem that Serge and Dmitriy found.
>
> regards,
>
> Chris
>
> On 12/27/06, Rickard Green <rickard.s.green@ericsson.com> wrote:
>> The process lock plays an important role here. Unfortunately a faulty
>> optimization (blush) prevented the process lock from playing that role.
>> ptimer_timeout() has to acquire the process lock before looking at the
>> ptimer flags. I've attached a patch that should fix the problem.
>>
>> $ tar -zxf otp_src_R11B-2.tar.gz
>> $ patch -p0 < ptimer.patch
>> patching file `otp_src_R11B-2/erts/emulator/beam/utils.c'
>>
>> Please, report to us whether or not the problem went away.
>>
>> Great work Dmitriy and Serge! Many thanks!
>>
>> BR,
>> Rickard Green, Erlang/OTP
>>
>> Rickard Green wrote:
>> > Thanks for your detailed bug report. I'll look at this as soon as
>> possible.
>> >
>> > BR,
>> > Rickard Green, Erlang/OTP
>> >
>> > Serge Aleynikov wrote:
>> >> Additionally, I should say that we've been able to reproduce this
>> bug on
>> >> several Linux platforms (RH ES 4.0, Fedora Core 5.0, 32bit and
>> 64bit) in
>> >> R11B-0, R11B-1 and R11B-2. The bug (what appears to be a race
>> >> condition) is seen only if the emulator is started in the SMP mode and
>> >> results in the following construct blocking infinitely in the
>> context of
>> >> some Erlang process handing a message dispatching function:
>> >>
>> >> receive
>> >> after N -> % Where N is between 1 and 999
>> >> ok
>> >> end.
>> >>
>> >> It happens when all the CPUs in SMP mode are over 75% loaded. The bug
>> >> doesn't happen immediately after starting a release, but after a
>> period
>> >> of 5 min to 3 hours, which makes it pretty hard to diagnose. The
>> >> tracing method that we initially tried to use was to include printf
>> >> statements in the emulator to stderr. However, this prevented the bug
>> >> from showing up. Further it was changed to using SysV message
>> queue to
>> >> communicate trace to an external process that dumped the trace to a
>> >> file. This allowed to gain further understanding of the problem,
>> but as
>> >> Dmitry indicated any attempt to reduce the code to a minimal example
>> >> made the problem disappear.
>> >>
>> >> The emulator code is quite involved, but hopefully someone in the OTP
>> >> team could come up with a recommendation of how/where to put a missing
>> >> synchronization. If needed we can arrange for a remote SSH login
>> to the
>> >> system(s) where the problem is reproducible.
>> >>
>> >> Regards,
>> >>
>> >> Serge
>> >>
>> >> Dmitriy Kargapolov wrote:
>> >>> Unfortunately I can not create standalone test for this bug, even
>> when I
>> >>> became much more close to understanding the effect.
>> >>> This bug appears only in highly loaded system.
>> >>>
>> >>> Recently I did manage to trace some points in the code and see at
>> least
>> >>> one scenario for the race condition bug.
>> >>>
>> >>> 1. Thread A erl_set_timer (time.c) Lock Timing Wheel
>> >>> 2. Thread A insert_timer (time.c) Insert Timer T1
>> >>> 3. Thread A erl_set_timer (time.c) Unlock Timing Wheel
>> >>> 4. Thread B bump_timer_internal (time.c) Lock Timing Wheel
>> >>> 5. Thread A cancel_timer (erl_process.c) Cancel timer T1
>> >>> 6. Thread B bump_timer_internal (time.c) Build list of
>> Expired
>> >>> Timers
>> >>> 7. Thread A erl_cancel_timer (time.c) Cancel timer T1:
>> >>> Waiting for Timing Wheel Lock
>> >>> 8. Thread B bump_timer_internal (time.c) Unlock Timing Wheel
>> >>> 9. Thread C set_timer (erl_process.c) New Timeout
>> Request (T2)
>> >>> 10. Thread B bump_timer_internal (time.c) Call Expired Timers
>> >>> Callbacks
>> >>> 11. Thread B free_ptimer (utils.c) Timer T1 callback
>> >>> invokes free_ptimer()
>> >>> 12. Thread C erts_create_smp_ptimer (utils.c) Create Timer
>> >>> ErtsSmpPTimer for T2
>> >>> 13. Thread B free_ptimer (utils.c) Free ErtsSmpPTimer
>> >>> memory block
>> >>> 14. Thread C erts_create_smp_ptimer (utils.c) Allocate
>> ErtsSmpPTimer
>> >>> for T2, block reused!
>> >>> 15. Thread C erl_set_timer (time.c) erl_set_timer
>> invoked
>> >>> for T2
>> >>> 16. Thread C erl_set_timer (time.c) Lock Timing Wheel
>> >>> 17. Thread C insert_timer (time.c) Insert Timer T2
>> >>> 18. Thread C erl_set_timer (time.c) Unlock Timing Wheel
>> >>> 19. Thread A erl_cancel_timer (time.c) Lock Timing Wheel
>> >>> 20. Thread A erl_cancel_timer (time.c) Remove ex-T1 == T2
>> >>> from the timing wheel
>> >>> 21. Thread A erl_cancel_timer (time.c) Unlock Timing Wheel
>> >>>
>> >>> See also attached diagram.
>> >>>
>> >>> Looks like one more mutex required, excluding release of
>> ErtsSmpPTimer
>> >>> memory block by timeout callback if cancel request was issued for the
>> >>> timer and vise versa. The two point of control - cancel timer and
>> timer
>> >>> expiration should not interfere.
>> >>> This bug happens only in SMP mode since there additional timer
>> control
>> >>> structure ErtsSmpPTimer is used between emulator and timing wheel.
>> >>>
>> >>> Mikael Pettersson wrote:
>> >>>> Dmitriy Kargapolov writes:
>> >>>> > > When running erl with -smp +S 2 option, sometimes process gets
>> >>>> stuck in > timer:sleep/1.
>> >>>> > Process code looks like:
>> >>>> > > some_receiver(State) ->
>> >>>> > NewState = receive
>> >>>> > % legal packet
>> >>>> > {some_keyword, Address, Port, Packet} ->
>> >>>> > State1 = handle_packet(Address, Port, Packet,
>> State),
>> >>>> > timer:sleep(get_loop_delay()),
>> >>>> > State1;
>> >>>> > % unknown message
>> >>>> > _ ->
>> >>>> > State
>> >>>> > end,
>> >>>> > some_receiver(NewState).
>> >>>> > > Delay value varies in range 1..999
>> >>>> > > Since timer:sleep/1 implemented as:
>> >>>> > sleep(T) ->
>> >>>> > receive
>> >>>> > after T -> ok
>> >>>> > end.
>> >>>> > it seems to be problem with "after" in smp implementation in
>> R11B-0
>> >>>> > > I don't have more details yet but will continue testing.
>> >>>> > My platform: 2.6.9-5.ELsmp #1 SMP i686 i686 i386 GNU/Linux
>> >>>>
>> >>>> Interesting. Please send us a small standalone module that exhibits
>> >>>> the bug, and I'll see if I can reproduce it.
>> >>>>
>> >>>> /Mikael
>> >>>>
>> >>>
>> ------------------------------------------------------------------------
>> >>>
>> >>> _______________________________________________
>> >>> erlang-questions mailing list
>> >>> erlang-questions@erlang.org
>> >>> http://www.erlang.org/mailman/listinfo/erlang-questions
>> > _______________________________________________
>> > erlang-bugs mailing list
>> > erlang-bugs@erlang.org
>> > http://www.erlang.org/mailman/listinfo/erlang-bugs
>> >
>>
>>
>>
>>
>>
>> --- otp_src_R11B-2/erts/emulator/beam/utils.c 2006-11-06
>> 14:51:50.000000000 +0100
>> +++ otp_src_R11B-2.ptimer_patch/erts/emulator/beam/utils.c
>> 2006-12-27 18:11:44.772758000 +0100
>> @@ -2999,15 +2999,16 @@
>> static void
>> ptimer_timeout(ErtsSmpPTimer *ptimer)
>> {
>> - if (!(ptimer->timer.flags & ERTS_PTMR_FLG_CANCELLED)) {
>> if (is_internal_pid(ptimer->timer.id)) {
>> Process *p;
>> - p = erts_pid2proc(NULL,
>> - 0,
>> - ptimer->timer.id,
>> - ERTS_PROC_LOCK_MAIN|ERTS_PROC_LOCK_STATUS);
>> + p = erts_pid2proc_opt(NULL,
>> + 0,
>> + ptimer->timer.id,
>> +
>> ERTS_PROC_LOCK_MAIN|ERTS_PROC_LOCK_STATUS,
>> + ERTS_P2P_FLG_ALLOW_OTHER_X);
>> if (p) {
>> - if (!(ptimer->timer.flags & ERTS_PTMR_FLG_CANCELLED)) {
>> + if (!p->is_exiting
>> + && !(ptimer->timer.flags &
>> ERTS_PTMR_FLG_CANCELLED)) {
>> ASSERT(*ptimer->timer.timer_ref == ptimer);
>> *ptimer->timer.timer_ref = NULL;
>> (*ptimer->timer.timeout_func)(p);
>> @@ -3028,7 +3029,6 @@
>> erts_smp_io_unlock();
>> }
>> }
>> - }
>> free_ptimer(ptimer);
>> }
>>
>>
>>
>>
>> _______________________________________________
>> erlang-bugs mailing list
>> erlang-bugs@erlang.org
>> http://www.erlang.org/mailman/listinfo/erlang-bugs
>>
>>
>>
>
_______________________________________________
erlang-bugs mailing list
erlang-bugs@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-bugs
Post recived from mailinglist
Guest
Posted: Fri Dec 29, 2006 4:17 pm Reply with quote
Guest
First test of the patch has been successfully completed. We will
continue testing after the holiday, but so far patch works fine.
Thank you very much for fixing the problem in so short time!

BR and Happy New Year!


Rickard Green wrote:
> The scenario described by Serge and Dmitriy can happen due to this bug.
> The fix has been tested and I am quite sure it will fix the described
> problem. There could of course exist yet another bug causing the same
> problem, but I don't think so. The results of Serge's and Dmitriy's
> tests are of course interesting, but regardless that the patch fixes a
> real bug. If you use the smp emulator, apply the patch.
>
> BR,
> Rickard Green, Erlang/OTP
>
> Chris Newcombe wrote:
>> Hi Rickard,
>>
>> First of all, many thanks indeed for the very fast response time on
>> investigating and fixing issues like this! That level of
>> responsiveness really helps reassure new adopters of Erlang.
>>
>> How risky is this patch? i.e. Should everyone apply it?
>>
>> Is the patch ...
>>
>> a) An experimental fix that needs testing by Serge and Dmitriy before
>> others consider it.
>>
>> b) A definite fix for a definite problem, and has been tested. But
>> it may or may not be the problem that Serge and Dmitriy found.
>>
>> regards,
>>
>> Chris
>>
>> On 12/27/06, Rickard Green <rickard.s.green@ericsson.com> wrote:
>>> The process lock plays an important role here. Unfortunately a faulty
>>> optimization (blush) prevented the process lock from playing that role.
>>> ptimer_timeout() has to acquire the process lock before looking at the
>>> ptimer flags. I've attached a patch that should fix the problem.
>>>
>>> $ tar -zxf otp_src_R11B-2.tar.gz
>>> $ patch -p0 < ptimer.patch
>>> patching file `otp_src_R11B-2/erts/emulator/beam/utils.c'
>>>
>>> Please, report to us whether or not the problem went away.
>>>
>>> Great work Dmitriy and Serge! Many thanks!
>>>
>>> BR,
>>> Rickard Green, Erlang/OTP
>>>
>>> Rickard Green wrote:
>>> > Thanks for your detailed bug report. I'll look at this as soon as
>>> possible.
>>> >
>>> > BR,
>>> > Rickard Green, Erlang/OTP
>>> >
>>> > Serge Aleynikov wrote:
>>> >> Additionally, I should say that we've been able to reproduce this
>>> bug on
>>> >> several Linux platforms (RH ES 4.0, Fedora Core 5.0, 32bit and
>>> 64bit) in
>>> >> R11B-0, R11B-1 and R11B-2. The bug (what appears to be a race
>>> >> condition) is seen only if the emulator is started in the SMP mode
>>> and
>>> >> results in the following construct blocking infinitely in the
>>> context of
>>> >> some Erlang process handing a message dispatching function:
>>> >>
>>> >> receive
>>> >> after N -> % Where N is between 1 and 999
>>> >> ok
>>> >> end.
>>> >>
>>> >> It happens when all the CPUs in SMP mode are over 75% loaded. The
>>> bug
>>> >> doesn't happen immediately after starting a release, but after a
>>> period
>>> >> of 5 min to 3 hours, which makes it pretty hard to diagnose. The
>>> >> tracing method that we initially tried to use was to include printf
>>> >> statements in the emulator to stderr. However, this prevented the
>>> bug
>>> >> from showing up. Further it was changed to using SysV message
>>> queue to
>>> >> communicate trace to an external process that dumped the trace to a
>>> >> file. This allowed to gain further understanding of the problem,
>>> but as
>>> >> Dmitry indicated any attempt to reduce the code to a minimal example
>>> >> made the problem disappear.
>>> >>
>>> >> The emulator code is quite involved, but hopefully someone in the OTP
>>> >> team could come up with a recommendation of how/where to put a
>>> missing
>>> >> synchronization. If needed we can arrange for a remote SSH login
>>> to the
>>> >> system(s) where the problem is reproducible.
>>> >>
>>> >> Regards,
>>> >>
>>> >> Serge
>>> >>
>>> >> Dmitriy Kargapolov wrote:
>>> >>> Unfortunately I can not create standalone test for this bug, even
>>> when I
>>> >>> became much more close to understanding the effect.
>>> >>> This bug appears only in highly loaded system.
>>> >>>
>>> >>> Recently I did manage to trace some points in the code and see at
>>> least
>>> >>> one scenario for the race condition bug.
>>> >>>
>>> >>> 1. Thread A erl_set_timer (time.c) Lock Timing Wheel
>>> >>> 2. Thread A insert_timer (time.c) Insert Timer T1
>>> >>> 3. Thread A erl_set_timer (time.c) Unlock Timing
>>> Wheel
>>> >>> 4. Thread B bump_timer_internal (time.c) Lock Timing Wheel
>>> >>> 5. Thread A cancel_timer (erl_process.c) Cancel timer T1
>>> >>> 6. Thread B bump_timer_internal (time.c) Build list of
>>> Expired
>>> >>> Timers
>>> >>> 7. Thread A erl_cancel_timer (time.c) Cancel timer T1:
>>> >>> Waiting for Timing Wheel Lock
>>> >>> 8. Thread B bump_timer_internal (time.c) Unlock Timing
>>> Wheel
>>> >>> 9. Thread C set_timer (erl_process.c) New Timeout
>>> Request (T2)
>>> >>> 10. Thread B bump_timer_internal (time.c) Call Expired
>>> Timers
>>> >>> Callbacks
>>> >>> 11. Thread B free_ptimer (utils.c) Timer T1 callback
>>> >>> invokes free_ptimer()
>>> >>> 12. Thread C erts_create_smp_ptimer (utils.c) Create Timer
>>> >>> ErtsSmpPTimer for T2
>>> >>> 13. Thread B free_ptimer (utils.c) Free ErtsSmpPTimer
>>> >>> memory block
>>> >>> 14. Thread C erts_create_smp_ptimer (utils.c) Allocate
>>> ErtsSmpPTimer
>>> >>> for T2, block reused!
>>> >>> 15. Thread C erl_set_timer (time.c) erl_set_timer
>>> invoked
>>> >>> for T2
>>> >>> 16. Thread C erl_set_timer (time.c) Lock Timing Wheel
>>> >>> 17. Thread C insert_timer (time.c) Insert Timer T2
>>> >>> 18. Thread C erl_set_timer (time.c) Unlock Timing
>>> Wheel
>>> >>> 19. Thread A erl_cancel_timer (time.c) Lock Timing Wheel
>>> >>> 20. Thread A erl_cancel_timer (time.c) Remove ex-T1 == T2
>>> >>> from the timing wheel
>>> >>> 21. Thread A erl_cancel_timer (time.c) Unlock Timing
>>> Wheel
>>> >>>
>>> >>> See also attached diagram.
>>> >>>
>>> >>> Looks like one more mutex required, excluding release of
>>> ErtsSmpPTimer
>>> >>> memory block by timeout callback if cancel request was issued for
>>> the
>>> >>> timer and vise versa. The two point of control - cancel timer and
>>> timer
>>> >>> expiration should not interfere.
>>> >>> This bug happens only in SMP mode since there additional timer
>>> control
>>> >>> structure ErtsSmpPTimer is used between emulator and timing wheel.
>>> >>>
>>> >>> Mikael Pettersson wrote:
>>> >>>> Dmitriy Kargapolov writes:
>>> >>>> > > When running erl with -smp +S 2 option, sometimes process
>>> gets
>>> >>>> stuck in > timer:sleep/1.
>>> >>>> > Process code looks like:
>>> >>>> > > some_receiver(State) ->
>>> >>>> > NewState = receive
>>> >>>> > % legal packet
>>> >>>> > {some_keyword, Address, Port, Packet} ->
>>> >>>> > State1 = handle_packet(Address, Port, Packet,
>>> State),
>>> >>>> > timer:sleep(get_loop_delay()),
>>> >>>> > State1;
>>> >>>> > % unknown message
>>> >>>> > _ ->
>>> >>>> > State
>>> >>>> > end,
>>> >>>> > some_receiver(NewState).
>>> >>>> > > Delay value varies in range 1..999
>>> >>>> > > Since timer:sleep/1 implemented as:
>>> >>>> > sleep(T) ->
>>> >>>> > receive
>>> >>>> > after T -> ok
>>> >>>> > end.
>>> >>>> > it seems to be problem with "after" in smp implementation in
>>> R11B-0
>>> >>>> > > I don't have more details yet but will continue testing.
>>> >>>> > My platform: 2.6.9-5.ELsmp #1 SMP i686 i686 i386 GNU/Linux
>>> >>>>
>>> >>>> Interesting. Please send us a small standalone module that exhibits
>>> >>>> the bug, and I'll see if I can reproduce it.
>>> >>>>
>>> >>>> /Mikael
>>> >>>>
>>> >>>
>>> ------------------------------------------------------------------------
>>> >>>
>>> >>> _______________________________________________
>>> >>> erlang-questions mailing list
>>> >>> erlang-questions@erlang.org
>>> >>> http://www.erlang.org/mailman/listinfo/erlang-questions
>>> > _______________________________________________
>>> > erlang-bugs mailing list
>>> > erlang-bugs@erlang.org
>>> > http://www.erlang.org/mailman/listinfo/erlang-bugs
>>> >
>>>
>>>
>>>
>>>
>>>
>>> --- otp_src_R11B-2/erts/emulator/beam/utils.c 2006-11-06
>>> 14:51:50.000000000 +0100
>>> +++ otp_src_R11B-2.ptimer_patch/erts/emulator/beam/utils.c
>>> 2006-12-27 18:11:44.772758000 +0100
>>> @@ -2999,15 +2999,16 @@
>>> static void
>>> ptimer_timeout(ErtsSmpPTimer *ptimer)
>>> {
>>> - if (!(ptimer->timer.flags & ERTS_PTMR_FLG_CANCELLED)) {
>>> if (is_internal_pid(ptimer->timer.id)) {
>>> Process *p;
>>> - p = erts_pid2proc(NULL,
>>> - 0,
>>> - ptimer->timer.id,
>>> -
>>> ERTS_PROC_LOCK_MAIN|ERTS_PROC_LOCK_STATUS);
>>> + p = erts_pid2proc_opt(NULL,
>>> + 0,
>>> + ptimer->timer.id,
>>> +
>>> ERTS_PROC_LOCK_MAIN|ERTS_PROC_LOCK_STATUS,
>>> + ERTS_P2P_FLG_ALLOW_OTHER_X);
>>> if (p) {
>>> - if (!(ptimer->timer.flags & ERTS_PTMR_FLG_CANCELLED)) {
>>> + if (!p->is_exiting
>>> + && !(ptimer->timer.flags &
>>> ERTS_PTMR_FLG_CANCELLED)) {
>>> ASSERT(*ptimer->timer.timer_ref == ptimer);
>>> *ptimer->timer.timer_ref = NULL;
>>> (*ptimer->timer.timeout_func)(p);
>>> @@ -3028,7 +3029,6 @@
>>> erts_smp_io_unlock();
>>> }
>>> }
>>> - }
>>> free_ptimer(ptimer);
>>> }
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> erlang-bugs mailing list
>>> erlang-bugs@erlang.org
>>> http://www.erlang.org/mailman/listinfo/erlang-bugs
>>>
>>>
>>>
>>
>

_______________________________________________
erlang-bugs mailing list
erlang-bugs@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-bugs
Post recived from mailinglist
Guest
Posted: Fri Dec 29, 2006 4:51 pm Reply with quote
Guest
Excellent -- many thanks again for fixing it so quickly.

Chris

On 12/29/06, Rickard Green <rickard.s.green@ericsson.com> wrote:
> The scenario described by Serge and Dmitriy can happen due to this bug.
> The fix has been tested and I am quite sure it will fix the described
> problem. There could of course exist yet another bug causing the same
> problem, but I don't think so. The results of Serge's and Dmitriy's
> tests are of course interesting, but regardless that the patch fixes a
> real bug. If you use the smp emulator, apply the patch.
>
> BR,
> Rickard Green, Erlang/OTP
>
> Chris Newcombe wrote:
> > Hi Rickard,
> >
> > First of all, many thanks indeed for the very fast response time on
> > investigating and fixing issues like this! That level of
> > responsiveness really helps reassure new adopters of Erlang.
> >
> > How risky is this patch? i.e. Should everyone apply it?
> >
> > Is the patch ...
> >
> > a) An experimental fix that needs testing by Serge and Dmitriy before
> > others consider it.
> >
> > b) A definite fix for a definite problem, and has been tested. But
> > it may or may not be the problem that Serge and Dmitriy found.
> >
> > regards,
> >
> > Chris
> >
> > On 12/27/06, Rickard Green <rickard.s.green@ericsson.com> wrote:
> >> The process lock plays an important role here. Unfortunately a faulty
> >> optimization (blush) prevented the process lock from playing that role.
> >> ptimer_timeout() has to acquire the process lock before looking at the
> >> ptimer flags. I've attached a patch that should fix the problem.
> >>
> >> $ tar -zxf otp_src_R11B-2.tar.gz
> >> $ patch -p0 < ptimer.patch
> >> patching file `otp_src_R11B-2/erts/emulator/beam/utils.c'
> >>
> >> Please, report to us whether or not the problem went away.
> >>
> >> Great work Dmitriy and Serge! Many thanks!
> >>
> >> BR,
> >> Rickard Green, Erlang/OTP
> >>
> >> Rickard Green wrote:
> >> > Thanks for your detailed bug report. I'll look at this as soon as
> >> possible.
> >> >
> >> > BR,
> >> > Rickard Green, Erlang/OTP
> >> >
> >> > Serge Aleynikov wrote:
> >> >> Additionally, I should say that we've been able to reproduce this
> >> bug on
> >> >> several Linux platforms (RH ES 4.0, Fedora Core 5.0, 32bit and
> >> 64bit) in
> >> >> R11B-0, R11B-1 and R11B-2. The bug (what appears to be a race
> >> >> condition) is seen only if the emulator is started in the SMP mode and
> >> >> results in the following construct blocking infinitely in the
> >> context of
> >> >> some Erlang process handing a message dispatching function:
> >> >>
> >> >> receive
> >> >> after N -> % Where N is between 1 and 999
> >> >> ok
> >> >> end.
> >> >>
> >> >> It happens when all the CPUs in SMP mode are over 75% loaded. The bug
> >> >> doesn't happen immediately after starting a release, but after a
> >> period
> >> >> of 5 min to 3 hours, which makes it pretty hard to diagnose. The
> >> >> tracing method that we initially tried to use was to include printf
> >> >> statements in the emulator to stderr. However, this prevented the bug
> >> >> from showing up. Further it was changed to using SysV message
> >> queue to
> >> >> communicate trace to an external process that dumped the trace to a
> >> >> file. This allowed to gain further understanding of the problem,
> >> but as
> >> >> Dmitry indicated any attempt to reduce the code to a minimal example
> >> >> made the problem disappear.
> >> >>
> >> >> The emulator code is quite involved, but hopefully someone in the OTP
> >> >> team could come up with a recommendation of how/where to put a missing
> >> >> synchronization. If needed we can arrange for a remote SSH login
> >> to the
> >> >> system(s) where the problem is reproducible.
> >> >>
> >> >> Regards,
> >> >>
> >> >> Serge
> >> >>
> >> >> Dmitriy Kargapolov wrote:
> >> >>> Unfortunately I can not create standalone test for this bug, even
> >> when I
> >> >>> became much more close to understanding the effect.
> >> >>> This bug appears only in highly loaded system.
> >> >>>
> >> >>> Recently I did manage to trace some points in the code and see at
> >> least
> >> >>> one scenario for the race condition bug.
> >> >>>
> >> >>> 1. Thread A erl_set_timer (time.c) Lock Timing Wheel
> >> >>> 2. Thread A insert_timer (time.c) Insert Timer T1
> >> >>> 3. Thread A erl_set_timer (time.c) Unlock Timing Wheel
> >> >>> 4. Thread B bump_timer_internal (time.c) Lock Timing Wheel
> >> >>> 5. Thread A cancel_timer (erl_process.c) Cancel timer T1
> >> >>> 6. Thread B bump_timer_internal (time.c) Build list of
> >> Expired
> >> >>> Timers
> >> >>> 7. Thread A erl_cancel_timer (time.c) Cancel timer T1:
> >> >>> Waiting for Timing Wheel Lock
> >> >>> 8. Thread B bump_timer_internal (time.c) Unlock Timing Wheel
> >> >>> 9. Thread C set_timer (erl_process.c) New Timeout
> >> Request (T2)
> >> >>> 10. Thread B bump_timer_internal (time.c) Call Expired Timers
> >> >>> Callbacks
> >> >>> 11. Thread B free_ptimer (utils.c) Timer T1 callback
> >> >>> invokes free_ptimer()
> >> >>> 12. Thread C erts_create_smp_ptimer (utils.c) Create Timer
> >> >>> ErtsSmpPTimer for T2
> >> >>> 13. Thread B free_ptimer (utils.c) Free ErtsSmpPTimer
> >> >>> memory block
> >> >>> 14. Thread C erts_create_smp_ptimer (utils.c) Allocate
> >> ErtsSmpPTimer
> >> >>> for T2, block reused!
> >> >>> 15. Thread C erl_set_timer (time.c) erl_set_timer
> >> invoked
> >> >>> for T2
> >> >>> 16. Thread C erl_set_timer (time.c) Lock Timing Wheel
> >> >>> 17. Thread C insert_timer (time.c) Insert Timer T2
> >> >>> 18. Thread C erl_set_timer (time.c) Unlock Timing Wheel
> >> >>> 19. Thread A erl_cancel_timer (time.c) Lock Timing Wheel
> >> >>> 20. Thread A erl_cancel_timer (time.c) Remove ex-T1 == T2
> >> >>> from the timing wheel
> >> >>> 21. Thread A erl_cancel_timer (time.c) Unlock Timing Wheel
> >> >>>
> >> >>> See also attached diagram.
> >> >>>
> >> >>> Looks like one more mutex required, excluding release of
> >> ErtsSmpPTimer
> >> >>> memory block by timeout callback if cancel request was issued for the
> >> >>> timer and vise versa. The two point of control - cancel timer and
> >> timer
> >> >>> expiration should not interfere.
> >> >>> This bug happens only in SMP mode since there additional timer
> >> control
> >> >>> structure ErtsSmpPTimer is used between emulator and timing wheel.
> >> >>>
> >> >>> Mikael Pettersson wrote:
> >> >>>> Dmitriy Kargapolov writes:
> >> >>>> > > When running erl with -smp +S 2 option, sometimes process gets
> >> >>>> stuck in > timer:sleep/1.
> >> >>>> > Process code looks like:
> >> >>>> > > some_receiver(State) ->
> >> >>>> > NewState = receive
> >> >>>> > % legal packet
> >> >>>> > {some_keyword, Address, Port, Packet} ->
> >> >>>> > State1 = handle_packet(Address, Port, Packet,
> >> State),
> >> >>>> > timer:sleep(get_loop_delay()),
> >> >>>> > State1;
> >> >>>> > % unknown message
> >> >>>> > _ ->
> >> >>>> > State
> >> >>>> > end,
> >> >>>> > some_receiver(NewState).
> >> >>>> > > Delay value varies in range 1..999
> >> >>>> > > Since timer:sleep/1 implemented as:
> >> >>>> > sleep(T) ->
> >> >>>> > receive
> >> >>>> > after T -> ok
> >> >>>> > end.
> >> >>>> > it seems to be problem with "after" in smp implementation in
> >> R11B-0
> >> >>>> > > I don't have more details yet but will continue testing.
> >> >>>> > My platform: 2.6.9-5.ELsmp #1 SMP i686 i686 i386 GNU/Linux
> >> >>>>
> >> >>>> Interesting. Please send us a small standalone module that exhibits
> >> >>>> the bug, and I'll see if I can reproduce it.
> >> >>>>
> >> >>>> /Mikael
> >> >>>>
> >> >>>
> >> ------------------------------------------------------------------------
> >> >>>
> >> >>> _______________________________________________
> >> >>> erlang-questions mailing list
> >> >>> erlang-questions@erlang.org
> >> >>> http://www.erlang.org/mailman/listinfo/erlang-questions
> >> > _______________________________________________
> >> > erlang-bugs mailing list
> >> > erlang-bugs@erlang.org
> >> > http://www.erlang.org/mailman/listinfo/erlang-bugs
> >> >
> >>
> >>
> >>
> >>
> >>
> >> --- otp_src_R11B-2/erts/emulator/beam/utils.c 2006-11-06
> >> 14:51:50.000000000 +0100
> >> +++ otp_src_R11B-2.ptimer_patch/erts/emulator/beam/utils.c
> >> 2006-12-27 18:11:44.772758000 +0100
> >> @@ -2999,15 +2999,16 @@
> >> static void
> >> ptimer_timeout(ErtsSmpPTimer *ptimer)
> >> {
> >> - if (!(ptimer->timer.flags & ERTS_PTMR_FLG_CANCELLED)) {
> >> if (is_internal_pid(ptimer->timer.id)) {
> >> Process *p;
> >> - p = erts_pid2proc(NULL,
> >> - 0,
> >> - ptimer->timer.id,
> >> - ERTS_PROC_LOCK_MAIN|ERTS_PROC_LOCK_STATUS);
> >> + p = erts_pid2proc_opt(NULL,
> >> + 0,
> >> + ptimer->timer.id,
> >> +
> >> ERTS_PROC_LOCK_MAIN|ERTS_PROC_LOCK_STATUS,
> >> + ERTS_P2P_FLG_ALLOW_OTHER_X);
> >> if (p) {
> >> - if (!(ptimer->timer.flags & ERTS_PTMR_FLG_CANCELLED)) {
> >> + if (!p->is_exiting
> >> + && !(ptimer->timer.flags &
> >> ERTS_PTMR_FLG_CANCELLED)) {
> >> ASSERT(*ptimer->timer.timer_ref == ptimer);
> >> *ptimer->timer.timer_ref = NULL;
> >> (*ptimer->timer.timeout_func)(p);
> >> @@ -3028,7 +3029,6 @@
> >> erts_smp_io_unlock();
> >> }
> >> }
> >> - }
> >> free_ptimer(ptimer);
> >> }
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> erlang-bugs mailing list
> >> erlang-bugs@erlang.org
> >> http://www.erlang.org/mailman/listinfo/erlang-bugs
> >>
> >>
> >>
> >
>
_______________________________________________
erlang-bugs mailing list
erlang-bugs@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-bugs
Post recived from mailinglist
Idon
Posted: Sun Apr 08, 2007 6:01 pm Reply with quote
Guest
Christina Ricci *beep*!
http://Christina-Ricci-*beep*.org/WindowsMediaPlayer.php?movie=231891

Display posts from previous:  

All times are GMT
Page 1 of 1
This forum is locked: you cannot post, reply to, or edit topics.

Jump to:  

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You cannot download files in this forum