| Author |
Message |
|
| anders_n |
Posted: Tue Oct 23, 2007 7:07 pm |
|
|
|
User
Joined: 28 Feb 2005
Posts: 155
Location: Saltillo, Mexico
|
To summarize my progress on the widefinder problem
A few days ago I started with Steve Vinoski's tbray16.erl
As a baseline on my 1.66 GHz dual core Centrino
laptop, Linux,
tbray16
real 0m7.067s
user 0m12.377s
sys 0m0.584s
I removed the dict used for the shift table,
and changed the min_heap_size.
That gave
real 0m2.713s
user 0m4.168s
sys 0m0.412s
(see tbray_tuple.erl and wfbm4_tuple.erl)
Steve reported that it ran in ~1.9 s on his 8 core server.
Then I removed the dicts that were used for collecting the
matches and used ets instead, and got some improvement
on my dual core laptop.
real 0m2.220s
user 0m3.252s
sys 0m0.344s
(see tbray_ets.erl and wfbm4_ets.erl)
Interestingly Steve reported that it actually performed
worse on his 8 core server.
These versions all read the whole file into memory at the start.
On my laptop that takes ~400ms (when the file is already cached
in the OS).
So I changed it to read the file in chucks and spawn the worker
after each chunk is read.
tbray_blockread with 4 processes
real 0m1.992s
user 0m3.176s
sys 0m0.420s
(see tbray_blockread.erl and wfbm4_ets.erl)
Running it in the erlang shell it takes ~1.8s.
Just starting and stopping the VM takes
time erl -pa ../../bfile/ebin/ -smp -noshell -run init stop
real 0m1.229s
user 0m0.208s
sys 0m0.020s
It would be interesting to see how it runs on other machines,
with more cores.
/Anders
Post recived from mailinglist |
|
|
| Back to top |
|
| Guest |
Posted: Tue Oct 23, 2007 8:44 pm |
|
|
|
Guest
|
On 10/23/07, Anders Nygren <anders.nygren@gmail.com (anders.nygren@gmail.com)> wrote:Quote: To summarize my progress on the widefinder problem
A few days ago I started with Steve Vinoski's tbray16.erl
As a baseline on my 1.66 GHz dual core Centrino
laptop, Linux,
tbray16
real |
|
|
| Back to top |
|
| anders_n |
Posted: Tue Oct 23, 2007 9:15 pm |
|
|
|
User
Joined: 28 Feb 2005
Posts: 155
Location: Saltillo, Mexico
|
On 10/23/07, Anders Nygren <anders.nygren@gmail.com> wrote:
> To summarize my progress on the widefinder problem
> A few days ago I started with Steve Vinoski's tbray16.erl
> As a baseline on my 1.66 GHz dual core Centrino
> laptop, Linux,
> tbray16
> real 0m7.067s
> user 0m12.377s
> sys 0m0.584s
>
> I removed the dict used for the shift table,
> and changed the min_heap_size.
> That gave
> real 0m2.713s
> user 0m4.168s
> sys 0m0.412s
>
> (see tbray_tuple.erl and wfbm4_tuple.erl)
> Steve reported that it ran in ~1.9 s on his 8 core server.
>
> Then I removed the dicts that were used for collecting the
> matches and used ets instead, and got some improvement
> on my dual core laptop.
> real 0m2.220s
> user 0m3.252s
> sys 0m0.344s
>
> (see tbray_ets.erl and wfbm4_ets.erl)
>
> Interestingly Steve reported that it actually performed
> worse on his 8 core server.
>
> These versions all read the whole file into memory at the start.
> On my laptop that takes ~400ms (when the file is already cached
> in the OS).
>
> So I changed it to read the file in chucks and spawn the worker
> after each chunk is read.
>
> tbray_blockread with 4 processes
> real 0m1.992s
> user 0m3.176s
> sys 0m0.420s
>
> (see tbray_blockread.erl and wfbm4_ets.erl)
>
> Running it in the erlang shell it takes ~1.8s.
>
In the last email I mentioned that
" Just starting and stopping the VM takes
time erl -pa ../../bfile/ebin/ -smp -noshell -run init stop
real 0m1.229s
user 0m0.208s
sys 0m0.020s"
But I just realized that a more useful measure for basic
startup and shutdown is
time erl -pa ../../bfile/ebin/ -smp -noshell -run erlang halt
real 0m0.201s
user 0m0.180s
sys 0m0.016s
/Anders
_______________________________________________
erlang-questions mailing list
erlang-questions@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-questions
Post recived from mailinglist |
|
|
| Back to top |
|
| Guest |
Posted: Tue Oct 23, 2007 9:43 pm |
|
|
|
Guest
|
On 10/23/07, Anders Nygren <anders.nygren@gmail.com (anders.nygren@gmail.com)> wrote:Quote: On 10/23/07, Anders Nygren <anders.nygren@gmail.com (anders.nygren@gmail.com)> wrote:
> To summarize my progress on the widefinder problem
> A few days ago I started with Steve Vinoski's tbray16.erl
> As a baseline on my 1.66 GHz dual core Centrino
> laptop, Linux,
> tbray16
> real |
|
|
| Back to top |
|
| anders_n |
Posted: Tue Oct 23, 2007 10:19 pm |
|
|
|
User
Joined: 28 Feb 2005
Posts: 155
Location: Saltillo, Mexico
|
On 10/23/07, Steve Vinoski <vinoski@ieee.org> wrote:
> On 10/23/07, Anders Nygren <anders.nygren@gmail.com> wrote:
> > To summarize my progress on the widefinder problem
> > A few days ago I started with Steve Vinoski's tbray16.erl
> > As a baseline on my 1.66 GHz dual core Centrino
> > laptop, Linux,
> > tbray16
> > real 0m7.067s
> > user 0m12.377s
> > sys 0m0.584s
>
> Anders, thanks for collecting and posting these. I've just performed a set
> of new timings for all of them, as listed below. For each, I just ran this
> command:
>
> time erl -smp -noshell -run <test_case> main o1000k.ap >/dev/null
>
> where "<test_case>" is the name of the tbray test case file. All were looped
> ten times, and I took the best timing for each. All tests were done on my
> 8-core 2.33 GHz dual Intel Xeon with 2 GB RAM Linux box, in a local
> (non-NFS) directory.
>
I don't keep track of the finer details of different CPUs, but I have
a vague memory of that the 8 core Xeon is really 2 4 core CPUs
on one chip, is that correct?
The reason I am asking is that I can not figure out why Your
measurements have shorter real times than mine, but more
than twice the user time.
Also it does not seems to scale so well up to 8 cores.
Steve's best time is 0m1.546s an mine was 0m1.992s.
Steve, can You also do some tests on tbray_blockread using
different numbers of worker processes. Since smaller block
size means that we start using all the cores earlier.
> My original tbray16 runs in
>
>
> real 0m3.162s
> user 0m16.513s
> sys 0m1.762s
> > I removed the dict used for the shift table,
> > and changed the min_heap_size.
> > That gave
> > real 0m2.713s
> > user 0m4.168s
> > sys 0m0.412s
> >
> > (see tbray_tuple.erl and wfbm4_tuple.erl)
> > Steve reported that it ran in ~1.9 s on his 8 core server.
>
>
> What I get for tbray_tuple is:
>
> real 0m2.285s
> user 0m8.615s
> sys 0m0.988s
>
>
> > Then I removed the dicts that were used for collecting the
> > matches and used ets instead, and got some improvement
> > on my dual core laptop.
> > real 0m2.220s
> > user 0m3.252s
> > sys 0m0.344s
> >
> > (see tbray_ets.erl and wfbm4_ets.erl)
> >
> > Interestingly Steve reported that it actually performed
> > worse on his 8 core server.
>
> The discrepancy seems to be gone. With your new file that you supplied in
> your message, the official timing for tbray_ets on the 8-core is:
>
>
> real 0m1.868s
> user 0m7.416s
> sys 0m0.509s
>
>
> > These versions all read the whole file into memory at the start.
> > On my laptop that takes ~400ms (when the file is already cached
> > in the OS).
> >
> > So I changed it to read the file in chucks and spawn the worker
> > after each chunk is read.
> >
> > tbray_blockread with 4 processes
> > real 0m1.992s
> > user 0m3.176s
> > sys 0m0.420s
> >
> > (see tbray_blockread.erl and wfbm4_ets.erl)
> >
> > Running it in the erlang shell it takes ~1.8s.
>
>
> Interestingly, some of my earlier attempts tried to overlap block reads and
> worker spawning, but the results were always worse, so that's why I went to
> reading in the whole file. This blockread approach may very well be The
> Ultimate Wide Finder.
>
> Timing for tbray_blockread on the 8-core:
>
> real 0m1.546s
> user 0m7.337s
> sys 0m0.662s
>
>
> > Just starting and stopping the VM takes
> > time erl -pa ../../bfile/ebin/ -smp -noshell -run init stop
> >
> > real 0m1.229s
> > user 0m0.208s
> > sys 0m0.020s
>
> On the 8-core this takes:
>
> real 0m1.093s
> user 0m0.072s
> sys 0m0.012s
>
> > It would be interesting to see how it runs on other machines,
> > with more cores.
>
> Tim Bray is traveling at the moment, but he told me by email that he hopes
> to get back to measuring these on the T5120 next week.
>
> thanks,
> --steve
_______________________________________________
erlang-questions mailing list
erlang-questions@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-questions
Post recived from mailinglist |
|
|
| Back to top |
|
| Guest |
Posted: Tue Oct 23, 2007 10:58 pm |
|
|
|
Guest
|
On 10/23/07, Anders Nygren <anders.nygren@gmail.com (anders.nygren@gmail.com)> wrote:Quote: On 10/23/07, Steve Vinoski <vinoski@ieee.org (vinoski@ieee.org)> wrote:
> On 10/23/07, Anders Nygren <anders.nygren@gmail.com (anders.nygren@gmail.com)> wrote:
> > To summarize my progress on the widefinder problem
> > A few days ago I started with Steve Vinoski's tbray16.erl
> > As a baseline on my 1.66 GHz dual core Centrino
> > laptop, Linux,
> > tbray16
> > real |
|
|
| Back to top |
|
| anders_n |
Posted: Tue Oct 23, 2007 11:44 pm |
|
|
|
User
Joined: 28 Feb 2005
Posts: 155
Location: Saltillo, Mexico
|
On 10/23/07, Steve Vinoski <vinoski@ieee.org> wrote:
> On 10/23/07, Anders Nygren <anders.nygren@gmail.com> wrote:
> > On 10/23/07, Steve Vinoski <vinoski@ieee.org> wrote:
> > > On 10/23/07, Anders Nygren <anders.nygren@gmail.com> wrote:
> > > > To summarize my progress on the widefinder problem
> > > > A few days ago I started with Steve Vinoski's tbray16.erl
> > > > As a baseline on my 1.66 GHz dual core Centrino
> > > > laptop, Linux,
> > > > tbray16
> > > > real 0m7.067s
> > > > user 0m12.377s
> > > > sys 0m0.584s
> > >
> > > Anders, thanks for collecting and posting these. I've just performed a
> set
> > > of new timings for all of them, as listed below. For each, I just ran
> this
> > > command:
> > >
> > > time erl -smp -noshell -run <test_case> main o1000k.ap >/dev/null
> > >
> > > where "<test_case>" is the name of the tbray test case file. All were
> looped
> > > ten times, and I took the best timing for each. All tests were done on
> my
> > > 8-core 2.33 GHz dual Intel Xeon with 2 GB RAM Linux box, in a local
> > > (non-NFS) directory.
> > >
> >
> > I don't keep track of the finer details of different CPUs, but I have
> > a vague memory of that the 8 core Xeon is really 2 4 core CPUs
> > on one chip, is that correct?
>
> Yes, I believe so.
> > The reason I am asking is that I can not figure out why Your
> > measurements have shorter real times than mine, but more
> > than twice the user time.
>
> It's because the user time includes CPU time on all the cores. More cores,
> and more things happening on those cores, means more CPU time and thus more
> user time. Tim saw the same phenomenon on his T5120 and blogged about it
> here:
>
But user time is supposed to be the time used executing instructions
for the process and its children, i.e. the CPU time used to solve
the task. So the user time should ideally remain constant when
more cores are added, and the real time should ideally be divided
by the number of cores.
But also Tim said
"Further poking dug up the answer: it seems that the hardware doesn't
tell the OS how it's sharing out the cycles among the the threads that
it has runnable at any point in time. So Solaris just credits them
with user CPU time whenever they're in Run state. The results will be
correct when you have up to sixteen threads staying runnable; above
that they get funky. "
So basically You can not trust the user time on the T1 or T2. But I
don't think that also applies on other processors.
> <http://www.tbray.org/ongoing/When/200x/2007/10/09/Niagara-2-T2-T5120>
> > Also it does not seems to scale so well up to 8 cores.
> > Steve's best time is 0m1.546s an mine was 0m1.992s .
>
> The default settings in the code are probably not ideal for the 8-core box.
> > Steve, can You also do some tests on tbray_blockread using
> > different numbers of worker processes. Since smaller block
> > size means that we start using all the cores earlier.
>
>
> I ran a series of tests of different block sizes, and I found that for the 8
> core, dividing the file into 1024 chunks (for this file, this means a block
> size of 230606 bytes) produced the best time:
Yes, I also got the impression that there was a optimum around
200k blocksize. But the sample to sample variation is enough that
I was not sure.
>
>
> real 0m1.103s
> user 0m6.651s
> sys 0m0.492s
>
> Which is pretty darn fast. Smaller chunk sizes are slower probably
> because there's more result collecting and merging to do,
There is no merging or collecting of results in the ets based versions.
I think the slowdown for even smaller blocks is because of the
scheduling of, and switching between, all the worker processes.
>while larger chunk
> sizes are slower because parallelism is reduced.
>
> I can't wait to see this thing run on Tim's T5120.
>
> BTW, I got a comment on my blog today from someone who essentially said I
> was making Erlang look bad by applying it to a problem for which it's not a
> good fit. My response was that I didn't agree; Tim's original goal was to
> maximize the use of a multicore system for solving the Wide Finder, and
> Erlang now does that better than anything else I've seen so far. Does anyone
> in the Erlang community agree with the person who made that comment that
> this Wide Finder project has made Erlang look bad?
I think Tim was unnecessarily strong in his initial comments.
He did a naive beginner solution and when it performed badly he
said the Erlang sucks, (more or less). Instead of asking how to
improve it.
It has been an interesting exercise, and quite useful for me since
I am currently looking at a system that needs to process ~ 1 Tera
byte of log files per day. Fortunately they are BER coded
/Anders
_______________________________________________
erlang-questions mailing list
erlang-questions@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-questions
Post recived from mailinglist |
|
|
| Back to top |
|
| Thomas Lindgren |
Posted: Wed Oct 24, 2007 10:43 am |
|
|
|
User
Joined: 09 Mar 2005
Posts: 284
|
--- Steve Vinoski <vinoski@ieee.org> wrote:
Anders, thanks for collecting and posting these.
> I've just performed a set
> of new timings for all of them, as listed below. For
> each, I just ran this
> command:
>
> time erl -smp -noshell -run <test_case> main
> o1000k.ap >/dev/null
>
> where "<test_case>" is the name of the tbray test
> case file. All were
> looped ten times, and I took the best timing for
> each. All tests were done
> on my 8-core 2.33 GHz dual Intel Xeon with 2 GB RAM
> Linux box, in a local
> (non-NFS) directory.
So, looking at Steve's results on his 8-core system,
we have:
real user tbray5/real user/real
tbray5 9.8 -- 1.0 --
tbray14 6.63 34.53 1.48 5.21
tbray15 4.12 25.14 2.38 6.10
tbray16 3.16 16.15 3.10 5.11
tbray_tuple 2.28 8.61 4.30 3.78
tbray_ets 1.87 7.42 5.24 3.97
tbray_blkr 1.55 7.34 6.32 4.74
tbray5/real is the speedup versus the baseline, while
user/real is the speedup for each version due to
parallelization.
Thus, the latest version is 6.3 times faster than the
first one. The parallel speedup is about the same in
tbray5 and tbray_blkr, a very decent utilization of
>50%, but the amount of work (user) has shrunk from
(presumably more than) 34.53 seconds to 7.34 seconds.
Tim Bray's original Erlang number on "his macbook"
appears to be 34.16 seconds user (probably about the
same real?). How does this compare to Ruby? Tim Bray
reported that it needed 3.46 seconds real, again on
his macbook. (As I understand it, all results here are
for the big data set.)
Best,
Thomas
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
_______________________________________________
erlang-questions mailing list
erlang-questions@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-questions
Post recived from mailinglist |
|
|
| Back to top |
|
| Guest |
Posted: Wed Oct 24, 2007 12:03 pm |
|
|
|
Guest
|
On 10/24/07, Thomas Lindgren <thomasl_erlang@yahoo.com (thomasl_erlang@yahoo.com)> wrote:Quote:
--- Steve Vinoski <vinoski@ieee.org (vinoski@ieee.org)> wrote:
Anders, thanks for collecting and posting these.
> I've just performed a set
> of new timings for all of them, as listed below. For
> each, I just ran this
> command:
>
> time erl -smp -noshell -run <test_case> main
> o1000k.ap >/dev/null
>
> where "<test_case>" is the name of the tbray test
> case file. All were
> looped ten times, and I took the best timing for
> each. All tests were done
> on my 8-core 2.33 GHz dual Intel Xeon with 2 GB RAM
> Linux box, in a local
> (non-NFS) directory.
So, looking at Steve's results on his 8-core system,
we have:
|
|
|
| Back to top |
|
| dcaoyuan |
Posted: Thu Oct 25, 2007 1:19 pm |
|
|
|
User
Joined: 28 Mar 2007
Posts: 34
|
It seems heap size is really a key for binary processing, and there
are other tips for binary processing too. With proper heap size set,
the straightforward Erlang code (in 80 LOC) can achieve around 3.1 sec
on my 4-CPU linux box (the ruby code took about 4.1 sec on the same
machine). The code is pasted on:
http://blogtrader.net/page/dcaoyuan/entry/learning_coding_binary_was_tim
With default heap size, the code may take 4.8+ sec.
On 10/24/07, Steve Vinoski <vinoski@ieee.org> wrote:
>
> On 10/24/07, Thomas Lindgren <thomasl_erlang@yahoo.com> wrote:
> >
> > --- Steve Vinoski <vinoski@ieee.org> wrote:
> >
> > Anders, thanks for collecting and posting these.
> > > I've just performed a set
> > > of new timings for all of them, as listed below. For
> > > each, I just ran this
> > > command:
> > >
> > > time erl -smp -noshell -run <test_case> main
> > > o1000k.ap >/dev/null
> > >
> > > where "<test_case>" is the name of the tbray test
> > > case file. All were
> > > looped ten times, and I took the best timing for
> > > each. All tests were done
> > > on my 8-core 2.33 GHz dual Intel Xeon with 2 GB RAM
> > > Linux box, in a local
> > > (non-NFS) directory.
> >
> > So, looking at Steve's results on his 8-core system,
> > we have:
> >
> > real user tbray5/real user/real
> > tbray5 9.8 -- 1.0 --
> > tbray14 6.63 34.53 1.48 5.21
> > tbray15 4.12 25.14 2.38 6.10
> > tbray16 3.16 16.15 3.10 5.11
> > tbray_tuple 2.28 8.61 4.30 3.78
> > tbray_ets 1.87 7.42 5.24 3.97
> > tbray_blkr 1.55 7.34 6.32 4.74
> >
> > tbray5/real is the speedup versus the baseline, while
> > user/real is the speedup for each version due to
> > parallelization.
> >
> > Thus, the latest version is 6.3 times faster than the
> > first one. The parallel speedup is about the same in
> > tbray5 and tbray_blkr, a very decent utilization of
> > >50%, but the amount of work (user) has shrunk from
> > (presumably more than) 34.53 seconds to 7.34 seconds.
> >
> > Tim Bray's original Erlang number on "his macbook"
> > appears to be 34.16 seconds user (probably about the
> > same real?). How does this compare to Ruby? Tim Bray
> > reported that it needed 3.46 seconds real, again on
> > his macbook. (As I understand it, all results here are
> > for the big data set.)
> >
>
> Yes, all results are for o1000k.ap, Tim's original large dataset. As for
> Ruby, I just ran Tim's original code on the 8-core, and out of ten attempts
> the best was:
>
>
> real 0m2.210s
> user 0m2.095s
> sys 0m0.109s
> --steve
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@erlang.org
> http://www.erlang.org/mailman/listinfo/erlang-questions
>
--
- Caoyuan
_______________________________________________
erlang-questions mailing list
erlang-questions@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-questions
Post recived from mailinglist |
|
|
| Back to top |
|
| dcaoyuan |
Posted: Fri Oct 26, 2007 2:56 pm |
|
|
|
User
Joined: 28 Mar 2007
Posts: 34
|
For my code, the best +h Size option is 8192, and the block size of
binary for processing + reading is 10240000 to 20480000 Bytes per my
testing. The result is now about 2.97 sec vs Ruby's 4.1 sec on 2.8Ghz
4-CPU box.
Attached is the newest code with some cleanup. To evaluate:
$ erlc -smp tbray5.erl
$ time erl +h 8192 -smp -noshell -run tbray5 start o1000k.ap -s erlang halt
real 0m2.972s
user 0m9.685s
sys 0m0.748s
On 10/25/07, Caoyuan <dcaoyuan@gmail.com> wrote:
> It seems heap size is really a key for binary processing, and there
> are other tips for binary processing too. With proper heap size set,
> the straightforward Erlang code (in 80 LOC) can achieve around 3.1 sec
> on my 4-CPU linux box (the ruby code took about 4.1 sec on the same
> machine). The code is pasted on:
>
> http://blogtrader.net/page/dcaoyuan/entry/learning_coding_binary_was_tim
>
> With default heap size, the code may take 4.8+ sec.
>
> On 10/24/07, Steve Vinoski <vinoski@ieee.org> wrote:
> >
> > On 10/24/07, Thomas Lindgren <thomasl_erlang@yahoo.com> wrote:
> > >
> > > --- Steve Vinoski <vinoski@ieee.org> wrote:
> > >
> > > Anders, thanks for collecting and posting these.
> > > > I've just performed a set
> > > > of new timings for all of them, as listed below. For
> > > > each, I just ran this
> > > > command:
> > > >
> > > > time erl -smp -noshell -run <test_case> main
> > > > o1000k.ap >/dev/null
> > > >
> > > > where "<test_case>" is the name of the tbray test
> > > > case file. All were
> > > > looped ten times, and I took the best timing for
> > > > each. All tests were done
> > > > on my 8-core 2.33 GHz dual Intel Xeon with 2 GB RAM
> > > > Linux box, in a local
> > > > (non-NFS) directory.
> > >
> > > So, looking at Steve's results on his 8-core system,
> > > we have:
> > >
> > > real user tbray5/real user/real
> > > tbray5 9.8 -- 1.0 --
> > > tbray14 6.63 34.53 1.48 5.21
> > > tbray15 4.12 25.14 2.38 6.10
> > > tbray16 3.16 16.15 3.10 5.11
> > > tbray_tuple 2.28 8.61 4.30 3.78
> > > tbray_ets 1.87 7.42 5.24 3.97
> > > tbray_blkr 1.55 7.34 6.32 4.74
> > >
> > > tbray5/real is the speedup versus the baseline, while
> > > user/real is the speedup for each version due to
> > > parallelization.
> > >
> > > Thus, the latest version is 6.3 times faster than the
> > > first one. The parallel speedup is about the same in
> > > tbray5 and tbray_blkr, a very decent utilization of
> > > >50%, but the amount of work (user) has shrunk from
> > > (presumably more than) 34.53 seconds to 7.34 seconds.
> > >
> > > Tim Bray's original Erlang number on "his macbook"
> > > appears to be 34.16 seconds user (probably about the
> > > same real?). How does this compare to Ruby? Tim Bray
> > > reported that it needed 3.46 seconds real, again on
> > > his macbook. (As I understand it, all results here are
> > > for the big data set.)
> > >
> >
> > Yes, all results are for o1000k.ap, Tim's original large dataset. As for
> > Ruby, I just ran Tim's original code on the 8-core, and out of ten attempts
> > the best was:
> >
> >
> > real 0m2.210s
> > user 0m2.095s
> > sys 0m0.109s
> > --steve
> > _______________________________________________
> > erlang-questions mailing list
> > erlang-questions@erlang.org
> > http://www.erlang.org/mailman/listinfo/erlang-questions
> >
>
>
> --
> - Caoyuan
>
--
- Caoyuan
Post recived from mailinglist |
|
|
| Back to top |
|
| anders_n |
Posted: Fri Oct 26, 2007 10:06 pm |
|
|
|
User
Joined: 28 Feb 2005
Posts: 155
Location: Saltillo, Mexico
|
On 10/23/07, Anders Nygren <anders.nygren@gmail.com> wrote:
> To summarize my progress on the widefinder problem
> A few days ago I started with Steve Vinoski's tbray16.erl
> As a baseline on my 1.66 GHz dual core Centrino
> laptop, Linux,
> tbray16
> real 0m7.067s
> user 0m12.377s
> sys 0m0.584s
>
> I removed the dict used for the shift table,
> and changed the min_heap_size.
> That gave
> real 0m2.713s
> user 0m4.168s
> sys 0m0.412s
>
> (see tbray_tuple.erl and wfbm4_tuple.erl)
> Steve reported that it ran in ~1.9 s on his 8 core server.
>
> Then I removed the dicts that were used for collecting the
> matches and used ets instead, and got some improvement
> on my dual core laptop.
> real 0m2.220s
> user 0m3.252s
> sys 0m0.344s
>
> (see tbray_ets.erl and wfbm4_ets.erl)
>
> Interestingly Steve reported that it actually performed
> worse on his 8 core server.
>
> These versions all read the whole file into memory at the start.
> On my laptop that takes ~400ms (when the file is already cached
> in the OS).
>
> So I changed it to read the file in chucks and spawn the worker
> after each chunk is read.
>
> tbray_blockread with 4 processes
> real 0m1.992s
> user 0m3.176s
> sys 0m0.420s
>
> (see tbray_blockread.erl and wfbm4_ets.erl)
>
> Running it in the erlang shell it takes ~1.8s.
>
> Just starting and stopping the VM takes
> time erl -pa ../../bfile/ebin/ -smp -noshell -run init stop
>
> real 0m1.229s
> user 0m0.208s
> sys 0m0.020s
>
> It would be interesting to see how it runs on other machines,
> with more cores.
>
> /Anders
>
>
So I have a new version that I think will break the 1 second barrier
on Steve's 8-core
box.
The best I have seen on my dual core laptop is
real: 0m1.689s
user: 0m2.2756s
sys: 0m0.396s
The changes relative my latest posted tbray_blockread.erl are
- reading the file is in a separate process
- never bind variables to sub binaries unless absolutely necessary
- only have a limited number of worker processes at any time
One lesson from this exercise is that it can be bad for performance,
the result of changing the code to not bind variables to sub binaries
can be seen in the garbage collection statistics.
wfinder, (an unreleased version that ran in 1.050s on Steve's 8-core)
garbage collections: 46302
words reclaimed: 501768347
wfinder1
garbage collections: 13917
words reclaimed: 384561741
/Anders
Post recived from mailinglist |
|
|
| Back to top |
|
| Guest |
Posted: Sun Oct 28, 2007 7:27 am |
|
|
|
Guest
|
Hello,
These results are interesting, but I demur to kind of solution. Your
and Steve's approach have some caveats.
1/ File is read all in memory. When workers are so much slow, it can
happen principally. 200MB of Tim Bray's data is not problem on your
8CPU box, but what if file will be bigger. What about 1GB? No problem?
And 1TB? Still no problem? I know, that current i/o HW (and you don't
flush caches between measures and workers on 8CPU box are still fast
enough) can't provide data in performance causing problem for this
simple Tim Bray's exercise, but it is principally problem.
2/ Workers share resource (ets table) and it is principally bad. If
you have more CPU consuming task and you must use more CPU than as
current task to consume your input data bandwitch and simultaneously
more result extensive task, you fall in trouble again.
As conclusion I think, your solution scale bad for both end. When you
have small amount of CPUs, you run out memory on larger datasets. When
you have more CPU, you fall in bottle neck of your shared resource. Of
course, Tim Bray's exercise is more CPU consuming than result
extensive and you don't fall to bottle neck trap and file reading on
current HW must be sequential and i/o performance is so bad, thus 8
CPU is enough to consume data faster than i/o can produce and you
don't run out of memory. But I think Tim Bray's exercise is not about
tuning solution for this one task, I think Tim Bray's exercise is
about multicore crisis and principal solutions.
Cheers,
--Hynek (Pichi) Vychodil
On 10/26/07, Anders Nygren <anders.nygren@gmail.com> wrote:
> On 10/23/07, Anders Nygren <anders.nygren@gmail.com> wrote:
> > To summarize my progress on the widefinder problem
> > A few days ago I started with Steve Vinoski's tbray16.erl
> > As a baseline on my 1.66 GHz dual core Centrino
> > laptop, Linux,
> > tbray16
> > real 0m7.067s
> > user 0m12.377s
> > sys 0m0.584s
> >
> > I removed the dict used for the shift table,
> > and changed the min_heap_size.
> > That gave
> > real 0m2.713s
> > user 0m4.168s
> > sys 0m0.412s
> >
> > (see tbray_tuple.erl and wfbm4_tuple.erl)
> > Steve reported that it ran in ~1.9 s on his 8 core server.
> >
> > Then I removed the dicts that were used for collecting the
> > matches and used ets instead, and got some improvement
> > on my dual core laptop.
> > real 0m2.220s
> > user 0m3.252s
> > sys 0m0.344s
> >
> > (see tbray_ets.erl and wfbm4_ets.erl)
> >
> > Interestingly Steve reported that it actually performed
> > worse on his 8 core server.
> >
> > These versions all read the whole file into memory at the start.
> > On my laptop that takes ~400ms (when the file is already cached
> > in the OS).
> >
> > So I changed it to read the file in chucks and spawn the worker
> > after each chunk is read.
> >
> > tbray_blockread with 4 processes
> > real 0m1.992s
> > user 0m3.176s
> > sys 0m0.420s
> >
> > (see tbray_blockread.erl and wfbm4_ets.erl)
> >
> > Running it in the erlang shell it takes ~1.8s.
> >
> > Just starting and stopping the VM takes
> > time erl -pa ../../bfile/ebin/ -smp -noshell -run init stop
> >
> > real 0m1.229s
> > user 0m0.208s
> > sys 0m0.020s
> >
> > It would be interesting to see how it runs on other machines,
> > with more cores.
> >
> > /Anders
> >
> >
>
> So I have a new version that I think will break the 1 second barrier
> on Steve's 8-core
> box.
> The best I have seen on my dual core laptop is
> real: 0m1.689s
> user: 0m2.2756s
> sys: 0m0.396s
>
> The changes relative my latest posted tbray_blockread.erl are
> - reading the file is in a separate process
> - never bind variables to sub binaries unless absolutely necessary
> - only have a limited number of worker processes at any time
>
> One lesson from this exercise is that it can be bad for performance,
> the result of changing the code to not bind variables to sub binaries
> can be seen in the garbage collection statistics.
>
> wfinder, (an unreleased version that ran in 1.050s on Steve's 8-core)
> garbage collections: 46302
> words reclaimed: 501768347
>
> wfinder1
> garbage collections: 13917
> words reclaimed: 384561741
>
> /Anders
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@erlang.org
> http://www.erlang.org/mailman/listinfo/erlang-questions
>
>
_______________________________________________
erlang-questions mailing list
erlang-questions@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-questions
Post recived from mailinglist |
|
|
| Back to top |
|
| Thomas Lindgren |
Posted: Sun Oct 28, 2007 12:34 pm |
|
|
|
User
Joined: 09 Mar 2005
Posts: 284
|
--- Hynek Vychodil <vychodil.hynek@gmail.com> wrote:
> Hello,
> These results are interesting, but I demur to kind
> of solution. Your
> and Steve's approach have some caveats.
>
> 1/ File is read all in memory.
Hynek,
This is true for some versions, but not all. The
'block read' version reads the file in chunks.
> 2/ Workers share resource (ets table) and it is
> principally bad. If
> you have more CPU consuming task and you must use
> more CPU than as
> current task to consume your input data bandwitch
> and simultaneously
> more result extensive task, you fall in trouble
> again.
Note that the ets table in all proposals but one is
managed by a single process. It is just used as a more
efficient data structure. So the potential problem
here is really if this process becomes a bottleneck.
So, we have so far looked at two extremes:
1. Every worker maintains a local count, these are
then merged into a global count.
2. A single process maintains the global count,
workers send it updates.
But if this becomes problematic, one could also
combine the two by having 1 to N centralized counting
processes to trade off the cost of merging versus the
cost of incrementally sending all counts to a
'master'. (And one could batch the sending of updates
too, come to think of it.)
> As conclusion I think, your solution scale bad for
> both end. When you
> have small amount of CPUs, you run out memory on
> larger datasets.
Not necessarily. With the block read solution, it
doesn't seem like you run that risk.
The use of file:read_file/1 just showed that you
_could_ do fast I/O in Erlang, at a time when people
thought Erlang file I/O was very slow indeed. Showing
this was done by switching to a more suitable API
call. But you can be even more sophisticated than
that, e.g., by using file:pread.
> When
> you have more CPU, you fall in bottle neck of your
> shared resource.
Do you mean that the problem becomes I/O bound? Do
note that all sufficiently fast solutions will
ultimately be limited by a hardware bottleneck of some
sort: CPU, I/O, network ...
In this particular case, you could increase I/O
performance by, say, striping the disk. And you can
increase CPU performance by, say, distributing the
work to multiple hosts/nodes (fairly straightforward
with Erlang, by the way). But with these problems,
even with infinite hardware you will eventually run
into some sequential portion of the code, and that
will limit the speedup as per Amdahl's Law.
Best,
Thomas
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
_______________________________________________
erlang-questions mailing list
erlang-questions@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-questions
Post recived from mailinglist |
|
|
| Back to top |
|
| Guest |
Posted: Sun Oct 28, 2007 1:08 pm |
|
|
|
Guest
|
On 10/28/07, Thomas Lindgren <thomasl_erlang@yahoo.com> wrote:
>
> --- Hynek Vychodil <vychodil.hynek@gmail.com> wrote:
>
> > Hello,
> > These results are interesting, but I demur to kind
> > of solution. Your
> > and Steve's approach have some caveats.
> >
> > 1/ File is read all in memory.
>
> Hynek,
>
> This is true for some versions, but not all. The
> 'block read' version reads the file in chunks.
What version do you mean? tbray_blockread.erl from
http://www.erlang.org/pipermail/erlang-questions/2007-October/030118.html
reads in chunks, but when workers are slow you run out of memory. Look
at scan_file/9 cycle. There isn't limit of blocks in memory.
>
> > 2/ Workers share resource (ets table) and it is
> > principally bad. If
> > you have more CPU consuming task and you must use
> > more CPU than as
> > current task to consume your input data bandwitch
> > and simultaneously
> > more result extensive task, you fall in trouble
> > again.
>
> Note that the ets table in all proposals but one is
> managed by a single process. It is just used as a more
> efficient data structure. So the potential problem
> here is really if this process becomes a bottleneck.
>
> So, we have so far looked at two extremes:
>
> 1. Every worker maintains a local count, these are
> then merged into a global count.
>
> 2. A single process maintains the global count,
> workers send it updates.
>
> But if this becomes problematic, one could also
> combine the two by having 1 to N centralized counting
> processes to trade off the cost of merging versus the
> cost of incrementally sending all counts to a
> 'master'. (And one could batch the sending of updates
> too, come to think of it.)
>
> > As conclusion I think, your solution scale bad for
> > both end. When you
> > have small amount of CPUs, you run out memory on
> > larger datasets.
>
> Not necessarily. With the block read solution, it
> doesn't seem like you run that risk.
>
Yes, but where is this solution? I can't see it in this thread now.
May be missed some, but solutions what I read are reader depend and
reader is not waiting for workers.
>
> The use of file:read_file/1 just showed that you
> _could_ do fast I/O in Erlang, at a time when people
> thought Erlang file I/O was very slow indeed. Showing
> this was done by switching to a more suitable API
> call. But you can be even more sophisticated than
> that, e.g., by using file:pread.
>
> > When
> > you have more CPU, you fall in bottle neck of your
> > shared resource.
>
> Do you mean that the problem becomes I/O bound? Do
> note that all sufficiently fast solutions will
> ultimately be limited by a hardware bottleneck of some
> sort: CPU, I/O, network ...
>
> In this particular case, you could increase I/O
> performance by, say, striping the disk. And you can
> increase CPU performance by, say, distributing the
> work to multiple hosts/nodes (fairly straightforward
> with Erlang, by the way). But with these problems,
> even with infinite hardware you will eventually run
> into some sequential portion of the code, and that
> will limit the speedup as per Amdahl's Law.
>
Yes, you are true. There isn't "best" solution. But at least make
memory safe solution we can.
Cheers
-- Hynek (Pichi) Vychodil
_______________________________________________
erlang-questions mailing list
erlang-questions@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-questions
Post recived from mailinglist |
|
|
| Back to top |
|
|
|
All times are GMT
Page 1 of 3
Goto page 1, 2, 3 Next
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum You cannot attach files in this forum You cannot download files in this forum
|
|
|