| Author |
Message |
|
| anders_n |
Posted: Sun Oct 28, 2007 1:13 pm |
|
|
|
User
Joined: 28 Feb 2005
Posts: 155
Location: Saltillo, Mexico
|
On 10/28/07, Thomas Lindgren <thomasl_erlang@yahoo.com> wrote:
>
> --- Hynek Vychodil <vychodil.hynek@gmail.com> wrote:
>
> > Hello,
> > These results are interesting, but I demur to kind
> > of solution. Your
> > and Steve's approach have some caveats.
> >
> > 1/ File is read all in memory.
>
> Hynek,
>
> This is true for some versions, but not all. The
> 'block read' version reads the file in chunks.
It is still "sort of" true for the blockread and later versions,
since there is no flowcontrol, so when the file is already
cached in the OS the reading is faster than the processing and
all (almost) of the file will be in memory.
I am aware of this but have not bother with adding the flow
control yet.
>
> > 2/ Workers share resource (ets table) and it is
> > principally bad. If
> > you have more CPU consuming task and you must use
> > more CPU than as
> > current task to consume your input data bandwitch
> > and simultaneously
> > more result extensive task, you fall in trouble
> > again.
>
> Note that the ets table in all proposals but one is
> managed by a single process. It is just used as a more
> efficient data structure. So the potential problem
> here is really if this process becomes a bottleneck.
>
> So, we have so far looked at two extremes:
>
> 1. Every worker maintains a local count, these are
> then merged into a global count.
>
> 2. A single process maintains the global count,
> workers send it updates.
>
> But if this becomes problematic, one could also
> combine the two by having 1 to N centralized counting
> processes to trade off the cost of merging versus the
> cost of incrementally sending all counts to a
> 'master'. (And one could batch the sending of updates
> too, come to think of it.)
>
I have not seen this as a problem yet since there is a relative
small number of concurrent workers. However as the number of
cores grow it may become a problem.
An alternative is that each worker has a ets tables for its counters and
sends its results to the central ets table on termination.
> > As conclusion I think, your solution scale bad for
> > both end. When you
> > have small amount of CPUs, you run out memory on
> > larger datasets.
>
> Not necessarily. With the block read solution, it
> doesn't seem like you run that risk.
See above.
>
> The use of file:read_file/1 just showed that you
> _could_ do fast I/O in Erlang, at a time when people
> thought Erlang file I/O was very slow indeed. Showing
> this was done by switching to a more suitable API
> call. But you can be even more sophisticated than
> that, e.g., by using file:pread.
>
> > When
> > you have more CPU, you fall in bottle neck of your
> > shared resource.
>
> Do you mean that the problem becomes I/O bound? Do
> note that all sufficiently fast solutions will
> ultimately be limited by a hardware bottleneck of some
> sort: CPU, I/O, network ...
>
> In this particular case, you could increase I/O
> performance by, say, striping the disk. And you can
> increase CPU performance by, say, distributing the
> work to multiple hosts/nodes (fairly straightforward
> with Erlang, by the way). But with these problems,
> even with infinite hardware you will eventually run
> into some sequential portion of the code, and that
> will limit the speedup as per Amdahl's Law.
>
Currently that sequential part is ~ 0.5s on my 1.66GHz
dual core laptop.
the part of the work that can be run in parallel takes
~2.254 s
so theoretically we would get
Cores Real time Speedup Rel. speedup by doubling #cores
1 2.754
2 1.627 1.693 1.693
4 1.064 2.590 1.530
8 0.782 3.523 1.360
16 0.641 4.297 1.220
32 0.570 4.828 1.123
64 0.535 5.146 1.066
128 0.518 5.321 1.034
256 0.509 5.413 1.017
Which is not very good after 8 cores.
So I am now looking at making this a 'real' distributed solution instead.
/Anders
_______________________________________________
erlang-questions mailing list
erlang-questions@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-questions
Post recived from mailinglist |
|
|
| Back to top |
|
| Guest |
Posted: Sun Oct 28, 2007 3:01 pm |
|
|
|
Guest
|
>
> Currently that sequential part is ~ 0.5s on my 1.66GHz
> dual core laptop.
> the part of the work that can be run in parallel takes
> ~2.254 s
> so theoretically we would get
> Cores Real time Speedup Rel. speedup by doubling #cores
> 1 2.754
> 2 1.627 1.693 1.693
> 4 1.064 2.590 1.530
> 8 0.782 3.523 1.360
> 16 0.641 4.297 1.220
> 32 0.570 4.828 1.123
> 64 0.535 5.146 1.066
> 128 0.518 5.321 1.034
> 256 0.509 5.413 1.017
>
> Which is not very good after 8 cores.
>
> So I am now looking at making this a 'real' distributed solution instead.
>
> /Anders
I wrote some code to try parallel splitting, but it is still with
single dispatcher, because you must have single reader. Sequential
reading is about twenty times faster than random and you must keep
information about block order. Let you see chunk_reader and
nlt_reader.
Crazy think that chunk_reader is slower when compiled with native
option. I don't know why.
-- Hynek (Pichi) Vychodil
Post recived from mailinglist |
|
|
| Back to top |
|
| Guest |
Posted: Sun Oct 28, 2007 3:29 pm |
|
|
|
Guest
|
>
> I wrote some code to try parallel splitting, but it is still with
> single dispatcher, because you must have single reader. Sequential
> reading is about twenty times faster than random and you must keep
> information about block order. Let you see chunk_reader and
> nlt_reader.
>
> Crazy think that chunk_reader is slower when compiled with native
> option. I don't know why.
>
> -- Hynek (Pichi) Vychodil
>
>
And complete with this file_map_reduce.erl and wf_pichi1.erl code.
In wf_pichi1.erl is some my very old code but there is also much
better Anders finder from wfbm4_ets1 and works well. This is code with
one ets which can have bottle neck with many CPUs, of course. I know
it.
-- Hynek (Pichi) Vychodil
Post recived from mailinglist |
|
|
| Back to top |
|
| Thomas Lindgren |
Posted: Sun Oct 28, 2007 4:55 pm |
|
|
|
User
Joined: 09 Mar 2005
Posts: 284
|
--- Hynek Vychodil <vychodil.hynek@gmail.com> wrote:
> On 10/28/07, Thomas Lindgren
> <thomasl_erlang@yahoo.com> wrote:
> >
> > --- Hynek Vychodil <vychodil.hynek@gmail.com>
> wrote:
> >
> > > Hello,
> > > These results are interesting, but I demur to
> kind
> > > of solution. Your
> > > and Steve's approach have some caveats.
> > >
> > > 1/ File is read all in memory.
> >
> > Hynek,
> >
> > This is true for some versions, but not all. The
> > 'block read' version reads the file in chunks.
>
> What version do you mean? tbray_blockread.erl from
>
http://www.erlang.org/pipermail/erlang-questions/2007-October/030118.html
> reads in chunks, but when workers are slow you run
> out of memory. Look
> at scan_file/9 cycle. There isn't limit of blocks in
> memory.
Correct, the code doesn't support it at the moment,
but neither is there an inherent problem with adding
such a limit. If you are processing terabytes of data,
this is obviously a more pressing issue than at the
moment, but you are not _required_ to read all of the
logfile at once, and you are not _unable_ to control
how many data chunks are resident at any time. All the
building blocks for a more robust file reader are
available if you need them.
So I would say that while this code might not be
something you deploy in production, it still
definitely works as a proof of concept on how to write
efficient code that exploits multicore, etc.
Best,
Thomas
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
_______________________________________________
erlang-questions mailing list
erlang-questions@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-questions
Post recived from mailinglist |
|
|
| Back to top |
|
| Guest |
Posted: Sun Oct 28, 2007 5:34 pm |
|
|
|
Guest
|
Hi Anders,
I rewrote your code a little. I removed all remaining binary bindings
and it is noticeable faster again. Try wf_pichi3.erl.
It requires:
chunk_reader - http://www.erlang.org/pipermail/erlang-questions/attachments/20071028/16fc8af3/attachment-0002.obj
nlt_reader - http://www.erlang.org/pipermail/erlang-questions/attachments/20071028/16fc8af3/attachment-0003.obj
file_map_reduce -
http://www.erlang.org/pipermail/erlang-questions/attachments/20071028/207cb882/attachment.obj
Have a fun
--Hynek (Pichi) Vychodil
On 10/26/07, Anders Nygren <anders.nygren@gmail.com> wrote:
> On 10/23/07, Anders Nygren <anders.nygren@gmail.com> wrote:
> > To summarize my progress on the widefinder problem
> > A few days ago I started with Steve Vinoski's tbray16.erl
> > As a baseline on my 1.66 GHz dual core Centrino
> > laptop, Linux,
> > tbray16
> > real 0m7.067s
> > user 0m12.377s
> > sys 0m0.584s
> >
> > I removed the dict used for the shift table,
> > and changed the min_heap_size.
> > That gave
> > real 0m2.713s
> > user 0m4.168s
> > sys 0m0.412s
> >
> > (see tbray_tuple.erl and wfbm4_tuple.erl)
> > Steve reported that it ran in ~1.9 s on his 8 core server.
> >
> > Then I removed the dicts that were used for collecting the
> > matches and used ets instead, and got some improvement
> > on my dual core laptop.
> > real 0m2.220s
> > user 0m3.252s
> > sys 0m0.344s
> >
> > (see tbray_ets.erl and wfbm4_ets.erl)
> >
> > Interestingly Steve reported that it actually performed
> > worse on his 8 core server.
> >
> > These versions all read the whole file into memory at the start.
> > On my laptop that takes ~400ms (when the file is already cached
> > in the OS).
> >
> > So I changed it to read the file in chucks and spawn the worker
> > after each chunk is read.
> >
> > tbray_blockread with 4 processes
> > real 0m1.992s
> > user 0m3.176s
> > sys 0m0.420s
> >
> > (see tbray_blockread.erl and wfbm4_ets.erl)
> >
> > Running it in the erlang shell it takes ~1.8s.
> >
> > Just starting and stopping the VM takes
> > time erl -pa ../../bfile/ebin/ -smp -noshell -run init stop
> >
> > real 0m1.229s
> > user 0m0.208s
> > sys 0m0.020s
> >
> > It would be interesting to see how it runs on other machines,
> > with more cores.
> >
> > /Anders
> >
> >
>
> So I have a new version that I think will break the 1 second barrier
> on Steve's 8-core
> box.
> The best I have seen on my dual core laptop is
> real: 0m1.689s
> user: 0m2.2756s
> sys: 0m0.396s
>
> The changes relative my latest posted tbray_blockread.erl are
> - reading the file is in a separate process
> - never bind variables to sub binaries unless absolutely necessary
> - only have a limited number of worker processes at any time
>
> One lesson from this exercise is that it can be bad for performance,
> the result of changing the code to not bind variables to sub binaries
> can be seen in the garbage collection statistics.
>
> wfinder, (an unreleased version that ran in 1.050s on Steve's 8-core)
> garbage collections: 46302
> words reclaimed: 501768347
>
> wfinder1
> garbage collections: 13917
> words reclaimed: 384561741
>
> /Anders
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@erlang.org
> http://www.erlang.org/mailman/listinfo/erlang-questions
>
>
Post recived from mailinglist |
|
|
| Back to top |
|
| anders_n |
Posted: Sun Oct 28, 2007 10:31 pm |
|
|
|
User
Joined: 28 Feb 2005
Posts: 155
Location: Saltillo, Mexico
|
On 10/28/07, Hynek Vychodil <vychodil.hynek@gmail.com> wrote:
> Hi Anders,
> I rewrote your code a little. I removed all remaining binary bindings
> and it is noticeable faster again. Try wf_pichi3.erl.
>
Hynek
that was great, Your change brings my wfinder1_1 + wfbm4_ets1_1
down to
real 0m1.118s
user 0m1.640s
sys 0m0.368s
on my 1.66 GHz dual core laptop.
As a comparison Your wf_pichi3
real 0m1.854s
user 0m2.928s
sys 0m0.336s
on my laptop.
/Anders
Post recived from mailinglist |
|
|
| Back to top |
|
| Guest |
Posted: Mon Oct 29, 2007 4:59 am |
|
|
|
Guest
|
On 10/28/07, Anders Nygren <anders.nygren@gmail.com (anders.nygren@gmail.com)> wrote:Quote: On 10/28/07, Hynek Vychodil <vychodil.hynek@gmail.com (vychodil.hynek@gmail.com)> wrote:
> Hi Anders,
> I rewrote your code a little. I removed all remaining binary bindings
> and it is noticeable faster again. Try wf_pichi3.erl.
>
Hynek
that was great, Your change brings my wfinder1_1 + wfbm4_ets1_1
down to
real |
|
|
| Back to top |
|
| Thomas Lindgren |
Posted: Mon Oct 29, 2007 8:38 am |
|
|
|
User
Joined: 09 Mar 2005
Posts: 284
|
--- Steve Vinoski <vinoski@ieee.org> wrote:
> On 10/28/07, Anders Nygren <anders.nygren@gmail.com>
> wrote:
> >
> > On 10/28/07, Hynek Vychodil
> <vychodil.hynek@gmail.com> wrote:
> > > Hi Anders,
> > > I rewrote your code a little. I removed all
> remaining binary bindings
> > > and it is noticeable faster again. Try
> wf_pichi3.erl.
> > >
> >
> > Hynek
> > that was great, Your change brings my wfinder1_1 +
> wfbm4_ets1_1
> > down to
> > real 0m1.118s
> > user 0m1.640s
> > sys 0m0.368s
> > on my 1.66 GHz dual core laptop.
>
>
> And on the 8-core 2.33GHz Intel Xeon Linux box with
> 2 GB RAM, this version
> is extremely fast:
>
> real 0m0.567s
> user 0m2.249s
> sys 0m0.956s
Impressive. However, can someone explain why
sys > real?
Is the kernel too running in parallel?
Best,
Thomas
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
_______________________________________________
erlang-questions mailing list
erlang-questions@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-questions
Post recived from mailinglist |
|
|
| Back to top |
|
| Thomas Lindgren |
Posted: Mon Oct 29, 2007 9:13 am |
|
|
|
User
Joined: 09 Mar 2005
Posts: 284
|
--- Steve Vinoski <vinoski@ieee.org> wrote:
> On 10/28/07, Anders Nygren <anders.nygren@gmail.com>
> wrote:
> >
> > On 10/28/07, Hynek Vychodil
> <vychodil.hynek@gmail.com> wrote:
> > > Hi Anders,
> > > I rewrote your code a little. I removed all
> remaining binary bindings
> > > and it is noticeable faster again. Try
> wf_pichi3.erl.
> > >
> >
> > Hynek
> > that was great, Your change brings my wfinder1_1 +
> wfbm4_ets1_1
> > down to
> > real 0m1.118s
> > user 0m1.640s
> > sys 0m0.368s
> > on my 1.66 GHz dual core laptop.
>
>
> And on the 8-core 2.33GHz Intel Xeon Linux box with
> 2 GB RAM, this version
> is extremely fast:
>
> real 0m0.567s
> user 0m2.249s
> sys 0m0.956s
(I'll ignore the unexplained sys time below. That
makes the discussion a bit preliminary; perhaps the
derived results should be computed some other way.
Apply grain of salt appropriately.)
For those keeping track, the latest result is fully
2.7 times faster than the best previous version (which
was block read), and 17.3 times faster than the
initial version. The latest speedup is basically due
to doing less work. However, note that user time fell
by a somewhat greater ratio than real time, which
might mean parallelization overheads are becoming
visible.
Also, the user time of 2.249 seconds is now close to
the Ruby user time, which were 2.095s on the same
hardware, while the Erlang parallelization speedup
(user/real) on top of this is 3.95 out of 8. Comparing
the real times of Erlang (0.567s) and Ruby (2.21s), we
get about the same execution time speedup, 3.9. Not
too shabby, huh?
Is there anything more to be wrung out of this
program? Well, apart from further tuning, one can note
that Ruby had 0.1s sys time, while Erlang apparently
needs a bit more, 0.5-1.0s. Why?
Taking a wider view, it would also be very interesting
to see how to apply these lessons to more general
problems and/or libraries.
Best,
Thomas
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
_______________________________________________
erlang-questions mailing list
erlang-questions@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-questions
Post recived from mailinglist |
|
|
| Back to top |
|
| Guest |
Posted: Mon Oct 29, 2007 10:37 am |
|
|
|
Guest
|
Ruby code is slower on my old single core home desktop and sys time
are almost same.
http://pichis-blog.blogspot.com/2007/10/faster-than-ruby-but-scalable.html
--Hynek (Pichi) Vychodil
On 10/29/07, Thomas Lindgren <thomasl_erlang@yahoo.com> wrote:
>
> --- Steve Vinoski <vinoski@ieee.org> wrote:
>
> > On 10/28/07, Anders Nygren <anders.nygren@gmail.com>
> > wrote:
> > >
> > > On 10/28/07, Hynek Vychodil
> > <vychodil.hynek@gmail.com> wrote:
> > > > Hi Anders,
> > > > I rewrote your code a little. I removed all
> > remaining binary bindings
> > > > and it is noticeable faster again. Try
> > wf_pichi3.erl.
> > > >
> > >
> > > Hynek
> > > that was great, Your change brings my wfinder1_1 +
> > wfbm4_ets1_1
> > > down to
> > > real 0m1.118s
> > > user 0m1.640s
> > > sys 0m0.368s
> > > on my 1.66 GHz dual core laptop.
> >
> >
> > And on the 8-core 2.33GHz Intel Xeon Linux box with
> > 2 GB RAM, this version
> > is extremely fast:
> >
> > real 0m0.567s
> > user 0m2.249s
> > sys 0m0.956s
>
> (I'll ignore the unexplained sys time below. That
> makes the discussion a bit preliminary; perhaps the
> derived results should be computed some other way.
> Apply grain of salt appropriately.)
>
> For those keeping track, the latest result is fully
> 2.7 times faster than the best previous version (which
> was block read), and 17.3 times faster than the
> initial version. The latest speedup is basically due
> to doing less work. However, note that user time fell
> by a somewhat greater ratio than real time, which
> might mean parallelization overheads are becoming
> visible.
>
> Also, the user time of 2.249 seconds is now close to
> the Ruby user time, which were 2.095s on the same
> hardware, while the Erlang parallelization speedup
> (user/real) on top of this is 3.95 out of 8. Comparing
> the real times of Erlang (0.567s) and Ruby (2.21s), we
> get about the same execution time speedup, 3.9. Not
> too shabby, huh?
>
> Is there anything more to be wrung out of this
> program? Well, apart from further tuning, one can note
> that Ruby had 0.1s sys time, while Erlang apparently
> needs a bit more, 0.5-1.0s. Why?
>
> Taking a wider view, it would also be very interesting
> to see how to apply these lessons to more general
> problems and/or libraries.
>
> Best,
> Thomas
>
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@erlang.org
> http://www.erlang.org/mailman/listinfo/erlang-questions
>
_______________________________________________
erlang-questions mailing list
erlang-questions@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-questions
Post recived from mailinglist |
|
|
| Back to top |
|
| Guest |
Posted: Mon Oct 29, 2007 11:05 am |
|
|
|
Guest
|
On 10/29/07, Steve Vinoski <vinoski@ieee.org> wrote:
> On 10/28/07, Anders Nygren <anders.nygren@gmail.com> wrote:
> > On 10/28/07, Hynek Vychodil <vychodil.hynek@gmail.com> wrote:
> > > Hi Anders,
> > > I rewrote your code a little. I removed all remaining binary bindings
> > > and it is noticeable faster again. Try wf_pichi3.erl.
> > >
> >
> > Hynek
> > that was great, Your change brings my wfinder1_1 + wfbm4_ets1_1
> > down to
> > real 0m1.118s
> > user 0m1.640s
> > sys 0m0.368s
> > on my 1.66 GHz dual core laptop.
>
>
> And on the 8-core 2.33GHz Intel Xeon Linux box with 2 GB RAM, this version
> is extremely fast:
>
> real 0m0.567s
> user 0m2.249s
> sys 0m0.956s
> --steve
Did you test wfinder1_1 + wfbm4_ets1_1 or wf_pichi3 + file_map_reduce
+nlt_reader +chunk_reader too? I think nlt_reader should be less
memory consuming than reader from wfinder1_1 and it can cause less sys
time and paralel block splitting can cause some speed up on 8 core,
but I can't test it.
--Hynek (Pichi) Vychodil
_______________________________________________
erlang-questions mailing list
erlang-questions@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-questions
Post recived from mailinglist |
|
|
| Back to top |
|
| dcaoyuan |
Posted: Mon Oct 29, 2007 11:22 am |
|
|
|
User
Joined: 28 Mar 2007
Posts: 34
|
I think the system time is mostly the disk/io time.
On 10/29/07, Thomas Lindgren <thomasl_erlang@yahoo.com> wrote:
>
> --- Steve Vinoski <vinoski@ieee.org> wrote:
>
> > On 10/28/07, Anders Nygren <anders.nygren@gmail.com>
> > wrote:
> > >
> > > On 10/28/07, Hynek Vychodil
> > <vychodil.hynek@gmail.com> wrote:
> > > > Hi Anders,
> > > > I rewrote your code a little. I removed all
> > remaining binary bindings
> > > > and it is noticeable faster again. Try
> > wf_pichi3.erl.
> > > >
> > >
> > > Hynek
> > > that was great, Your change brings my wfinder1_1 +
> > wfbm4_ets1_1
> > > down to
> > > real 0m1.118s
> > > user 0m1.640s
> > > sys 0m0.368s
> > > on my 1.66 GHz dual core laptop.
> >
> >
> > And on the 8-core 2.33GHz Intel Xeon Linux box with
> > 2 GB RAM, this version
> > is extremely fast:
> >
> > real 0m0.567s
> > user 0m2.249s
> > sys 0m0.956s
>
> Impressive. However, can someone explain why
> sys > real?
> Is the kernel too running in parallel?
>
> Best,
> Thomas
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@erlang.org
> http://www.erlang.org/mailman/listinfo/erlang-questions
>
--
- Caoyuan
_______________________________________________
erlang-questions mailing list
erlang-questions@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-questions
Post recived from mailinglist |
|
|
| Back to top |
|
| anders_n |
Posted: Mon Oct 29, 2007 3:28 pm |
|
|
|
User
Joined: 28 Feb 2005
Posts: 155
Location: Saltillo, Mexico
|
On 10/29/07, Thomas Lindgren <thomasl_erlang@yahoo.com> wrote:
>
> --- Steve Vinoski <vinoski@ieee.org> wrote:
>
> > On 10/28/07, Anders Nygren <anders.nygren@gmail.com>
> > wrote:
> > >
> > > On 10/28/07, Hynek Vychodil
> > <vychodil.hynek@gmail.com> wrote:
> > > > Hi Anders,
> > > > I rewrote your code a little. I removed all
> > remaining binary bindings
> > > > and it is noticeable faster again. Try
> > wf_pichi3.erl.
> > > >
> > >
> > > Hynek
> > > that was great, Your change brings my wfinder1_1 +
> > wfbm4_ets1_1
> > > down to
> > > real 0m1.118s
> > > user 0m1.640s
> > > sys 0m0.368s
> > > on my 1.66 GHz dual core laptop.
> >
> >
> > And on the 8-core 2.33GHz Intel Xeon Linux box with
> > 2 GB RAM, this version
> > is extremely fast:
> >
> > real 0m0.567s
> > user 0m2.249s
> > sys 0m0.956s
>
> (I'll ignore the unexplained sys time below. That
> makes the discussion a bit preliminary; perhaps the
> derived results should be computed some other way.
> Apply grain of salt appropriately.)
>
> For those keeping track, the latest result is fully
> 2.7 times faster than the best previous version (which
> was block read), and 17.3 times faster than the
> initial version. The latest speedup is basically due
> to doing less work.
Are we doing less work?
Not really, over 3 versions wfinder, wfinder1 and wfinder1_1
the only differences are how we do binary matching
- do NOT create sub binaries
- keep "pointers" for offsets inside one big binary
The reason that it get so much faster seems to be that
we have much less garbage collection.
wfinder
real 0m1.995s
user 0m3.164s
sys 0m0.396s
garbage collections: 46301
words reclaimed: 501744350
wfinder1
real 0m1.786s
user 0m2.792s
sys 0m0.364s
garbage collections: 36102
words reclaimed: 392187156
wfinder1_1
real 0m1.219s
user 0m1.660s
sys 0m0.376s
garbage collections: 10729
words reclaimed: 114930430
So the amount of garbage has been reduced by a factor > 4
>However, note that user time fell
> by a somewhat greater ratio than real time, which
> might mean parallelization overheads are becoming
> visible.
>
> Also, the user time of 2.249 seconds is now close to
> the Ruby user time, which were 2.095s on the same
> hardware, while the Erlang parallelization speedup
> (user/real) on top of this is 3.95 out of 8. Comparing
> the real times of Erlang (0.567s) and Ruby (2.21s), we
> get about the same execution time speedup, 3.9. Not
> too shabby, huh?
>
> Is there anything more to be wrung out of this
> program?
As I said yesterday, I think that the
current solution has a lower bound of ~0.5s no matter how
many cores You trow at it.
As Steve measured it
Workers Real time
1 1.252
2 1.079
4 0.701
8 0.575
16 0.567
I think we have reached the limit on this track, we need to look at
this another way to make a solution, that probably is slower on
a low number of cores but that scales better on more > 4 cores.
>Well, apart from further tuning, one can note
> that Ruby had 0.1s sys time, while Erlang apparently
> needs a bit more, 0.5-1.0s. Why?
>
> Taking a wider view, it would also be very interesting
> to see how to apply these lessons to more general
> problems and/or libraries.
Some random thoughts
-The compiler should treat don't care variable (_Var) the same
as (_), and not bind them.
I like to be able to write
<<_Type:TLen/binary,_X:XLen/binary, Val:Len/binary....>>
instead of
SkipLen=TLen+XLen,
<<_:SkipLen/binary,Val:Len/binary...>>
- When doing folds over large binaries do not repetedly
split the binary in head and tail parts for recursive calls.
Keep an offset counter to track the position in the binary.
- One reason that it is even possible to solve the wide finder
in parallel is that it is possible to "resync" at a random place
in the file by locating a newline.
I have an interest in processing files with BER coded data, so
I have been trying to make a BER version of widefinder, but since
it is necessary to scan the data sequentially to identify each
(TLV) block, it does not lend itself to the type of solution we
have for widefinder. My BER version currently takes
real 0m7.378s
user 0m6.768s
sys 0m0.876s
/Anders
_______________________________________________
erlang-questions mailing list
erlang-questions@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-questions
Post recived from mailinglist |
|
|
| Back to top |
|
| Guest |
Posted: Mon Oct 29, 2007 3:56 pm |
|
|
|
Guest
|
"Anders Nygren" <anders.nygren@gmail.com> writes:
>
> Some random thoughts
> -The compiler should treat don't care variable (_Var) the same
> as (_), and not bind them.
> I like to be able to write
> <<_Type:TLen/binary,_X:XLen/binary, Val:Len/binary....>>
> instead of
> SkipLen=TLen+XLen,
> <<_:SkipLen/binary,Val:Len/binary...>>
The BEAM compiler treats all variables the same.
If the compiler finds that a variable is not used, it will NOT
match out the binary.
I assume that HiPE's native-code compiler does the same optimization.
/Bjorn
--
Bj |
|
|
| Back to top |
|
| anders_n |
Posted: Mon Oct 29, 2007 4:33 pm |
|
|
|
User
Joined: 28 Feb 2005
Posts: 155
Location: Saltillo, Mexico
|
On 29 Oct 2007 16:54:29 +0100, Bjorn Gustavsson <bjorn@erix.ericsson.se> wrote:
> "Anders Nygren" <anders.nygren@gmail.com> writes:
>
> >
> > Some random thoughts
> > -The compiler should treat don't care variable (_Var) the same
> > as (_), and not bind them.
> > I like to be able to write
> > <<_Type:TLen/binary,_X:XLen/binary, Val:Len/binary....>>
> > instead of
> > SkipLen=TLen+XLen,
> > <<_:SkipLen/binary,Val:Len/binary...>>
>
> The BEAM compiler treats all variables the same.
>
> If the compiler finds that a variable is not used, it will NOT
> match out the binary.
Good to hear that.
/Anders
>
> I assume that HiPE's native-code compiler does the same optimization.
>
> /Bjorn
> --
> Bj |
|
|
| Back to top |
|
|
|
All times are GMT
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum You cannot attach files in this forum You cannot download files in this forum
|
|
|