| Author |
Message |
|
| rvirding |
Posted: Mon Jun 04, 2007 12:06 am |
|
|
|
User
Joined: 30 Aug 2006
Posts: 452
Location: Stockholm, Sweden
|
I have loaded up a new regular expression module, re.erl, to trapexit.org:
http://forum.trapexit.org/viewtopic.php?t=8675
This is a new implementation of regular expressions which is sort of
compatible with regexp.erl with two major improvements:
1. It now works directly on binaries, all the functions take binaries as
input, but not for the regexp.
2. There are 2 new function which extract and return sub-expressions,
smatch/2, and first_smatch2. These are the similar to match/2 and
first_match/2 but they also sub expressions For example:
2> re:smatch("-axxxb--", "a((x+)|(y+))b").
{match,2,5,"axxxb",{{3,3,"xxx"},{3,3,"xxx"},undefined}}
A sub-expr is 'undefined' if there is no match.
It supports POSIX regexp as did the old one, but we now have POSIX
character classes but only for Latin-1. So we can write "[[:digit:]]" or
"[[:alnum:]]". The functions are the same as before.
The regexp engine should never explode irrespective of the regexp, which
many do, and is about as fast as the old one. It depends on the regexp.
I would like some feed-back on the speed and the interface.
N.B. It is not really possible to have both POSIX and PERL regexps in
the same module as apart from the difference in features they have
different semantics. If all goes well a PERL module might follow.
Robert
_______________________________________________
erlang-questions mailing list
erlang-questions@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-questions
Post recived from mailinglist |
|
|
| Back to top |
|
| tobbe |
Posted: Mon Jun 04, 2007 10:31 am |
|
|
|
User
Joined: 19 Jan 2005
Posts: 274
Location: Stockholm, Sweden
|
Code:
1> re:match("now/plus42hours/","^now/(plus|minus)(\d{1,2})hours/$").
nomatch
2> re:smatch("now/plus42hours/","^now/(plus|minus)([[:alnum:]])hours/$").
nomatch
3> re:smatch("now/plus42hours/","now/(plus|minus)([[:alnum:]])hours/").
nomatch
Que?
Also, it would be really nice with some docs with lots of examples.
Or, why not provide an Eunit test file? That would help you to do
regression testing and give good examples in one place.
Cheers, Tobb |
|
|
| Back to top |
|
| rvirding |
Posted: Mon Jun 04, 2007 8:39 pm |
|
|
|
User
Joined: 30 Aug 2006
Posts: 452
Location: Stockholm, Sweden
|
tobbe wrote:
> 1> re:match("now/plus42hours/","^now/(plus|minus)(\d{1,2})hours/$").
> nomatch
> 2> re:smatch("now/plus42hours/","^now/(plus|minus)([[:alnum:]])hours/$").
> nomatch
> 3> re:smatch("now/plus42hours/","now/(plus|minus)([[:alnum:]])hours/").
> nomatch
OK:
1) \d is a PERLism and as I wrote I only support POSIX style regexps. As
the regexp is a string it would have to be "\\d" as the '\' needs to be
seen by the regexp module. If there is interest I will do a PERL
compatible version.
2) [[:alnum:]] matches ONE alpha-numeric character, almost equivalent to
"[a-zA-Z_0-9]" but for all of Latin-1
3) Same comment here.
So:
2>re:match("now/plus42hours/","^now/(plus|minus)([[:alnum:]]+)hours/$").
{match,1,16}
3>re:smatch("now/plus42hours/","^now/(plus|minus)([[:alnum:]]+)hours/$").
{match,1,16,"now/plus42hours/",{{5,4,"plus"},{9,2,"42"}}}
4>re:smatch(<<"now/plus42hours/">>,"^now/(plus|minus)([[:alnum:]]+)hours/$").
{match,1,16,"now/plus42hours/",{{5,4,"plus"},{9,2,"42"}}}
5>re:smatch(<<"now/plus42hours/">>,"^now/(plus|minus)([[:alnum:]]{1,2})hours/$").
{match,1,16,"now/plus42hours/",{{5,4,"plus"},{9,2,"42"}}}
6>re:smatch("now/plus42hours/","^now/(plus|minus)([[:alnum:]]{1,2})hours/$").
{match,1,16,"now/plus42hours/",{{5,4,"plus"},{9,2,"42"}}}
I hope that's legible
> Also, it would be really nice with some docs with lots of examples.
> Or, why not provide an Eunit test file? That would help you to do
> regression testing and give good examples in one place.
>
> Cheers, Tobbe
As I said it is compatible with the old regexp except for the smatch/2,
first_smatch/2 functions.
It hasn't got to the stage of eunit regression testing yet, I would like
to get the interface nailed down first. What information needs to be
returned? Now smatch returns everything. By the way an unused sub-expr
returns 'undefined' to differentiate it from the empty string.
The reason I don't use the module name 'regexp' is to give me more
freedom in determining the interface. I will feed improvements back in
to regexp of course.
Then I can split it up along the lines of rok's suggestions and add a
re-entrant interface as someone else was requesting. And leex. Could
even make it work directly on io-lists, but I don't really see the need.
In testing tobbe's examples I found a few bugs and have added a new
version to trapexit:
http://forum.trapexit.org/viewtopic.php?t=8676
Robert
P.S. Using the code tags in trapexit really screws it up for me in
Thunderbird, I get the raw HTML.
P.P.S I found I couldn't directly use the update file option is trapexit
as it changes the name of the file, my update became re_478.erl.
_______________________________________________
erlang-questions mailing list
erlang-questions@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-questions
Post recived from mailinglist |
|
|
| Back to top |
|
| axel |
Posted: Tue Jun 05, 2007 6:49 am |
|
|
|
User
Joined: 03 Mar 2005
Posts: 271
|
On 2007-06-04 22:34, Robert Virding wrote:
> tobbe wrote:
>> 1> re:match("now/plus42hours/","^now/(plus|minus)(\d{1,2})hours/$").
>> nomatch
>> 2> re:smatch("now/plus42hours/","^now/(plus|minus)([[:alnum:]])hours/$").
>> nomatch
>> 3> re:smatch("now/plus42hours/","now/(plus|minus)([[:alnum:]])hours/").
>> nomatch
>
> OK:
> 1) \d is a PERLism and as I wrote I only support POSIX style regexps. As
> the regexp is a string it would have to be "\\d" as the '\' needs to be
if somebody is interested in something else than ''normal regular
expressions'' (where normal is awk, sed, posix, perl, etc) i can recommend
http://www.scsh.net/docu/html/man-Z-H-7.html#node_idx_1178
it is regexp for the scheme shell. it has s-expressions instead of
strings. i find it easier to use when the regular expression goes beyond
that which is possible to do with strstr and friends.
bengt
--
Those were the days...
EPO guidelines 1978: "If the contribution to the known art resides
solely in a computer program then the subject matter is not
patentable in whatever manner it may be presented in the claims."
_______________________________________________
erlang-questions mailing list
erlang-questions@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-questions
Post recived from mailinglist |
|
|
| Back to top |
|
| rvirding |
Posted: Thu Jun 14, 2007 11:04 pm |
|
|
|
User
Joined: 30 Aug 2006
Posts: 452
Location: Stockholm, Sweden
|
Bengt Kleberg wrote:
> On 2007-06-04 22:34, Robert Virding wrote:
>> tobbe wrote:
>>> 1> re:match("now/plus42hours/","^now/(plus|minus)(\d{1,2})hours/$").
>>> nomatch
>>> 2> re:smatch("now/plus42hours/","^now/(plus|minus)([[:alnum:]])hours/$").
>>> nomatch
>>> 3> re:smatch("now/plus42hours/","now/(plus|minus)([[:alnum:]])hours/").
>>> nomatch
>> OK:
>> 1) \d is a PERLism and as I wrote I only support POSIX style regexps. As
>> the regexp is a string it would have to be "\\d" as the '\' needs to be
>
> if somebody is interested in something else than ''normal regular
> expressions'' (where normal is awk, sed, posix, perl, etc) i can recommend
> http://www.scsh.net/docu/html/man-Z-H-7.html#node_idx_1178
>
> it is regexp for the scheme shell. it has s-expressions instead of
> strings. i find it easier to use when the regular expression goes beyond
> that which is possible to do with strstr and friends.
Sorry for taking so long to answer this.
The is definitely interesting. What it describes is along the same lines
as what Richard O'Keefe was suggesting, defining the regular expression
with a structure instead of with a string. They wrap the s-expr form
with a read macro which parses the s-expr and builds an internal
representation. One interesting point is that when matching it does not
return an explicit structure with the results of the match, but instead
an ADT with a set of access functions.
One benefit of doing this is that as the internal structure of the ADT
is undefined and data only accessible though the access functions then
you are free to change the internals. The downside is not being able to
pattern match on the result. What do people feel is the best way to go?
I rather like having both the string form for a regular expression and a
structural representation. It easier to get it more beautiful in Lisp I
think. For Erlang would could either use terms directly or have a more
functional way as Richard described. So instead of "[a-c]*|z+" you could
have:
{alt,{'*',{cc,"a-c"}},{'+',{c,$z}}}
or
alt('*'(cc("a-c")),'+'($z))
Can't think of better names for the closures right now, using kclosure
and pclosure seems so long.
Robert
_______________________________________________
erlang-questions mailing list
erlang-questions@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-questions
Post recived from mailinglist |
|
|
| Back to top |
|
| Guest |
Posted: Fri Jun 15, 2007 2:08 am |
|
|
|
Guest
|
|
| Back to top |
|
| rvirding |
Posted: Sat Jun 16, 2007 9:48 am |
|
|
|
User
Joined: 30 Aug 2006
Posts: 452
Location: Stockholm, Sweden
|
|
| Back to top |
|
|
|
All times are GMT
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum You cannot attach files in this forum You cannot download files in this forum
|
|
|