Discussion:
Bug in TRE regular expression library
Ralf Junker
2010-07-20 08:36:27 UTC
Permalink
I am looking for someone who maintains the NetBSD TRE code in

http://cvsweb.netbsd.org/bsdweb.cgi/src/external/bsd/tre/

Reason is that I believe that I found a bug in TRE.

Unfortunately, the original TRE author is unresponsive to e-mail
messages. The original TRE mailing lists no longer function, and the
original TRE repository is no longer available. So it seems that the
NetBSD port of TRE might be the only maintaind versioin of TRE - at
least I could find no other.

I am not sure if the problem is relevant to NetBSD, but it might still
be worth reporting. Below follows my message to Ville Laurikari.

Ralf

------

I am using TRE compiled with both TRE_WCHAR and TRE_MULTIBYTE defined.
My build passes all tests fine, except for a "special" case (see code
below).

The special case happens when I set TRE_MB_CUR_MAX = 2 for multi-byte
character support: Then GET_NEXT_WCHAR() no longer advances str_byte at
the end of a null-terminated string.

As a result, the example below does match with TRE_MB_CUR_MAX = 2, which
is *not* correct. With TRE_MB_CUR_MAX = 1 it fails as expected.

A simple fix would be to change tre-match-utils.h line 59 from

if (w == 0 && len >= 0)

to just

if (w == 0)

However, I am not sure if this causes any side effects?
Christos Zoulas
2010-07-20 18:39:09 UTC
Permalink
Post by Ralf Junker
I am looking for someone who maintains the NetBSD TRE code in
http://cvsweb.netbsd.org/bsdweb.cgi/src/external/bsd/tre/
Reason is that I believe that I found a bug in TRE.
Unfortunately, the original TRE author is unresponsive to e-mail
messages. The original TRE mailing lists no longer function, and the
original TRE repository is no longer available. So it seems that the
NetBSD port of TRE might be the only maintaind versioin of TRE - at
least I could find no other.
I am not sure if the problem is relevant to NetBSD, but it might still
be worth reporting. Below follows my message to Ville Laurikari.
Ralf
------
I am using TRE compiled with both TRE_WCHAR and TRE_MULTIBYTE defined.
My build passes all tests fine, except for a "special" case (see code
below).
The special case happens when I set TRE_MB_CUR_MAX = 2 for multi-byte
character support: Then GET_NEXT_WCHAR() no longer advances str_byte at
the end of a null-terminated string.
As a result, the example below does match with TRE_MB_CUR_MAX = 2, which
is *not* correct. With TRE_MB_CUR_MAX = 1 it fails as expected.
A simple fix would be to change tre-match-utils.h line 59 from
if (w == 0 && len >= 0)
to just
if (w == 0)
However, I am not sure if this causes any side effects?
Which example?

christos
Ralf Junker
2010-07-20 19:44:13 UTC
Permalink
Post by Christos Zoulas
Which example?
Sorry, I forgot it. It is now added below.

Ralf

-----------------------

#define p "(.)\\1$"

int _tmain(int argc, _TCHAR* argv[])
{
regex_t RE;

int nMatches, rc;

regmatch_t *matches;

TRE_MB_CUR_MAX = 2;

rc = tre_regncomp(&RE, p, strlen(p), REG_EXTENDED);
printf ("%d\n", rc);

nMatches = RE.re_nsub + 1;
matches = malloc (nMatches * sizeof(matches[0]));

rc = tre_regexec(&RE, "foox", nMatches, matches, 0);
printf("Result: %d\n", rc);

free(matches);

tre_regfree(&RE);

printf ("\nDone");
scanf ("*%s");

return 0;
}
der Mouse
2010-07-20 19:58:16 UTC
Permalink
Post by Christos Zoulas
Which example?
Sorry, I forgot it. It is now added below.
scanf ("*%s");
While I can't see it as being your problem, this looks..questionable,
at the very least. If the input line at this point begins with a *,
scanf will, at best, misuse stack trash as if it were a pointer and
segfault immediately; more pessimistically, misuse stack trash as if it
were a pointer and scribble on some random data structure somewhere,
leading to cryptic misbehaviour at some difficult-to-predict later
point.

If the input doesn't begin with a *, I think our implementation will
not misbehave, but I also think this is an accident of the
implementation and should not be counted on - I don't think C and/or
stdio promise that omitting arguments like this is acceptable even if
they're not stored through. (For example, scanf might fetch the
pointer, even if it doesn't store through it, and if the machine has
trap representations for pointers it may crash when doing so.)

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML ***@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Ralf Junker
2010-07-20 20:26:23 UTC
Permalink
Post by der Mouse
Post by Ralf Junker
scanf ("*%s");
While I can't see it as being your problem,
It surely is not my problem nor the point I wanted to draw attention
to. The focus is solely on the TRE regex functions.

Still, thanks for pointing this out. I never put any thought into this
scanf arguments, simply because they serve a single purpose: Stop the
application so I can read its output and press any key to quit quickly.
Works fine for me here.

I am sorry to say that I am not much concerned about any problems this
key press might cause elsewehere, as this is not a working application
but some simple code to demonstrate a problem somewhere else, possibly
in production code.

Buz if you'd like to suggest a proper way to achieve this objective, I'd
be eager to learn ...

Ralf
Post by der Mouse
this looks..questionable, at the very least. If the input line at
this point begins with a *, scanf will, at best, misuse stack trash
as if it were a pointer and segfault immediately; more
pessimistically, misuse stack trash as if it were a pointer and
scribble on some random data structure somewhere, leading to cryptic
misbehaviour at some difficult-to-predict later point.
If the input doesn't begin with a *, I think our implementation will
not misbehave, but I also think this is an accident of the
implementation and should not be counted on - I don't think C and/or
stdio promise that omitting arguments like this is acceptable even
if they're not stored through. (For example, scanf might fetch the
pointer, even if it doesn't store through it, and if the machine has
trap representations for pointers it may crash when doing so.)
der Mouse
2010-07-20 20:53:31 UTC
Permalink
I never put any thought into this scanf arguments, simply because
they serve a single purpose: Stop the application so I can read its
output and press any key to quit quickly. Works fine for me here.
It actually looks as though you meant to write "%*s" but got the * and
% switched. %*s is not quite what you want either, since it will
silently consume whitespace, including newlines. You might want %*c,
but that will consume only the next character, rather than up to the
next newline (which I suspect is what you want) - scanf("%*c") is
pretty much the same as getchar() in this use.

There is actually a syntax that is very close to what I think you want:
scanf("%*[^\n]"). The * suppresses assignment, so you don't have to
pass a pointer, and the format spec says to consume a string of
unlimited length made up of non-newlines. However, it does not consume
the \n, so if you use it twice, the second call will return
immediately. It might work to do scanf("%*[^\n]%*1[\n]") (the second
format specifier attempts to consume at most one newline), but I'd want
to test that before depending on it - I suspect that's a piece of scanf
that gets little testing and thus is likely to be buggy.
I am sorry to say that I am not much concerned about any problems
this key press might cause elsewehere, as this is not a working
application but some simple code to demonstrate a problem somewhere
else, possibly in production code.
Normally, I'd feel the same way. But this looks like not a standalone
test program but a small part of a larger test scaffold, and if running
one test can corrupt things so as to cause a later test to fail
spuriously, that sounds to me like a real problem.

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML ***@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Ralf Junker
2010-07-21 06:54:42 UTC
Permalink
Post by der Mouse
I never put any thought into this scanf arguments, simply because
they serve a single purpose: Stop the application so I can read its
output and press any key to quit quickly. Works fine for me here.
It actually looks as though you meant to write "%*s" but got the * and
% switched. %*s is not quite what you want either, since it will
silently consume whitespace, including newlines.
As I thought about it once more and read some docs I recall that
scanf("%*s") is indeed what I intended. Thanks for the clearing this up.
Post by der Mouse
You might want %*c,
but that will consume only the next character, rather than up to the
next newline (which I suspect is what you want) - scanf("%*c") is
pretty much the same as getchar() in this use.
I tested and found that getchar(), scanf("%*c"), and scanf("%*s") are
working likewise well for my purpose.
Post by der Mouse
scanf("%*[^\n]"). The * suppresses assignment, so you don't have to
pass a pointer, and the format spec says to consume a string of
unlimited length made up of non-newlines. However, it does not consume
the \n, so if you use it twice, the second call will return
immediately. It might work to do scanf("%*[^\n]%*1[\n]") (the second
format specifier attempts to consume at most one newline), but I'd want
to test that before depending on it - I suspect that's a piece of scanf
that gets little testing and thus is likely to be buggy.
What I really want is a single command to stop and wait for just _any_
key-press -- not the RETURN key only. Searching the web indicates that
this does not exist, at least not in a simple and platform independent
way. So I can happily live with getchar() as you suggested.
Post by der Mouse
I am sorry to say that I am not much concerned about any problems
this key press might cause elsewehere, as this is not a working
application but some simple code to demonstrate a problem somewhere
else, possibly in production code.
Normally, I'd feel the same way. But this looks like not a standalone
test program but a small part of a larger test scaffold, and if running
one test can corrupt things so as to cause a later test to fail
spuriously, that sounds to me like a real problem.
I appreciate your concern, but it is in fact just a standalone test program.

Now that this is settled, I would love to read from someone with the
same enthusiasm for my original TRE regular expression problem! ;-)

Below is the test code again, now properly using getchar().

Ralf

----------------

#define p "(.)\\1$"

int _tmain(int argc, _TCHAR* argv[])
{
regex_t RE;

int nMatches, rc;

regmatch_t *matches;

TRE_MB_CUR_MAX = 2;

rc = tre_regncomp(&RE, p, strlen(p), REG_EXTENDED);
printf ("%d\n", rc);

nMatches = RE.re_nsub + 1;
matches = malloc (nMatches * sizeof(matches[0]));

rc = tre_regexec(&RE, "foox", nMatches, matches, 0);
printf("Result: %d\n", rc);

free(matches);

tre_regfree(&RE);

printf ("\nDone");
getchar(); /* Keep app window open until key press. */

return 0;
}

Loading...