wchar_t encoding?

Discussion:

wchar_t encoding?

Paul Koning

2010-05-19 15:29:38 UTC

Gents,

I'm working on a patch to gdb 7.1 to make it work on NetBSD. The issue
is that GDB 7 uses iconv to handle character strings, and uses wide
chars internally so it can handle various non-ASCII scripts.

The trouble for NetBSD is that it asks iconv to translate to a character
set named "wchar_t". That means "whatever the encoding is for the
wchar_t data type". GNU libiconv supports that, so on platforms that
use that library things are fine.

NetBSD supports iconv, but it doesn't know the "wchar_t" encoding name.
So I proposed a patch that substitutes what appears to be used instead,
namely UCS-4 in platform native byte order (so "ucs-4le" on x86, for
example). This seems to work.

The trouble is that I'm getting pushback on the patch, because of
concerns that the encoding used for wchar_t is not actually UCS-4. In
particular, there is this article:
http://www.gnu.org/software/libunistring/manual/libunistring.html#The-wc
har_005ft-mess which says that on Solaris and FreeBSD the encoding of
wchar_t is "undocumented and locale dependent". (Ye gods!)

Now, NetBSD is not FreeBSD... so... what is the answer for NetBSD? Is
it like FreeBSD? (If so, it would be good to fix that.) Or is it a
fixed encoding, and if so, is it indeed ucs-4?

Thanks,
paul

Martin Husemann

2010-05-19 15:35:37 UTC

Post by Paul Koning
NetBSD supports iconv, but it doesn't know the "wchar_t" encoding name.

It's probably easiest to add an alias for this and get that change pulled
up.

Martin

Paul Koning

2010-05-19 17:55:47 UTC

Post by Martin Husemann

Post by Paul Koning
NetBSD supports iconv, but it doesn't know the "wchar_t" encoding

name.
It's probably easiest to add an alias for this and get that change pulled
up.

That's one approach. Another is to teach gdb to ask for a different
name, which isn't a particularly hard bit of configure machinery.

The problem is "alias for what?"

paul

Martin Husemann

2010-05-19 19:10:03 UTC

Post by Paul Koning
The problem is "alias for what?"

Yes - I don't know, and I'd argue, gdb shouldn't know either ;-)

Martin

Paul Koning

2010-05-19 19:44:21 UTC

Post by Martin Husemann

Post by Paul Koning
The problem is "alias for what?"

Yes - I don't know, and I'd argue, gdb shouldn't know either ;-)

I guess I didn't explain well enough.

What's going on is this: the target being debugged has strings that gdb
needs to handle. It is told what encoding is used for those strings
(via user commands, defaulting in some suitable way).

To allow for maximum flexibility, any internal processing on those
strings is done in wchar_t form. Some of the work involves calling
various wide char support routines, like iswprint(). Those functions
assume (perhaps implicitly) a particular encoding, perhaps ucs-4,
perhaps something else. For example, in Solaris the answer (apparently)
is "something else and it depends on the locale".

So when GDB reads a string from the target it feeds it to iconv and asks
it to convert from whatever was specified as the target's encoding into
"the encoding that the wchar support routines expect to find in wchar_t
data".

GDB doesn't particularly want to know what that encoding is, but it has
to ask for a specific encoding or iswprint() will get the wrong answer.
This is why libiconv supports the encoding name "wchar_t" in the first
place.

If it's possible to add that encoding name to the iconv in NetBSD, that
would be a good solution. I tried to read the iconv code and got
completely lost, so I certainly have no idea how to do that. On the
other hand, I do know how to add code to configure.ac in gdb to teach it
to use a different name (essentially doing the aliasing in a #define in
config.h for gdb), all I need is knowledge of what to use. Or places to
look for the answer...

paul

Valeriy E. Ushakov

2010-05-20 03:55:38 UTC

Post by Paul Koning
I'm working on a patch to gdb 7.1 to make it work on NetBSD. The issue
is that GDB 7 uses iconv to handle character strings, and uses wide
chars internally so it can handle various non-ASCII scripts.
The trouble for NetBSD is that it asks iconv to translate to a character
set named "wchar_t". That means "whatever the encoding is for the
wchar_t data type". GNU libiconv supports that, so on platforms that
use that library things are fine.
The trouble is that I'm getting pushback on the patch, because of
concerns that the encoding used for wchar_t is not actually UCS-4.
http://www.gnu.org/software/libunistring/manual/libunistring.html#The-wchar_005ft-mess
which says that on Solaris and FreeBSD the encoding of wchar_t is
"undocumented and locale dependent". (Ye gods!)

Why are they so surprised about that? C99 says:

3.7.3
[#1] wide character
bit representation that fits in an object of type wchar_t,
capable of representing any character in the current locale

It's simply impossible to always use unicode as the only encoding for
wchar_t, since not all charsets are 1:1 with unicode.

Besides, iconv does not return (fsvo "return") wide strings, it
returns good old pointer to char. Do they pass a pointer to wchar_t
as destination?

If they just assume it's going to be a pointer to wide string, then
correct implementation of "wchar_t" is for iconv to convert to a plain
string in current charset and then convert that to a wide string.

Or do they actually assume it's gonna be utf32?

SY, Uwe

--
***@stderr.spb.ru | Zu Grunde kommen
http://snark.ptc.spbu.ru/~uwe/ | Ist zu Grunde gehen

Paul Koning

2010-05-20 14:06:37 UTC

Post by Paul Koning
http://www.gnu.org/software/libunistring/manual/libunistring.html#The-
wchar_005ft-mess

Post by Paul Koning
which says that on Solaris and FreeBSD the encoding of wchar_t is
"undocumented and locale dependent". (Ye gods!)

3.7.3
[#1] wide character
bit representation that fits in an object of type wchar_t,
capable of representing any character in the current locale
It's simply impossible to always use unicode as the only encoding for
wchar_t, since not all charsets are 1:1 with unicode.

That wasn't "they" -- the editorial comment was mine. I thought that
Unicode by now is complete enough to be able to handle other charsets.
It sounds like that's not true, or at least wasn't 12 years ago. Can
you give an example of a charset for which Unicode is not sufficient?

Post by Paul Koning
Besides, iconv does not return (fsvo "return") wide strings, it
returns good old pointer to char. Do they pass a pointer to wchar_t
as destination?

Yes. The iconv documentation says that the arguments are buffer
pointers, so their type is whatever the source or destination encoding
name implies.

Post by Paul Koning
If they just assume it's going to be a pointer to wide string, then
correct implementation of "wchar_t" is for iconv to convert to a plain
string in current charset and then convert that to a wide string.
Or do they actually assume it's gonna be utf32?

No, that's exactly the issue.

The C99 rule you quoted says (or at least implies) that the encoding of
wchar_t is locale dependent. So the question is: how does a program
find out WHAT encoding wchar_t uses right now? I don't see any API for
obtaining that information. Clearly this is necessary -- how else can a
program construct properly encoded wide char data if it needs to do so
(as GDB does)?

paul

Valeriy E. Ushakov

2010-05-20 17:46:48 UTC

Post by Paul Koning

Post by Valeriy E. Ushakov
Or do they actually assume it's gonna be utf32?

No, that's exactly the issue.
The C99 rule you quoted says (or at least implies) that the encoding of
wchar_t is locale dependent. So the question is: how does a program
find out WHAT encoding wchar_t uses right now? I don't see any API for
obtaining that information. Clearly this is necessary -- how else can a
program construct properly encoded wide char data if it needs to do so
(as GDB does)?

There's api to convert between plain chars/strings and wide
chars/strings, there is stdio api for wide chars/strings.

Why is that necessary to know the wide char bit patterns?

SY, Uwe

--
***@stderr.spb.ru | Zu Grunde kommen
http://snark.ptc.spbu.ru/~uwe/ | Ist zu Grunde gehen

Paul Koning

2010-05-20 19:57:36 UTC

Post by Valeriy E. Ushakov

Post by Paul Koning

Post by Valeriy E. Ushakov
Or do they actually assume it's gonna be utf32?

No, that's exactly the issue.
The C99 rule you quoted says (or at least implies) that the encoding

of

Post by Paul Koning
wchar_t is locale dependent. So the question is: how does a program
find out WHAT encoding wchar_t uses right now? I don't see any API

for

Post by Paul Koning
obtaining that information. Clearly this is necessary -- how else

can a

Post by Paul Koning
program construct properly encoded wide char data if it needs to do

so

Post by Paul Koning
(as GDB does)?

There's api to convert between plain chars/strings and wide
chars/strings, there is stdio api for wide chars/strings.
Why is that necessary to know the wide char bit patterns?

Maybe it isn't. I'm trying to solve the problem GDB needs to solve with
minimal changes to GDB.

What it needs to do: it's given a string (narrow or wide) on a target
system. It's told (by the user, defaulted in some suitable way) what
encoding that string has. GDB reads that string from memory. It then
wants to do something with it, for example print it. It also needs to
do some basic processing, for example test for non-printable characters.

The current scheme is, in outline:

1. Read the string into a buffer (call it "targetbuf")
2. iconv_open ("wchar_t", target_string_encoding_nam)
3. iconv (..., targetbuf, ... wcharbuf)

and it then has the target string, in wide char format, in the encoding
used by the host (which may be different from that used by the target).

The nice thing about iconv is that it converts between any source and
destination encoding in one step.

I see functions like mbtowc, and it looks like those could be used. (In
fact, that's how libiconv implements iconv().) But it becomes a multi
step process: first convert the string read from the target to a
multibyte string, probably with iconv. I can't find any documentation
that says what the encoding of a multibyte string is, though. libiconv
clearly assumes that it's "Unicode" (meaning UTF-8???). If it's utf-8
or some other well-defined encoding, then that works. Then the second
step would be to feed that intermediate encoding to mbtowc, which is
defined to translate to wide chars according to the current locale.

I don't see a narrow string to wc conversion, though there is a narrow
char (single char) to wc conversion. But that doesn't do the conversion
GDB needs because it's defined to operate entirely in the current
locale, while the conversion GDB does is from a user-specified target
system encoding on input, to the locale host system encoding on output.
Note that the target OS may not be NetBSD, it may be a different byte
order, etc...

The two step conversion, if it does the job, seems acceptable. If it
gets a whole lot more complicated it becomes hard to swallow, and also
hard for me to justify spending the effort. After all, another way out
is to say that GDB 7 on NetBSD requires libiconv -- which eliminates the
problem entirely at the cost of having two libraries that implement
nearly identical version of iconv -- libc and libiconv.

paul

Valeriy E. Ushakov

2010-05-20 18:11:37 UTC

Post by Paul Koning

Post by Valeriy E. Ushakov
It's simply impossible to always use unicode as the only encoding for
wchar_t, since not all charsets are 1:1 with unicode.

That wasn't "they" -- the editorial comment was mine. I thought that
Unicode by now is complete enough to be able to handle other charsets.
It sounds like that's not true, or at least wasn't 12 years ago. Can
you give an example of a charset for which Unicode is not sufficient?

I can invent an infinite number of them - it a matter of principle :),
the whole point is that C locale API (warts and all) is supposed to be
*completely* charset internals agnostic, you should be able to define
external locale information as groked by your C library, set your
LC_CTYPE &co accordinglly and a well behaved C program is supposed to
just work.

For a real life example, consider something like CSX (classical
sanskrit extended - a charset used to represent latin transliteration
of classical sanskrit). It has e.g. a character for "r with dot below
with macron with acute". Of course you can represent it using unicode
(you can inconv between csx and utf*), but you will need a sequence of
combining marks, i.e. it's not a 1:1 mapping, so a unicode wchar_t
cannot represent that character.

SY, Uwe

--
***@stderr.spb.ru | Zu Grunde kommen
http://snark.ptc.spbu.ru/~uwe/ | Ist zu Grunde gehen

Paul Koning

2010-05-20 15:01:04 UTC

Post by Paul Koning

Post by Paul Koning
...
The trouble for NetBSD is that it asks iconv to translate to a

character

Post by Paul Koning
set named "wchar_t". That means "whatever the encoding is for the
wchar_t data type". GNU libiconv supports that, so on platforms that
use that library things are fine.

I did some digging to see how libiconv implements that feature.

If __LIBC_ISO_10646__ is defined then it simply aliases this to an
appropriate width Unicode (ucs2 or ucs4). That applies to Linux, for
example.

If it isn't defined (as is the case on NetBSD) but mbrtowc() exists,
then it uses that function. More precisely, a conversion to "wchar_t"
first converts to Unicode, which is then fed into mbrtowc to produce the
wchar_t encoding. mbrtowc knows about any locale issues...

I guess that means that "multibyte" is Unicode, or UTF-8??? I don't see
that documented in any manpage. It also means that if you have a source
character that's not in Unicode but is in whatever encoding wchar_t
uses, it would not be handled by the libiconv implementation of iconv()
because it uses Unicode as an intermediate form.

paul

Valeriy E. Ushakov

2010-05-20 17:58:56 UTC

Post by Paul Koning

Post by Paul Koning

Post by Paul Koning
...
The trouble for NetBSD is that it asks iconv to translate to a

character

Post by Paul Koning
set named "wchar_t". That means "whatever the encoding is for the
wchar_t data type". GNU libiconv supports that, so on platforms

that

Post by Paul Koning

Post by Paul Koning
use that library things are fine.

I did some digging to see how libiconv implements that feature.
If __LIBC_ISO_10646__ is defined then it simply aliases this to an
appropriate width Unicode (ucs2 or ucs4). That applies to Linux, for
example.
If it isn't defined (as is the case on NetBSD) but mbrtowc() exists,
then it uses that function. More precisely, a conversion to "wchar_t"
first converts to Unicode, which is then fed into mbrtowc to produce the
wchar_t encoding. mbrtowc knows about any locale issues...
I guess that means that "multibyte" is Unicode, or UTF-8??? I don't see
that documented in any manpage. It also means that if you have a source
character that's not in Unicode but is in whatever encoding wchar_t
uses, it would not be handled by the libiconv implementation of iconv()
because it uses Unicode as an intermediate form.

Yeah, this fallback seems bogus. mbtowc &co exepct the source to be
in the current charset, so it's wrong to feed it unicode data (even if
wchar_t *is* always unicode internally).

SY, Uwe

--
***@stderr.spb.ru | Zu Grunde kommen
http://snark.ptc.spbu.ru/~uwe/ | Ist zu Grunde gehen

James Chacon

2010-05-20 18:30:15 UTC

Post by Paul Koning

Post by Paul Koning

Post by Paul Koning
...
The trouble for NetBSD is that it asks iconv to translate to a

character

Post by Paul Koning
set named "wchar_t". That means "whatever the encoding is for the
wchar_t data type". GNU libiconv supports that, so on platforms

that

Post by Paul Koning

Post by Paul Koning
use that library things are fine.

I did some digging to see how libiconv implements that feature.
If __LIBC_ISO_10646__ is defined then it simply aliases this to an
appropriate width Unicode (ucs2 or ucs4). That applies to Linux, for
example.
If it isn't defined (as is the case on NetBSD) but mbrtowc() exists,
then it uses that function. More precisely, a conversion to "wchar_t"
first converts to Unicode, which is then fed into mbrtowc to produce the
wchar_t encoding. mbrtowc knows about any locale issues...
I guess that means that "multibyte" is Unicode, or UTF-8??? I don't see
that documented in any manpage. It also means that if you have a source
character that's not in Unicode but is in whatever encoding wchar_t
uses, it would not be handled by the libiconv implementation of iconv()
because it uses Unicode as an intermediate form.

I think part of your problem here is mixing terminology. Unicode is
not an encoding, it's simply a definition of code points mapping to
specific glyphs. UTF-8/16/32/shift-JIS/etc are all "encodings".

I'll have to go dig out my C99 but locale dependent could mean the
number of bytes a wchar_t contains can vary by locale.

James

Valeriy E. Ushakov

2010-05-20 19:07:32 UTC

Post by James Chacon

Post by Paul Koning

Post by Paul Koning

Post by Paul Koning
...
The trouble for NetBSD is that it asks iconv to translate to a

character

Post by Paul Koning
set named "wchar_t". That means "whatever the encoding is for the
wchar_t data type". GNU libiconv supports that, so on platforms

that

Post by Paul Koning

Post by Paul Koning
use that library things are fine.

I did some digging to see how libiconv implements that feature.
If __LIBC_ISO_10646__ is defined then it simply aliases this to an
appropriate width Unicode (ucs2 or ucs4). That applies to Linux, for
example.
If it isn't defined (as is the case on NetBSD) but mbrtowc() exists,
then it uses that function. More precisely, a conversion to "wchar_t"
first converts to Unicode, which is then fed into mbrtowc to produce the
wchar_t encoding. mbrtowc knows about any locale issues...
I guess that means that "multibyte" is Unicode, or UTF-8??? I don't see
that documented in any manpage. It also means that if you have a source
character that's not in Unicode but is in whatever encoding wchar_t
uses, it would not be handled by the libiconv implementation of iconv()
because it uses Unicode as an intermediate form.

I think part of your problem here is mixing terminology. Unicode is
not an encoding,

While it might be technically sloppy, it doesn't change the gist of
the argument.

If we want to be pendantic we should stick to terminology in
http://unicode.org/reports/tr17/

Post by James Chacon
it's simply a definition of code points mapping to specific glyphs.

No :). If we are into nitpicking, then dragging glyphs into this is a
much worse sin against terminology :)

Post by James Chacon
I'll have to go dig out my C99 but locale dependent could mean the
number of bytes a wchar_t contains can vary by locale.

No, wchar_t has fixed size. It's the bit pattern of wide characters
that is locale dependent.

SY, Uwe

--
***@stderr.spb.ru | Zu Grunde kommen
http://snark.ptc.spbu.ru/~uwe/ | Ist zu Grunde gehen

Neil Booth

2010-05-23 03:32:39 UTC

Paul Koning wrote:-

Post by Paul Koning
Gents,
I'm working on a patch to gdb 7.1 to make it work on NetBSD. The issue
is that GDB 7 uses iconv to handle character strings, and uses wide
chars internally so it can handle various non-ASCII scripts.
The trouble for NetBSD is that it asks iconv to translate to a character
set named "wchar_t". That means "whatever the encoding is for the
wchar_t data type". GNU libiconv supports that, so on platforms that
use that library things are fine.
NetBSD supports iconv, but it doesn't know the "wchar_t" encoding name.
So I proposed a patch that substitutes what appears to be used instead,
namely UCS-4 in platform native byte order (so "ucs-4le" on x86, for
example). This seems to work.
The trouble is that I'm getting pushback on the patch, because of
concerns that the encoding used for wchar_t is not actually UCS-4. In
http://www.gnu.org/software/libunistring/manual/libunistring.html#The-wc
har_005ft-mess which says that on Solaris and FreeBSD the encoding of
wchar_t is "undocumented and locale dependent". (Ye gods!)
Now, NetBSD is not FreeBSD... so... what is the answer for NetBSD? Is
it like FreeBSD? (If so, it would be good to fix that.) Or is it a
fixed encoding, and if so, is it indeed ucs-4?

NetBSD uses citrus. From what I've figured out, there are 2 wchar_t
encodings: ucs-4 and one I'll call "kuten". The latter is a natural
wide character encoding for some of the narrow character encodings
of far eastern character sets. The following page touches on kuten
a bit:

http://en.wikipedia.org/wiki/JIS_X_0208

This decision to not use UCS-4 univerally for wchar_t is one that
raises much heat, and unfortunately leaves those of us requiring
a single wchar_t enoding somewhat stuck.

Because of this, if you want to convert to ucs-4, you need an extra
kuten<->ucs4 converter and step. I don't believe the ability to
do this is given by C, or POSIX, or even Citrus. It's a sad
situation -- there are legitimate reasons to need to be able to do
this; it is untrue that "you should not care". Consider the case
I had: a compiler front end needing to handle extended identifier
characters, and characters with UCNs in them, and wanting to ensure
that the same identifier written both ways was treated identically.
You need to be able to convert your identifiers to UCS-4, and there
is no portable way to do so.

I do believe that these 2 wchar_t are the only ones you'll meet.
But you'll need to have a kuten<->UCS4 map somewhere.

Neil.

Neil Booth

2010-05-23 21:38:44 UTC

Paul Koning wrote:-

So I now see three options.
1. Abandon all hope -- leaving GDB 7 seriously busted on NetBSD.
2. Use UCS-4 as the wchar_t encoding in NetBSD, ignoring the fact that in some cases that isn't what is used. That fixes NetBSD for most cases. It opens an opportunity for others to fix the remaining cases.
3. Tell people to use libiconv with GDB 7 on NetBSD. Given how it does the conversion, I'm not at all convinced that this is any different from option 2.
My inclination is to use option 2, i.e., submit a patch to gdb to do that with the argument that it fixes the problem for most cases.

I agree.

Neil.

Martin Husemann

2010-05-23 23:05:42 UTC

I'm sorry, I must be missing something, but it still is not clear to me
why you can't use the standard conversion on the host (or, why you can
assume that host and target have the same wchar_t representation).

I would expect the side that runs the UI to tell the debugger about
the character encoding, all relevant parts of the communication to
happen in a locale dependend encoding according to that setting,
and the debugger to just use standard calls to parse that strings.

If there is no common locale setting for both parties, how can you assume
to be able to communicate in wchar_t streams? Passing wchar_t streams
between machines doesn't seem like a good idea, but I guess that is
where I'm missing something.

Martin

Paul Koning

2010-05-24 14:01:53 UTC

Post by Martin Husemann
I'm sorry, I must be missing something, but it still is not clear to me
why you can't use the standard conversion on the host (or, why you can
assume that host and target have the same wchar_t representation).
I would expect the side that runs the UI to tell the debugger about
the character encoding, all relevant parts of the communication to
happen in a locale dependend encoding according to that setting,
and the debugger to just use standard calls to parse that strings.
If there is no common locale setting for both parties, how can you assume
to be able to communicate in wchar_t streams? Passing wchar_t streams
between machines doesn't seem like a good idea, but I guess that is
where I'm missing something.

As you said, you definitely do NOT want to assume that host and target
have the same representation, or for that matter the same locale. They
may well be different operating systems, or have different byte order,
and so on.

For that reason, gdb does a conversion when it reads string data from
target memory. It comes from target memory as a byte string, and it
needs to convert that into something the host can use.

Standard calls like mbtowc don't work for this because they are meant to
convert from one host encoding to another, from and to the same locale.
On the other hand, iconv is specifically designed to convert between
encodings chosen explicitly -- not implicitly by the host type and
locale.

So gdb lets the user specify what string encodings are used on the
target (separately for "char" and "wchar_t" target types). Gdb then
uses iconv to translate from that encoding to the one used internally on
the host. The internal encoding is not explicitly chosen by the user;
instead gdb supplies "wchar_t" which in libiconv is the name of "the
encoding of the wchar_t type on this host". For example, I might be
debugging an ACMEos target that uses KOI-8 for char and UCS-2 Big Endian
for its wchar_t; I would then set those two encodings as the target side
encodings for those two string types, and gdb would do the right thing
(display strings correctly).

That codeset name "wchar_t" may be a libiconv extension; in any case,
the Citrus iconv on NetBSD does not support that encoding name. The
result is that gdb 7 on NetBSD will not display string variables at all;
instead you get an error message. So I'm looking for the name to use
instead of "wchar_t". From what I've learned, "ucs-4" (more precisely,
"ucs-4be" or "ucs-4le" depending on the host byte order) is the right
answer most of the time but apparently not all the time. So I'm
inclined to use that on the grounds that it makes the situation much
better for NetBSD, and if someone else can dig up the rest of the answer
that can still be done as a later improvement.

paul

Martin Husemann

2010-05-24 17:58:27 UTC

Post by Paul Koning
For that reason, gdb does a conversion when it reads string data from
target memory. It comes from target memory as a byte string, and it
needs to convert that into something the host can use.

Ok, I nearly understand what you are trying to explain, but I
don't understand why any conversion is needed at all. If you have
a char * variable on the target and the host wants to display the content -
it can only do that reasonably when knowing target's current LC_CTYPE
and applying "a compatible LC_CTYPE" on the host. Why is a wchar_t string
different?

Post by Paul Koning
From what I've learned, "ucs-4" (more precisely,
"ucs-4be" or "ucs-4le" depending on the host byte order) is the right
answer most of the time but apparently not all the time.

So you are saying that the target reads the wchar_t * from memory, converts
it to ucs-4* and transfers the result to the host? Maybe for the purpose of
debugging this is close enough to be an acceptable solution; it should even
work (modulo some loss) when the targets internal wchar_t representation
currently is jis/kuten.

But why (besides gdb folks not having it designed this way) couldn't the
target convert the string to soemthing the host and it agreed upon? Or
maybe even apply no conversion at all and have the user manually set some
compatible environment on the host?

Sorry for the stupid questions, I still feel quite confused.

Martin

Paul Koning

2010-05-24 19:24:18 UTC

Post by Martin Husemann

Post by Paul Koning
For that reason, gdb does a conversion when it reads string data from
target memory. It comes from target memory as a byte string, and it
needs to convert that into something the host can use.

Ok, I nearly understand what you are trying to explain, but I
don't understand why any conversion is needed at all. If you have
a char * variable on the target and the host wants to display the content -
it can only do that reasonably when knowing target's current LC_CTYPE
and applying "a compatible LC_CTYPE" on the host. Why is a wchar_t string
different?

I don't think it is. I probably didn't explain it well.

Say I'm running gdb on a host that uses UTF-8 char strings, so the
locale on the host side is set accordingly. But the target doesn't use
Unicode, it's using some national set, like KOI or Latin-6 or whatever.
So we need to translate the bytes read from the target in order to come
up with host side bits that look right when given to printf.

Post by Martin Husemann

Post by Paul Koning
From what I've learned, "ucs-4" (more precisely,
"ucs-4be" or "ucs-4le" depending on the host byte order) is the right
answer most of the time but apparently not all the time.

So you are saying that the target reads the wchar_t * from memory, converts
it to ucs-4* and transfers the result to the host? Maybe for the purpose of
debugging this is close enough to be an acceptable solution; it should even
work (modulo some loss) when the targets internal wchar_t
representation
currently is jis/kuten.

The host to target data transfer is always in terms of byte strings.
The host interprets those, based on the data types. For example, if you
ask for the value of an int, gdb will read however many bytes that is on
the target, and use its knowledge of how the target encodes ints in
order to display the value. Similarly if you ask for the value of a
string (narrow or wide).

What makes it harder for strings is that their encoding depends on
locale, while the encoding of int does not.

As for jis/kuten, that's what Neil mentioned. I know next to nothing
about this but from what I read on Wikipedia it appears that JIS-0208 is
a subset of Unicode. So I'm puzzled why jis/kuten would be used as the
wchar_t encoding.

Post by Martin Husemann
But why (besides gdb folks not having it designed this way) couldn't the
target convert the string to soemthing the host and it agreed upon? Or
maybe even apply no conversion at all and have the user manually set some
compatible environment on the host?

The conversion is in fact in the host. And wchar_t is used on the
theory that this is the "handles everything" string type.

As for a compatible environment, that assumes there is one; there might
not be. Iconv is pretty general but a given OS might not have such a
wide range of locale values it knows. For example, in NetBSD is there a
locale that says you're using ucs-2? I don't think so.

paul

Neil Booth

2010-05-24 22:35:38 UTC

Paul Koning wrote:-

Post by Paul Koning
As for jis/kuten, that's what Neil mentioned. I know next to nothing
about this but from what I read on Wikipedia it appears that JIS-0208 is
a subset of Unicode. So I'm puzzled why jis/kuten would be used as the
wchar_t encoding.

Because it's a very fast conversion; the single byte form just
encodes the ku / ten (row / column) with some bitshifts etc;
converting to other kuten-based encodings (Big5 etc) is then also
very simple. But there is no simple mapping from kuten to Unicode,
I think you'd need an 8000-entry table (and one to map back). I
also believe there are a few codepoints that don't have a one-to-one
mapping to Unicode, so a roundtrip conversion isn't guaranteed.

Neil.

Martin Husemann

2010-05-25 07:18:49 UTC

Post by Neil Booth
I think you'd need an 8000-entry table (and one to map back). I
also believe there are a few codepoints that don't have a one-to-one
mapping to Unicode, so a roundtrip conversion isn't guaranteed.

Yeah, but I'd argue that maybe for debugging purposes you could life with
the (unwanted) aliasing of a few "equivalent" code points.

However, this would involve a conversion on the target before delivering the
stream to the host, which we do not have at all in Paul's scenario if I
understood correctly - so I'm out of ideas (besides doing what Paul suggested).

Martin

David Laight

2010-05-25 17:24:49 UTC

Hmmm.... this whole discussion makes me wonder how gcc handles
wchar_t literals L"string" when cross compiling ...

Doesn't it have the same problem ??

David

--
David Laight: ***@l8s.co.uk

22 Replies
2 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Paul Koning 2010-05-19 15:29:38 UTC

Martin Husemann 2010-05-19 15:35:37 UTC

Paul Koning 2010-05-19 17:55:47 UTC

Martin Husemann 2010-05-19 19:10:03 UTC

Paul Koning 2010-05-19 19:44:21 UTC

Valeriy E. Ushakov 2010-05-20 03:55:38 UTC

Paul Koning 2010-05-20 14:06:37 UTC

Valeriy E. Ushakov 2010-05-20 17:46:48 UTC

Paul Koning 2010-05-20 19:57:36 UTC

Valeriy E. Ushakov 2010-05-20 18:11:37 UTC

Paul Koning 2010-05-20 15:01:04 UTC

Valeriy E. Ushakov 2010-05-20 17:58:56 UTC

James Chacon 2010-05-20 18:30:15 UTC

Valeriy E. Ushakov 2010-05-20 19:07:32 UTC

Neil Booth 2010-05-23 03:32:39 UTC

Neil Booth 2010-05-23 21:38:44 UTC

Martin Husemann 2010-05-23 23:05:42 UTC

Paul Koning 2010-05-24 14:01:53 UTC

Martin Husemann 2010-05-24 17:58:27 UTC

Paul Koning 2010-05-24 19:24:18 UTC

Neil Booth 2010-05-24 22:35:38 UTC

Martin Husemann 2010-05-25 07:18:49 UTC

David Laight 2010-05-25 17:24:49 UTC

about - legalese

Loading...