mailing list archives
Re: tvb_get_string_enc() doesn't always return valid UTF-8
From: Evan Huus <eapache () gmail com>
Date: Wed, 29 Jan 2014 16:24:09 -0500
On Mon, Jan 27, 2014 at 7:17 PM, Evan Huus <eapache () gmail com> wrote:
Thank you Guy, this is excellent. It's more or less what I was looking
for in  back in October (which, coincidentally, links to the bug I
referred to earlier in this thread).
All of these conversations should be condensed (probably using your
email as a base) and put into a wiki page for future reference. I will
try and do this when I find the time, if nobody else beats me to it.
First draft wiki page is available:
Further comments and suggestions inline.
On Sun, Jan 26, 2014 at 6:13 PM, Guy Harris <guy () alum mit edu> wrote:
On Jan 26, 2014, at 1:27 PM, Jakub Zawadzki <darkjames-ws () darkjames pl> wrote:
On Tue, Jan 21, 2014 at 08:01:15AM -0500, Evan Huus wrote:
On Tue, Jan 21, 2014 at 2:40 AM, Guy Harris <guy () alum mit edu> wrote:
On Jan 20, 2014, at 5:59 PM, Evan Huus <eapache () gmail com> wrote:
In which case is dumb search-and-replace of tvb_get_string with
tvb_get_string_enc and ENC_ASCII an easy way to make (part of) the API
Did somebody say that had been done or suggest that it be done?
I thought it was kind of implied when you wrote "We should also
probably audit all calls to tvb_get_string() and tvb_get_stringz() in
dissectors and change them to tvb_get_string_enc() with the
If tvb_get_string() behaves identically to tvb_get_string_enc(...
ENC_ASCII) then there doesn't seem much point in having both.
We can also think about dropping STR_ASCII, STR_UNICODE stuff.
To be honest I'm not happy about it, I'd rather still display
non-printable ascii in escaped-hex-form.
OK, first-principles time:
A character string is a sequence of code points from a character set. It's represented as a sequence of octets
using a particular encoding for that character set, wherein each character is represented as a 1-or-more-octet
subsequence in that sequence.
In many of those encodings, not all subsequences of octets correspond to code points in the character set. For
the 8-bit encoding of ASCII encodes each code point as an octet, and octets with the uppermost bit set don't
correspond to ASCII code points;
the 8-bit encodings of "8-bit" character sets encode each code point as an octet and, in some of those
character sets, there are fewer than 256 code points, and some octet values don't correspond to code points in the
UTF-8 encodes each Unicode code point as 1 or more octets, and:
an octet sequence that begins with an octet with the uppermost bit set and the bit below it clear is
invalid and doesn't correspond to a code point in Unicode;
an octet sequence that begins with an octet with the uppermost two bits set, and where the 1 bits
below it indicate that the sequence is N bytes long, but that has fewer than N-1 octets-with-10-at-the-top following
it (either because it's terminated by an octet that doesn't have 10 at the top or it's terminated by the end of the
string), is invalid and doesn't correspond to a code point in Unicode;
an octet sequence doesn't have the two problems above but that produces a value that's not a valid
Unicode code point is invalid and (by definition) doesn't correspond to a code point in Unicode;
UCS-2 encodes each code point in the Unicode Basic Multilingual Plane as 2 octets (big-endian or
little-endian), and not all values from 0 to 65535 correspond to Unicode code points (see next item...);
UTF-16 encodes each Unicode code point as 2 or 4 octets (big-endian or little-endian), with code points in
the Basic Multilingual Plane encoded as 2 octets and other code points encoded as a 2-octet "leading surrogate"
followed by a 2-octet "trailing surrogate" (those are values between 0 and 65535 that are *not* Unicode code points;
see previous item), and:
a leading surrogate not followed by a trailing surrogate (either because it's followed by a 2-octet
Unicode code point value or because it's at the end of the string) is not a valid UTF-16 sequence and doesn't
correspond to a code point in Unicode;
a trailing surrogate not preceded by a leading surrogate (either because it's at the beginning of
the string or because it's preceded by a 2-octet Unicode code point value) is not a valid UTF-16 sequence and
doesn't correspond to a code point in Unicode;
a leading surrogate followed by a trailing surrogate that gives a value that's not a valid Unicode
code point is invalid and (by definition) doesn't correspond to a code point in Unicode;
UCS-4 encodes each Unicode code point directly as 4 octets (big-endian or little-endian), and any value that
corresponds to a surrogate or a value larger than the largest possible Unicode code point value is invalid and
doesn't correspond to a code point in Unicode;
Strings in Wireshark are:
displayed to users, either directly in the packet containing them as part of the packet summary or detail,
or indirectly, for example, by being stored as a pathname or pathname component for a file and shown in packets
referring to the file by some ID rather than by pathname;
matched by packet-matching expressions (display filters, color filters, etc.);
processed internally by dissectors and taps (whether in C, Lua, or some other language);
handed to other programs by, for example, "tshark -T fields -e ...".
In all of those cases, we need to do *something* with the invalid octet sequences.
In the packet-matching expression case, I'd say that:
*all* comparison operations in which a string value from the packet is compared with a constant text string
should fail if the string has invalid octet sequences (so 0x20 0xC0 0xC0 0xC0, as a purportedly-UTF-8 string, is
neither equal to nor unequal to " " or "hello" or....);
In addition, we should have a monadic function "valid" (I believe the
matching engine already supports the syntax for monadic functions, so
this should be easy to do) which takes a string field and returns a
boolean whether or not the string contains invalid octet sequences.
The other question here is matching non-printing characters. If I want
to match 0x13 0x10 I should probably be able to compare against
- most non-printing characters don't have obvious escape sequences
(see the bit on \uXX escape sequences below, I guess)
- if I want to match a literal "\" and I'm typing in the shell, that
means I need to type "\\\\" for all the escapes to process correctly.
comparison operations in which a string value from the packet is compared with an octet string (comparing
against something such as 20:c0:c0:c0) should do an octet-by-octet comparison of the raw octets of the string (so
0x20 0xC0 0xC0 0xC0, no matter what the encoding, compares equal to 20:c0:c0:c0);
equality comparison operations between two string values from the packet succeed if either
1) the two string values are valid in their encoding and translate to the same UTF-8 string
2) the two string values have the same encoding and have the same octets.
It took me a while to convince myself 2 was a good idea (since it
seems on the surface that it might confuse the
invalid-strings-are-not-comparable choice) but on reflection I think
it's probably the right thing to do.
That would require more work.
In the display case, an argument could be made that invalid octet sequences should be displayed as a sequence of
\xNN escape sequences, one octet at a time.
In the "processed internally" case, if the part of the string that's being looked at contains an invalid octet
sequence, the processing should fail, otherwise the processing should still work. For example, an HTTP request
beginning with 0x47 0x45 0x54 0x20 0xC0 should be treated as a GET request with the operand being invalid, but an
HTTP request beginning with 0x47 0x45 0x54 0xC0 should be treated as an invalid request. That would *also* require
I'm not sure *what* the right thing to do when handing fields to other programs is.
I see two probable use cases:
- the program is interested in the raw bytes, in which case that's
what we should send; if the string isn't valid, the reading program
will deal with it
- the program is interested in the string, in which case we should
send it in the locale-appropriate encoding (most frequently UTF-8)
with invalid sequences mapped to the encoding-appropriate replacement
These two should cover 99% of cases I can think of with relatively
minimal effort on our part. The second should be default, since the
most frequent case of "other program" is probably "stdout of a shell"
or "text file".
So the functions that get strings from packets should not map invalid octet sequences to a sequence of \xNN escape
sequences, as that would interfere with proper handling of the string when doing packet matching and internal
processing. For those cases, perhaps a combination of
1) replacing invalid sequences with REPLACEMENT CHARACTER
2) providing a separate indication that this was done
would be the right case.
I agree. This makes it easy for dissectors that don't care to
gracefully handle invalid strings. Dissectors that care about the type
of invalidity can access the individual bytes, though we may want a
helper function list_string_encoding_errors(tvb, offset, len, enc) or
some such thing.
However, that then throws away information, so that you can't *display* that string with the invalid sequences shown
as \xNN sequences.
I have a vague idea that a combination of
list_string_encoding_errors() and escape_encoding_errors() would be
sufficient, but this definitely bears more thought. Maybe just a
boolean parameter to tvb_get_string()? Or make tvb_get_string() always
do the escaping, and provide a helper unescape_string() which removes
the escaping and puts replacement characters in for invalid sequences?
For now, my inclination is to continue with the "replace invalid sequences with REPLACEMENT CHARACTER in
tvb_get_string* routines" strategy, but not treat that as the ultimate solution. (I'll talk about some thoughts
about what to do after that below.)
Non-printable characters are an orthogonal issue; they *can* be represented in our UTF-8 encoding of Unicode, but
they shouldn't be displayed in the UI as themselves.
My inclination there is to replace them, when displaying, with escape sequences:
for code points >= 0x80, I'd display them as a \uXXXXXX escape sequence (whether to trim leading zeroes, and
how many of them to trim, is left as an exercise for the reader; probably trim no more than two leading zeroes, but
I'm not sure what to do if there's only one leading zero) - note that this won't show the value(s) of the octet(s)
that make up the code point, it'll show the Unicode code point;
for 0x7F and most code points < 0x20, I'd display them either as \uXX, \xXX, or \ooo (whether to stick with
Traditional Octal(TM), hex, or Unicode is left as an exercise for the reader);
for the characters that have their own C string escape sequences (e.g., tab, CR, LF), I'd display them as
that escape sequence.
(For the future, we might want to have the "value", in a protocol tree, of a string be a combination of the encoding
and the raw octets; if some code wants the value for processing purposes, it'd call a routine that converts the
value to UTF-8 with REPLACEMENT CHARACTER replacing invalid sequences, and if it wants the value for display
purposes, it'd call a routine that converts it to UTF-8 with escape sequences replacing invalid sequences *and* with
non-printable characters shown as the appropriate escape sequence.
That raises the question of whether, when building a protocol tree, we need to put the value into the protocol tree
item at all if the item is created with proto_tree_create_item(), or whether we just postpone extracting the value
until we actually need it. Lazy processing FTW....)
Sent via: Wireshark-dev mailing list <wireshark-dev () wireshark org>
mailto:wireshark-dev-request () wireshark org?subject=unsubscribe
Re: tvb_get_string_enc() doesn't always return valid UTF-8 Jakub Zawadzki (Jan 20)
Re: tvb_get_string_enc() doesn't always return valid UTF-8 Guy Harris (Jan 20)