mailing list archives
Re: Webtrends HTTP Server %20 bug (UTF-8)
From: Peter W <peterw () usa net>
Date: Fri, 8 Jun 2001 15:40:56 -0400
On Fri, Jun 08, 2001 at 04:51:57AM +0100, Glynn Clements wrote:
Eric Hacker wrote:
Conveniently, UTF8 uses the same
values as ASCII for ASCII representation. Above the standard ASCII 127
character representation, UTF8 uses multi-byte strings beginning with 0xC1.
No; the sequences for codes 128 to 255 begin with 0xC2 and 0xC3
And encodings for 256 - (2^32 -1) use other values in the first octet.
Two points here:
1) Eric wrote "As a URL cannot contain spaces or other special characters,
URL encoding is used to transport them. Thus all UTF8 characters above ASCII
are supposed to be URL encoded in order to be sent."
It's not at all clear to me a) that UTF-8 sequences are allowed in *any*
HTTP headers (request or response) or b) how a server or client would decide
whether a possible UTF-8 sequence like %C3%B3 is UTF-8 for the single value
0xF3 or the two-character phrase 0xC3 + 0xB3. All indications in the RFCs
(2068, 1738, 1808) suggest that only characters 0x00 - 0xFF are expected in
the various headers, and that no UTF-8, double-byte, or other
representations are allowed.
2) The UTF-8 rules are kinda funny. 0xFE and 0xFF are illegal everywhere,
and other characters may be illegal depending on their placement, e.g. a
"starting" octet with 2^7 on and 2^6 off, or a "subsequent" octet that
doesn't have 2^7 on and 2^6 off. I wouldn't be surprised if some UTF-8
parsing routines don't handle illegal characters gracefully, or if
applications don't gracefully trap errors reported by the UTF-8 parsing
routines, etc. This might be worth some testing.