distinction between resource charset and format octet decoding

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

distinction between resource charset and format octet decoding

Garret Wilson
I have question (using Tomcat 9.0.12 on Windows 10), and I'd like
someone on the Tomcat development team to clarify a distinction for me
regarding resource charsets and octet decoding in a particular format. I
am not a newbie, and although the answer to my question may seem
obvious, it goes to a critical issue that I believe to be a fundamental
bug in Tomcat encoding processing.

Let's say that as an HTTP client I retrieve a resource `readme.txt` from
Tomcat, and Tomcat clearly indicates via the HTTP response headers that
the `Content-Type` is `text/plain; charset=ISO-8859-1`. That file
contains, among things, a line that says:

     See https://example.com/example.jsp?fullName=Fl%C3%A1vio+Jos%C3%A9 
for more info.

I want parse the text file and present a live link to the user (email
clients do this all the time), but I want to make the link "pretty" by
decoding the URL. The question is: do I decode the octets using UTF-8,
to show `…fullName=Flávio+José`, or do I use ISO-8859-1 to decode the
octets, so that I show `…fullName=Flávio+José`? (Flávio José is a
famous Brazilian forró singer and musician, by the way.)

The content type encoding of `readme.txt` is ISO-8859-1, so I must use
ISO-8859-1 to decode the octets in `Fl%C3%A1vio+Jos%C3%A9`, yielding
`…fullName=Flávio+José`, right??!

No, of course not. The decoding of the octet sequence is independent of
the resource encoding, and represents a separate layer of encoding _on
top_ of the resource encoding. It wouldn't matter whether the text file
were encoded in UTF-8, ISO-8859-1, or US-ASCII—the URL would still be
https://example.com/example.jsp?fullName=Fl%C3%A1vio+Jos%C3%A9, and its
octets should still be decoded using UTF-8 as per RFC 3986.

I'll get right to the point; the above was a rhetorical question used as
an analogy.

The Tomcat FAQ at
https://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q9 indicates that
the default encoding for an HTTP POST is ISO-8859-1. That is true.
However Tomcat then goes further to then assume that it should decode
_the octets of `application/x-www-form-urlencoded`_ using ISO-8859-1 as
well! This is simply wrong; the octets should be interpreted as a
sequence of UTF-8 octets; see
https://url.spec.whatwg.org/#concept-urlencoded-serializer . This means
if my browser sends a POST with content `fullName=Fl%C3%A1vio+Jos%C3%A9`
using `application/x-www-form-urlencoded`, Tomcat will interpret this
request parameter as `Flávio José` in my servlet/JSP, when it should
interpret it as `Flávio José`. (Tomcat correctly decodes the octet when
used as a query parameter rather than a POST parameter.)

Now it may be that the FAQ is simply out of date; it still seems to
think that encoded URI octets should not be interpreted as UTF-8,
completely ignoring RFC 3986. If so, it is long out of date; RFC 3986
came out in 2005. (And indeed, Tomcat works with UTF-8 octets in URIs.)
But out of date or not, the FAQ at
https://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q8 then recommends
that to force Tomcat to interpret the
`application/x-www-form-urlencoded` octets as UTF-8, I must set the
`org.apache.catalina.filters.SetCharacterEncodingFilter` filter (in some
`web.xml` file) to `UTF-8`. (I can alternatively put `<%
request.setCharacterEncoding("UTF-8"); %>` in my JSP.) And sure enough,
it fixes the problem.

But as discussed above, this is completely wrong: the resource character
encoding of a request sent in `application/x-www-form-urlencoded` should
have absolutely no bearing on how the encoded octets within that
resource are decoded. They must be decoded as UTF-8, irrespective of
what "character encoding" Tomcat assumes the content to be. Tomcat has
updated the way it decodes URIs to support UTF-8; it is time Tomcat does
the same for `application/x-www-form-urlencoded` values. The current
approach is broken in the context of the modern web, and the workaround
is simply wrong.

I also raised this at https://stackoverflow.com/q/54094982/421049 .

I would have filed a Tomcat Bugzilla issue, but the bug report form
indicated I should report the problem on this list first.

Thank you in advance for your attention to this matter.

Garret Wilson
GlobalMentor, Inc.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: distinction between resource charset and format octet decoding

markt
On 08/01/2019 21:31, Garret Wilson wrote:

<snip/>

> But as discussed above, this is completely wrong: the resource character
> encoding of a request sent in `application/x-www-form-urlencoded` should
> have absolutely no bearing on how the encoded octets within that
> resource are decoded.

That is not the correct interpretation of section 3.12 of the Servlet
4.0 specification (note the section numbers do vary between spec
versions). Tomcat implements the correct interpretation - i.e. the
charset from the request content-type defines how encoded octets are
decoded and, if none is specified, ISO-8859-1 is used as the default.

Yes, this default is now very out-dated. That is a side-effect of:
- how long the Servlet specification has been around
- the very conservative approach taken by Java EE in terms of backwards
   compatibility (once set, defaults are very rarely - if ever - changed)
- arguably missed opportunities to address this issue prior to
   Servlet 4.0

As of Servlet 4.0 there is a specification compliant configuration
option to change this default to any encoding of your choice. Obviously,
UTF-8 is one of the options. You can do this by adding the following to
your web.xml:

<request-character-encoding>UTF-8</request-character-encoding>

If you add it to conf/web.xml it applies to every web application
deployed to Tomcat.

Tomcat 9 uses this in the examples, manager and host-manager
applications in place of the SetCharacterEncodingFilter.

Whether Tomcat should ship with this setting present in conf/web.xml by
default is something that should probably be discussed for Tomcat 10.
Given the current state of the web, there is a reasonable case for doing
so. I'll add that to the TOMCAT-NEXT discussion list.

The Tomcat Wiki also needs to be updated to take account of this new
configuration option (and the related <response-character-encoding>).
Since it is a wiki and this is clearly an issue you care about would you
like to tackle that?

Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: distinction between resource charset and format octet decoding

Garret Wilson
Hi, Mark, and thanks for some quick response. You provided some info I
wasn't aware of. Some responses below:

On 1/8/2019 9:57 PM, Mark Thomas wrote:

> On 08/01/2019 21:31, Garret Wilson wrote:
>
> <snip/>
>
>> But as discussed above, this is completely wrong: the resource
>> character encoding of a request sent in
>> `application/x-www-form-urlencoded` should have absolutely no bearing
>> on how the encoded octets within that resource are decoded.
>
> That is not the correct interpretation of section 3.12 of the Servlet
> 4.0 specification (note the section numbers do vary between spec
> versions). Tomcat implements the correct interpretation - i.e. the
> charset from the request content-type defines how encoded octets are
> decoded and, if none is specified, ISO-8859-1 is used as the default.


Ah, I hadn't seen that in the servlet spec. Yes, it seems as if Tomcat
is correctly following the spec, but I would still say the servlet spec
is wrong to make any linkage at all between resource encoding and %nn
interpretation. In fact reading the prose it's not clear to me that the
servlet spec is even strongly tying the %nn interpretation to the
encoding. It just sees to say that, unless otherwise specified, the %nn
interpretation should be ISO-8859-1. And actually that's a step up from
the HTML 4.0.1 spec, which in
https://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1 indicates
that they should be interpreted as US-ASCII codes. :(

You indicate that this is all out of date, and I think we're in
agreement there. We really, really need to get the next servlet
specification to remove this part. In fact the servlet specification
should defer to the official `application/x-www-form-urlencoded`
specification, which at this point I think is the W3C HTML5 spec, which
in turn defers to the WHATWG spec (which clearly says that UTF-8) should
be used. What makes all of this more of a mess is that there seems to be
no way to work around this from the client side, e.g. by putting
something in the HTML to indicate UTF-8, as
`application/x-www-form-urlencoded` doesn't support a `charset` parameter.

Anyway if there are any openings on the committee to update the servlet
spec, let me know.


> ...
> As of Servlet 4.0 there is a specification compliant configuration
> option to change this default to any encoding of your choice.
> Obviously, UTF-8 is one of the options. You can do this by adding the
> following to your web.xml:
>
> <request-character-encoding>UTF-8</request-character-encoding>

Oh, that is really good to know, thanks!! Still I say that the request
character encoding is orthogonal to the %nn encoding, but, still, it's
good to have an implementation-agnostic way to do it.

>
>
> Whether Tomcat should ship with this setting present in conf/web.xml
> by default is something that should probably be discussed for Tomcat
> 10. Given the current state of the web, there is a reasonable case for
> doing so. I'll add that to the TOMCAT-NEXT discussion list.


Yes please! If I can help in any way, let me know.


>
> The Tomcat Wiki also needs to be updated to take account of this new
> configuration option (and the related <response-character-encoding>).
> Since it is a wiki and this is clearly an issue you care about would
> you like to tackle that?


Yes, I'd love to. Let me know what permissions I need, etc.

I have an international flight boarding right now so I have to go, and I
may not reply for the next few hours, but definitely sign me up.

Thanks,

Garret


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: distinction between resource charset and format octet decoding

markt
On 09/01/2019 00:50, Garret Wilson wrote:

> Hi, Mark, and thanks for some quick response. You provided some info I
> wasn't aware of. Some responses below:
>
> On 1/8/2019 9:57 PM, Mark Thomas wrote:
>> On 08/01/2019 21:31, Garret Wilson wrote:
>>
>> <snip/>
>>
>>> But as discussed above, this is completely wrong: the resource
>>> character encoding of a request sent in
>>> `application/x-www-form-urlencoded` should have absolutely no bearing
>>> on how the encoded octets within that resource are decoded.
>>
>> That is not the correct interpretation of section 3.12 of the Servlet
>> 4.0 specification (note the section numbers do vary between spec
>> versions). Tomcat implements the correct interpretation - i.e. the
>> charset from the request content-type defines how encoded octets are
>> decoded and, if none is specified, ISO-8859-1 is used as the default.
>
>
> Ah, I hadn't seen that in the servlet spec. Yes, it seems as if Tomcat
> is correctly following the spec, but I would still say the servlet spec
> is wrong to make any linkage at all between resource encoding and %nn
> interpretation. In fact reading the prose it's not clear to me that the
> servlet spec is even strongly tying the %nn interpretation to the
> encoding. It just sees to say that, unless otherwise specified, the %nn
> interpretation should be ISO-8859-1. And actually that's a step up from
> the HTML 4.0.1 spec, which in
> https://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1 indicates
> that they should be interpreted as US-ASCII codes. :(
>
> You indicate that this is all out of date, and I think we're in
> agreement there. We really, really need to get the next servlet
> specification to remove this part. In fact the servlet specification
> should defer to the official `application/x-www-form-urlencoded`
> specification, which at this point I think is the W3C HTML5 spec, which
> in turn defers to the WHATWG spec (which clearly says that UTF-8) should
> be used. What makes all of this more of a mess is that there seems to be
> no way to work around this from the client side, e.g. by putting
> something in the HTML to indicate UTF-8, as
> `application/x-www-form-urlencoded` doesn't support a `charset` parameter.
>
> Anyway if there are any openings on the committee to update the servlet
> spec, let me know.

That has moved to Eclipse. The process to update the spec is still being
defined. The Jakarta EE Servlet API project is the project to get
involved in.


>> ...
>> As of Servlet 4.0 there is a specification compliant configuration
>> option to change this default to any encoding of your choice.
>> Obviously, UTF-8 is one of the options. You can do this by adding the
>> following to your web.xml:
>>
>> <request-character-encoding>UTF-8</request-character-encoding>
>
> Oh, that is really good to know, thanks!! Still I say that the request
> character encoding is orthogonal to the %nn encoding, but, still, it's
> good to have an implementation-agnostic way to do it.
>
>>
>>
>> Whether Tomcat should ship with this setting present in conf/web.xml
>> by default is something that should probably be discussed for Tomcat
>> 10. Given the current state of the web, there is a reasonable case for
>> doing so. I'll add that to the TOMCAT-NEXT discussion list.
>
>
> Yes please! If I can help in any way, let me know.
>
>
>>
>> The Tomcat Wiki also needs to be updated to take account of this new
>> configuration option (and the related <response-character-encoding>).
>> Since it is a wiki and this is clearly an issue you care about would
>> you like to tackle that?
>
>
> Yes, I'd love to. Let me know what permissions I need, etc.

Create yourself an account at https://wiki.apache.org/tomcat (click
login then create an account) and let the list know your ID. Then one of
the admins can add you to the allowed editors.

Apologies for the hoop jumping required but without the manual approval
step for new accounts, the ASF project wiki's were being deluged in spam.

Mark

>
> I have an international flight boarding right now so I have to go, and I
> may not reply for the next few hours, but definitely sign me up.
>
> Thanks,
>
> Garret
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: distinction between resource charset and format octet decoding

Garret Wilson
On 1/9/2019 2:30 AM, Mark Thomas wrote:
> …
> Create yourself an account at https://wiki.apache.org/tomcat (click
> login then create an account) and let the list know your ID. Then one of
> the admins can add you to the allowed editors.


I was just ready to create an account, but I want to verify the details
so I don't screw things up.

  * It asks for a "Name". Is this a username, I suppose? So we don't
    maintain our "name" separate from our "login username"?
  * It says to use "FirstnameLastName". Are you literally wanting us to
    use "JohnDoe", or can we use "johndoe"? Sorry for the questions; as
    one who works with protocols all the time, I automatically assume
    this stuff is important. But I prefer to use lowercase on my
    usernames; I'm a little confused about why this would want
    PascalCase for a login username. (I can't think of another system
    that I use that requires PascalCase usernames.)

My guess is that it's trying to maintain a "human name" and a "username"
but combine them both into one field or something. I can't say this
approach is typical…

Garret

Reply | Threaded
Open this post in threaded view
|

Re: distinction between resource charset and format octet decoding

markt
On 15/01/2019 03:39, Garret Wilson wrote:

> On 1/9/2019 2:30 AM, Mark Thomas wrote:
>> …
>> Create yourself an account at https://wiki.apache.org/tomcat (click
>> login then create an account) and let the list know your ID. Then one of
>> the admins can add you to the allowed editors.
>
>
> I was just ready to create an account, but I want to verify the details
> so I don't screw things up.
>
>  * It asks for a "Name". Is this a username, I suppose? So we don't
>    maintain our "name" separate from our "login username"?

Yes, it is your username. Any linkage from that to your "public name"
would be maintained on your user page - if you wish.

>  * It says to use "FirstnameLastName". Are you literally wanting us to
>    use "JohnDoe", or can we use "johndoe"? Sorry for the questions; as
>    one who works with protocols all the time, I automatically assume
>    this stuff is important. But I prefer to use lowercase on my
>    usernames; I'm a little confused about why this would want
>    PascalCase for a login username. (I can't think of another system
>    that I use that requires PascalCase usernames.)

Think of it as a SHOULD rather than a MUST.

> My guess is that it's trying to maintain a "human name" and a "username"
> but combine them both into one field or something. I can't say this
> approach is typical…

Anything in PascalCase becomes a link to a wiki page of that name.
Usernames are created in this form so references to the user
automatically become links to that user's page in the wiki.

It isn't a feature we use much at the moment. A quick check shows that
most, but not all, contributors have created their user name in PascalCase.

For example, take a look at https://wiki.apache.org/tomcat/AndrewCarr

Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: distinction between resource charset and format octet decoding

Garret Wilson
On 1/15/2019 3:20 AM, Mark Thomas wrote:
> …
> Anything in PascalCase becomes a link to a wiki page of that name.
> Usernames are created in this form so references to the user
> automatically become links to that user's page in the wiki.


Ah, OK, that explains it. Very good to know. Maybe a little semantic
overloading, but as this is my first wiki account anywhere, I'm guessing
it's typical with whatever software you're using.

Anyway my account is created, with username `GarretWilson`. After I get
permissions I'll update the info on octet encoding for
application/x-www-form-urlencoded in relation to the servlet spec. It
may not be immediately, but I'll slowly but surely get to it.

Cheers,

Garret


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: distinction between resource charset and format octet decoding

markt
On 23/01/2019 05:07, Garret Wilson wrote:

> On 1/15/2019 3:20 AM, Mark Thomas wrote:
>> …
>> Anything in PascalCase becomes a link to a wiki page of that name.
>> Usernames are created in this form so references to the user
>> automatically become links to that user's page in the wiki.
>
>
> Ah, OK, that explains it. Very good to know. Maybe a little semantic
> overloading, but as this is my first wiki account anywhere, I'm guessing
> it's typical with whatever software you're using.
>
> Anyway my account is created, with username `GarretWilson`. After I get
> permissions I'll update the info on octet encoding for
> application/x-www-form-urlencoded in relation to the servlet spec. It
> may not be immediately, but I'll slowly but surely get to it.

Karma granted. Happy editing.

Cheers,

Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: distinction between resource charset and format octet decoding

Garret Wilson
Good morning, I'm just getting to the editing. I'm going to list some
thoughts I have as I go through this, so you can verify things:

  * The servlet spec links are way out of date. I'll update them.
  * "There /is no default encoding for URIs/ specified anywhere, which
    is why there is a lot of confusion when it comes to decoding these
    values." Sheesh, this is is ancient. I'll correct it as per
    https://tools.ietf.org/html/rfc3986#section-2.5 .
  * "Most of the web uses ISO-8859-1 as the default for query strings."
    Is this still true?! In light of the above, I would think it is not
    true, but I wanted to ask, as you know better about what you've seen
    "in the wild".

Garret

Reply | Threaded
Open this post in threaded view
|

Re: distinction between resource charset and format octet decoding

Garret Wilson
On 2/1/2019 7:23 AM, Garret Wilson wrote:
> …
>  * "There /is no default encoding for URIs/ specified anywhere, which
>    is why there is a lot of confusion when it comes to decoding these
>    values." Sheesh, this is is ancient. I'll correct it as per
>    https://tools.ietf.org/html/rfc3986#section-2.5 .


Amazing. A close reading of RFC 3986 reveals that there is no clear
mandate for UTF-8 in existing URI schemes, even though recommended for
new schemes. Anyway, everyone seems to have settled on UTF-8 (Tomcat
included), so I'll try to indicate that.

Garret


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: distinction between resource charset and format octet decoding

Christopher Schultz-2
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Garret,

On 2/1/19 11:08, Garret Wilson wrote:

> On 2/1/2019 7:23 AM, Garret Wilson wrote:
>> … * "There /is no default encoding for URIs/ specified anywhere,
>> which is why there is a lot of confusion when it comes to
>> decoding these values." Sheesh, this is is ancient. I'll correct
>> it as per https://tools.ietf.org/html/rfc3986#section-2.5 .
>
>
> Amazing. A close reading of RFC 3986 reveals that there is no
> clear mandate for UTF-8 in existing URI schemes, even though
> recommended for new schemes. Anyway, everyone seems to have settled
> on UTF-8 (Tomcat included), so I'll try to indicate that.

Wait... are you saying that _it's the Wild West out there?_ ;)

Yes. The web is indeed held together with duct-tape and bailing wire.
It's amazing that it works as well as it does.

- -chris
-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlxUhDEACgkQHPApP6U8
pFhWlA/8Cxr6xzT8+cw5Mu/a8cH788p+ucK4QtO9Qlm6EBhhX2sW9BelWpk2ftOX
xypZkwW155D2hlz58eUTGSoFl92rgFZNXmXBoIXd+MDgNS/b0zgabb7N7wlHswzj
LJArA9GtXNjRy5vJc4Bpe37ZpiqcV9f/sbQhSO31ZrJYvnVuOOYszzfp2g6UWlg5
+OAgfi2L99uMxJdqc81eIVsL6mmmhlkJYe6ejAZjb/EQ2Lk74MKlgCUfaoasCdYd
hqdQJIBpRGvUnx6UEoq+sdEilBAXTJocGv8cyOFQY5rHcaTy7WIQ9mIWilTjBb6O
gxWJbgRfX+uOVhTT5mo7LoE+YVLQZ3QPAM21SEXtX3PR5Vuk4hB8SYj3/er7S7v2
/kPL0d5K2DsO8034PoZQBturIV8pkiF5jqr2nSTND/B0nFK9hcZu27qY9RigHF95
8owMY7/hdMsK2PlYOwyj6dZSMx94Iy5mWDCrF3GUFCbEN9u3/6HoRYuJZOpCv8h1
aZHZmiYDEtxzxL8OkXNqyuBu4k+HJ58/ABMelpXOjxMVHuFXkqny6XiqrzyWac+z
yW1otX/uLKgqKI9PL3O8MfzVS5LZ6XVtprkZUDhCBvsA8vQTZYBRVQu3DiGMPojj
U4STB1VBJSV4I67bBhkQaAZnsqIgeNi/qzHC+5h6hbHl+Me1lRg=
=Z4XG
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: distinction between resource charset and format octet decoding

Garret Wilson
On 2/1/2019 9:38 AM, Christopher Schultz wrote:
>> Amazing. A close reading of RFC 3986 reveals that there is no
>> clear mandate for UTF-8 in existing URI schemes, even though
>> recommended for new schemes. Anyway, everyone seems to have settled
>> on UTF-8 (Tomcat included), so I'll try to indicate that.
> Wait... are you saying that _it's the Wild West out there?_ ;)
>
> Yes. The web is indeed held together with duct-tape and bailing wire.
> It's amazing that it works as well as it does.


Hahaha. I'm /so/ happy someone agrees with me! Here's to improving
things with a little JB Weld once in a while. (That's what my
grandparents used on the farm when the bailing wire and duct tape
couldn't handle it.)

Garret

Reply | Threaded
Open this post in threaded view
|

Re: distinction between resource charset and format octet decoding

Garret Wilson
In reply to this post by markt
OK, Mark, I've made my initial edits to the
https://wiki.apache.org/tomcat/FAQ/CharacterEncoding page. _Please check
them over!_ This is my first edit to the wiki.

That page has a lot of legacy information, some of which had to do with
internal Tomcat stuff, and some of which had to do with minute details
of obsolete RFCs and evolution of browser behavior. I didn't want to
spend the entire day (week?) on this, so I tried to surgically to only
address the sections relating to POST of
application/x-www-form-urlencoded and how percent-encoded octets are
interpreted. I couldn't resist updating the specification links and
changing just a little prose about URL percent encoding.

There is the risk now that other sections of the page are still outdated
and conflict with my changes, but most importantly the FAQ should
provide more complete information on how Tomcat web applications can be
made to work with modern browsers.

Please let me know if I bungled anything or if I need to clarify something.

Thanks for letting me participate.

Garret

On 1/23/2019 12:26 AM, Mark Thomas wrote:

> On 23/01/2019 05:07, Garret Wilson wrote:
>> On 1/15/2019 3:20 AM, Mark Thomas wrote:
>>> …
>>> Anything in PascalCase becomes a link to a wiki page of that name.
>>> Usernames are created in this form so references to the user
>>> automatically become links to that user's page in the wiki.
>>
>> Ah, OK, that explains it. Very good to know. Maybe a little semantic
>> overloading, but as this is my first wiki account anywhere, I'm guessing
>> it's typical with whatever software you're using.
>>
>> Anyway my account is created, with username `GarretWilson`. After I get
>> permissions I'll update the info on octet encoding for
>> application/x-www-form-urlencoded in relation to the servlet spec. It
>> may not be immediately, but I'll slowly but surely get to it.
> Karma granted. Happy editing.
>
> Cheers,
>
> Mark
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: distinction between resource charset and format octet decoding

markt
On 01/02/2019 17:58, Garret Wilson wrote:

> OK, Mark, I've made my initial edits to the
> https://wiki.apache.org/tomcat/FAQ/CharacterEncoding page. _Please check
> them over!_ This is my first edit to the wiki.
>
> That page has a lot of legacy information, some of which had to do with
> internal Tomcat stuff, and some of which had to do with minute details
> of obsolete RFCs and evolution of browser behavior. I didn't want to
> spend the entire day (week?) on this, so I tried to surgically to only
> address the sections relating to POST of
> application/x-www-form-urlencoded and how percent-encoded octets are
> interpreted. I couldn't resist updating the specification links and
> changing just a little prose about URL percent encoding.
>
> There is the risk now that other sections of the page are still outdated
> and conflict with my changes, but most importantly the FAQ should
> provide more complete information on how Tomcat web applications can be
> made to work with modern browsers.
>
> Please let me know if I bungled anything or if I need to clarify something.

LGTM.

> Thanks for letting me participate.

No need to thank us. We should be thanking you. Thank you.

So, what do you want to work on next? ;)

Cheers,

Mark


>
> Garret
>
> On 1/23/2019 12:26 AM, Mark Thomas wrote:
>> On 23/01/2019 05:07, Garret Wilson wrote:
>>> On 1/15/2019 3:20 AM, Mark Thomas wrote:
>>>> …
>>>> Anything in PascalCase becomes a link to a wiki page of that name.
>>>> Usernames are created in this form so references to the user
>>>> automatically become links to that user's page in the wiki.
>>>
>>> Ah, OK, that explains it. Very good to know. Maybe a little semantic
>>> overloading, but as this is my first wiki account anywhere, I'm guessing
>>> it's typical with whatever software you're using.
>>>
>>> Anyway my account is created, with username `GarretWilson`. After I get
>>> permissions I'll update the info on octet encoding for
>>> application/x-www-form-urlencoded in relation to the servlet spec. It
>>> may not be immediately, but I'll slowly but surely get to it.
>> Karma granted. Happy editing.
>>
>> Cheers,
>>
>> Mark
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: distinction between resource charset and format octet decoding

Garret Wilson
In reply to this post by markt
Sorry to bring up the non-UTF-8 escaped octets form POST problem again,
but …

On 1/8/2019 3:57 PM, Mark Thomas wrote:

> …
> As of Servlet 4.0 there is a specification compliant configuration
> option to change this default to any encoding of your choice.
> Obviously, UTF-8 is one of the options. You can do this by adding the
> following to your web.xml:
>
> <request-character-encoding>UTF-8</request-character-encoding>
>
> If you add it to conf/web.xml it applies to every web application
> deployed to Tomcat.
>
> Tomcat 9 uses this in the examples, manager and host-manager
> applications in place of the SetCharacterEncodingFilter.


As you know I've already updated the Tomcat FAQ with the options for
forcing Tomcat to interpret form POSTs with any escaped characters using
UTF-8 octet sequences (as modern browsers send, and as HTML5 requires)
instead of ISO-8859-1 (as the Servlet 4 spec says).

But the problem is worse with the Spring community. If someone is using
Spring Boot to create an executable JAR/WAR using embedded tomcat,
Spring Boot does something to configure Tomcat to send the POSTs
correctly (that is, as the modern web likes it, not like the Servlet 4
spec says). Unfortunately, if I use Spring Boot to make a WAR which is
both a self-contained executing WAR /and/ a WAR deployable on Tomcat,
when I deploy the WAR on Tomcat the encoded characters are using escaped
ISO-8859-1 octets, so my web app breaks. Yes, the WAR runs differently
if using Spring Boot embedded Tomcat or deployed on standalone Tomcat as
a WAR.

Spring Boot ignores any `web.xml` file. I guess I could create a
`web.xml` file only for standalone Tomcat, but then this freezes Eclipse
(as I posted elsewhere) because Eclipse doesn't understand
`<request-character-encoding>`. So like so many things on the web, this
is a mess.

This is a serious issue, in my opinion. The Servlet 4 specification is
out of step with everything else in the ecosystem!

> Whether Tomcat should ship with this setting present in conf/web.xml
> by default is something that should probably be discussed for Tomcat
> 10. Given the current state of the web, there is a reasonable case for
> doing so. I'll add that to the TOMCAT-NEXT discussion list.

Yes, can I just re-second (third?) that motion, and underscore the need
for this to be changed in Tomcat 10?

Thanks,

Garret