Quantcast

DefaultServlet doesn't set charset

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

DefaultServlet doesn't set charset

Markus Schönhaber-10
Hi,

AFAICT Tomcat's DefaultServlet doesn't add "; charset=..." to the
Content-Type header when serving static resources of content type text/*
and the corresponding resource isn't encoded in ISO-8859-1.
As I understand it, this is a violation of the HTTP 1.1 spec, since RFC
2616 says in section 3.7.1:
|  The "charset" parameter is used with some media types to define the
|  character set (section 3.4) of the data. When no explicit charset
|  parameter is provided by the sender, media subtypes of the "text"
|  type are defined to have a default charset value of "ISO-8859-1" when
|  received via HTTP. Data in character sets other than "ISO-8859-1" or
|  its subsets MUST be labeled with an appropriate charset value. See
|  section 3.4.1 for compatibility problems.

I'm seeing this with Tomcat 6.0.18, JDK 6u6 on 64-bit Ubuntu Hardy with
a platform default encoding of "UTF-8".
To reproduce this, one can simply put a UTF-8-encoded plain text file
containing non-ASCII characters in in webapps/ROOT of a default Tomcat
6.0.18 installation and access this file via browser. Instead of the
non-ASCII characters the browser should display the well-known garbage
one gets when UTF-8 is decoded using an 8-bit charset (provided, the
browser doesn't do some guessing of the charset based on the content).

Doing a quick search on bugzilla I only came up with
https://issues.apache.org/bugzilla/show_bug.cgi?id=41773
Now I'm unsure whether I do something completely wrong or my
interpretation of the spec and DefaultServlet's behaviour is correct -
which would mean that this is a bug.

Can someone shed some light on this?

Regards
  mks

---------------------------------------------------------------------
To start a new topic, e-mail: [hidden email]
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: DefaultServlet doesn't set charset

Mark Thomas-2
Markus Schönhaber wrote:
> Hi,
>
> AFAICT Tomcat's DefaultServlet doesn't add "; charset=..." to the
> Content-Type header when serving static resources of content type text/*
> and the corresponding resource isn't encoded in ISO-8859-1.
Correct.

> As I understand it, this is a violation of the HTTP 1.1 spec, since RFC
> 2616 says in section 3.7.1:
> |  The "charset" parameter is used with some media types to define the
> |  character set (section 3.4) of the data. When no explicit charset
> |  parameter is provided by the sender, media subtypes of the "text"
> |  type are defined to have a default charset value of "ISO-8859-1" when
> |  received via HTTP. Data in character sets other than "ISO-8859-1" or
> |  its subsets MUST be labeled with an appropriate charset value. See
> |  section 3.4.1 for compatibility problems.
Yes, but... it is debatable in a container environment who is responsible
for ensuring this requirement is met. If you have multiple text files each
with a different character set Tomcat is going to have to start guessing
the charset from the content - a path I wouldn't want to go down.

> I'm seeing this with Tomcat 6.0.18, JDK 6u6 on 64-bit Ubuntu Hardy with
> a platform default encoding of "UTF-8".
> To reproduce this, one can simply put a UTF-8-encoded plain text file
> containing non-ASCII characters in in webapps/ROOT of a default Tomcat
> 6.0.18 installation and access this file via browser. Instead of the
> non-ASCII characters the browser should display the well-known garbage
> one gets when UTF-8 is decoded using an 8-bit charset (provided, the
> browser doesn't do some guessing of the charset based on the content).
And most of them do, don't they?

> Doing a quick search on bugzilla I only came up with
> https://issues.apache.org/bugzilla/show_bug.cgi?id=41773
> Now I'm unsure whether I do something completely wrong or my
> interpretation of the spec and DefaultServlet's behaviour is correct -
> which would mean that this is a bug.
You could argue, based on the spec extract above, if the platform default
encoding isn't ISO-8859-1 that Tomcat should add this to the Content-Type
header although I am wary about what this might break. As Remy points out
in that bug, if you need that functionality it is easy to extend the
DefaultServlet or your could write a simple Filter.

That said I wouldn't be against a patch that introduced a
useFileEncodingInCharset parameter (although a shorter name would be better ;)

> Can someone shed some light on this?
HTH,

Mark



---------------------------------------------------------------------
To start a new topic, e-mail: [hidden email]
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: DefaultServlet doesn't set charset

Markus Schönhaber-10
Mark Thomas wrote:

>> As I understand it, this is a violation of the HTTP 1.1 spec, since RFC
>> 2616 says in section 3.7.1:
>> |  The "charset" parameter is used with some media types to define the
>> |  character set (section 3.4) of the data. When no explicit charset
>> |  parameter is provided by the sender, media subtypes of the "text"
>> |  type are defined to have a default charset value of "ISO-8859-1" when
>> |  received via HTTP. Data in character sets other than "ISO-8859-1" or
>> |  its subsets MUST be labeled with an appropriate charset value. See
>> |  section 3.4.1 for compatibility problems.
> Yes, but... it is debatable in a container environment who is responsible
> for ensuring this requirement is met.

I don't see that as debatable. In my understanding a web server that
serves non-ISO-8859-1-encoded content of type text/* without declaring
the charset is lying wrt the spec.

> If you have multiple text files each
> with a different character set Tomcat is going to have to start guessing
> the charset from the content - a path I wouldn't want to go down.

Agreed. I also consider having text resources with different encodings
as something non-standard, non-default which one shouldn't expect the
DefaultServlet to handle correctly. That's where the administrator's or
developer's responsibility starts.
But I'm talking about what I'd call the "default" case: where text
resources are created using the default platform encoding. And this is
something that, IMO, the DefaultServlet should be able to cope with.

Thinking about this, there are two things that seem odd to me:
1. I could find no place in the docs where it is mentioned that the
DefaultServlet is unable to serve text resources correctly if they are
not encoded in ISO-8859-1.
2. The existence of the fileEncoding init-param. Why should one care (or
be able to change) which encoding is used when reading text files from
disk if there's only one encoding for which serving them actually works?

>> one gets when UTF-8 is decoded using an 8-bit charset (provided, the
>> browser doesn't do some guessing of the charset based on the content).
> And most of them do, don't they?

I don't know. My Firefox doesn't. And I have yet to see a Firefox
installation where the charset guessing is turned on by default. The
same applies to what I can say wrt to Opera. Looking at IE when it comes
to standards compliance seems to be nonsense to me.
But that's only my experience - YMMV.
Furthermore, whether charset guessing done by the client conforms to the
spec seems doubtful to me when I look at section 3.4.1.

Anyway, my question is whether or not Tomcat behaves correctly (which
seems not to be the case) not whether some - or even most - browsers do
something that reduces the impact of a server's wrong behaviour.

> That said I wouldn't be against a patch that introduced a
> useFileEncodingInCharset parameter (although a shorter name would be better ;)

Great! I'll dig into DefaultServlet's source and see what I can come up
with.
Speaking of the parameter name - that indeed seems problematic :-)

> HTH,

It did. Thanks for your response.

Regards
  mks

---------------------------------------------------------------------
To start a new topic, e-mail: [hidden email]
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: DefaultServlet doesn't set charset

André Warnier
In reply to this post by Markus Schönhaber-10
Markus Schönhaber wrote:
> Hi,
(provided, the
> browser doesn't do some guessing of the charset based on the content).
>
Not in any way to distract from your main question, which is very
interesting, but that is a very big "provided", because IE does a lot of
second-guessing the server, infamously.
And considering that IE still covers at least 90% of the corporate sites
I know, that may be a reason for a bug like this - if bug there is - to
remain largely unnoticed.



---------------------------------------------------------------------
To start a new topic, e-mail: [hidden email]
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: DefaultServlet doesn't set charset

Markus Schönhaber-10
André Warnier wrote:

> Markus Schönhaber wrote:
>> (provided, the
>> browser doesn't do some guessing of the charset based on the content).
>>
> Not in any way to distract from your main question, which is very
> interesting, but that is a very big "provided", because IE does a lot of
> second-guessing the server, infamously.
> And considering that IE still covers at least 90% of the corporate sites
> I know, that may be a reason for a bug like this - if bug there is - to
> remain largely unnoticed.

As I understand Mark's reply, he doesn't consider DefaultServlet's
behaviour, under the circumstances I'm talking about, as being correct
either. This supports my view of the issue as a bug.

And yes, you may be right that the the widespread use of IE may have
helped to conceal this bug.

But:
1. As I said before: looking at IE to find out if a server behaves
correctly seems to me like asking Genghis Khan when you want to learn
about peace, freedom and human rights.
2. (much more important) this bug affects me. So I will have this fixed,
while the reasons why this has gone largely unnoticed for obviously
quite some time are of little relevance to me.

Regards
  mks

---------------------------------------------------------------------
To start a new topic, e-mail: [hidden email]
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: DefaultServlet doesn't set charset

Markus Schönhaber-10
In reply to this post by Markus Schönhaber-10
Markus Schönhaber wrote:

> Mark Thomas wrote:

>> That said I wouldn't be against a patch that introduced a
>> useFileEncodingInCharset parameter (although a shorter name would be better ;)
>
> Great! I'll dig into DefaultServlet's source and see what I can come up
> with.

OK, I think I have, by and large, understood how the DefaultServlet
works and added code to append the charset info (if wanted and
applicable) to the Content-Type response header.
What I ended up with is the same as if I had simply searched the code
for all places where Content-Type is set and add my code there. Seems
reassuring to me.

I'll do some more testing and then attach the patch to
https://issues.apache.org/bugzilla/show_bug.cgi?id=41773
Are there any (non-obvious) testcases that you think deserve special
attention?

Regards
  mks

---------------------------------------------------------------------
To start a new topic, e-mail: [hidden email]
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: DefaultServlet doesn't set charset

Christopher Schultz-2
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Markus,

Markus Schönhaber wrote:
| OK, I think I have, by and large, understood how the DefaultServlet
| works and added code to append the charset info (if wanted and
| applicable) to the Content-Type response header.
| What I ended up with is the same as if I had simply searched the code
| for all places where Content-Type is set and add my code there. Seems
| reassuring to me.

Have you rigged the servlet to add a static charset defined in, say,
web.xml or something like that? Is there any logic to guess the actual
charset? Are you actually setting the character set of the response's
Writer? I'd love to take a look at your patch.

I would definitely add some tests to verify correct behavior when the
charset is set to something that is not sane (like ";;;;").

- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkigRtQACgkQ9CaO5/Lv0PCOMwCgvxUa1bbV4GhEQXgRB0i0WPAq
SVcAnRfhzT+Ej5qUgUAatnJtfkaS/fG5
=b0I8
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To start a new topic, e-mail: [hidden email]
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: DefaultServlet doesn't set charset

Markus Schönhaber-10
Christopher Schultz wrote:

> Have you rigged the servlet to add a static charset defined in, say,
> web.xml or something like that?

In a way, yes. DefaultServlet already uses the value of the fileEncoding
init-param, if set, as encoding when reading static content from disk.
So, if fileEncoding is explicitly set in web.xml, I also use it's value
for the charset info in the response header.

> Is there any logic to guess the actual
> charset?

Depends on what charset you mean.
- The charset of a file on disk? Then no, I haven't touched the code for
reading files from disk - and I don't intend to.
- The charset added to the Content-Type response header? Then yes. If
fileEncoding (see above) is not set, the value from
java.nio.charset.Charset.defaultCharset().name()
is used.
BTW: this is OK for Tomcat 6. But if anyone was interested to port this
to an older version of Tomcat which is supposed to be able to run
pre-1.5 JVMs, he should keep in mind that this has to be changed. For
example into something like
(new OutputStreamWriter(new ByteArrayOutputStream())).getEncoding()

> Are you actually setting the character set of the response's
> Writer?

No. But a good point!
As I understand it, DefaultServlet always tries to use the
ServletOutputStream. Only if response.getOuptuStream() fails with an ISE
and the media type is text/* or *xml, it tries to use the Response
object's PrintWriter. So, if the latter is the case and something other
than the platform default encoding should be used, it might be sensible
to set the encoding for the writer.
I have to think about this some more - especially about a real world
example that triggers this.

> I'd love to take a look at your patch.

No problem. You can get it here:
http://www.ddt-consult.de/sendCharset.patch

> I would definitely add some tests to verify correct behavior when the
> charset is set to something that is not sane (like ";;;;").

Hm, yes, one could add a sanity check. But I'd expect people who set
fileEncoding explicitly to not only know what they're doing but to also
check if what they did actually works.

Thanks for your input, Chris.

Regards
  mks

---------------------------------------------------------------------
To start a new topic, e-mail: [hidden email]
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Loading...