|
Hi,
AFAICT Tomcat's DefaultServlet doesn't add "; charset=..." to the Content-Type header when serving static resources of content type text/* and the corresponding resource isn't encoded in ISO-8859-1. As I understand it, this is a violation of the HTTP 1.1 spec, since RFC 2616 says in section 3.7.1: | The "charset" parameter is used with some media types to define the | character set (section 3.4) of the data. When no explicit charset | parameter is provided by the sender, media subtypes of the "text" | type are defined to have a default charset value of "ISO-8859-1" when | received via HTTP. Data in character sets other than "ISO-8859-1" or | its subsets MUST be labeled with an appropriate charset value. See | section 3.4.1 for compatibility problems. I'm seeing this with Tomcat 6.0.18, JDK 6u6 on 64-bit Ubuntu Hardy with a platform default encoding of "UTF-8". To reproduce this, one can simply put a UTF-8-encoded plain text file containing non-ASCII characters in in webapps/ROOT of a default Tomcat 6.0.18 installation and access this file via browser. Instead of the non-ASCII characters the browser should display the well-known garbage one gets when UTF-8 is decoded using an 8-bit charset (provided, the browser doesn't do some guessing of the charset based on the content). Doing a quick search on bugzilla I only came up with https://issues.apache.org/bugzilla/show_bug.cgi?id=41773 Now I'm unsure whether I do something completely wrong or my interpretation of the spec and DefaultServlet's behaviour is correct - which would mean that this is a bug. Can someone shed some light on this? Regards mks --------------------------------------------------------------------- To start a new topic, e-mail: [hidden email] To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
|
Markus Schönhaber wrote:
> Hi, > > AFAICT Tomcat's DefaultServlet doesn't add "; charset=..." to the > Content-Type header when serving static resources of content type text/* > and the corresponding resource isn't encoded in ISO-8859-1. Correct. > As I understand it, this is a violation of the HTTP 1.1 spec, since RFC > 2616 says in section 3.7.1: > | The "charset" parameter is used with some media types to define the > | character set (section 3.4) of the data. When no explicit charset > | parameter is provided by the sender, media subtypes of the "text" > | type are defined to have a default charset value of "ISO-8859-1" when > | received via HTTP. Data in character sets other than "ISO-8859-1" or > | its subsets MUST be labeled with an appropriate charset value. See > | section 3.4.1 for compatibility problems. Yes, but... it is debatable in a container environment who is responsible for ensuring this requirement is met. If you have multiple text files each with a different character set Tomcat is going to have to start guessing the charset from the content - a path I wouldn't want to go down. > I'm seeing this with Tomcat 6.0.18, JDK 6u6 on 64-bit Ubuntu Hardy with > a platform default encoding of "UTF-8". > To reproduce this, one can simply put a UTF-8-encoded plain text file > containing non-ASCII characters in in webapps/ROOT of a default Tomcat > 6.0.18 installation and access this file via browser. Instead of the > non-ASCII characters the browser should display the well-known garbage > one gets when UTF-8 is decoded using an 8-bit charset (provided, the > browser doesn't do some guessing of the charset based on the content). And most of them do, don't they? > Doing a quick search on bugzilla I only came up with > https://issues.apache.org/bugzilla/show_bug.cgi?id=41773 > Now I'm unsure whether I do something completely wrong or my > interpretation of the spec and DefaultServlet's behaviour is correct - > which would mean that this is a bug. You could argue, based on the spec extract above, if the platform default encoding isn't ISO-8859-1 that Tomcat should add this to the Content-Type header although I am wary about what this might break. As Remy points out in that bug, if you need that functionality it is easy to extend the DefaultServlet or your could write a simple Filter. That said I wouldn't be against a patch that introduced a useFileEncodingInCharset parameter (although a shorter name would be better ;) > Can someone shed some light on this? HTH, Mark --------------------------------------------------------------------- To start a new topic, e-mail: [hidden email] To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
|
Mark Thomas wrote:
>> As I understand it, this is a violation of the HTTP 1.1 spec, since RFC >> 2616 says in section 3.7.1: >> | The "charset" parameter is used with some media types to define the >> | character set (section 3.4) of the data. When no explicit charset >> | parameter is provided by the sender, media subtypes of the "text" >> | type are defined to have a default charset value of "ISO-8859-1" when >> | received via HTTP. Data in character sets other than "ISO-8859-1" or >> | its subsets MUST be labeled with an appropriate charset value. See >> | section 3.4.1 for compatibility problems. > Yes, but... it is debatable in a container environment who is responsible > for ensuring this requirement is met. I don't see that as debatable. In my understanding a web server that serves non-ISO-8859-1-encoded content of type text/* without declaring the charset is lying wrt the spec. > If you have multiple text files each > with a different character set Tomcat is going to have to start guessing > the charset from the content - a path I wouldn't want to go down. Agreed. I also consider having text resources with different encodings as something non-standard, non-default which one shouldn't expect the DefaultServlet to handle correctly. That's where the administrator's or developer's responsibility starts. But I'm talking about what I'd call the "default" case: where text resources are created using the default platform encoding. And this is something that, IMO, the DefaultServlet should be able to cope with. Thinking about this, there are two things that seem odd to me: 1. I could find no place in the docs where it is mentioned that the DefaultServlet is unable to serve text resources correctly if they are not encoded in ISO-8859-1. 2. The existence of the fileEncoding init-param. Why should one care (or be able to change) which encoding is used when reading text files from disk if there's only one encoding for which serving them actually works? >> one gets when UTF-8 is decoded using an 8-bit charset (provided, the >> browser doesn't do some guessing of the charset based on the content). > And most of them do, don't they? I don't know. My Firefox doesn't. And I have yet to see a Firefox installation where the charset guessing is turned on by default. The same applies to what I can say wrt to Opera. Looking at IE when it comes to standards compliance seems to be nonsense to me. But that's only my experience - YMMV. Furthermore, whether charset guessing done by the client conforms to the spec seems doubtful to me when I look at section 3.4.1. Anyway, my question is whether or not Tomcat behaves correctly (which seems not to be the case) not whether some - or even most - browsers do something that reduces the impact of a server's wrong behaviour. > That said I wouldn't be against a patch that introduced a > useFileEncodingInCharset parameter (although a shorter name would be better ;) Great! I'll dig into DefaultServlet's source and see what I can come up with. Speaking of the parameter name - that indeed seems problematic :-) > HTH, It did. Thanks for your response. Regards mks --------------------------------------------------------------------- To start a new topic, e-mail: [hidden email] To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
|
In reply to this post by Markus Schönhaber-10
Markus Schönhaber wrote:
> Hi, (provided, the > browser doesn't do some guessing of the charset based on the content). > Not in any way to distract from your main question, which is very interesting, but that is a very big "provided", because IE does a lot of second-guessing the server, infamously. And considering that IE still covers at least 90% of the corporate sites I know, that may be a reason for a bug like this - if bug there is - to remain largely unnoticed. --------------------------------------------------------------------- To start a new topic, e-mail: [hidden email] To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
|
André Warnier wrote:
> Markus Schönhaber wrote: >> (provided, the >> browser doesn't do some guessing of the charset based on the content). >> > Not in any way to distract from your main question, which is very > interesting, but that is a very big "provided", because IE does a lot of > second-guessing the server, infamously. > And considering that IE still covers at least 90% of the corporate sites > I know, that may be a reason for a bug like this - if bug there is - to > remain largely unnoticed. As I understand Mark's reply, he doesn't consider DefaultServlet's behaviour, under the circumstances I'm talking about, as being correct either. This supports my view of the issue as a bug. And yes, you may be right that the the widespread use of IE may have helped to conceal this bug. But: 1. As I said before: looking at IE to find out if a server behaves correctly seems to me like asking Genghis Khan when you want to learn about peace, freedom and human rights. 2. (much more important) this bug affects me. So I will have this fixed, while the reasons why this has gone largely unnoticed for obviously quite some time are of little relevance to me. Regards mks --------------------------------------------------------------------- To start a new topic, e-mail: [hidden email] To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
|
In reply to this post by Markus Schönhaber-10
Markus Schönhaber wrote:
> Mark Thomas wrote: >> That said I wouldn't be against a patch that introduced a >> useFileEncodingInCharset parameter (although a shorter name would be better ;) > > Great! I'll dig into DefaultServlet's source and see what I can come up > with. OK, I think I have, by and large, understood how the DefaultServlet works and added code to append the charset info (if wanted and applicable) to the Content-Type response header. What I ended up with is the same as if I had simply searched the code for all places where Content-Type is set and add my code there. Seems reassuring to me. I'll do some more testing and then attach the patch to https://issues.apache.org/bugzilla/show_bug.cgi?id=41773 Are there any (non-obvious) testcases that you think deserve special attention? Regards mks --------------------------------------------------------------------- To start a new topic, e-mail: [hidden email] To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
|
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1 Markus, Markus Schönhaber wrote: | OK, I think I have, by and large, understood how the DefaultServlet | works and added code to append the charset info (if wanted and | applicable) to the Content-Type response header. | What I ended up with is the same as if I had simply searched the code | for all places where Content-Type is set and add my code there. Seems | reassuring to me. Have you rigged the servlet to add a static charset defined in, say, web.xml or something like that? Is there any logic to guess the actual charset? Are you actually setting the character set of the response's Writer? I'd love to take a look at your patch. I would definitely add some tests to verify correct behavior when the charset is set to something that is not sane (like ";;;;"). - -chris -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkigRtQACgkQ9CaO5/Lv0PCOMwCgvxUa1bbV4GhEQXgRB0i0WPAq SVcAnRfhzT+Ej5qUgUAatnJtfkaS/fG5 =b0I8 -----END PGP SIGNATURE----- --------------------------------------------------------------------- To start a new topic, e-mail: [hidden email] To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
|
Christopher Schultz wrote:
> Have you rigged the servlet to add a static charset defined in, say, > web.xml or something like that? In a way, yes. DefaultServlet already uses the value of the fileEncoding init-param, if set, as encoding when reading static content from disk. So, if fileEncoding is explicitly set in web.xml, I also use it's value for the charset info in the response header. > Is there any logic to guess the actual > charset? Depends on what charset you mean. - The charset of a file on disk? Then no, I haven't touched the code for reading files from disk - and I don't intend to. - The charset added to the Content-Type response header? Then yes. If fileEncoding (see above) is not set, the value from java.nio.charset.Charset.defaultCharset().name() is used. BTW: this is OK for Tomcat 6. But if anyone was interested to port this to an older version of Tomcat which is supposed to be able to run pre-1.5 JVMs, he should keep in mind that this has to be changed. For example into something like (new OutputStreamWriter(new ByteArrayOutputStream())).getEncoding() > Are you actually setting the character set of the response's > Writer? No. But a good point! As I understand it, DefaultServlet always tries to use the ServletOutputStream. Only if response.getOuptuStream() fails with an ISE and the media type is text/* or *xml, it tries to use the Response object's PrintWriter. So, if the latter is the case and something other than the platform default encoding should be used, it might be sensible to set the encoding for the writer. I have to think about this some more - especially about a real world example that triggers this. > I'd love to take a look at your patch. No problem. You can get it here: http://www.ddt-consult.de/sendCharset.patch > I would definitely add some tests to verify correct behavior when the > charset is set to something that is not sane (like ";;;;"). Hm, yes, one could add a sanity check. But I'd expect people who set fileEncoding explicitly to not only know what they're doing but to also check if what they did actually works. Thanks for your input, Chris. Regards mks --------------------------------------------------------------------- To start a new topic, e-mail: [hidden email] To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
| Powered by Nabble | Edit this page |
