Quantcast

request parameters mishandle utf-8 encoding

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

request parameters mishandle utf-8 encoding

Lajos Papp
hi,

i think there is a bug at handling utf-8 encoded request parameters  
sent by a html form with "get" method.
i created a simple jsp page:
=== encTest.jsp ===
<%@page contentType="text/html" pageEncoding="UTF-8"%>

<%
String query = request.getQueryString();
String queryDecoded = "-";
if (query != null) {
     queryDecoded = java.net.URLDecoder.decode(query,"utf-8");
}

request.setCharacterEncoding("UTF-8");
String reqParam = request.getParameter("param");
%>

<br> query = <%= query %>
<br> queryDecoded = <%= queryDecoded %>
<br> reqParam = <%= reqParam %>


<form action="encTest.jsp" method="get">
     <input name="param" />
     <input type="submit" value="send" />
</form>
=== end of jsp ===

When i fill out the form with some non US characters (in this case  
with a
hungarian name), the browser urlencodes it correctly which i can see
from the url:
http://localhost:8080/struts/encTest.jsp?param=b%C3%A9la

when i decode the query string by hand:
   queryDecoded = java.net.URLDecoder.decode(query,"utf-8");
i get the correct string, but when i call the getParameter() method  
on the request:
   request.setCharacterEncoding("UTF-8");
   String reqParam = request.getParameter("param");
i get a miscoded string as the request.setCharacterEncoding("UTF-8")  
wouldn't
be there.

i checked the sourcecode of tomcat 6.0.16 and found that
the Parameters.handleQueryParameters() does the real job, which is  
called by
Request. parseParameters()
the request has the correct encoding (utf-8) but the parameter has 2  
different
properties which store information about encoding: encoding and
queryStringEncoding. in case of a "GET" the useBodyEncodingForURI is
false, and therefore only parameters.setEncoding("utf-8") is called
but parameters.setQueryStringEncoding("utf-8") isn't.
so when request.parseParameters() calls  
parameters.handleQueryParameters()
than queryStringEncoding is still null, and of course will return  
miscoded paramter.

Do you agree that it's a bug, or i miss something?
cheers,
lajos

=== org.apache.catalina.connector.Request ===

  protected void parseParameters() {

         ...
         String enc = getCharacterEncoding();

         boolean useBodyEncodingForURI =  
connector.getUseBodyEncodingForURI();
         if (enc != null) {
             parameters.setEncoding(enc);
             if (useBodyEncodingForURI) {
                 parameters.setQueryStringEncoding(enc);
             }
         }
         ...
         parameters.handleQueryParameters();

         ...
         if (!getMethod().equalsIgnoreCase("POST"))
             return;


=== org.apache.tomcat.util.http.Parameters ===
public void handleQueryParameters() {
    ...
    handleQueryParameters(decodedQuery, queryStringEncoding);
}



---------------------------------------------------------------------
To start a new topic, e-mail: [hidden email]
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: request parameters mishandle utf-8 encoding

Bill Barker-2
It's not a bug, it's a feature ;).  Seriously, if you open a bug report for
this, it will be closed quickly as either INVALID or as DUPLICATE to a bug
that was closed as INVALID.

The HTTP spec specifies that header information is encoded in iso-latin-1,
so this is what Tomcat uses by default when parsing the query-string.  If
you want the non-default behavior, then simply set
useBodyEncodingForURI="true" in the <Connector ... /> element of server.xml.

"Lajos Papp" <[hidden email]> wrote in message
news:[hidden email]...

> hi,
>
> i think there is a bug at handling utf-8 encoded request parameters  sent
> by a html form with "get" method.
> i created a simple jsp page:
> === encTest.jsp ===
> <%@page contentType="text/html" pageEncoding="UTF-8"%>
>
> <%
> String query = request.getQueryString();
> String queryDecoded = "-";
> if (query != null) {
>     queryDecoded = java.net.URLDecoder.decode(query,"utf-8");
> }
>
> request.setCharacterEncoding("UTF-8");
> String reqParam = request.getParameter("param");
> %>
>
> <br> query = <%= query %>
> <br> queryDecoded = <%= queryDecoded %>
> <br> reqParam = <%= reqParam %>
>
>
> <form action="encTest.jsp" method="get">
>     <input name="param" />
>     <input type="submit" value="send" />
> </form>
> === end of jsp ===
>
> When i fill out the form with some non US characters (in this case  with a
> hungarian name), the browser urlencodes it correctly which i can see
> from the url:
> http://localhost:8080/struts/encTest.jsp?param=b%C3%A9la
>
> when i decode the query string by hand:
>   queryDecoded = java.net.URLDecoder.decode(query,"utf-8");
> i get the correct string, but when i call the getParameter() method  on
> the request:
>   request.setCharacterEncoding("UTF-8");
>   String reqParam = request.getParameter("param");
> i get a miscoded string as the request.setCharacterEncoding("UTF-8")
> wouldn't
> be there.
>
> i checked the sourcecode of tomcat 6.0.16 and found that
> the Parameters.handleQueryParameters() does the real job, which is  called
> by
> Request. parseParameters()
> the request has the correct encoding (utf-8) but the parameter has 2
> different
> properties which store information about encoding: encoding and
> queryStringEncoding. in case of a "GET" the useBodyEncodingForURI is
> false, and therefore only parameters.setEncoding("utf-8") is called
> but parameters.setQueryStringEncoding("utf-8") isn't.
> so when request.parseParameters() calls
> parameters.handleQueryParameters()
> than queryStringEncoding is still null, and of course will return
> miscoded paramter.
>
> Do you agree that it's a bug, or i miss something?
> cheers,
> lajos
>
> === org.apache.catalina.connector.Request ===
>
>  protected void parseParameters() {
>
>         ...
>         String enc = getCharacterEncoding();
>
>         boolean useBodyEncodingForURI =
> connector.getUseBodyEncodingForURI();
>         if (enc != null) {
>             parameters.setEncoding(enc);
>             if (useBodyEncodingForURI) {
>                 parameters.setQueryStringEncoding(enc);
>             }
>         }
>         ...
>         parameters.handleQueryParameters();
>
>         ...
>         if (!getMethod().equalsIgnoreCase("POST"))
>             return;
>
>
> === org.apache.tomcat.util.http.Parameters ===
> public void handleQueryParameters() {
>    ...
>    handleQueryParameters(decodedQuery, queryStringEncoding);
> }
>
>
>
> ---------------------------------------------------------------------
> To start a new topic, e-mail: [hidden email]
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>




---------------------------------------------------------------------
To start a new topic, e-mail: [hidden email]
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: request parameters mishandle utf-8 encoding

Christopher Schultz-2
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Bill,

Bill Barker wrote:
| The HTTP spec specifies that header information is encoded in iso-latin-1

Could you provide a reference for this? Whenever I dig into the HTTP
specification, I end up having to read all over it to find things like
this. I seem to recall that:

1. I've located this information in the past
2. The real answer was that HTTP headers format inherits from SMTP
3. SMTP requires pure ASCII headers
4. The request line ("GET /whatever HTTP/[version]") does not count
~   as a header

Unfortunately, I can't find my references and so my assertions are
pretty much worthless. :(

| so this is what Tomcat uses by default when parsing the query-string.  If
| you want the non-default behavior, then simply set
| useBodyEncodingForURI="true" in the <Connector ... /> element of
server.xml.

I find it more useful to set URIEncoding="UTF-8" in the <Connector>,
since the page encoding and URI encoding are not guaranteed to be the
same. The OP should look to see what works best and feels more natural
in his environment.

- -chris

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkiRydQACgkQ9CaO5/Lv0PBDmwCdFOiGgB33MNXvkyk2rJD4/Qru
CvwAn1h8Ex8bpoMo9CyOYKG1JqjzCE1y
=UeRV
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To start a new topic, e-mail: [hidden email]
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: request parameters mishandle utf-8 encoding

André Warnier
Christopher Schultz wrote:
[...]

Here is the definitive reference :
http://www.faqs.org/rfcs/rfc2396.html
and see 1.5. URI Transcribability and following if you are courageous.

And the HTTP 1.1 RFC 2616 makes reference to the above RFC in what
regards URL encoding.

The point is that the URL contained in the HTTP request line (the first
line) cannot be considered to be in any particular encoding, unless the
client and server somehow agree on a convention in advance.
All it says in the specs, is that only certain ranges of bytes are
allowed "as is" in URL's, and the rest should be escaped, and it says
how they should be escaped.

To say this in lay language : you can decide to write a URL in pretty
much any encoding of any character set you want, but then, once you have
your encoded URL, you should scan it byte by byte, and any byte that is
not in the accepted "as is" range should be encoded as per the spec.
The accepted range is, generally speaking, the byte values that
correspond to the printable characters in the latin-1 alphabet, minus
some "excluded" characters like #,<,>,/ etc...

For example, if your choice of encoding was so that, after encoding, at
position 30 of your URL string was a byte with a hex value 0x20 (which
in iso-8859-1 is a space), then it should be replaced by a "+".
Similarly, if after the original encoding there happened to be a byte at
position 40 with a hex value of 0x0D (CR, a control character), it
should be replaced by the sequence %0D.  And so on.

Now, whether the server will "understand" your URL is another matter.

The receiving HTTP server should first of all decode the received URL in
the same way, before any further decoding is done.  Thus, from left to
right, any "+" byte should be replaced by a byte 0x20, any sequence
"%0D" should be replaced by the single byte with hex value 0x0D, etc..

Then, by default, it is the convention that in the absence of any other
information or convention, the resulting string should be considered as
being in the iso-8859-1 (latin-1) alphabet.

However, if the client and server have somehow made a convention that
they would exchange URLs containing Unicode characters, encoded as
UTF-8, that's fine.

After the HTTP Request line, are any number of HTTP headers.  As far as
I remember, these should conform to the rules for MIME headers, which
may well specify that they should be limited to ASCII, I am too lazy to
check.

Then there may be a blank line, followed by a request content.
For that one, the situation is totally different, because a preceding
HTTP header should specify the content-type, and if it is text, the
character-set and encoding used.

By using the option in Tomcat that specifies "consider the request URL
as being in the same encoding as the request body", you are making the
big assumption that you know the client, and that you know that it will
send requests that way.
Between a client and a server that "don't know eachother", it is very
unsafe to make that assumption.  Specifying this parameter in Tomcat is
not going to magically make your client respect that convention.

It's a pity, but that's the way it is with HTTP 1.1.
The people who designed the protocol and wrote the specs did a great
job, but did not include any unambiguous way to specify, in the URL
itself, in which character set or encoding of ditto it was written, if
it is not the default latin-1.

In the SMTP protocol, by contrast, there exists a way to specify the
encoding of a header value (e.g. the "Subject" header), at the beginning
of the header value itself.

André

---------------------------------------------------------------------
To start a new topic, e-mail: [hidden email]
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Loading...