[PROPOSAL] Explicitly-set the request character encoding when it has been committed

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[PROPOSAL] Explicitly-set the request character encoding when it has been committed

Christopher Schultz-2
All,

Yesterday, I struggled to determine why my application was behaving
differently than it had been in the past, and the problem turned out to
be that I had inserted a <filter> in the filter-chain before my
CharacterEncodingFilter. My new <filter> was reading a
request-parameter. The CharacterEncodingFilter looks like this:

     public void doFilter(ServletRequest request,
                         ServletResponse response,
                         FilterChain chain)
        throws IOException, ServletException
     {
        request.setCharacterEncoding(getCharacterEncoding(request));

        chain.doFilter(request, response);
     }

     protected String getCharacterEncoding(ServletRequest request)
     {
        String charset=request.getCharacterEncoding();

        if(null == charset)
            return this.getDefaultEncoding();
        else
            return charset;
     }

The "default encoding" is essentially always UTF-8.

When looking at the request in my servlet, the character encoding
reported by the request was "UTF-8", but the actual encoding used
appeared to be ISO-8859-1 (the protocol default).

It looks like Tomcat is defaulting to ISO-8859-1 but continuing to
return null for request.getCharacterEncoding().

My proposal is to have Tomcat set the request encoding field to
"ISO-8859-1" in the following situation:

1. The character encoding is null
2. A method is called which requires that the character encoding be
"committed"

Once that charset is determined, changing the request's charset has no
effect whatsoever other than to confuse the application developer.

If Tomcat were to explicitly-set that encoding, my
CharacterEncodingFilter wouldn't detect null and override that request
charset with "UTF-8", thereby lying to the rest of the application.

I might even go so far as to propose that calling
request.setCharacterEncoding() after the encoding has been committed
should throw IllegalStateException. (This may be a violation of the
spec, as the javadoc only declares UnsupportedEncodingException).

The javadoc state:

"
Overrides the name of the character encoding used in the body of this
request. This method must be called prior to reading request parameters
or reading input using getReader(). *Otherwise, it has no effect.*
"

(emphasis mine)

In Tomcat, if you call setCharacterEncoding("UTF-8") after request
parameters have been read, there *is* an effect: you change the value of
the charset field in the response.

This is the current implementation of setCharacterEncoding:

     /**
      * Overrides the name of the character encoding used in the body of
      * this request.  This method must be called prior to reading request
      * parameters or reading input using <code>getReader()</code>.
      *
      * @param enc The character encoding to be used
      *
      * @exception UnsupportedEncodingException if the specified encoding
      *  is not supported
      *
      * @since Servlet 2.3
      */
     public void setCharacterEncoding(String enc) throws
UnsupportedEncodingException {

         if (usingReader) {
             return;
         }

         // Confirm that the encoding name is valid
         Charset charset = B2CConverter.getCharset(enc);

         // Save the validated encoding
         coyoteRequest.setCharset(charset);
     }

The javadoc says that it must be called before reading any request
parameters OR calling getReader() but there is only a check for the reader.

Maybe we should change the check to:

         if (usingReader || parametersParsed) {
             return;
         }

And also, change Request.parseParameters to add:

             // getCharacterEncoding() may have been overridden to
search for
             // hidden form field containing request encoding
             Charset charset = getCharset();

             // Add this line, here:
             coyoteRequest.setCharset(charset);

It is at this point that the character set is truly committed, at least
when parsing parameters.

In getReader, we have similar logic, where we set the character set for
the request after determining what it actually is:

         if (coyoteRequest.getCharacterEncoding() == null) {
             // Nothing currently set explicitly.
             // Check the content
             Context context = getContext();
             if (context != null) {
                 String enc = context.getRequestCharacterEncoding();
                 if (enc != null) {
                     // Explicitly set the context default so it is
visible to
                     // InputBuffer when creating the Reader.
                     setCharacterEncoding(enc);
                 }
             }
         }

I'm open to any or all of the above, but I think something should be
done. It was surprising to see that the request's charset was "UTF-8"
and yet the UTF-8 bytes sent to the container were coming out mangled in
the application.

If the application were to see null coming back from
request.getCharacterEncoding() would have at least giving a clue as to
what was happening. Since the effective charset being used was
ISO-8859-1, it would have been better to return "ISO-8859-1" instead of
"null" knowing that the parameters had *already* been interpreted using
that character set.

Thanks,
-chris

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] Explicitly-set the request character encoding when it has been committed

markt
On 01/04/2021 17:08, Christopher Schultz wrote:

<snip/>

> The javadoc says that it must be called before reading any request
> parameters OR calling getReader() but there is only a check for the reader.
>
> Maybe we should change the check to:
>
>          if (usingReader || parametersParsed) {
>              return;
>          }

+1

> And also, change Request.parseParameters to add:
>
>              // getCharacterEncoding() may have been overridden to
> search for
>              // hidden form field containing request encoding
>              Charset charset = getCharset();
>
>              // Add this line, here:
>              coyoteRequest.setCharset(charset);

I'm less sure of this because of this line in the Javadoc for
getCharacterEncoding():

@return ... <code>null</code> if the request does not specify a
character encoding

> It is at this point that the character set is truly committed, at least
> when parsing parameters.
>
> In getReader, we have similar logic, where we set the character set for
> the request after determining what it actually is:

Which suggests the second of the above changes might be OK even if the
spec suggests otherwise.

At the moment, I'm leaning towards just the first of your proposed
changes but I'm open to arguments to do more.

Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] Explicitly-set the request character encoding when it has been committed

Christopher Schultz-2
Mark,

On 4/1/21 16:00, Mark Thomas wrote:

> On 01/04/2021 17:08, Christopher Schultz wrote:
>
> <snip/>
>
>> The javadoc says that it must be called before reading any request
>> parameters OR calling getReader() but there is only a check for the
>> reader.
>>
>> Maybe we should change the check to:
>>
>>          if (usingReader || parametersParsed) {
>>              return;
>>          }
>
> +1
>
>> And also, change Request.parseParameters to add:
>>
>>              // getCharacterEncoding() may have been overridden to
>> search for
>>              // hidden form field containing request encoding
>>              Charset charset = getCharset();
>>
>>              // Add this line, here:
>>              coyoteRequest.setCharset(charset);
>
> I'm less sure of this because of this line in the Javadoc for
> getCharacterEncoding():
>
> @return ... <code>null</code> if the request does not specify a
> character encoding
>
>> It is at this point that the character set is truly committed, at
>> least when parsing parameters.
>>
>> In getReader, we have similar logic, where we set the character set
>> for the request after determining what it actually is:
>
> Which suggests the second of the above changes might be OK even if the
> spec suggests otherwise.
>
> At the moment, I'm leaning towards just the first of your proposed
> changes but I'm open to arguments to do more.

I'm happy with incremental changes. Making the first change will be an
improvement for sure.

On the other hand, if code calls request.setCharacterEncoding(non-null)
then it will overwrite whatever the request had previously sent
(including null). So in that case, request.getCharacterEncoding isn't
always returning "what the client sent".

-chris

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] Explicitly-set the request character encoding when it has been committed

markt
On 05/04/2021 13:19, Christopher Schultz wrote:

<snip/>

> I'm happy with incremental changes. Making the first change will be an
> improvement for sure.
>
> On the other hand, if code calls request.setCharacterEncoding(non-null)
> then it will overwrite whatever the request had previously sent
> (including null). So in that case, request.getCharacterEncoding isn't
> always returning "what the client sent".

Another place where there is scope to add further clarity to the spec.

Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]