Adding regular expression support to CORS filter

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Adding regular expression support to CORS filter

Carsten Klein
Hi there,

I'd like to contribute a CORS filter enhancement, making it accept both
wildcard-based and 'regular expression'-based expressions for its
allowed origins list.

I know this from a project based on Jetty, which has support for, at
least, simple wildcard matching (*). Specifying multiple allowed origins
with one pattern comes quite handy if there are numerous developer
machines that all access one single server from their browsers.

The implementation shall support two flavors of expressions:

- Java Pattern Class based expressions

Enclosed between slashes (/.../) these bring the full power of regular
expressions to the filter's allowed origins list configuration. This
will also support specifying pattern flags by appending single
characters after the terminating slash (like in /^foo.bar$/i for
case-insensitive matching).

- Wildcard-based simple expressions

With a much simpler syntax, these expressions are more compact and may
be more intuitive for people not familiar with regular expressions.
Although less powerful than *real* regular expressions, the special
characters supported should provide enough options to specify allowed
origins efficiently:

?       matches any single character except the domain separator (.)

#       matches any single digit

*       matches any number of any characters except the domain
         separator (.) including none

**      matches any number of any characters including none (this also
         matches the domain separator and so matches several sub-
         domains)
         (Technically, any number > 1 of consecutive asterisks are
         treated as a single ** pattern.)

[abc]   not yet sure about character classes
[a-z]
[^abc]

Wildcard-based expressions are implemented by regular expressions as
well. Such an expression is turned into an *anchored* regular expression
during initialization e.g.

http://?*.devzone.intra  ==>  ^http://[^.][^.]*\.devzone\.intra$

Of course, it is still possible to specify literal origins, as well as
the sole '*' to allow access from any origin. So, the value of the
'cors.allowed.origins' initialization parameter may look like this:


https://www.apache.org,
http://?*.devzone.intra,
/:\/\/staginghost-\d{1,3}.mycompany.corp$/i


As you can see, *real* Java regular expressions are not anchored by
default. The above one is end anchored only (making it match any protocol).

Obviously, forward slashes withing the regular expression must be
escaped in order not to end the expression prematurely.


The current CORS filter implementation uses a HashSet<String> to
determine whether a given origin is valid or not. That's a quite fast
solution. Since evaluating a regular expression is much more expensive
(~25 times slower), a sort of caching mechanism is required.

The idea is to transparently add positive matches (allowed origins) to
the same HashSet<String> that already contains all literally specified
allowed origins. Since these positives form a (rather small) countable
set (in practice) we could simply just add these without worrying about
cache removal. There is no difference in memory consumption compared to
the current implementation: all allowed origins must be stored in that
hash set.

There *may* be more disallowed origins than allowed ones so, another
cache for non-matching origins is certainly a good idea. However, this
adds extra memory consumption and that cache should not grow with no
limit. The idea is to use a LinkedHashMap with access order ordering
mode (accessOrder = true). This makes the map an LRU-cache, removing the
least recently used entry when a defined maximum capacity is reached
while adding a new entry. Maybe that cache's capacity should be
configurable.

Write access to that caches must be synchronized. At current, I tend to
use a ReadWriteLock.

So, the new algorithm for determining whether an origin is allowed or
not is like so:

private boolean isOriginAllowed(final String origin) {

     if (anyOriginAllowed) {
         return true;
     }

     if (allowedOrigins.contains(origin)) {
         return true;
     }

     if (notAllowedOrigins.containsKey(origin)) {
         return false;
     }

     // synchronized block starts here (using a ReadWriteLock)

     boolean result = false;
     for (Pattern pattern : allowedOriginPatterns) {
         if (pattern.matcher(origin).matches()) {
             allowedOrigins.add(origin);
             break;
         }
     }

     if (!result) {
         // origin is definitely not allowed
         notAllowedOrigins.put(origin, null);
     }

     // release lock here

     return result;
}

As you can see, there are not more than N extra evaluations of a regular
expression required for an allowed origin during the filter's lifetime
(when the origin is used for the first time), N being the number of
different expressions configured.

The same is true for not allowed origins if the LRU cache is large and
if there are requests from only few different not allowed origins (that
is, if the LRU cache is not smaller than the number of different not
allowed origins).

However, since the LinkedHashMap modifies its linked list when the
'containsKey' method is called, that invocation is about 1.6 times
slower than the corresponding call to the 'contains' method of the
HashSet. But, we are only slowing down requests for not allowed origins.

Also, absolutes timings per request are ~ 5.0 ns to ~7.8 ns (on a quite
recent i7 @ 3.600 GHz) so, that should not really be an issue. With the
above algorithm, these two times must be added for a single not allowed
request that is in the LRU cache, but ~12.8 ns is still not that
dramatic, right?

In all other cases, up to N extra evaluations of a regular expression
are required if a not allowed origin is currently not in the LRU cache
(which, in theory, may be the case for every request). However, that
highly depends on the order of requests and the distribution of their
origins.

It would be great to get your support for contributing the described
CORS filter enhancement.

Carsten


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Adding regular expression support to CORS filter

Carsten Klein

Any comments on that? Is it worth preparing a PR?
Reply | Threaded
Open this post in threaded view
|

Re: Adding regular expression support to CORS filter

Christopher Schultz-2
Carsten,

On 9/27/20 05:53, Carsten Klein wrote:
> Any comments on that? Is it worth preparing a PR?

Regular expressions are fairly expensive.

If there is a way to build the code such that some subset of wildcards
can be serviced without regex (and of course exact matches without using
regex), then I'm at least a +0 on this.

It may seem like over-engineering, but maybe creating a Matcher
interface and a Factory (I know, I know) which produces Matchers which
implement the optimal strategy for a particular value would be a good
way to do this.

A single matcher could be used for really simple values and maybe a
MultipleMatcher could be used for multiple different value-checks.
Something like that will likely lead to the highest-performing filter
processing.

-chris

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Adding regular expression support to CORS filter

Carsten Klein
Chris,

On 9/28/20 02:40, Christopher Schultz wrote:
> Carsten,
>
> On 9/27/20 05:53, Carsten Klein wrote:
>> Any comments on that? Is it worth preparing a PR?
>
> Regular expressions are fairly expensive.

Yes, but my measurements of the HashSet-lookups were wrong, since
hashValue() of a String gets cached, so, while measuring in a loop, the
hash value gets computed only once. Setting up a fair test is
challenging (using new strings in the loop causes memory allocations and
GCs). I ended with additionally calling a clone of the String's
hashValue() function in the loop.

Now, testing an origin with HashSet.contains(origin) (current solution)
takes about 19 ns for a cache-miss (empty list in bucket) and 30 ns for
a cache-hit) on my box.

In contrast, evaluating a regular expression takes about 120 ns; so
these are about 6 times slower (NOT 25 times as stated in my last mail).

The creation of a new Matcher instance each time is a significant
bottleneck (in my measurement loop): reusing the same Matcher instance
and resetting it with a new input string makes the test taking only
about 75 ns (that's only about 4 times slower that the HashSet test).

So, regular expressions are not as bad as we were thinking?

>
> If there is a way to build the code such that some subset of wildcards
> can be serviced without regex (and of course exact matches without using
> regex), then I'm at least a +0 on this.

I never intended to implement tests for exact matches with regular
expressions. Configured "exact" origins (those without wildcards and not
enclosed between slashes) will still be tested with HashSet.contains, of
course. So, there's no change (considering performance) for Tomcat
setups only using "exactly" defined allowed origins.

If someone uses the new "inexact" origin specifiers, will it not be
comprehensible that these are more expensive (from a performance point
of view)?

> It may seem like over-engineering, but maybe creating a Matcher
> interface and a Factory (I know, I know) which produces Matchers which
> implement the optimal strategy for a particular value would be a good
> way to do this.
>
> A single matcher could be used for really simple values and maybe a
> MultipleMatcher could be used for multiple different value-checks.
> Something like that will likely lead to the highest-performing filter
> processing.

I did some tests on that. As I don't believe, that a (self-made)
NFA-based solution could outperform Java's regular expression engine, I
was looking for a different (simpler) algorithm.

I started with a rule-driven globbing algorithm, that supports

? (any char except "."),
# (any digit, using Character.isDigit)
$ (any letter, using Character.isAlphabetic)

as well as literal character sequences.

(Actually, I followed your suggestion. Your Matchers are my Rules
together with a piece of code in a switch-case block. Using real
classes/instances for the matchers requires method invocations and maybe
instanceof tests. Both are adding extra overhead, so I decided to use a
more C-like approach.)

That simple algorithm takes about 42 ns and so, is still 2 times slower
than the HashSet test. I already made more than half the way down to
support * and ** multi-matching wildcards. That implementation uses
non-recursive backtracking, similar to the algorithms described at
https://www.codeproject.com/Articles/5163931/Fast-String-Matching-with-Wildcards-Globs-and-Giti

With * and ** partly in place, time consumption is about 50 ns. The code
additions for making the algorithm work on the many edge cases will very
likely add more nanoseconds to the test so, we may soon end at 60 ns or
even more. That's almost the same time required for evaluating a regular
expression (without the time needed to create the Matcher instance).

The algorithm is optimized and uses only a few method calls and no OOP
constructs (by using the Rules, which are like beans that specify what
to match next, the whole logic can be implemented in a single method).
But it's still not much faster than a Java regular expression test.

I don't believe, that it's worth to (self-)implement such a rather
complex (error prone) and hard to understand (and maintain) algorithm,
if it's not significantly faster than real Java regular expressions.

Anyhow, if performance should not degrade due to using wildcards in the
allowed origins, the goal is not just to be better than Java regular
expressions, but to be close to the HashSet test (~19 ns). And, if real
regular expressions shall be used as well (enclosed between slashes),
that all will not help much for these.

That's why I wanted to combine this with a HashMap-based (LRU-)cache, so
that regular expressions must only be evaluated if the current request's
origin is not yet in the cache. This cache's performance nearly equals
the performance of the HashSet test (depending on whether real LRU is
used or not). That way, the time required for testing a regular
expression is no longer significant in the average case.

Finally, we should not only compare the performance of the different
matching algorithms. The absolute absolute time values should also be
compared to the request's overall processing time. It's about time
intervals between ~20 and ~120 nano(!)seconds. I was using Tomcats
Access Log with the %D placeholder, which reports the request's overall
processing time (< Tomcat 10, reports milliseconds as integers). On my
box, a request to a simple servlet, echoing some request properties (no
database or filesystem access), takes between 0 and 2 milliseconds.
Using the average of 1 millisecond (that is 1,000,000, nanoseconds),
every regular expression test adds ~0,012% of extra time to the request.

In other words, we could easily test about 83 regular expressions per
request in order to add ~1% extra processing time (only few people will
configure that much different allowed origin patterns).

So, I still believe that adding support for specifying the CORS filter's
allowed origins with wildcards (globbing) or by real regular expressions
does not (and should not) depend on a new pattern matching algorithm.
Java's regular expression engine is, in fact, not that slow (and the new
algorithm is probably not significantly faster).

Compared with the request's overall processing time, even with Java's
slow but approved regular expression engine, there's enough space for
testing some dozens of regular expressions without degrading performance
by more than one percent.

With a cache, the number of regular expression tests can even be
minimized, so that the allowed-state of the presented origin can be
determined by only two HashMap-lookups for most of the requests.

Carsten

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]