Our webapp is running very slowly on one particular customer box

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Our webapp is running very slowly on one particular customer box

James H. H. Lampert
This is related to my query (thanks, Mr. Gregg) about "Tenured SOA."

It seems that on one of our customer installations, our webapp gets into
a state of running very slowly, and the dedicated subsystem it's running
in is showing massive levels of page-faulting.

I've compared the GC stats of the "problem" system with one that's
actually got more users connected, but isn't experiencing performance
issues. It seems that they're both going to GC about every 30-50
seconds, but GC on the "problem" system appears to be somewhat less
effective.

Also, I've looked at the threads on both. On the system that is behaving
normally, the "GC Slave" threads (7 of them) are showing total CPU (at
this moment) of around 150 seconds each, and Aux I/O of mostly zero,
with one showing 1 and one showing 3. Conversely, on the "problem"
system, I'm seeing 15(!) GC Slave threads, each with total CPU under 6
seconds each, but Aux I/O ranging from 5800 to over 8000.

I'm not sure what to make of this. In both cases, Tomcat's JVM is
running in a subsystem of its own, with a private memory pool of around
7G, and they're both running with -Xms4096m -Xmx5120m.

--
JHHL

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Our webapp is running very slowly on one particular customer box

Christopher Schultz-2
James,

On 10/27/20 16:20, James H. H. Lampert wrote:

> This is related to my query (thanks, Mr. Gregg) about "Tenured SOA."
>
> It seems that on one of our customer installations, our webapp gets into
> a state of running very slowly, and the dedicated subsystem it's running
> in is showing massive levels of page-faulting.
>
> I've compared the GC stats of the "problem" system with one that's
> actually got more users connected, but isn't experiencing performance
> issues. It seems that they're both going to GC about every 30-50
> seconds, but GC on the "problem" system appears to be somewhat less
> effective.
>
> Also, I've looked at the threads on both. On the system that is behaving
> normally, the "GC Slave" threads (7 of them) are showing total CPU (at
> this moment) of around 150 seconds each, and Aux I/O of mostly zero,
> with one showing 1 and one showing 3. Conversely, on the "problem"
> system, I'm seeing 15(!) GC Slave threads, each with total CPU under 6
> seconds each, but Aux I/O ranging from 5800 to over 8000.
>
> I'm not sure what to make of this. In both cases, Tomcat's JVM is
> running in a subsystem of its own, with a private memory pool of around
> 7G, and they're both running with -Xms4096m -Xmx5120m.

If you expect the service to be long-running, definitely set Xms=Xmx.
There's no reason to artificially restrict the heap "early" in the
process's lifetime only to completely re-size and re-organize the heap
over time. You may as well allocate the maximum right up front and leave
it that way.

The problem system certainly appears to be thrashing its GC. Are there
any environmental differences that you notice about the two systems? For
a JVM with a maximum heap of ~5GiB, I think that a 7GiB private memory
space (this is an AS/400 thing isn't it?) isn't large enough. The heap
space is just the "Java heap" and there are other things that need
memory, sometimes ~= to the heap size. It's sometimes surprising how
much "native" memory a JVM needs. Is the kernel+userspace running in
that "subsystem" as well? Or just the JVM process?

I'm guessing that your comment about page-faulting and "Aux I/O rang[es]
from 5800 - 8000 [sec]" means that you are actually paging the heap to
the disk. What happens if you shrink your max-heap to 2GiB and change
nothing else? This should make sure that your heap + native memory fits
into physical memory and that thrashing should stop.

Maybe you *do* need a 5GiB heap, though. In that case, if the
heap-shrink works but you get OOMEs under load, then I think that simply
increasing the memory allocated to the "subsystem" should help a lot.

How much (real) memory does the system report is being used by the JVM
process? I think you'll find it much larger than 5GiB.

-chris

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Our webapp is running very slowly on one particular customer box

James H. H. Lampert
First, thanks once again, Mr. Schultz, for getting back to me.

I noticed something rather promising: it seems that maxThreads for the
Port 443 connector were set at 150 for System "A" (problem box), but 400
for System "J" (box that's quite happy).

I've restarted Tomcat with the maxThreads bumped up to 400, and so far,
it seems much happier than it was. That could have been the problem all
along.

My colleagues and I also observed that yesterday, when we did *not* shut
down and restart, the slowdown and the nearly-full "tenured-SOA" portion
of the heap eventually resolved itself, which suggests that it wasn't a
memory leak in any even remotely conventional sense of the term.

The page-faulting is a virtual memory term: on an AS/400, the entire
combined total of main storage and disk is addressable (the concept is
called "Single-Level Store"), and virtual storage paging is built into
the OS at a very low level; a "page fault" is when a process finds tries
to access something that's been paged out to disk.

As to the private memory pool, it's not that the subsystem is restricted
to its private pool; rather, everything else is kept *out* of that
private pool. It still has full access to the "Machine" and "Base"
shared pools.

--
JHHL

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Our webapp is running very slowly on one particular customer box

Christopher Schultz-2
James,

On 10/28/20 16:40, James H. H. Lampert wrote:
> First, thanks once again, Mr. Schultz, for getting back to me.
>
> I noticed something rather promising: it seems that maxThreads for the
> Port 443 connector were set at 150 for System "A" (problem box), but 400
> for System "J" (box that's quite happy).
>
> I've restarted Tomcat with the maxThreads bumped up to 400, and so far,
> it seems much happier than it was. That could have been the problem all
> along.

Hmm. That doesn't sound very satisfying to me, honestly. Allowing *more*
load uses *less* GC and/or fewer page-faults? Seems fishy.

> My colleagues and I also observed that yesterday, when we did *not* shut
> down and restart, the slowdown and the nearly-full "tenured-SOA" portion
> of the heap eventually resolved itself, which suggests that it wasn't a
> memory leak in any even remotely conventional sense of the term.

That's a Good Thing, but also not very satisfying when you just want it
to stop sucking and let your users get work done :)

> The page-faulting is a virtual memory term: on an AS/400, the entire
> combined total of main storage and disk is addressable (the concept is
> called "Single-Level Store"), and virtual storage paging is built into
> the OS at a very low level; a "page fault" is when a process finds tries
> to access something that's been paged out to disk.

Yes, this is the common definition of a page-fault, not just an AS/400
thing. Good to know for sure that AS/400 doesn't re-define that term,
though :)

How long has the process on System J been running? How about System A
(before you restarted the JVM)?

> As to the private memory pool, it's not that the subsystem is restricted
> to its private pool; rather, everything else is kept *out* of that
> private pool. It still has full access to the "Machine" and "Base"
> shared pools.


Okay, so it's like a guaranteed-minimum memory space?

-chris

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]