Discussion:
Request for review (L): 7087838: JSR 292: add throttling logic for optimistic call site optimizations
(too old to reply)
Rémi Forax
2011-09-17 09:33:10 UTC
Permalink
Hi Christian, hi all,
I understand that you need such kind of logic
but I think it's not compatible with the approach taken by Mark Roos
i.e flush all callsites if more than a predefined number of callsites
have installed an inlining cache.

A possible solution is to add a way in the API to know if a callsite
will trigger a deoptimization if the target changes.

Rémi
[This change will be pushed after 7087357. So ignore the code removal in src/share/vm/classfile/javaClasses.cpp.]
http://cr.openjdk.java.net/~twisti/7087838/
7087838: JSR 292: add throttling logic for optimistic call site optimizations
The optimistic optimization for MutableCallSite and VolatileCallSite
invalidate compiled methods on every setTarget. This possibly results
in a recompile. For ever-changing call sites this is a performance
hit.
The fix is to add some throttling logic that prevents the optimistic
optimization after a specified amount of invalidations per CallSite
object.
This change also moves the flush_dependents_on methods from Universe
to CodeCache.
src/share/vm/c1/c1_GraphBuilder.cpp
src/share/vm/ci/ciCallSite.cpp
src/share/vm/ci/ciCallSite.hpp
src/share/vm/classfile/javaClasses.cpp
src/share/vm/classfile/javaClasses.hpp
src/share/vm/classfile/systemDictionary.cpp
src/share/vm/classfile/vmSymbols.hpp
src/share/vm/code/codeCache.cpp
src/share/vm/code/codeCache.hpp
src/share/vm/code/dependencies.cpp
src/share/vm/code/dependencies.hpp
src/share/vm/memory/universe.cpp
src/share/vm/memory/universe.hpp
src/share/vm/oops/methodOop.cpp
src/share/vm/opto/callGenerator.cpp
src/share/vm/opto/memnode.cpp
src/share/vm/prims/jvmtiRedefineClasses.cpp
src/share/vm/prims/methodHandles.cpp
src/share/vm/runtime/globals.hpp
Mark Roos
2011-09-18 15:31:33 UTC
Permalink
Rémi Forax
2011-09-18 15:44:24 UTC
Permalink
Hi Mark,
the throttle occurs if you deoptimize an already JITed callsite 10 times.
If the callsite is not JITed, you can call setTarget() without triggering
the throttle logic.

So the main problem you will have is if your program always calls
very often the same hot code and is really big.
Because your code is big, you will have to flush the callsites,
so you will flush the hot code which will be recompiled just
after (because it's a hot code). After 10 flushes, you will see
a performance degradation because your hot code will be
no more JITed.

Rémi

On 09/18/2011 05:31 PM, Mark Roos wrote:
Mark Roos
2011-09-18 16:49:20 UTC
Permalink
Hi Rémi

you mentioned
After 10 flushes, you will see
a performance degradation because your hot code will be
no more JITed.

But I also have to setTarget on the callsites when the method code
changes. The easy (most efficient)
way is to reset them all. So after 10 replacement methods I lose all
jitted optimizations? forever?
This seems extreme to me.

Should the throttle be more sensitive to rate of change and not absolute
quantity? Or a means
to setTarget which does not count towards throttling. Or (gasp) if I
choose to do lots of setTargets
maybe I should just take the hit myself leading me to avoid the issue if
it matters to me.

Since I don't understand the use case where deep/changing chains are
common and so nned to be fast,
this seems premature to me.

I am still wondering what I would do if this is the case? Drop and
replace all methods? Do my own
callsite to somehow limit target changes ( callsite with a callsite as
target)?

regards
mark
Mark Roos
2011-09-18 17:16:15 UTC
Permalink
In looking at my code.

In general 98% of the callsites are < 3 targets.

Those that are larger I can catch and use a different lookup. I also
believe that Charles Nutter limits his
depth to 5.

So the general case I see will not exceed the < 10.

But I do have some cases where 10 is not the right number

Part of my app is a compiler so some of its callsites see 20 or 30
classes ( AST node types).
And part is a data flow evaluator where the data has about 30
types

And as I watch the jit work it attacks these sites pretty fast so I could
image exceeding the > 10.

I would like both or these to jit as they are used a lot

And my final use case which is dynamic method replacement. Here I have to
reset the callsites so they
get the correct code. In our system this happens doing feature loading (
small dynamic patches) and for
changing math operations ( dataflow) and finally during edit. Both easily
exceed the >10 during the execution
life of the app (hopefully months).

I would be fine if there was a way to tell the jitter to reset its counts
on a callsite as all of my use cases
can absorb time at that instance. I really would have issues with losing
the jitting when the app is expected
to be running at full speed and would have to find some clever hack around
it.

Another point is that the classes callsites see during startup can be
quite different than during normal operation.
This is another reason we do the bulk invalidation

mark
Rémi Forax
2011-09-18 23:19:08 UTC
Permalink
Post by Mark Roos
In looking at my code.
In general 98% of the callsites are < 3 targets.
Those that are larger I can catch and use a different lookup. I also
believe that Charles Nutter limits his
depth to 5.
So the general case I see will not exceed the < 10.
But I do have some cases where 10 is not the right number
Part of my app is a compiler so some of its callsites see 20
or 30 classes ( AST node types).
And part is a data flow evaluator where the data has about 30
types
For these callsites you should use a dispatch table (a vtable dedicated
to a callsite)
instead of a sequence of 20 guards, because your code is equivalent to
looking up
a class in a linkedlist, so it's awfully slow.
Moreover, are you sure the code that contains these callsites is JITed,
it's pretty easy to hit other thresholds like by example
the max number of internal nodes (ideal nodes).
Post by Mark Roos
And as I watch the jit work it attacks these sites pretty fast so I
could image exceeding the > 10.
I would like both or these to jit as they are used a lot
And my final use case which is dynamic method replacement. Here I
have to reset the callsites so they
get the correct code. In our system this happens doing feature
loading ( small dynamic patches) and for
changing math operations ( dataflow) and finally during edit. Both
easily exceed the >10 during the execution
life of the app (hopefully months).
I would be fine if there was a way to tell the jitter to reset its
counts on a callsite as all of my use cases
can absorb time at that instance. I really would have issues with
losing the jitting when the app is expected
to be running at full speed and would have to find some clever hack
around it.
Another point is that the classes callsites see during startup can be
quite different than during normal operation.
This is another reason we do the bulk invalidation
mark
Rémi
Mark Roos
2011-09-21 06:21:11 UTC
Permalink
Christian Thalinger
2011-09-21 08:34:51 UTC
Permalink
From Remi
Moreover, are you sure the code that contains these callsites is JITed,
No, but then how would I know and what would I do if I did know. If its JITed
and I do the 10th ( or whatever) invalidate in three months because of uploading a patch
and because of that the app slows down until I reboot it that is a problem for me.
Smalltalk is dynamic not just in its use of variables but in its ability to redo methods on
the fly ( and create new classes on demand etc). Because of this I will need to invalidate
callsites in the cache. I really don't think it will be helpful if I have to drop the methods and
recompile the byte codes to get 'fresh' callsites.
Maybe I could have a way to put the site back into 'bootstrap' mode. Maybe even with
a SwitchPoint.
I think I can deal with depth control I am not sure I can deal with an 'helpful' compiler
I got your point. It seems we have to rework this patch a little to meet the requirements. And I'm pretty sure it's too late for 7u2 anyway. A rate instead of an absolute counter seems to be the way to go.

-- Christian
regards
mark
_______________________________________________
mlvm-dev mailing list
http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev
Christian Thalinger
2011-09-19 07:57:45 UTC
Permalink
Post by Mark Roos
In looking at my code.
In general 98% of the callsites are < 3 targets.
Those that are larger I can catch and use a different lookup. I also believe that Charles Nutter limits his
depth to 5.
So the general case I see will not exceed the < 10.
But I do have some cases where 10 is not the right number
You can adjust that number in your runtime startup script since the switch that drives that number is a product switch:

-XX:PerCallSiteSetTargetTrapLimit=50
Post by Mark Roos
Part of my app is a compiler so some of its callsites see 20 or 30 classes ( AST node types).
And part is a data flow evaluator where the data has about 30 types
And as I watch the jit work it attacks these sites pretty fast so I could image exceeding the > 10.
I would like both or these to jit as they are used a lot
And my final use case which is dynamic method replacement. Here I have to reset the callsites so they
get the correct code. In our system this happens doing feature loading ( small dynamic patches) and for
changing math operations ( dataflow) and finally during edit. Both easily exceed the >10 during the execution
life of the app (hopefully months).
I would be fine if there was a way to tell the jitter to reset its counts on a callsite as all of my use cases
can absorb time at that instance. I really would have issues with losing the jitting when the app is expected
to be running at full speed and would have to find some clever hack around it.
Tom suggested in a pre-review of the patch that we could (a) have a rate instead of a counter or (b) only count actual deoptimizations (which might be a simple fix for now and decide later if that's good or we need something else).

As Remi mentioned we need throttling like this but we don't have enough data yet to tell what is the best approach. Mark, could you try that patch with your implementation?

-- Christian
Post by Mark Roos
Another point is that the classes callsites see during startup can be quite different than during normal operation.
This is another reason we do the bulk invalidation
mark_______________________________________________
mlvm-dev mailing list
http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev
John Rose
2011-09-19 19:44:45 UTC
Permalink
Post by Christian Thalinger
Post by Mark Roos
In looking at my code.
In general 98% of the callsites are < 3 targets.
Those that are larger I can catch and use a different lookup. I also believe that Charles Nutter limits his
depth to 5.
So the general case I see will not exceed the < 10.
But I do have some cases where 10 is not the right number
-XX:PerCallSiteSetTargetTrapLimit=50
This is the right short-term answer to see how our technique will affect Mark's system.

In choosing heuristics like this it is best to start simple and then see whether it causes problems in practice. In addition, providing tuning knobs allows customers to perform experiments to see if their performance is affected by the heuristic.

If there are only a few "megamutable" call sites, it may be the case that their performance is best managed by special techniques, rather than by adjusting the heuristics that affect all the other call sites. In particular, indirect (non-inlined) calls through method handles are slow right now; I'm working on fixing this.

-- John
Mark Roos
2011-09-21 06:21:10 UTC
Permalink
Thanks but I do have a few questions that are not being answered.


First is the decision by the compiler to not jit a callsite reversible
without restarting
the program? I am very worried that the mere act of invalidating
callsites to force them
to use new code (something that happens a lot in Smalltalk) will send my
app to interpreter
purgatory. I really need a way to invalidate the sites ( which right now
can only be done
via a setTarget ( I think) without forcing them to not be jitted.

Second
Christian mentioned: As Remi mentioned we need throttling like this
What is the use case that is being fixed by this? It seems that I can fix
it on my end by watching
the call depth ( which I do ) so why does the compiler need to do it for
me?

Third. I am sure there is a very cool vTable lookup in the low level
code. How can I get to
that? I am concerned that writing a high level vTable will be slow. At
least slower that
a few GWTs

Thanks

mark
Mark Roos
2011-09-21 06:21:10 UTC
Permalink
Hi Christian

You requested:
Mark, could you try that patch with your implementation?

I am at a conference this week but I will set up a test case when I get
back.

regards

mark
Loading...