Card mark steal #25986

PeterSolMS · 2019-08-02T13:59:30Z

Implement card marking stealing for better work balance in Server GC.

One of the last stages in the mark_phase is to mark objects referenced from older generations. This stage is often slow compared to the other stages, and it is also often somewhat unbalanced, i.e. some GC threads finish their work significantly sooner than others. The change also applies to the relocate_phase, but that phase usually takes significantly less time.

This change implements thread-safe enumeration of older generations by dividing them into chunks (256 kB in 64-bits, 128 kB in 32-bits), and arranges it so threads finishing work early will help on other heaps. Each thread grabs a chunk and then looks through the card table section corresponding to this chunk. When it's done with a chunk, it grabs the next one and so on.

There are changes at multiple levels:

at the top level, mark_phase and relocate_phase contain changes to check for work already done for both the heap associated with the thread and other heaps.
these routines call mark_through_cards_for_segments and mark_through_cards_for_large_objects which contain code to walk through the older generations in chunks.
ultimately card_marking_enumerator::move_next implements the thread safe enumeration, supplying chunks, and gc_heap::find_next_chunk supplies a chunk where all card bits are set.

…d_first_object when finding the start of objects for marking interior pointers.

…t in find_first_object when finding the start of objects for marking interior pointers." This reverts commit 9d53ff9.

… but call its function pointer arg fn on another heap to keep things straight. Example: thread 3 (associated with heap 3) helps out marking through cards on heap 5. The way I set things up the this pointer for mark_through_cards_for_xxx will get heap 5 as its this pointer. And that's fine as far comparisons against gc_low, gc_high, etc. go. However, it needs to call mark_object_simple with heap 3 as its this pointer, because otherwise multiple threads may use heap 5's mark_stack etc., which would cause trouble. So, mark_through_cards_helper gets passed two heaps, one implicitly as the this pointer, the other ("hpt") explicitly - this is the heap associated with the gc thread.

…f an object straddles a chunk boundary. Added stress log instrumentation for card and card bundle clearing

…t one card bundle bit for each chunk. In card_bundle_clear, use Interlocked::And because now several threads may clear bits in the same card bundle dword. In relocate_phase, move other relocations *before* relocation of older generations, so the latter can make up for imbalances in the former. Fix issue where mark_through_cards_for_segments was setting bricks incorrectly, causing too much time being spent in find_first_object. Fix is to distinguis the "next object" in the heap walk from the "continuation object" where we continue the scan.

… into card_mark_steal

…steal

CMakeSettings.json

…partially address code review feedback.

…wing Maoni's suggestion.

Maoni0 · 2019-08-12T06:09:02Z

src/gc/gc.cpp

 if ((gc_low <= *poo) && (gc_high > *poo))
 {
 n_gen++;
- call_fn(fn) (poo THREAD_NUMBER_ARG);
+ call_fn(hpt,fn) (poo THREAD_NUMBER_ARG);


call_fn(hpt,fn) (poo THREAD_NUMBER_ARG); [](start = 8, length = 40)

for mark phase, this will call mark_object_simple with hpt's gc_low/gc_high instead of with the gc_low/gc_high of the heap it's marking. so in gc_mark now the order is different which heap's gc_low/gc_high it's comparing with first

Agreed, but is this a correctness problem? It's probably a small perf problem, I wonder how we can fix it...

right, not a correctness problem. one could construct a scenario where the perf matters - for example if you allocated a bunch of static objects on your worker threads in a fairly balanced fashion which means they'd be on their cores' respective heaps, then these objects refer to temp objects created in the same fashion which means gc_mark will find the gc_low/gc_high on the first try. now, let's say some of the work we do before marking through cards happen to be very unbalanced so a lot of stealing will incur during card marking so now gc_mark will have to go to other heaps.

currently I don't have better ideas than having mark_object_simple also take a gc_heap* arg (which is of course also cost).

I would think that it would still be a net win - if thread 1 is fast and does part of thread 2's work, then thread 2 will finish faster even if thread 1 is less effective at it than thread 2 would be. We are essentially using time where thread 1 would be idle otherwise.

I don't better ideas for a fix to this issue either.

Hi. As this features just landed I'm very excited about it but would also like to understand the changes.

Is it guaranteed that the following scenario won't happen?
Thread 1 finishes fast and starts working on thread 2's work.
Thread 2 had almost finished by the time thread 1 started to do its work.
Because thread 2 is now idle it starts doing thread 3's work, and so on...
Ultimately resulting in every thread working on someone else's work.

Thanks so much for all the new great features in .NET 5!

it's totally fine if T2 finishes the end of T1's work and T3 finishes the end of T2's work. threads with a lot of work left will still be working on their respective heaps. only threads that are done with their own heaps would go steal other threads' work. does this make sense?

src/gc/gc.cpp

In the situation where there was a large byte array at the end of a segment, mark_through_cards_for_large_objects moved to the next segment, but the card_mark_enumerator was still in the previous segment and got stuck there. Fix is to explicitly exhaust the current segment in mark_through_cards_xxx before moving on to the next one. This will keep the card_mark_enumerator and its callers in mark_through_cards_xxx in sync.

Make sure we can optionally use STRESS_LOG for situations where that's advantageous.

…e in mark_through_cards_xxx.

…ansition.

As card_word_end is updated by find_next_chunk, moving the call to find_next_chunk into card_transition means card_word_end needs to be a reference parameter in card_transition, otherwise card_word_end in mark_through_cards_xxx would not get updated.

Maoni0 · 2019-08-14T06:50:45Z

thanks for making these changes, Peter!

I'm wondering about perf since CARD_MARKING_STEALING_GRANULARITY is pretty small. I do see an optimization opportunity in finding the next set of cards since the chunk index we get can only increase. you could imagine that instead of finding the next chunk on the segment then call find_card (in some scenario only to find out there are no set cards for that chunk), you could just get the next set card and get its chunk index. other threads may clear cards but the one that clear cards is guaranteed to see all the set ones to begin with. but of course this is more complicated.

do we have an idea how much overhead stealing adds, especially in scenarios where set cards are sparse? if you haven't, you could do this perf measurement with workstation GC which makes it easier.

PeterSolMS · 2019-08-19T16:17:59Z

I played around a bit and found that it's pretty easy to construct an example where card marking becomes significantly slower (about 2x), with almost a 2x impact on pause time. This happens when you have a large gen2/LOH with a very low density of set cards, so that the impact of getting the next chunk actually matters a lot compared to the actual marking. I used 1 pointer for every 8 MB of memory for this extreme example, so there's nothing to do for most chunks. For a less extreme example I used 1 pointer for every 80 kB of memory (about 3 pointers per 256 kB chunk), and got about 30% impact on card marking and 20% impact on pause time. So, this means that there are good reasons not to enable this on workstation GC, and there may be scenarios on server GC where this change has a negative effect. But these scenarios are likely to be rare.

I am a bit wary of the more complicated scheme you outline above. But it would be nice to avoid the somewhat arbitrary CARD_MARKING_STEALING_GRANULARITY parameter, or make it auto-tuning in some fashion.

I found there is significant overhead to enabling FEATURE_CARD_MARKING_STEALING for workstation GC in the case of low card density, and of course there is no benefit, so it makes sense to enable it only for server GC where the additional overhead will be made up by better work balancing in most cases.

I fixed an off-by-one issue in find_card, and when I switched workstation GC back to no card marking stealing, this assert at the end of card_transition fired in the case where limit, end and card_address(end_card) all coincide, because we access the card for the end address which we really shouldn't look at. The fix is simply to compare limit to end instead of card_address(end_card).

Maoni0 · 2019-08-23T04:45:17Z

there may be scenarios on server GC where this change has a negative effect. But these scenarios are likely to be rare.

I wonder if they are that rare. the reason why we have card bundles is because of sparsely set cards so that means at least we saw scenarios that warranted adding a new mechanism for it. if you have a 10GB heap, you could have pretty large regions where they are no set cards. 256k seems so small. how big of a heap were you testing with? many of the asp.net benchmarks have very tiny heaps.

PeterSolMS · 2019-08-23T09:20:26Z

You make a good point. You are right that most asp.net benchmarks have tiny heaps, and that 256 kB is quite small. I wouldn't be opposed to making the granularity 1 MB or even 8 MB (8 MB helped a lot with my synthetic example with a ~10 GB heap, but 1 MB helped more), but 8 MB would mean that the optimization would help only the scenarios with larger heaps. Perhaps we should make the granularity a fraction of the gen 2/loh heap size, within reasonable limits?

Another idea would be to scan the cards and card bundles ahead of time on a single thread, and construct a list of chunks that doesn't contain the large uninteresting regions anymore. That makes things more complicated though and I'm not sure it's worthwhile. Perhaps there is a cleverer idea around the corner here?

Pass card_word_end as a value parameter to card transition, when FEATURE_CARD_MARKING_STEALING is enabled, also pass it as a ref parameter. This gets rid of a perf regression in workstation GC. Set CARD_MARKING_STEALING_GRANULARITY to a higher value - picked 2 MB in 64-bits (1 MB in 32-bits). Perhaps this is a reasonable compromise.

PeterSolMS · 2019-08-23T15:24:25Z

I increased the granularity to 2 MB for now. This should address the concern about the case of sparsely set cards. Let me know if you think of something better...

Implement card marking stealing for better work balance in Server GC. One of the last stages in the mark_phase is to mark objects referenced from older generations. This stage is often slow compared to the other stages, and it is also often somewhat unbalanced, i.e. some GC threads finish their work significantly sooner than others. The change also applies to the relocate_phase, but that phase usually takes significantly less time. This change implements thread-safe enumeration of older generations by dividing them into chunks (2 MB in 64-bits, 1 MB in 32-bits), and arranges it so threads finishing work early will help on other heaps. Each thread grabs a chunk and then looks through the card table section corresponding to this chunk. When it's done with a chunk, it grabs the next one and so on. There are changes at multiple levels: - at the top level, mark_phase and relocate_phase contain changes to check for work already done for both the heap associated with the thread and other heaps. - these routines call mark_through_cards_for_segments and mark_through_cards_for_large_objects which contain code to walk through the older generations in chunks. - ultimately card_marking_enumerator::move_next implements the thread safe enumeration, supplying chunks, and gc_heap::find_next_chunk supplies a chunk where all card bits are set. Commit migrated from dotnet/coreclr@5ca444c

PeterSolMS added 18 commits June 11, 2019 08:39

Changes to set gen0 bricks always. This reduces the time spent in fin…

9d53ff9

…d_first_object when finding the start of objects for marking interior pointers.

Merge branch 'brick_table'

4d96dee

Revert "Changes to set gen0 bricks always. This reduces the time spen…

f738b0d

…t in find_first_object when finding the start of objects for marking interior pointers." This reverts commit 9d53ff9.

Merge remote-tracking branch 'upstream/master'

5b5feb3

Merge remote-tracking branch 'upstream/master'

7960a38

Initial check point for card marking stealing

c4b41c5

Fixed issue with over-eager card clearing in mark_through_cards_xxx i…

ef34c85

…f an object straddles a chunk boundary. Added stress log instrumentation for card and card bundle clearing

Merge from master

c2ad446

Merge remote-tracking branch 'upstream/master'

9dda934

Merge branch 'master' into card_mark_steal

70a105e

Merge branch 'card_mark_steal' of https://github.com/PeterSolMS/coreclr…

8f0ba33

… into card_mark_steal

Fix white space according to local coding standards.

6d12dbd

Merge remote-tracking branch 'upstream/master'

d48fabe

Merge branch 'master' into card_mark_steal

6c5135c

Remove instrumentation to find brick table bug.

93d2408

Merge remote-tracking branch 'origin/card_mark_steal' into card_mark_…

ffc98d4

…steal

PeterSolMS requested a review from Maoni0 August 2, 2019 13:59

janvorli reviewed Aug 2, 2019

View reviewed changes

CMakeSettings.json Outdated Show resolved Hide resolved

PeterSolMS added 3 commits August 2, 2019 16:22

Remove some STRESS_LOG calls, add parentheses to some expressions to …

86bb8e1

…partially address code review feedback.

Renamed ticket to chunk_index to make code more comprehensible, follo…

f3b14e7

…wing Maoni's suggestion.

Fix Linux build.

8fcad5e

Maoni0 reviewed Aug 12, 2019

View reviewed changes

src/gc/gc.cpp Show resolved Hide resolved

Maoni0 reviewed Aug 12, 2019

View reviewed changes

src/gc/gc.cpp Outdated Show resolved Hide resolved

Maoni0 reviewed Aug 12, 2019

View reviewed changes

src/gc/gc.cpp Show resolved Hide resolved

PeterSolMS added 3 commits August 12, 2019 10:05

Remove STRESS_LOGX calls or replace by dprintf.

41fafb1

Make sure we can optionally use STRESS_LOG for situations where that's advantageous.

Addressed some code review feedback - reduce amount of duplicated cod…

ac41099

…e in mark_through_cards_xxx.

PeterSolMS added 2 commits August 12, 2019 17:54

Address code review feedback: move call to find_next_chunk to card_tr…

a20a2f4

…ansition.

PeterSolMS added 2 commits August 21, 2019 16:31

PeterSolMS added 4 commits October 16, 2019 11:49

Merge remote-tracking branch 'upstream/master'

2c6071a

Merge branch 'master' into card_mark_steal

cd72bdc

Merge remote-tracking branch 'upstream/master'

1ce64d5

Merge branch 'master' into card_mark_steal

0040092

PeterSolMS merged commit 5ca444c into dotnet:master Oct 23, 2019

richlander mentioned this pull request Jun 6, 2020

Improving P95+ latency dotnet/runtime#37534

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Card mark steal #25986

Card mark steal #25986

PeterSolMS commented Aug 2, 2019

Maoni0 Aug 12, 2019

PeterSolMS Aug 12, 2019

Maoni0 Aug 14, 2019

PeterSolMS Aug 19, 2019

Tommigun1980 Nov 12, 2020

Maoni0 Nov 13, 2020

Maoni0 commented Aug 14, 2019

PeterSolMS commented Aug 19, 2019

Maoni0 commented Aug 23, 2019

PeterSolMS commented Aug 23, 2019

PeterSolMS commented Aug 23, 2019

Card mark steal #25986

Card mark steal #25986

Conversation

PeterSolMS commented Aug 2, 2019

Maoni0 Aug 12, 2019

Choose a reason for hiding this comment

PeterSolMS Aug 12, 2019

Choose a reason for hiding this comment

Maoni0 Aug 14, 2019

Choose a reason for hiding this comment

PeterSolMS Aug 19, 2019

Choose a reason for hiding this comment

Tommigun1980 Nov 12, 2020

Choose a reason for hiding this comment

Maoni0 Nov 13, 2020

Choose a reason for hiding this comment

Maoni0 commented Aug 14, 2019

PeterSolMS commented Aug 19, 2019

Maoni0 commented Aug 23, 2019

PeterSolMS commented Aug 23, 2019

PeterSolMS commented Aug 23, 2019