Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Card mark steal #25986

Merged
merged 33 commits into from Oct 23, 2019
Merged

Card mark steal #25986

merged 33 commits into from Oct 23, 2019

Conversation

PeterSolMS
Copy link

Implement card marking stealing for better work balance in Server GC.

One of the last stages in the mark_phase is to mark objects referenced from older generations. This stage is often slow compared to the other stages, and it is also often somewhat unbalanced, i.e. some GC threads finish their work significantly sooner than others. The change also applies to the relocate_phase, but that phase usually takes significantly less time.

This change implements thread-safe enumeration of older generations by dividing them into chunks (256 kB in 64-bits, 128 kB in 32-bits), and arranges it so threads finishing work early will help on other heaps. Each thread grabs a chunk and then looks through the card table section corresponding to this chunk. When it's done with a chunk, it grabs the next one and so on.

There are changes at multiple levels:

  • at the top level, mark_phase and relocate_phase contain changes to check for work already done for both the heap associated with the thread and other heaps.
  • these routines call mark_through_cards_for_segments and mark_through_cards_for_large_objects which contain code to walk through the older generations in chunks.
  • ultimately card_marking_enumerator::move_next implements the thread safe enumeration, supplying chunks, and gc_heap::find_next_chunk supplies a chunk where all card bits are set.

…d_first_object when finding the start of objects for marking interior pointers.
…t in find_first_object when finding the start of objects for marking interior pointers."

This reverts commit 9d53ff9.
… but call its function pointer arg fn on another heap to keep things straight.

Example: thread 3 (associated with heap 3) helps out marking through cards on heap 5. The way I set things up the this pointer for mark_through_cards_for_xxx will get heap 5 as its this pointer. And that's fine as far comparisons against gc_low, gc_high, etc. go. However, it needs to call mark_object_simple with heap 3 as its this pointer, because otherwise multiple threads may use heap 5's mark_stack etc., which would cause trouble.

So, mark_through_cards_helper gets passed two heaps, one implicitly as the this pointer, the other ("hpt") explicitly - this is the heap associated with the gc thread.
…f an object straddles a chunk boundary. Added stress log instrumentation for card and card bundle clearing
…t one card bundle bit for each chunk.

In card_bundle_clear, use Interlocked::And because now several threads may clear bits in the same card bundle dword.

In relocate_phase, move other relocations *before* relocation of older generations, so the latter can make up for imbalances in the former.

Fix issue where mark_through_cards_for_segments was setting bricks incorrectly, causing too much time being spent in find_first_object. Fix is to distinguis the "next object" in the heap walk from the "continuation object" where we continue the scan.
@PeterSolMS PeterSolMS requested a review from Maoni0 August 2, 2019 13:59
CMakeSettings.json Outdated Show resolved Hide resolved
if ((gc_low <= *poo) && (gc_high > *poo))
{
n_gen++;
call_fn(fn) (poo THREAD_NUMBER_ARG);
call_fn(hpt,fn) (poo THREAD_NUMBER_ARG);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call_fn(hpt,fn) (poo THREAD_NUMBER_ARG); [](start = 8, length = 40)

for mark phase, this will call mark_object_simple with hpt's gc_low/gc_high instead of with the gc_low/gc_high of the heap it's marking. so in gc_mark now the order is different which heap's gc_low/gc_high it's comparing with first

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but is this a correctness problem? It's probably a small perf problem, I wonder how we can fix it...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, not a correctness problem. one could construct a scenario where the perf matters - for example if you allocated a bunch of static objects on your worker threads in a fairly balanced fashion which means they'd be on their cores' respective heaps, then these objects refer to temp objects created in the same fashion which means gc_mark will find the gc_low/gc_high on the first try. now, let's say some of the work we do before marking through cards happen to be very unbalanced so a lot of stealing will incur during card marking so now gc_mark will have to go to other heaps.

currently I don't have better ideas than having mark_object_simple also take a gc_heap* arg (which is of course also cost).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think that it would still be a net win - if thread 1 is fast and does part of thread 2's work, then thread 2 will finish faster even if thread 1 is less effective at it than thread 2 would be. We are essentially using time where thread 1 would be idle otherwise.

I don't better ideas for a fix to this issue either.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi. As this features just landed I'm very excited about it but would also like to understand the changes.

Is it guaranteed that the following scenario won't happen?
Thread 1 finishes fast and starts working on thread 2's work.
Thread 2 had almost finished by the time thread 1 started to do its work.
Because thread 2 is now idle it starts doing thread 3's work, and so on...
Ultimately resulting in every thread working on someone else's work.

Thanks so much for all the new great features in .NET 5!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's totally fine if T2 finishes the end of T1's work and T3 finishes the end of T2's work. threads with a lot of work left will still be working on their respective heaps. only threads that are done with their own heaps would go steal other threads' work. does this make sense?

src/gc/gc.cpp Show resolved Hide resolved
src/gc/gc.cpp Outdated Show resolved Hide resolved
src/gc/gc.cpp Show resolved Hide resolved
In the situation where there was a large byte array at the end of a segment, mark_through_cards_for_large_objects moved to the next segment, but the card_mark_enumerator was still in the previous segment and got stuck there.

Fix is to explicitly exhaust the current segment in mark_through_cards_xxx before moving on to the next one. This will keep the card_mark_enumerator and its callers in mark_through_cards_xxx in sync.
Make sure we can optionally use STRESS_LOG for situations where that's advantageous.
As card_word_end is updated by find_next_chunk, moving the call to find_next_chunk into card_transition means card_word_end needs to be a reference parameter in card_transition, otherwise card_word_end in mark_through_cards_xxx would not get updated.
@Maoni0
Copy link
Member

thanks for making these changes, Peter!

I'm wondering about perf since CARD_MARKING_STEALING_GRANULARITY is pretty small. I do see an optimization opportunity in finding the next set of cards since the chunk index we get can only increase. you could imagine that instead of finding the next chunk on the segment then call find_card (in some scenario only to find out there are no set cards for that chunk), you could just get the next set card and get its chunk index. other threads may clear cards but the one that clear cards is guaranteed to see all the set ones to begin with. but of course this is more complicated.

do we have an idea how much overhead stealing adds, especially in scenarios where set cards are sparse? if you haven't, you could do this perf measurement with workstation GC which makes it easier.

@PeterSolMS
Copy link
Author

I played around a bit and found that it's pretty easy to construct an example where card marking becomes significantly slower (about 2x), with almost a 2x impact on pause time. This happens when you have a large gen2/LOH with a very low density of set cards, so that the impact of getting the next chunk actually matters a lot compared to the actual marking. I used 1 pointer for every 8 MB of memory for this extreme example, so there's nothing to do for most chunks. For a less extreme example I used 1 pointer for every 80 kB of memory (about 3 pointers per 256 kB chunk), and got about 30% impact on card marking and 20% impact on pause time. So, this means that there are good reasons not to enable this on workstation GC, and there may be scenarios on server GC where this change has a negative effect. But these scenarios are likely to be rare.

I am a bit wary of the more complicated scheme you outline above. But it would be nice to avoid the somewhat arbitrary CARD_MARKING_STEALING_GRANULARITY parameter, or make it auto-tuning in some fashion.

I found there is significant overhead to enabling FEATURE_CARD_MARKING_STEALING for workstation GC in the case of low card density, and of course there is no benefit, so it makes sense to enable it only for server GC where the additional overhead will be made up by better work balancing in most cases.
I fixed an off-by-one issue in find_card, and when I switched workstation GC back to no card marking stealing, this assert at the end of card_transition fired in the case where limit, end and card_address(end_card) all coincide, because we access the card for the end address which we really shouldn't look at.

The fix is simply to compare limit to end instead of card_address(end_card).
@Maoni0
Copy link
Member

there may be scenarios on server GC where this change has a negative effect. But these scenarios are likely to be rare.

I wonder if they are that rare. the reason why we have card bundles is because of sparsely set cards so that means at least we saw scenarios that warranted adding a new mechanism for it. if you have a 10GB heap, you could have pretty large regions where they are no set cards. 256k seems so small. how big of a heap were you testing with? many of the asp.net benchmarks have very tiny heaps.

@PeterSolMS
Copy link
Author

You make a good point. You are right that most asp.net benchmarks have tiny heaps, and that 256 kB is quite small. I wouldn't be opposed to making the granularity 1 MB or even 8 MB (8 MB helped a lot with my synthetic example with a ~10 GB heap, but 1 MB helped more), but 8 MB would mean that the optimization would help only the scenarios with larger heaps. Perhaps we should make the granularity a fraction of the gen 2/loh heap size, within reasonable limits?

Another idea would be to scan the cards and card bundles ahead of time on a single thread, and construct a list of chunks that doesn't contain the large uninteresting regions anymore. That makes things more complicated though and I'm not sure it's worthwhile. Perhaps there is a cleverer idea around the corner here?

Pass card_word_end as a value parameter to card transition, when FEATURE_CARD_MARKING_STEALING is enabled, also pass it as a ref parameter. This gets rid of a perf regression in workstation GC.

Set CARD_MARKING_STEALING_GRANULARITY to a higher value - picked 2 MB in 64-bits (1 MB in 32-bits). Perhaps this is a reasonable compromise.
@PeterSolMS
Copy link
Author

I increased the granularity to 2 MB for now. This should address the concern about the case of sparsely set cards. Let me know if you think of something better...

@PeterSolMS PeterSolMS merged commit 5ca444c into dotnet:master Oct 23, 2019
MichalStrehovsky pushed a commit to MichalStrehovsky/corert that referenced this pull request Mar 28, 2020
Implement card marking stealing for better work balance in Server GC.

One of the last stages in the mark_phase is to mark objects referenced from older generations. This stage is often slow compared to the other stages, and it is also often somewhat unbalanced, i.e. some GC threads finish their work significantly sooner than others. The change also applies to the relocate_phase, but that phase usually takes significantly less time.

This change implements thread-safe enumeration of older generations by dividing them into chunks (2 MB in 64-bits, 1 MB in 32-bits), and arranges it so threads finishing work early will help on other heaps. Each thread grabs a chunk and then looks through the card table section corresponding to this chunk. When it's done with a chunk, it grabs the next one and so on.
There are changes at multiple levels:

- at the top level, mark_phase and relocate_phase contain changes to check for work already done for both the heap associated with the thread and other heaps.
- these routines call mark_through_cards_for_segments and mark_through_cards_for_large_objects which contain code to walk through the older generations in chunks.
 - ultimately card_marking_enumerator::move_next implements the thread safe enumeration, supplying chunks, and gc_heap::find_next_chunk supplies a chunk where all card bits are set.

Commit migrated from dotnet/coreclr@5ca444c
MichalStrehovsky pushed a commit to MichalStrehovsky/corert that referenced this pull request Mar 31, 2020
Implement card marking stealing for better work balance in Server GC.

One of the last stages in the mark_phase is to mark objects referenced from older generations. This stage is often slow compared to the other stages, and it is also often somewhat unbalanced, i.e. some GC threads finish their work significantly sooner than others. The change also applies to the relocate_phase, but that phase usually takes significantly less time.

This change implements thread-safe enumeration of older generations by dividing them into chunks (2 MB in 64-bits, 1 MB in 32-bits), and arranges it so threads finishing work early will help on other heaps. Each thread grabs a chunk and then looks through the card table section corresponding to this chunk. When it's done with a chunk, it grabs the next one and so on.
There are changes at multiple levels:

- at the top level, mark_phase and relocate_phase contain changes to check for work already done for both the heap associated with the thread and other heaps.
- these routines call mark_through_cards_for_segments and mark_through_cards_for_large_objects which contain code to walk through the older generations in chunks.
 - ultimately card_marking_enumerator::move_next implements the thread safe enumeration, supplying chunks, and gc_heap::find_next_chunk supplies a chunk where all card bits are set.

Commit migrated from dotnet/coreclr@5ca444c
jkotas pushed a commit to dotnet/corert that referenced this pull request Apr 1, 2020
Implement card marking stealing for better work balance in Server GC.

One of the last stages in the mark_phase is to mark objects referenced from older generations. This stage is often slow compared to the other stages, and it is also often somewhat unbalanced, i.e. some GC threads finish their work significantly sooner than others. The change also applies to the relocate_phase, but that phase usually takes significantly less time.

This change implements thread-safe enumeration of older generations by dividing them into chunks (2 MB in 64-bits, 1 MB in 32-bits), and arranges it so threads finishing work early will help on other heaps. Each thread grabs a chunk and then looks through the card table section corresponding to this chunk. When it's done with a chunk, it grabs the next one and so on.
There are changes at multiple levels:

- at the top level, mark_phase and relocate_phase contain changes to check for work already done for both the heap associated with the thread and other heaps.
- these routines call mark_through_cards_for_segments and mark_through_cards_for_large_objects which contain code to walk through the older generations in chunks.
 - ultimately card_marking_enumerator::move_next implements the thread safe enumeration, supplying chunks, and gc_heap::find_next_chunk supplies a chunk where all card bits are set.

Commit migrated from dotnet/coreclr@5ca444c
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
4 participants