Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use xmm for stack prolog zeroing rather than rep stos #32538

Merged
merged 17 commits into from Mar 5, 2020

Conversation

benaadams
Copy link
Member

@benaadams benaadams commented Feb 19, 2020

Use xmm registers to clear prolog blocks rather than rep stosd.

rep stos does have a shallower growth rate so it will eventually overtake; however unlikely in the size range for the stack clearance:

image

Also its variability based on small changes in stack changes is problematic as just adding an extra variable to a method (with no other changes) could add or remove 9ns from its running time. Whereas the xmm clear in this PR is much smoother and more expected changes.

Focusing in on the more common <= 256 range

image

And throughput

image

As the prolog doesn't support jmp labels; I have taught the x86/x64 Jit to understand relative jmps by instruction count (as per Arm) for the prolog.

e.g. output prolog for a 1024 byte stack leap method:

G_M1845_IG01:
       push     rdi
       push     rsi
       sub      rsp, 0x408
       mov      rax, -0x3F0
       vxorps   xmm4, xmm4
       vmovups  xmmword ptr [rsp], xmm4
       vmovups  xmmword ptr [rsp+rax+400H], xmm4
       vmovups  xmmword ptr [rsp+rax+410H], xmm4
       vmovups  xmmword ptr [rsp+rax+420H], xmm4
       add      rax, 48
       jne      SHORT  -5 instr

Switch to movups rather than movdqu as is shorter encoding with same functionality to reduce code size regressions.

Top file regressions (bytes):
        2195 : System.Private.CoreLib.dasm (0.07% of base)

Contributes to #8890
Resolves #32396

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 19, 2020
@benaadams
Copy link
Member Author

benaadams commented Feb 19, 2020

Example change from ValueTuple`6:Equals(Object):bool:this

    sub      rsp, 88
-   mov      rsi, rcx
-   lea      rdi, [rsp+20H]
-   mov      ecx, 12
-   xor      rax, rax
-   rep stosd 
-   mov      rcx, rsi
+   xorps    xmm4, xmm4
+   movdqu   xmmword ptr [rsp+20H], xmm4
+   movdqu   xmmword ptr [rsp+30H], xmm4
+   movdqu   xmmword ptr [rsp+40H], xmm4
    mov      qword ptr [rsp+50H], rdx

CustomAttributeData:Init(DllImportAttribute):this

    sub      rsp, 304
-   mov      rsi, rcx
-   lea      rdi, [rsp+40H]
-   mov      ecx, 60
-   xor      rax, rax
-   rep stosd 
-   mov      rcx, rsi
+   xorps    xmm4, xmm4
+   mov      rax, -240
+   movups   xmmword ptr [rsp+rax+130H], xmm4
+   movups   xmmword ptr [rsp+rax+140H], xmm4
+   movups   xmmword ptr [rsp+rax+150H], xmm4
+   add      rax, 48
+   jne      SHORT  -5 instr
    mov      rbp, rcx

Though this is still within the range amount of xmm zeroing performed by clang #32442 (comment)

// As we output multiple instructions for SIMD zeroing, we want to balance code size, with throughput
// so cap max size at 16 * MAX SIMD length
int initMaxSIMDSize = 16 * XMM_REGSIZE_BYTES;
if ((untrLclHi - untrLclLo) <= initMaxSIMDSize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be locals in this region that do not need to be zero-initialized. Does it add up to anything significant? We may be able to do better by skipping zero-initialization for them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is handled by the else branch when genUseBlockInit is false where it just zeros out the GC fields (not sure of heuristics that determine it); starts just after the shown GitHub diff at else if (genInitStkLclCnt > 0)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. it may be worth it to build a bit-mask what needs to be zero-initialized, and then go over the bitmask to do the zero-init.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do have an issue open for double zeroing #2325

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use use the same logic the genInitStkLclCnt else block uses to drop any xmm blocks that don't have something to zero? (in a follow up)

@benaadams benaadams changed the title Use xmm for prolog zeroing <= 128 bytes Use xmm for prolog zeroing <= 256 bytes Feb 19, 2020
@AndyAyersMS
Copy link
Member

cc @erozenfeld

Do we need to go all the way to 256 bytes for methods in Kestrel?

It would be good to highlight the methods that need a lot of prolog zeroing so we can see if other parts of the heuristic need reworking, or whether other changes -- either to the sources or to the jit -- might be able to remove the need for this much zeroing.

Can you look at code size impact outside of corelib (say via -f to jit-diff)?

@tannergooding
Copy link
Member

Do we need to go all the way to 256 bytes for methods in Kestrel?

256 bytes looks to be what Clang/LLVM do by default (or 512 bytes for AVX2): https://godbolt.org/z/geQ9u9
160 bytes looks to be what MSVC does by default: https://godbolt.org/z/dphgZN
Both fallback to calling memset (rather than using rep stosb/w/d) above their cutoffs.

Perhaps this is one of the cases where we could generate the larger code unless the user has explicitly opted for "small code generation"?

@tannergooding
Copy link
Member

When optimizing for size; MSVC (-O1) looks to drop the cutoff to 64 bytes while Clang (-Os) drops the cutoff to 128 bytes (XMM) or 256 bytes (YMM).

@benaadams
Copy link
Member Author

When optimizing for size...

Is there an easy way to check if code is being produced for R2R? Could have lower limits for that?

@AndyAyersMS
Copy link
Member

Jitted code will have more zeroing than typical C/C++ code so a more conservative cutover may be appropriate.

Is there an easy way to check if code is being produced for R2R? Could have lower limits for that?

Yes, IsReadyToRun. But you would be better off checking opts.compCodeOpt; the goal can vary by method.

@jkotas
Copy link
Member

Is there an easy way to check if code is being produced for R2R? Could have lower limits for that?

We have important customers that are not using tiered-JIT and care about code quality. I do not think we should be changing the cut-off for R2R.

@benaadams
Copy link
Member Author

benaadams commented Feb 19, 2020

As @tannergooding mentions 16 simd instructions is the clang switch over point; so either 16x xmm (256bytes) or 16x ymm (512bytes). Intel suggests 128 bytes as switch over for rep stos if the data is 32 byte aligned; though it not being is probably adding to the problems

Do we need to go all the way to 256 bytes for methods in Kestrel?

The Kestrel method I was looking at https://gist.github.com/benaadams/0a44b0cb1c0aab57d6525a0eedc0672f is 10x xmm

Even though it is high, hopefully it is infrequent?

@benaadams
Copy link
Member Author

Rebased to kick start CI

@BruceForstall
Copy link
Member

cc @dotnet/jit-contrib

@AndyAyersMS
Copy link
Member

In TakeMessageHeaders the jit will zero 168 bytes, so yes, 10 xmm stores and one gpr store to cover it all.

Of those 168 bytes, only 64 bytes are GC refs (you can see this by looking at the untracked GC lifetimes, say via COMPlus_JitGCDump). So if we're suppressing initlocals, looks like 7 xmm stores would cover the necessary untracked GC initializations.

So perhaps the "bit vector" approach has some appeal?

@erozenfeld
Copy link
Member

So if we're suppressing initlocals, looks like 7 xmm stores would cover the necessary untracked GC initializations.

So perhaps the "bit vector" approach has some appeal?

If we go that route the logic should live in CodeGen::genCheckUseBlockInit. Also, we'll need to be careful not to break any assumptions in fgVarNeedsExplicitZeroInit.

@benaadams benaadams changed the title Use xmm for prolog zeroing <= 256 bytes Use xmm for prolog zeroing <= 128 bytes; align > 128 bytes Feb 19, 2020
@benaadams
Copy link
Member Author

benaadams commented Feb 19, 2020

Think I discovered what was going wrong in the other PR with alignment; so have included aligning rep stosb and reduced the cut over to 128 bytes, so all lengths should benefit and at R2R time. (Didn't bring over the ymm size version)

@benaadams benaadams changed the title Use xmm for prolog zeroing <= 128 bytes; align > 128 bytes Use xmm for prolog zeroing <=128 bytes; align >128 bytes Feb 19, 2020
@benaadams
Copy link
Member Author

Probably worth breaking this method up since it is hard to follow with all the ifdefs, but that is something we can do in a follow-on PR.

I was wondering if it could be made a little more generic and used for genCodeForInitBlkUnroll and since there is now a mini-loop capability also genCodeForInitBlkRepStos which seems to be for blocks of unspecified size.

@AndyAyersMS
Copy link
Member

Right, it would be nice to share logic with other similar cases, though there can be different constraints on codegen (gc safe vs non-gc safe, for instance) that impact the allowable code sequences.

@AndyAyersMS
Copy link
Member

Linux build issue looks like #2234:

OCI runtime exec failed: exec failed: container_linux.go:344: starting container process caused "process_linux.go:95: adding pid 75444 to cgroups caused \"failed to write 75444 to cgroup.procs: write /sys/fs/cgroup/cpu,cpuacct/docker/69bd2a5c8d374a2795cee96d4030553e1525133f7613269c286a24b66eb38215/cgroup.procs: invalid argument\"": unknown

@AndyAyersMS
Copy link
Member

Perf failures look like download issues, #32835.

[2020/03/03 20:07:13][INFO] Could not find/download: ".NET Core SDK" with version = 5.0.100-preview.2.20153.4

@AndyAyersMS
Copy link
Member

@benaadams can you share out your perf tests?

I'd like to see us run them on an AMD box and perhaps get them into the performance repo.

@benaadams
Copy link
Member Author

can you share out your perf tests?

Sure https://gist.github.com/benaadams/fed05032ff6c27e2d245135fe51e5496

@tannergooding
Copy link
Member

tannergooding commented Mar 4, 2020

Sure https://gist.github.com/benaadams/fed05032ff6c27e2d245135fe51e5496

I'll run these on my Ryzen 1800X and Ryzen 3900X tomorrow for additional reference points.

Edit: Started building the branch so I can run tests, will expect results shortly.

@tannergooding
Copy link
Member

tannergooding commented Mar 4, 2020

@AndyAyersMS
Copy link
Member

Thanks, @tannergooding -- looks qualitatively similar to the intel results. And @benaadams thanks for all your hard work here, this is a really nice improvement.

@dotnet/jit-contrib I think this is ready to merge, any final comments?

@jkotas jkotas added the tenet-performance Performance related issue label Mar 4, 2020
@erozenfeld
Copy link
Member

I'll review this later today.

@AndyAyersMS AndyAyersMS merged commit 88ebe4a into dotnet:master Mar 5, 2020
@AndyAyersMS
Copy link
Member

@erozenfeld didn't see your comment in time -- we can revise if needed.

@tannergooding
Copy link
Member

@adamsitnik, @DrewScoggins

It might be good to check the perf numbers tomorrow after they have gotten a chance to run 😄

@benaadams benaadams deleted the zero-init-xmm branch March 5, 2020 03:20
@benaadams
Copy link
Member Author

benaadams commented Mar 5, 2020

@erozenfeld I was wondering if it changed the cost/benefit for "JIT: Support object stack allocation" any? #11192

e.g.

class Foo
{
    public long f1;
    public long f2;
    public long f3;
    public long f4;
    public Foo(long f1, long f2)
    {
        this.f1 = f1;
        this.f2 = f2;
    }
}

class Test
{
    static long f1;
    static long f2;
    public static int Main()
    {
        Foo foo = new Foo(f1, f2);
        return (int)(foo.f1 + foo.f2);
    }
}

Changes from with the object allocation

;* V00 loc0         [V00    ] (  0,  0   )     ref  ->  zero-ref    class-hnd exact
;  V01 OutArgs      [V01    ] (  1,  1   )  lclBlk (32) [rsp+0x00]   "OutgoingArgSpace"
;  V02 tmp1         [V02,T00] (  5, 10   )     ref  ->  rax         class-hnd exact "NewObj constructor temp"
;  V03 tmp2         [V03,T01] (  2,  4   )    long  ->  rdx         "Inlining Arg"
;  V04 tmp3         [V04,T02] (  2,  4   )    long  ->  rcx         "Inlining Arg"
;
; Lcl frame size = 40

G_M59477_IG01:
       sub      rsp, 40
						;; bbWeight=1    PerfScore 0.25
G_M59477_IG02:
       mov      rcx, 0xD1FFAB1E
       call     CORINFO_HELP_NEWSFAST
       mov      rdx, qword ptr [reloc classVar[0xd1ffab1e]]
       mov      rcx, qword ptr [reloc classVar[0xd1ffab1e]]
       mov      qword ptr [rax+8], rdx
       mov      qword ptr [rax+16], rcx
       mov      edx, dword ptr [rax+8]
       add      edx, dword ptr [rax+16]
       mov      eax, edx
						;; bbWeight=1    PerfScore 9.50
G_M59477_IG03:
       add      rsp, 40
       ret      
						;; bbWeight=1    PerfScore 1.25

With with COMPlus_JitObjectStackAllocation=1, now drops the expensive 40 byte rep stosd for the cheaper vmovdqa

;  V00 loc0         [V00,T02] (  3,  3   )    long  ->  rax         class-hnd exact
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]   "OutgoingArgSpace"
;* V02 tmp1         [V02    ] (  0,  0   )    long  ->  zero-ref    class-hnd exact "NewObj constructor temp"
;  V03 tmp2         [V03,T00] (  2,  4   )    long  ->  rax         "Inlining Arg"
;  V04 tmp3         [V04,T01] (  2,  4   )    long  ->  rdx         "Inlining Arg"
;  V05 tmp4         [V05    ] (  4,  4   )  struct (40) [rsp+0x08]   do-not-enreg[XSF] must-init addr-exposed "MorphAllocObjNodeIntoStackAlloc temp"

-; Lcl frame size = 48
+; Lcl frame size = 40

G_M59477_IG01:
-      push     rdi
-      sub      rsp, 48
-      lea      rdi, [rsp+08H]
-      mov      ecx, 10
-      xor      rax, rax
-      rep stosd 
+      sub      rsp, 40
+      vxorps   xmm4, xmm4
+      vmovdqa  xmmword ptr [rsp], xmm4
+      vmovdqa  xmmword ptr [rsp+10H], xmm4
+      xor      rax, rax
+      mov      qword ptr [rsp+20H], rax
-					;; bbWeight=1    PerfScore 27.25
+					;; bbWeight=1    PerfScore 3.83

G_M59477_IG02:
       mov      rax, 0xD1FFAB1E
-      mov      qword ptr [rsp+08H], rax
+      mov      qword ptr [rsp], rax
       mov      rax, qword ptr [reloc classVar[0xd1ffab1e]]
       mov      rdx, qword ptr [reloc classVar[0xd1ffab1e]]
-      mov      qword ptr [rsp+10H], rax
-      mov      qword ptr [rsp+18H], rdx
-      lea      rax, [rsp+08H]
+      mov      qword ptr [rsp+08H], rax
+      mov      qword ptr [rsp+10H], rdx
+      lea      rax, [rsp]
       mov      edx, dword ptr [rax+8]
       add      edx, dword ptr [rax+16]
       mov      eax, edx
						;; bbWeight=1    PerfScore 10.00
G_M59477_IG03:
-      add      rsp, 48
+      add      rsp, 40
-      pop      rdi
       ret      
						;; bbWeight=1    PerfScore 1.75

@benaadams
Copy link
Member Author

From there, if the lea was dropped

G_M59477_IG02:
       mov      rax, 0xD1FFAB1E
       mov      qword ptr [rsp], rax
       mov      rax, qword ptr [reloc classVar[0xd1ffab1e]]
       mov      rdx, qword ptr [reloc classVar[0xd1ffab1e]]
       mov      qword ptr [rsp+08H], rax
       mov      qword ptr [rsp+10H], rdx
+      mov      edx, dword ptr [rsp+08H]
+      add      edx, dword ptr [rsp+10H]
-      lea      rax, [rsp]
-      mov      edx, dword ptr [rax+8]
-      add      edx, dword ptr [rax+16]
       mov      eax, edx

Since its operating on the framepointer, could then eliminate the reg->stack, stack->reg?

G_M59477_IG02:
       mov      rax, 0xD1FFAB1E
       mov      qword ptr [rsp], rax
       mov      rax, qword ptr [reloc classVar[0xd1ffab1e]]
       mov      rdx, qword ptr [reloc classVar[0xd1ffab1e]]
-      mov      qword ptr [rsp+08H], rax
-      mov      qword ptr [rsp+10H], rdx
-      mov      edx, dword ptr [rsp+08H]
-      add      edx, dword ptr [rsp+10H]
+      add      edx, eax
       mov      eax, edx

Reverse the add

G_M59477_IG02:
       mov      rax, 0xD1FFAB1E
       mov      qword ptr [rsp], rax
       mov      rax, qword ptr [reloc classVar[0xd1ffab1e]]
       mov      rdx, qword ptr [reloc classVar[0xd1ffab1e]]
-      add      edx, eax
-      mov      eax, edx
+      add      eax, edx

Then remove the write never read

G_M59477_IG02:
-      mov      rax, 0xD1FFAB1E
-      mov      qword ptr [rsp], rax
       mov      rax, qword ptr [reloc classVar[0xd1ffab1e]]
       mov      rdx, qword ptr [reloc classVar[0xd1ffab1e]]
       add      eax, edx

From there determine don't need any stack anyway? 🤔

@erozenfeld
Copy link
Member

I was wondering if it changed the cost/benefit for "JIT: Support object stack allocation" any? #11192

I'm glad this will potentially help with CQ of methods with object stack allocation. Currently the main issue is not the quality of the code that's generated but the fact that very few allocations can be safely moved to the stack without an interprocedural escape analysis. We have some ideas on how to approach that problem but this work is not in near-term plans.

@benaadams
Copy link
Member Author

@adamsitnik, @DrewScoggins

It might be good to check the perf numbers tomorrow after they have gotten a chance to run 😄

System.Text.Json tests might be one to look at

@erozenfeld
Copy link
Member

didn't see your comment in time -- we can revise if needed.

No problem. I reviewed the changes and they look good.
@benaadams Thank you for this work, very nice improvement.

@DrewScoggins
Copy link
Member

DrewScoggins commented Mar 5, 2020

@adamsitnik, @DrewScoggins

It might be good to check the perf numbers tomorrow after they have gotten a chance to run 😄

System.Text.Json tests might be one to look at

Thanks for the insight Ben!

Looking at the data we are seeing big wins (7%-20%) across the suite of JSON tests. I have added an image of a graph from the report below. That kind of drop is what we are seeing across over half of the JSON tests, with the other half within the noise range of the tests.

You can find the full reports here:
https://pvscmdupload.blob.core.windows.net/reports/JSONReport_x64_refs-heads-master_Daily_Ubuntu1804_dotnetruntime_Micro_2020-03-05.html

https://pvscmdupload.blob.core.windows.net/reports/JSONReport_x64_refs-heads-master_Daily_Windows1018362_dotnetruntime_Micro_2020-03-05.html

image

@AndyAyersMS
Copy link
Member

It might be good to check the perf numbers tomorrow

I looked but there were no new perf reports on our internal share. Must be jitted code is now too fast to measure.

@benaadams
Copy link
Member Author

Courteously of @sebastienros source

+5% to Plaintext (extra 200k requests per second)

PlaintextPlatform: 8,693M  -> 8,734M
Plaintext:         4,037M  -> 4,239M
Json:              838,137 -> 840,426
DbFortunesRaw:     268,438 -> 274,706

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Strange Span costs for as Memory.Span -> parameter