The Stack Overflow journey to .NET 6

2022-05-25

At Stack Overflow we always try to run on the latest and greatest version of .NET. We love that extra bit of performance that the .NET team is squeezing out and of course we don’t mind running a supported version so we get important security and bug fixes.

Given that we’ve been on .NET 5 for a while and the end of life date for .NET 5 was May 10 we’ve been busy with the upgrade to .NET 6. In this blog post we share a bit about what that means for us, some issues we ran into and the end result.

First I want to give a shout out to Samo Prelog and Dan Roberts for all the work they did on this migration and to Roberta Arcoverde for enabling our team to work on this migration.

A shared codebase for 3 products

You probably know our public site stackoverflow.com. Stack Overflow and the whole Stack Exchange network is where we get the biggest part of our traffic. On an average day we have 250 million requests and 85 million page views. Because of our size we run into unique issues but it also gives us great opportunities to benefit from the optimizations in a new .NET version.

In addition to our public Stack Exchange network, we also have Stack Overflow for Teams and Stack Overflow Enterprise. Stack Overflow for Teams is a multi tenant private SaaS version of Stack Overflow where you can ask private questions within your organization. Stack Overflow Enterprise is a Stack Overflow environment where you can have a main site with questions and as many private Teams as you want that we can host for you as SaaS or that you can host yourself.

These three applications are all based on the same code base but deployed in different configurations and environments. This means that a fundamental upgrade like .NET 6 touches a lot of things and we have to be sure that they all work before we roll it out more broadly. This is why we created a branch for .NET 6 that we regularly update from main. This way, we can decide when we merge to main and push the change to all developers.

How we deploy high impact changes to production

Stack Overflow and Stack Overflow for Teams both run in our own datacenter. Stack Overflow runs on 11 web servers in our primary location and 11 in our secondary location. Servers 1 to 9 run stackoverflow.com and the Stack Exchange network. 10 and 11 run our developer instance and all our Meta sites.

Our pipeline automatically deploys all changes to dev. Next in line are all Meta sites and after that and all network sites (including stackoverflow.com). After this, the change is live for all public users of our site. We then deploy to our 3 Teams web servers.

Looking at some recent numbers, our Meta build and deploy take 6 minutes, the StackExchange Network 10 minutes, Teams takes 10 minutes and the Teams API takes 4 minutes.

To learn more about our deployments this Podcast is a nice listen: https://stackoverflow.blog/2016/05/03/stack-overflow-how-we-do-deployment-2016-edition/

Pushing to production in a bomb suit

You can understand why a big change like upgrading EF or .NET isn’t merged to main and deployed to all servers in one go. We want to be able to quickly roll back if something goes wrong and make sure that the blast impact of a potential issue is as small as possible.

We first canary the .NET 6 branch on a single server for Meta and let it run for a while. We keep an eye on our metrics in Splunk and exceptions in OpServer and try to catch any issues we didn’t see on our own developer machines. Once we’ve let it bake for a while, we repeat the process and deploy it to one of the servers running our public networks. Finally, we deploy the branch to all our servers and let it sit overnight. This is especially important because we have a lot of background processes that run on different parts of the day and that are particularly heavy on the database and sometimes surface issues.

If we don’t see any exceptions or other issues we’re ready to merge to main and then make sure that is deployed everywhere.

That takes care of the Stack Exchange Network and Teams. We also have Stack Overflow Enterprise. For Enterprise we create a new installation package daily and deploy that to test environments that are running main. We also have test environments that run the latest released Enterprise version and those aren’t touched. Against the main test environments, we run an automated set of end to end tests based on Mabl. In addition we have manual testers that look for certain edge cases and that test brand new functionality. Since we don’t release Enterprise continuously but only 3 times a year, we have some time to let .NET 6 run on Stack Overflow and Stack Overflow for Teams before we decide it’s stable enough for an Enterprise release.

Entity Framework 5

Before getting started on the .NET 6 upgrade, we first had to catch up on Entity Framework. We still ran on EF 2.2 which is not supported on .NET 6 anymore. To avoid upgrading Entity Framework and .NET at the same time, we first upgraded to EF 5 and got that merged to main and running in all our environments.

First step is to make the code compile. We had to do some simple method renames such as FromSql to FromSqlRaw, fix some namespaces and remove dependencies on internal methods that were no longer there.

After everything compiled, we started running our integration tests. A lot of these tests use Entity Framework and although we don’t have a 100% coverage it did catch some of the breaking changes that we needed to fix. One thing that bit us was
Temporary key values are no longer set onto entity instances. In EF 2, a newly created entity would get a temporary key value that you could use to set foreign key relationships. This was the cause of some integration tests failing and we fixed it by setting the navigation property instead of the foreign key value.

Another issue we ran into required some debugging and investigation to figure out what went wrong. In the end we hit Backing fields are used by default. We have some logic in the setter of our properties that in EF2 was executed whenever an entity got loaded from the database. In EF3 Microsoft changed this behavior and used the backing field by following a simple naming convention. This led to the property having a value but the extra logic in our setter not running. A simple modelBuilder.UsePropertyAccessMode(PropertyAccessMode.Property); solved the issue. If finding the cause was as easy as fixing it…

After fixing all the integration tests we went through the deploy dance that I described above and then merged EF5 to main

.NET 6

We were then ready to start working on .NET 6. The basics are simple: just change net5.0 to net6.0 in all our project files. In addition we had to update several NuGet packages. We use central dependency management for all our NuGet packages so it’s easy to do that in one location.

We ran into some small and some bigger problems. For example, we have some code that’s dealing with Png files for gravatars. Since this code depends on Windows APIs, we got a bunch of errors of the form error CA1416: This call site is reachable on all platforms. ‘FontStyle.Bold’ is only supported on: ‘windows’. We solved it by adding a if (!RuntimeInformation.IsOSPlatform(OSPlatform.Windows)) return; and we know that’s not the nicest way to do this so we added it to our list of required fixes to run Stack Overflow on Linux.

Another issue hit our integration tests running on CircleCI. These tests run on Linux and they kept failing by just completely crashing the process. We didn’t see any error messages or logs, it just crashed. After debugging this locally by using Windows + WSL + Remote Debugging from Visual Studio, we found this GitHub issue that helped us with a workaround. By manually initializing NetSecurityNative_EnsureGssInitialized before opening a SQL connection the issue was solved.

Another issue had to do with the way we sometimes render a Razor view in our integration tests. Previously, we compiled our Razor views to a separate assembly: StackOverflow.PrecompiledViews.dll. We loaded that DLL in our integration tests and we could render them. In .NET 6, this was changed with source generators. ASP.NET doesn’t generate a separate assembly anymore but instead includes them as code in the same assembly as the controllers.

We also use several custom analyzers in our code base. One of these analyzers implemented the IOperation interface from Roslyn in order to do a custom traversal of the IOperation tree (via OperationWalker). This originally gave us a RS1009 warning, which we suppressed. The Roslyn / CodeAnalysis packages in .NET SDK 6.0.300, which shipped with VS 17.2 after we merged to main, got bumped to 4.2, featuring a new member on the IOperation interface. Our .NET branch originally targeted 6.0.200. In order to keep supporting both patch versions, we had to rewrite the analyzer to do the IOperation tree traversal without implementing the IOperation interface.

After getting all integration tests to green we started deploying the .NET 6 branch following the process outlined above.

A connection pool scare

While running .NET 6 as a test on all our servers we suddenly noticed a spike in exceptions triggered by SQL connection pool exhaustion. After looking at historical logs we noticed this issue has happened before but it looked like .NET 6 made it much more frequent to a state where we weren’t sure if we could deploy .NET 6 without fixing this issue.

Fortunately enough we encountered this issue before and we have some code in Stack Overflow that automatically triggers a minidump of the process whenever this type of exception occurs. The method Dumpster.Fire is inspired by this code https://github.com/dotnet/diagnostics/blob/8e5123a15b40902abd9813927ce09ddd8d2c688a/src/Tools/dotnet-dump/Dumper.Windows.cs. This gives us a minidump file we can then analyze and see what’s happening.

Looking at the minidump in Visual Studio it showed a lot of threads waiting synchronously for async database access.

Viewing the minidump in Visual Studio shows blocked threads

We discovered this had to do with an experiment we were running on our public sites where we needed a lot of data loaded into our cache and waited synchronously for the cache to be ready. Fortunately the experiment was completed and turning it off solved our connection pool issues.

End result

After fixing this issue and letting .NET 6 run one more time on all our servers we were ready to merge the .NET 6 upgrade to main.

One graph that immediately caught our attention was the number of threads in thread pools.

The number of threads in the threadpools exploded

As you can see, on May 11, the moment we deployed .NET 6, the graph suddenly exploded. We contacted Microsoft and learned that this is by design and a change they made to how threads are managed within thread pools. As long as the CPU and queue length doesn’t go up it’s not an issue. After a couple of days we also saw the numbers return to normal.

Here are some more graphs from the performance dashboard we use internally.

Overview of our performance dashboard

Looking at all this data we haven’t seen any real performance improvements (yet). We want to let this run for a couple of weeks and make sure there are no issues and then we want to enable Dynamic PGO and see if that changes anything.

At least we’re happy to have finished this migration and are running on a supported version of .NET!

The future

A last step we want to take while everything is still fresh is upgrading Entity Framework to version 6. That would mean we are on the latest version of all our important packages.

If we have any updates we’ll let you know! If you have any questions or comments or would like to see a blog post on another part of our landscape feel free to reach out through the comments below or on Twitter.