By thinkmelt in cuda — 31 May 2023

0x002E - CUDA - Memory De-confliction in a Asynchronous Execute Environment.

CUDA - Memory De-confliction in a Asynchronous Execute Environment.

There are a few major mistakes that I made in understanding a proper memory/data process flow when dealing with Cuda GPU mass computing. I was under the impression that cores are able to effectively 'cross-communicate' while executing. In normal thread computing at the core level cores can finish at any time and then asynchronously update tables etc. They do not need to wait for any other core and are independently pulling, assessing and pushing results. Also for example in Cuda there is the '__shared__' special variables just made for that purpose - no? But in reality while executing a few rules apply at all times - to do otherwise you just get breaks..

read only data: Can be done by any / all cores at any time.
read/write data: Never read across writing threads until they are done executing. '__syncthreads()' MUST be completed before results can be compared. When comparing results you have your result set as a read-only state!
write data: Always reserve a space for each core for its own scratch pad that only it has access to. No other core should attempt to read this section until a warp synchronization is completed. At that point the host/cores etc - can copy the lot of the results over analyze them and collect high return sets.

We can walk through the scenario. Presume we have a struct variable with 12 elements, and our code block has some if conditions. In any code block with a if condition you can encounter warp divergence. This occurs when some cores return a true condition, while others return a false, and the ending compute times are different for some threads over others. In this state some cores will compute faster than others and they will not finish at the same time. The 'true' condition warps could be attempting to write to their structs and the 'false' condition warps have not written yet.

The problem is the 'faster' threads then go off to have a look at the 'slower' threads work to see if they have valuable results (we are looking at a scenario where the threads are going to run 1000's of iterations without coming back to the CPU - or bothering to synchronize). However the 'slower' cpus are still half-done writing to their structs and the faster threads read corrupted data.

So I was trying to solve this - why don't I solve this by declaring a single '__shared__' byte that each core can indicate it's status in? The reality is no - because over 1000's of cores executing millions of iterations one core could be writing to the exact register at the same time as another is attempting to read from it, and according to the manual it will return a unknown state, and or trigger a race condition.

A solution example:

Effectively while producing results each thread will have it's own contiguous block of RAM to scratchpad itself down. Mathematically we can define a 'start region / end region' for our threads. Each thread knows that space it strictly for itself.

long offsetstart = (blockId.x * blockDim.x + threadID.x) * bytesize;
long offsetend = (blockId.x * blockDim.x + threadID.x) * bytesize + bytesize - 1;

__syncthreads() now waits for all threads to finish and in the next code block all can be strictly in a read assessing state. One can build all kinds of modified models off this.

Host Code Does Not Need to Wait!

Cuda code warps are non-blocking. Host code can continue executing, and this can present the same problem where the host can attempt to read a running thread before it is done (and have no real method to know or hit a race condition returning confusing results.)

The solution to this is the host code as follows:

cudaDeviceSynchronize();

This will force the CPU to wait for it's running threads to finish.

But what if our problem is days of computing or a unbounded problem set with no known compute time???!

At this point one needs to look at building their own scheduler working with the millisecond clock() function right inside the thread - for example:

Threads execute for 15,000 milliseconds before returning.
Host is managing with cudaDeviceSynchronize()
Host harvests top returning datasets every 15 seconds and reseeds this data to the other threads that now work in a seeded deviation mode (taking top performers and adjusting them slightly).
Runtime is now indeterminate, and user runs for whatever budgeted time they want to allocate and results are coming back throughout with the user selecting from an array of potential results (what we went with).
In our real-world example each thread had room for it's own top-ten results. Because the problem set was absolutely horrifically large and basically unbounded the solution was to attempt random vectors set recording the top performers. After 100's of iterations 1000's of cores would collect and aggregate ALL of the result sets into one large super-group which was iteratively set back for reprocessing in a seed deviation type arrangment. This was found to be very effective in finding high-returning results in very large N problems on very large quote sets.
When this was done in Python synchronization across seed deviants took multiple servers running overnight.
When migrated to Cuda random iteration and seed deviation homing reduced to approximately 3-5 minutes per stock! This is a game-changer.

In summary learning to code in this method felt like sending your code off to some distant planet to execute. While it runs you really have no method to communicate with it - until it decides it is done. This is a very simplistic method to explain a highly complex arrangement of memory allocations across SMP's etc etc. So as long hanging process used to break older versions of Cuda as it would indicate to the system at the driver level your code was stopping the GPU card from even updating!

The only exception was when they added the printf which you could receive information from any thread as to it's status. If you added a pipe in linux to this technically you could literally harvest results direct from the running threads without bothering to copy them back.

A similar contributor suggested this question in terms of thread-to-thread communication:

0x002E - CUDA - Memory De-confliction in a Asynchronous Execute Environment.

0x002D CUDA - Market Simulation Building (The Basics)

0x002F - CUDA Tables | Lookups