0x0030 - The Clock Problem - or Iteration over Various Base N Sets in a Cuda Coding Environment
The Clock Problem - or Iteration over Various Base N Sets in a Cuda Coding Environment
The simplest way to understand the Various Base N problem is explained in a question:
I have 1283812831 seconds. How many days, hours, minutes and seconds does this index at? Silence... Head scratcher isn't it??
Naturally most programmers gloss over this simply relying in the fact that languages like Python offer timedelta which will do the heavy lifting, look up a quick solution, write some inefficient bloatware and move on.
But in Cuda coding you will find that the equivalent of the clock problem will requiring solving all the time - so you have no choice but to know how to do this. Because you are dealing with Base-N type problems - consider the following problem example set:
- Cuda kernel will launch with 32 block x 32 threads (1024 total)
- The A variable will range from 8 - 86 (78 Possibles)
- The B variable will range from 14-600 (586 Possibles)
- The C variable will range from 0.2 - 0.9 in 0.01 increments (70 Possibles)
- The D variable will range from 0.8 - 2.0 in 0.01 increments (120 Possibles)
We know we have 78 x 586 x 70 x 120 (383 million) possible permutations on the problem and our GPU will chunk out 1024 per iteration run.
Very similar to the BASE60 of seconds in a minute, BASE3600 seconds in a hour we have BASE76, BASE586.. etc.
We will double back and complete the clock problem and when we understand it - we can then solve the Cuda Base N problem, so:
I have 1283812831 seconds. How many days, hours, minutes and seconds does this index at? Silence... Head scratcher isn't it??
Our smallest unit of measurement is the second.
seconds in a minute = 60
seconds in a hour = 3600
seconds in a day = 86400
- Always work from the largest BASE(N) to the smallest.
days = math.floor(total_seconds / seconds_in_a_day)
days = math.floor(1283812831 / 86400)
days = 14858
remainder_seconds = 1283812831 - (14858 days * 86400 seconds in day)
remainder_seconds = 1283812831 - 1283731200 = 81631 seconds left over.
remainder_hours = math.floor(remainder_seconds / seconds_per_hour)
remainder_hours = math.floor(81631 / 3600)
remainder_hours = 22
remainder_min = math.floor(remainder_seconds_for_min / sec_per_min)
remainder_seconds_for_min = 81631 - (22 * 3600)
remainder_seconds_for_min = 2431 / 60
remainder_min = 40
remainder_sec_final = 2431 - (60 * 40)
remainder_sec_final = 31 seconds
When it's all done we have 14858 days, 22 hours, 40 minutes and 31 seconds.
Now the exact same method is applied to the cuda problem - except we are dealing with arbitrary base-N sets based upon the problem range and step.
- Total iterations 383 million or 383947200 iterations stepped at 1024 per warp call.
- The iteration cycle is the same. For instance for every iteration of Variable C we will have 120 iterations of Variable D - and for every iteration of Variable B we will have (70 x 120 iterations of D) and for every iteration of Variable A we will have (586 x 70 x 120 iterations of D)..
- Calculate your scalar offset into the problem. This is N/383947200. Since we are running 1024 threads at a instance it becomes:
- 374,948 runs per thread.
long per_thead_run_count = 383947200 / (blockDim.x * def_thread_count)
long scalar_offset = current_run * def_thread_count + cthread
Once you have your scalar_offset you simply apply the same rules as above math.floor dividing your modulus factors, consider:
I am on iteration 282828984 of 383947200. What is my value of A,B,C,D variables before it simulates?
Refreshing we have:
120 D iterations per Variable C
120 x 70 (8400 D iterations per variable B)
120 x 70 x 586 (4922400 D iterations per variable A)
A Pos = math.floor(282828984 / 4922400) = 57
A Remainder = (282828984 - (57 * 4922400) = 2252184
B Pos = math.floor(2252184 / 8400) = 268
B Remainder = 2252184 - (268 * 8400) = 984
C Pos = math.floor(984 / 120) = 8
C Remainder = 984 - (120 * 8) = 24
D = 24.
So when we are on iteration 282828984 our A=57, B=268, C=8, and D=24
Finally: application towards float ratios simply requirest a ratio:
24 / 120 * (2.0 - 0.8) + 0.8 = D = 1.04
A Real-world example implementation (note math.floor is simply discarded in a long division)
In the end none of this was practical - when it was put into production it was found that it took approximately 3 seconds to run 1 sim-cycle across the block/thread count. Multiplied by 374,000 expected runs was looking at 12.7 days per stock! Since we are looking at 5000 stocks to have a complete run time of multiple years is simply not timely. Random vector can completely blow this away and is the algorithm of choice.