at the same time. Taking into account that:
- A task mapped to run on the GPU, needs offloading data that is acquired through the copy engine (GPU CE). Also, a GPU kernel can output data to be copied back to the host.
- CPU tasks are modeled as with a Read/Compute/Write semantic, with "Read" and "Write" being 100% memory bounded operations, and
- GPU CE, A57 and Denver have a significantly different memory bandwidth, latencies and sensitivity to memory interference.
According to the following references that measure memory impact on interference:
https://ieeexplore.ieee.org/stamp/stamp ... er=8247615
http://hercules2020.eu/wp-content/uploa ... tforms.pdf
The idea is that the length of memory phases (read and write) depends on how many other memory controller clients
are accessing main memory at the same time.
Let us suppose the following ideas for modeling interference:
We are given a taskset of CPU and GPU tasks.
Size of buffers to read and to write is known in advance and it is fixed for every instance of the periodic job.
From the GPU side we only consider interference from the Copy Engine data movements (modeled in amalthea as "runnables").
A GPU CE data movement is a 100% memory bound runnable.
Every memory access is modeled as sequential access patterns.
Model:
The model we derive from the previously cited literature is the following and it is used to describe what happens to read/write latencies
when more than one CPU core is accessing main memory at the same time.
Also, the following model accounts for increasing latencies for GPU CE activity during the observed time window.
For CPUs:
Lat(CPUtype,cacheLine)[ns] = baseline(CPUtype) + K(CPUtype)*#C + sGPU(CPUtype)*bGPU
with:
Lat(x,y) = time needed to read or write a cacheLine (64B) from main memory to CPU registers.
CPUtype = Observed CPU core; A57 or Denver
baseline = time taken to read or write a cacheLine (64B) from main memory to CPU registers in isolation (no interference).
baseline(A57) = 20 ns
baseline(Denver) = 8 ns
K(CPUtype) = increase in latency operated by a single interfering core. Do note: it does not matter if the interfering core is denver or A57: this number only depends on the observed CPU core (CPUtype)
K(A57) = 20 ns
K(Denver) = 2 ns
#C = number of interfering cores. Range from 0 to 5 (as one core is the observed CPU core. 0 means no interference from other CPUs)
sGPU = sensibility to GPU CE activity. This represents an increase in latencies if the GPU is performing operations on the copy engine.
sGPU(A57) = 100 ns
sGPU(Denver) = 20 ns
bGPU = boolean. 1: if the GPU is operating the copy engine, 0 otherwise.
for GPU Copy engine:
Lat(memcpy,64B) = GPUbaseline + 0.5*#C
Lat(memcpy,64B) = Time taken to transfer 64B using the copy engine (cudaMemcpy)
GPUbaseline = 3 ns. Time taken to transfer 64B using the copy engine with no interfering CPUs
Each core active in the same time window of the CE operation increases the baseline by half a nanosecond.
Numerical Example:
A task mapped on a A57 core has a memory footprint (read) of 128B.
If no other CPU is accessing memory (#C=0)
and if the GPU CE is idle (bGPU=0),
then the time necessary to perform the read operation of the working set size is:
( 128/cacheLine*Lat(A57,cacheLine) = 2*20 = 40 ns ).
If one core is active (does not matter if denver or A57):
(128/cacheLine*Lat(A57,cacheLine) = 2*(20 + 20*1 + 0) = 80 ns)
And this would increase to 280 ns if the GPU CE was active for the whole time of this memory phase.
Please let us know if you have any further question.
Nacho & Nicola