<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en-gb">
	<link rel="self" type="application/atom+xml" href="http://localhost/app.php/feed/topic/125" />

	<title>Tools and Benchmarks for Real-Time Systems</title>
	<subtitle>ECRTS Community Forum</subtitle>
	<link href="http://localhost/index.php" />
	<updated>2019-05-14T16:17:26+01:00</updated>

	<author><name><![CDATA[Tools and Benchmarks for Real-Time Systems]]></name></author>
	<id>http://localhost/app.php/feed/topic/125</id>

		<entry>
		<author><name><![CDATA[Nacho_S]]></name></author>
		<updated>2019-05-14T16:17:26+01:00</updated>

		<published>2019-05-14T16:17:26+01:00</published>
		<id>http://localhost/viewtopic.php?t=125&amp;p=251#p251</id>
		<link href="http://localhost/viewtopic.php?t=125&amp;p=251#p251"/>
		<title type="html"><![CDATA[Re: Memory contention model (example)]]></title>

		
		<content type="html" xml:base="http://localhost/viewtopic.php?t=125&amp;p=251#p251"><![CDATA[
<blockquote class="uncited"><div>Hello,<br>thank you very much for your answer!<br><br>I am currently analyzing the Jetson TX2 platform and have another question:<br>Does anyone eventually know if the GPU of Jetson TX2 has accessible hardware performance counters and if yes, how to access them?<br><br>Many thanks in advance.<br><br>Best regards</div></blockquote>Hi,<br><br>We have never profiled an application at that level but as far as I know you can access it by using this API <a href="https://docs.nvidia.com/cuda/cupti/index.html#r_overview" class="postlink">https://docs.nvidia.com/cuda/cupti/inde ... r_overview</a><br><br>Nacho<p>Statistics: Posted by <a href="http://localhost/memberlist.php?mode=viewprofile&amp;u=1594">Nacho_S</a> — Tue May 14, 2019</p><hr />
]]></content>
	</entry>
		<entry>
		<author><name><![CDATA[lkrupp]]></name></author>
		<updated>2019-05-07T14:25:25+01:00</updated>

		<published>2019-05-07T14:25:25+01:00</published>
		<id>http://localhost/viewtopic.php?t=125&amp;p=250#p250</id>
		<link href="http://localhost/viewtopic.php?t=125&amp;p=250#p250"/>
		<title type="html"><![CDATA[Re: Memory contention model (example)]]></title>

		
		<content type="html" xml:base="http://localhost/viewtopic.php?t=125&amp;p=250#p250"><![CDATA[
Hello,<br>thank you very much for your answer!<br><br>I am currently analyzing the Jetson TX2 platform and have another question:<br>Does anyone eventually know if the GPU of Jetson TX2 has accessible hardware performance counters and if yes, how to access them?<br><br>Many thanks in advance.<br><br>Best regards<p>Statistics: Posted by <a href="http://localhost/memberlist.php?mode=viewprofile&amp;u=26053">lkrupp</a> — Tue May 07, 2019</p><hr />
]]></content>
	</entry>
		<entry>
		<author><name><![CDATA[Nacho_S]]></name></author>
		<updated>2019-04-29T16:49:45+01:00</updated>

		<published>2019-04-29T16:49:45+01:00</published>
		<id>http://localhost/viewtopic.php?t=125&amp;p=249#p249</id>
		<link href="http://localhost/viewtopic.php?t=125&amp;p=249#p249"/>
		<title type="html"><![CDATA[Re: Memory contention model (example)]]></title>

		
		<content type="html" xml:base="http://localhost/viewtopic.php?t=125&amp;p=249#p249"><![CDATA[
<blockquote class="uncited"><div>Hi,<br>the document describing the memory latency measurements states that LMBench (LAT_MEM_RD as specified in detail above) in conjuction with a custom-made program is used. I am interested in how exactly the latency is measured. Therefore, my questions are:<br><br>1) Are the measurements on the Tegra platforms conducted using standard Linux for Tegra or was the Linux kernel adapted?<br>2) In case of standard L4T: Are timing primitives of the OS (like gettime() or clocks()) used and how do you account for the overhead of the OS?<br><br>Thank you in advance!<br><br>Best regards,<br><br>Lukas Krupp</div></blockquote>Hi,<br>&gt;&gt; 1) Are the measurements on the Tegra platforms conducted using standard Linux for Tegra or was the Linux kernel adapted?<br>The experiments have been done by using the standard Linux for Tegra.<br><br>&gt;&gt; 2) In case of standard L4T: Are timing primitives of the OS (like gettime() or clocks()) used and how do you account for the overhead of the OS?<br>We used OS timing primitives, unfortunately we do not take into account the OS overhead, however, we tried to minimize this overhead as much as possible by launching the tasks in isolation (with synthetic parameters as input), maximum OS priority, etc...<br>Nacho.<p>Statistics: Posted by <a href="http://localhost/memberlist.php?mode=viewprofile&amp;u=1594">Nacho_S</a> — Mon Apr 29, 2019</p><hr />
]]></content>
	</entry>
		<entry>
		<author><name><![CDATA[lkrupp]]></name></author>
		<updated>2019-04-25T07:15:21+01:00</updated>

		<published>2019-04-25T07:15:21+01:00</published>
		<id>http://localhost/viewtopic.php?t=125&amp;p=248#p248</id>
		<link href="http://localhost/viewtopic.php?t=125&amp;p=248#p248"/>
		<title type="html"><![CDATA[Re: Memory contention model (example)]]></title>

		
		<content type="html" xml:base="http://localhost/viewtopic.php?t=125&amp;p=248#p248"><![CDATA[
Hi,<br>the document describing the memory latency measurements states that LMBench (LAT_MEM_RD as specified in detail above) in conjuction with a custom-made program is used. I am interested in how exactly the latency is measured. Therefore, my questions are:<br><br>1) Are the measurements on the Tegra platforms conducted using standard Linux for Tegra or was the Linux kernel adapted?<br>2) In case of standard L4T: Are timing primitives of the OS (like gettime() or clocks()) used and how do you account for the overhead of the OS?<br><br>Thank you in advance!<br><br>Best regards,<br><br>Lukas Krupp<p>Statistics: Posted by <a href="http://localhost/memberlist.php?mode=viewprofile&amp;u=26053">lkrupp</a> — Thu Apr 25, 2019</p><hr />
]]></content>
	</entry>
		<entry>
		<author><name><![CDATA[Nacho_S]]></name></author>
		<updated>2019-04-12T15:44:20+01:00</updated>

		<published>2019-04-12T15:44:20+01:00</published>
		<id>http://localhost/viewtopic.php?t=125&amp;p=247#p247</id>
		<link href="http://localhost/viewtopic.php?t=125&amp;p=247#p247"/>
		<title type="html"><![CDATA[Re: Memory contention model (example)]]></title>

		
		<content type="html" xml:base="http://localhost/viewtopic.php?t=125&amp;p=247#p247"><![CDATA[
Hi,<br><blockquote class="uncited"><div>Hi<br>I guess the baseline can be derived from the read/write latencies and the PU's frequency, but what about K and sGPU parameters? Are there any model entities these values can be derived from? I am working on an implementation and would like to make it as flexible as possible.</div></blockquote>You can find those values here: <a href="http://hercules2020.eu/wp-content/uploads/2017/03/D2.2_Detailed_Characterization_of_Platforms.pdf" class="postlink">http://hercules2020.eu/wp-content/uploa ... tforms.pdf</a><br>Please check figure 17 (page 22) for the A57 cores and figure 19 (page 24) for Denver cores.<br><br>To derive those results we used a memory latency benchmark called "LAT_MEM_RD" (<a href="http://www.bitmover.com/lmbench/lat_mem_rd.8.html" class="postlink">http://www.bitmover.com/lmbench/lat_mem_rd.8.html</a>)<br><br>Best,<br><br>Nacho<p>Statistics: Posted by <a href="http://localhost/memberlist.php?mode=viewprofile&amp;u=1594">Nacho_S</a> — Fri Apr 12, 2019</p><hr />
]]></content>
	</entry>
		<entry>
		<author><name><![CDATA[zero212]]></name></author>
		<updated>2019-04-10T13:34:31+01:00</updated>

		<published>2019-04-10T13:34:31+01:00</published>
		<id>http://localhost/viewtopic.php?t=125&amp;p=246#p246</id>
		<link href="http://localhost/viewtopic.php?t=125&amp;p=246#p246"/>
		<title type="html"><![CDATA[Re: Memory contention model (example)]]></title>

		
		<content type="html" xml:base="http://localhost/viewtopic.php?t=125&amp;p=246#p246"><![CDATA[
Hi<br>I guess the baseline can be derived from the read/write latencies and the PU's frequency, but what about K and sGPU parameters? Are there any model entities these values can be derived from? I am working on an implementation and would like to make it as flexible as possible.<p>Statistics: Posted by <a href="http://localhost/memberlist.php?mode=viewprofile&amp;u=25212">zero212</a> — Wed Apr 10, 2019</p><hr />
]]></content>
	</entry>
		<entry>
		<author><name><![CDATA[arne.hamann]]></name></author>
		<updated>2019-03-26T08:26:08+01:00</updated>

		<published>2019-03-26T08:26:08+01:00</published>
		<id>http://localhost/viewtopic.php?t=125&amp;p=240#p240</id>
		<link href="http://localhost/viewtopic.php?t=125&amp;p=240#p240"/>
		<title type="html"><![CDATA[Re: Memory contention model (example)]]></title>

		
		<content type="html" xml:base="http://localhost/viewtopic.php?t=125&amp;p=240#p240"><![CDATA[
The description of the memory contention model is now also included in the appendix of the challenge description that can be found <a href="https://www.ecrts.org/forum/viewtopic.php?f=43&amp;t=124&amp;sid=adc73952620de96706b7e167fc3639e6" class="postlink">here</a>.<p>Statistics: Posted by <a href="http://localhost/memberlist.php?mode=viewprofile&amp;u=708">arne.hamann</a> — Tue Mar 26, 2019</p><hr />
]]></content>
	</entry>
		<entry>
		<author><name><![CDATA[Nacho_S]]></name></author>
		<updated>2019-03-06T17:46:43+01:00</updated>

		<published>2019-03-06T17:46:43+01:00</published>
		<id>http://localhost/viewtopic.php?t=125&amp;p=235#p235</id>
		<link href="http://localhost/viewtopic.php?t=125&amp;p=235#p235"/>
		<title type="html"><![CDATA[Memory contention model (example)]]></title>

		
		<content type="html" xml:base="http://localhost/viewtopic.php?t=125&amp;p=235#p235"><![CDATA[
In the challenge we ask to derive a memory contention model when more than one CPU core and/or the GPU is accessing memory <br>at the same time. Taking into account that:<br><ul> <li>A task mapped to run on the GPU, needs offloading data that is acquired through the copy engine (GPU CE). Also, a GPU kernel can output data to be copied back to the host.</li></ul>    <ul><li>CPU tasks are modeled as with a Read/Compute/Write semantic, with "Read" and "Write" being 100% memory bounded operations, and</li></ul>    <ul><li>GPU CE, A57 and Denver have a significantly different memory bandwidth, latencies and sensitivity to memory interference.</li> </ul><br>According to the following references that measure memory impact on interference:<br><br>    <a href="https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8247615" class="postlink">https://ieeexplore.ieee.org/stamp/stamp ... er=8247615</a> <br>    <a href="http://hercules2020.eu/wp-content/uploads/2017/03/D2.2_Detailed_Characterization_of_Platforms.pdf" class="postlink">http://hercules2020.eu/wp-content/uploa ... tforms.pdf</a><br><br>The idea is that the length of memory phases (read and write) depends on how many other memory controller clients <br>are accessing <strong class="text-strong">main </strong> memory at the same time.<br><br>Let us suppose the following ideas for modeling interference:<br>We are given a taskset of CPU and GPU tasks.<br>Size of buffers to read and to write is known in advance and it is fixed for every instance of the periodic job.<br>From the GPU side we only consider interference from the Copy Engine data movements (modeled in amalthea as "runnables").<br>A GPU CE data movement is a 100% memory bound runnable.<br>Every memory access is modeled as sequential access patterns.<br><br>Model:<br>The model we derive from the previously cited literature is the following and it is used to describe what happens to read/write latencies <br>when more than one CPU core is accessing main memory at the same time.<br>Also, the following model accounts for increasing latencies for GPU CE activity during the observed time window.<br><br><strong class="text-strong"><span style="text-decoration:underline">For CPUs:</span></strong><br><strong class="text-strong">Lat(CPUtype,cacheLine)[ns] = baseline(CPUtype) + K(CPUtype)*#C + sGPU(CPUtype)*bGPU </strong><br><br>with:<br><strong class="text-strong">Lat(x,y)</strong> = time needed to read or write a cacheLine (64B) from main memory to CPU registers.<br><strong class="text-strong">CPUtype</strong> = Observed CPU core; A57 or Denver<br><strong class="text-strong">baseline</strong> = time taken to read or write a cacheLine (64B) from main memory to CPU registers in isolation (no interference).<br>baseline(A57) = 20 ns<br>baseline(Denver) = 8 ns<br><br><strong class="text-strong">K(CPUtype)</strong> = increase in latency operated by a single interfering core. Do note: it does not matter if the interfering core is denver or A57: this number only depends on the observed CPU core (CPUtype)<br>K(A57) = 20 ns<br>K(Denver) = 2 ns  <br><strong class="text-strong">#C</strong> = number of interfering cores. Range from 0 to 5 (as one core is the observed CPU core. 0 means no interference from other CPUs)<br><br><strong class="text-strong">sGPU</strong> = sensibility to GPU CE activity. This represents an increase in latencies if the GPU is performing operations on the copy engine.<br>sGPU(A57) = 100 ns<br>sGPU(Denver) = 20 ns<br><br><strong class="text-strong">bGPU</strong> = boolean. 1: if the GPU is operating the copy engine, 0 otherwise.<br><br><span style="text-decoration:underline"><strong class="text-strong">for GPU Copy engine:</strong></span><br><strong class="text-strong">Lat(memcpy,64B) = GPUbaseline + 0.5*#C</strong><br><br><strong class="text-strong">Lat(memcpy,64B)</strong> = Time taken to transfer 64B using the copy engine (cudaMemcpy)<br><strong class="text-strong">GPUbaseline</strong> = 3 ns. Time taken to transfer 64B using the copy engine with no interfering CPUs<br>Each core active in the same time window of the CE operation increases the baseline by half a nanosecond.<br><br>Numerical Example:<br>A task mapped on a A57 core has a memory footprint (read) of 128B.<br>If no other CPU is accessing memory (#C=0)<br>and if the GPU CE is idle (bGPU=0),<br>then the time necessary to perform the read operation of the working set size is:<br>( 128/cacheLine*Lat(A57,cacheLine) = 2*20 = 40 ns ).<br>If one core is active (does not matter if denver or A57):<br>(128/cacheLine*Lat(A57,cacheLine) = 2*(20 + 20*1 + 0) = 80 ns)<br>And this would increase to 280 ns if the GPU CE was active for the whole time of this memory phase.<br><br>Please let us know if you have any further question.<br><br><em class="text-italics">Nacho &amp; Nicola</em><p>Statistics: Posted by <a href="http://localhost/memberlist.php?mode=viewprofile&amp;u=1594">Nacho_S</a> — Wed Mar 06, 2019</p><hr />
]]></content>
	</entry>
	</feed>
