Meeting notes: Difference between revisions

Revision as of 00:46, 6 October 2010

Current progress:

Brandon investigated potential performance limiters in Nehalem, and an ARM
- nehalem allows 10 outstanding data misses per core
- 32 outstanding L3/memory loads? 16 outstanding L3/memory stores?
- 8 outstanding prefetches? The docs weren't clear about prefetches.
- The ARM Brandon looked at (don't remember which) allows something like 3 outstanding misses per core.
Jacob is still looking at benchmarks/applications. Jacob checked some into a new Git repository; he'll send a pointer.
Andrew is still looking at minimal context switches and Qthreads

Issues to watch out for:

Current thoughts on methodology:

Next steps:

keep thinking about methodology. Jacob and Simon will talk more.
experiment with performance counters on Nehalem system to validate info in docs. how many outstanding memory requests per core? per chip?
continue exploring applications and benchmarks.
continue exploring green threads, minimal context switch.

Current progress:

We can get accounts on the PNNL XMT; Simon sent out an email.
An XMT simulator is also available.
Simon created a wiki for the project and sent account details to everybody.
Simon has been trying to get details of applications from various people, but has not yet been successful. Graph connectivity is the basic area. There is an effort to create a graph benchmark: http://www.graph500.org/, with source graph 500 git repository

Thoughts on methodology:

We should target an abstract model, not the real ISA.
We will focus on performance for a single node for now.
It may be possible to modify processor microcode. Could that be useful?
Some questions we're interested in exploring:
- How many thread contexts can we run concurrently on a modern Xeon before memory performance degrades?
- What resource constraints in existing processors limit our performance?
- Could we take a trace of memory operations, filter out stack references, and replay these into a simple architectural simulator to explore what resources are necessary to get performance?

We'll explore a couple angles immediately:

look at green threads packages: Andrew is doing this already.
look at applications/benchmarks: Jacob will start here.
look at resource constraints in Intel and Arm (maybe AMD): Brandon is looking into this.

@@ Line 11: / Line 11: @@
 Issues to watch out for:
-* if stacks are aligned, l1 associativity may cause problems
+* if stacks are aligned, L1 associativity may cause problems
 Current thoughts on methodology: