Meeting notes
From SoftXMT
Jump to navigationJump to search
October 4, 2010
Current progress:
- Brandon investigated potential performance limiters in Nehalem, and an ARM
- nehalem allows 10 outstanding data misses per core
- 32 outstanding L3/memory loads? 16 outstanding L3/memory stores?
- 8 outstanding prefetches? The docs weren't clear about prefetches.
- The ARM Brandon looked at (don't remember which) allows something like 3 outstanding misses per core.
- Jacob is still looking at benchmarks/applications. Jacob checked some into a new Git repository; he'll send a pointer.
- Andrew is still looking at minimal context switches and Qthreads
Issues to watch out for:
- if stacks are aligned, l1 associativity may cause problems
Current thoughts on methodology:
- we'll take 2.5 approaches:
- what's a lower bound on performance?
- (basically, build a system and make it run faster)
- write some benchmarks using green threads packages
- hack benchmarks, green threads packages for more preformance
- what's an upper bound on performance? What limits it?
- First, "guess" at likely performance limiters in available hardware
- build some synthetic kernels to exercise what we think are performance limiters
- then, use performance counters to validate that limiters are behaving as we expect
- Then, validate that these limiters are actually the problem.
- Given a particular memory hierarchy (cache structure, link bandwidths/latencies, queue sizes), how fast can we execute a particular memory trace?
- capture memory trace from benchmark on real system
- find cache parameters, link bandwidths/latencies, queue sizes for real system
- modify memory hierarchy model (Ruby or Sesc?) to match this system
- replay memory trace through model, verifying that we get similar event counts to real hardware.
- expand limiting parameters. do we get a speedup? If so, we were right about limit.
- repeat until parameters are absurd. what's a reasonable design (set of parameters) that gives good performance?
- First, "guess" at likely performance limiters in available hardware
- what's a lower bound on performance?
Next steps:
- keep thinking about methodology. Jacob and Simon will talk more.
- experiment with performance counters on Nehalem system to validate info in docs. how many outstanding memory requests per core? per chip?
- continue exploring applications and benchmarks.
- continue exploring green threads, minimal context switch.
Other questions to answer:
- what platforms should we buy?
- investigate potential talks?
- jace? (extended memory semantics)
- qthreads guy?
September 30, 2010
Current progress:
- We can get accounts on the PNNL XMT; Simon sent out an email.
- An XMT simulator is also available.
- Simon created a wiki for the project and sent account details to everybody.
- Simon has been trying to get details of applications from various people, but has not yet been successful. Graph connectivity is the basic area. There is an effort to create a graph benchmark: http://www.graph500.org/, with source graph 500 git repository
Thoughts on methodology:
- We should target an abstract model, not the real ISA.
- We will focus on performance for a single node for now.
- It may be possible to modify processor microcode. Could that be useful?
- Some questions we're interested in exploring:
- How many thread contexts can we run concurrently on a modern Xeon before memory performance degrades?
- What resource constraints in existing processors limit our performance?
- Could we take a trace of memory operations, filter out stack references, and replay these into a simple architectural simulator to explore what resources are necessary to get performance?
We'll explore a couple angles immediately:
- look at green threads packages: Andrew is doing this already.
- look at applications/benchmarks: Jacob will start here.
- look at resource constraints in Intel and Arm (maybe AMD): Brandon is looking into this.