Meeting notes
From SoftXMT
Jump to navigationJump to search
November 15, 2010
- New machine is up, prefetchers switched off
- 2MB huge pages enabled, we think we know how to use them.
- (old P4 Xeon cluster is alive again, too)
- Brandon reran linked list traversal on new machine
- Single socket max BW is what we expect
- Single socket BW seems to plateau between 24-36 outstanding misses, no matter which way we slice it. (2 threads, 12 misses, or 12 threads, 2 misses)
- This fits with our model of the processor (at most 32 slots in global queue for local reads)
- This leads us back to the SIMD idea: with a 12-thread or 16-thread processor, 2-way static SIMD might be enough to saturate bandwidth without context switching.
- todo: reslice data to show max bandwith by total outstanding requests
- todo: want to see BW for a single thread all the way to 50 concurrent misses, to verify there is a pleateau
- two-socket bandwidth is not what we expect (same as single socket). Suspicion: numa layout not behaving how we expect. Will explore further.
- mark suggests, if we want to avoid TLB problems altogether:
- limit linux to 1g, set up page map, and manage rest of memory yourself (done by gridiron systems)
- Jacob is grafting timing model into a cache simulator to help explain Andrew's code.
- Andrew is investigating his code
November 8, 2010
- Andrew discussed his context switching approach
- he sees speedups, but not what simon expected
- next step: explain where speedups are coming from
November 1, 2010
- looked at poster made last week for affiliates
- interesting conversations with facebook---follow up in future
- convey training
- fpgas sit on front-side bus. seems like it could be useful.
- talked to allan porterfield about pchase bandwidth limits
- he feels the bottleneck is dram page switching times
- linked list traversal status
- modified allocator to minimize tlb refills. this gives us similar results to pchase
- we max out at 60% of theoretical peak bandwidth
- this requires ~6 concurrent requests with 4 threads, or ~3 concurrent requests with 8 threads
- global queue contention is minimal at this level
- latencies vary from ~30ns to ~100ns at peak bandwidth
- XMT node can issue ~100M memory ops/s to network---is there a commodity interconnect that can give this kind of rate?
- andrew is still working on a graph benchmark
- discussed static scheduling instead of context switches
- if we need only a few outstanding requests, this might be simpler
- for the single socket case, seems reasonable
- for multi-node, thread model has benefits
- next time
- andrew will have some sort of graph benchmark, and will discuss his approach
- brandon/jacob will have a more solid idea of where our bottleneck is
October 25, 2010
- discussed porterfield paper
- they use pchase, a linked list traversal microbenchmark like we've discussed
- they run on quad core xeon like ours, with more memory
- they get about 54% of theoretical peak. why?
- Kyle Wheeler, qthreads developer was here
- he shared some new context switch code that should be faster
October 18, 2010
- jase was here
- discussed synchronization using microcode modification
October 11, 2010
October 4, 2010
Current progress:
- Brandon investigated potential performance limiters in Nehalem, and an ARM
- nehalem allows 10 outstanding data misses per core
- 32 outstanding L3/memory loads? 16 outstanding L3/memory stores?
- 8 outstanding prefetches? The docs weren't clear about prefetches.
- The ARM Brandon looked at (don't remember which) allows something like 3 outstanding misses per core.
- Jacob is still looking at benchmarks/applications. Jacob checked some into a new Git repository; he'll send a pointer.
- Andrew is still looking at minimal context switches and Qthreads
Issues to watch out for:
- if stacks are aligned, L1 associativity may cause problems
Current thoughts on methodology:
- we'll take 2.5 approaches:
- what's a lower bound on performance?
- (basically, build a system and make it run faster)
- write some benchmarks using green threads packages
- hack benchmarks, green threads packages for more preformance
- what's an upper bound on performance? What limits it?
- First, "guess" at likely performance limiters in available hardware
- build some synthetic kernels to exercise what we think are performance limiters
- then, use performance counters to validate that limiters are behaving as we expect
- Then, validate that these limiters are actually the problem.
- Given a particular memory hierarchy (cache structure, link bandwidths/latencies, queue sizes), how fast can we execute a particular memory trace?
- capture memory trace from benchmark on real system
- find cache parameters, link bandwidths/latencies, queue sizes for real system
- modify memory hierarchy model (Ruby or Sesc?) to match this system
- replay memory trace through model, verifying that we get similar event counts to real hardware.
- expand limiting parameters. do we get a speedup? If so, we were right about limit.
- repeat until parameters are absurd. what's a reasonable design (set of parameters) that gives good performance?
- First, "guess" at likely performance limiters in available hardware
- what's a lower bound on performance?
Next steps:
- keep thinking about methodology. Jacob and Simon will talk more.
- experiment with performance counters on Nehalem system to validate info in docs. how many outstanding memory requests per core? per chip?
- continue exploring applications and benchmarks.
- continue exploring green threads, minimal context switch.
Other questions to answer:
- what platforms should we buy?
- investigate potential talks?
- jace? (extended memory semantics)
- qthreads guy?
September 30, 2010
Current progress:
- We can get accounts on the PNNL XMT; Simon sent out an email.
- An XMT simulator is also available.
- Simon created a wiki for the project and sent account details to everybody.
- Simon has been trying to get details of applications from various people, but has not yet been successful. Graph connectivity is the basic area. There is an effort to create a graph benchmark: http://www.graph500.org/, with source graph 500 git repository
Thoughts on methodology:
- We should target an abstract model, not the real ISA.
- We will focus on performance for a single node for now.
- It may be possible to modify processor microcode. Could that be useful?
- Some questions we're interested in exploring:
- How many thread contexts can we run concurrently on a modern Xeon before memory performance degrades?
- What resource constraints in existing processors limit our performance?
- Could we take a trace of memory operations, filter out stack references, and replay these into a simple architectural simulator to explore what resources are necessary to get performance?
We'll explore a couple angles immediately:
- look at green threads packages: Andrew is doing this already.
- look at applications/benchmarks: Jacob will start here.
- look at resource constraints in Intel and Arm (maybe AMD): Brandon is looking into this.