Meeting notes

November 15, 2010

New machine is up, prefetchers switched off
- 2MB huge pages enabled, we think we know how to use them.
- (old P4 Xeon cluster is alive again, too)

Brandon reran linked list traversal on new machine
- Single socket max BW is what we expect
- Single socket BW seems to plateau between 24-36 outstanding misses, no matter which way we slice it. (2 threads, 12 misses, or 12 threads, 2 misses)
- This fits with our model of the processor (at most 32 slots in global queue for local reads)
- This leads us back to the SIMD idea: with a 12-thread or 16-thread processor, 2-way static SIMD might be enough to saturate bandwidth without context switching.
- todo: reslice data to show max bandwith by total outstanding requests
- todo: want to see BW for a single thread all the way to 50 concurrent misses, to verify there is a pleateau
- two-socket bandwidth is not what we expect (same as single socket). Suspicion: numa layout not behaving how we expect. Will explore further.

mark suggests, if we want to avoid TLB problems altogether:
- limit linux to 1g, set up page map, and manage rest of memory yourself (done by gridiron systems)

Jacob is grafting timing model into a cache simulator to help explain Andrew's code.
Andrew is investigating his code

November 8, 2010

Andrew discussed his context switching approach
he sees speedups, but not what simon expected
next step: explain where speedups are coming from

November 1, 2010

looked at poster made last week for affiliates
- interesting conversations with facebook---follow up in future

convey training
- fpgas sit on front-side bus. seems like it could be useful.

talked to allan porterfield about pchase bandwidth limits
- he feels the bottleneck is dram page switching times

linked list traversal status
- modified allocator to minimize tlb refills. this gives us similar results to pchase
- we max out at 60% of theoretical peak bandwidth
- this requires ~6 concurrent requests with 4 threads, or ~3 concurrent requests with 8 threads
- global queue contention is minimal at this level
- latencies vary from ~30ns to ~100ns at peak bandwidth

XMT node can issue ~100M memory ops/s to network---is there a commodity interconnect that can give this kind of rate?

andrew is still working on a graph benchmark

discussed static scheduling instead of context switches
- if we need only a few outstanding requests, this might be simpler
- for the single socket case, seems reasonable
- for multi-node, thread model has benefits

next time
- andrew will have some sort of graph benchmark, and will discuss his approach
- brandon/jacob will have a more solid idea of where our bottleneck is

October 25, 2010

discussed porterfield paper
- they use pchase, a linked list traversal microbenchmark like we've discussed
- they run on quad core xeon like ours, with more memory
- they get about 54% of theoretical peak. why?

Kyle Wheeler, qthreads developer was here
- he shared some new context switch code that should be faster

October 18, 2010

jase was here
- discussed synchronization using microcode modification

October 11, 2010

October 4, 2010

Current progress:

Brandon investigated potential performance limiters in Nehalem, and an ARM
- nehalem allows 10 outstanding data misses per core
- 32 outstanding L3/memory loads? 16 outstanding L3/memory stores?
- 8 outstanding prefetches? The docs weren't clear about prefetches.
- The ARM Brandon looked at (don't remember which) allows something like 3 outstanding misses per core.
Jacob is still looking at benchmarks/applications. Jacob checked some into a new Git repository; he'll send a pointer.
Andrew is still looking at minimal context switches and Qthreads

Issues to watch out for:

if stacks are aligned, L1 associativity may cause problems

Current thoughts on methodology:

we'll take 2.5 approaches:
- what's a lower bound on performance?
  - (basically, build a system and make it run faster)
  - write some benchmarks using green threads packages
  - hack benchmarks, green threads packages for more preformance
- what's an upper bound on performance? What limits it?
  - First, "guess" at likely performance limiters in available hardware
    - build some synthetic kernels to exercise what we think are performance limiters
    - then, use performance counters to validate that limiters are behaving as we expect
  - Then, validate that these limiters are actually the problem.
    - Given a particular memory hierarchy (cache structure, link bandwidths/latencies, queue sizes), how fast can we execute a particular memory trace?
    - capture memory trace from benchmark on real system
    - find cache parameters, link bandwidths/latencies, queue sizes for real system
    - modify memory hierarchy model (Ruby or Sesc?) to match this system
    - replay memory trace through model, verifying that we get similar event counts to real hardware.
    - expand limiting parameters. do we get a speedup? If so, we were right about limit.
    - repeat until parameters are absurd. what's a reasonable design (set of parameters) that gives good performance?

Next steps:

keep thinking about methodology. Jacob and Simon will talk more.
experiment with performance counters on Nehalem system to validate info in docs. how many outstanding memory requests per core? per chip?
continue exploring applications and benchmarks.
continue exploring green threads, minimal context switch.

September 30, 2010

Current progress:

We can get accounts on the PNNL XMT; Simon sent out an email.
An XMT simulator is also available.
Simon created a wiki for the project and sent account details to everybody.
Simon has been trying to get details of applications from various people, but has not yet been successful. Graph connectivity is the basic area. There is an effort to create a graph benchmark: http://www.graph500.org/, with source graph 500 git repository

Thoughts on methodology:

We should target an abstract model, not the real ISA.
We will focus on performance for a single node for now.
It may be possible to modify processor microcode. Could that be useful?
Some questions we're interested in exploring:
- How many thread contexts can we run concurrently on a modern Xeon before memory performance degrades?
- What resource constraints in existing processors limit our performance?
- Could we take a trace of memory operations, filter out stack references, and replay these into a simple architectural simulator to explore what resources are necessary to get performance?

We'll explore a couple angles immediately:

look at green threads packages: Andrew is doing this already.
look at applications/benchmarks: Jacob will start here.
look at resource constraints in Intel and Arm (maybe AMD): Brandon is looking into this.

Meeting notes

Contents

November 15, 2010

November 8, 2010

November 1, 2010

October 25, 2010

October 18, 2010

October 11, 2010

October 4, 2010

September 30, 2010

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools