Meeting notes

From SoftXMT
Revision as of 06:32, 3 November 2010 by Nelson (talk | contribs)
Jump to navigationJump to search

November 1, 2010

  • looked at poster made last week for affiliates
    • interesting conversations with facebook---follow up in future
  • convey training
    • fpgas sit on front-side bus. seems like it could be useful.
  • talked to allan porterfield about pchase bandwidth limits
    • he feels the bottleneck is dram page switching times
  • linked list traversal status
    • modified allocator to minimize tlb refills. this gives us similar results to pchase
    • we max out at 60% of theoretical peak bandwidth
    • this requires ~6 concurrent requests with 4 threads, or ~3 concurrent requests with 8 threads
    • global queue contention is minimal at this level
    • latencies vary from ~30ns to ~100ns at peak bandwidth
  • XMT node can issue ~100M memory ops/s to network---is there a commodity interconnect that can give this kind of rate?
  • andrew is still working on a graph benchmark
  • discussed static scheduling instead of context switches
    • if we need only a few outstanding requests, this might be simpler
    • for the single socket case, seems reasonable
    • for multi-node, thread model has benefits
  • next time
    • andrew will have some sort of graph benchmark, and will discuss his approach
    • brandon/jacob will have a more solid idea of where our bottleneck is

October 25, 2010

  • discussed porterfield paper
    • they use pchase, a linked list traversal microbenchmark like we've discussed
    • they run on quad core xeon like ours, with more memory
    • they get about 54% of theoretical peak. why?
  • Kyle Wheeler, qthreads developer was here
    • he shared some new context switch code that should be faster

October 18, 2010

  • jase was here
    • discussed synchronization using microcode modification

October 11, 2010

October 4, 2010

Current progress:

  • Brandon investigated potential performance limiters in Nehalem, and an ARM
    • nehalem allows 10 outstanding data misses per core
    • 32 outstanding L3/memory loads? 16 outstanding L3/memory stores?
    • 8 outstanding prefetches? The docs weren't clear about prefetches.
    • The ARM Brandon looked at (don't remember which) allows something like 3 outstanding misses per core.
  • Jacob is still looking at benchmarks/applications. Jacob checked some into a new Git repository; he'll send a pointer.
  • Andrew is still looking at minimal context switches and Qthreads

Issues to watch out for:

  • if stacks are aligned, L1 associativity may cause problems

Current thoughts on methodology:

  • we'll take 2.5 approaches:
    • what's a lower bound on performance?
      • (basically, build a system and make it run faster)
      • write some benchmarks using green threads packages
      • hack benchmarks, green threads packages for more preformance
    • what's an upper bound on performance? What limits it?
      • First, "guess" at likely performance limiters in available hardware
        • build some synthetic kernels to exercise what we think are performance limiters
        • then, use performance counters to validate that limiters are behaving as we expect
      • Then, validate that these limiters are actually the problem.
        • Given a particular memory hierarchy (cache structure, link bandwidths/latencies, queue sizes), how fast can we execute a particular memory trace?
        • capture memory trace from benchmark on real system
        • find cache parameters, link bandwidths/latencies, queue sizes for real system
        • modify memory hierarchy model (Ruby or Sesc?) to match this system
        • replay memory trace through model, verifying that we get similar event counts to real hardware.
        • expand limiting parameters. do we get a speedup? If so, we were right about limit.
        • repeat until parameters are absurd. what's a reasonable design (set of parameters) that gives good performance?


Next steps:

  • keep thinking about methodology. Jacob and Simon will talk more.
  • experiment with performance counters on Nehalem system to validate info in docs. how many outstanding memory requests per core? per chip?
  • continue exploring applications and benchmarks.
  • continue exploring green threads, minimal context switch.

Other questions to answer:

  • what platforms should we buy?
  • investigate potential talks?
    • jace? (extended memory semantics)
    • qthreads guy?


September 30, 2010

Current progress:

  • We can get accounts on the PNNL XMT; Simon sent out an email.
  • An XMT simulator is also available.
  • Simon created a wiki for the project and sent account details to everybody.
  • Simon has been trying to get details of applications from various people, but has not yet been successful. Graph connectivity is the basic area. There is an effort to create a graph benchmark: http://www.graph500.org/, with source graph 500 git repository

Thoughts on methodology:

  • We should target an abstract model, not the real ISA.
  • We will focus on performance for a single node for now.
  • It may be possible to modify processor microcode. Could that be useful?
  • Some questions we're interested in exploring:
    • How many thread contexts can we run concurrently on a modern Xeon before memory performance degrades?
    • What resource constraints in existing processors limit our performance?
    • Could we take a trace of memory operations, filter out stack references, and replay these into a simple architectural simulator to explore what resources are necessary to get performance?

We'll explore a couple angles immediately:

  • look at green threads packages: Andrew is doing this already.
  • look at applications/benchmarks: Jacob will start here.
  • look at resource constraints in Intel and Arm (maybe AMD): Brandon is looking into this.