Meeting notes

From SoftXMT
Revision as of 21:30, 13 December 2010 by Nelson (talk | contribs)
Jump to navigationJump to search

December 13, 2010

  • Simon: report on review
  • Next steps:
    • how to increase concurrency?
      • FPGA?
      • software? dedicate a core?
    • EMS
    • global addressing
    • energy
    • current performance for kernel/real apps?
      • kernels on graphs?
      • sparse matrix?
      • branch and bound?


    • Reduce page miss overhead
      • 1GB huge pages
      • physical addressing
    • Using streaming accesses
      • prefetchnt0
      • nontemporal store (MOVNTQ)
      • nontemporal load
    • larger system
      • FPGA options
      • AMD HT3 HNC
      • Intel QPI spec
    • Other processors?
      • Intel MIC?
      • ARM?
    • Improve application performance with thread library
  • HotPar paper
    • get speedup with an application or two
      • bfs
      • sparse matrix something (elimination tree?)
      • us vs. many-core?


December 6, 2010

  • simon went over the presentation he'll be giving
    • mark: dram page miss rate? why not 100%?
    • carlson benchmark: why is base case so much slower than list traversal?

November 15, 2010

  • New machine is up, prefetchers switched off
    • 2MB huge pages enabled, we think we know how to use them.
    • (old P4 Xeon cluster is alive again, too)
  • Brandon reran linked list traversal on new machine
    • Single socket max BW is what we expect
    • Single socket BW seems to plateau between 24-36 outstanding misses, no matter which way we slice it. (2 threads, 12 misses, or 12 threads, 2 misses)
    • This fits with our model of the processor (at most 32 slots in global queue for local reads)
    • This leads us back to the SIMD idea: with a 12-thread or 16-thread processor, 2-way static SIMD might be enough to saturate bandwidth without context switching.
    • todo: reslice data to show max bandwith by total outstanding requests
    • todo: want to see BW for a single thread all the way to 50 concurrent misses, to verify there is a pleateau
    • two-socket bandwidth is not what we expect (same as single socket). Suspicion: numa layout not behaving how we expect. Will explore further.
  • mark suggests, if we want to avoid TLB problems altogether:
    • limit linux to 1g, set up page map, and manage rest of memory yourself (done by gridiron systems)
  • Jacob is grafting timing model into a cache simulator to help explain Andrew's code.
  • Andrew is investigating his code

November 8, 2010

  • Andrew discussed his context switching approach
  • he sees speedups, but not what simon expected
  • next step: explain where speedups are coming from
  • linked list info: we exceeded what we thought was single-memory-controller bandwidth limit by having both socket 0 and socket 1 make requests to MC 0. This gave us another 2-3GB/s. This makes it seem like something before the memory controller (global queue?) is limiting the single-socket bandwidth.

November 1, 2010

  • looked at poster made last week for affiliates
    • interesting conversations with facebook---follow up in future
  • convey training
    • fpgas sit on front-side bus. seems like it could be useful.
  • talked to allan porterfield about pchase bandwidth limits
    • he feels the bottleneck is dram page switching times
  • linked list traversal status
    • modified allocator to minimize tlb refills. this gives us similar results to pchase
    • we max out at 60% of theoretical peak bandwidth
    • this requires ~6 concurrent requests with 4 threads, or ~3 concurrent requests with 8 threads
    • global queue contention is minimal at this level
    • latencies vary from ~30ns to ~100ns at peak bandwidth
  • XMT node can issue ~100M memory ops/s to network---is there a commodity interconnect that can give this kind of rate?
  • andrew is still working on a graph benchmark
  • discussed static scheduling instead of context switches
    • if we need only a few outstanding requests, this might be simpler
    • for the single socket case, seems reasonable
    • for multi-node, thread model has benefits
  • next time
    • andrew will have some sort of graph benchmark, and will discuss his approach
    • brandon/jacob will have a more solid idea of where our bottleneck is

October 25, 2010

  • discussed porterfield paper
    • they use pchase, a linked list traversal microbenchmark like we've discussed
    • they run on quad core xeon like ours, with more memory
    • they get about 54% of theoretical peak. why?
  • Kyle Wheeler, qthreads developer was here
    • he shared some new context switch code that should be faster

October 18, 2010

  • jase was here
    • discussed synchronization using microcode modification

October 11, 2010

October 4, 2010

Current progress:

  • Brandon investigated potential performance limiters in Nehalem, and an ARM
    • nehalem allows 10 outstanding data misses per core
    • 32 outstanding L3/memory loads? 16 outstanding L3/memory stores?
    • 8 outstanding prefetches? The docs weren't clear about prefetches.
    • The ARM Brandon looked at (don't remember which) allows something like 3 outstanding misses per core.
  • Jacob is still looking at benchmarks/applications. Jacob checked some into a new Git repository; he'll send a pointer.
  • Andrew is still looking at minimal context switches and Qthreads

Issues to watch out for:

  • if stacks are aligned, L1 associativity may cause problems

Current thoughts on methodology:

  • we'll take 2.5 approaches:
    • what's a lower bound on performance?
      • (basically, build a system and make it run faster)
      • write some benchmarks using green threads packages
      • hack benchmarks, green threads packages for more preformance
    • what's an upper bound on performance? What limits it?
      • First, "guess" at likely performance limiters in available hardware
        • build some synthetic kernels to exercise what we think are performance limiters
        • then, use performance counters to validate that limiters are behaving as we expect
      • Then, validate that these limiters are actually the problem.
        • Given a particular memory hierarchy (cache structure, link bandwidths/latencies, queue sizes), how fast can we execute a particular memory trace?
        • capture memory trace from benchmark on real system
        • find cache parameters, link bandwidths/latencies, queue sizes for real system
        • modify memory hierarchy model (Ruby or Sesc?) to match this system
        • replay memory trace through model, verifying that we get similar event counts to real hardware.
        • expand limiting parameters. do we get a speedup? If so, we were right about limit.
        • repeat until parameters are absurd. what's a reasonable design (set of parameters) that gives good performance?


Next steps:

  • keep thinking about methodology. Jacob and Simon will talk more.
  • experiment with performance counters on Nehalem system to validate info in docs. how many outstanding memory requests per core? per chip?
  • continue exploring applications and benchmarks.
  • continue exploring green threads, minimal context switch.

Other questions to answer:

  • what platforms should we buy?
  • investigate potential talks?
    • jace? (extended memory semantics)
    • qthreads guy?


September 30, 2010

Current progress:

  • We can get accounts on the PNNL XMT; Simon sent out an email.
  • An XMT simulator is also available.
  • Simon created a wiki for the project and sent account details to everybody.
  • Simon has been trying to get details of applications from various people, but has not yet been successful. Graph connectivity is the basic area. There is an effort to create a graph benchmark: http://www.graph500.org/, with source graph 500 git repository

Thoughts on methodology:

  • We should target an abstract model, not the real ISA.
  • We will focus on performance for a single node for now.
  • It may be possible to modify processor microcode. Could that be useful?
  • Some questions we're interested in exploring:
    • How many thread contexts can we run concurrently on a modern Xeon before memory performance degrades?
    • What resource constraints in existing processors limit our performance?
    • Could we take a trace of memory operations, filter out stack references, and replay these into a simple architectural simulator to explore what resources are necessary to get performance?

We'll explore a couple angles immediately:

  • look at green threads packages: Andrew is doing this already.
  • look at applications/benchmarks: Jacob will start here.
  • look at resource constraints in Intel and Arm (maybe AMD): Brandon is looking into this.