Meeting notes: Difference between revisions

From SoftXMT
Jump to navigationJump to search
No edit summary
No edit summary
Line 11: Line 11:


Issues to watch out for:
Issues to watch out for:
* if stacks are aligned, l1 associativity may cause problems
* if stacks are aligned, L1 associativity may cause problems


Current thoughts on methodology:
Current thoughts on methodology:

Revision as of 00:46, 6 October 2010

October 4, 2010

Current progress:

  • Brandon investigated potential performance limiters in Nehalem, and an ARM
    • nehalem allows 10 outstanding data misses per core
    • 32 outstanding L3/memory loads? 16 outstanding L3/memory stores?
    • 8 outstanding prefetches? The docs weren't clear about prefetches.
    • The ARM Brandon looked at (don't remember which) allows something like 3 outstanding misses per core.
  • Jacob is still looking at benchmarks/applications. Jacob checked some into a new Git repository; he'll send a pointer.
  • Andrew is still looking at minimal context switches and Qthreads

Issues to watch out for:

  • if stacks are aligned, L1 associativity may cause problems

Current thoughts on methodology:

  • we'll take 2.5 approaches:
    • what's a lower bound on performance?
      • (basically, build a system and make it run faster)
      • write some benchmarks using green threads packages
      • hack benchmarks, green threads packages for more preformance
    • what's an upper bound on performance? What limits it?
      • First, "guess" at likely performance limiters in available hardware
        • build some synthetic kernels to exercise what we think are performance limiters
        • then, use performance counters to validate that limiters are behaving as we expect
      • Then, validate that these limiters are actually the problem.
        • Given a particular memory hierarchy (cache structure, link bandwidths/latencies, queue sizes), how fast can we execute a particular memory trace?
        • capture memory trace from benchmark on real system
        • find cache parameters, link bandwidths/latencies, queue sizes for real system
        • modify memory hierarchy model (Ruby or Sesc?) to match this system
        • replay memory trace through model, verifying that we get similar event counts to real hardware.
        • expand limiting parameters. do we get a speedup? If so, we were right about limit.
        • repeat until parameters are absurd. what's a reasonable design (set of parameters) that gives good performance?


Next steps:

  • keep thinking about methodology. Jacob and Simon will talk more.
  • experiment with performance counters on Nehalem system to validate info in docs. how many outstanding memory requests per core? per chip?
  • continue exploring applications and benchmarks.
  • continue exploring green threads, minimal context switch.

Other questions to answer:

  • what platforms should we buy?
  • investigate potential talks?
    • jace? (extended memory semantics)
    • qthreads guy?


September 30, 2010

Current progress:

  • We can get accounts on the PNNL XMT; Simon sent out an email.
  • An XMT simulator is also available.
  • Simon created a wiki for the project and sent account details to everybody.
  • Simon has been trying to get details of applications from various people, but has not yet been successful. Graph connectivity is the basic area. There is an effort to create a graph benchmark: http://www.graph500.org/, with source graph 500 git repository

Thoughts on methodology:

  • We should target an abstract model, not the real ISA.
  • We will focus on performance for a single node for now.
  • It may be possible to modify processor microcode. Could that be useful?
  • Some questions we're interested in exploring:
    • How many thread contexts can we run concurrently on a modern Xeon before memory performance degrades?
    • What resource constraints in existing processors limit our performance?
    • Could we take a trace of memory operations, filter out stack references, and replay these into a simple architectural simulator to explore what resources are necessary to get performance?

We'll explore a couple angles immediately:

  • look at green threads packages: Andrew is doing this already.
  • look at applications/benchmarks: Jacob will start here.
  • look at resource constraints in Intel and Arm (maybe AMD): Brandon is looking into this.