Meeting notes: Difference between revisions

From SoftXMT
Jump to navigationJump to search
(Created page with "== September 30, 2010 == Current progress: * We can get accounts on the PNNL XMT; Simon sent out an email. * An XMT simulator is also available. * Simon created a wiki for the p...")
 
No edit summary
 
(15 intermediate revisions by 2 users not shown)
Line 1: Line 1:
New meeting notes page:
https://docs.google.com/a/cs.washington.edu/document/d/1VY1TtPGcmveI_fnx8bNpd-JeE2DBzmg8SOdzSVZMqTE/edit?hl=en&authkey=COuXvO0B
== March 1, 2011 ==
todo:
* prefetch on the consumer for green threads on delegate only
* return path for delegates
* green threads on the producer side
* energy per update gups w/ delegates
* prefetching on the producer so the consumer goes faster?
* full empty bits
== February 22, 2011 ==
== February 15, 2011 ==
* App: perhaps move to par depth first search
* Synch
** bench: gups a[b[i]]++, where 'a' big array not fit in cache
*** Try baseline and with aggregation
**** Different queues, ex. Fastforward
*** Look at benefit of scheme when 'a' being big (~cache) v. small (cache)
** Single socket: ~80M atomic inc
** Single core atomic single var: ~160M,  nonatomic ~ x10 better
** Initial: aggregation to single core with FF queue ~40M updates
** Intel
*** Currently mem-model intel: close to TSO (part for legacy reasons); but clearly on new uarchs trying to improve performance
**** Stores from a core are in order (thinking of comm-q of buffers as example where this might matter)
**** Causal order (c0 write, c1 read then write based on read)
** Decide on standard metric for synchronization ops throughput
** list chase with atomic updates on separate lists with 6 cores ~160M updates.  Differs from the ~80M number from the single variable updates; perhaps has to do with a given core updating many things so update to single variable not blocking as much work
== February 8, 2011 ==
* Application frustrations.
* Experiments with Cray CX-1
** older version with Nehalem 4-cores per socket, 8 nodes with 2 sockets per node.
** 20Gbit/s  Infiniband Mellanox, fully conected
** Ran some OSU benchmarks with MPI
*** timing unidirectional send/receive latency at 1.5us  to 64 bytes per msg, then linear at 1.5us + (x-64)/1.2B/us
*** bandwdith is 1.8 GB/s  at around 5KB ...
*** 1.7M messages per sec with 64 byte payload
*Contention
** throughput of increments
*** racy on same variable, racy update to local, flatlines at a little over 1.4G updates/s
*** racy on same variable, atomic update to global, plummets to 10M/s
*** independent updates with full fence after each is ~50M/s
*** ... with just store fence is 600M/s
* todo/think about:
** find good way to move stuff to another core (low synchro 1producer/1consumer)
** even having each node in multi-node system have exclusive right to operations on global data residing on in its local memory still does not provide the same ordering guarentees as synch ops on a signle chip, since there is reordering due to the network.  So, may make just as much sense to partition local memory (the globally shared part) between all the cores on the chip, rather than have a single memory manager core per chip.
** use counters to find stalls in pipeline due to fences
** experiment with performance vs. density of sync ops across the cores
== February 1, 2011 ==
* Synchronization update
** Brandon has been running list chasing (non-green-threads) with sync ops on the nodes.
*** With disjoint sync ops, we get up to 160MT/s
**** this is slower than the 275MT/s we saw before, but we're also doing more work now, so we don't know what this number means.
*** WIth multiple list-walks-with-sync on the same list, we see faster rates, but since the caches are helping, the numbers are also confusing.
*** Next step:
**** run old list walk with racy increment
**** try temporal / non-temporal prefetches, l1 vs l2,
***** but difficulty in making sure prefetches execute. idea: try issuing loads after prefetches to ensure finished, or limiting number of concurrent requests
**** atomic op to disjoint locations
**** atomic op to same location
** Jacob tried doing atomic increments to a single variable
*** with a single core, atomic increment is twice as slow as racy increment
*** atomic increments destroy scalability: throughput is the same for 2-6 threads.
*** Todo:
**** try unrelated sync op in racy
**** try load/add/cmpxchg (not likely to be fast, but interesting anyway)
**** use performance counters to measure pipeline fullness
== December 13, 2010 ==
* Simon: report on review
* Next steps:
** how to increase concurrency?
*** FPGA?
*** software? dedicate a core?
** EMS
** global addressing
** energy
** current performance for kernel/real apps?
*** kernels on graphs?
*** sparse matrix?
*** branch and bound?
** Reduce page miss overhead
*** 1GB huge pages
*** physical addressing
** Using streaming accesses
*** prefetchnt0
*** nontemporal store (MOVNTQ)
*** nontemporal load
** larger system
*** FPGA options
*** AMD HT3 HNC
*** Intel QPI spec
** Other processors?
***  Intel MIC?
*** ARM?
** Improve application performance with thread library
* HotPar paper
** get speedup with an application or two
*** bfs
*** sparse matrix something (elimination tree?)
*** us vs. many-core?
== December 6, 2010 ==
*simon went over the presentation he'll be giving
**mark: dram page miss rate? why not 100%?
**carlson benchmark: why is base case so much slower than list traversal?
== November 15, 2010 ==
* New machine is up, prefetchers switched off
** 2MB huge pages enabled, we think we know how to use them.
** (old P4 Xeon cluster is alive again, too)
* Brandon reran linked list traversal on new machine
** Single socket max BW is what we expect
** Single socket BW seems to plateau between 24-36 outstanding misses, no matter which way we slice it. (2 threads, 12 misses, or 12 threads, 2 misses)
** This fits with our model of the processor (at most 32 slots in global queue for local reads)
** This leads us back to the SIMD idea: with a 12-thread or 16-thread processor, 2-way static SIMD might be enough to saturate bandwidth without context switching.
** todo: reslice data to show max bandwith by total outstanding requests
** todo: want to see BW for a single thread all the way to 50 concurrent misses, to verify there is a pleateau
** two-socket bandwidth is not what we expect (same as single socket). Suspicion: numa layout not behaving how we expect. Will explore further.
* mark suggests, if we want to avoid TLB problems altogether:
** limit linux to 1g, set up page map, and manage rest of memory yourself (done by gridiron systems)
* Jacob is grafting timing model into a cache simulator to help explain Andrew's code.
* Andrew is investigating his code
== November 8, 2010 ==
* Andrew discussed his context switching approach
* he sees speedups, but not what simon expected
* next step: explain where speedups are coming from
* linked list info: we exceeded what we thought was single-memory-controller bandwidth limit by having both socket 0 and socket 1 make requests to MC 0. This gave us another 2-3GB/s. This makes it seem like something before the memory controller (global queue?) is limiting the single-socket bandwidth.
== November 1, 2010 ==
* looked at poster made last week for affiliates
** interesting conversations with facebook---follow up in future
* convey training
** fpgas sit on front-side bus. seems like it could be useful.
* talked to allan porterfield about pchase bandwidth limits
** he feels the bottleneck is dram page switching times
* linked list traversal status
** modified allocator to minimize tlb refills. this gives us similar results to pchase
** we max out at 60% of theoretical peak bandwidth
** this requires ~6 concurrent requests with 4 threads, or ~3 concurrent requests with 8 threads
** global queue contention is minimal at this level
** latencies vary from ~30ns to ~100ns at peak bandwidth
* XMT node can issue ~100M memory ops/s to network---is there a commodity interconnect that can give this kind of rate?
* andrew is still working on a graph benchmark
* discussed static scheduling instead of context switches
** if we need only a few outstanding requests, this might be simpler
** for the single socket case, seems reasonable
** for multi-node, thread model has benefits
* next time
** andrew will have some sort of graph benchmark, and will discuss his approach
** brandon/jacob will have a more solid idea of where our bottleneck is
== October 25, 2010 ==
* discussed porterfield paper
** they use pchase, a linked list traversal microbenchmark like we've discussed
** they run on quad core xeon like ours, with more memory
** they get about 54% of theoretical peak. why?
* Kyle Wheeler, qthreads developer was here
** he shared some new context switch code that should be faster
== October 18, 2010 ==
* jase was here
** discussed synchronization using microcode modification
== October 11, 2010 ==
== October 4, 2010 ==
Current progress:
* Brandon investigated potential performance limiters in Nehalem, and an ARM
** nehalem allows 10 outstanding data misses per core
** 32 outstanding L3/memory loads? 16 outstanding L3/memory stores?
** 8 outstanding prefetches? The docs weren't clear about prefetches.
** The ARM Brandon looked at (don't remember which) allows something like 3 outstanding misses per core.
* Jacob is still looking at benchmarks/applications. Jacob checked some into a new Git repository; he'll send a pointer.
* Andrew is still looking at minimal context switches and Qthreads
Issues to watch out for:
* if stacks are aligned, L1 associativity may cause problems
Current thoughts on methodology:
* we'll take 2.5 approaches:
** what's a lower bound on performance?
*** (basically, build a system and make it run faster)
*** write some benchmarks using green threads packages
*** hack benchmarks, green threads packages for more preformance
** what's an upper bound on performance? What limits it?
*** First, "guess" at likely performance limiters in available hardware
**** build some synthetic kernels to exercise what we think are performance limiters
**** then, use performance counters to validate that limiters are behaving as we expect
*** Then, validate that these limiters are actually the problem.
**** Given a particular memory hierarchy (cache structure, link bandwidths/latencies, queue sizes), how fast can we execute a particular memory trace?
**** capture memory trace from benchmark on real system
**** find cache parameters, link bandwidths/latencies, queue sizes for real system
**** modify memory hierarchy model (Ruby or Sesc?) to match this system
**** replay memory trace through model, verifying that we get similar event counts to real hardware.
**** expand limiting parameters. do we get a speedup? If so, we were right about limit.
**** repeat until parameters are absurd. what's a reasonable design (set of parameters) that gives good performance?
Next steps:
* keep thinking about methodology. Jacob and Simon will talk more.
* experiment with performance counters on Nehalem system to validate info in docs.  how many outstanding memory requests per core? per chip?
* continue exploring applications and benchmarks.
* continue exploring green threads, minimal context switch.
Other questions to answer:
* what platforms should we buy?
* investigate potential talks?
** jace? (extended memory semantics)
** qthreads guy?
== September 30, 2010 ==
== September 30, 2010 ==


Line 5: Line 265:
* An XMT simulator is also available.
* An XMT simulator is also available.
* Simon created a wiki for the project and sent account details to everybody.
* Simon created a wiki for the project and sent account details to everybody.
* Simon has been trying to get details of applications from various people, but has not yet been successful. Graph connectivity is the basic area. There is an effort to create a graph benchmark: http://www.graph500.org/
* Simon has been trying to get details of applications from various people, but has not yet been successful. Graph connectivity is the basic area. There is an effort to create a graph benchmark: http://www.graph500.org/, with source  [[graph 500 git repository]]


Thoughts on methodology:
Thoughts on methodology:

Latest revision as of 16:25, 11 May 2011

New meeting notes page: https://docs.google.com/a/cs.washington.edu/document/d/1VY1TtPGcmveI_fnx8bNpd-JeE2DBzmg8SOdzSVZMqTE/edit?hl=en&authkey=COuXvO0B


March 1, 2011

todo:

  • prefetch on the consumer for green threads on delegate only
  • return path for delegates
  • green threads on the producer side
  • energy per update gups w/ delegates
  • prefetching on the producer so the consumer goes faster?
  • full empty bits


February 22, 2011

February 15, 2011

  • App: perhaps move to par depth first search
  • Synch
    • bench: gups a[b[i]]++, where 'a' big array not fit in cache
      • Try baseline and with aggregation
        • Different queues, ex. Fastforward
      • Look at benefit of scheme when 'a' being big (~cache) v. small (cache)
    • Single socket: ~80M atomic inc
    • Single core atomic single var: ~160M, nonatomic ~ x10 better
    • Initial: aggregation to single core with FF queue ~40M updates
    • Intel
      • Currently mem-model intel: close to TSO (part for legacy reasons); but clearly on new uarchs trying to improve performance
        • Stores from a core are in order (thinking of comm-q of buffers as example where this might matter)
        • Causal order (c0 write, c1 read then write based on read)
    • Decide on standard metric for synchronization ops throughput
    • list chase with atomic updates on separate lists with 6 cores ~160M updates. Differs from the ~80M number from the single variable updates; perhaps has to do with a given core updating many things so update to single variable not blocking as much work

February 8, 2011

  • Application frustrations.
  • Experiments with Cray CX-1
    • older version with Nehalem 4-cores per socket, 8 nodes with 2 sockets per node.
    • 20Gbit/s Infiniband Mellanox, fully conected
    • Ran some OSU benchmarks with MPI
      • timing unidirectional send/receive latency at 1.5us to 64 bytes per msg, then linear at 1.5us + (x-64)/1.2B/us
      • bandwdith is 1.8 GB/s at around 5KB ...
      • 1.7M messages per sec with 64 byte payload
  • Contention
    • throughput of increments
      • racy on same variable, racy update to local, flatlines at a little over 1.4G updates/s
      • racy on same variable, atomic update to global, plummets to 10M/s
      • independent updates with full fence after each is ~50M/s
      • ... with just store fence is 600M/s
  • todo/think about:
    • find good way to move stuff to another core (low synchro 1producer/1consumer)
    • even having each node in multi-node system have exclusive right to operations on global data residing on in its local memory still does not provide the same ordering guarentees as synch ops on a signle chip, since there is reordering due to the network. So, may make just as much sense to partition local memory (the globally shared part) between all the cores on the chip, rather than have a single memory manager core per chip.
    • use counters to find stalls in pipeline due to fences
    • experiment with performance vs. density of sync ops across the cores

February 1, 2011

  • Synchronization update
    • Brandon has been running list chasing (non-green-threads) with sync ops on the nodes.
      • With disjoint sync ops, we get up to 160MT/s
        • this is slower than the 275MT/s we saw before, but we're also doing more work now, so we don't know what this number means.
      • WIth multiple list-walks-with-sync on the same list, we see faster rates, but since the caches are helping, the numbers are also confusing.
      • Next step:
        • run old list walk with racy increment
        • try temporal / non-temporal prefetches, l1 vs l2,
          • but difficulty in making sure prefetches execute. idea: try issuing loads after prefetches to ensure finished, or limiting number of concurrent requests
        • atomic op to disjoint locations
        • atomic op to same location
    • Jacob tried doing atomic increments to a single variable
      • with a single core, atomic increment is twice as slow as racy increment
      • atomic increments destroy scalability: throughput is the same for 2-6 threads.
      • Todo:
        • try unrelated sync op in racy
        • try load/add/cmpxchg (not likely to be fast, but interesting anyway)
        • use performance counters to measure pipeline fullness

December 13, 2010

  • Simon: report on review
  • Next steps:
    • how to increase concurrency?
      • FPGA?
      • software? dedicate a core?
    • EMS
    • global addressing
    • energy
    • current performance for kernel/real apps?
      • kernels on graphs?
      • sparse matrix?
      • branch and bound?


    • Reduce page miss overhead
      • 1GB huge pages
      • physical addressing
    • Using streaming accesses
      • prefetchnt0
      • nontemporal store (MOVNTQ)
      • nontemporal load
    • larger system
      • FPGA options
      • AMD HT3 HNC
      • Intel QPI spec
    • Other processors?
      • Intel MIC?
      • ARM?
    • Improve application performance with thread library
  • HotPar paper
    • get speedup with an application or two
      • bfs
      • sparse matrix something (elimination tree?)
      • us vs. many-core?


December 6, 2010

  • simon went over the presentation he'll be giving
    • mark: dram page miss rate? why not 100%?
    • carlson benchmark: why is base case so much slower than list traversal?

November 15, 2010

  • New machine is up, prefetchers switched off
    • 2MB huge pages enabled, we think we know how to use them.
    • (old P4 Xeon cluster is alive again, too)
  • Brandon reran linked list traversal on new machine
    • Single socket max BW is what we expect
    • Single socket BW seems to plateau between 24-36 outstanding misses, no matter which way we slice it. (2 threads, 12 misses, or 12 threads, 2 misses)
    • This fits with our model of the processor (at most 32 slots in global queue for local reads)
    • This leads us back to the SIMD idea: with a 12-thread or 16-thread processor, 2-way static SIMD might be enough to saturate bandwidth without context switching.
    • todo: reslice data to show max bandwith by total outstanding requests
    • todo: want to see BW for a single thread all the way to 50 concurrent misses, to verify there is a pleateau
    • two-socket bandwidth is not what we expect (same as single socket). Suspicion: numa layout not behaving how we expect. Will explore further.
  • mark suggests, if we want to avoid TLB problems altogether:
    • limit linux to 1g, set up page map, and manage rest of memory yourself (done by gridiron systems)
  • Jacob is grafting timing model into a cache simulator to help explain Andrew's code.
  • Andrew is investigating his code

November 8, 2010

  • Andrew discussed his context switching approach
  • he sees speedups, but not what simon expected
  • next step: explain where speedups are coming from
  • linked list info: we exceeded what we thought was single-memory-controller bandwidth limit by having both socket 0 and socket 1 make requests to MC 0. This gave us another 2-3GB/s. This makes it seem like something before the memory controller (global queue?) is limiting the single-socket bandwidth.

November 1, 2010

  • looked at poster made last week for affiliates
    • interesting conversations with facebook---follow up in future
  • convey training
    • fpgas sit on front-side bus. seems like it could be useful.
  • talked to allan porterfield about pchase bandwidth limits
    • he feels the bottleneck is dram page switching times
  • linked list traversal status
    • modified allocator to minimize tlb refills. this gives us similar results to pchase
    • we max out at 60% of theoretical peak bandwidth
    • this requires ~6 concurrent requests with 4 threads, or ~3 concurrent requests with 8 threads
    • global queue contention is minimal at this level
    • latencies vary from ~30ns to ~100ns at peak bandwidth
  • XMT node can issue ~100M memory ops/s to network---is there a commodity interconnect that can give this kind of rate?
  • andrew is still working on a graph benchmark
  • discussed static scheduling instead of context switches
    • if we need only a few outstanding requests, this might be simpler
    • for the single socket case, seems reasonable
    • for multi-node, thread model has benefits
  • next time
    • andrew will have some sort of graph benchmark, and will discuss his approach
    • brandon/jacob will have a more solid idea of where our bottleneck is

October 25, 2010

  • discussed porterfield paper
    • they use pchase, a linked list traversal microbenchmark like we've discussed
    • they run on quad core xeon like ours, with more memory
    • they get about 54% of theoretical peak. why?
  • Kyle Wheeler, qthreads developer was here
    • he shared some new context switch code that should be faster

October 18, 2010

  • jase was here
    • discussed synchronization using microcode modification

October 11, 2010

October 4, 2010

Current progress:

  • Brandon investigated potential performance limiters in Nehalem, and an ARM
    • nehalem allows 10 outstanding data misses per core
    • 32 outstanding L3/memory loads? 16 outstanding L3/memory stores?
    • 8 outstanding prefetches? The docs weren't clear about prefetches.
    • The ARM Brandon looked at (don't remember which) allows something like 3 outstanding misses per core.
  • Jacob is still looking at benchmarks/applications. Jacob checked some into a new Git repository; he'll send a pointer.
  • Andrew is still looking at minimal context switches and Qthreads

Issues to watch out for:

  • if stacks are aligned, L1 associativity may cause problems

Current thoughts on methodology:

  • we'll take 2.5 approaches:
    • what's a lower bound on performance?
      • (basically, build a system and make it run faster)
      • write some benchmarks using green threads packages
      • hack benchmarks, green threads packages for more preformance
    • what's an upper bound on performance? What limits it?
      • First, "guess" at likely performance limiters in available hardware
        • build some synthetic kernels to exercise what we think are performance limiters
        • then, use performance counters to validate that limiters are behaving as we expect
      • Then, validate that these limiters are actually the problem.
        • Given a particular memory hierarchy (cache structure, link bandwidths/latencies, queue sizes), how fast can we execute a particular memory trace?
        • capture memory trace from benchmark on real system
        • find cache parameters, link bandwidths/latencies, queue sizes for real system
        • modify memory hierarchy model (Ruby or Sesc?) to match this system
        • replay memory trace through model, verifying that we get similar event counts to real hardware.
        • expand limiting parameters. do we get a speedup? If so, we were right about limit.
        • repeat until parameters are absurd. what's a reasonable design (set of parameters) that gives good performance?


Next steps:

  • keep thinking about methodology. Jacob and Simon will talk more.
  • experiment with performance counters on Nehalem system to validate info in docs. how many outstanding memory requests per core? per chip?
  • continue exploring applications and benchmarks.
  • continue exploring green threads, minimal context switch.

Other questions to answer:

  • what platforms should we buy?
  • investigate potential talks?
    • jace? (extended memory semantics)
    • qthreads guy?


September 30, 2010

Current progress:

  • We can get accounts on the PNNL XMT; Simon sent out an email.
  • An XMT simulator is also available.
  • Simon created a wiki for the project and sent account details to everybody.
  • Simon has been trying to get details of applications from various people, but has not yet been successful. Graph connectivity is the basic area. There is an effort to create a graph benchmark: http://www.graph500.org/, with source graph 500 git repository

Thoughts on methodology:

  • We should target an abstract model, not the real ISA.
  • We will focus on performance for a single node for now.
  • It may be possible to modify processor microcode. Could that be useful?
  • Some questions we're interested in exploring:
    • How many thread contexts can we run concurrently on a modern Xeon before memory performance degrades?
    • What resource constraints in existing processors limit our performance?
    • Could we take a trace of memory operations, filter out stack references, and replay these into a simple architectural simulator to explore what resources are necessary to get performance?

We'll explore a couple angles immediately:

  • look at green threads packages: Andrew is doing this already.
  • look at applications/benchmarks: Jacob will start here.
  • look at resource constraints in Intel and Arm (maybe AMD): Brandon is looking into this.