Meeting notes: Difference between revisions
From SoftXMT
Jump to navigationJump to search
No edit summary |
No edit summary |
||
Line 2: | Line 2: | ||
* Simon: report on review | * Simon: report on review | ||
* Next steps: | * Next steps: | ||
** how to increase concurrency? | |||
*** FPGA? | |||
*** software? dedicate a core? | |||
** EMS | |||
** global addressing | |||
** energy | |||
** current performance for kernel/real apps? | |||
*** kernels on graphs? | |||
*** sparse matrix? | |||
*** branch and bound? | |||
** Reduce page miss overhead | ** Reduce page miss overhead | ||
*** 1GB huge pages | *** 1GB huge pages | ||
*** physical addressing | *** physical addressing | ||
** Using streaming accesses | ** Using streaming accesses | ||
*** prefetchnt0 | *** prefetchnt0 | ||
*** nontemporal store (MOVNTQ) | *** nontemporal store (MOVNTQ) | ||
*** nontemporal load | *** nontemporal load | ||
** larger system | ** larger system | ||
*** FPGA options | *** FPGA options | ||
*** AMD HT3 HNC | *** AMD HT3 HNC | ||
*** Intel QPI spec | *** Intel QPI spec | ||
** Other processors? | ** Other processors? | ||
*** Intel MIC? | *** Intel MIC? | ||
*** ARM? | *** ARM? | ||
** Improve application performance with thread library | ** Improve application performance with thread library | ||
* HotPar paper | * HotPar paper | ||
** get speedup with an application or two | |||
*** bfs | |||
*** sparse matrix something (elimination tree?) | |||
*** us vs. many-core? | |||
Revision as of 21:30, 13 December 2010
December 13, 2010
- Simon: report on review
- Next steps:
- how to increase concurrency?
- FPGA?
- software? dedicate a core?
- how to increase concurrency?
- EMS
- global addressing
- energy
- current performance for kernel/real apps?
- kernels on graphs?
- sparse matrix?
- branch and bound?
- Reduce page miss overhead
- 1GB huge pages
- physical addressing
- Reduce page miss overhead
- Using streaming accesses
- prefetchnt0
- nontemporal store (MOVNTQ)
- nontemporal load
- Using streaming accesses
- larger system
- FPGA options
- AMD HT3 HNC
- Intel QPI spec
- larger system
- Other processors?
- Intel MIC?
- ARM?
- Improve application performance with thread library
- Other processors?
- HotPar paper
- get speedup with an application or two
- bfs
- sparse matrix something (elimination tree?)
- us vs. many-core?
- get speedup with an application or two
December 6, 2010
- simon went over the presentation he'll be giving
- mark: dram page miss rate? why not 100%?
- carlson benchmark: why is base case so much slower than list traversal?
November 15, 2010
- New machine is up, prefetchers switched off
- 2MB huge pages enabled, we think we know how to use them.
- (old P4 Xeon cluster is alive again, too)
- Brandon reran linked list traversal on new machine
- Single socket max BW is what we expect
- Single socket BW seems to plateau between 24-36 outstanding misses, no matter which way we slice it. (2 threads, 12 misses, or 12 threads, 2 misses)
- This fits with our model of the processor (at most 32 slots in global queue for local reads)
- This leads us back to the SIMD idea: with a 12-thread or 16-thread processor, 2-way static SIMD might be enough to saturate bandwidth without context switching.
- todo: reslice data to show max bandwith by total outstanding requests
- todo: want to see BW for a single thread all the way to 50 concurrent misses, to verify there is a pleateau
- two-socket bandwidth is not what we expect (same as single socket). Suspicion: numa layout not behaving how we expect. Will explore further.
- mark suggests, if we want to avoid TLB problems altogether:
- limit linux to 1g, set up page map, and manage rest of memory yourself (done by gridiron systems)
- Jacob is grafting timing model into a cache simulator to help explain Andrew's code.
- Andrew is investigating his code
November 8, 2010
- Andrew discussed his context switching approach
- he sees speedups, but not what simon expected
- next step: explain where speedups are coming from
- linked list info: we exceeded what we thought was single-memory-controller bandwidth limit by having both socket 0 and socket 1 make requests to MC 0. This gave us another 2-3GB/s. This makes it seem like something before the memory controller (global queue?) is limiting the single-socket bandwidth.
November 1, 2010
- looked at poster made last week for affiliates
- interesting conversations with facebook---follow up in future
- convey training
- fpgas sit on front-side bus. seems like it could be useful.
- talked to allan porterfield about pchase bandwidth limits
- he feels the bottleneck is dram page switching times
- linked list traversal status
- modified allocator to minimize tlb refills. this gives us similar results to pchase
- we max out at 60% of theoretical peak bandwidth
- this requires ~6 concurrent requests with 4 threads, or ~3 concurrent requests with 8 threads
- global queue contention is minimal at this level
- latencies vary from ~30ns to ~100ns at peak bandwidth
- XMT node can issue ~100M memory ops/s to network---is there a commodity interconnect that can give this kind of rate?
- andrew is still working on a graph benchmark
- discussed static scheduling instead of context switches
- if we need only a few outstanding requests, this might be simpler
- for the single socket case, seems reasonable
- for multi-node, thread model has benefits
- next time
- andrew will have some sort of graph benchmark, and will discuss his approach
- brandon/jacob will have a more solid idea of where our bottleneck is
October 25, 2010
- discussed porterfield paper
- they use pchase, a linked list traversal microbenchmark like we've discussed
- they run on quad core xeon like ours, with more memory
- they get about 54% of theoretical peak. why?
- Kyle Wheeler, qthreads developer was here
- he shared some new context switch code that should be faster
October 18, 2010
- jase was here
- discussed synchronization using microcode modification
October 11, 2010
October 4, 2010
Current progress:
- Brandon investigated potential performance limiters in Nehalem, and an ARM
- nehalem allows 10 outstanding data misses per core
- 32 outstanding L3/memory loads? 16 outstanding L3/memory stores?
- 8 outstanding prefetches? The docs weren't clear about prefetches.
- The ARM Brandon looked at (don't remember which) allows something like 3 outstanding misses per core.
- Jacob is still looking at benchmarks/applications. Jacob checked some into a new Git repository; he'll send a pointer.
- Andrew is still looking at minimal context switches and Qthreads
Issues to watch out for:
- if stacks are aligned, L1 associativity may cause problems
Current thoughts on methodology:
- we'll take 2.5 approaches:
- what's a lower bound on performance?
- (basically, build a system and make it run faster)
- write some benchmarks using green threads packages
- hack benchmarks, green threads packages for more preformance
- what's an upper bound on performance? What limits it?
- First, "guess" at likely performance limiters in available hardware
- build some synthetic kernels to exercise what we think are performance limiters
- then, use performance counters to validate that limiters are behaving as we expect
- Then, validate that these limiters are actually the problem.
- Given a particular memory hierarchy (cache structure, link bandwidths/latencies, queue sizes), how fast can we execute a particular memory trace?
- capture memory trace from benchmark on real system
- find cache parameters, link bandwidths/latencies, queue sizes for real system
- modify memory hierarchy model (Ruby or Sesc?) to match this system
- replay memory trace through model, verifying that we get similar event counts to real hardware.
- expand limiting parameters. do we get a speedup? If so, we were right about limit.
- repeat until parameters are absurd. what's a reasonable design (set of parameters) that gives good performance?
- First, "guess" at likely performance limiters in available hardware
- what's a lower bound on performance?
Next steps:
- keep thinking about methodology. Jacob and Simon will talk more.
- experiment with performance counters on Nehalem system to validate info in docs. how many outstanding memory requests per core? per chip?
- continue exploring applications and benchmarks.
- continue exploring green threads, minimal context switch.
Other questions to answer:
- what platforms should we buy?
- investigate potential talks?
- jace? (extended memory semantics)
- qthreads guy?
September 30, 2010
Current progress:
- We can get accounts on the PNNL XMT; Simon sent out an email.
- An XMT simulator is also available.
- Simon created a wiki for the project and sent account details to everybody.
- Simon has been trying to get details of applications from various people, but has not yet been successful. Graph connectivity is the basic area. There is an effort to create a graph benchmark: http://www.graph500.org/, with source graph 500 git repository
Thoughts on methodology:
- We should target an abstract model, not the real ISA.
- We will focus on performance for a single node for now.
- It may be possible to modify processor microcode. Could that be useful?
- Some questions we're interested in exploring:
- How many thread contexts can we run concurrently on a modern Xeon before memory performance degrades?
- What resource constraints in existing processors limit our performance?
- Could we take a trace of memory operations, filter out stack references, and replay these into a simple architectural simulator to explore what resources are necessary to get performance?
We'll explore a couple angles immediately:
- look at green threads packages: Andrew is doing this already.
- look at applications/benchmarks: Jacob will start here.
- look at resource constraints in Intel and Arm (maybe AMD): Brandon is looking into this.