### Database Architectures for New Hardware

### a tutorial by

### Anastassia Ailamaki

Database Group Carnegie Mellon University http://www.cs.cmu.edu/~natassa



# Trends in processor performance Scaling # of transistors, innovative microarchitecture Higher performance, despite technological hurdles! MOORE'S LAW Transistors In processor performance in perform







### Modern storage managers

- □ Several decades work to hide I/O
- □ Asynchronous I/O + Prefetch & Postwrite
  - Overlap I/O latency by useful computation
- Parallel data access
  - Partition data on modern disk array [PAT88]
- Smart data placement / clustering
  - Improve data locality
  - Maximize parallelism
  - Exploit hardware characteristics

...and larger main memories fit more data

■ 1MB in the 80's, 10GB today, TBs coming soon

DB storage mgrs efficiently hide I/O latencies

©2004 Anastassia Ailama



### Breaking the Memory Wall

@Carnegie Mello

### Wish for a Database Architecture:

- that uses hardware intelligently
- □ that won't fall apart when new computers arrive
- u that will adapt to alternate configurations

### Efforts from multiple research communities

- Cache-conscious data placement and algorithms
- Instruction stream optimizations
- □ Novel database software architectures
- □ Novel hardware designs (covered briefly)

□2004 Anastassia Ailamaki

| Databases                                                                              | 1 |
|----------------------------------------------------------------------------------------|---|
| Detailed Outline                                                                       |   |
| <ul> <li>Introduction and Overview</li> </ul>                                          |   |
| New Hardware                                                                           |   |
| Execution Pipelines                                                                    |   |
| Cache memories     Where Date Time Code                                                |   |
| <ul> <li>Where Does Time Go?</li> <li>Measuring Time (Tools and Benchmarks)</li> </ul> |   |
| Analyzing DBs: Experimental Results                                                    | - |
| <ul> <li>Bridging the Processor/Memory Speed Gap</li> </ul>                            |   |
| <ul> <li>Data Placement</li> <li>Access Methods</li> </ul>                             |   |
| Query Processing Alorithms                                                             |   |
| <ul> <li>Instruction Stream Optimizations</li> </ul>                                   |   |
| Staged Database Systems                                                                |   |
| <ul><li>Newer Hardware</li><li>Hip and Trendy</li></ul>                                |   |
| Query co-processing                                                                    |   |
| Databases on MEMStore                                                                  |   |
| Directions for Future Research                                                         |   |
| 17/102 Angelassia Aliamagi II.                                                         |   |
|                                                                                        |   |
|                                                                                        |   |
|                                                                                        |   |
|                                                                                        |   |
| Deleberon                                                                              | 1 |
| Databases<br>@Carnegie Mellou                                                          |   |
| Outline                                                                                |   |
| □ Introduction and Overview                                                            | - |
|                                                                                        |   |
| □ New Hardware                                                                         |   |
| □ Where Does Time Go?                                                                  |   |
|                                                                                        |   |
| □ Bridging the Processor/Memory Speed Gap                                              |   |
| □ Hip and Trendy                                                                       |   |
| <ul> <li>Directions for Future Research</li> </ul>                                     |   |
|                                                                                        |   |
|                                                                                        |   |
|                                                                                        |   |
|                                                                                        |   |
|                                                                                        |   |
|                                                                                        |   |
| 2004 Anastassia Ailamaki 11                                                            |   |
|                                                                                        |   |
|                                                                                        |   |
|                                                                                        |   |
|                                                                                        |   |
|                                                                                        | - |
| Databases  @Carnegie Mellor                                                            |   |
| This Section's Goals                                                                   |   |
|                                                                                        |   |
| <ul> <li>Understand how a program is executed</li> </ul>                               |   |
| <ul> <li>How new hardware parallelizes execution</li> </ul>                            |   |
| <ul><li>What are the pitfalls</li></ul>                                                |   |
| Understand why database programs do not take                                           |   |
| advantage of microarchitectural advances                                               |   |
| □ Understand memory hierarchies                                                        |   |
|                                                                                        |   |
| <ul> <li>How they work</li> </ul>                                                      |   |
| What are the parameters that affect program behavior                                   |   |
| Why they are important to database performance                                         |   |
|                                                                                        |   |
|                                                                                        |   |
|                                                                                        |   |
| 12004 Augustin Albumbi                                                                 |   |

## Outline Introduction and Overview New Hardware Execution Pipelines Cache memories Where Does Time Go? Bridging the Processor/Memory Speed Gap Hip and Trendy Directions for Future Research











### Outline Introduction and Overview New Hardware Execution Pipelines Cache memories Where Does Time Go? Bridging the Processor/Memory Speed Gap Hip and Trendy Directions for Future Research







# Miss Classification (3+1 C's) compulsory (cold) "cold miss" on first access to a block — defined as: miss in infinite cache capacity misses occur because cache not large enough — defined as: miss in fully-associative cache conflict misses occur because of restrictive mapping strategy only in set-associative or direct-mapped cache — defined as: not attributable to compulsory or capacity coherence misses occur because of sharing among multiprocessors Cold misses are unavoidable Capacity, conflict, and coherence misses can be reduced







### Summary: New Hardware Fundamental goal in processor design: max ILP Pipelined, superscalar, speculative execution Out-of-order execution Non-blocking caches Dependencies in instruction stream lower ILP Caches important for database performance Level 1 instruction cache in critical execution path Trips to memory most expensive B workloads on new hardware Too many load/store instructions Tight dependencies in instruction stream Algorithms not optimized for cache hierarchies Long code paths

Large instruction and data footprints

| Outline Outline                                                                                              |   |
|--------------------------------------------------------------------------------------------------------------|---|
| □ Introduction and Overview □ New Hardware                                                                   |   |
| □ Where Does Time Go?                                                                                        |   |
| <ul><li>Bridging the Processor/Memory Speed Gap</li><li>Hip and Trendy</li></ul>                             |   |
| □ Directions for Future Research                                                                             |   |
|                                                                                                              |   |
|                                                                                                              |   |
| 2004 Anastassia Ailamaki 28                                                                                  |   |
|                                                                                                              |   |
| Databases<br>@Carnegie Mellon                                                                                | I |
| This Section's Goals                                                                                         |   |
| <ul><li>Hardware takes time: how do we measure time?</li><li>Understand how to efficiently analyze</li></ul> |   |
| microarchitectural behavior of database workloads                                                            |   |
| <ul><li>Should we use simulators? When? Why?</li><li>How do we use processor counters?</li></ul>             |   |
| <ul><li>Which tools are available for analysis?</li><li>Which database systems/benchmarks to use?</li></ul>  |   |
| <ul> <li>Survey experimental results on workload<br/>characterization</li> </ul>                             |   |
| <ul> <li>Discover what matters for database performance</li> </ul>                                           |   |
| 2004 Anastassia Ailamaki 25                                                                                  |   |
|                                                                                                              |   |
|                                                                                                              |   |
| Databases                                                                                                    | 1 |
| Outline (Carnegie Mellon                                                                                     |   |
| □ Introduction and Overview □ New Hardware                                                                   |   |
| □ Where Does Time Go?                                                                                        |   |
| <ul> <li>Measuring Time (Tools and Benchmarks)</li> <li>Analyzing DBs: Experimental Results</li> </ul>       |   |
| <ul><li>□ Bridging the Processor/Memory Speed Gap</li><li>□ Hip and Trendy</li></ul>                         |   |
| □ Directions for Future Research                                                                             |   |
|                                                                                                              |   |
|                                                                                                              |   |

Simulator vs. Real Machine Real machine Simulator □ Limited to available Can measure any event hardware counters/events Vary hardware configurations □ Limited to (real) hardware configurations □ Fast (real-life) execution (Too) Slow execution Often forces use of scaled- Enables testing real: large & down/simplified workloads more realistic workloads Sometimes not repeatable Always repeatable □ Tool: performance counters □ Virtutech Simics, SimOS, SimpleScalar, etc. Real-machine experiments to locate problems Simulation to evaluate solutions

### Database @Carnegie Mello

Databases

### Hardware Performance Counters Counters

- What are they?
  - Special purpose registers that keep track of programmable events
  - Non-intrusive counts "accurately" measure processor events
  - Software API's handle event programming/overflow
  - GUI interfaces built on top of API's to provide higher-level analysis
- What can they count?
  - Instructions, branch mispredictions, cache misses, etc.
  - No standard set exists
- Issues that may complicate life
  - Provides only hard counts, analysis must be done by user or tools
  - Made specifically for each processor
  - even processor families may have different interfaces
  - Vendors don't like to support because is not profit contributor

©2004 Anastassia Ailamak

### @Carnegie Mello

### **Evaluating Behavior using HW Counters**

- □ Stall time (cycle) counters
  - very useful for time breakdowns
  - (e.g., instruction-related stall time)
- Event counters
  - useful to compute ratios
  - (e.g., # misses in L1-Data cache)
- Need to understand counters before using them
  - Often not easy from documentation
  - Best way: microbenchmark (run programs with precomputed events)
    - E.g., strided accesses to an array

©2004 Anastassia Ailamak

| • |      |
|---|------|
|   |      |
|   | <br> |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
| • |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
| • |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
| • |      |
|   |      |
|   |      |
|   |      |
|   |      |

| Example:                                                                                | Databases<br>@Carnegie Mellor |          |  |  |
|-----------------------------------------------------------------------------------------|-------------------------------|----------|--|--|
| Cycles                                                                                  | CPU_CLK_UNHALTED\             |          |  |  |
| Instructions                                                                            | INST_RETIRED                  |          |  |  |
| L1 Data (L1D) accesses                                                                  | DATA_MEM_REFS                 |          |  |  |
| L1 Data (L1D) misses                                                                    | DCU_LINES_IN                  | \        |  |  |
| L2 Misses                                                                               | L2_LINES_IN                   | _`"time" |  |  |
| Instruction-related stalls                                                              | IFU_MEM_STALL                 | //       |  |  |
| Branches                                                                                | BR_INST_DECODED               | //       |  |  |
| Branch mispredictions                                                                   | BR_MISS_PRED_RETIRED          | //       |  |  |
| TLB misses ITLB_MISS                                                                    |                               |          |  |  |
| L1 Instruction misses                                                                   | IFU_IFETCH_MISS //            |          |  |  |
| Dependence stalls                                                                       | PARTIAL_RAT_STALLS/           |          |  |  |
| Resource stalls                                                                         | RESOURCE_STALLS /             |          |  |  |
| Lots more detail, measurable events, statistics Often >1 ways to measure the same thing |                               |          |  |  |

### Producing time breakdowns

- □ Determine benchmark/methodology (more later)
- □ Devise formulae to derive useful statistics
- □ Determine (and test!) software
  - E.g., Intel Vtune (GUI, sampling), or emon
  - Publicly available & universal (e.g., PAPI [DMM04])
- □ Determine time components T1....Tn
  - Determine how to measure each using the counters
  - Compute execution time as the sum
- Verify model correctness
  - Measure execution time (in #cycles)
  - Ensure measured time = computed time (or almost)
  - Validate computations using redundant formulae

©2004 Anastassia Ailamak





### What to measure?

□ Decision Support System (DSS:TPC-H)

- Complex queries, low-concurrency
- Read-only (with rare batch updates)
- Sequential access dominates
- Repeatable (unit of work = query)
- On-Line Transaction Processing (OLTP:TPCC, ODB)
  - Transactions with simple queries, high-concurrency
  - Update-intensive
  - Random access frequent
  - Not repeatable (unit of work = 5s of execution after rampup)

Often too complex to provide useful insight

### Microbenchmarks

[KPH98,ADH99,KP00,SAF04

- What matters is basic execution loops
- □ Isolate three basic operations:
  - Sequential scan (no index)
  - Random access on records (non-clustered index)
  - Join (access on two tables)
- Vary parameters:
  - selectivity, projectivity, # of attributes in predicate
  - join algorithm, isolate phases
  - table size, record size, # of fields, type of fields
- Determine behavior and trends
  - Microbenchmarks can efficiently mimic TPC microarchitectural behavior!
  - Widely used to analyze query execution

**Excellent for microarchitectural analysis** 

22004 Anastassia Ailamak

| <br> |      |  |
|------|------|--|
|      |      |  |
|      |      |  |
|      |      |  |
|      |      |  |
|      |      |  |
|      |      |  |
|      |      |  |
|      |      |  |
|      |      |  |
|      |      |  |
|      | <br> |  |
|      |      |  |
|      |      |  |
|      |      |  |
|      |      |  |
|      |      |  |
|      |      |  |
|      |      |  |
|      |      |  |
|      |      |  |
|      |      |  |
|      |      |  |
|      |      |  |
|      |      |  |
|      |      |  |
|      |      |  |
|      |      |  |
|      |      |  |
|      |      |  |
|      |      |  |
|      |      |  |
|      |      |  |
|      |      |  |
|      |      |  |
|      | <br> |  |
|      |      |  |
|      |      |  |
|      |      |  |
|      |      |  |
|      |      |  |
|      |      |  |
|      |      |  |













### Summary: Where Does Time Go?

- □ Goal: discover bottlenecks
  - Hardware performance counters ⇒ time breakdown
  - Tools available for access and analysis (+simulators)
  - Run commercial DBMS and equivalent prototypes
  - Microbenchmarks offer valuable insight
- □ Database workloads: more than 50% stalls
  - Mostly due to memory delays
  - Cannot always reduce stalls by increasing cache size
- Crucial bottlenecks
  - Data accesses to L2 cache (esp. for DSS)
  - Instruction accesses to L1 cache (esp. for OLTP)

©2004 Americania Ailamal



### Outline Introduction and Overview New Hardware Where Does Time Go? Bridging the Processor/Memory Speed Gap Hip and Trendy Directions for Future Research

### This Section's Goals Survey techniques to improve locality Relational data Access methods Survey new query processing algorithms Present a new database system architecture Briefly explain Instruction Stream Optimizations Show how much good understanding of the platform can achieve

### Outline Introduction and Overview New Hardware Where Does Time Go? Bridging the Processor/Memory Speed Gap Data Placement Access Methods Query Processing Instruction Stream Optimizations Staged Database Systems Newer Hardware Hip and Trendy Directions for Future Research







### Decomposition Storage Model (DSM) [CK85] EID Name Age 1237 lane 30 4322 45 John 1563 20 7658 Susan 52 2534 43 8791 Dan 37 Partition original table into *n* 1-attribute sub-tables





## Repairing NSM's cache performance We need a data placement that... Eliminates unnecessary memory accesses Improves inter-record locality Keeps a record's fields together Does not affect NSM's I/O performance and, most importantly, is...







### Dynamic PAX: Data Morphing

Databases @Carnegie Mellor

[HP03]

- PAX random access: more cache misses in record
- Store attributes accessed together contiguously
- Dynamic partition updates with changing workloads
  - Optimize total cost based on cache misses
  - Partition algorithms: naïve & hill-climbing algorithms
- □ Fewer cache misses
  - Better projectivity and scalability for index scan queries
  - Up to 45% faster than NSM & 25% faster than PAX
- □ Same I/O performance as PAX and NSM
- □ Future work: how to handle conflicts?

©2004 Anastassia Ailamal

### Alternatively: Repair DSM's I/O behavior □ We like DSM for partial record access ■ We like NSM for full-record access Solution: Fractured Mirrors [RDS02] Sparse B-Tree on ID 3 / 1. Get data placement right 1 A1 A2 A3 4 A4 A5 2. Faster record reconstruction Lineitem (TPCH) 1GB Instead of record- or page-at-a-time... ► NSM Chunk-based merge algorithm! Page-at-a-time Read in segments of M pages ( a "chunk") 160 140 120 100 80 60 2. Merge segments in memory 3. Requires (N\*K)/M disk seeks 4. For a memory budget of B pages, each partition gets B/N pages in a chunk 6 8 10 12 14

### Fractured Mirrors 3. Smart mirroring Disk 2 Disk 1 Disk 1 NSM Copy DSM Copy **NSM** Copy DSM Copy □ Achieves 2-3x speedups on TPC-H ■ Needs 2 copies of the database ■ Future work: A new optimizer Smart buffer pool management Updates

### Summary (no replication)

|             | Cache-memory Performance |                       | Memory-disk Performance |                       |  |
|-------------|--------------------------|-----------------------|-------------------------|-----------------------|--|
| Page layout | full-record access       | partial record access | full-record access      | partial record access |  |
| NSM         | <b>©</b>                 | 8                     | <b>©</b>                | 8                     |  |
| DSM         | 8                        | <b>©</b>              | 8                       | <b>©</b>              |  |
| PAX         | 8                        | 0                     | 8                       | Ø                     |  |

### Need new placement method:

- Efficient full- and partial-record accesses
- Maximize utilization at all levels of memory hierarchy

Difficult!!! Different devices/access methods Different workloads on the same database

©2004 Amostossia Ailamai





| • |  |
|---|--|
|   |  |
|   |  |
|   |  |
|   |  |
|   |  |
|   |  |
|   |  |
| - |  |
|   |  |
|   |  |
|   |  |
|   |  |
|   |  |
|   |  |
|   |  |
|   |  |
|   |  |
|   |  |
|   |  |
|   |  |
|   |  |



### Data Placement: Summary

- Smart data placement increases spatial locality
  - Research targets table (relation) data
  - Goal: Reduce number of non-cold cache misses
- □ Techniques focus grouping attributes into cache lines for quick access
- □ PAX, Data morphing: Cache optimization techniques
- □ Fractured Mirrors: Cache-and-disk optimization
- □ Fates DB Storage Manager: Independent data layout support across the entire memory hierarchy

©2004 Anastassia Ailamak

### Outline

- Introduction and Overview
- New Hardware
- □ Where Does Time Go?
- □ Bridging the Processor/Memory Speed Gap
  - Data Placement
  - Access Methods
  - Query Processing
  - Instruction Stream Optimizations
  - Staged Database Systems
- Newer Hardware
- Hip and Trendy
- Directions for Future Research

□2004 Anastassia Ailamaki

| )a | la  | b   | a   | S  | ē   | S  |
|----|-----|-----|-----|----|-----|----|
| Ca | rne | gje | · N | 16 | ·II | or |

### Main-Memory Tree Indexes

- □ T Trees: proposed in mid-80s for MMDBs [LC86]
  - Aim: balance space overhead with searching time
  - Uniform memory access assumption (no caches)
- Main-memory B<sup>+</sup> Trees: better cache performance [RR99]
- □ Node width = cache line size (32-128b) /
  - Minimize number of cache misses for search
  - Much higher than traditional disk-based B-Trees
- So now trees are too deep



Databases @Carnegie M-11

How to make trees shallower?

### 

### What do we do with cold misses? Answer: hide latencies using prefetching Prefetching enabled by Non-blocking cache technology Prefetch assembly instructions SGI R10000, Alpha 21264, Intel Pentium4 Pref 0 (r2) Pref 4 (r7) Pref 0 (r3) Pref 8 (r9) Prefetching hides cold cache miss latency Efficiently used in pointer-chasing lookups!









### Access Methods: Summary

- Optimize B+ Tree pointer-chasing cache behavior
  - Reduce node size to few cache lines
  - Reduce pointers for larger fanout (CSB+)
  - "Next" pointers to lowest non-leaf level for easy prefetching (pB+)
  - Simultaneously optimize cache and disk (fpB+)
  - Bulk searches: Buffer index accesses

### Additional work:

- Cache-oblivious B-Trees [BDF00]
  - Optimal bound in number of memory transfers
- Regardless of # of memory levels, block size, or level speed
- Survey of techniques for B-Tree cache performance [GL01]
  - Existing heretofore-folkloric knowledge
  - Key normalization/compression, alignment, separating keys/pointers

Lots more to be done in area – consider interference and scarce resources

|  | ıΤ |  |  |
|--|----|--|--|
|  |    |  |  |

- □ Introduction and Overview
- New Hardware
- □ Where Does Time Go?
- □ Bridging the Processor/Memory Speed Gap
  - Data Placement
  - Access Methods
  - Query Processing
  - Staged Database Systems
  - Staged Database Systems
  - Instruction Stream Optimizations
- Newer Hardware
- Hip and Trendy
- Directions for Future Research

2004 Anastassia Ailamak

| Ca<br>Ca | a   | b   | as | ie | S   |
|----------|-----|-----|----|----|-----|
| Ca       | rne | gje | M  | el | lor |

### **Query Processing Algorithms**

Idea: Adapt query processing algorithms to caches Related work includes:

- □ Improving data cache performance
  - Sorting
  - Join
- □ Improving instruction cache performance
  - DSS applications

©2004 Anastassia Ailama

# Sorting In-memory sorting / generating runs AlphaSort Quick Sort Replacement-selection Use quick sort rather than replacement selection Sequential vs. random access No cache misses after sub-arrays fit in cache Sort (key-prefix, pointer) pairs rather than records 3x cpu speedup for the Datamation benchmark

## Hash Join Build Probe Relation Random accesses to hash table Both when building AND when probing!!! Poor cache performance ≥ 73% of user time is CPU cache stalls [CAG04] Approaches to improving cache performance Cache partitioning – maximizes locality Prefetching – hides latencies







### Monet: Extracting Payload

Databases @Carnegie Mello

□ Two ways to extract payload:

- Pre-projection: copy fields during cache partitioning
- Post-projection: generate join index, then extract fields
- Monet: post-projection
  - Radix-decluster algorithm for good cache performance
- □ Post-projection good for DSM
  - Up to 2X speedup compared to pre-projection
- □ Post-projection is not recommended for NSM
  - Copying fields during cache partitioning is better

Paper presented in this conference!

Optimizing non-DSM hash joins [CAG04] Hash Cell (hash code, build tuple ptr) Hash Build **Bucket** Headers **Partition** Simplified probing algorithm foreach probe tuple miss latency (0) compute bucket number; (1) visit header; time (2) visit cell array; (3) visit matching build tuple; Idea: Exploit inter-tuple parallelism

### **Group Prefetching** [CAG04] foreach group of probe tuples { foreach tuple in group { a group (0) compute bucket number; 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 1 1 3 3 3 3 prefetch header; foreach tuple in group { (1) visit header; prefetch cell array; foreach tuple in group { (2)visit cell array; prefetch build tuple; foreach tuple in group { (3) visit matching build tuple;

\_\_\_\_

```
Databases
@Carnegia **
               Software Pipelining
                                               [CAG04]
               Prologue;
               for j=0 to N-4 do {
                   tuple j+3:
                     (0) compute bucket number;
                        prefetch header;
                   tuple j+2:
                     (1) visit header;
                        prefetch cell array;
                   tuple j+1:
                     (2)visit cell array;
                        prefetch build tuple;
epilogue
                   tuple i:
                     (3) visit matching build tuple;
               Epilogue;
```

### Prefetching: Performance Results [CAG04] □ Techniques exhibit similar performance □ Group prefetching easier to implement Compared to cache partitioning: Cache partitioning costly when tuples are large (>20b) Prefetching about 50% faster than cache partitioning 5000 4000 □ 9X speedups over 4000 ■ Baseline baseline at 1000 3000 ■ Group Pref cycles SP Pref 2000 □ Absolute numbers do not change! 1000 cycles 150 cycles

### DSS: Reducing I-misses [PMA01,ZR04]

□ Demand-pull execution model: one tuple at a time

- ABABABABABABABABAB...
- If A + B > L1 instruction cache size
- Poor instruction cache utilization!
- Solution: multiple tuples at an operator
  - ABBBBBAAAAABBBBB...
- □ Modify operators to support block of tuples [PMA01]
- □ Insert "buffer" operators between A and B [ZR04]
  - "buffer" calls B multiple times
  - Stores intermediate tuple pointers to serve A's request
  - No need to change original operators

12% speedup for simple TPC-H queries

### **Concurrency Control** [CHK01] Multiple CPUs share a tree Lock coupling: too much cost · Latching a node means writing True even for readers !!! · Coherence cache misses due to writes from different CPUs Solution: · Optimistic approach for readers Updaters still latch nodes n4 n5 n6 n7 Updaters also set node versions Readers check version to ensure correctness Search throughput: 5x (=no locking case) Update throughput: 4x

### Query processing: summary

- □ Alphasort: use quicksort and key prefix-pointer
- □ Monet: MM-DBMS uses aggressive DSM
- Optimize partitioning with hierarchical radix-clustering
  - Optimize post-projection with radix-declustering
  - Many other optimizations
- □ Traditional hash joins: aggressive prefetching
  - Efficiently hides data cache misses
  - Robust performance with future long latencies
- □ DSS I-misses: group computation (new operator)
- □ B-tree concurrency control: reduce readers' latching

©2004 Anastassia Ailamak

| Outline |  |
|---------|--|
|         |  |

- □ Introduction and Overview
- New Hardware
- □ Where Does Time Go?
- □ Bridging the Processor/Memory Speed Gap
  - Data Placement
  - Access Methods
  - Query Processing
  - Instruction Stream Optimizations
  - Staged Database Systems
- Newer Hardware
- □ Hip and Trendy
- Directions for Future Research

| 004 Anastassia Ailamaki |  |  |  |
|-------------------------|--|--|--|

| • |  |  |  |
|---|--|--|--|
|   |  |  |  |
|   |  |  |  |
|   |  |  |  |
|   |  |  |  |
|   |  |  |  |
|   |  |  |  |
|   |  |  |  |
|   |  |  |  |
|   |  |  |  |











## STEPS: Cache-Resident OLTP STEPS implementation runs full OLTP workloads (TPC-C) Groups threads per DB operator, then uses fast context-switch to reuse instructions in the cache Full-system TPC-C implementation: Groups threads per DB operator, then uses fast context-switch to reuse instructions in the cache Full-system TPC-C implementation: Groups threads per DB operator, then uses fast context-switch to reuse instructions in the cache STEPS minimizes L1-I cache misses without increasing cache size







### Staged Database Systems

@Carnegie Mello

[HA03]

- Staged software design allows for
  - Cohort scheduling of queries to amortize loading time
  - Suspend at module boundaries to maintain context
- □ Break DBMS into stages
- □ Stages act as independent servers
- Queries exist in the form of "packets"
- Proposed query scheduling algorithms to address locality/wait time tradeoffs [HA02]

©2004 Anastassia Ailam



### Summary: Bridging the Gap

@Carnegie Mell

- Cache-aware data placement
  - Eliminates unnecessary trips to memory
  - Minimizes conflict/capacity misses
  - Fates: decouple memory from storage layout
- What about compulsory (cold) misses?
  - Can't avoid, but can hide latency with prefetching
  - Techniques for B-trees, hash joins
- □ Staged Database Systems: a scalable future
- Addressing instruction stalls
  - DSS: Call Graph Prefetching, SIMD, group operator
  - OLTP: STEPS, a promising direction for any platform

©2004 Anastassia Ailamak

### Outline Introduction and Overview New Hardware Where Does Time Go? Bridging the Processor/Memory Speed Gap Newer Hardware Hip and Trendy Directions for Future Research







### Outline

- □ Introduction and Overview
- New Hardware
- □ Where Does Time Go?
- □ Bridging the Processor/Memory Speed Gap
- Hip and Trendy
  - Query co-processing
  - Databases on MEMStore
- Directions for Future Research

©2004 Anastassia Ailamak

### **Oprimizing Spatial Operations**

Databases @Carnegie Mello

[SAA03]

- □ Spatial operation is computation intensive
  - Intersection, distance computation
  - Number of vertices per object↑, cost↑
- Use graphics card to increase speed
- □ Idea: use color blending to detect intersection
  - Draw each polygon with gray
  - Intersected area is black because of color mixing effect
  - Algorithms cleverly use hardware features



Intersection selection: up to 64% improvement using graphics card

|  | · |  | · |
|--|---|--|---|

### Fast Computation of DB Operations Caracter Methods Using Graphics Processors [GLW04]

- □ Exploit graphics features for database operations
  - Predicate, Boolean operations, Aggregates
- Examples:
  - Predicate: attribute > constant
    - Graphics: test a set of pixels against a reference value
    - pixel = attribute value, reference value = constant
  - Aggregations: COUNT
    - Graphics: count number of pixels passing a test
- Good performance: e.g. over 2X improvement for predicate evaluations

Promising! Peak performance of graphics processor increases 2.5-3 times a year

### Outline

@Carnegie Mel

- □ Introduction and Overview
- New Hardware
- □ Where Does Time Go?
- □ Bridging the Processor/Memory Speed Gap

### Hip and Trendy

- Query co-processing
- Databases on MEMStore
- Directions for Future Research

©2004 Anastassia Ailamak

## MEMStore (MEMS\*-based storage) On-chip mechanical storage - using MEMS for media positioning Read/write tips \* microelectromechanical systems

| • |  |  |  |
|---|--|--|--|
| • |  |  |  |
| • |  |  |  |







### Special thanks go to...









> Shimin Chen, Minglong Shao, Stavros Harizopoulos, and Nikos Hardavellas for invaluable contributions to this talk



Databases @Carnegie Mellor

- > Steve Schlosser (MEMStore)
- > Ravi Ramamurthy (fractured mirrors)
- Babak Falsafi and Chris Colohan (h/w architecture)

### REFERENCES (used in presentation)

### References Where Does Time Go? (simulation only)

- [ADS02] Branch Behavior of a Commercial OLTP Workload on Intel IA32 Processors. M. Annavaram, T. Diep, J. Shen. International Conference on Computer Design: VLSI in Computers and Processors (ICCD), Freiburg, Germany, September 2002.
- [SBG02] A Detailed Comparison of Two Transaction Processing Workloads. R. Stets, L.A. Barroso, and K. Gharchontoo. IEEE Annual Workshop on Workload Characterization (WWC), Austin, Texas, November 2002.
- [BGN00] Impact of Chip-Level Integration on Performance of OLTP Workloads. L.A. Barroso. K.
- [ISGNU] Impact of Chip-Level Integration on Performance of ULTP Workloads. L. K. Barroso, K. Gharachorloo, A. Nowatzyk, and B. Verghese. IEEE International Symposium on High-Performance Computer Architecture (HPCA), Toulouse, France, January 2000.

  [RGA8] Performance of Database Workloads on Shared Memory Systems with Out-of-Order Processors. P. Ranganathan, K. Gharachorloo, S. Adve, and L.A. Barroso. International Conference on Architecture Support for Programming Languages and Operating Systems (ASPLOS), San Jose, California, October 1998.
- [LBE98] An Analysis of Database Workload Performance on Simultaneous Multithreaded Processors. J. Lo, L.A. Barroso, S. Eggers, K. Gharachorloo, H. Levy, and S. Parekh. ACM International Symposium on Computer Architecture (ISCAI), Bareclena, Spain, June 1991.

  [EJL96] Evaluation of Multithreaded Uniprocessors for Commercial Application Environments. R. Elickemeyer, R.E. Johnson, S.R. Kunkel, M.S. Squillante, and S. Liu, ACM International Symposium on Computer Architecture (ISCA), Philadelphia, Pennsylvania, May 1996.

|   | <br> |
|---|------|
|   |      |
|   |      |
|   |      |
|   |      |
| • |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   | _    |
|   |      |
|   |      |
|   | _    |
|   |      |
|   |      |
|   |      |
|   |      |
|   | _    |
|   |      |
|   |      |
|   |      |
|   | _    |
|   | _    |
|   |      |
|   | _    |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |
|   |      |

| ı   | RETERENCES @Carnegie Mello                                                                                                                                                                                                                                 |
|-----|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| l   | Where Does Time Go? (real-machine/simulation)                                                                                                                                                                                                              |
| l   |                                                                                                                                                                                                                                                            |
| l   | [RAD02] Comparing and Contrasting a Commercial OLTP Workload with CPU2000. J. Rupley II,<br>M. Annavaram, J. DeVale, T. Diep and B. Black (Intel). IEEE Annual Workshop on Workload<br>Characterization (WWC). Austin, Texas, November 2002.               |
| l   | [CTT99] Detailed Characterization of a Quad Pentium Pro Server Running TPC-D. Q. Cao, J. Torrellas, P. Trancoso, J. Larriba-Pey, B. Knighten, Y. Won. International Conference on Computer Design (ICCD), Austin, Texas, October 1999.                     |
| l   | [ADH99] DBMSs on a Modern Processor: Experimental Results A. Ailamaki, D. J. DeWitt, M. D.<br>Hill, D.A. Wood. International Conference on Very Large Data Bases (VLDB), Edinburgh, UK,<br>September 1999.                                                 |
| l   | [KPH98] Performance Characterization of a Quad Pentium Pro SMP using OLTP Workloads. K.<br>Keeton, D.A. Patterson, Y.O. He, R.C. Raphael, W.E. Baker. ACM International Symposium<br>on Computer Architecture (ISCA), Barcelona, Spain, June 1998.         |
| l   | [BGB98] Memory System Characterization of Commercial Workloads. L.A. Barroso, K. Gharachorloo, and E. Bugnion. ACM International Symposium on Computer Architecture (ISCA), Barcelona, Spain, June 1998.                                                   |
| l   | [TL297] The Memory Performance of DSS Commercial Workloads in Shared-Memory<br>Multiprocessors. P. Trancoso, J. Larriba-Pey, Z. Zhang, J. Torrellas. IEEE International<br>Symposium on High-Performance Computer Architecture (HPCA). San Antonio, Texas. |
| ı   | February 1997.                                                                                                                                                                                                                                             |
| ı   |                                                                                                                                                                                                                                                            |
| ı   | ©2004 Anastaccia Ailamaki 12.                                                                                                                                                                                                                              |
|     |                                                                                                                                                                                                                                                            |
|     |                                                                                                                                                                                                                                                            |
|     |                                                                                                                                                                                                                                                            |
|     |                                                                                                                                                                                                                                                            |
| _   |                                                                                                                                                                                                                                                            |
|     | References Databases @Carnegie Mello                                                                                                                                                                                                                       |
|     | Architecture-Conscious Data Placement                                                                                                                                                                                                                      |
|     |                                                                                                                                                                                                                                                            |
|     | [SSS04] Clotho: Decoupling memory page                                                                                                                                                                                                                     |
|     | layout from storage organization. M. Shao,                                                                                                                                                                                                                 |
|     | J. Schindler, S.W. Schlosser, A. Ailamaki, G.R.                                                                                                                                                                                                            |
| l   | Ganger. International Conference on Very                                                                                                                                                                                                                   |
|     | Large Data Bases (VLDB), Toronto, Canada,                                                                                                                                                                                                                  |
|     | September 2004.                                                                                                                                                                                                                                            |
|     | [SSS04a] Atropos: A Disk Array Volume                                                                                                                                                                                                                      |
|     | Manager for Orchestrated Use of Disks. J.                                                                                                                                                                                                                  |
|     | Schindler, S.W. Schlosser, M. Shao, A.                                                                                                                                                                                                                     |
| - 1 | Ailamaki, G.R. Ganger. USENIX Conference                                                                                                                                                                                                                   |

### Databases @Carnegie Mellor References **Architecture-Conscious Access Methods** [ZR03a] Buffering Accesses to Memory-Resident Index Structures. J. Zhou and K.A. Ross. International Conference on Very Large Data Bases (VLDB), Berlin, Germany, September 2003. [HP03a] Effect of node size on the performance of cache-conscious B+ Trees. R.A. Hankins and J.M. Pattal. ACM International Conference on Measurement and Modeling of Computer Systems (CGM02) Fractal Profesching B+ Trees Cache International Conference on Measurement of Data (SIGMOD), Madison, Wisconsin, June 2002. [CGM02] Fractal Profesching B+ Trees (Allemin ACM International Conference on Management of Data (SIGMOD), Madison, Wisconsin, June 2002. [GL01] B-Tree Indexes and CPU Caches. G. Graefe and P. Larson. International Conference on Data Engineering (ICDE), Heidelberg, Germany, April 2001. [CGM01] Improving Index Performance through Prefetching. S. Chen, P.B. Gibbons, and T.C. Mowry. ACM International Conference on Management of Data (SIGMOD), Santa Barbara, California, May 2001. [MR01] Main-memory Main Structures with fixed-size partial keys. P. Bohannon, P. Mellory, and R. California, May 2001. [RB00] Cache-Oblivious B-Trees. M.A. Bender, E.D. Demaine, and M. Farach-Colton, Symposium on Foundations of Computer Science (FOCS), Redondo Beach, California, November 2000. [RR00] Making B+ Trees Cache Conscious in Main Memory, J. Rao and K.A. Ross. International Conference on Management of Data (SIGMOD), Santa Mary 2000. [RR90] Cache-Oblivious B-Trees. M.A. Bender, E.D. Demaine, and M. Farach-Colton, Symposium on Foundations of Computer Science (FOCS), Redondo Beach, California, November 2000. [RR90] Making B+ Trees Cache Conscious in Main Memory, J. Rao and K.A. Ross. International Conference on Menagement of Data (SIGMOD), Balas, Texas, May 2000. [RR90] Cache-Oblivious B-Trees (Pock), Redondo Beach, California, November 2000.

on File and Storage Technologies (FAST), San

Francisco, California, March 2004.

Query Processing in main-memory database management systems. T. J. Lehman and M. J. Carey. ACM International Conference on Management of Data (SIGMOD), 1986.

|                                                                                                                                                                                                                                                                                | tabases<br>rnegie Mellor |  |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------|--|
| Architecture-Conscious Query Processing                                                                                                                                                                                                                                        |                          |  |
| [MBN04] Cache-Conscious Radix-Decluster Projections. Stefan Manegold, Peter A. Boncz, Niels Nes, Martin                                                                                                                                                                        |                          |  |
| Kersten. In Proceedings of the International Conference on Very Large Data Bases (VLDB), Toronto, C September 2004.  [GLIW04] Fast Computation of Database Operations using Graphics Processors. N.K. Govindaraju, B. L.                                                       | ıyd, W.                  |  |
| Wang, M. Lin, D. Manocha. ACM International Conference on Management of Data (SIGMOD), Paris,<br>June 2004.  [CAG04] Improving Hash Join Performance through Prefetching. S. Chen, A. Allamaki, P. B. Gibbons, a                                                               | France,                  |  |
| Mowry. International Conference on Data Engineering (ICDE), Boston, Massachusetts, March 2004.  [ZR04] Buffering Database Operations for Enhanced Instruction Cache Performance. J. Zhou, K. A. Ros:                                                                           |                          |  |
| International Conference on Management of Data (SIGMOD), Paris, France, June 2004.  [SAA03] Hardware Acceleration for Spatial Selections and Joins. C. Sun, D. Agrawal, A.E. Abbac International conference on Management of Data (SIGMOD), San Diego, California, June, 2003. |                          |  |
| [CHK01] Cache-Conscious Concurrency Control of Main-Memory Indexes on Shared-Memory Multiproce<br>Systems. S. K. Cha, S. Hwang, K. Kim, and K. Kwon. International Conference on Very Large Data Ba<br>(VLDB), Rome, Italy, September 2001.                                    | es                       |  |
| [PMA01] Block Oriented Processing of Relational Database Operations in Modern Computer Architect<br>Padmanabhan, T. Malkemus, R.C. Agarwal, A. Jhingran. International Conference on Data Eng<br>(ICDE), Heidelberg, Germany, April 2001.                                      | res. S.<br>neering       |  |
| [MBK00] What Happens During a Join? Dissecting CPU and Memory Optimization Effects. S. Manege<br>Boncz, and M.L Kersten. International Conference on Very Large Data Bases (VLDB), Cairo<br>September 2008.                                                                    | i, P.A.<br>Egypt,        |  |
| [SKN94] Cache Conscious Algorithms for Relational Query Processing. A. Shatdal, C. Kant, and J.F. Ni<br>International Conference on Very Large Data Bases (VLDB), Santiago de Chile, Chile, September 1994                                                                     |                          |  |
| [NBC94] AlphaSort: A RISC Machine Sort. C. Nyberg, T. Barclay, Z. Cvetanovic, J. Gray, and D.B. Lom<br>International Conference on Management of Data (SIGMOD), Minneapolis, Minnesota, May 1994.                                                                              | . ACM                    |  |
| C2004 Anastassia Ailamaki                                                                                                                                                                                                                                                      | 127                      |  |
|                                                                                                                                                                                                                                                                                |                          |  |
|                                                                                                                                                                                                                                                                                |                          |  |
|                                                                                                                                                                                                                                                                                |                          |  |
|                                                                                                                                                                                                                                                                                |                          |  |
| References                                                                                                                                                                                                                                                                     | tabases<br>rnegie Mellon |  |
| Instrustion Stream Optimizations and                                                                                                                                                                                                                                           |                          |  |
| DBMS Architectures                                                                                                                                                                                                                                                             |                          |  |
| [HA04] STEPS towards Cache-resident Transaction Processing. S. Harizopoulos and A. Ai                                                                                                                                                                                          | ımaki.                   |  |
| International Conference on Very Large Data Bases (VLDB), Toronto, Canada, Sep 2004.                                                                                                                                                                                           |                          |  |
| [APD03] Call Graph Prefetching for Database Applications. M. Annavaram, J.M. Patel, an<br>Davidson. ACM Transactions on Computer Systems, 21(4):412-444. November 2003.                                                                                                        |                          |  |
| [SAG03] Lachesis: Robust Database Storage Management Based on Device-specific Perfor<br>Characteristics. J. Schindler, A. Allamaki, and G. R. Ganger. International Confere<br>Very Large Data Bases (VLDB), Berlin, Germany, September 2003.                                  |                          |  |
| [HA02] Affinity Scheduling in Staged Server Architectures. S. Harizopoulos and A. Ai<br>Carnegie Mellon University, Technical Report CMU-CS-02-113, March, 2002.                                                                                                               | maki.                    |  |
| [HA03] A Case for Staged Database Systems. S. Harizopoulos and A. Ailamaki. Confere<br>Innovative Data Systems Research (CIDR), Asilomar, CA, January 2003.                                                                                                                    |                          |  |
| [B02] Monet: A Next-Generation DBMS Kernel For Query-Intensive Applications. P. A.<br>Ph.D. Thesis, Universiteit van Amsterdam, Amsterdam, The Netherlands, May 2002.                                                                                                          | _                        |  |
| [PMH02] Computation Regrouping: Restructuring Programs for Temporal Data Cache L.<br>V.K. Pingali, S.A. McKee, W.C. Hseih, and J.B. Carter. International Conferer<br>Supercomputing (ICS), New York, New York, June 2002.                                                     |                          |  |
| [ZR02] Implementing Database Operations Using SIMD Instructions. J. Zhou and K.A. ACM International Conference on Management of Data (SIGMOD), Madison, Wisconsi.                                                                                                              |                          |  |
| 2002.                                                                                                                                                                                                                                                                          |                          |  |
| 77004 Anastassia Ailamaki                                                                                                                                                                                                                                                      | 128                      |  |
|                                                                                                                                                                                                                                                                                |                          |  |
|                                                                                                                                                                                                                                                                                |                          |  |
|                                                                                                                                                                                                                                                                                |                          |  |
| D. F Do                                                                                                                                                                                                                                                                        | tabases                  |  |
| References                                                                                                                                                                                                                                                                     | rnegie Mellor            |  |
| Newer Hardware                                                                                                                                                                                                                                                                 | -                        |  |
| [BWS03] Improving the Performance of OLTP Workloads on SMP Computer Systems by L<br>Modified Cache Lines. J.E. Black, D.F. Wright, and E.M. Salqueiro. IEEE Annual Wo                                                                                                          |                          |  |
| <ul> <li>Modified Gache Lines. J.E. Biack, D.F. Wright, and E.M. Salgueiro. IEEE Annual W. on Workload Characterization (WWC), Austin, Texas, October 2003.</li> <li>[GH03] Technological impact of magnetic hard disk drives on storage systems. E. Groc</li> </ul>           | ·   —                    |  |
| and R. D. Halem <i>IBM Systems Journal 42(2)</i> , 2003.  [DJN02] Shared Cache Architectures for Decision Support Systems. M. Dubois, J. Jeon                                                                                                                                  |                          |  |
| Nanda, Performance Evaluation 49(1), September 2002.  [G02] Put Everything in Future (Disk) Controllers. Jim Gray, talk at the USENIX Confere                                                                                                                                  |                          |  |
| File and Storage Technologies (FAST), Monterey, California, January 2002.  [BGM00] Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing. L.A. Barr                                                                                                            | so, K.                   |  |
| Gharachortoo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets,<br>Verghese. International Symposium on Computer Architecture (ISCA). Vancouver, C<br>June 2000.                                                                                               |                          |  |
| IALICOOL Active district Description model already and evaluation A Achenia M. Illinoi                                                                                                                                                                                         |                          |  |

2004 Anastassia Ailamaki

[AUS98] Active disks: Programming model, algorithms and evaluation. A. Acharya, M. Uysal, and
J. Saltz. International Conference on Architecture Support for Programming Languages and
Operating Systems (ASPLOS), San Jose, California, October 1998.
 [KPH98] A Case for Intelligent Disks (IDISKs), K. Keeton, D. A. Patterson, J. Hellerstein. SIGMOD
Record, 27(3):42-52, September 1998.
 [PGK88] A Case for Redundant Arrays of Inexpensive Disks (RAID), D. A. Patterson, G. A. Gibson,
and R. H. Katz. ACM International Conference on Management of Data (SIGMOD), June 1988.

| References                                                                                                                                                   | @Carnegie Mellor                         |  |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------|--|
| Methodologies and Benchmarks                                                                                                                                 |                                          |  |
| <u> </u>                                                                                                                                                     | _                                        |  |
| [DMM04] Accurate Cache and TLB Characterization Using hardware Counters. J. Do Moore, P. Mucci, K. Seymour, H. You. International Conference on Computation. |                                          |  |
| (ICCS), Krakow, Poland, June 2004.  [SAF04] DBmbench: Fast and Accurate Database Workload Representation on                                                  | _                                        |  |
| Microarchitecture. M. Shao, A. Ailamaki, and B. Falsafi. Carnegie Mellon University.<br>Report CMU-CS-03-161, 2004.                                          |                                          |  |
| [KP00] Towards a Simplified Database Workload for Computer Architecture Evalu<br>Keeton and D. Patterson. IEEE Annual Workshop on Workload Characterizatic   |                                          |  |
| Texas, October 1999.                                                                                                                                         | ,, , , , , , , , , , , , , , , , , , , , |  |
|                                                                                                                                                              | _                                        |  |
|                                                                                                                                                              |                                          |  |
|                                                                                                                                                              | _                                        |  |
|                                                                                                                                                              |                                          |  |
|                                                                                                                                                              |                                          |  |
|                                                                                                                                                              |                                          |  |
|                                                                                                                                                              |                                          |  |
| 70004 Anastassia Ailamaki                                                                                                                                    | 130                                      |  |
|                                                                                                                                                              |                                          |  |
|                                                                                                                                                              |                                          |  |
|                                                                                                                                                              |                                          |  |
|                                                                                                                                                              |                                          |  |
|                                                                                                                                                              | Databases<br>@Carnegie Mellon            |  |
| Useful Links                                                                                                                                                 |                                          |  |
| lefe en letel Destina A Desference Occuptors                                                                                                                 | _                                        |  |
| Info on Intel Pentium4 Performance Counters<br>ftp://download.intel.com/design/Pentium4/manuals/25366814.pu                                                  |                                          |  |
| □ AMD hardware performance counters                                                                                                                          | -                                        |  |
| http://www.amd.com/us-en/Processors/DevelopWithAMD/                                                                                                          |                                          |  |
| □ PAPI Performance Library                                                                                                                                   | _                                        |  |
| http://icl.cs.utk.edu/papi/                                                                                                                                  |                                          |  |
| □ Intel® VTune™ Performance Analyzers                                                                                                                        | _                                        |  |
|                                                                                                                                                              |                                          |  |
| http://developer.intel.com/software/products/v                                                                                                               | rtune/                                   |  |
| http://developer.intel.com/software/products/v                                                                                                               | rtune/                                   |  |
| http://developer.intel.com/software/products/v                                                                                                               | rtune/<br>-                              |  |