跳转至

Chap 5 存储结构

阅读信息

1950 词  11 分钟  本页总访问量 加载中...

Memory Hierarchy Introduction

Programs may access any address space at any time.

Locality:

  • Temporal locality: Items accessed recently are likely to be accessed again soon (e.g., loop).
  • Spatial locality: Items near those accessed recently are likely to be accessed soon (e.g., sequential instruction access).

Taking advantage of locality:

  • Copy recently accessed and nearby items from disk to smaller DRAM (main memory);
  • Copy more recently accessed and nearby items from DRAM to smaller SRAM (cache, which is inside CPU).

More closer to CPU, faster, smaller, and more expensive.

Terminology

  • Block (aka. line): unit of copying, may be multiple words.
  • Hit: accessed data is present in upper level.
  • Hit time: the time to access the upper level of memory.
  • Hit ratio: hits/accesses.
  • Miss: not hit, need to copy block from lower level.
  • Miss penalty: time taken to copy block (replace+deliver).
  • Miss ratio: misses/accesses.

Cache

Simple implementations: Each item of data at the lower level has exactly one location in the cache. Lots of items at the lower level shares locations in the upper level.

Two issues: How do we know if a data item is in the cache? If it is, how do we find it?

Locate the data item!

Direct-mapping algorithm: memory address is modulo #(cache blocks).

Usually cache has \(2^n\) blocks, so the cache index equals to the lowest \(n\) bits of memory index.

Check its presence!

Multiple blocks share location, so how do we know which particular block is stored?

Tag: Store memory address as well as its data. Since lower bits is the cache address, only need to store higher bits as memory address, called tag.

Valid bit: 1 if present, 0 if empty. Initialized as 0.

Components of three addresses:

  • Physics address (main memory address): tag / index / byte offset. Byte offset is determined by size of block. Index is the cache address, i.e. lower bits of memory block address. Tag is the higher bits of block address. Tag and index are concatenated to form block address.
  • block address: tag / index (main memory address deprived of offset).
  • Cache line: (index,) valid bit / tag / data.
E.g.1 cache access

8-blocks, 1 word/block, direct mapped. Access Sequence: 10110, 11010, 10110, 10010.


The access sequence here refers to block address. index has 3 bits, thus tag has 2 bits.

10110 → valid=0 → miss → copy to cache (time locality) → cache[110].valid=1, cache[110].tag=10, cache[110].data=Mem[10110] → return data.

11010 → valid=0 → miss → copy to cache → omitted.

10110 → valid=1, tag is the same → hit → return data.

10010 → valid=1, but tag is not the same → miss → replace → cache[010].valid=1, cache[010].tag=10, cache[010].data=Mem[10010] → return data.

E.g.2 caceh size

64-bits addresses, directed mapped, \(2^n\) blocks, \(2^m\) words/block. Bits needed for each part? Total number of bits in cache?


  • Byte offset: \(2^m\) words = \(2^{m+2}\) bytes → m+2 bits.
  • Cache index: \(2^n\) blocks → n bits.
  • Tag: all remaining bits → 64-(n+m+2) bits.
  • Total size of cache: = #block * (data size + tag size + valid size) = #block * (#(words/block)*32 + #(tag bits) + 1) = \(2^n\times (2^m\times 32+63-n-m)\).
E.g.3 cache size

How many total bits are required for a direct-mapped cache 16KB of data and 4-word blocks, assuming a 32-bit address?


Total data size id 16KB = \(2^{12}\) words, while each block contains \(2^2\) words. So #blocks is \(2^{10}\).

In a block:

  • index: 10 bits
  • byte offset: s*2=4 bits
  • tag: 32-10-4=18 bits
  • valid: 1 bit
  • data: 4*32=128 bits

Size of cache: \(2^{10}\times (128+18+1)=147 Kbits\).

E.g.4 cache size

Consider a cache with 64 blocks and a block size of 16 bytes. What block number does byte address 1200 map to?


Block address = byte address / bytes per block = \(\lfloor\frac{1200}{16}\rfloor\) =75.

Index = block address modulo #(cache blocks) = 75 mod(64) = 11.

Handle hits and misses?

Read hits is what we want.

Read misses has two cases: instrunction cache miss and data cache miss.

When instrction miss occurs: stall the CPU, fetch block from memory, deliver to cache, restart CPU.

Write hits lead to different strategies:

  • write-back: only write into cache, and write back to memory later. Faster, but cause inconsistency (need dirty bit to mark eviction).
  • write-through: write into both cache and memory. Slower (add write buffer to mitigate), but ensure consistency.

Write misses: read the entire block into the cache, then write the word. (See in Q4.)

Q&As on Memory Hierarchy

Q1 (Block placement): Where can a block be placed in the upper level?

Fully Associative, Set Associative, Direct Mapped.

  • Direct mapped: block can only go in one place in the cache.
  • Fully associative: block can go anywhere in cache.
  • Set associative: block can go in one set of places in cache.

In fully associative, cache has no index bits.

Details on set associatice: A set is a group of adjacent blocks in cache. Index equals to block address mod(#sets). If each set has n blocks, the cache is said to be n-way set associative.

Q2 (Block identification): How is a block found if it is in the upper level?

Use tag and valid bit.

Tag comparation: For fully-associative caches, compare the requested tag with every block's tag in the cache to determine hit or miss. For set-associative caches, compare with every block's tag in the set.

Index of puysical address: for set-associative caches, index has \(\log_2(\#\text{sets})\) bits; for directed-mapped caches, index has \(\log_2(\#\text{blocks})\) bits; for fully-associated caches, there are no index foeld.

Q3 (Block replacement): Which block should be replaced on a miss?

Random, LRU, FIFO.

In set-associative caches and fully-associative caches, there exists the replacement issue.

  • Random replacement: randomly pick any block. Easy to implement, allocate uniformly, but may evict a block that is about to be accessed.
  • Least-recently used (LRU): pick the block in the set which was least recently accessed. Require extra bits to keep track of accesses.
  • First in, first out (FIFO): pick a block from the set which was first came into the cache.

Q4 (Write strategy): What happens on a write?

Write hit:

  • Write-through: can always discard cached data, maintaining most up-to-date data in the memory.
  • Write-back: can't just discard cached data cause may have to write it back to memory. Add dirty bits to cache control.

Advantages:

  • write-through: read misses don't result in writes, memory hierarchy is consistent and it's simply to implement.
  • write-back: write occur at speed of cache and main memory bandwidth is smaller when multiple writes occur to the same block.

Write stall: CPU wait for writes to complete during write-through.

Write buffer: A small cache that can hold a few values waiting to go to main memory. To avoid stalling on writes, many CPUs use a write buffer.

Write buffer does not entirely eliminate stalls.

Write miss:

  • Write allocate: the block is loaded into the cache on a miss before anything else occurs.
  • Write around: the block is only written to main memory, not stored in cache.

In general, write-back caches use write-allocate, and write-through caches use write-around.

E.g.5 compare three mapping strategy

Three small caches, four one-word blocks per cache. One cache is direct-mapped, the second is two-way set associative, and the third is fully associative.

Access sequence: 0, 8, 0, 6, 8.


Direct-mapped: 5 misses, because 0 and 8 share the same position in cache.

Two-way associative: all data stored in set 0, but each set has 2 blocks. 4 misses and 1 hit.

Fully associative: 3 misses and 2 hits.

E.g.6 cache size

Assume cache size is 4K Block, block size is 4 words, physical address is 32bits. Find the total number of tag bits for 4-way associativity.


  • number of cache sets: \(2^{12}\)/4 = \(2^{10}\)
  • Index: 10 bits
  • Offset: 4 bits
  • Tag: 32-10-4 = 18 bits
  • Total bits for tag: 18*4K = 72 Kbits

Measure Cache Performance!

  • Average_memory_access_time = hit_time + miss_time
    = hit_rate * cache_time + miss_rate * memory_time

Use CPU time to measure cache performance.

  • CPU_time = (CPU_execution_clock_cycle + Memory_stall_clock_cycles) * Clock_cycle_time.

In which:

  • CPU_execution_clock_cycle = CI * CPI * Clock_cycle_time.

  • Memory_stall_clock_cycles = #instrctions * miss_ratio * miss_penalty
    = Read_stall_cycles + Write_stall_cycles.

For read stall:

  • Read_stall_cycles = Read_per_program * Read_miss_rate * Read_miss_panelty.

For a write-through plus write buffer scheme:

  • Write_stall_cycles = (Write_per_program * Write_miss_rate * Write_miss_panelty) + Write_buffer_stalls.

In most write-through cache organizations, the read and write miss penalties are the same.

E.g.7 calculate cache performance

Assume:

  • instruction cache miss rate: 2%
  • data cache miss rate: 4%
  • CPI without any memory stalls: 2
  • miss penalty: 100 cycles
  • The frequency of all loads and stores in gcc: 36%

How faster a processor would run with a perfect cache?


  • Instruction miss cycles: 2%*CI*100 = 2CI
  • Data miss cycles: 4%*(CI*36%)*100 = 1.44CI
  • Total memory-stall cycles: 3.44CI

CPI with stall: CPI with perfect cache + total memory-stalls = 5.44CI


When CPU is faster, total memory-stall cycles increases, thus the increase in computer performance is less than proportional to the increase in CPU frequency.

Improve Cache Performance!

Sol.1 More flexible placement of blocks to reduce cache misses

See in Q1.

Sol.2 Choosing which block to replace

See in Q2.

Sol.3 Choosing different block size

Larger blocks should reduce miss rate, but fewer blocks means more competition and may increase miss rate as well.

Larger blocks exploit spatial locality. In practice, use split caches because there is more spatial locality in code.

Sol.4 Designing the memory system to support cache

Bandwidth: transferred words / time panelty

Performance in wider main memory: When width of main memory increases, transfer times decrease and bandwidth increases.

Performance in four-way interleaved memory: parallel access to four memory banks.

Sol.5 Reducing miss penalty with multilevel caches

Add a second level cache: often primary cache is on the same chip as the processor. Use SRAMs to add another cache above primary memory (DRAM). Miss penalty goes down if data is in 2nd level cache.

E.g. multilevel caches miss panelty

CPI of 1.0 on a 5GHz machine with a 2% miss rate, 100ns DRAM access. Adding 2nd level cache with 5ns access time decreases miss rate to 0.5%. Calculate the decreasement of CPI.


  • Clock cycle time: 0.2ns.
  • Miss panelty to main memory: 100ns/0.2ns = 500 cycles.

CPI equals execution cycles plus stall cycles:

  • Total CPI with one level cache: 1.0+2%*500 = 11 cycles.
  • Miss panelty to the 2nd cache: 5ns/0.2ns = 25 cycles.
  • Total CPI with two levels cache: 1.0+2%*25+0.5%*500 = 4 cycles.

The processor is faster by: 11/4 = 2.8.

Sol.6 Software optimization blocking

Interact with software. Misses depend on memory patterns.

Virtual Memory

Motivation:

  1. Efficient and safe sharing of memory among multiple programs.
  2. Remove the programming burdens of a small, limited amount of main memory.

Main Memory act as a cache for the secondary storage (disk).

Pages: virtual memory block.

Page faults: the data is not in memory, thus need to retrieve from disk.

Virtual pages are more than physical pages. Page faults cause huge miss panelty, thus pages are faily large (e.g., 4KB). We can handle the faults in software instead of hardware. Using write-through is too expensive, so we use write back.

Page table: transform virtual address to physical address.

Elements of page tables:

  • Index: virtual page number
  • Valid bit: 1 if the virtual page in currently in memory
  • Entry: physical page number for the virtual page

Each program has its own page table. Virtual memory systems use fully associative mapping method.

Five techniques to reduce page table size

  1. Multi-level Paging
  2. Inverted Page Table
  3. Larger Page Size
  4. Page Table Entry Compression / Sharing
  5. egmented Paging(分段分页)
E.g.8 page table size

Assume virtual page number is 32 bits, page size is 4KB, and each entry is 4 bytes. Calculate the number of page table entries and the size of page table.

  • Number of entries: \(2^{32}\)
  • Size of page table: 4*#entry = \(2^{34}\) bytes

Assume virtual address is 32 bits, page size is 4KB, and each entry is 4 bytes. Calculate the number of page table entries and the size of page table.

  • Page size: \(2^{12}\) bytes
  • Page offset: 12 bits
  • Virtual page number: virtual_address-page_offset=32-12 = 20 bits
  • Number of entries: \(2^{20}\)
  • Size of page table: \(2^{22}\) bytes
E.g.9 size of page table

Consider a virtual memory system with the following properties: 36-bit virtual address, 4KB pages, 32-bit physical address.

What is the total size of the page table for each process on this processor, assuming that the valid bit and dirty bit are in use? ( assuming that disk addresses are not stored in the page table).


  • Page offset: 12 bits
  • Virtual page number: 36-12 = 24 bits
  • Number of entries: \(2^{24}\)
  • Entry size: 1+1+20 = 22 bits
  • Size of page table: 22*\(2^{24}\) bits

What about writes?

Must use write-back strategy. The dirty bit is set when a page is first written. If dirty bit is 1, the page must be written back to disk before being replaced.

TLB

The TLB (Translation-lookaside Buffer) acts as Cache on the page table.

Trends:

  • Synchronous SDRAMs (provide a burst of data)
  • Redesign DRAM chips to provide higher bandwidth or processing
  • Restructure code to increase locality
  • Use prefetching (make cache visible to ISA)

Memory hierarchy

TLB and cache