Skip to content

mtumilowicz/java17-mesi-false-sharing-processor-optimisations-workshop

Repository files navigation

Build Status License: GPL v3

java17-mesi-false-sharing-processor-optimisations-workshop

preface

  • goals of these workshops
    • understanding modern cache architecture in multi-core environment
    • discussing cache write policies
    • explaining cache coherence problems
      • with presentation of most known cache coherence protocol: MESI
    • exemplifying cache performance hits
      • false sharing
      • loop order
    • showing processor optimisations
      • vectorization
  • workshop plan
    • false sharing example
    • benchmarks
      • to trigger: gradlew jmh
      • vectorization
      • loop order

prerequisite

  • access to a cache by a processor involves one of two processes: read and write
    • each process can have two results
      • cache hit = processor accesses its private cache and finds the addressed item already in the cache
      • otherwise - cache miss

cache coherence

  • in modern CPUs (almost) all memory accesses go through the cache hierarchy
    • CPU core’s load/store (and instruction fetch) units normally can’t even access memory directly
      • physically impossible

        alt text

      • they talk to their L1 caches

        • at this point, there’s generally more cache levels involved
        • L1 cache doesn’t talk to memory directly anymore, it talks to a L2 cache – which in turns talks to memory
          • or maybe to a L3 cache
      • caches are organized into “lines”, corresponding to aligned blocks of memory

        • 32 bytes (older ARMs, 90s/early 2000s x86s/PowerPCs)
        • 64 bytes (newer ARMs and x86s)
        • 128 bytes (newer Power ISA machines)
  • cache coherence
    • concern raised in a multi-core distributed caches
    • when multiple processors are operating on the same or nearby memory locations, they may end up sharing the same cache line
      • unit of granularity of a cache entry is 64 bytes (512 bits)

        • even if you read/write 1 byte you're writing 64 bytes
      • it’s essential to keep those overlapping caches consistent with each other

        • benefits of multithreading can disappear if the threads are competing for the same cache line
        • note that the problem really is that we have multiple caches, not that we have multiple cores
          • we could solve the entire problem by sharing all caches between all cores (L1)
            • each cycle, the L1 picks one lucky core that gets to do a memory operation this cycle, and runs it
              • problem: cores now spend most of their time waiting in line for their next turn at a L1 request
                • processors do a lot of those, at least one for every load/store instruction
                  • slow
                • solution: next best thing is to have multiple caches and then make them behave as if there was only one cache
                  • this is what cache coherency protocols are for
      • there are quite a few protocols to maintain the cache coherency between CPU cores

      • problem is not unique to parallel processing systems

        • strong resemblance to the "lost update" problem
      • general approach

        • getting read access to a cache line involves talking to the other cores
          • might cause them to perform memory transactions
        • writing to a cache line is a multi-step process
          • before you can write anything, you first need to acquire both exclusive ownership of the cache line and a copy of its existing contents
            • "Read For Ownership" request
        • each line in a cache is identified and referenced by a cache tag (block number)
          • allows the determination of the primary memory address associated with each element in the cache
        • each individual cache must monitor the traffic in cache tags
          • corresponds to the blocks being read from and written to the shared primary memory
          • done by a snooping cache (or snoopy cache, after the Peanuts comic strip)
            • basic idea behind snooping is that all memory transactions take place on a shared bus that’s visible to all cores alt text
            • caches don’t just interact with the bus when they want to do a memory transaction themselves
              • instead, each cache continuously snoops on bus traffic to keep track of what the other caches are doing
              • if one cache wants to read from or write to memory on behalf of its core, all the other cores notice
                • that allows them to keep their caches synchronized
                • one core writes to a memory location => other cores know that their copies of the corresponding cache line are now stale and hence invalid
                  • problem: write-back model
                    • it’s not enough to broadcast just the writes to memory when they happen
                    • physical write-back to memory can happen a long time after the core executed the corresponding store
                    • for the intervening time, the other cores and their caches might themselves try to write to the same location, causing a conflict
                    • if we want to avoid conflicts, we need to tell other cores about our intention to write before we start changing anything in our local copy
            • memory itself is a shared resource
              • memory access needs to be arbitrated
                • only one cache gets to read data from, or write back to, memory in any given cycle
          • caches do not respond to bus events immediately
            • reason: cache is busy doing other things (sending data to the core for example)
              • it might not get processed that cycle
            • invalidation queue
              • place where bus message triggering a cache line invalidation sits for a while until the cache has time to process it
      • why you need volatile keyword in java if you have cache coherency?

        • cache coherency ensures that all threads eventually see the change
          • no guarantee about when this happens
        • volatile works also on higher level of abstraction
          • introduces happens-before relationships around reads and writes
  • cache write policies
    • write back
      • write operations are usually made only to the cache
      • main memory is only updated when the corresponding cache line is flushed from the cache
      • results in inconsistency
        • example: if two caches contain the same line, and the line is updated in one cache, the other cache will unknowingly have an invalid value
      • one fundamental implementation
        • MESI
          • each cache line can be in one of these four distinct states: Modified, Exclusive, Shared, or Invalid
          • key feature: delayed flush to main memory
            • example: when no one reads the data there is no need to write main memory
              • better to write only to cache as it is much faster
    • write through
      • all write operations are made to main memory as well as to the cache
      • ensures that main memory is always valid
      • has consistency issues
        • occur unless other cache monitor the memory traffic or receive some direct notification of the update.
      • two fundamental implementations
        1. with update protocol
          • after write to main memory message with the updated data is broadcast to all processor modules in the system
            • each processor updates the contents of the affected cache block if this block is present in its cache
        2. with invalidation of copies
          • after write to main memory invalidation request is broadcast through the system
            • all copies in other caches are invalidated
    • write-through caches are simpler, but write-back has some advantages
      • it can filter repeated writes to the same location
        • most of the cache line changes on a write-back => can issue one large memory transaction instead of several small ones
          • more efficient

MESI protocol

  • formal mechanism for controlling cache coherency using snooping techniques
  • most widely used cache coherence protocol
  • each line in an individual processors cache can exist in one of the four following states
    • (M)odified
      • result of a successful write hit on a cache line
        • its value is different from the main memory
      • indicates that the cache line is present in current cache only and is dirty
        • it must be in the I state for all other cores
      • modified line can be kept by a processor only as long as it's the only processor that has this copy
        • on a cache miss, the cache still needs to write-back data to memory if it is in the modified state
          • processor must signal "Dirty" and write the data back to the shared primary memory
            • causing the other processor to abandon its memory fetch
        • it is necessary to first write in memory and then read it from there
          • reading cores can't directly read from the cache of the writing core
            • it's more expensive
            • example
              1. suppose, both of A & B share that line and B got it directly from the cache line of A
              2. C needs that line for a write
              3. both A & B will have keep snooping that line
                • shared copies grows
                • greater impact on performance due to the snooping done by all the 'shared' processors
        • example
          1. let's processor A has that modified line
          2. processor B is trying to read that same cache (modified by A) line from main memory
          3. A's snooping read attempts for that line
            • content in the main memory is invalid now (because A modified the content)
          4. A has to write it back to main memory
            • in order to allow processor B (and others) to read that line
    • (E)xclusive
      • its value matches the main memory value
      • no other cache holds a copy of this line
      • main purpose: prevent the unnecessary broadcast of a Cache Invalidate signal on a write hit
        • reduces traffic on a shared bus
    • (S)hared
      • multiple caches may hold the line
      • main memory is up to date
      • if a core does not have exclusive access to a cache line when it wants to write, it first needs to send an "I want exclusive access" request to the bus
        • this tells all other cores to invalidate their copies of that cache line, if they have any
    • (I)nvalid
      • cache line does not contain valid data
      • it will reload that cache line the next time it needs to look at a value in it
        • even if that value is perfectly current => false sharing
  • requires 2 bits per cache line to hold the state
    • 4 values: M, E, S, I
  • transition between the states is controlled by memory accesses and bus snooping activity
    • suppose a requesting processor processing a write hit on its cache
      • by definition, any copy of the line in the caches of other processors must be in the Shared State

      • what happens depends on the state of the cache in the requesting processor

        1. Modified
          • protocol does not specify an action for the processor
        2. Shared
          • processor writes the data
          • marks the cache line as Modified
          • broadcasts a Cache Invalidate signal to other processors
        3. Exclusive
          • processor writes the data and marks the cache line as Modified
      • example

        Step cache line A cache line B
        1 Exclusive -
        2 Shared Shared
        3 Invalid Modified
        4 Shared Shared
        1. Core A reads a value
        2. Core B reads a value from the same are of memory
        3. Core B writes into that value
        4. Core A tries to read a value from that same part of memory
          • Core B's cache line is forced back to memory
          • Core A's cache line is re-loaded from memory
    • simulation
      1. exclusive alt text
        • CPU 1
          • is the first (and only) processor to request block A from the shared memory
          • issues a BR (Bus Read) for the block and gets its copy
            • neither CPU 2 nor CPU 3 respond to the BR
          • cache line containing block A is marked Exclusive
            • subsequent reads to this block access the cached entry and not the shared memory
      2. shared alt text
        • CPU 2 requests the same block A
        • snoop cache on CPU 1 notes the request and CPU 1 broadcasts Shared, announcing that it has a copy of the block
          • CPU 3 does not respond to the BR
        • both copies of the block are marked as shared
          • indicates that the block is in two or more caches for reading
          • copy in the shared primary memory is up to date
      3. modified alt text
        • CPU 2 writes to the cache line it is holding in its cache
          • issues a BU (Bus Upgrade) broadcast, marks the cache line as Modified, and writes the data to the line
          • CPU 1 responds to the BU by marking the copy in its cache line as Invalid
          • CPU 3 does not respond to the BU
        • informally, CPU 2 can be said to "own the cache line"
      4. dirty alt text
        • CPU 3 attempts to read block A from primary memory
        • CPU 1, the cache line holding that block has been marked as Invalid
          • CPU 1 does not respond to the BR (Bus Read) request
        • CPU 2 has the cache line marked as Modified
          • asserts the signal Dirty on the bus, writes the data in the cache line back to the shared memory, and marks the line Shared
          • informally, CPU 2 asks CPU 3 to wait while it writes back the contents of its modified cache line to the shared primary memory
        • CPU 3 waits and then gets a correct copy
        • The cache line in each of CPU 2 and CPU 3 is marked as Shared

false sharing

  • true sharing
    • CPUs are writing to the same variables stored within the same cache line
  • CPUs are writing to independent variables stored within the same cache line
    • independent = each CPU doesn't really rely on the values written by the other CPU
    • steps
      1. first thread modifies the variables
        • cache line is invalidated in all CPU caches
      2. other CPUs reload the content of the invalidated cache line
        • even if they don't need the variable that was modified by first thread
  • essence of problems with concurrent programming
    • more processors, more power, more electricity consumption but slower than single thread
  • solution: padding
  • note that false sharing doesn't cause incorrect results - just a performance hit
    • updates may be lost if writes are not atomic

processor optimisations

  • SIMD (Single Instruction, Multiple Data)
    • example: add four pairs of numbers together at once, rather than adding them sequentially
    • instructions supported by modern processors
    • allow a single instruction to operate on multiple data elements simultaneously
      • can significantly improve performance by exploiting parallelism at the instruction level
    • data should be aligned and contiguous in memory
  • vectorization
    • involves identifying portions of code that can be executed in parallel using SIMD instructions
    • typically involves operations like arithmetic operations, array computations, and data processing loops
    • example: assembler on intel platform
      • vpadd
        • Vector Packed Add
        • used for adding packed integer or floating-point values stored in SIMD registers
        • vs add - typical instructions used for scalar operations
      • vmovdqu
        • Vector Move Unaligned
        • used for moving data between memory and SIMD registers
        • vs mov - typical instructions used for scalar operations