- references
- https://jenkov.com/tutorials/java-concurrency/false-sharing.html
- https://www.baeldung.com/java-false-sharing-contended
- Cache Issues -- False Sharing -- Mike Bailey, Oregon State University
- JDD2019: Who ate my RAM?, Jarek Pałka
- JDD 2018: new java.io.File("jdd.json"); is this really that simple? by Jarek Pałka
- 2023 - Krzysztof Ślusarski - Java vs CPU
- https://techexpertise.medium.com/cache-coherence-problem-and-approaches-a18cdd48ee0e
- https://medium.com/@mallela.chandra76/cache-coherence-in-a-system-d9ba906b45f7
- https://www.brainkart.com/article/The-MESI-protocol_7651/
- https://www.geeksforgeeks.org/cache-coherence-protocols-in-multiprocessor-system/
- https://stackoverflow.com/questions/49983405/what-is-the-benefit-of-the-moesi-cache-coherency-protocol-over-mesi
- https://stackoverflow.com/questions/21126034/msi-mesi-how-can-we-get-read-miss-in-shared-state
- https://stackoverflow.com/questions/10058243/mesi-cache-protocol
- http://www.edwardbosworth.com/My5155_Slides/Chapter13/CacheCoherency.htm
- https://fgiesen.wordpress.com/2014/07/07/cache-coherency/
- https://github.com/melix/jmh-gradle-plugin
- https://chat.openai.com/
- 308. WJUG - Maciej Przepióra - "Java Memory Model for Mere Mortals" [EN]
- goals of these workshops
- understanding modern cache architecture in multi-core environment
- discussing cache write policies
- explaining cache coherence problems
- with presentation of most known cache coherence protocol: MESI
- exemplifying cache performance hits
- false sharing
- loop order
- showing processor optimisations
- vectorization
- workshop plan
- false sharing example
- benchmarks
- to trigger:
gradlew jmh
- vectorization
- loop order
- to trigger:
- access to a cache by a processor involves one of two processes: read and write
- each process can have two results
- cache hit = processor accesses its private cache and finds the addressed item already in the cache
- otherwise - cache miss
- each process can have two results
- in modern CPUs (almost) all memory accesses go through the cache hierarchy
- CPU core’s load/store (and instruction fetch) units normally can’t even access memory directly
-
physically impossible
-
they talk to their L1 caches
- at this point, there’s generally more cache levels involved
- L1 cache doesn’t talk to memory directly anymore, it talks to a L2 cache – which in turns talks to memory
- or maybe to a L3 cache
-
caches are organized into “lines”, corresponding to aligned blocks of memory
- 32 bytes (older ARMs, 90s/early 2000s x86s/PowerPCs)
- 64 bytes (newer ARMs and x86s)
- 128 bytes (newer Power ISA machines)
-
- CPU core’s load/store (and instruction fetch) units normally can’t even access memory directly
- cache coherence
- concern raised in a multi-core distributed caches
- when multiple processors are operating on the same or nearby memory locations, they may end up sharing the same cache line
-
unit of granularity of a cache entry is 64 bytes (512 bits)
- even if you read/write 1 byte you're writing 64 bytes
-
it’s essential to keep those overlapping caches consistent with each other
- benefits of multithreading can disappear if the threads are competing for the same cache line
- note that the problem really is that we have multiple caches, not that we have multiple cores
- we could solve the entire problem by sharing all caches between all cores (L1)
- each cycle, the L1 picks one lucky core that gets to do a memory operation this cycle, and runs it
- problem: cores now spend most of their time waiting in line for their next turn at a L1 request
- processors do a lot of those, at least one for every load/store instruction
- slow
- solution: next best thing is to have multiple caches and then make them behave as if there was only one cache
- this is what cache coherency protocols are for
- processors do a lot of those, at least one for every load/store instruction
- problem: cores now spend most of their time waiting in line for their next turn at a L1 request
- each cycle, the L1 picks one lucky core that gets to do a memory operation this cycle, and runs it
- we could solve the entire problem by sharing all caches between all cores (L1)
-
there are quite a few protocols to maintain the cache coherency between CPU cores
-
problem is not unique to parallel processing systems
- strong resemblance to the "lost update" problem
-
general approach
- getting read access to a cache line involves talking to the other cores
- might cause them to perform memory transactions
- writing to a cache line is a multi-step process
- before you can write anything, you first need to acquire both exclusive ownership of the cache line and a copy of its existing contents
- "Read For Ownership" request
- before you can write anything, you first need to acquire both exclusive ownership of the cache line and a copy of its existing contents
- each line in a cache is identified and referenced by a cache tag (block number)
- allows the determination of the primary memory address associated with each element in the cache
- each individual cache must monitor the traffic in cache tags
- corresponds to the blocks being read from and written to the shared primary memory
- done by a snooping cache (or snoopy cache, after the Peanuts comic strip)
- basic idea behind snooping is that all memory transactions take place on a shared bus that’s visible to all cores
- caches don’t just interact with the bus when they want to do a memory transaction themselves
- instead, each cache continuously snoops on bus traffic to keep track of what the other caches are doing
- if one cache wants to read from or write to memory on behalf of its core, all the other cores notice
- that allows them to keep their caches synchronized
- one core writes to a memory location => other cores know that their copies of the corresponding cache line are now stale and hence invalid
- problem: write-back model
- it’s not enough to broadcast just the writes to memory when they happen
- physical write-back to memory can happen a long time after the core executed the corresponding store
- for the intervening time, the other cores and their caches might themselves try to write to the same location, causing a conflict
- if we want to avoid conflicts, we need to tell other cores about our intention to write before we start changing anything in our local copy
- problem: write-back model
- memory itself is a shared resource
- memory access needs to be arbitrated
- only one cache gets to read data from, or write back to, memory in any given cycle
- memory access needs to be arbitrated
- caches do not respond to bus events immediately
- reason: cache is busy doing other things (sending data to the core for example)
- it might not get processed that cycle
- invalidation queue
- place where bus message triggering a cache line invalidation sits for a while until the cache has time to process it
- reason: cache is busy doing other things (sending data to the core for example)
- getting read access to a cache line involves talking to the other cores
-
why you need
volatile
keyword in java if you have cache coherency?- cache coherency ensures that all threads eventually see the change
- no guarantee about when this happens
volatile
works also on higher level of abstraction- introduces happens-before relationships around reads and writes
- cache coherency ensures that all threads eventually see the change
-
- cache write policies
- write back
- write operations are usually made only to the cache
- main memory is only updated when the corresponding cache line is flushed from the cache
- results in inconsistency
- example: if two caches contain the same line, and the line is updated in one cache, the other cache will unknowingly have an invalid value
- one fundamental implementation
- MESI
- each cache line can be in one of these four distinct states: Modified, Exclusive, Shared, or Invalid
- key feature: delayed flush to main memory
- example: when no one reads the data there is no need to write main memory
- better to write only to cache as it is much faster
- example: when no one reads the data there is no need to write main memory
- MESI
- write through
- all write operations are made to main memory as well as to the cache
- ensures that main memory is always valid
- has consistency issues
- occur unless other cache monitor the memory traffic or receive some direct notification of the update.
- two fundamental implementations
- with update protocol
- after write to main memory message with the updated data is broadcast to all processor modules in the system
- each processor updates the contents of the affected cache block if this block is present in its cache
- after write to main memory message with the updated data is broadcast to all processor modules in the system
- with invalidation of copies
- after write to main memory invalidation request is broadcast through the system
- all copies in other caches are invalidated
- after write to main memory invalidation request is broadcast through the system
- with update protocol
- write-through caches are simpler, but write-back has some advantages
- it can filter repeated writes to the same location
- most of the cache line changes on a write-back => can issue one large memory transaction instead of several small ones
- more efficient
- most of the cache line changes on a write-back => can issue one large memory transaction instead of several small ones
- it can filter repeated writes to the same location
- write back
- formal mechanism for controlling cache coherency using snooping techniques
- most widely used cache coherence protocol
- each line in an individual processors cache can exist in one of the four following states
- (M)odified
- result of a successful write hit on a cache line
- its value is different from the main memory
- indicates that the cache line is present in current cache only and is dirty
- it must be in the I state for all other cores
- modified line can be kept by a processor only as long as it's the only processor that has this copy
- on a cache miss, the cache still needs to write-back data to memory if it is in the modified state
- processor must signal "Dirty" and write the data back to the shared primary memory
- causing the other processor to abandon its memory fetch
- processor must signal "Dirty" and write the data back to the shared primary memory
- it is necessary to first write in memory and then read it from there
- reading cores can't directly read from the cache of the writing core
- it's more expensive
- example
- suppose, both of A & B share that line and B got it directly from the cache line of A
- C needs that line for a write
- both A & B will have keep snooping that line
- shared copies grows
- greater impact on performance due to the snooping done by all the 'shared' processors
- reading cores can't directly read from the cache of the writing core
- example
- let's processor A has that modified line
- processor B is trying to read that same cache (modified by A) line from main memory
- A's snooping read attempts for that line
- content in the main memory is invalid now (because A modified the content)
- A has to write it back to main memory
- in order to allow processor B (and others) to read that line
- on a cache miss, the cache still needs to write-back data to memory if it is in the modified state
- result of a successful write hit on a cache line
- (E)xclusive
- its value matches the main memory value
- no other cache holds a copy of this line
- main purpose: prevent the unnecessary broadcast of a Cache Invalidate signal on a write hit
- reduces traffic on a shared bus
- (S)hared
- multiple caches may hold the line
- main memory is up to date
- if a core does not have exclusive access to a cache line when it wants to write, it first needs to send an "I want exclusive access" request to the bus
- this tells all other cores to invalidate their copies of that cache line, if they have any
- (I)nvalid
- cache line does not contain valid data
- it will reload that cache line the next time it needs to look at a value in it
- even if that value is perfectly current => false sharing
- (M)odified
- requires 2 bits per cache line to hold the state
- 4 values: M, E, S, I
- transition between the states is controlled by memory accesses and bus snooping activity
- suppose a requesting processor processing a write hit on its cache
-
by definition, any copy of the line in the caches of other processors must be in the Shared State
-
what happens depends on the state of the cache in the requesting processor
- Modified
- protocol does not specify an action for the processor
- Shared
- processor writes the data
- marks the cache line as Modified
- broadcasts a Cache Invalidate signal to other processors
- Exclusive
- processor writes the data and marks the cache line as Modified
- Modified
-
example
Step cache line A cache line B 1 Exclusive - 2 Shared Shared 3 Invalid Modified 4 Shared Shared - Core A reads a value
- Core B reads a value from the same are of memory
- Core B writes into that value
- Core A tries to read a value from that same part of memory
- Core B's cache line is forced back to memory
- Core A's cache line is re-loaded from memory
-
- simulation
- exclusive
- CPU 1
- is the first (and only) processor to request block A from the shared memory
- issues a BR (Bus Read) for the block and gets its copy
- neither CPU 2 nor CPU 3 respond to the BR
- cache line containing block A is marked Exclusive
- subsequent reads to this block access the cached entry and not the shared memory
- CPU 1
- shared
- CPU 2 requests the same block A
- snoop cache on CPU 1 notes the request and CPU 1 broadcasts Shared, announcing that it has a copy of the block
- CPU 3 does not respond to the BR
- both copies of the block are marked as shared
- indicates that the block is in two or more caches for reading
- copy in the shared primary memory is up to date
- modified
- CPU 2 writes to the cache line it is holding in its cache
- issues a BU (Bus Upgrade) broadcast, marks the cache line as Modified, and writes the data to the line
- CPU 1 responds to the BU by marking the copy in its cache line as Invalid
- CPU 3 does not respond to the BU
- informally, CPU 2 can be said to "own the cache line"
- CPU 2 writes to the cache line it is holding in its cache
- dirty
- CPU 3 attempts to read block A from primary memory
- CPU 1, the cache line holding that block has been marked as Invalid
- CPU 1 does not respond to the BR (Bus Read) request
- CPU 2 has the cache line marked as Modified
- asserts the signal Dirty on the bus, writes the data in the cache line back to the shared memory, and marks the line Shared
- informally, CPU 2 asks CPU 3 to wait while it writes back the contents of its modified cache line to the shared primary memory
- CPU 3 waits and then gets a correct copy
- The cache line in each of CPU 2 and CPU 3 is marked as Shared
- exclusive
- suppose a requesting processor processing a write hit on its cache
- true sharing
- CPUs are writing to the same variables stored within the same cache line
- CPUs are writing to independent variables stored within the same cache line
- independent = each CPU doesn't really rely on the values written by the other CPU
- steps
- first thread modifies the variables
- cache line is invalidated in all CPU caches
- other CPUs reload the content of the invalidated cache line
- even if they don't need the variable that was modified by first thread
- first thread modifies the variables
- essence of problems with concurrent programming
- more processors, more power, more electricity consumption but slower than single thread
- solution: padding
- usually cache line has 16 slots
- if you use padding you could move a value to the next cache line
- JVM
- don't use volatile
- write to main memory will be delayed up to very end
- assuming that threads are modifying not interlapping set of variables it's OK
- example
however, if we modify
private static final int THREADS = 6 private static int[] results = new int[THREADS] // each thread has it's own place to accumulate results
results
using volatile machinery, we have false sharingVH.setVolatile(results, offset, results[offset] + 1)
- change data structures so the independent variables are no longer stored within the same cache line
jdk.internal.vm.annotation.Contended
- annotation to prevent false sharing
- introduced by Java 8 under
sun.misc
package- repackaged later by Java 9
- by default adds 128 bytes of padding
- cache line size in many modern processors is around 64/128 bytes
- configurable through the
-XX:ContendedPaddingWidth
- annotated field => JVM will add some paddings around it
- annotated class => JVM will add the same padding before all the fields
-XX:-RestrictContended
- disable
@Contended
annotation
- disable
- use cases
ConcurrentHashMap
ForkJoinPool
- don't use volatile
- usually cache line has 16 slots
- note that false sharing doesn't cause incorrect results - just a performance hit
- updates may be lost if writes are not atomic
- SIMD (Single Instruction, Multiple Data)
- example: add four pairs of numbers together at once, rather than adding them sequentially
- instructions supported by modern processors
- allow a single instruction to operate on multiple data elements simultaneously
- can significantly improve performance by exploiting parallelism at the instruction level
- data should be aligned and contiguous in memory
- vectorization
- involves identifying portions of code that can be executed in parallel using SIMD instructions
- typically involves operations like arithmetic operations, array computations, and data processing loops
- example: assembler on intel platform
vpadd
- Vector Packed Add
- used for adding packed integer or floating-point values stored in SIMD registers
- vs
add
- typical instructions used for scalar operations
vmovdqu
- Vector Move Unaligned
- used for moving data between memory and SIMD registers
- vs
mov
- typical instructions used for scalar operations