Tools

  • (Gonzalez, 2016) http://openjdk.java.net/projects/code-tools/jmh/
  • (Gonzalez, 2016) Java Path Finder (JPF) from NASA
  • (Evans, 2013) jmap; jhat
  • (Goetz, 2006) Unix (vmstat) Windows (perfmon)
  • Collections
    • Fastutil http://fastutil.di.unimi.it/ Great default choice for collections of primitive types, like int or long. Also handles big collections with more than 231 elements well.
    • Guava https://github.com/google/guava
    • Eclipse Collections https://www.eclipse.org/collections/ this library includes almost any collection you might need: primitive type collections, multimaps, bidirectional maps and so on.
    • !! JCTools https://github.com/JCTools/JCTools for better performance on high throughput concurrent applications.
      • check for Unsafe with Java 9+
      • https://www.baeldung.com/java-concurrency-jc-tools
  • ?? vmlens.com
  • LMAX Disruptor
    • https://martinfowler.com/articles/lmax.html
    • Disruptor pattern with event sourcing: https://github.com/mikeb01/ticketing.git
    • Disruptor pattern and Kafka: https://github.com/mykidong/high-performant-event-collector.git
      • https://medium.com/@mykidong/howto-high-performant-event-collector-with-disruptor-and-vert-x-2e1a1949a62c

General

  • (Gonzalez, 2016) Avoid deadlocks by ordering the locks.
  • (Gonzalez, 2016) Avoid executing inside the critical section the code you don’t control.
  • (Gonzalez, 2016) Avoid the use of blocking operations inside a critical section.
  • (Subramaniam, 2011) Division of labour - Generally speaking there two types of programs that may profit from multi-threading: with intensive computations, and with intensive I/O.
    • for computations we can use the min N of available cores (+ see pp31-32)
    • for I/O there should be more threads
      • When a task performs an IO operation, its thread gets blocked. The processor immediately context switches to run other eligible threads. If we had only as many threads as the number of available cores, even though we have tasks to perform, they can’t run because we haven’t scheduled them on threads for the processors to pick up.
    • the formula: N of threads = N of CPUs / (1 - Blocking Coefficient), where Blocking Coefficient is 0 for computations and close to 1 for I/O
      • It is a bit of educated guess, but java.lang.management API can be helpful.
    • see also rx.schedulers.Schedulers with io() and computation()
  • (Goetz, 2006) Division of labour (another formula): N of threads = N of CPUs * target CPU utilization * (1 + Wait time/Compile time), where Target CPU utilization must be between 0 and 1.
  • (Goetz, 2006) Unless a field should be mutable make it final.
  • (Goetz, 2006) ThreadPool Saturation Policies (TBD see ch 6.4)
  • (Gonzalez, 2017) on volatile: When you modify a volatile variable, its value is sent to the main memory. The value of all the variables modified previously by the same thread are sent too. Compilers can’t reorder sentences that modify a volatile variable for an optimization purpose. It can reorder the previous operations and the later ones, but not the modifications of a volatile variable. The changes that happen before these modifications will be visible to those instructions.
  • me summary:
    • 1 volatile - read/write from memory directly, not the CPU cache => slower. Happens before on the usage. Volatile code cannot be optimized for performance. Code before it is always before it, and code after it is always after it. Occasionally may be good for many reads / few writes.
    • 2 atomic - CPU built-in support for compare and swap. Useful for “check then act” situations
    • 3 sync - Using the synchronized keyword also establishes a happens-before relationship between statements. Entering a synchronized method/block establishes a happens-before relationship between the statements that appear before it and the ones inside the method/block
  • me Threads are heavyweight
    • (check size, but should be additionally 1-2 mb).
    • too many threads affect data locality (i.e. L1 & L2 need to be flushed during thread switch)

Locks

  • [wiki] Read/Write Lock: read-preferring can cause writer starvation, but a higher priority for write requestors can help (me - can use priority queue for the threads). Plus it should be reentrant.
  • [wiki] Read/Write Lock - read-preferring example (RL - read lock, WL - write lock):
    • start read: lock RL; counter++; if(counter == 1) lock WL; unlock RL;
    • end read: lock RL; counter--; if(counter == 0) unlock WL; unlock RL;
    • start write: lock WL
    • end write: unlock WL
  • TODO The read-copy-update (RCU) algorithm can be wait-free for readers.
  • me copy on write (see add/get from CopyOnWriteArrayList). On read: lock, copy, add to new, update, release lock. Again use volatile for the structure. It is wait-free for readers.
    • avoids shared state, but slower due to volatile
    • e.g. copy before modification - similar ot copy on array
    • example from vmlens
  • me check on spinlock. It avoids the thread switching. Should not be overused.
    • -XX:UseSpinning to enable
    • -XX:PreBlockSpinning=12 N of try before switching
final transient ReentrantLock lock = new ReentrantLock();
private transient volatile Object[] array;
public E get(int index) {
  return (E) array[index];
}
public E set(int index, E element) {
  final ReentrantLock lock = this.lock;
  lock.lock();
  try {
    Object[] elements = array;
    E oldValue = (E) elements[index];
    if (oldValue != element) {
      int len = elements.length;
      Object[] newElements = Arrays.copyOf(elements, len);
      newElements[index] = element;
      array = newElements;
    } else {
      // Not quite a no-op; ensures volatile write semantics
      array = elements;
    }
    return oldValue;
  } finally {
    lock.unlock();
  }
}

Semaphores

(Jenkov, 2014)

// avoids missed signals in case take() happens before release()
class SignalSemaphore {
  private boolean signal = false;
  public synchronized void take() { signal = true; notify(); }
  public synchronized void release() { while(!signal) wait(); signal = false; }

class SendingThread {
  private SignalSemaphore s = null;
  public SendingThread(SignalSemaphore s) { this.s = s; }
  // do something, then signal
  public void run() { while(true) { doSomething(); s.take(); } }
}

class ReceivingThread {
  private SignalSemaphore s = null;
  public ReceivingThread(SignalSemaphore s) { this.s = s; }
  // receive signal, doSomething
  public void run() { while(true) { s.release(); doSomething(); } }
}

SignalSemaphore s = new SignalSemaphore();
SendingThread sender = new SendingThread(s);
ReceivingThread receiver = new ReceivingThread(s);
receiver.start();
sender.start();

class CountingSemaphore {
  private int signal = 0;
  public synchronized void take() { signal ++; notify(); }
  public synchronized void release() { while(signal == 0) wait(); signal --; }
}

class BoundedSemaphore {
  private int signal = 0;
  private int bound = 0;
  public BoundedSemaphore(int upperBound) { bound = upperBound; }
  public synchronized void take() {
    while(signal == bound) wait();
    signal ++;
    notify();
  }
  public synchronized void release() {
    while(signal == 0) wait();
    signal --;
    notify();
  }
}
// when bound=1 and on the same thread => can be used as lock
// can be used to control N of threads entering into a section (bound=??)
BoundedSemaphore lock = new BoundedSemaphore(1);
lock.take();
try { // critical section
} finally { lock.release(); }

Collections

  • (Gonzalez, 2017) Blocking collections: This kind of collection includes operations to add and remove data. If the operation can’t be done immediately, because the collection is either full or empty, the thread that makes the call will be blocked until the operation could be carried out.
  • (Gonzalez, 2017) Non-blocking collections: This kind of collection also includes operations to add and remove data. But in this case, if the operation can’t be done immediately, it returns a null value or throws an exception; the thread that makes the call won’t be blocked here.
  • [jrebel] + [conc_j9]

    collection thread-safe
    ArrayList CopyOnWriteArrayList
    HashMap/Set ConcurrentHashMap
    TreeMap ConcurrentSkipListMap
    LinkedHashMap TODO check
    PriorityQueue PriorityBlockingQueue*
    ArrayDeque ArrayBlockingQueue*
    Deque ConcurrentLinkedDeque
    Queue LinkedTransferQueue* **

    *blocking **good for producer/consumer

Mutable or Immutable

  • (Gonzalez, 2016) If you have a simple object without internal data structures, it’s usually not a problem to make it immutable. However, making immutable complex objects that incorporate collections of other objects usually leads to serious performance problems.

Atomic

  • (Gonzalez, 2016) Prefer using atomic variables instead of synchronization
  • (Gonzalez, 2017) API: AtomicLong, LongAdder, DoubleAccumulator, AtomicIntegerArray
  • (Gonzalez, 2017) The individual AtomicInteger objects are thread-safe, but the array as a data structure AtomicInteger[] is not.
  • (Goetz, 2006) CAS (compare-and-swap; compare-and-set): When multiple threads attempt to update the same variable simultaneously using CAS, one wins and updates the variable’s value, and the rest lose. But the losers are not punished by suspension, as they could be if they failed to acquire a lock; instead, they are told that they didn’t win the race this time but can try again. Because a thread that loses a CAS is not blocked, it can decide whether it wants to try again, take some other recovery action, or do nothing (a failed CAS means that someone else already did the work you were planning to do).
    • typical pattern: CAS atomically updates V(memory) to the new value B, but only if the value in V matches the expected old value A; otherwise it does nothing.
    • CAS outperform lock-based counters when small contention, and often when no contention
    • disadvantage: forces the caller to deal with contention (by retrying, backing off, giving up…) whereas Lock deals with contention automatically by blocking until the lock is available.
    • difficulty of constructing the surrounding algorithms
  • me see also AtomicReferenceFieldUpdater - smaller objects (no Atomic), but uses reflection and may be slower.

Volatile

  • (Gonzalez, 2017) The volatile keyword only works well when the value of the shared variable is only modified by one thread. If the variable is modified by multiple threads, the volatile keyword doesn’t protect you from possible data-race conditions. It also doesn’t make operations, such as + or -, atomic. For example, the ++ operator over a volatile variable is not thread-safe.
  • (Gonzalez, 2016) The volatile keyword ensures that when the stopped variable is set to true by another thread, this change will be visible in the main method. Without the volatile keyword, the change cannot be visible due to CPU caching or compiler optimizations.
  • (Gonzalez, 2017) Since Java 5, Java Memory Model has a happens-before guarantee established with the volatile keyword. This fact has two implications:
    • When you modify a volatile variable, its value is sent to the main memory. The value of all the variables modified previously by the same thread are sent too.
    • Compilers can’t reorder sentences that modify a volatile variable for an optimization purpose. It can reorder the previous operations and the later ones, but not the modifications of a volatile variable. The changes that happen before these modifications will be visible to those instructions.
  • (Gonzalez, 2016) Under some circumstances, you can use the volatile keyword and not use a synchronization mechanism. If only one of the tasks modifies the data and the rest of the tasks read it, you can use the volatile keyword without any synchronization or data inconsistency problem. In other scenarios, you need to use a lock, the synchronized keyword, or any other synchronization method.

Barriers

(Thompson, 2011)

  • Store Barrier: sfence instruction on x86, waits for all store instructions prior to the barrier to be written from the store buffer to the L1 cache for the CPU on which it is issued.  This will make the program state visible to other CPUs so they can act on it if necessary.
  • Load Barrier: lfence instruction on x86, ensures all load instructions after the barrier, happens after the barrier and then wait on the load buffer to drain for the issuing CPU. This makes program state exposed from other CPUs visible to this CPU before making further progress.
  • Full barrier mfence instruction on x86, is a composite of both load and store barriers happening on a CPU.
  • In Java Memory model:
    • volatile field has a store barrier before write, full barrier after write, and insert barrier inserted before reading it.
    • qualified final field of a class, have store barrier after initialization, to ensure fields are visible.
    • atomic instructions (e.g. locks), are effectively a full barrier. Software locks usually employ memory barriers, or atomic instructions, to achieve visibility and preserve program order.
  • Memory barriers prevent a CPU from performing a lot of techniques to hide memory latency therefore they have a significant performance cost. … it is best to model the problem so the processor can do units of work, then have all the necessary memory barriers occur on the boundaries of these work units. This allows the processor to optimise the units of work without restriction. There is an advantage to grouping necessary memory barriers in that buffers flushed after the first one will be less costly because no work will be under way to refill them.

Future

  • (Gonzalez, 2016) You can cancel the execution of the task using the cancel() method. This method has a boolean parameter to specify whether you want to interrupt the task if it’s running or not.
  • (Gonzalez, 2016) You can check whether the task has been cancelled (with the isCancelled() method) or it has finished (with the isDone() method).
  • (Gonzalez, 2016) You can get the value returned by the task using the get() method.
  • (Gonzalez, 2016) One possible option is to create other tasks to process the Future objects associated with every task, and the Java concurrency API provides us with an elegant solution to implement this solution with the CompletionService interface and its implementation, the ExecutorCompletionService class.
  • (Gonzalez, 2016) A CompletionService object is a mechanism that has an executor and allows you to decouple the production of tasks and the consumption of the results of those tasks. You can send tasks to the executor using the submit() method and get the results of the tasks when they finish using the poll() or take() methods.
  • (Gonzalez, 2016) CompletableFuture, a new feature of Java 8, extends the Future mechanism of the executor tasks to generate the result of a task in an asynchronous way. You can specify tasks to be executed after the result is generated, so you can control the order of the execution of the tasks. It allows you to implement an event-driving model linking tasks that will only be executed when others have finished.
    • As with the Future interface, CompletableFuture must be parameterized with the type of the result that will be returned by the operation. As with a Future object, the CompletableFuture class represents a result of an asynchronous computation, but the result of CompletableFuture can be established by any thread.
  • (Urma et al., 2015) a bit more complex example sync and async with Future
// synced Example
List<String> findPrices(String product){
  return shops.stream()
    .map(shop -> shop.getPrice(product))
    .map(Quote::parse)
    .map(Discount::applyDiscount)
    .collect(toList());
}
// asynced with Futures
List<String> findPrices(String product){
  List<CompletableFuture<String>> prices = shop.stream()
    // async retrieve of shops' prices
    .map(shop -> CompletableFuture.supplyAsync(() -> shop.getPrice(product),executor))
    .map(future -> future.thenApply(Quote::parse))
    // another async task to apply the discount
    .map(future -> future.thenCompose(quote -> CompletableFuture.supplyAsync(() -> Discount.applyDiscount(quote), executor)))
    .collect(toList());
  // Wait for all Futures in the stream to complete and then extract their results(the join() method)
  // Alternatively return Future, that may not yet be completed - findProcessAsyn(...);
  return prices.stream()
    .map(CompletableFuture::join)
    .collect(toList());
}

Measurement

  • (Evans, 2013) latency - end-to-end time to process a single work-unit for a given workflow (Y-axis response time, X-axis load)
  • (Evans, 2013) throughput - is the number of units of work that a system can perform in some time period with given resources (e.g number of transactions per second)
  • (Evans, 2013) utilization - represents the percentage of available resources that are being used to handle work units, instead of housekeeping tasks
  • (Evans, 2013) efficiency - is equal to the throughput divided by the resources used. A system that requires more resources to produce the same throughput is less efficient.
  • (Evans, 2013) capacity - the number of work units (such as transactions) that can be in flight through the system at any time. That is, it’s the amount of simultaneous processing available at specified latency or throughput.
  • (Evans, 2013) scalability - As resources are added to a system, the throughput (or latency) will change. This change in throughput or latency is the scalability of the system. If solution A doubles its throughput when the available servers in a pool are doubled, it’s scaling in a perfectly linear fashion. Perfect linear scaling is very, very difficult to achieve under most circumstances.
  • (Goetz, 2006) measure don’t guess (the actual cost varies across platforms, but a good rule of thumb is that a context switch costs the equivalent of 5000 to 10000 clock cycles (several milliseconds).

Basic synchronization and Locks

  • (Goetz, 2006) join(): allows one thread to wait for the completion of another. t.join(); causes the current thread to pause execution until t’s thread terminates.
  • (Goetz, 2006) t.sleep(): don’t affect t but the calling thread (same for yield()). Better use Thread.sleep(time)
  • (Goetz, 2006) yield(): Don’t rely on thread priorities when designing your multithreaded application. Because thread-scheduling priority behavior is not guaranteed, use thread priorities as a way to improve the efficiency of your program, but just be sure your program doesn’t depend on that behavior for correctness.
    • What yield() is supposed to do is make the currently running thread head back to runnable to allow other threads of the same priority to get their turn (from running to runnable state).
    • there’s no guarantee the yielding thread won’t just be chosen again over all the others!
  • (Goetz, 2006) Static methods/fields are controlled by a lock that’s distinct from lock for any other instance of the class, because they are associated with the class, not with the object
  • (Goetz, 2006) Contended (lock contention, more heavy and slow threads) vs. non-contended (sync. or volatile) synchronization
    • Synchronization by one thread can also affect the performance of other threads. Synchronization creates traffic on the shared memory bus; this bus has a limited bandwidth and is shared across all processors. If threads must compete for synchronization bandwidth, all threads using synchronization will suffer. This aspect is sometimes used to argue against the use of non-blocking algorithms without some sort of back-off, because under heavy contention, non-blocking algorithms generate more synchronization traffic than lock-based ones.
    • Suspending a thread because it could not get a lock, or because it blocked on a condition wait or blocking I/O operation, entails two additional context switches and all the attendant OS and cache activity.
  • (Goetz, 2006) Fairness vs Bargaining (ch 10.2)
    • fairness has a significant performance cost
    • ReentrantLock has a non-fair(default) permit bargaining: if in time of request the lock is free it takes it without waiting.
    • tryLock – always bargains even for fair lock
  • (Goetz, 2006) ReentrantReadWriteLock + ReadWriteMap.java (example)
    • In reality, ConcurrentHashMap’s performance is so good that you would probably use it rather than this approach if all you needed was a concurrent hash-based map, but this technique would be useful if you want to provide more concurrent access to an alternate Map implementation such as LinkedHashMap.

Fork/Join Notes

  • (Urma et al., 2015) Work stealing: Each thread holds a doubly linked queue of the tasks assigned to it, and as soon as it computes a task it pulls another from the head of the queue and starts executing it. When its tasks are finished, instead of becoming idle, the thread randomly choose a queue of a different thread and steals a task from its tail.
  • (Urma et al., 2015) Call join() only after related tasks has been started, as it blocks the caller until result is produced.
  • (Urma et al., 2015) When splitting on computation on two tasks call fork() on one of them, which will use another thread from the ForkJoinPool, and compute() on the other in order to use the same thread. It is more efficient.
  class MyRecursiveTask extends RecursiveTask<Long> {
    // final fields and constructor
    ...
    @Override protected Long compute(){
      //if(threshold matched) {return computeSequentially();} else ...
      MyRecursiveTask t1 = new MyRecursiveTask(...);
      // async exec using another thread from the ForkJoinPool
      t1.fork();
      MyRecursiveTask t2 = new MyRecursiveTask(...);
      // sync exec in the same thread allowing new recursive splits
      Long result2 = t2.compute();
      // read the result from the 1st task or wait until finished
      Long result1 = t1.join();
      return result1 + result2;
    }
  }

References

  • Evans, B. (2013). The well-grounded Java developer. Manning.
  • Goetz, B. (2006). Java concurrency in practice. Addison-Wesley.
  • Gonzalez, J. (2017). Java 9 concurrency cookbook. Packt Publishing.
  • Gonzalez, J. (2016). Mastering concurrency programming with Java 8. Packt Publishing.
  • Jenkov, J. (2014). Semaphores. http://tutorials.jenkov.com/java-concurrency/semaphores.html
  • Subramaniam, V. (2011). Programming concurrency on the JVM. Pragmatic Bookshelf.
  • Thompson, M. (2011). Mechanical Sympathy: Memory Barriers/Fences. In Mechanical Sympathy. https://mechanical-sympathy.blogspot.com/2011/07/memory-barriersfences.html
  • Urma, R.-G., Fusco, M., & Mycroft, A. (2015). Java 8 in action : lambdas, streams, and functional-style programming. Manning.