Architect notes #1

Design for production (anticipate it), not for QA.

Better (ok, sometimes more expensive) to invest into development efforts and cut a costs of support/maintain efforts after release.

Keep your hands dirty with source code, talk to other engineers.

Start design from the bottom – hardware, networks, regions, AZ

Fault – something happened within system (may be kept hidden from user) – happens all the time

Error – incorrect system’s behaviour visible from customer’s POV

Failure – system is down

Cascading/Chain failure – redistributed load from failed instance caused other instance at pool to die.

To prevent fault distribution/propagation as a snowball, interconnection/dependencies between system’s component should be weaker (like mediator service, messages, etc), amount of it should be minimized.

Treat every dependency, third party, external services as unreliable and potential point of failure.


Check your component against (jmeter/gatling):

  • load (data grows):
    • impulse (short but very intensive load)
    • stress (constantly increasing load, from small to normal/high range)
  • longevity – for memory (or other resource) leak, keep system under usual load for one week for instance
  • production-like amount of dummy data


  • not just “liveness” of some node, but fact it correctly performs its duties, like dummy mock client is trying to perform some typical business action.
  • if session object is small as possible (usage of weak reference can help, kept till next time GC processing)
  • test of most resource consuming user’s behavior


  • tcpdump (minimalistic), wireshark


How to test REST, negative cases:

  • incorrectly formed response
  • no response, timeout
  • empty response
  • mismatch between content type specified and provided
  • very big response
  • weird response status


Approaches to mitigate faults:

  • fault isolation:
    • timeouts
    • circuit breaker (three states: closed (all call being passed, failed calls are being counted, if threshold is exceeded, CB is turned into opened state), opened state (no calls are passed, this state is kept for some specified time, after this period of time, CB is switched into half-opened/closed state), half-opened/closed (first call passed, result is analysed, if it is an error again, CB is switched into opened state, if it is a success, CB is witched into closed state) )
    • bulkhead (different scopes of grouping resources: threads, thread pools, process, VM, metal machines, AZ, regions)
  • resource manipulation:
    • everything the generates output, like log files, should have one more corresponding component that track these log files and remove obsolete, to make sure that hard drive is not filled.
    • reserve resources
    • make queues explicit
  • fault behavior: be prepared the fails unavoidable and fail fast if it happened and restarts quickly, frequency of restarting should be tracked, if it doesn’t exceed some threshold, use hierarchical supervisors here, this approach not for monoliths
  • test service within environment that stimulates production as close as possible, with all possible faults, error, network partition and so on


System should server its users: before, during and after being updated.

While update session from user and bots/crawlers should be distinguished, since session itself could block an update, so session from not-humans can be broken.

Blue-green deployment is better to perform by groups of services

During update of SQL tables – use shims, temporary triggers of old table, that duplicates modifications made on old table into new table. Remove these shims after update is finished.

Load-test with noise and chaos

Dedicated devops team – is mostly platform or tools team

Chaos engineering – killing random nodes/containers at production


What to monitor:

  • traffic
  • business transactions
  • users
  • resource
  • database
  • data
  • integration points
  • cache

API versioning:

  • URL placed version number
  • standard header header, like “content-type”
  • custom header
  • body of request (PUT and POST only)

Java performance extract


  • micro
  • meso
  • macro


  • non Java: CPU, disk, network
  • Java tools: flags, heap, GC

Profilers: sampling, instrumentation, native

JIT: client (default threshold is 1500), server (default threshold is 10000), tiered compilation

Tuning JIT:

  • code cache
  • compilation thresholds – how many times code will be interpreted before gets compiled
  • print compilation process logs
  • compilation thread (amount can be adjusted)
  • inlining (limits of code for inline – default 325 bytes)
  • escape analysis mode, very efficient, but will break improperly synchronized code
  • de-optimisation
  • tiered compilation levels

GC: serial, throughput (parallel), concurrent (CMS), G1

GC generation: new (eden, survived), old

All GC do stop-the-world pause while checking the eden, for not eden, CMS and G1 may do (lower CPU consumption) or with not stop-the-world pause (high CPU consumption)

Serial GC (x32, single core machine or Windows) – for client:

  • single threaded
  • stop-the-world for new or old generation processing

Throughput (Unix, multi-core, x62):

  • multi-threaded
  • stop-the-world for new or old generation processing


– multiple threads for new generation

  • for old generation, one thread scans object to free in background with no stop, but old generation remains fragmented, stop-the-world still happens, but quite rare, to defragment the old generation heap, usually it happens when there is no space to allocate for new object


  • for large heaps (more than 4GB), marker heap with a region
  • System.gc() does stop-the-wrold for all types of GC and do full scan

Tuning GC:

  • sizing heap (small – too often GC works, big – OS swapping of RAM and drive)
  • sizing generations
  • Permgen/MetaSpace – keeps information about loaded classes – is expensive operation for resizing, it is better to define at startup
  • Controlling amount of GC threads
  • adaptive sizing (should be turned-on)
  • large object

Tuning threads:

  • pool size
  • thread stack size
  • avoid synchronization
  • thread priorities
  • adjusting spinning


  • choose right driver (try different)
  • prepared statement and statement pooling
  • connection pools
  • transaction pools
  • cached queries

Other optimisations:

  • reuse Random
  • JNI is not solution for performance
  • Exceptions are not always an issue
  • One line string concatenation is faster then multiline
  • Lambdas and anonymous classes has the same performance, but lambdas loaded faster