Recently I read excellent book Release It! written by Michael Nygard. The book is 7 years old and I don’t know how I could miss the book until now.
Michael Nygard shows how to design and architect medium or large scale web applications. Real lessons learnt from the trenches not golden rules from ivory architects.
This blog post is a dump of taken notes when I was reading the book. The list could be used as a checklist for system architects and developers. There is no particular order of the notes, perhaps there are duplications too.
admin access – should use separate networks than regular traffic, if not administrator will not be able connect to the system when something is wrong.
network timeouts – should be always defined, if not our system could hang if there is a problem with remote service.
firewall – be aware of timeouts on firewall connection tracking tables, if the connection is unused for long time (e.g connection from the pool), firewall could drop packets silently.
failure probability – are dependant, not like during coin toss.
3rd party vendors – their client library often sucks, you can not define timeouts, you can not configure threading correctly.
method wait – always provide the timeout, do not use method
massive email with deep links – do not send massive emails with deep links, bunch of requests to single resource could kill your application.
threads ratio – check front-end and back-end threads ratio, the system is as fast as its slowest part.
SLA – define different SLAs for different subsystems, not everything must have 99.99%
high CPU utilization – check GC logs first.
JVM crash – typical after OOM, when native code is trying to allocate memory –
malloc()returns error but only few programmers handle this error.
Collection size – do not use unbounded collections, huge data set kills your application eventually.
Outgoing communication – define timeouts.
Incoming communication – fail fast, be pleasant for other systems.
separate threads pool – for admin access, your last way to fix the system.
input validation – fail fast, use JS validation even if validation must be duplicated.
circuit braker – design pattern for handling unavailable remote services.
handshake in protocol – alternative for circuit braker if you desing your own protocol.
test harness – test using production like environment (but how to do that???)
capacity – always multiply by number of users, requests, etc.
safety limits on everything – nice general rule.
oracle and connection pool – Oracle in default configuration spawns separate process for every connection, check how much memory is used only for handling client connections.
unbalanced resources – underestimated part will fail first, and it could hang whole system.
JSP and GC – be aware of
noclassgcJVM option, compiled JSP files use perm gen space.
http sessions – users do not understand the concept, do not keep shopping card in the session :–)
whitespaces – remove any unnecessary whitespace from the pages, in large scale it saves a lot of traffic.
avoid hand crafted SQLs – hard to predict the outcome, and hard to optimize for performance.
database tests – use the real data volume.
unicast – could be used for up to ~10 servers, for bigger cluster use multicast.
cache – always limit cache size.
hit ratio – always monitor cache hit ratio.
precompute html – huge server resource saver, not everything changes on every request.
JVM tuning – is application release specific, on every release memory utilization could be different.
multihomed servers – on production network topology is much more complex.
bonding – single network configured with multiple network cards and multiple switch ports.
backup – use separate network, backup always consumes your whole bandwidth.
virtual IP – always configure virtual IP, your configuration will be much more flexible.
technical accounts – do not share accounts between services, it would be security flaws.
cluster configuration verification – periodically check configuration on the cluster nodes, even if the configuration is deployed automatically.
separate configuration specific for the single cluster node – keep node specific configuration separated from shared configuration.
configuration property names – based on function not nature (e.g: hostname is too generic).
graceful shutdown – do not terminate existing business transations.
thread dumps – prepare scripts for that, during accident time is really precious (SLAs).
recovery oriented computing – be prepared for restarting only part of the system, restarting everything is time consuming.
transparency – be able to monitor everything.
monitoring policy, alerts – should not be defined by the service, configure the policies outside (perhaps in central place).
log format – should be human readable, humans are the best in pattern matching, use tabulators and fixed width columns.
CIM – SNMP superior.
SSL accelerator – what it really is???
OpsDB monitoring – measurements and expectations, end to end business process monitoring.
Node Identifiers – assign to teams in block.
Observe, Orient, Decide, Act – military methodology, somehow similar to Agile :–)
review – tickets, stack traces in log files, volume of problems, data volumes, query statistics periodically.
DB migration – expansion phase for incompatible schema changes.