Internet Reliability and Systems

A more technical post but simplified for the layperson. It was inspired by an issue we saw in New Zealand last week (28/4/14) where we got reports of fairly widespread internet issues preventing access to local systems. It was very localised - we actually had no customer complaints. But we did send out a email to all affected clients and partners. This post seeks to show and explain what can happen and why.
We will keep this simple as we can - so don’t be put off. If you have every wondered why things on the internet go down, go slow or what some of the problems can be, hopefully you will find this an accessible and informative read.

We have split this post into the following topic areas:
  • Internet Traffic
  • Malware and Internet attacks
  • System Capacity
  • System Maintenance and Downtime

Internet Traffic

The internet clearly gets lots of traffic. It is used to be called the ‘Internet Superhighway’ and like a traditional road traffic highway it also suffers from traffic jams and peak rush hour type periods. It's growth has been phenomenal too of course, and this also creates problems; sometimes local infrastructure struggles to cope till more equipment and facilities are added. The telecoms companies do their best. But keeping up with this fast growing demand is not easy. Periodically most regions will see from time to time performance decreases and outages when demand exceeds the capacity of the network to supply.

Additionally, network failures can also seriously affect local services. Much of the internet was designed originally as a communications network for use by the military. It's thus designed to re-route traffic automatically when bottlenecks occur (i.e. in military terms to re-route when exchanges got blown up in war - nice?!). This can mean that traffic flows can be a bit unpredictable. A stoppage in one area can mean information is re-routed onto otherwise quieter traffic routes, thus overloading them too. This all happens at very fast speeds of course. But the good news is that it generally also gets sorted fairly fast too.

Internet Geography

If you want a picture of what is happening on the internet see the screenshot below showing a typical global monitoring service and snapshot of its view of how the internet is currently working.

Untitled

Like all these monitors - it's a rough sample - and these scores can vary depending when taken and with whats going on locally. They show the relative reliability scored out of 100 and summarised by region. This blog post was prompted by internet issues happening in Auckland New Zealand just over a week ago. So, taking that as an example, the regional Australian average at the time of writing was 89. The two servers that report for New Zealand on this monitoring service however were only averaging 84. Thus within any region, there can be quite a variance in performance. And, following our example, the New Zealand average was considerably below this level during the period where we saw the fairly widespread slowdown and unavailability in Auckland (please note: such issues can and do happen almost anywhere and it's not a particularly New Zealand only issue). Whilst local provision of the internet varies enormously; even the most developed countries with extensive broadband will still have ad-hoc local performance issues. Hardware can fail and systems can still get overloaded. This all shows up when you suddenly can’t access some, or even any, of your usual websites and services.


Malware and Internet attacks

This is oft portrayed by Hollywood and the movies. The reality - though not as exciting - is however, very real and it does regularly happen. Malware, short for ‘Malicious Software’, is any software used to disrupt computer operation, gather sensitive information, or gain access to private computer systems. As well as trying to access data it can either, as a byproduct or by design; slow down, overload and even shutdown whole systems.

virus
Keeping data secure is clearly very important. As an example we secure data on our systems using SSL encryption and other methods to ensure no actual data is compromised by an outside source. Recently a programming error referred to as the Heartbleed Bug was discovered that showed a flaw in OpenSSL, a popular protocol used to encrypt data over the Internet and used by many web servers. We had it very promptly patched on our systems (clients were emailed on 14/4/14) and thus to the best of our knowledge we had no data exposed. However, it does illustrate that it is continual battle to protect data and that more and more software and systems are being used to manage data and get involved with general data flows. Each time a server is patched, it can run slow or even require some downtime making sites and facilities unavailable. It's also not just your server with your own system on it. Your system can be affected by the other servers out on the net that route your information from your severs to your computer screen over internet communications lines.

DNS attacks

Denial-Of-Service (DDoS) attacks have a history dating back to the "Smurf" attack in 1997 (it was named after the file smurf.c that contained its source code, not the small blue mythical people in the movie). They are also sometimes called DNS Amplification attacks (a type of DDoS). These days such attacks are even stronger, and much more common in slowing down internet access and causing significant damage to the intended target server(s). A common method of attack involves saturating a webserver(s) - the target - with external communications requests, so much so that it cannot respond to legitimate traffic, or responds so slowly as to be rendered essentially unavailable. Such attacks usually lead to the target server being overloaded.

Both Malware and DDoS are just two of the issues that affect use of data over the internet and it all results in having to use additional software, process and systems to keep systems protected. This all has a cost of course, as it can slow things down. Part of the issue that caused the outage in Auckland last week was attributable to a widespread DDoS attack on several data centres in Auckland (as reported to us at the time). This caused massive traffic spikes and some servers to overload. Fortunately these things get sorted over time and we certainly have seen no lasting damage - usually things fairly quickly settle down and often get fixed the same day. When you think of all that is going-on downtime and outages are really very short lived.


System Capacity

How a system is designed and structured, can materially affect how fast it is and how it can scale up to handle extra usage. Faster sites mean that visitors spend less time waiting around for it to load and they will tend to look around the site and use it more.  A question we get asked is; what happens if we get lots of people looking at a site all at once - will it cope?

performance-and-storage-capacity-dials
We generally like this question as its an area that our systems and capabilities do very well in. The general term for the number of users a system can manage at the same time is ‘concurrency’.  Some of our systems can get very large and we have deployments that can handle really very large numbers of users (e.g. tens of thousands of candidate enquires).  We thus design our systems so that they can handle many people using them at the same time. However clearly this is not limitless. However well we manage things. By way of a simplified explanation, we split our systems such that the dynamic data elements are handled by a separate specialised application database server which does all the work of displaying fields and records etc.  The actual main website can thus be thought of as largely a static container. This means from a purely website perspective we can serve large numbers of users from shared managed webservers with the overall effect of keeping performance up and thus handle very large volumes. Likewise we can scale the database servers up and down to match traffic.

As to how many users a system can handle. There is really no single number - it all depends.  By way of explanation, in typical use, a user will visit the site and then read what is on the page.  It is only when either the webserver is loading the page or database server is rendering its data that the system is doing any actual work.  Most of the time the system is simply waiting for the next instruction from the user.  As concurrency is multiple users at the same time - you can have for any given number of users on the site, the system operating at either fairly low capacity or running full tilt deepening on the pattern of usage.  For our larger implementations, or where site usage grows, we provision the system with separate webservers and database capacity and we can scale up the IT resources as the system grows. Thus it is very unlikely indeed, that you will find any of our systems affected by overload situation.


System Maintenance and Downtime

We closely monitor the service at each of the datacentres we use (UK, Germany, Singapore, USA and NZ). We have monitoring tools that continually check on performance and we get notified very quickly if there is any outage. However our servers like all others, get periodically upgraded and require some maintenance every now and then.

under-construction
In general we are able to ensure downtime resulting in any outage is very minimal and likely to be in the order of only a few minutes. Sometimes servers do have to be upgraded or “patched” very quickly and we don’t issue advance notice in such instances. Where we know of a planned maintenance activity that will result in any material downtime longer than a few minutes in core business hours we inform our users in advance - however this is also rare. In practical terms it is most unlikely users will encounter server outages. Where they do rarely occur during a working day they will only normally last a few minutes.

no_internet
Much more likely is users finding they can’t get access to either our sites, some sites or even all sites over the internet. It does not happen often but we reckon in around half the cases where we get a call of problems accessing one of our systems, the users concerned are having more general internet problems and can’t access even sites like Google. Note: we always ask if you can access Google. If you can’t then that very likely shows your own internet access is likely; temporarily down or degraded for one of the reasons given above. In the other half of enquires we generally find that the user whilst they might be able to access Google they are finding it loads very slowly or they are having trouble with other local sites. Generally their ISP will confirm there has been a broader issue (as advised above, we know if one of our sites is really down as our monitoring service will reliably advise us of such).


Finally - Advice on what to do if you can’t access a site

We hopefully have outlined in broadly layman’s terms some of the reasons why you can have problems with accessing resources over the internet. Our best advice is to try again in a few minutes and check that you can access some widely known sites like Google. If you can’t access them, then you should speak to your IT department or your Internet provider as its almost certainly a broader issue and not one particular to accessing any of our systems. If you can however, access other sites but not one of ours, do contact us. We do have tools and systems that monitor outages etc and we can very quickly advise you of the current situation. Happily we can report that for some 5 years now we consistently delivered in excess of 99.9% uptime across all our sites.