Strings, lists & gatherers

Problem statement: You need to send a dynamically constructed stream of bytes over some I/O channel.

An illustrative example

As always, I shall take an example in the internet space and more specifically in HTML. Let us say that you have a classical two dimensional array whose dimensions are not known at compile time. The contents of this array happens to be strings. This 2d array needs to be printed in the form of a valid HTML table. To make things more interesting, each cell in the table may be given a CSS class based on the content of the cell. The output could look as follows:
 <table>
<tr>
<td class="ve">one</td>
<td class="ve">two</td>
<td class="ve">three</td>
</tr>
<tr>
<td>four</td>
<td class="ve">five</td>
<td>six</td>
</tr>
<tr>
<td>seven</td>
<td>eight</td>
<td class="ve">nine</td>
</tr>
</table>

The rule for having the optional CSS class is left as a puzzle for the bored reader.

A naïve approach

The most straightforward  technique is to keep writing one chunk at a time, peeking into the the array as necessary. The reality is that very few people do it since everyone is taught that unbuffered I/O is expensive. The origin of this problem comes the fact that in most mainstream operating systems, only the kernel is allowed to do actual I/O and hence a system call needs to be made. The cost of switching from user space to kernel space and back is considered to be high.

The “fix”

Hence, the strategy is to accumulate a fair amount bytes that needs to be transmitted and then do fewer system calls. I/O APIs in programming languages usually provide transparent APIs where this accumulation happens. It usually comes with a burden of expecting the programmer to indicate the end of the stream so that any left over bytes can be flushed by making one last syscall.

Memory copy

One of sources of overhead (not the largest) of making syscall arises from having to copy data from user space to kernel space. This data copy operation happens ever so often in user space when string are concatenated. If either the user program directly concatenates or the buffering implementation does so in the end, part of the syscall overhead is incurred. The plausible reason is that the most common form of output APIs that programmers are exposed to takes a single stream/string.

Fewer copies

A far less known technique known as scatter/gather I/O exists that sort of addresses this problem. The gather operation is used for writing output in a single shot from multiple input byte buffers. This API (in both POSIX and Windows) accepts an array of buffers over which is sequentially iterates and writes the output. The problem is now reduced to having an array with pointers/references to all the buffers. If you have come down to having to operate at this level, chances are, you might not want to deal with magically expanding arrays. Your option at that point would be to use a linked list to accumulate all the references to the buffers and then turn it into an array just before performing the write operation.

But why all this you ask

… because unless you are coding at what is considered fairly low levels these days (such a POSIX & C), you do not realize how many times you are ripping up and recreating little byte streams in multiple layers of code using both direct and indirect constructs. This problem becomes very evident when trying to generate  outputs in formats such as XML or JSON where there is a mix of a lot of what I call gluing bytes sprinkled liberally between the actual payload. Given an arbitrarily nested variable loosely typed languages like say the ones available in perl/php/python/javascript, I am wondering what is the most elegant way to arrive a representation like JSON and perform an output operation without mindless string concatenation. Thoughts are welcome.

Why you can’t always just throw more hardware at it

A long time ago, people used to worry about the efficiencies of software they used to write. Then came a time when processors just kept getting faster every month the pace wouldn’t slow down even after crossing the 500MHz mark. Somewhere around this time, people started writing exceptionally bloated software and the bloat started to grow at a phenomenal pace. Then came the new catch phase hardware is cheap, we can throw more hardware at it. And in one magic swoop, all bloatware became perfectly acceptable since the bloat now seemed to be affordable. And this was precisely the point wherein most people forgot their CS fundamentals. If you have done a course on CPU scheduling, you would know these metrics:

  1. CPU utilisation
  2. Throughput
  3. Turnaround time
  4. Waiting time
  5. Response time

I will take up web application space as an example in the remainder of the discussions since it has a fairly large development community and also because it is littered with bloatware + hardware is cheap mentality. In web applications, the consumer is usually worried about response times and turnaround times. Let us say there is solution A wherein it takes a full second for the server to process a single web request and solution B that takes 50 milliseconds to process a single web request. A very misplaced number that people chase is requests/second and this is solved using the now infamous throw more hardware approach. Focus on throughput works in businesses when your consumers have nowhere else to go and your notion of increasing business is by increasing volumes. You don’t hear people switching banks because of how fast (or slow) their websites load and the reason is that main product offering is banking service and not a website i.e. you would worry more about interest rates rather than website response times. Businesses whose primary offering is the website itself cannot take such liberties.

Turnaround time

Turnaround time is the total time taken to service a request. So, if you have a slow running web page, you can keep adding more hardware to take on more volume (assuming the solution can be scaled out infinitely) but the experience of each individual user is not going to improve. Also, real world experience suggests that left to itself, things start to slow down as you scale out. A knee jerk fix is to do things in parallel and use threads. That also usually doesn’t get you too far thanks to what a certain Amdahl had to say. This is where all those classes on algorithms, architecture and the abstinence from bloatwares begin to make some difference.

Response time

Response time is what is usually called as time to first byte in the internet world. In trying to solve the turnaround time problem, one of the speedup areas that people work on is minimizing the context switches from user space to kernel space. Zero copy is an example of one such problem. The most common example however happens to be buffered files (or streams if you are from the Java world). Some people (and their software creations) take this to the extreme and try and send out the entire HTTP response in one shot hoping to minimize the number of system calls needed to get the job done. It turns out that this makes for a worse user experience. Put it another way, it is better off to start sending something to the user after 200 milliseconds (ms) and finish it in the next 4 seconds rather than start sending something 2 seconds after the request was issued and get done in the next 500 ms.  In fact this is a harder problem to solve for two reasons:

  • Left to itself, most web servers aren’t eager to push back smaller chunks of data (easier problem to solve)
  • Dynamic pages, especially the ones generated MVC frameworks do not make the response available to the web server until they have fully constructed the response body. Some of these solutions offer no straight forward way to push out data in parts while others have explicit mechanisms of achieving this effect.

For those of you who are still wondering why something that puts on extra load on the server and takes longer to finish is considered better by the user, there are two reasons:

  • Psychological: Giving the user an early indication of some progress creatives some incentive for the user to wait rather than sending no information. Even getting the status bar to say recieving from … as opposed to sending request to … makes a difference.
  • Pipeline effect: An average web page has references to various resources (images, external css files, etc. etc.) that are needed to completely render a page. It turns out that most browsers can initiate the retrieval of those resources before the page loads up completely. Pushing out a partial response early on gives the browsers a chance to get started with other things early on. So while the additional flushes done on the server side might have slowed down the turnaround time for basic page transmission, the overall turnaround time as seen by the user can still drop with this technique.

Throughput

Since throughput signifies the total amount of work that gets done in a unit of time, it turns out that throwing more hardware can sort of solve this problem. As I had mentioned earlier, if you solution scales infinitely, then the hardware addition technique works. The reason why things do not scale infinitely are:

  • There ends up being some components that are hard to scale infinitely such as the top level load balancer and the pipes that it is connected to
  • Amdahl’s law

One of the most common fixes that is a borderline superstition is to run more threads. In a CPU bound world, having any more threads of execution than the number of compute unit slows things down. In a NUMA based world, certain workloads can be detrimental even when the number of threads matches the number of compute unit available. However, for workloads that are I/O bound, threads do help as long as the different threads are not contending for the same underlying I/O resource. The one exception is rotating storage media where the amortized performance increases as concurrent requests increase but only up to a certain point.

In effect, the reasons for throughput not increasing just by increasing either the concurrency levels of task execution or by throwing in more hardware beyond a certain point is very real.

Closing remarks

We are now in an age where people not only believe that hardware is cheap but also in cloud computing that promises provisioning of infinite hardware (i.e. more than you can afford). The thing to remember is that you might extend the life of a given solution for quite sometime by throwing in hardware (at diminishing rate of returns) but if you are chasing response times, you will have to constantly improvise on your design as opposed relying on hardware.