The standard way of understanding the HTTP protocol is via the request reply pattern. Each HTTP transaction consists of a finitely bounded HTTP request and a finitely bounded HTTP response.
However it's also possible for both parts of an HTTP 1.1 transaction to stream their possibly infinitely bounded data. The advantages is that the sender can send data that is beyond the sender's memory limit, and the receiver can act on the data stream in chunks immediately instead of waiting for the entire data to arrive. Basically you're either saving space or you're saving time. The advantages of streaming is elaborated in Wikipedia's Online algorithm article.
Note that HTTP streaming is only involves the HTTP protocol and not websockets. Streaming is also the basis for HTML5 server sent events.
So we're going to look at HTTP streaming architecture, and how to achieve streaming in a few different languages.
The first thing to understand is that HTTP streaming involves streaming within a single HTTP transaction. In a larger context, each HTTP transaction itself represents an event as part of a larger event stream. This reveals to us that the concepts of "streaming" is a context-specific concept, it's relative to what we consider the "stream" to be.
Firstly we have to consider the HTTP headers that supports streaming. Open this https://en.wikipedia.org/wiki/List_of_HTTP_header_fields up for reference:
The Content-Length
header determines the byte length of the request/response
body. If you neglect to specify the Content-Length
header, HTTP servers will
implicitly add a Transfer-Encoding: chunked
header. The Content-Length
and
Transfer-Encoding
header should not be used together. The receiver will have no
idea what the length of the body is and cannot estimate the download completion
time. If you do add a Content-Length
header, make sure it matches the entire
body in bytes, if it is incorrect, the behaviour of receivers is undefined.
The Content-Length
header will not allow streaming, but it is useful for large
binary files, where you want to support partial content serving. This basically
means resumable downloads, paused downloads, partial downloads, and multi-homed
downloads. This requires the use of an additional header called Range
. This
technique is called Byte serving.
The use of Transfer-Encoding: chunked
is what allows streaming within a single
request or response. This means that the data is transmitted in a chunked manner,
and does not impact the representation of the content.
Officially an HTTP client is meant to send a request with a TE
header field that
specifies what kinds of transfer encodings the client is willing to accept. This is
not always sent, however most servers assume that clients can process chunked
encodings.
The chunked transfer encoding makes better use of persistent TCP connections, which HTTP 1.1 assumes to be true by default.
Chunked data is represented in this manner:
4\r\n
Wiki\r\n
5\r\n
pedia\r\n
e\r\n
in\r\n\r\nchunks.\r\n
0\r\n
\r\n
Each chunk starts with its byte length expressed as a hexadecimal number followed by optional parameters (chunk extension) and a terminating CRLF sequence, followed by the chunk data. The final chunk is terminated by a CRLF sequence.
Chunk extensions can be used to indicate a message digest or an estimated progress. They are just custom metadata that your layer 7 receiver needs to parse. There's no standardised format for it. Because of this, it's probably better to just add your metadata (if any) into the chunk itself for your layer 7.5 application to parse.
For your application to send out chunked data, you must first send out the
Transfer-Encoding
header, and then you must flush content in chunks according to
the chunk format. If you don't have an appropriate HTTP server that handles this, then
you need to implement the syntax generator yourself. Sometimes you can use a library
to provide an abstract interface.
For example in PHP, there's the Symfony HTTP Foundation Stream Response and in NodeJS, it's native HTTP module chunks all responses.
Chunking is a 2 way street. The HTTP protocol allows the client to chunk HTTP requests. This allows the client to stream the HTTP request. Which is useful for uploading large files. However not many servers (except NGINX) support this feature, and most streaming upload implementations rely on Javascript libraries to cut up a binary file and send it by chunks to the server. Using Javascript gives you more control over the uploading experience, but the HTTP protocol would be the most simplest.
Browsers natively support chunked data. So if your server sends chunked data, they will start rendering data as soon as they receive it. However there's a buffer limit that browsers need to receive before it starts rendering them. This is different for each browser, but generally it's 1KB. You can see the limits for various browsers here: http://stackoverflow.com/a/16909228/582917
If however you want to consume an API that supports streaming, you need to be aware of
how your HTTP library handles chunked data. In most cases, you'll need to attach a
callback handler that executes upon each chunk of data. This should mean that your
API will need to frame each chunk in a useful manner. If the API is doing too many
chunks, you may end up needing to buffer the data up into a "semantic protocol data
unit" (PDU) before you can work on it. This of course defeats the purpose of chunking
in the first place. For example in PHP, you can use the Guzzle library or curl
.
In considering performance, you want to make sure that you're not producing way too chunky data. The more "chunking" you do, the more overhead that exists in both producing the chunks and parsing the chunks. Furthermore, it also results in more executions of buffering functions if the receiver can't make immediate use of the chunks. Chunking isn't always the right answer, it adds extra complexity on the recipient. So if you're sending small units of things that won't gain much from streaming, don't bother with it!
Do note that byte serving is compatible with chunked encoding, this would be applicable where you know the total content length, want to allow partial or resumable downloads, but you want to stream each partial response to the client.
It is also possible to compress chunked or non-chunked data. This is practically
done via the Content-Encoding
header.
Note that the Content-Length
is equal to the length of the body after the
Content-Encoding
. This means if you have gzipped your response, then the length
calculation happens after compression. You will need to be able to load the entire
body in memory if you want to calculate the length (unless you have that information
elsewhere).
When streaming using chunked encoding, the compression algorithm must also support online processing. Thankfully, gzip supports stream compression. I believe that the content gets compressed first, and then cut up in chunks. That way, the chunks are received, then decompressed to acquire the real content. If it were the other way around, you'll get the compressed stream, and then decompressing would give us chunks. Which doesn't make sense.
A typical compressed stream response may have these headers:
Content-Type: text/html
Content-Encoding: gzip
Transfer-Encoding: chunked
Semantically the usage of Content-Encoding
indicates an "end to end" encoding
scheme, which means only the final client or final server is supposed to decode the
content. Proxies in the middle are not suppose to decode the content.
If you want to allow proxies in the middle to decode the content, the correct header
to use is in fact the Transfer-Encoding
header. If the HTTP request possessed a
TE: gzip chunked
header, then it is legal to respond with Transfer-Encoding: gzip chunked
.
However this is very rarely supported. So you should only use Content-Encoding
for your compression right now.
The biggest problem when implementing HTTP streaming is understanding the effect of buffering. Buffering is the practice of accumulating reads or writes into a temporary fixed memory space. The advantages of buffering include reducing read or write call overhead. For example instead of writing 1KB 4096 times, you can just write 4096KB at once. This means your program can create a write buffer holding 4096KB of temporary data (which can be aligned to the disk blocksize), and once the space limit is reached, the buffer is flushed to disk.
Typical HTTP architectures include these components:
Client <--> Proxy <--> HTTP Server <--> Application Server <--> Database Server
Each one of these components can possess adjustable and varied buffering styles and limits.
To correct perform streaming, you have to know and adjust the buffering limits at each component.
For example, let's invesigate the typical PHP stack such as:
Browser <--> Proxy <--> NGINX <--> PHP <--> MySQL
Firstly browsers have a rendering buffer limit. You must send as much data as the limit before the browsers will render the content. Having chunks smaller than the buffer will just make the browser hold the data until either the buffer is full or when the connection is closed (or after some time limit).
At the proxy level, this could be your ISP or some custom proxy. If the proxy buffers data this means, your streamed data from upstream will be stored up the proxy buffer before sending to the browser. Some mobile wireless ISP will buffer things and you won't be able to control this behaviour, this is a violation of the end to end principle, so there's nothing here you can do technically.
At the NGINX level, buffering is dependent upon the type of the upstream connection. There
are 3 common connection types for HTTP: "proxy", "uwsgi", "fastcgi". If you want your NGINX
server to respect streaming, you can either switch off buffering for your connection type, or
match the buffer size with the upstream chunk size. Switching off buffering can be done
using a buffering directive (proxy_buffering
, uwsgi_buffering
, fastcgi_buffering
), or
you can use a special header X-Accel-Buffering: no
which tells NGINX to not buffer the
response. The special header is more flexible, as this allows NGINX to buffer responses that
don't need streaming. It also works for all 3 connection types.
If you instead try to match the buffer size with the chunk size, you have to make sure that the number of buffers multiplied by the buffer size (equal to a system memory page) is equal to a single chunk size. If it is greater than a single chunk from upstream, then this means your chunks will be accumulated before they are sent downstream. If it is less than the chunk size, this would result in NGINX buffering to disk, you want to avoid this as this results in extra overhead when streaming. For more information on buffer size see this gist.
Just a note on buffering optimisation: the larger the total buffer size, the greater likelihood of each connection using more memory. This is because if each buffer is large, there's a chance that you may not be efficiently using the buffer which can cause memory fragmentation. In the end, each buffer size should match the system memory page size. The number of buffers is what can be dynamically allocated. If your total buffer size across all connections exceeds your OS's memory limit, you're either going to meet an OOM error or starting paging to disk. To maintain your NGINX's availability, you have to consider the theoretical number of connections that a single NGINX server can handle, before it exhausts your server's memory limit.
Be aware of the real chunk size after compression. If your upstream is compressing the content,
the resulting chunk size will be different. In most cases, NGINX should be doing the compression
and it does support compressing for chunk that arrives from upstream. You just need gzip on
.
This means your application layer should not be compressing or chunking the content, it should
just flush raw data. NGINX is smart enough to understand and will automatically compress each
received upstream data, and then format it into chunks, which is then flushed to downstream.
There's an advantage in keeping buffers available or having a larger buffer size than the chunk size. It comes from dealing with slow clients. NGINX as a reverse proxy is very fast and can read the response from your upstream application server very quickly. NGINX itself can deal with any slow browsers that has a slower read rate than your upstream's write rate. Because NGINX is very light weight (asynchronous IO), the cost of holding a connection in NGINX is far smaller than holding open a process (that is waiting for the client to finish reading) in your application server. This is of course relative, as your application server might also be very light weight, and rely on either green threads or asynchronous IO. This problem does reveal an interesting property of streaming systems. Any stream will only be as quick as the slowest link (reader or writer) in the chain. This problem with streaming is related to network back pressure issue in distributed systems.
To take advantage of NGINX's ability of handling slow clients while still streaming data as
fast as possible, there will need to be some tuning of both the buffer size and potentially the
*_busy_buffer_size
option. You cannot just increase the total buffer size, as that will
just make NGINX wait until the buffer is full. What you need is some buffer size that is
allocated only for slow clients. This has something to do with the *_busy_buffer_size
, but
this is poorly documented currently, so I do not know how make this work.
Here are 2 quotes about the *_busy_buffer_size
:
When buffering of responses from the * server is enabled, limits the total size of buffers that can be busy sending a response to the client while the response is not yet fully read. In the meantime, the rest of the buffers can be used for reading the response and, if needed, buffering part of the response to a temporary file. By default, size is limited by the size of two buffers set by the *_buffer_size and *_buffers directives.
- NGINX documentation
proxy_busy_buffers_size: This directive sets the maximum size of buffers that can be marked "client-ready" and thus busy. While a client can only read the data from one buffer at a time, buffers are placed in a queue to send to the client in bunches. This directive controls the size of the buffer space allowed to be in this state.
At the PHP level, global buffers can be set inside the php.ini
configuration file. There are
3 options defined output_buffering
, output_handler
and implicit_flush
. They
are explained in the output control section of the PHP documentation.
It is interesting to note that for CLI applications, the output buffering is off by default.
This is so that your CLI application can show you results as its running. This buffer is controlled
by the server application programming interface "SAPI". You can control inside your application by
calling flush()
, which will flush the entire SAPI buffer.
During runtime, custom buffers can also be created using ob_start()
. Once you have added content
to the buffer, you can then flush your custom buffer using ob_flush()
. This only flushes the buffer
that you created using ob_start()
. Think of the ob_start()
as a kind of PHP specific manual
memory management. You're basically asking for some block of memory (fixed or variable), which you
then can only use for your output statements and functions: echo
and print
.
If you have entered both levels of buffers, you need call the flush functions in this order:
ob_flush(); flush();
.
Both the global SAPI buffer and the custom application buffer have settings that enable automatic flushing. This can depend on hitting the buffer limit, or on some function call. Check the documentation for more.
Finally we reach the MySQL level. This can be replaced with any upstream data source that you are calling in order to prepare a response. By default all SQL queries are buffered. There are 2 options to achieve unbuffered queries (writes and reads). The first is the unbuffered query option. This allows one to work with reading large result sets, and to process each row as it arrives (including flushing to the client).The second option works with just one single column of data. This is useful where a single column contains a large binary or textual content, and you want to be able to work with a stream on this data specifically. This involves the usage of the large object option. You can also stream write a large binary or textual content into the database using large object option. The streaming of writing rows is just done by running multiple insert queries.
With regards to the second method, there are some peculiarities you have to keep in mind: https://www.percona.com/blog/2007/07/06/php-large-result-sets-and-summary-tables/
NodeJS has great support for streaming. In fact its entire native HTTP module does streaming by
default for both incoming requests and outgoing responses. Everytime you call response.writeHead
or
response.write
, it is just writing a chunk of data. However there may be a buffer size inside
NodeJS which is probably the highWaterMark
setting. However I have not looked into this further.
NodeJS has a native stream module: https://nodejs.org/api/stream.html that serves as a base object for all other IO modules.
This is really helpful. Thanks ❤