Browser file and network I/O #5
Description
Node.js showed us that JavaScript can be a great choice for file + network programming if backed by non-blocking I/O under the hood. The way node.js does this is by interfacing with the OS in C++ code and exposing a JavaScript API. When Node was created it had the benefit of starting from scratch, and focused on these I/O interfaces in the beginning:
- TCP
- UDP
- DNS
- File System
Each of these have JavaScript APIs, e.g. var net = require('net')
, but when you actually create a TCP socket with net
it uses the C++ machinery under the hood in Node to create the TCP socket in your OS, and then any data that goes and out of the socket is relayed from you to the OS and back as node Buffer
objects (which just exist to efficiently hold binary data, something that JS couldn't do when node was created).
Under the hood the C++ I/O interfaces were written in a non-blocking way such that they could enable a streaming JavaScript interface to be written on top of them. Streaming support just means that you can process files in real time, and process large files without crashing the process (as you arent trying to read too much data into memory). Both of these qualities make node great for processing data.
I/O is important for a variety of data processing and data science use cases. Reading large files, downloading large files, uploading large files, writing large files, etc.
I/O in browser JavaScript is much more limited when it comes to working with network and file data. I'll list all I/O options in browsers today and describe their weaknesses:
HTTP (XHR)
XHR is the main HTTP client built in to browsers. Similar to the net
module in node, XHR (specifically XHR version 2), is implemented in native (C++) code in the browser and exposed as a JavaScript API.
Major flaw: Does not support streaming data.
Say you want to download a 20GB file and write a grep function that counts how many time a certain word occurs in the file. In node this would be a 5 line program. XHR however buffers the entire response body into a single buffer. This means your browser will eventually slow to a halt or crash as your response buffer exceeds the available amount of RAM. Also accessing the data from XHR as it arrives only works if the XHR data is text, so psuedo-streaming binary data is totally impossible.
This is a subtle but important difference from the way node works, which is to split the response into many smaller buffers. Once your code has processed a buffer it can be garbage collected. However, there is no way to garbage collect processed buffers with XHR.
The only workaround to this is to use HTTP Range headers and make multiple HTTP requests for different byte ranges of the file. However this incurs a significant performance penalty, especially for TLS connections which have something like a 5RTT handshake, as well as requires the server to support Range headers (it is nowhere near universal).
HTTP (Fetch)
Fetch is a replacement for XHR/XHR2. In Chrome as of v43 it supports streaming responses! https://googlechrome.github.io/samples/fetch-api/fetch-response-stream.html. It doesn't work in Firefox yet, but is planned.
There is a 1-2 year old WHATWG initiative called Streams: https://streams.spec.whatwg.org/. It is trying to come up with a JS Streaming API that can be shared across I/O interfaces in the browser. To quote this blog post, "The Streams API is important because it allows large resources to be processed in a memory efficient way.". Fetch is the first (I think) thing to use it.
Issue: not widely implemented/spec isn't finished
I think Fetch and Streams are The Future(tm), but they are still being evaluated and designed. There is an opportunity for us to share data science/data processing use cases to the spec designers to make sure they end up making our lives more awesome.
WebSockets
WebSockets are a message based protocol on top of TCP, and are supported in most browsers today. There are issues deploying websockets on networks with proxies that don't understand how to route websocket traffic, but those issues go away if you use TLS (wss://
instead of ws://
) as it causes the headers to get encrypted and the proxies pass the messages through (as it prevents them from reading the headers and getting confused).
Two nice things about WebSockets
You can read/write individual buffers out of them
This is in contrast to XHR which only lets you read or write a single buffer per request. With a WebSocket, you open a connection and can write or read as many individual buffers as you want over the lifecycle of the socket. This is much better for streaming large files either in or out of the browser, as the programmer can decide how big or small a buffer should be sliced up.
You can transfer binary data over them
Whereas XHR is very limited in regards to binary data, WebSockets can run in 'arraybuffer' mode which lets you read and write binary buffers without having to do any crazy hacks like Base64 encoding, which is very inefficient especially for large files.
Issue: no backpressure
TCP has a mechanism built in for knowing when the other side of the connection is clogged, so that the person writing data can slow down. This is very important for memory efficiency in real world networking, as a single user on a slow connection without backpressure could cause a server to have to buffer lots of data in RAM waiting for the slow user to download it.
Unfortunately WebSockets does not expose the backpressure signals from TCP :( There is an opportunity here to request that the WebSocket API be improved. We actually got the standards bodies to fix WebRTC DataChannels for the same issue: feross/simple-peer#39
WebRTC DataChannels
DataChannels are somewhat similar to WebSockets, in fact their design was almost copy-pasted, but instead of being on top of Client-Server TCP they are on top of WebRTC PeerConnections which can be direct connections between either two browser users or a browser and a server process.
They are in nearly every browser today, the notable exception being Safari.
The two main differences between DataChannels and WebSockets are:
DataChannels are always encrypted
WebSockets use TLS and the HTTPS Certificate system to do encryption, but DataChannels have a built in P2P private key encryption that is turned on by default.
DataChannels have a reliable and unreliable mode
The networking machinery underneath WebRTC is too complicated to detail here, but whereas WebSockets are only on top of TCP (a reliable transport), DataChannels can be either reliable or unreliable depending on the underlying transport (which may be UDP or TCP).
Issue: connection overhead
Making a WebRTC connection can be relatively slow, with a ~5RTT handshake and ICE/STUN negotiation. However, Google is apparently working on a version of DataChannels powered by their new QUIC protocol which boasts 0RTT handshakes in the best case.
Issue: no backpressure
DataChannels copied WebSockets and inherited the lack of backpressure. But we got in contact with Chrome and Firefox and got them to fix the spec! feross/simple-peer#39
File System
Around 5 years ago there was a File System W3C specification, but as of 2014 it is abandoned and there doesn't seem to be any replacement in the works.
IMO the only good part of the File System API is FileReader
, and it happens to be one of the only parts that was actually ever widely implemented.
Good: reading files
FileReader lets the user select a file (or multiple files), and then browser JS gets random access to read data in the file, in the chunk size the programmer specifies. This is really nice! I wrote a module that wraps this in a stream: https://github.com/maxogden/filereader-stream
Bad: writing files
However, say you want to write a file to the users hard drive. You can ask the user to choose where they want to save the file, but the File System saveAs()
method only lets you write a single buffer to the file, and there is no way to append to a file! This means you can only write files as large as will fit in ArrayBuffers
, e.g. 32 bit integer buffers in RAM.
The lack of a streaming file write interface in the browser (and missing from W3C specs) is a huge issue, and needs some community championing.
IndexedDB
IndexedDB is a non-blocking key/value database available to browser JS, backed by LevelDB in Chrome and SQLite in Firefox. It's supported in most browsers.
The API is really complicated, and there are bugs between implementations in different browsers, but it is an actual database that you can actually use to store key/value data. It doesn't work for storing large data like files (for the same reasons you wouldn't store a filesystem in a MySQL table, but would use blobs instead), but works relatively well for mutable data that you want random access to. It's also non-blocking which means reading/writing to it won't cause the browser UI to freeze up.
I don't have any major criticisms of IndexedDB, other than it seems to be difficult to implement as it is a very complicated spec. If it was less overengineered I feel the inconsistencies between implementations would be easier to fix.
PHEW!
Ok that was a tour of the state of I/O in the browser as it relates to data processing/data science users.
If you have feedback, or ideas on how to improve this stuff, or questions, leave a comment below.