Rendering Millions of Interactive Points on the web

AKA Big Data Web Clients

Summary: rendering millions of interactive elements in a browser is possible, but it requires a handful of variations from typical web development. Binary data from the server that is passed, with as little handling as possible, directly to client gpus is the tldr.

Most web developers are rightly concerned with keeping the amount of data their applications require to a minimum. There is a certain class of applications, however, where a lot of data is unavoidable. In fact, the more data you can deliver, process, and expose to the user for exploration, the better.

For lack of a better term I usually refer to these applications as Big Data Web Clients. Whether it's large amounts of detailed geospatial coordinates or a massive data graph for in-browser exploration, there are unique challenges to keeping things performant.

As an example, a detailed outline of all the counties in the US, encoded as GeoJSON is about 20 MB of data. If you pass that much data to JSON.parse(), you're going to crash the user's browser. Existing mapping libraries might be optimized to handle specific scenarios like that, but in terms of general purpose data deserialization it's not a feasible approach. We are going to use geographic data in this example because it's easy to understand and render, but the technique is most valuable as a general purpose approach.

Transmission protocols and wire types

I'd venture to guess that, over the past 15 years, 95+% of web applications have used JSON over HTTP as their client-server communication mechanism. (REST is a bit more specific, but you could probably swap that in here). Of the sliver that were not, a decent portion were likely written by Google engineers. I spent about two years at Google X, and one of the most valuable techniques I picked up there was the use of Protocol Buffers for remote communication in web apps. We used them for everything from embedded systems to servers to clients, and I really think they provide a host of benefits to projects, both technical and otherwise.

The tweet thread linked above goes into more detail on protocol buffers, but we’re going to go one small step further here and talk about gRPC.

gRPC takes the Protocol Buffer message format, defined in .proto files, one step further and allows them to define services in addition to data types. Instead of just a User type, you can also have a UserService type, with methods like AddUser and GetUsers. Can you guess what types the service accepts and returns? That’s right, your protobufs. So what you end up with is something like the below, where you’ve defined the various remote procedures that you wish to expose, and the types they use, and you can generate code in multiple languages that implement those typed methods.

This is incredibly valuable for making the data model of your application clear. For this example app, every essential detail of the application is captured in this one file. The implementation details are separate from the clear outline of the “domain knowledge” or “business logic” of the application.

Protobufs implementation

This all means that when our client requests data from the server, it won’t be served a massive, inefficient JSON representation of an object graph. It will be json, but only in letter, not spirit. What I mean is that protobufs encode data as json arrays. They essentially flatten just the values of the object graph and discard the keys, resulting in a much smaller payload size. This comes at the cost of readability, because it’s hard for humans to interpret the by-index arrangement of values that protobufs use.

While the efficiency of the encoded format is essential, the differences with a “typical” web app don’t end there. The code generated by protoc for dealing with those serialized data structures is also efficient and careful in how it deserializes them. While the details are specific based on your exact tool chain, generally speaking the code will not attempt to eagerly parse more data than it needs to. So creating a new instance with Type.deserializeBinary(chunk) will probably not parse the entire structure of chunk. It may selectively parse the top level items, but the rest of the data will remain serialized until it’s requested via a getter like getName(). In other words, a lot of implementations make use of lazy evaluations to avoid any bottlenecks.

This means we end up with references to strongly typed, lazily evaluated objects for use in our client application. chef’s kiss

Taking efficiency one step further means using binary data and something like Arrow, but let’s come back to that later.

I’ve kept everything as general as possible to this point, but this bit is specific to web browsers as clients. The only way to draw even tens of thousands, let alone millions, of objects in a browser, is to use webgl. The dom simply cannot handle that many items and even canvas will bog things down due to being cpu bound and on the main thread. Only by offloading to the gpu, which is massively parallelized and optimized for graphics (duh), can you hope to attain the kind of performance that your users deserve and demand.

There are many libraries that make it easier to target webgl, and for geospatial visualization none is better than deck.gl from Uber. deck.gl has a great api that allows you to compose various layers to create a lot of common visualizations easily, while also providing the tools necessary for creating completely custom visualizations. From your own layers to your own custom webgl shaders, deck provides a fantastic baseline framework for geospatial visualization.

For this example we’ll use the ScatterplotLayer from deck to render simple “dots” on our map.

Binary data and Apache Arrow

To squeeze even more bang for your byte, and potentially simplify your graphics pipeline, you can transfer and use your data in a binary format rather than using objects and arrays to construct your whole data graph. This avoids costly deserialization and memory allocations, allowing you to work with much larger quantities of data more efficiently. As an example, passing 5-10 megabytes of data into JSON.parse() will probably crash your browser. If it doesn't it's going to lock up the UI for several seconds. Contrast that with loading 30 megabytes of binary data and watch the memory increase by about 30 megabytes, and the UI remain responsive.

This performance boost does come with the cost of decreased inspectability, as well as the need for dedicated tools for working with the data in binary form.

Sidebar- explain what is meant by binary data?

As it stands today, Apache Arrow is the obvious choice for this task. Arrow is a columnar format with bindings for every major programming language.

Explain typed arrays and data views and all that?