Creating a Google Cloud Platform Project for High Performance Web Graphics

Inspired by projects like kepler.gl, Unfolded Studio, and Mapbox Studio, this project is meant to serve as an exploration and documentation of how to build these kinds of systems from the ground up.

In particular, I'm interested in identifying and experimenting with the key components and tradeoffs that affect data throughput, all the way from the storage layer to the user's screen. Incremental loading, streaming, data formats, client side parsing, off-main thread processing, and CPU versus GPU management are all dials I'd like to be able to turn.

As of January 2021, I'd call this an early MVP. Everything beyond Part 1 is very tentative, and there are some tasks that have been implemented but not yet documented:

  • Loading and parsing Apache Arrow data on the client
  • Rendering Arrow data in deck.gl
  • Building a playable time scrubber with d3.brush
  • Using deck.gl's DataFilterExtension for High Performance Vusalizations
    • This enables what used to require custom shaders to do
  • Using requestAnimationFrame to Build a Smoothly Animated Time Scrubber
  • Adding Mapbox GL JS v2 with elevation
  • Client side caching to avoid redundant network calls (and GCP costs)

Improving the visual design is probably the next highest priority task. The latest version can be seen here, and the source code is here.

Part 1 - Project Setup and MVP construction

  1. [Create a new Google Cloud Platform project]
  2. [Create a new Google App Engine application]
  3. [Create a Django Application in the Google App Engine Standard Environment]
  4. [Using Google BigQuery from Django]
  5. [Inspecting BigQuery Public Datasets from Google App Engine]
  6. [Returning BigQuery Results in Apache Arrow Format]
  7. [Serve a client side application from Django]
  8. [Create a vanilla JavaScript application with Mapbox and deck.gl]
  9. [Load API data from Django routes in a vanilla JavaScript application]
  10. [Returning BigQuery Data in JSON Format with Python]
  11. [Create an interactive scatter plot with vanilla JavaScript in deck.gl]
  12. [Returning BigQuery Data in Apache Arrow Format with Python]
  13. [Loading and Rendering Apache Arrow Data with deck.gl]
  14. [Using TypeScript in the Browser Without a Framework]

Part 2 - Dataset Creation and API architecture

TBD

  • Identify the features that should be driven by the client
    • Record batch size and HTTP vs WebSockets split?
  • Can a web browser be an Apache Arrow Flight client?
  • How do you best deliver massive data sets? My guess is that the ideal experience would come from some strategy of returning the first N results directly in the HTTP call, and then streaming down the rest as web sockets messages. It would allow the server to return quickly and the client to show some results quickly, and then fill in the rest as they load over the socket connection.
  • Browsers can't be grpc streaming clients, so I think Web Sockets are about as close as we can get in a browser to true streaming.
  • I wonder if there is a benefit to having a server side grpc streaming or Flight client, that then relays the data as WS messages. I guess if you don't use one of those you need some other method of getting the incremental results at the server layer to relay.

Part 3 - Parsing and Rendering Large Data Payloads on the Web

  • Compare the JSON and Arrow APIs in terms of memory, jank/lockup, etc
  • Client side Arrow parsing strategies
  • Experiment with batch sizes and whatnot
  • Parsing the data to support zooming to bounds

Part 4 - High Performance Graphics in Web Browsers

  • Why binary data
  • Why Arrow
  • deck.gl and high performance best practices
  • Scatterplots
  • Lines (roads? waterways?)
  • Polygons
  • Custom shaders
  • Building an Interactive Time Scrubber for Geotemporal Data

TBD

  • Need to decide if I'm sticking with vanilla JS or not
  • Do I need protobufs to avoid things getting messy?