Encoding Strings in Apache Arrow

Because [Apache Arrow] uses a prefix sum pattern to encode strings, it uses more space than comparable approaches like [Protocol Buffers], especially when storing small strings. Since the offsets vector reserves 4 bytes for each entry, an Arrow string column will always be at least row_count * 5 bytes in size. Four bytes for the offset and one byte minimum for the data.

Since [Protocol Buffers] are not columnar, they encode strings and their lengths in one buffer. This means they can avoid wasting any space, because byte length of the string length (?!?) can be encoding in a single byte if the data fits. So a Protocol Buffer might store string data like <len><char_codes><len><char_codes>, but Arrow encodes it like this.

offset: [0, 4, 8, 12, 16],
data: [0001000200030004]

Why do this if it uses more space? The columnar approach allows for fast read operations without loading extra data. Row based approaches like Protocol Buffers mean you cannot read the second string without reading the first string's length, because you won't know where it starts.Not a big deal until you need the one millionth string's value and you have to read 999,999 others first. With a columnar approach you can do random parallel accesses simply by using the row index to find the length or the value of a string in the buffer. You jump to offsets[str_idx] to know where to start reading in the data buffer.

References

Slack chat with Paul Taylor on 2020-09-04