pyarrow.Buffer
wraps the C++ arrow::Buffer
typepy_buffer()
function.arrowBuf = pa.py_buffer(pythonBytesObjectOrSomething)
The Arrow C++ libraries have several abstract interfaces for different kinds of IO objects:
In the interest of making these objects behave more like Python’s built-in file objects, we have defined a NativeFile base class which implements the same API as regular Python file objects.
pa.input_stream(buf)
will return a BufferReader
when passed native Python buffers or memoryviewsinput_stream
will open the given file on disk for reading, creating a OSFile
. Optionally, the file can be compressed: if its filename ends with a recognized extension such as .gz, its contents will automatically be decompressed on reading.
output_stream()
is the equivalent function for output streams and allows creating a writable NativeFile
.
To assist with serialization and deserialization of in-memory data, we have file interfaces that can read and write to Arrow Buffers.
writer = pa.BufferOutputStream()
writer.write(b'hello, friends')
buf = writer.getvalue()
print(buf)
reader = pa.BufferReader(buf)
reader.seek(7)
reader.read(7)
# create a small record batch
import pyarrow as pa
data = [
pa.array([1, 2, 3, 4]),
pa.array(['foo', 'bar', 'baz', None]),
pa.array([True, None, False, True])
]
batch = pa.record_batch(data, names=['f0', 'f1', 'f2'])
Now, we can begin writing a stream containing some number of these batches. For this we use
RecordBatchStreamWriter
, which can write to a writeableNativeFile
object or a writeable Python object. For convenience, this one can be created with new_stream():
sink = pa.BufferOutputStream()
writer = pa.ipc.new_stream(sink, batch.schema)
When creating the StreamWriter, we pass the schema, since the schema (column names and types) must be the same for all of the batches sent in this particular stream. Now we can do:
# write multiple record batches into the StreamWriter
for i in range(5):
writer.write_batch(batch)
writer.close()
buf = sink.getvalue()
Now
buf
contains the complete stream as an in-memory byte buffer. We can read such a stream withRecordBatchStreamReader
or the convenience functionpyarrow.ipc.open_stream
:
reader = pa.ipc.open_stream(buf)
reader.schema
# Out[13]:
# f0: int64
# f1: string
# f2: bool
batches = [b for b in reader]
len(batches)
# 5
An important point is that if the input source supports zero-copy reads (e.g. like a memory map, or pyarrow.BufferReader), then the returned batches are also zero-copy and do not allocate any new memory on read.
Basically, when you read a full file from disk, you immediately know how many batches it contains (determined by what?) and can read from it in any location
The stream and file reader classes have a special read_pandas method to simplify reading multiple record batches and converting them to a single DataFrame output
df = pa.ipc.open_file(buf).read_pandas()