GPU Memory Store(gstore_fdw)

Overview

Usually, PG-Strom uses GPU device memory for temporary purpose only. It allocates a certain amount of device memory needed for query execution, then transfers data blocks and launch GPU kernel to process SQL workloads. Once GPU kernel gets finished, these device memory regison shall be released soon, to re-allocate unused device memory for other workloads.

This design allows concurrent multiple session or scan workloads on the tables larger than GPU device memory. It may not be optimal depending on circumstances.

A typical example is, repeated calculation under various conditions for data with a scale large enough to fit in the GPU device memory, not so large. This applies to workloads such as machine-learning, pattern matching or similarity search.

For modern GPUs, it is not so difficult to process a few gigabytes data on memory at most, but it is a costly process to setup data to be loaded onto GPU device memory and transfer them.

In addition, since variable length data in PostgreSQL has size limitation up to 1GB, it restricts the data format when it is givrn as an argument of PL/CUDA function, even if the data size itself is sufficient in the GPU device memory.

GPU memory store (gstore_fdw) is a feature to preserve GPU device memory and to load data to the memory preliminary. It makes unnecessary to setup arguments and load for each invocation of PL/CUDA function, and eliminates 1GB limitation of variable length data because it allows GPU device memory allocation up to the capacity.

As literal, gstore_fdw is implemented using foreign-data-wrapper of PostgreSQL. You can modify the data structure on GPU device memory using INSERT, UPDATE or DELETE commands on the foreign table managed by gstore_fdw. In the similar way, you can also read the data using SELECT command.

PL/CUDA function can reference the data stored onto GPU device memory through the foreign table. Right now, GPU programs which is transparently generated from SQL statement cannot reference this device memory region, however, we plan to enhance the feature in the future release.

GPU memory store

Setup

Usually it takes the 3 steps below to create a foreign table.

  • Define a foreign-data-wrapper using CREATE FOREIGN DATA WRAPPER command
  • Define a foreign server using CREATE SERVER command
  • Define a foreign table using CREATE FOREIGN TABLE command

The first 2 steps above are included in the CREATE EXTENSION pg_strom command. All you need to run individually is CREATE FOREIGN TABLE command last.

CREATE FOREIGN TABLE ft (
    id int,
    signature smallint[] OPTIONS (compression 'pglz')
)
SERVER gstore_fdw OPTIONS(pinning '0', format 'pgstrom');

You can specify some options on creation of foreign table using CREATE FOREIGN TABLE command.

SERVER gstore_fdw is a mandatory option. It indicates the new foreign table is managed by gstore_fdw.

The options below are supported in the OPTIONS clause.

name target description
pinning table Specifies device number of the GPU where device memory is preserved.
format table Specifies the internal data format on GPU device memory. Default is pgstrom
compression column Specifies whether variable length data is compressed, or not. Default is uncompressed.

Right now, only pgstrom is supported for format option. It is identical data format with what PG-Strom uses for in-memory columnar cache. In most cases, no need to pay attention to internal data format on writing / reading GPU data store using SQL. On the other hands, you need to consider when you program PL/CUDA function or share the GPU device memory with external applications using IPC handle.

Right now, only pglz is supported for compression option. This compression logic adopts an identical data format and algorithm used by PostgreSQL to compress variable length data larger than its threshold. It can be decompressed by GPU internal function pglz_decompress() from PL/CUDA function. Due to the characteristics of the compression algorithm, it is valuable to represent sparse matrix that is mostly zero.

Operations

Loading data

Like normal tables, you can write GPU device memory on behalf of the foreign table using INSERT, UPDATE and DELETE command.

Note that gstore_fdw acquires SHARE UPDATE EXCLUSIVE lock on the beginning of these commands. It means only single transaction can update the gstore_fdw foreign table at a certain point. It is a trade-off. We don't need to check visibility per record when PL/CUDA function references gstore_fdw foreign table.

Any contents written to the gstore_fdw foreign table is not visible to other sessions until transaction getting committed, like regular tables. This is a significant feature to ensure atomicity of transaction, however, it also means the older revision of gstore_fdw foreign table contents must be kept on the GPU device memory until any concurrent transaction which may reference the older revision gets committed or aborted.

So, even though you can run INSERT, UPDATE or DELETE commands as if it is regular tables, you should avoidto update several rows then commit transaction many times. Basically, INSERT of massive rows at once (bulk loading) is recommended.

Unlike regular tables, contents of the gstore_fdw foreign table is vollatile. So, it is very easy to loose contents of the gstore_fdw foreign table by power-down or PostgreSQL restart. So, what we load onto gstore_fdw foreign table should be reconstructable by other data source.

Checking the memory consumption

See pgstrom.gstore_fdw_chunk_info system view to see amount of the device memory consumed by gstore_fdw.

postgres=# select * from pgstrom.gstore_fdw_chunk_info ;
 database_oid | table_oid | revision | xmin | xmax | pinning | format  |  rawsize  |  nitems
--------------+-----------+----------+------+------+---------+---------+-----------+----------
        13806 |     26800 |        3 |    2 |    0 |       0 | pgstrom | 660000496 | 15000000
        13806 |     26797 |        2 |    2 |    0 |       0 | pgstrom | 440000496 | 10000000
(2 rows)

By nvidia-smi command, you can check how much device memory is consumed for each GPU device. "PG-Strom GPU memory keeper" process actually keeps and manages the device memory area acquired by Gstore_fdw. In this example, 1211MB is preliminary allocated for total of the above rawsize (about 1100MB) and CUDA internal usage.

$ nvidia-smi
Wed Apr  4 15:11:50 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30                 Driver Version: 390.30                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P40           Off  | 00000000:02:00.0 Off |                    0 |
| N/A   39C    P0    52W / 250W |   1221MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      6885      C   ...bgworker: PG-Strom GPU memory keeper     1211MiB |
+-----------------------------------------------------------------------------+

Internal Data Format

See the notes for details of the internal data format when gstore_fdw write on GPU device memory.

Inter-process Data Collaboration

CUDA provides special APIs cuIpcGetMemHandle() and cuIpcOpenMemHandle(). The first allows to get a unique identifier of GPU device memory allocated by applications. The other one allows to reference a shared GPU device memory region from other applications. In the other words, it supports something like a shared memory on the host system.

This unique identifier is CUipcMemHandle object; which is simple binary data in 64bytes. This session introduces SQL functions which exchange GPU device memory with other applications using CUipcMemHandle identifier.

SQL Functions to

gstore_export_ipchandle(reggstore)

This function gets CUipcMemHandle identifier of the GPU device memory which is preserved by gstore_fdw foreign table, then returns as a binary data in bytea type. If foreign table is empty and has no GPU device memory, it returns NULL.

  • 1st arg(ftable_oid): OID of the foreign table. Because it is reggstore type, you can specify the foreign table by name string.
  • result: CUipcMemHandle identifier in the bytea type.
# select gstore_export_ipchandle('ft');
                                                      gstore_export_ipchandle

------------------------------------------------------------------------------------------------------------------------------------
 \xe057880100000000de3a000000000000904e7909000000000000800900000000000000000000000000020000000000005c000000000000001200d0c10101005c
(1 row)

lo_import_gpu(int, bytea, bigint, bigint, oid=0)

This function temporary opens the GPU device memory region acquired by external applications, then read this region and writes out as a largeobject of PostgreSQL. If largeobject already exists, its contents is replaced by the data read from the GPU device memory. It keeps owner and permission configuration. Elsewhere, it creates a new largeobject, then write out the data which is read from GPU device memory.

  • 1st arg(device_nr): GPU device number where device memory is acquired
  • 2nd arg(ipc_mhandle): CUipcMemHandle identifier in bytea type
  • 3rd(offset): offset of the head position to read, from the GPU device memory region.
  • 4th(length): size to read in bytes
  • 5th(loid): OID of the largeobject to be written. 0 is assumed, if no valid value is supplied.
  • result: OID of the written largeobject

lo_export_gpu(oid, int, bytea, bigint, bigint)

  • 1st arg(loid): OID of the largeobject to be read
  • 2nd arg(device_nr): GPU device number where device memory is acquired
  • 3rd arg(ipc_mhandle): CUipcMemHandle identifier in bytea type
  • 4th arg(offset): offset of the head position to write, from the GPU device memory region.
  • 5th arg(length): size to write in bytes
  • result: Length of bytes actually written. If length of the largeobject is less then length, it may return the value less than length.