This chapter introduces advanced features of PG-Strom.

SSD-to-GPU Direct SQL Execution


For the fast execution of SQL workloads, it needs to provide processors rapid data stream from storage or memory, in addition to processor's execution efficiency. Processor will run idle if data stream would not be delivered.

SSD-to-GPU Direct SQL Execution directly connects NVMe-SSD which enables high-speed I/O processing by direct attach to the PCIe bus and GPU device that is also attached on the same PCIe bus, and runs SQL workloads very high speed by supplying data stream close to the wired speed of the hardware.

Usually, PostgreSQL data blocks on the storage shall be once loaded to CPU/RAM through the PCIe bus, then, PostgreSQL runs WHERE-clause for filtering or JOIN/GROUP BY according to the query execution plan. Due to the characteristics of analytic workloads, the amount of result data set is much smaller than the source data set. For example, it is not rare case to read billions rows but output just hundreds rows after the aggregation operations with GROUP BY.

In the other words, we consume bandwidth of the PCIe bus to move junk data, however, we cannot determine whether rows are necessary or not prior to the evaluation by SQL workloads on CPU. So, it is not avoidable restriction in usual implementation.

SSD2GPU Direct SQL Execution Overview

SSD-to-GPU Direct SQL Execution changes the flow to read blocks from the storage sequentially. It directly loads data blocks to GPU using peer-to-peer DMA over PCIe bus, then runs SQL workloads on GPU device to reduce number of rows to be processed by CPU. In other words, it utilizes GPU as a pre-processor of SQL which locates in the middle of the storage and CPU/RAM for reduction of CPU's load, then tries to accelerate I/O processing in the results.

This feature internally uses NVIDIA GPUDirect RDMA. It allows peer-to-peer data transfer over PCIe bus between GPU device memory and third parth device by coordination using a custom Linux kernel module. So, this feature requires NVMe-Strom driver which is a Linux kernel module in addition to PG-Strom which is a PostgreSQL extension module.

Also note that this feature supports only NVMe-SSD. It does not support SAS or SATA SSD. We have tested several NVMe-SSD models. You can refer 002: HW Validation List for your information.

System Setup

Driver Installation

nvme_strom package is required to activate SSD-to-GPU Direct SQL Execution. This package contains a custom Linux kernel module which intermediates P2P DMA from NVME-SSD to GPU. You can obtain the package from the HeteroDB Software Distribution Center.

If heterodb-swdc package is already installed, you can install the package by yum command.

$ sudo yum install nvme_strom
 Package             Arch            Version            Repository         Size
 nvme_strom          x86_64          0.8-1.el7          heterodb          178 k

Transaction Summary
Install  1 Package
DKMS: install completed.
  Verifying  : nvme_strom-0.8-1.el7.x86_64                                  1/1

  nvme_strom.x86_64 0:0.8-1.el7


Once nvme_strom package gets installed, you can see nvme_strom module using lsmod command below.

$ lsmod | grep nvme
nvme_strom             12625  0
nvme                   27722  4
nvme_core              52964  9 nvme

Designing Tablespace

SSD-to-GPU Direct SQL Execution shall be invoked in the following case.

  • The target table to be scanned locates on the partition being consist of NVMe-SSD.
    • /dev/nvmeXXXX block device, or md-raid0 volume which consists of NVMe-SSDs only.
  • The target table size is larger than pg_strom.nvme_strom_threshold.
    • You can adjust this configuration. Its default is physical RAM size of the system plus 1/3 of shared_buffers configuration.


Striped read from multiple NVMe-SSD using md-raid0 requires the enterprise subscription provided by HeteroDB,Inc.

In order to deploy the tables on the partition consists of NVMe-SSD, you can use the tablespace function of PostgreSQL to specify particular tables or databases to place them on NVMe-SSD volume, in addition to construction of the entire database cluster on the NVMe-SSD volume.

For example, you can create a new tablespace below, if NVMe-SSD is mounted at /opt/nvme.


In order to create a new table on the tablespace, specify the TABLESPACE option at the CREATE TABLE command below.

CREATE TABLE my_table (...) TABLESPACE my_nvme;

Or, use ALTER DATABASE command as follows, to change the default tablespace of the database. Note that tablespace of the existing tables are not changed in thie case.



Controls using GUC parameters

There are two GPU parameters related to SSD-to-GPU Direct SQL Execution.

The first is pg_strom.nvme_strom_enabled that simply turn on/off the function of SSD-to-GPU Direct SQL Execution. If off, SSD-to-GPU Direct SQL Execution should not be used regardless of the table size or physical location. Default is on.

The other one is pg_strom.nvme_strom_threshold which specifies the least table size to invoke SSD-to-GPU Direct SQL Execution.

PG-Strom will choose SSD-to-GPU Direct SQL Execution when target table is located on NVMe-SSD volume (or md-raid0 volume which consists of NVMe-SSD only), and the table size is larger than this parameter. Its default is sum of the physical memory size and 1/3 of the shared_buffers. It means default configuration invokes SSD-to-GPU Direct SQL Execution only for the tables where we certainly cannot process them on memory.

Even if SSD-to-GPU Direct SQL Execution has advantages on a single table scan workload, usage of disk cache may work better on the second or later trial for the tables which are available to load onto the main memory.

On course, this assumption is not always right depending on the workload charasteristics.

Ensure usage of SSD-to-GPU Direct SQL Execution

EXPLAIN command allows to ensure whether SSD-to-GPU Direct SQL Execution shall be used in the target query, or not.

In the example below, a scan on the lineorder table by Custom Scan (GpuJoin) shows NVMe-Strom: enabled. In this case, SSD-to-GPU Direct SQL Execution shall be used to read from the lineorder table.

# explain (costs off)
select sum(lo_revenue), d_year, p_brand1
from lineorder, date1, part, supplier
where lo_orderdate = d_datekey
and lo_partkey = p_partkey
and lo_suppkey = s_suppkey
and p_category = 'MFGR#12'
and s_region = 'AMERICA'
  group by d_year, p_brand1
  order by d_year, p_brand1;
                                          QUERY PLAN
   Group Key: date1.d_year, part.p_brand1
   ->  Sort
         Sort Key: date1.d_year, part.p_brand1
         ->  Custom Scan (GpuPreAgg)
               Reduction: Local
               GPU Projection: pgstrom.psum((lo_revenue)::double precision), d_year, p_brand1
               Combined GpuJoin: enabled
               ->  Custom Scan (GpuJoin) on lineorder
                     GPU Projection: date1.d_year, part.p_brand1, lineorder.lo_revenue
                     Outer Scan: lineorder
                     Depth 1: GpuHashJoin  (nrows 2406009600...97764190)
                              HashKeys: lineorder.lo_partkey
                              JoinQuals: (lineorder.lo_partkey = part.p_partkey)
                              KDS-Hash (size: 10.67MB)
                     Depth 2: GpuHashJoin  (nrows 97764190...18544060)
                              HashKeys: lineorder.lo_suppkey
                              JoinQuals: (lineorder.lo_suppkey = supplier.s_suppkey)
                              KDS-Hash (size: 131.59MB)
                     Depth 3: GpuHashJoin  (nrows 18544060...18544060)
                              HashKeys: lineorder.lo_orderdate
                              JoinQuals: (lineorder.lo_orderdate = date1.d_datekey)
                              KDS-Hash (size: 461.89KB)
                     NVMe-Strom: enabled
                     ->  Custom Scan (GpuScan) on part
                           GPU Projection: p_brand1, p_partkey
                           GPU Filter: (p_category = 'MFGR#12'::bpchar)
                     ->  Custom Scan (GpuScan) on supplier
                           GPU Projection: s_suppkey
                           GPU Filter: (s_region = 'AMERICA'::bpchar)
                     ->  Seq Scan on date1
(31 rows)

Attension for visibility map

Right now, GPU routines of PG-Strom cannot run MVCC visibility checks per row, because only host code has a special data structure for visibility checks. It also leads a problem.

We cannot know which row is visible, or invisible at the time when PG-Strom requires P2P DMA for NVMe-SSD, because contents of the storage blocks are not yet loaded to CPU/RAM, and MVCC related attributes are written with individual records. PostgreSQL had similar problem when it supports IndexOnlyScan.

To address the problem, PostgreSQL has an infrastructure of visibility map which is a bunch of flags to indicate whether any records in a particular data block are visible from all the transactions. If associated bit is set, we can know the associated block has no invisible records without reading the block itself.

SSD-to-GPU Direct SQL Execution utilizes this infrastructure. It checks the visibility map first, then only "all-visible" blocks are required to read with SSD-to-GPU P2P DMA.

VACUUM constructs visibility map, so you can enforce PostgreSQL to construct visibility map by explicit launch of VACUUM command.

VACUUM ANALYZE linerorder;

In-memory Columnar Cache


PG-Strom has one another feature related to storage to supply processors data stream.

In-memory columnar cache reads data blocks of the target table, convert the row-format of PostgreSQL to columnar format which is suitable for summary and analytics, and cache them on memory.

This feature requires no special hardware like SSD-to-GPU Direct SQL Execution, on the other hands, RAM capacity is still smaller than SSD, so this feature is suitable to handle "not a large scale data set" up to 60%-75% of the system RAM size.

This feature is not "a columnar store". It means cached and converted data blocks are flashed once PostgreSQL server process has restarted for example. When any cached rows get updated, PG-Strom invalidates the columnar cache block which contains the updated rows. This design on the basis that columnar format is vulnerable to updating workloads. If we try to update columnar-store with keeping consistency towards update of row-store, huge degradation of write performance is not avoidable. On the other hands, it is lightweight operation to invalidate the columnar cache block which contains the updated row. PG-Strom can switch GPU kernels to be invoked for row- or columnar-format according to format of the loading data blocks. So, it works flexibly, even if a columnar cache block gets invalidated thus PG-Strom has to load data blocks from the shared buffer of PostgreSQL.

overview of in-memory columnar cache

System Setup

Location of the columnar cache

The pg_strom.ccache_base_dir parameter allows to specify the path to store the columnar cache. The default is /dev/shm where general Linux distribution mounts tmpfs filesystem, so files under the directory are "volatile", with no backing store.

Custom configuration of the parameter enables to construct columnar cache on larger and reasonably fast storage, like NVMe-SSD, as backing store. However, note that update of the cached rows invalidates whole of the chunk (128MB) which contains the updated rows. It may lead unexpected performance degradation, if workloads have frequent read / write involving I/O operations.

Columnar Cache Builder Configuration

PG-Strom can build in-memory columnar cache automatically and asynchronously using one or multiple background workers. These background workers are called columnar cache builder.

Columnar cache builder scans the target tables to construct columnar cache in the specified database, by round-robin, then converts to columnar format and keep it on the cache. It is an asynchronous job from the backend process which handles user's SQL.

Once a columnar cache is built, any other backend process can reference them. PG-Strom never construct columnar cache on demand, unlike usual disk cache mechanism, even if it is access to the area where columnar cache is not built yet. In this case, PG-Strom loads row-data through the normal storage system of PostgreSQL.

The number of columnar cache builders are fixed on the startup, so you need to setup pg_strom.ccache_num_builders parameters then restart PostgreSQL to increase the number of workers.

The pg_strom.ccache_databases parameter configures the databases associated with columnar cache builders. It requires superuser privilege to setup, and is updatable on PostgreSQL running. (Of course, it is possible to assign by postgresql.conf configuration on startup.)

Once a comma separated list of database names are assigned, columnar cache builders are associated to the specified databases in rotation. For example, if 5 columnar cache builders are running then 3 databases (postgres,my_test,benchmark) are assigned on the pg_strom.ccache_databases, 2 columnar cache builders are assigned on the postgres and my_test database for each, and 1 columnar cache builder is assigned on the benchmark database.

Source Table Configuration

DBA needs to specify the target tables to build columnar cache.

A SQL function pgstrom_ccache_enabled(regclass) adds the supplied table as target to build columnar cache. Other way round, a SQL function pgstrom_ccache_disabled(regclass) drops the supplied table from the target to build.

Internally, it is implemented as a special trigger function which invalidate columnar cache on write to the target tables. It means we don't build columnar cache on the tables which have no way to invalidate columnar cache.

postgres=# select pgstrom_ccache_enabled('t0');
(1 row)


Check status of columnar cache

pgstrom.ccache_info provides the status of the current columnar cache.

You can check the table, block number, cache creation time and last access time per chunk.

contrib_regression_pg_strom=# SELECT * FROM pgstrom.ccache_info ;
 database_id | table_id | block_nr | nitems  |  length   |             ctime             |             atime
       13323 | 25887    |   622592 | 1966080 | 121897472 | 2018-02-18 14:31:30.898389+09 | 2018-02-18 14:38:43.711287+09
       13323 | 25887    |   425984 | 1966080 | 121897472 | 2018-02-18 14:28:39.356952+09 | 2018-02-18 14:38:43.514788+09
       13323 | 25887    |    98304 | 1966080 | 121897472 | 2018-02-18 14:28:01.542261+09 | 2018-02-18 14:38:42.930281+09
         :       :             :         :          :                :                               :
       13323 | 25887    |    16384 | 1963079 | 121711472 | 2018-02-18 14:28:00.647021+09 | 2018-02-18 14:38:42.909112+09
       13323 | 25887    |   737280 | 1966080 | 121897472 | 2018-02-18 14:34:32.249899+09 | 2018-02-18 14:38:43.882029+09
       13323 | 25887    |   770048 | 1966080 | 121897472 | 2018-02-18 14:28:57.321121+09 | 2018-02-18 14:38:43.90157+09
(50 rows)

Check usage of columnar cache

You can check whether a particular query may reference columnar cache, or not, using EXPLAIN command.

The query below joins the table t0 and t1, and the Custom Scan (GpuJoin) which contains scan on the t0 shows CCache: enabled. It means columnar cache may be referenced at the scan on t0, however, it is not certain whether it is actually referenced until query execution. Columnar cache may be invalidated by the concurrent updates.

postgres=# EXPLAIN SELECT id,ax FROM t0 NATURAL JOIN t1 WHERE aid < 1000;

                                  QUERY PLAN
 Custom Scan (GpuJoin) on t0  (cost=12398.65..858048.45 rows=1029348 width=12)
   GPU Projection:,
   Outer Scan: t0  (cost=10277.55..864623.44 rows=1029348 width=8)
   Outer Scan Filter: (aid < 1000)
   Depth 1: GpuHashJoin  (nrows 1029348...1029348)
            HashKeys: t0.aid
            JoinQuals: (t0.aid = t1.aid)
            KDS-Hash (size: 10.78MB)
   CCache: enabled
   ->  Seq Scan on t1  (cost=0.00..1935.00 rows=100000 width=12)
(10 rows)

EXPLAIN ANALYZE command tells how many times columnar cache is referenced during the query execution.

After the execution of this query, Custom Scan (GpuJoin) which contains scan on t0 shows CCache Hits: 50. It means that columnar cache is referenced 50 times. Because the chunk size of columnar cache is 128MB, storage access is replaced to the columnar cache by 6.4GB.

postgres=# EXPLAIN ANALYZE SELECT id,ax FROM t0 NATURAL JOIN t1 WHERE aid < 1000;

                                    QUERY PLAN

 Custom Scan (GpuJoin) on t0  (cost=12398.65..858048.45 rows=1029348 width=12)
                              (actual time=91.766..723.549 rows=1000224 loops=1)
   GPU Projection:,
   Outer Scan: t0  (cost=10277.55..864623.44 rows=1029348 width=8)
                   (actual time=7.129..398.270 rows=100000000 loops=1)
   Outer Scan Filter: (aid < 1000)
   Rows Removed by Outer Scan Filter: 98999776
   Depth 1: GpuHashJoin  (plan nrows: 1029348...1029348, actual nrows: 1000224...1000224)
            HashKeys: t0.aid
            JoinQuals: (t0.aid = t1.aid)
            KDS-Hash (size plan: 10.78MB, exec: 64.00MB)
   CCache Hits: 50
   ->  Seq Scan on t1  (cost=0.00..1935.00 rows=100000 width=12)
                       (actual time=0.011..13.542 rows=100000 loops=1)
 Planning time: 23.390 ms
 Execution time: 1409.073 ms
(13 rows)

Attension for DROP DATABASE command

When columnar cache builder constructs columnar cache asynchronously, background worker process has internally connected to the specified database. When DROP DATABASE command tries to drop a database, PostgreSQL checks whether any session connects to the database. At that time, even if no user session connects to the database, DROP DATABASE will fail by columnar cache builder which keeps connection to the database.

Please remove the database name from the pg_strom.ccache_databases parameter prior to execution of DROP DATABASE command. Columnar cache builder will restart soon, then tries to connect databases according to the new configuration.

GPU Memory Store(gstore_fdw)


Usually, PG-Strom uses GPU device memory for temporary purpose only. It allocates a certain amount of device memory needed for query execution, then transfers data blocks and launch GPU kernel to process SQL workloads. Once GPU kernel gets finished, these device memory regison shall be released soon, to re-allocate unused device memory for other workloads.

This design allows concurrent multiple session or scan workloads on the tables larger than GPU device memory. It may not be optimal depending on circumstances.

A typical example is, repeated calculation under various conditions for data with a scale large enough to fit in the GPU device memory, not so large. This applies to workloads such as machine-learning, pattern matching or similarity search.

For modern GPUs, it is not so difficult to process a few gigabytes data on memory at most, but it is a costly process to setup data to be loaded onto GPU device memory and transfer them.

In addition, since variable length data in PostgreSQL has size limitation up to 1GB, it restricts the data format when it is givrn as an argument of PL/CUDA function, even if the data size itself is sufficient in the GPU device memory.

GPU memory store (gstore_fdw) is a feature to preserve GPU device memory and to load data to the memory preliminary. It makes unnecessary to setup arguments and load for each invocation of PL/CUDA function, and eliminates 1GB limitation of variable length data because it allows GPU device memory allocation up to the capacity.

As literal, gstore_fdw is implemented using foreign-data-wrapper of PostgreSQL. You can modify the data structure on GPU device memory using INSERT, UPDATE or DELETE commands on the foreign table managed by gstore_fdw. In the similar way, you can also read the data using SELECT command.

PL/CUDA function can reference the data stored onto GPU device memory through the foreign table. Right now, GPU programs which is transparently generated from SQL statement cannot reference this device memory region, however, we plan to enhance the feature in the future release.

GPU memory store


Usually it takes the 3 steps below to create a foreign table.

  • Define a foreign-data-wrapper using CREATE FOREIGN DATA WRAPPER command
  • Define a foreign server using CREATE SERVER command
  • Define a foreign table using CREATE FOREIGN TABLE command

The first 2 steps above are included in the CREATE EXTENSION pg_strom command. All you need to run individually is CREATE FOREIGN TABLE command last.

    id int,
    signature smallint[] OPTIONS (compression 'pglz')
SERVER gstore_fdw OPTIONS(pinning '0', format 'pgstrom');

You can specify some options on creation of foreign table using CREATE FOREIGN TABLE command.

SERVER gstore_fdw is a mandatory option. It indicates the new foreign table is managed by gstore_fdw.

The options below are supported in the OPTIONS clause.

name target description
pinning table Specifies device number of the GPU where device memory is preserved.
format table Specifies the internal data format on GPU device memory. Default is pgstrom
compression column Specifies whether variable length data is compressed, or not. Default is uncompressed.

Right now, only pgstrom is supported for format option. It is identical data format with what PG-Strom uses for in-memory columnar cache. In most cases, no need to pay attention to internal data format on writing / reading GPU data store using SQL. On the other hands, you need to consider when you program PL/CUDA function or share the GPU device memory with external applications using IPC handle.

Right now, only pglz is supported for compression option. This compression logic adopts an identical data format and algorithm used by PostgreSQL to compress variable length data larger than its threshold. It can be decompressed by GPU internal function pglz_decompress() from PL/CUDA function. Due to the characteristics of the compression algorithm, it is valuable to represent sparse matrix that is mostly zero.


Loading data

Like normal tables, you can write GPU device memory on behalf of the foreign table using INSERT, UPDATE and DELETE command.

Note that gstore_fdw acquires SHARE UPDATE EXCLUSIVE lock on the beginning of these commands. It means only single transaction can update the gstore_fdw foreign table at a certain point. It is a trade-off. We don't need to check visibility per record when PL/CUDA function references gstore_fdw foreign table.

Any contents written to the gstore_fdw foreign table is not visible to other sessions until transaction getting committed, like regular tables. This is a significant feature to ensure atomicity of transaction, however, it also means the older revision of gstore_fdw foreign table contents must be kept on the GPU device memory until any concurrent transaction which may reference the older revision gets committed or aborted.

So, even though you can run INSERT, UPDATE or DELETE commands as if it is regular tables, you should avoidto update several rows then commit transaction many times. Basically, INSERT of massive rows at once (bulk loading) is recommended.

Unlike regular tables, contents of the gstore_fdw foreign table is vollatile. So, it is very easy to loose contents of the gstore_fdw foreign table by power-down or PostgreSQL restart. So, what we load onto gstore_fdw foreign table should be reconstructable by other data source.

Checking the memory consumption

postgres=# select * from pgstrom.gstore_fdw_chunk_info ;
 database_oid | table_oid | revision | xmin | xmax | pinning | format  |  rawsize  |  nitems
        13806 |     26800 |        3 |    2 |    0 |       0 | pgstrom | 660000496 | 15000000
        13806 |     26797 |        2 |    2 |    0 |       0 | pgstrom | 440000496 | 10000000
(2 rows)
$ nvidia-smi
Wed Apr  4 15:11:50 2018
| NVIDIA-SMI 390.30                 Driver Version: 390.30                    |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P40           Off  | 00000000:02:00.0 Off |                    0 |
| N/A   39C    P0    52W / 250W |   1221MiB / 22919MiB |      0%      Default |

| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|    0      6885      C   ...bgworker: PG-Strom GPU memory keeper     1211MiB |

Internal Data Format

See the notes for details of the internal data format when gstore_fdw write on GPU device memory.

  • Detail of the pgstrom format
    • Here is a link to the note

CUDA provides special APIs cuIpcGetMemHandle() and cuIpcOpenMemHandle(). The first allows to get a unique identifier of GPU device memory allocated by applications. The other one allows to reference a shared GPU device memory region from other applications. In the other words, it supports something like a shared memory on the host system.

This unique identifier is CUipcMemHandle object; which is simple binary data in 64bytes. This session introduces SQL functions which exchange GPU device memory with other applications using CUipcMemHandle identifier.


This function gets CUipcMemHandle identifier of the GPU device memory which is preserved by gstore_fdw foreign table, then returns as a binary data in bytea type. If foreign table is empty and has no GPU device memory, it returns NULL.

  • 1st arg(ftable_oid): OID of the foreign table. Because it is reggstore type, you can specify the foreign table by name string.
  • result: CUipcMemHandle identifier in the bytea type.
# select gstore_export_ipchandle('ft');

(1 row)

lo_import_gpu(int, bytea, bigint, bigint, oid=0)

This function temporary opens the GPU device memory region acquired by external applications, then read this region and writes out as a largeobject of PostgreSQL. If largeobject already exists, its contents is replaced by the data read from the GPU device memory. It keeps owner and permission configuration. Elsewhere, it creates a new largeobject, then write out the data which is read from GPU device memory.

  • 1st arg(device_nr): GPU device number where device memory is acquired
  • 2nd arg(ipc_mhandle): CUipcMemHandle identifier in bytea type
  • 3rd(offset): offset of the head position to read, from the GPU device memory region.
  • 4th(length): size to read in bytes
  • 5th(loid): OID of the largeobject to be written. 0 is assumed, if no valid value is supplied.
  • result: OID of the written largeobject

lo_export_gpu(oid, int, bytea, bigint, bigint)

  • 1st arg(loid): OID of the largeobject to be read
  • 2nd arg(device_nr): GPU device number where device memory is acquired
  • 3rd arg(ipc_mhandle): CUipcMemHandle identifier in bytea type
  • 4th arg(offset): offset of the head position to write, from the GPU device memory region.
  • 5th arg(length): size to write in bytes
  • result: Length of bytes actually written. If length of the largeobject is less then length, it may return the value less than length.