PG-Strom v2.0 Release
Major enhancement in PG-Strom v2.0 includes:
- Overall redesign of the internal infrastructure to manage GPU and stabilization
- CPU+GPU hybrid parallel execution
- SSD-to-GPU Direct SQL Execution
- In-memory columnar cache
- GPU memory store (gstore_fdw)
- Redesign of GpuJoin and GpuPreAgg and speed-up
- GpuPreAgg + GpuJoin + GpuScan combined GPU kernel
You can download the summary of new features from: PG-Strom v2.0 Technical Brief.
- PostgreSQL v9.6, v10
- CUDA Toolkit 9.1
- Linux distributions supported by CUDA Toolkit
- Intel x86 64bit architecture (x86_64)
- NVIDIA GPU CC 6.0 or later (Pascal or Volta)
Entire re-design and stabilization of the internal infrastructure to manage GPU device.
- PostgreSQL backend process simultaneously uses only one GPU at most. In case of multi-GPUs installation, it assumes combination use with CPU parallel execution of PostgreSQL. Usually, it is not a matter because throughput of CPU to provide data to GPU is much narrower than capability of GPU processors. We prioritized simpleness of the software architecture.
- We began to utilize the demand paging feature of GPU device memory supported at the GPU models since Pascal generation. In most of SQL workloads, we cannot know exact size of the required result buffer prior to its execution, therefore, we had allocated more buffer than estimated buffer length, and retried piece of the workloads if estimated buffer size is not sufficient actually. This design restricts available resources of GPU which can be potentially used for other concurrent processes, and complicated error-retry logic was a nightmare for software quality. The demand paging feature allows to eliminate and simplify these stuffs.
- We stop to use CUDA asynchronous interface. Use of the demand paging feature on GPU device memory makes asynchronous APIs for DMA (like
cuMemCpyHtoD) perform synchronously, then it reduces concurrency and usage ratio of GPU kernels. Instead of the CUDA asynchronous APIs, PG-Strom manages its own worker threads which call synchronous APIs for each. As a by-product, we also could eliminate asynchronous callbacks (
cuStreamAddCallback), it allows to use MPS daemon which has a restriction at this API.
CPU+GPU Hybrid Parallel Execution
- CPU parallel execution at PostgreSQL v9.6 is newly supported.
- CustomScan logic of GpuScan, GpuJoin and GpuPreAgg provided by PG-Strom are executable on multiple background worker processes of PostgreSQL in parallel.
- Limitation: PG-Strom's own statistics displayed at
EXPLAIN ANALYZEif CPU parallel execution. Because PostgreSQL v9.6 does not provide
ShutdownCustomScancallback of the CustomScan interface, coordinator process has no way to reclaim information of worker processes prior to the release of DSM (Dynamic Shared Memory) segment.
SSD-to-GPU Direct SQL Execution
- By cooperation with the
nvme_stromLinux kernel module, it enables to load PostgreSQL's data blocks on NVMe-SSD to GPU device memory directly, bypassing the CPU and host buffer. This feature enables to apply PG-Strom on the area which have to process large data set more than system RAM size.
- It allows to pull out pretty high throughput close to the hardware limitation because its data stream skips block-device or filesystem layer. Then, GPU runs SQL workloads that usually reduce the amount of data to be processed by CPU. The chemical reaction of these characteristics enables to redefine GPU's role as accelerator of I/O workloads also, not only computing intensive workloads.
- By cooperation with the
In-memory Columnar Cache
- For middle size data-set loadable onto the system RAM, it allows to cache data-blocks in column format which is more suitable for GPU computing. If cached data-blocks are found during table scan, PG-Strom prefers to reference the columnar cache more than shared buffer of PostgreSQL.
- In-memory columnar cache can be built synchronously, or asynchronously by the background workers.
- You may remember very early revision of PG-Strom had similar feature. In case when a cached tuple gets updated, the latest in-memory columnar cache which we newly implemented in v2.0 invalidates the cache block which includes the updated tuples. It never updates the columnar cache according to the updates of row-store, so performance degradation is quite limited.
GPU Memory Store (gstore_fdw)
- It enables to write to / read from preserved GPU device memory region by SELECT/INSERT/UPDATE/DELETE in SQL-level, using foreign table interface.
- In v2.0, only
pgstrominternal data format is supported. It saves written data using PG-Strom's buffer format of
KDS_FORMAT_COLUMN. It can compress variable length data using LZ algorithm.
- In v2.0, GPU memory store can be used as data source of PL/CUDA user defined function.
Redesign and performance improvement of GpuJoin and GpuPreAgg
- Stop using Dynamic Parallelism which we internally used in GpuJoin and GpuPreAgg, and revised entire logic of these operations. Old design had a problem of less GPU usage ratio because a GPU kernel which launches GPU sub-kernel and just waits for its completion occupied GPU's execution slot.
- A coproduct of this redesign is suspend/resume of GpuJoin. In principle, JOIN operation of SQL may generate larger number of rows than number of input rows, but preliminary not predictive. The new design allows to suspend GPU kernel once buffer available space gets lacked, then resume with new result buffer. It simplifies size estimation logic of the result buffer, and eliminates GPU kernel retry by lack of buffer on run-time.
GpuPreAgg+GpuJoin+GpuScan combined GPU kernel
- In case when GPU executable SCAN, JOIN and GROUP BY are serially cascaded, a single GPU kernel invocation runs a series of tasks equivalent to the GpuScan, GpuJoin and GpuPreAgg. This is an approach to minimize data exchange between CPU and GPU. For example, result buffer of GpuJoin is used as input buffer of GpuPreAgg.
- This feature is especially valuable if combined with SSD-to-GPU Direct SQL Execution.
#plcuda_includeis enhanced to specify SQL function which returns
texttype. It can change the code block to inject according to the argument, so it also allows to generate multiple GPU kernel variations, not only inclusion of externally defined functions.
- If PL/CUDA takes
reggstoretype argument, GPU kernel function receives pointer of the GPU memory store. Note that it does not pass the OID value.
lo_export_gpufunctions allows to import contents of the GPU device memory acquired by external applications directly, or export contents of the largeobject to the GPU device memory.
- Add RPM packages to follow the PostgreSQL packages distributed by PostgreSQL Global Development Group.
- All the software packages are available at HeteroDB SWDC(Software Distribution Center) and downloadable.
- PG-Strom documentation was entirely rewritten using markdown and mkdocs. It makes documentation maintenance easier than the previous HTML based approach, so expects timely updates according to the development of new features.
- Regression test for PG-Strom was built on top of the regression test framework of PostgreSQL.
PostgreSQL v9.5 Support
- PostgreSQL v9.6 had big changes in both of the optimizer and executor to support CPU parallel query execution. The biggest change for extension modules that interact them is an enhancement of the interface called "upper planner path-ification". It allows to choose an optimal execution-plan from the multiple candidates based on the estimated cost, even if it is aggregation or sorting.
- It is fundamentally different from the older way where we rewrote query execution plan to inject GpuPreAgg using the hooks. It allows to inject GpuPreAgg node in more reasonable and reliable way, and we could drop complicated (and buggy) logic to rewrite query execution plan once constructed.
- CustomScan interface is also enhanced to support CPU parallel execution. Due to the reason, we dropped PostgreSQL v9.5 support to follow these new enhancement.
- We dropped GpuSort because we have little advantages in the performance.
- Sorting is one of the GPU suitable workloads. However, in case when we try to sort data blocks larger than GPU device memory, we have to split the data blocks into multiple chunks, then partially sort them and merge them by CPU to generate final results.
- Larger chunk size is better to reduce the load to merge multiple chunks by CPU, on the other hands, larger chunk size takes larger lead time to launch GPU kernel to sort. It means here is a trade-off; which disallows asynchronous processing by PG-Strom to make data transfer latency invisible.
- It is hard to solve the problem, or too early to solve the problem, we dropped GpuSort feature once.