PG-Strom v2.3 Release

PG-Strom Development Team (1-Apr-2020)

Overview

Major changes in PG-Strom v2.3 includes: - GpuJoin supports parallel construction of inner buffer - Arrow_Fdw now becomes writable; supports INSERT/TRUNCATE. - pg2arrow command supports 'append' mode. - mysql2arrow command was added.

Prerequisites

  • PostgreSQL v10, v11, v12
  • CUDA Toolkit 10.1 or later
  • Linux distributions supported by CUDA Toolkit
  • Intel x86 64bit architecture (x86_64)
  • NVIDIA GPU CC 6.0 or later (Pascal or Volta)

New Features

  • GpuJoin supports parallel construction of inner buffer
    • The older version construct inner buffer of GpuJoin by the backend process only. This restriction leads a problem; parallel scan of partitioned table delays extremely.
    • This version allows both of the backend and worker processes to construct inner buffer. In case when we scan a partitioned table, any processes that is assigned to a particular child table can start GpuJoin operations immediately.
  • Refactoring of the partition-wise asymmetric GpuJoin
    • By the refactoring of the partition-wise asymmetric GpuJoin, optimizer becomes to prefer multi-level GpuJoin in case when it offers cheaper execution cost.
  • Arrow_Fdw becomes writable; INSERT/TRUNCATE supported
    • Arrow_Fdw foreign table allows bulk-loading by INSERT and data elimination by pgstrom.arrow_fdw_truncate.
  • pg2arrow command supports 'append' mode.
    • We added --append option for pg2arrow command. As literal, it appends query results on existing Apache Arrow file.
    • Also, -t table option was added as an alias of SELECT * FROM table.
  • mysql2arrow command was added.
    • We added mysql2arrow command that connects to MySQL server, not PostgreSQL, and write out SQL query results as Apache Arrow files.
    • It has equivalent functionality to pg2arrow except for enum data type. mysql2arrow saves enum values as flat Utf8 values without DictionaryBatch chunks.
  • Regression test was added
    • Several test cases were added according to the PostgreSQL regression test framework.

Significant bug fixes

  • Revised cache invalidation logic for GPU device functions / types
    • The older version had invalidated all the metadata cache entries of GPU device functions / type on execution of ALTER command. It was revised to invalidate the entries that are actually updated.
  • Revised extreme performance degradation if GROUP BY has same grouping key twice or even number times.
    • GpuPreAgg combined hash values of grouping key of GROUP BY using XOR. So, if case when same column appeared even number time, it always leads 0 for hash-index problematically. Now we add a randomization for better hash distribution.
  • Potential infinite loop on GpuScan
    • By uninitialized values, GpuScan potentially goes to infinite loop when SSD2GPU Direct SQL is available.
  • Potential GPU kernel crash on GpuJoin
    • By uninitialized values, GpuJoin potentially makes GPU kernel crash when 3 or more tables are joined.

Deprecated Features

  • PostgreSQL v9.6 Support
    • CustomScan API in PostgreSQL v9.6 lacks a few APIs to handle dynamic shared memory (DSM). It has been a problem to handle a common code for v10 or later. To avoid the problem, we dropped PostgreSQL v9.6 support in this version.
  • PL/CUDA
    • According to the usecase analytics, users prefer familiar programming language environment like Python, rather than own special environment.
    • A combination of Arrow_Fdw's GPU export functionality and CuPy invocation at PL/Python is a successor of PL/CUDA, for in-database machine-learning / statistical analytics.
  • Gstore_Fdw
    • This feature is replaced by the writable Arrow_Fdw and its GPU export functionality.
  • Largeobject export to/import from GPU
    • According to the usecase analytics, we determined this feature is not needed.

PG-Strom v2.2 Release

PG-Strom Development Team (1-May-2019)

Overview

Major enhancement in PG-Strom v2.2 includes:

  • Table partitioning support
  • Columnar store support with Arrow_Fdw
  • Pre-built GPU binary support
  • Enables to implement GPU functions that returns variable length data
  • GpuSort support on GPU memory store (Gstore_Fdw)
  • NVME-oF support (Experimental)

Prerequisites

  • PostgreSQL v9.6, v10, v11
  • CUDA Toolkit 10.1
  • Linux distributions supported by CUDA Toolkit
  • Intel x86 64bit architecture (x86_64)
  • NVIDIA GPU CC 6.0 or later (Pascal or Volta)

New Features

  • Table partitioning support
    • If multi-GPUs configuration, an optimal GPU shall be chosen according to the physical distance between GPU and child tables that construct a partition. If PG-Strom cannot identify the distance from PCIe-bus topology, like NVME-oF configuration, DBA can configure the relation of GPU and NVME-SSD using pg_strom.nvme_distance_map.
    • When we join a partitioned table with non-partition tables, this version can produce a query execution plan that preliminary joins the non-partitioned table with partition child tables for each, and gather the results from child tables. This feature is proposed to PostgreSQL v13 core, as Asymmetric Partition-wise JOIN.
  • Columnar store support with Arrow_Fdw
    • It supports to read external Apache Arrow files using foreign table.
    • It also supports SSD-to-GPU Direct SQL on Apache Arrow files.
  • Pre-built GPU binary support
    • When GPU binary code is generated from SQL, the older version wrote out eitire CUDA C source code, including static portions like libraries, then NVRTC(NVIDIA Run-Time Compiker) built them on the fly. However, a part of complicated function consumed much longer compilation time.
    • v2.2 preliminary builds static functions preliminary, and only dynamic portion from SQL are built dynamically. It reduces the time for GPU binary generation.
  • JSONB data type support
    • This version allows to reference elements of JSONB object, and to utilize them as numeric or test.
  • Enables to implement GPU functions that returns variable length data
    • This version allows to implement SQL functions that returns variable-length data, like textcat, on GPU devices.
  • GpuSort support on GPU memory store (Gstore_Fdw)
    • This version allows to read data from GPU memory store for SQL workloads execution, not only PL/CUDA.
  • Addition of regression test
    • Several simple regression tests are added.
  • NVME-oF support (Experimental)
    • It supports SSD-to-GPU Direct SQL from remote SSD disks which are mounted using NVME-over-Fabric. Please note that it is an experimental feature, and it needs to replace the nvme_rdma kernel module on Red Hat Enterprise Linux 7.x / CentOS 7.x.

Features to be deprecated

  • PostgreSQL v9.6 support

    • CustomScan API in PostgreSQL v9.6 lacks a few APIs to handle dynamic shared memory (DSM), so it is unable to collect run-time statistics.
    • It also changes the way to keep expression objects internally, therefore, we had to put #if ... #endif blocks at no little points. It has damaged to code maintainability.
    • Due to the problems, this is the last version to support PostgreSQL v9.6. If you applied PG-Strom on PostgreSQL v9.6, let us recommend to move PostgreSQL v11 as soon as possible.
  • The pgstrom format of Gstore_Fdw foreign table

    • The internal data format on GPU memory store (Gstore_Fdw) is originally designed for data source of PL/CUDA procedures. It is our own format, and used PostgreSQL's data representations as is, like variable-length data, numeric, and so on.
    • After that, NVIDIA released RAPIDS(cuDF), based on Apache Arrow, for data exchange on GPU, then its adoption becomes wider on machine-learning application and Python software stack.
    • PG-Strom will switch its internal data format of Gstore_Fdw, to improve interoperability with these machine-learning software, then existing data format shall be deprecated.

Dropped Features

  • In-memory columnar cache
    • As results of use-case analysis, we concluded Arrow_Fdw can replace this feature in most cases. Due to feature duplication, we dropped the in-memory columnar cache.

PG-Strom v2.0 Release

PG-Strom Development Team (17-Apr-2018)

Overview

Major enhancement in PG-Strom v2.0 includes:

  • Overall redesign of the internal infrastructure to manage GPU and stabilization
  • CPU+GPU hybrid parallel execution
  • SSD-to-GPU Direct SQL Execution
  • In-memory columnar cache
  • GPU memory store (gstore_fdw)
  • Redesign of GpuJoin and GpuPreAgg and speed-up
  • GpuPreAgg + GpuJoin + GpuScan combined GPU kernel

You can download the summary of new features from: PG-Strom v2.0 Technical Brief.

Prerequisites

  • PostgreSQL v9.6, v10
  • CUDA Toolkit 9.1
  • Linux distributions supported by CUDA Toolkit
  • Intel x86 64bit architecture (x86_64)
  • NVIDIA GPU CC 6.0 or later (Pascal or Volta)

New Features

  • Entire re-design and stabilization of the internal infrastructure to manage GPU device.

    • PostgreSQL backend process simultaneously uses only one GPU at most. In case of multi-GPUs installation, it assumes combination use with CPU parallel execution of PostgreSQL. Usually, it is not a matter because throughput of CPU to provide data to GPU is much narrower than capability of GPU processors. We prioritized simpleness of the software architecture.
    • We began to utilize the demand paging feature of GPU device memory supported at the GPU models since Pascal generation. In most of SQL workloads, we cannot know exact size of the required result buffer prior to its execution, therefore, we had allocated more buffer than estimated buffer length, and retried piece of the workloads if estimated buffer size is not sufficient actually. This design restricts available resources of GPU which can be potentially used for other concurrent processes, and complicated error-retry logic was a nightmare for software quality. The demand paging feature allows to eliminate and simplify these stuffs.
    • We stop to use CUDA asynchronous interface. Use of the demand paging feature on GPU device memory makes asynchronous APIs for DMA (like cuMemCpyHtoD) perform synchronously, then it reduces concurrency and usage ratio of GPU kernels. Instead of the CUDA asynchronous APIs, PG-Strom manages its own worker threads which call synchronous APIs for each. As a by-product, we also could eliminate asynchronous callbacks (cuStreamAddCallback), it allows to use MPS daemon which has a restriction at this API.
  • CPU+GPU Hybrid Parallel Execution

    • CPU parallel execution at PostgreSQL v9.6 is newly supported.
    • CustomScan logic of GpuScan, GpuJoin and GpuPreAgg provided by PG-Strom are executable on multiple background worker processes of PostgreSQL in parallel.
    • Limitation: PG-Strom's own statistics displayed at EXPLAIN ANALYZE if CPU parallel execution. Because PostgreSQL v9.6 does not provide ShutdownCustomScan callback of the CustomScan interface, coordinator process has no way to reclaim information of worker processes prior to the release of DSM (Dynamic Shared Memory) segment.
  • SSD-to-GPU Direct SQL Execution

    • By cooperation with the nvme_strom Linux kernel module, it enables to load PostgreSQL's data blocks on NVMe-SSD to GPU device memory directly, bypassing the CPU and host buffer. This feature enables to apply PG-Strom on the area which have to process large data set more than system RAM size.
    • It allows to pull out pretty high throughput close to the hardware limitation because its data stream skips block-device or filesystem layer. Then, GPU runs SQL workloads that usually reduce the amount of data to be processed by CPU. The chemical reaction of these characteristics enables to redefine GPU's role as accelerator of I/O workloads also, not only computing intensive workloads.
  • In-memory Columnar Cache

    • For middle size data-set loadable onto the system RAM, it allows to cache data-blocks in column format which is more suitable for GPU computing. If cached data-blocks are found during table scan, PG-Strom prefers to reference the columnar cache more than shared buffer of PostgreSQL.
    • In-memory columnar cache can be built synchronously, or asynchronously by the background workers.
    • You may remember very early revision of PG-Strom had similar feature. In case when a cached tuple gets updated, the latest in-memory columnar cache which we newly implemented in v2.0 invalidates the cache block which includes the updated tuples. It never updates the columnar cache according to the updates of row-store, so performance degradation is quite limited.
  • GPU Memory Store (gstore_fdw)

    • It enables to write to / read from preserved GPU device memory region by SELECT/INSERT/UPDATE/DELETE in SQL-level, using foreign table interface.
    • In v2.0, only pgstrom internal data format is supported. It saves written data using PG-Strom's buffer format of KDS_FORMAT_COLUMN. It can compress variable length data using LZ algorithm.
    • In v2.0, GPU memory store can be used as data source of PL/CUDA user defined function.
  • Redesign and performance improvement of GpuJoin and GpuPreAgg

    • Stop using Dynamic Parallelism which we internally used in GpuJoin and GpuPreAgg, and revised entire logic of these operations. Old design had a problem of less GPU usage ratio because a GPU kernel which launches GPU sub-kernel and just waits for its completion occupied GPU's execution slot.
    • A coproduct of this redesign is suspend/resume of GpuJoin. In principle, JOIN operation of SQL may generate larger number of rows than number of input rows, but preliminary not predictive. The new design allows to suspend GPU kernel once buffer available space gets lacked, then resume with new result buffer. It simplifies size estimation logic of the result buffer, and eliminates GPU kernel retry by lack of buffer on run-time.
  • GpuPreAgg+GpuJoin+GpuScan combined GPU kernel

    • In case when GPU executable SCAN, JOIN and GROUP BY are serially cascaded, a single GPU kernel invocation runs a series of tasks equivalent to the GpuScan, GpuJoin and GpuPreAgg. This is an approach to minimize data exchange between CPU and GPU. For example, result buffer of GpuJoin is used as input buffer of GpuPreAgg.
    • This feature is especially valuable if combined with SSD-to-GPU Direct SQL Execution.
  • PL/CUDA Enhancement

    • #plcuda_include is enhanced to specify SQL function which returns text type. It can change the code block to inject according to the argument, so it also allows to generate multiple GPU kernel variations, not only inclusion of externally defined functions.
    • If PL/CUDA takes reggstore type argument, GPU kernel function receives pointer of the GPU memory store. Note that it does not pass the OID value.
  • Other Enhancement

    • lo_import_gpu and lo_export_gpu functions allows to import contents of the GPU device memory acquired by external applications directly, or export contents of the largeobject to the GPU device memory.
  • Packaging

    • Add RPM packages to follow the PostgreSQL packages distributed by PostgreSQL Global Development Group.
    • All the software packages are available at HeteroDB SWDC(Software Distribution Center) and downloadable.
  • Document

    • PG-Strom documentation was entirely rewritten using markdown and mkdocs. It makes documentation maintenance easier than the previous HTML based approach, so expects timely updates according to the development of new features.
  • Test

    • Regression test for PG-Strom was built on top of the regression test framework of PostgreSQL.

Dropped features

  • PostgreSQL v9.5 Support

    • PostgreSQL v9.6 had big changes in both of the optimizer and executor to support CPU parallel query execution. The biggest change for extension modules that interact them is an enhancement of the interface called "upper planner path-ification". It allows to choose an optimal execution-plan from the multiple candidates based on the estimated cost, even if it is aggregation or sorting.
    • It is fundamentally different from the older way where we rewrote query execution plan to inject GpuPreAgg using the hooks. It allows to inject GpuPreAgg node in more reasonable and reliable way, and we could drop complicated (and buggy) logic to rewrite query execution plan once constructed.
    • CustomScan interface is also enhanced to support CPU parallel execution. Due to the reason, we dropped PostgreSQL v9.5 support to follow these new enhancement.
  • GpuSort feature

    • We dropped GpuSort because we have little advantages in the performance.
    • Sorting is one of the GPU suitable workloads. However, in case when we try to sort data blocks larger than GPU device memory, we have to split the data blocks into multiple chunks, then partially sort them and merge them by CPU to generate final results.
    • Larger chunk size is better to reduce the load to merge multiple chunks by CPU, on the other hands, larger chunk size takes larger lead time to launch GPU kernel to sort. It means here is a trade-off; which disallows asynchronous processing by PG-Strom to make data transfer latency invisible.
    • It is hard to solve the problem, or too early to solve the problem, we dropped GpuSort feature once.