Trouble Shooting

Identify the problem

In case when a particular workloads produce problems, it is the first step to identify which stuff may cause the problem.

Unfortunately, much smaller number of developer supports the PG-Strom development community than PostgreSQL developer's community, thus, due to the standpoint of software quality and history, it is a reasonable estimation to suspect PG-Strom first.

The pg_strom.enabled parameter allows to turn on/off all the functionality of PG-Strom at once. The configuration below disables PG-Strom, thus identically performs with the standard PostgreSQL.

# SET pg_strom.enabled = off;

In addition, we provide parameters to disable particular execution plan like GpuScan, GpuJoin and GpuPreAgg.

See references/GUC Parameters for more details.

Collecting crash dump

Crash dump is very helpful for analysis of serious problems which lead system crash for example. This session introduces the way to collect crash dump of the PostgreSQL and PG-Strom process (CPU side) and PG-Strom's GPU kernel, and show the back trace on the serious problems.

Add configuration on PostgreSQL startup

For generation of crash dump (CPU-side) on process crash, you need to change the resource limitation of the operating system for size of core file PostgreSQL server process can generate.

For generation of crash dump (GPU-size) on errors of GPU kernel, PostgreSQL server process has CUDA_ENABLE_COREDUMP_ON_EXCEPTIONenvironment variable, and its value has 1.

You can put a configuration file at /etc/systemd/system/postgresql-<version>.service.d/ when PostgreSQL is kicked by systemd.

In case of RPM installation, a configuration file pg_strom.conf is also installed on the directory, and contains the following initial configuration.


In CUDA 9.1, it usually takes more than several minutes to generate crash dump of GPU kernel, and it entirely stops response of the PostgreSQL session which causes an error. So, we recommend to set CUDA_ENABLE_COREDUMP_ON_EXCEPTION environment variable only if you investigate errors of GPU kernels which happen on a certain query. The default configuration on RPM installation comments out the line of CUDA_ENABLE_COREDUMP_ON_EXCEPTION environment variable.

PostgreSQL server process should have unlimited Max core file size configuration, after the next restart.

You can check it as follows.

# cat /proc/<PID of postmaster>/limits
Limit                     Soft Limit           Hard Limit           Units
    :                         :                    :                  :
Max core file size        unlimited            unlimited            bytes
    :                         :                    :                  :

Installation of debuginfo package

# yum install postgresql10-debuginfo pg_strom-PG10-debuginfo
 Package                  Arch    Version             Repository           Size
 pg_strom-PG10-debuginfo  x86_64  1.9-180301.el7      heterodb-debuginfo  766 k
 postgresql10-debuginfo   x86_64  10.3-1PGDG.rhel7    pgdg10              9.7 M

Transaction Summary
Install  2 Packages
  pg_strom-PG10-debuginfo.x86_64 0:1.9-180301.el7
  postgresql10-debuginfo.x86_64 0:10.3-1PGDG.rhel7


Checking the back-trace on CPU side

The kernel parameter kernel.core_pattern and kernel.core_uses_pid determine the path where crash dump is written out. It is usually created on the current working directory of the process, check /var/lib/pgdata where the database cluster is deployed, if you start PostgreSQL server using systemd.

Once core.<PID> file gets generated, you can check its back-trace to reach system crash using gdb.

gdb speficies the core file by -c option, and the crashed program by -f option.

# gdb -c /var/lib/pgdata/core.134680 -f /usr/pgsql-10/bin/postgres
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-100.el7_4.1
(gdb) bt
#0  0x00007fb942af3903 in __epoll_wait_nocancel () from /lib64/
#1  0x00000000006f71ae in WaitEventSetWaitBlock (nevents=1,
    occurred_events=0x7ffee51e1d70, cur_timeout=-1, set=0x2833298)
    at latch.c:1048
#2  WaitEventSetWait (set=0x2833298, timeout=timeout@entry-1,
    nevents=nevents@entry1, wait_event_info=wait_event_info@entry100663296)
    at latch.c:1000
#3  0x00000000006210fb in secure_read (port=0x2876120,
    ptr=0xcaa7e0 <PqRecvBuffer>, len=8192) at be-secure.c:166
#4  0x000000000062b6e8 in pq_recvbuf () at pqcomm.c:963
#5  0x000000000062c345 in pq_getbyte () at pqcomm.c:1006
#6  0x0000000000718682 in SocketBackend (inBuf=0x7ffee51e1ef0)
    at postgres.c:328
#7  ReadCommand (inBuf=0x7ffee51e1ef0) at postgres.c:501
#8  PostgresMain (argc=<optimized out>, argv=argv@entry0x287bb68,
    dbname=0x28333f8 "postgres", username=<optimized out>) at postgres.c:4030
#9  0x000000000047adbc in BackendRun (port=0x2876120) at postmaster.c:4405
#10 BackendStartup (port=0x2876120) at postmaster.c:4077
#11 ServerLoop () at postmaster.c:1755
#12 0x00000000006afb7f in PostmasterMain (argc=argc@entry3,
    argv=argv@entry0x2831280) at postmaster.c:1363
#13 0x000000000047bbef in main (argc=3, argv=0x2831280) at main.c:228

bt command of gdb displays the backtrace. In this case, I sent SIGSEGV signal to the PostgreSQL backend which is waiting for queries from the client for intentional crash, the process got crashed at __epoll_wait_nocancel invoked by WaitEventSetWait.

Checking the backtrace on GPU

Crash dump of GPU kernel is generated on the current working directory of PostgreSQL server process, unless you don't specify the path using CUDA_COREDUMP_FILE environment variable explicitly. Check /var/lib/pgdata where the database cluster is deployed, if systemd started PostgreSQL. Dump file will have the following naming convension.


Note that the dump-file of GPU kernel contains no debug information like symbol information in the default configuration. It is nearly impossible to investigate the problem, so enable inclusion of debug information for the GPU programs generated by PG-Strom, as follows.

Also note than we don't recommend to turn on the configuration for daily usage, because it makes query execution performan slow down. Turn on only when you investigate the troubles.

nvme=# set pg_strom.debug_jit_compile_options = on;

You can check crash dump of the GPU kernel using cuda-gdb command.

# /usr/local/cuda/bin/cuda-gdb
NVIDIA (R) CUDA Debugger
9.1 release
Portions Copyright (C) 2007-2017 NVIDIA Corporation
For help, type "help".
Type "apropos word" to search for commands related to "word".

Run cuda-gdb command, then load the crash dump file above using target command on the prompt.

(cuda-gdb) target cudacore /var/lib/pgdata/core_1521131828_magro.heterodb.com_216238.nvcudmp
Opening GPU coredump: /var/lib/pgdata/core_1521131828_magro.heterodb.com_216238.nvcudmp
[New Thread 216240]

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x7ff4dc82f930 (cuda_gpujoin.h:1159)
[Current focus set to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]
#0  0x00007ff4dc82f938 in _INTERNAL_8_pg_strom_0124cb94::gpujoin_exec_hashjoin (kcxt=0x7ff4f7fffbf8, kgjoin=0x7fe9f4800078,
    kmrels=0x7fe9f8800000, kds_src=0x7fe9f0800030, depth=3, rd_stack=0x7fe9f4806118, wr_stack=0x7fe9f480c118, l_state=0x7ff4f7fffc48,
    matched=0x7ff4f7fffc7c "") at /usr/pgsql-10/share/extension/cuda_gpujoin.h:1159
1159            while (khitem && khitem->hash != hash_value)

You can check backtrace where the error happened on GPU kernel using bt command.

(cuda-gdb) bt
#0  0x00007ff4dc82f938 in _INTERNAL_8_pg_strom_0124cb94::gpujoin_exec_hashjoin (kcxt=0x7ff4f7fffbf8, kgjoin=0x7fe9f4800078,
    kmrels=0x7fe9f8800000, kds_src=0x7fe9f0800030, depth=3, rd_stack=0x7fe9f4806118, wr_stack=0x7fe9f480c118, l_state=0x7ff4f7fffc48,
    matched=0x7ff4f7fffc7c "") at /usr/pgsql-10/share/extension/cuda_gpujoin.h:1159
#1  0x00007ff4dc9428f0 in gpujoin_main<<<(30,1,1),(256,1,1)>>> (kgjoin=0x7fe9f4800078, kmrels=0x7fe9f8800000, kds_src=0x7fe9f0800030,
    kds_dst=0x7fe9e8800030, kparams_gpreagg=0x0) at /usr/pgsql-10/share/extension/cuda_gpujoin.h:1347

Please check CUDA Toolkit Documentation - CUDA-GDB for more detailed usage of cuda-gdb command.