Trouble Shooting
Identify the problem
In case when a particular workloads produce problems, it is the first step to identify which stuff may cause the problem.
Unfortunately, much smaller number of developer supports the PG-Strom development community than PostgreSQL developer's community, thus, due to the standpoint of software quality and history, it is a reasonable estimation to suspect PG-Strom first.
The pg_strom.enabled
parameter allows to turn on/off all the functionality of PG-Strom at once.
The configuration below disables PG-Strom, thus identically performs with the standard PostgreSQL.
# SET pg_strom.enabled = off;
In addition, we provide parameters to disable particular execution plan like GpuScan, GpuJoin and GpuPreAgg.
See references/GUC Parameters for more details.
Collecting crash dump
Crash dump is very helpful for analysis of serious problems which lead system crash for example. This session introduces the way to collect crash dump of the PostgreSQL and PG-Strom process (CPU side) and PG-Strom's GPU kernel, and show the back trace on the serious problems.
Add configuration on PostgreSQL startup
For generation of crash dump (CPU-side) on process crash, you need to change the resource limitation of the operating system for size of core file PostgreSQL server process can generate.
For generation of crash dump (GPU-size) on errors of GPU kernel, PostgreSQL server process has CUDA_ENABLE_COREDUMP_ON_EXCEPTION
environment variable, and its value has 1
.
You can put a configuration file at /etc/systemd/system/postgresql-<version>.service.d/
when PostgreSQL is kicked by systemd.
In case of RPM installation, a configuration file pg_strom.conf
is also installed on the directory, and contains the following initial configuration.
[Service]
LimitNOFILE=65536
LimitCORE=infinity
#Environment=CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1
In CUDA 9.1, it usually takes more than several minutes to generate crash dump of GPU kernel, and it entirely stops response of the PostgreSQL session which causes an error.
So, we recommend to set CUDA_ENABLE_COREDUMP_ON_EXCEPTION
environment variable only if you investigate errors of GPU kernels which happen on a certain query.
The default configuration on RPM installation comments out the line of CUDA_ENABLE_COREDUMP_ON_EXCEPTION
environment variable.
PostgreSQL server process should have unlimited Max core file size configuration, after the next restart.
You can check it as follows.
# cat /proc/<PID of postmaster>/limits
Limit Soft Limit Hard Limit Units
: : : :
Max core file size unlimited unlimited bytes
: : : :
Installation of debuginfo package
# yum install postgresql10-debuginfo pg_strom-PG10-debuginfo
:
================================================================================
Package Arch Version Repository Size
================================================================================
Installing:
pg_strom-PG10-debuginfo x86_64 1.9-180301.el7 heterodb-debuginfo 766 k
postgresql10-debuginfo x86_64 10.3-1PGDG.rhel7 pgdg10 9.7 M
Transaction Summary
================================================================================
Install 2 Packages
:
Installed:
pg_strom-PG10-debuginfo.x86_64 0:1.9-180301.el7
postgresql10-debuginfo.x86_64 0:10.3-1PGDG.rhel7
Complete!
Checking the back-trace on CPU side
The kernel parameter kernel.core_pattern
and kernel.core_uses_pid
determine the path where crash dump is written out.
It is usually created on the current working directory of the process, check /var/lib/pgdata
where the database cluster is deployed, if you start PostgreSQL server using systemd.
Once core.<PID>
file gets generated, you can check its back-trace to reach system crash using gdb
.
gdb
speficies the core file by -c
option, and the crashed program by -f
option.
# gdb -c /var/lib/pgdata/core.134680 -f /usr/pgsql-10/bin/postgres
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-100.el7_4.1
:
(gdb) bt
#0 0x00007fb942af3903 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1 0x00000000006f71ae in WaitEventSetWaitBlock (nevents=1,
occurred_events=0x7ffee51e1d70, cur_timeout=-1, set=0x2833298)
at latch.c:1048
#2 WaitEventSetWait (set=0x2833298, timeout=timeout@entry-1,
occurred_events=occurred_events@entry0x7ffee51e1d70,
nevents=nevents@entry1, wait_event_info=wait_event_info@entry100663296)
at latch.c:1000
#3 0x00000000006210fb in secure_read (port=0x2876120,
ptr=0xcaa7e0 <PqRecvBuffer>, len=8192) at be-secure.c:166
#4 0x000000000062b6e8 in pq_recvbuf () at pqcomm.c:963
#5 0x000000000062c345 in pq_getbyte () at pqcomm.c:1006
#6 0x0000000000718682 in SocketBackend (inBuf=0x7ffee51e1ef0)
at postgres.c:328
#7 ReadCommand (inBuf=0x7ffee51e1ef0) at postgres.c:501
#8 PostgresMain (argc=<optimized out>, argv=argv@entry0x287bb68,
dbname=0x28333f8 "postgres", username=<optimized out>) at postgres.c:4030
#9 0x000000000047adbc in BackendRun (port=0x2876120) at postmaster.c:4405
#10 BackendStartup (port=0x2876120) at postmaster.c:4077
#11 ServerLoop () at postmaster.c:1755
#12 0x00000000006afb7f in PostmasterMain (argc=argc@entry3,
argv=argv@entry0x2831280) at postmaster.c:1363
#13 0x000000000047bbef in main (argc=3, argv=0x2831280) at main.c:228
bt
command of gdb
displays the backtrace.
In this case, I sent SIGSEGV
signal to the PostgreSQL backend which is waiting for queries from the client for intentional crash, the process got crashed at __epoll_wait_nocancel
invoked by WaitEventSetWait
.
Checking the backtrace on GPU
Crash dump of GPU kernel is generated on the current working directory of PostgreSQL server process, unless you don't specify the path using CUDA_COREDUMP_FILE
environment variable explicitly.
Check /var/lib/pgdata
where the database cluster is deployed, if systemd started PostgreSQL. Dump file will have the following naming convension.
core_<timestamp>_<hostname>_<PID>.nvcudmp
Note that the dump-file of GPU kernel contains no debug information like symbol information in the default configuration. It is nearly impossible to investigate the problem, so enable inclusion of debug information for the GPU programs generated by PG-Strom, as follows.
Also note than we don't recommend to turn on the configuration for daily usage, because it makes query execution performan slow down. Turn on only when you investigate the troubles.
nvme=# set pg_strom.debug_jit_compile_options = on;
SET
You can check crash dump of the GPU kernel using cuda-gdb
command.
# /usr/local/cuda/bin/cuda-gdb
NVIDIA (R) CUDA Debugger
9.1 release
Portions Copyright (C) 2007-2017 NVIDIA Corporation
:
For help, type "help".
Type "apropos word" to search for commands related to "word".
(cuda-gdb)
Run cuda-gdb
command, then load the crash dump file above using target
command on the prompt.
(cuda-gdb) target cudacore /var/lib/pgdata/core_1521131828_magro.heterodb.com_216238.nvcudmp
Opening GPU coredump: /var/lib/pgdata/core_1521131828_magro.heterodb.com_216238.nvcudmp
[New Thread 216240]
CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x7ff4dc82f930 (cuda_gpujoin.h:1159)
[Current focus set to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]
#0 0x00007ff4dc82f938 in _INTERNAL_8_pg_strom_0124cb94::gpujoin_exec_hashjoin (kcxt=0x7ff4f7fffbf8, kgjoin=0x7fe9f4800078,
kmrels=0x7fe9f8800000, kds_src=0x7fe9f0800030, depth=3, rd_stack=0x7fe9f4806118, wr_stack=0x7fe9f480c118, l_state=0x7ff4f7fffc48,
matched=0x7ff4f7fffc7c "") at /usr/pgsql-10/share/extension/cuda_gpujoin.h:1159
1159 while (khitem && khitem->hash != hash_value)
You can check backtrace where the error happened on GPU kernel using bt
command.
(cuda-gdb) bt
#0 0x00007ff4dc82f938 in _INTERNAL_8_pg_strom_0124cb94::gpujoin_exec_hashjoin (kcxt=0x7ff4f7fffbf8, kgjoin=0x7fe9f4800078,
kmrels=0x7fe9f8800000, kds_src=0x7fe9f0800030, depth=3, rd_stack=0x7fe9f4806118, wr_stack=0x7fe9f480c118, l_state=0x7ff4f7fffc48,
matched=0x7ff4f7fffc7c "") at /usr/pgsql-10/share/extension/cuda_gpujoin.h:1159
#1 0x00007ff4dc9428f0 in gpujoin_main<<<(30,1,1),(256,1,1)>>> (kgjoin=0x7fe9f4800078, kmrels=0x7fe9f8800000, kds_src=0x7fe9f0800030,
kds_dst=0x7fe9e8800030, kparams_gpreagg=0x0) at /usr/pgsql-10/share/extension/cuda_gpujoin.h:1347
Please check CUDA Toolkit Documentation - CUDA-GDB for more detailed usage of cuda-gdb
command.