GPU-Sort

This chapter introduces the GPU-Sort and related features.

GPU Task Execution

PG-Strom execution mode

To explain the conditions for using GPU-Sort, we first need to know how PG-Strom processes SQL workloads.

When GPU-Scan and GPU-Join read a table to be processed, they divide the data into chunks of about 64MB and read it from storage to GPU, evaluate WHERE clauses and JOIN conditions on GPU, and write the processing results back to CPU.

At this time, the GPU memory used for Scan/Join processing is released and reused to process the next 64MB chunk. Normally, data reading from storage, Scan/Join processing on GPU, and writing back to CPU are executed concurrently, but GPU memory is still released and reused one after another, so PG-Strom does not use much GPU memory.

On the other hand, GPU-PreAgg is different. Due to the nature of the workload, the results of the aggregation process are stored on the GPU (Dam execution), and then written back to the CPU in one go. Therefore, when GPU-PreAgg is completed, almost all of the data is in the GPU memory.

Due to the nature of the sorting process, all of the data to be sorted must be loaded onto the GPU in order to execute GPU-Sort. In other words, in the case of GPU-Sort, which does not involve aggregation operations and has only GPU-Scan and GPU-Join as its lower nodes, the execution results must be stored in the GPU memory as in GPU-PreAgg. This is the same execution method used in Inner Pinned Buffer in GpuJoin, and when the results of GPU-Scan/Join are used in the next step, they are kept in the GPU memory instead of being returned to the CPU.

Enables GPU-Sort

Due to the nature of sorting workloads, all the target data must be stored in GPU memory.

Therefore, it is a basic premise that the data to be sorted must fix in the GPU memory capacity, but there is one more thing to consider.

Various GPU processes in PG-Strom have a mechanism called CPU-Fallback, which re-executes the processing of operators and SQL functions on the CPU if they cannot be completed on the GPU. Typically, this is the case when variable-length data does not fit into the PostgreSQL block size (8kB) and is stored in an external TOAST table. CPU-Fallback is a function to ensure continuity of processing even for extremely exceptional data, but rows that are subject to CPU-Fallback are stored in the CPU and do not exist in GPU memory, which causes an obstacle when sorting.

Therefore, the GPU-Sort function works only when the CPU-Fallback function is disabled, i.e., when pg_strom.cpu_fallback=off is set.

GPU-Sort with Window-Functions

If CPU-Fallback is disabled, the complete results of GPU-Scan/Join/PreAgg are guaranteed to be in GPU memory, so PG-Strom can perform parallel sorting based on the Bitonic-Sorting algorithm and return the sorted results to the CPU.

When used with a window function that limits the number of rows, such as the LIMIT clause or rank() < 4, it will reduce the number of data returned to the CPU based on these optimization hints. This should contribute to speedup by reducing the number of data to be processed by the CPU.

The following execution plan shows the result of narrowing down the number of rows using a window function (rank() < 4) without enabling GPU-Sort.

=# explain analyze
   select * from (
       select c_region, c_nation, c_city, lo_orderdate, sum(lo_revenue) lo_rev,
              rank() over(partition by c_region, c_nation, c_city
                          order by sum(lo_revenue)) cnt
         from lineorder, customer
        where lo_custkey = c_custkey
          and lo_shipmode in ('RAIL','SHIP')
         group by c_region, c_nation, c_city, lo_orderdate
   ) subqry
   where cnt < 4;
                                                                             QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 WindowAgg  (cost=32013352.01..33893039.51 rows=75187500 width=84) (actual time=13158.987..13335.106 rows=750 loops=1)
   Run Condition: (rank() OVER (?) < 4)
   ->  Sort  (cost=32013352.01..32201320.76 rows=75187500 width=76) (actual time=13158.976..13238.136 rows=601500 loops=1)
         Sort Key: customer.c_region, customer.c_nation, customer.c_city, (pgstrom.sum_numeric((pgstrom.psum(lineorder.lo_revenue))))
         Sort Method: quicksort  Memory: 76268kB
         ->  HashAggregate  (cost=15987574.35..18836475.71 rows=75187500 width=76) (actual time=9990.801..10271.543 rows=601500 loops=1)
               Group Key: customer.c_region, customer.c_nation, customer.c_city, lineorder.lo_orderdate
               Planned Partitions: 8  Batches: 1  Memory Usage: 516113kB
               ->  Custom Scan (GpuPreAgg) on lineorder  (cost=4967906.38..5907750.13 rows=75187500 width=76) (actual time=9175.476..9352.529 rows=1203000 loops=1)
                     GPU Projection: pgstrom.psum(lo_revenue), c_region, c_nation, c_city, lo_orderdate
                     GPU Scan Quals: (lo_shipmode = ANY ('{RAIL,SHIP}'::bpchar[])) [plan: 600046000 -> 171773200, exec: 1311339 -> 362780]
                     GPU Join Quals [1]: (lo_custkey = c_custkey) [plan: 171773200 -> 171773200, exec: 362780 -> 322560
                     GPU Outer Hash [1]: lo_custkey
                     GPU Inner Hash [1]: c_custkey
                     GPU Group Key: c_region, c_nation, c_city, lo_orderdate
                     Scan-Engine: GPU-Direct with 2 GPUs <0,1>; direct=11395910, ntuples=1311339
                     ->  Seq Scan on customer  (cost=0.00..81963.11 rows=3000011 width=46) (actual time=0.008..519.064 rows=3000000 loops=1)
 Planning Time: 1.395 ms
 Execution Time: 13494.808 ms
(19 rows)

After GPU-PreAgg, HashAggregate is run to aggregate the partial aggregation results, and Sort is run to sort the aggregated values. Finally, WindowAgg is run to narrow down the results to the top three sum(lo_revenue) for each c_region, c_nation, and c_city. The processing time for GPU-PreAgg is 9.352 seconds, so we can see that roughly 4 seconds of the latter half of the process was spent on the CPU.

On the other hand, the following execution plan disables CPU-Fallback by setting pg_strom.cpu_fallback=off (i.e. enables GPU-Sort).

=# set pg_strom.cpu_fallback = off;
SET
=# explain analyze
   select * from (
       select c_region, c_nation, c_city, lo_orderdate, sum(lo_revenue) lo_rev,
              rank() over(partition by c_region, c_nation, c_city
                          order by sum(lo_revenue)) cnt
         from lineorder, customer
        where lo_custkey = c_custkey
          and lo_shipmode in ('RAIL','SHIP')
         group by c_region, c_nation, c_city, lo_orderdate
   ) subqry
   where cnt < 4;
                                                                       QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
 WindowAgg  (cost=5595978.47..5602228.47 rows=125000 width=84) (actual time=9596.930..9598.194 rows=750 loops=1)
   Run Condition: (rank() OVER (?) < 4)
   ->  Result  (cost=5595978.47..5599415.97 rows=125000 width=76) (actual time=9596.918..9597.292 rows=750 loops=1)
         ->  Custom Scan (GpuPreAgg) on lineorder  (cost=5595978.47..5597540.97 rows=125000 width=76) (actual time=9596.912..9597.061 rows=750 loops=1)
               GPU Projection: pgstrom.psum(lo_revenue), c_region, c_nation, c_city, lo_orderdate
               GPU Scan Quals: (lo_shipmode = ANY ('{RAIL,SHIP}'::bpchar[])) [plan: 600046000 -> 171773200, exec: 1311339 -> 362780]
               GPU Join Quals [1]: (lo_custkey = c_custkey) [plan: 171773200 -> 171773200, exec: 362780 -> 322560
               GPU Outer Hash [1]: lo_custkey
               GPU Inner Hash [1]: c_custkey
               GPU Group Key: c_region, c_nation, c_city, lo_orderdate
               Scan-Engine: GPU-Direct with 2 GPUs <0,1>; direct=11395910, ntuples=1311339
               GPU-Sort keys: c_region, c_nation, c_city, pgstrom.fsum_numeric((pgstrom.psum(lo_revenue)))
               Window-Rank Filter: rank() over(PARTITION BY c_region, c_nation, c_city ORDER BY pgstrom.fsum_numeric((pgstrom.psum(lo_revenue)))) < 4
               ->  Seq Scan on customer  (cost=0.00..81963.11 rows=3000011 width=46) (actual time=0.006..475.006 rows=3000000 loops=1)
 Planning Time: 0.381 ms
 Execution Time: 9710.616 ms
(16 rows)

The HashAggregate and Sort that were in the original query plan have gone, and instead, the lines GPU-Sort keys and Window-Rank Filter have appeared as options for GpuPreAgg.

This indicates that GpuPreAgg creates the complete aggregation on the GPU, then sorts and outputs it.

In addition, in this query, filtering is performed using the window function rank(). By pushing down this condition to the lower node GpuPreAgg, rows that are known to be filtered out in advance are removed from the result set, reducing the amount of data transferred from the GPU to the CPU and the number of rows that the CPU needs to copy. These processes are processed in parallel on the GPU memory, so they are generally faster than CPU processing.

Row reductions by GPU-Sort

There are several conditions that must be met for GPU-Sort to be triggered. As explained in the previous section, the first prerequisite is that CPU-Fallback is disabled.

The optimizer will insert GPU-Sort in the following cases: - Upper nodes require sorted results, such as in an ORDER BY clause. - Window functions are used, and upper nodes require sorted results.

For these workloads, PG-Strom will generate GPU-PreAgg/Join/Scan execution plans with GPU-Sort.

In addition, in the following cases, by completing the sorting process on the GPU, it is possible to secondarily reduce the number of rows to be written back to CPU from the GPU. - When a LIMIT clause is added to ORDER BY, and the maximum number of rows that should be output is known in advance. - When the window function rank() or the like is used, the top number of rows that should be output from each partition is known in advance.

The following execution plan joins several tables, performs aggregation operations, sorts the results by d_year and revenue, and outputs the top 10.

=# set pg_strom.cpu_fallback = off;
SET
=# explain
    select c_nation, s_nation, d_year, sum(lo_revenue) as revenue
      from customer, lineorder, supplier, date1
     where lo_custkey = c_custkey
       and lo_suppkey = s_suppkey
       and lo_orderdate = d_datekey
       and c_region = 'ASIA'  and s_region = 'ASIA'
       and d_year >= 1992 and d_year <= 1997
     group by c_nation, s_nation, d_year
     order by d_year, revenue
     limit 10;
                                                                           QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=19266896.24..19266897.46 rows=10 width=68) (actual time=44024.638..44056.484 rows=10 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Result  (cost=19265896.24..19265896.46 rows=10 width=68) (actual time=44016.179..44016.185 rows=3 loops=3)
         ->  Parallel Custom Scan (GpuPreAgg) on lineorder  (cost=19265896.24..19265896.31 rows=10 width=68) (actual time=44016.177..44016.181 rows=3 loops=3)
               GPU Projection: pgstrom.psum(lo_revenue), c_nation, s_nation, d_year
               GPU Join Quals [1]: (s_suppkey = lo_suppkey) [plan: 2500011000 -> 494752100, exec: 0 -> 0
               GPU Outer Hash [1]: lo_suppkey
               GPU Inner Hash [1]: s_suppkey
               GPU Join Quals [2]: (c_custkey = lo_custkey) [plan: 494752100 -> 97977410, exec: 0 -> 0
               GPU Outer Hash [2]: lo_custkey
               GPU Inner Hash [2]: c_custkey
               GPU Join Quals [3]: (d_datekey = lo_orderdate) [plan: 97977410 -> 84024450, exec: 0 -> 0
               GPU Outer Hash [3]: lo_orderdate
               GPU Inner Hash [3]: d_datekey
               GPU Group Key: c_nation, s_nation, d_year
               Scan-Engine: GPU-Direct with 2 GPUs <0,1>; direct=114826068, ntuples=0
               GPU-Sort keys: d_year, pgstrom.fsum_numeric((pgstrom.psum(lo_revenue)))
               GPU-Sort Limit: 10
               ->  Parallel Custom Scan (GpuScan) on supplier  (cost=100.00..78805.87 rows=824560 width=22) (actual time=16.090..104.527 rows=666540 loops=3)
                     GPU Projection: s_nation, s_suppkey
                     GPU Scan Quals: (s_region = 'ASIA'::bpchar) [plan: 9999718 -> 824560, exec: 10000000 -> 1999620]
                     Scan-Engine: GPU-Direct with 2 GPUs <0,1>; direct=168663, ntuples=10000000
               ->  Parallel Custom Scan (GpuScan) on customer  (cost=100.00..79517.76 rows=2475728 width=22) (actual time=14.848..226.354 rows=2000770 loops=3)
                     GPU Projection: c_nation, c_custkey
                     GPU Scan Quals: (c_region = 'ASIA'::bpchar) [plan: 30003780 -> 2475728, exec: 30000000 -> 6002311]
                     Scan-Engine: GPU-Direct with 2 GPUs <0,1>; direct=519628, ntuples=30000000
               ->  Parallel Seq Scan on date1  (cost=0.00..69.55 rows=1289 width=8) (actual time=0.009..0.132 rows=731 loops=3)
                     Filter: ((d_year >= 1992) AND (d_year <= 1997))
                     Rows Removed by Filter: 121
 Planning Time: 0.786 ms
 Execution Time: 44093.114 ms
(32 rows)

Without GPU-Sort, GPU-PreAgg will generate 4375 partial aggregation results, integrates them with HashAggregate, sorts them with Sort, and outputs only the top 10 with Limit.

Depending on the scale of the problem, a strategy may be adopted in which partial sort results are generated for each PostgreSQL parallel worker and merge-sorted. In any case, sorting processing, which places a heavy load on the CPU, becomes a troublesome problem as the number of items increases.

However, if it is known in advance that only the top 10 sorted results will be output, there is no need to return 4365 pieces of data to the CPU in the first place.

This query execution plan does not include CPU Sort or HashAggregate. Instead, as a post-processing step of GPU-PreAgg, the results are sorted based on the key values displayed in the GPU-Sort keys line, and only the 10 rows displayed in the GPU-Sort Limit line are returned. The execution plan is to immediately return the GPU processing results to the client without executing Sort or Limit on the CPU side.

In addition to the LIMIT clause, window functions can be used as hints to reduce the number of rows. Currently, the following conditional expressions can be used as hints:

rank() OVER(...) < CONST
rank() OVER(...) <= CONST
dense_rank() OVER(...) < CONST
dense_rank() OVER(...) <= CONST
row_number() OVER(...) < CONST
row_number() OVER(...) <= CONST

If you want to disable CPU-Fallback but not GPU-Sort, you can enable/disable only GPU-Sort by pg_strom.enable_gpusort = [on|off].