[gnome-remote-desktop] rdp: Add class for NVENC and CUDA support

From: Jonas Ådahl <jadahl src gnome org>
To: commits-list gnome org
Cc:
Subject: [gnome-remote-desktop] rdp: Add class for NVENC and CUDA support
Date: Fri, 3 Sep 2021 09:34:43 +0000 (UTC)
commit ab2231c1084fef37220d0ed584b8fb858eee500f
Author: Pascal Nowack <Pascal Nowack gmx de>
Date:   Mon Jul 5 17:12:42 2021 +0200

    rdp: Add class for NVENC and CUDA support
    
    This is the first step for hardware acceleration support in
    gnome-remote-desktop.
    The implementation for Hardware acceleration using NVENC and CUDA is
    way easier than the implementation of hardware acceleration using
    VAAPI.
    To be able to use NVENC and CUDA, use the ffnvcodec-headers. These
    headers, provide functions to easily load NVENC and CUDA using dlopen.
    NVENC itself, will be used to encode AVC420 content, which will then
    be pushed via the graphics pipeline.
    CUDA will be used to perform the BGRX to YUV420 colour conversion, as
    that will massively improve the performance compared to colour
    conversion computed on the CPU.
    
    To be able to encode AVC420 frames, the frame size needs to be aligned
    to the value of 16 (both width and height), and the colour format needs
    to be converted to YUV420.
    In addition to the alignment to the value of 16, NVENC seems to require
    the frame height to be aligned to a multiple of 64, as otherwise the
    resulting frame on the client may contain a black strip in the middle
    of the frame, when the height is not aligned to a multiple of 64.
    Since the FreeRDP primitives are way too slow (17-24ms for a FullHD
    frame) to take care of the colour conversion, use a CUDA kernel to
    perform this operation.
    This will reduce the conversion time under 400µs (313µs according to
    the NVIDIA Visual Profiler) on a GeForce GTX 660.
    
    The resulting image (YUV420 in the NV12 format) will then be passed to
    NVENC, which then encodes the frame.
    When using NVENC with MBAFF, NVENV requires the image to be already
    interlaced.
    If it is not interlaced, then even lines end up in the resulting image
    at the position y / 2, instead of y, while odd lines end up in the
    resulting image at the position y / 2 + aligned_height / 2, instead of
    y.
    To take care of this situation, calculate the interlaced position
    directly in the CUDA kernel function.
    The resulting image will then be correct on the client side.
    
    NVENC support was introduced with the Kepler generation.
    Since the CUDA toolkit removed Kepler support with version 11, and most
    distributions don't ship a CUDA package, ship the generated PTX code
    with gnome-remote-desktop.
    The PTX code is generated with the CUDA toolkit version 10 and will
    work for all Kepler and later GPUs, since PTX code is forward
    compatible.
    
    PTX code is not binary, meaning it is a human readable text file.
    CUDA code is generated in two steps:
    
    First, the PTX code: The PTX code is code, that is generated for a
    specific compute capability, but can also be processed by GPUs, that
    support a newer compute capability.
    However, it cannot be processed by GPUs with an older (lower) compute
    capability.
    
    Second, the CUDA binary: The binary that will end up on the GPU
    eventually.
    Technically, gnome-remote-desktop could ship that binary, but this is
    not suitable:
    First, it is a binary, it cannot be easily verified.
    Second, the binary is GPU specific, meaning gnome-remote-desktop would
    have to ship a fat binary to cover all GPUs, which is not suitable.
    
    The NVIDIA driver ships a JIT compiler, which can load PTX code and
    generate the CUDA binary at runtime.
    
    When gnome-remote-desktop starts, the NVIDIA driver will automatically
    use the JIT compiler to produce the device specific CUDA binary.
    This is a fast process and it will also load the module.
    
    gnome-remote-desktop then uses the module to perform the colour
    conversion, when encoding an AVC420 frame.
    
    In the future, the NVENC and CUDA implementation will be extended to be
    able to produce AVC444 frames.
    The NVENC capable GPU doesn't need to have support for AVC444 frames
    for this, since RDP uses a special way to create AVC444 frames
    (composed out of two AVC420 frames, one main view, one auxiliary view).

 .gitlab-ci.yml                 |   3 +-
 config.h.meson                 |   6 +
 data/README                    |  59 +++++
 data/grd-cuda-avc-utils_30.ptx | 226 +++++++++++++++++
 data/meson.build               |   5 +
 meson.build                    |  33 ++-
 meson_options.txt              |   5 +
 src/grd-cuda-avc-utils.cu      | 210 ++++++++++++++++
 src/grd-rdp-nvenc.c            | 539 +++++++++++++++++++++++++++++++++++++++++
 src/grd-rdp-nvenc.h            |  51 ++++
 src/grd-types.h                |   1 +
 src/meson.build                |  12 +
 12 files changed, 1144 insertions(+), 6 deletions(-)
---
diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
index 17272d6..46bbe32 100644
--- a/.gitlab-ci.yml
+++ b/.gitlab-ci.yml
@@ -9,7 +9,7 @@ stages:
 .gnome-remote-desktop.fedora:33@common:
   variables:
     FDO_DISTRIBUTION_VERSION: 34
-    BASE_TAG: '2021-08-06.0'
+    BASE_TAG: '2021-08-06.1'
     FDO_UPSTREAM_REPO: GNOME/gnome-remote-desktop
     FDO_DISTRIBUTION_EXEC: |
       dnf -y update && dnf -y upgrade &&
@@ -25,6 +25,7 @@ stages:
       dnf remove -y pipewire0.2-devel pipewire0.2-libs &&
       dnf install -y 'pkgconfig(libpipewire-0.3)' &&
       dnf install -y 'pkgconfig(fuse3)' &&
+      dnf install -y 'pkgconfig(ffnvcodec)' &&
       dnf install -y dbus-daemon xorg-x11-server-Xvfb python3-dbus \
                      python3-gobject gnome-settings-daemon mesa-dri-drivers \
                      xorg-x11-server-Xwayland mutter &&
diff --git a/config.h.meson b/config.h.meson
index dd1b26e..98647d7 100644
--- a/config.h.meson
+++ b/config.h.meson
@@ -11,3 +11,9 @@
 
 /* Defined if VNC backend is enabled */
 #mesondefine HAVE_VNC
+
+/* Defined if NVENC is available */
+#mesondefine HAVE_NVENC
+
+/* Path of the data dir */
+#mesondefine GRD_DATA_DIR
diff --git a/data/README b/data/README
new file mode 100644
index 0000000..653af7f
--- /dev/null
+++ b/data/README
@@ -0,0 +1,59 @@
+How to produce the PTX instructions for CUDA kernels
+====================================================
+
+For the generation of the PTX instructions, the CUDA toolkit needs to be
+installed (See for this below in "Retrieving the CUDA toolkit").
+
+Generation:
+-----------
+
+When the CUDA toolkit is installed, and the current directory is `src`,
+generate the PTX instructions via:
+/opt/cuda/bin/./nvcc -arch=compute_30 -ptx grd-cuda-avc-utils.cu -o ../data/grd-cuda-avc-utils_30.ptx
+
+The nvcc path differ from OS to OS. In the case above, Archlinux is used, which
+uses the path `/opt/cuda/bin/nvcc` for nvcc.
+
+`-arch=compute_30` tells nvcc to generate instructions for GPUs with compute
+capability 3.0.
+CUDA GPUs with higher compute capability can also run CUDA kernels with lower
+compute capability.
+The CUDA kernel for gnome-remote-desktop uses compute capability 3.0, as
+compute capability 3.0 is the one of Kepler GPUs.
+Kepler GPUs are the first generation GPUs, that support NVENC. To remain
+compatible with these GPUs, generate the instructions for compute capability
+3.0.
+
+Also append a suffix, like in the example above, for the compute capability of
+the generated PTX instructions.
+If a kernel might be more efficient with newer CUDA features, generate a kernel
+for the higher necessary compute capability and another one for older GPUs as
+fallback.
+Use then the CUDA functions to check the compute capability of the selected GPU
+at runtime to determine, which PTX instructions should be loaded.
+
+Retrieving the CUDA toolkit:
+----------------------------
+
+Retrieving the CUDA toolkit depends on the distribution. It should be noted,
+that the generation of PTX instructions for compute capability 3.0 was removed
+from the CUDA toolkit version 11.
+So, an older version of the CUDA toolkit (version 10) is needed to generate PTX
+instructions for Kepler GPUs.
+
+Instructions to retrieve the CUDA toolkit version 10 on Archlinux:
+------------------------------------------------------------------
+
+While the current version of the CUDA toolkit can be found in the `community`
+repository (included by default) under the name `cuda`, the older version 10 is
+available via the Arch Linux Archive:
+
+For this, the easiest way here is to use the `downgrade` tool (AUR):
+
+Run `downgrade gcc8 gcc8-libs cuda` (might require root privileges) and choose
+the latest gcc8 and gcc8-libs version for gcc8 (dependency of CUDA 10) and for
+CUDA 10 choose the latest CUDA 10 release.
+The downgrade utility will then download and install these packages.
+
+After this, you can use nvcc to generate PTX instructions for Kepler GPUs as
+well.
diff --git a/data/grd-cuda-avc-utils_30.ptx b/data/grd-cuda-avc-utils_30.ptx
new file mode 100644
index 0000000..5fb4469
--- /dev/null
+++ b/data/grd-cuda-avc-utils_30.ptx
@@ -0,0 +1,226 @@
+//
+// Generated by NVIDIA NVVM Compiler
+//
+// Compiler Build ID: CL-27506705
+// Cuda compilation tools, release 10.2, V10.2.89
+// Based on LLVM 3.4svn
+//
+
+.version 6.5
+.target sm_30
+.address_size 64
+
+       // .globl       convert_2x2_bgrx_area_to_yuv420_nv12
+
+.visible .entry convert_2x2_bgrx_area_to_yuv420_nv12(
+       .param .u64 convert_2x2_bgrx_area_to_yuv420_nv12_param_0,
+       .param .u64 convert_2x2_bgrx_area_to_yuv420_nv12_param_1,
+       .param .u16 convert_2x2_bgrx_area_to_yuv420_nv12_param_2,
+       .param .u16 convert_2x2_bgrx_area_to_yuv420_nv12_param_3,
+       .param .u16 convert_2x2_bgrx_area_to_yuv420_nv12_param_4,
+       .param .u16 convert_2x2_bgrx_area_to_yuv420_nv12_param_5,
+       .param .u16 convert_2x2_bgrx_area_to_yuv420_nv12_param_6,
+       .param .u16 convert_2x2_bgrx_area_to_yuv420_nv12_param_7
+)
+{
+       .reg .pred      %p<15>;
+       .reg .b16       %rs<23>;
+       .reg .b32       %r<127>;
+       .reg .b64       %rd<36>;
+
+
+       ld.param.u64    %rd6, [convert_2x2_bgrx_area_to_yuv420_nv12_param_0];
+       ld.param.u64    %rd7, [convert_2x2_bgrx_area_to_yuv420_nv12_param_1];
+       ld.param.u16    %rs7, [convert_2x2_bgrx_area_to_yuv420_nv12_param_2];
+       ld.param.u16    %rs8, [convert_2x2_bgrx_area_to_yuv420_nv12_param_3];
+       ld.param.u16    %rs9, [convert_2x2_bgrx_area_to_yuv420_nv12_param_4];
+       ld.param.u16    %rs10, [convert_2x2_bgrx_area_to_yuv420_nv12_param_6];
+       ld.param.u16    %rs11, [convert_2x2_bgrx_area_to_yuv420_nv12_param_7];
+       mov.u32         %r30, %ntid.x;
+       mov.u32         %r31, %ctaid.x;
+       mov.u32         %r32, %tid.x;
+       mad.lo.s32      %r1, %r30, %r31, %r32;
+       mov.u32         %r33, %ntid.y;
+       mov.u32         %r34, %ctaid.y;
+       mov.u32         %r35, %tid.y;
+       mad.lo.s32      %r2, %r33, %r34, %r35;
+       and.b32         %r36, %r1, 65535;
+       ld.param.u16    %r37, [convert_2x2_bgrx_area_to_yuv420_nv12_param_5];
+       shr.u32         %r38, %r37, 1;
+       and.b32         %r3, %r2, 65535;
+       cvt.u32.u16     %r4, %rs10;
+       shr.u32         %r5, %r4, 1;
+       setp.ge.u32     %p1, %r3, %r5;
+       setp.ge.u32     %p2, %r36, %r38;
+       or.pred         %p3, %p1, %p2;
+       @%p3 bra        BB0_10;
+
+       cvta.to.global.u64      %rd8, %rd6;
+       cvt.u32.u16     %r42, %rs9;
+       and.b32         %r43, %r1, 32767;
+       shl.b32         %r44, %r1, 1;
+       and.b32         %r6, %r44, 65534;
+       mov.u32         %r45, 1;
+       shl.b32         %r46, %r2, 1;
+       and.b32         %r47, %r46, 65534;
+       mul.lo.s32      %r48, %r47, %r42;
+       cvt.u64.u32     %rd9, %r48;
+       shl.b32         %r49, %r43, 3;
+       cvt.u64.u32     %rd10, %r49;
+       add.s64         %rd1, %rd9, %rd10;
+       cvta.to.global.u64      %rd11, %rd7;
+       add.s64         %rd2, %rd11, %rd1;
+       setp.lt.u32     %p4, %r47, %r5;
+       shl.b32         %r50, %r2, 2;
+       sub.s32         %r51, %r45, %r4;
+       selp.b32        %r52, 0, %r51, %p4;
+       mov.u32         %r124, 0;
+       add.s32         %r53, %r52, %r50;
+       cvt.u64.u32     %rd12, %r53;
+       and.b64         %rd13, %rd12, 65535;
+       cvt.u64.u16     %rd14, %rs11;
+       mul.lo.s64      %rd15, %rd13, %rd14;
+       cvt.u64.u32     %rd16, %r6;
+       add.s64         %rd17, %rd15, %rd16;
+       add.s64         %rd3, %rd8, %rd17;
+       add.s32         %r7, %r47, 1;
+       and.b32         %r54, %r7, 65535;
+       setp.lt.u32     %p5, %r54, %r5;
+       shl.b32         %r55, %r7, 1;
+       selp.b32        %r56, 0, %r51, %p5;
+       add.s32         %r57, %r56, %r55;
+       cvt.u64.u32     %rd18, %r57;
+       and.b64         %rd19, %rd18, 65535;
+       mul.lo.s64      %rd20, %rd19, %rd14;
+       add.s64         %rd21, %rd20, %rd16;
+       add.s64         %rd4, %rd8, %rd21;
+       shr.u32         %r58, %r4, 2;
+       setp.lt.u32     %p6, %r3, %r58;
+       sub.s32         %r59, %r45, %r5;
+       selp.b32        %r60, 0, %r59, %p6;
+       shl.b32         %r61, %r3, 1;
+       add.s32         %r62, %r61, %r60;
+       cvt.u64.u32     %rd22, %r62;
+       and.b64         %rd23, %rd22, 65535;
+       cvt.u64.u16     %rd24, %rs10;
+       add.s64         %rd25, %rd23, %rd24;
+       mul.lo.s64      %rd26, %rd25, %rd14;
+       add.s64         %rd27, %rd26, %rd16;
+       add.s64         %rd5, %rd8, %rd27;
+       cvt.u32.u16     %r63, %rs7;
+       setp.ge.u32     %p7, %r6, %r63;
+       cvt.u32.u16     %r64, %rs8;
+       setp.ge.u32     %p8, %r47, %r64;
+       mov.u16         %rs21, 0;
+       or.pred         %p9, %p7, %p8;
+       mov.u16         %rs20, %rs21;
+       mov.u32         %r125, %r124;
+       mov.u32         %r126, %r124;
+       @%p9 bra        BB0_3;
+
+       ld.global.u8    %rs13, [%rd2];
+       cvt.u32.u16     %r126, %rs13;
+       ld.global.u8    %r125, [%rd2+1];
+       ld.global.u8    %r124, [%rd2+2];
+       mul.wide.u16    %r65, %rs13, 18;
+       mad.lo.s32      %r66, %r125, 183, %r65;
+       mad.lo.s32      %r67, %r124, 54, %r66;
+       shr.u32         %r68, %r67, 8;
+       cvt.u16.u32     %rs20, %r68;
+
+BB0_3:
+       and.b32         %r73, %r2, 32767;
+       shl.b32         %r74, %r73, 1;
+       setp.ge.u32     %p10, %r74, %r64;
+       st.global.u8    [%rd3], %rs20;
+       add.s32         %r14, %r6, 1;
+       setp.ge.u32     %p11, %r14, %r63;
+       or.pred         %p12, %p11, %p10;
+       @%p12 bra       BB0_5;
+
+       ld.global.u8    %rs15, [%rd2+4];
+       cvt.u32.u16     %r77, %rs15;
+       add.s32         %r126, %r77, %r126;
+       ld.global.u8    %r78, [%rd2+5];
+       add.s32         %r125, %r78, %r125;
+       ld.global.u8    %r79, [%rd2+6];
+       add.s32         %r124, %r79, %r124;
+       mul.wide.u16    %r80, %rs15, 18;
+       mad.lo.s32      %r81, %r78, 183, %r80;
+       mad.lo.s32      %r82, %r79, 54, %r81;
+       shr.u32         %r83, %r82, 8;
+       cvt.u16.u32     %rs21, %r83;
+
+BB0_5:
+       st.global.u8    [%rd3+1], %rs21;
+       setp.lt.u32     %p13, %r7, %r64;
+       @%p13 bra       BB0_7;
+       bra.uni         BB0_6;
+
+BB0_7:
+       cvt.u64.u16     %rd28, %rs9;
+       add.s64         %rd29, %rd1, %rd28;
+       add.s64         %rd31, %rd11, %rd29;
+       ld.global.u8    %rs18, [%rd31];
+       cvt.u32.u16     %r86, %rs18;
+       add.s32         %r126, %r86, %r126;
+       ld.global.u8    %r87, [%rd31+1];
+       add.s32         %r125, %r87, %r125;
+       ld.global.u8    %r88, [%rd31+2];
+       add.s32         %r124, %r88, %r124;
+       mul.wide.u16    %r89, %rs18, 18;
+       mad.lo.s32      %r90, %r87, 183, %r89;
+       mad.lo.s32      %r91, %r88, 54, %r90;
+       shr.u32         %r92, %r91, 8;
+       st.global.u8    [%rd4], %r92;
+       mov.u16         %rs22, 0;
+       @%p11 bra       BB0_9;
+
+       add.s32         %r94, %r42, 4;
+       and.b32         %r95, %r94, 65535;
+       cvt.u64.u32     %rd32, %r95;
+       add.s64         %rd33, %rd1, %rd32;
+       add.s64         %rd35, %rd11, %rd33;
+       ld.global.u8    %rs19, [%rd35];
+       cvt.u32.u16     %r96, %rs19;
+       add.s32         %r126, %r96, %r126;
+       ld.global.u8    %r97, [%rd35+1];
+       add.s32         %r125, %r97, %r125;
+       ld.global.u8    %r98, [%rd35+2];
+       add.s32         %r124, %r98, %r124;
+       mul.wide.u16    %r99, %rs19, 18;
+       mad.lo.s32      %r100, %r97, 183, %r99;
+       mad.lo.s32      %r101, %r98, 54, %r100;
+       shr.u32         %r102, %r101, 8;
+       cvt.u16.u32     %rs22, %r102;
+       bra.uni         BB0_9;
+
+BB0_6:
+       mov.u16         %rs22, 0;
+       st.global.u8    [%rd4], %rs22;
+
+BB0_9:
+       st.global.u8    [%rd4+1], %rs22;
+       bfe.u32         %r103, %r124, 2, 8;
+       mul.lo.s32      %r104, %r103, -29;
+       bfe.u32         %r105, %r125, 2, 8;
+       mad.lo.s32      %r106, %r105, -99, %r104;
+       bfe.u32         %r107, %r126, 2, 8;
+       shl.b32         %r108, %r107, 7;
+       add.s32         %r109, %r106, %r108;
+       shr.u32         %r110, %r109, 8;
+       add.s32         %r111, %r110, 128;
+       st.global.u8    [%rd5], %r111;
+       shl.b32         %r112, %r124, 5;
+       and.b32         %r113, %r112, 32640;
+       mad.lo.s32      %r114, %r105, -116, %r113;
+       mad.lo.s32      %r115, %r107, -12, %r114;
+       shr.u32         %r116, %r115, 8;
+       add.s32         %r117, %r116, 128;
+       st.global.u8    [%rd5+1], %r117;
+
+BB0_10:
+       ret;
+}
+
+
diff --git a/data/meson.build b/data/meson.build
new file mode 100644
index 0000000..a65423c
--- /dev/null
+++ b/data/meson.build
@@ -0,0 +1,5 @@
+if have_nvenc
+  install_data(['grd-cuda-avc-utils_30.ptx'],
+    install_dir: grd_datadir,
+  )
+endif
diff --git a/meson.build b/meson.build
index c3a53d8..7e5f399 100644
--- a/meson.build
+++ b/meson.build
@@ -6,11 +6,14 @@ project('gnome-remote-desktop', 'c',
 
 freerdp_req = '>= 2.3.0'
 fuse_req = '>= 3.9.1'
+nvenc_req = '>= 11'
 xkbcommon_req = '>= 1.0.0'
 
 gnome = import('gnome')
 i18n  = import('i18n')
 
+cc = meson.get_compiler('c')
+
 cairo_dep = dependency('cairo')
 glib_dep = dependency('glib-2.0', version: '>= 2.68')
 gio_dep = dependency('gio-2.0')
@@ -22,11 +25,16 @@ libnotify_dep = dependency('libnotify')
 
 have_rdp = get_option('rdp')
 have_vnc = get_option('vnc')
+have_nvenc = get_option('nvenc')
 
 if not have_rdp and not have_vnc
   error('Must enable at least one backend')
 endif
 
+if have_nvenc and not have_rdp
+  error('Support for hardware acceleration using NVENC requires the RDP backend')
+endif
+
 if have_rdp
   add_global_arguments('-D_GNU_SOURCE', language : 'c')
 
@@ -36,6 +44,11 @@ if have_rdp
   fuse_dep = dependency('fuse3', version: fuse_req)
   winpr_dep = dependency('winpr2', version: freerdp_req)
   xkbcommon_dep = dependency('xkbcommon', version: xkbcommon_req)
+
+  if have_nvenc
+    dl_dep = cc.find_library('dl', required: true)
+    nvenc_dep = dependency('ffnvcodec', version: nvenc_req)
+  endif
 endif
 
 if have_vnc
@@ -43,12 +56,22 @@ if have_vnc
   libvncclient_dep = dependency('libvncclient')
 endif
 
+prefix = get_option('prefix')
+libexecdir = join_paths(prefix, get_option('libexecdir'))
+datadir = join_paths(prefix, get_option('datadir'))
+schemadir = join_paths(datadir, 'glib-2.0', 'schemas')
+
+grd_datadir = join_paths(datadir, 'gnome-remote-desktop')
+
 cdata = configuration_data()
 cdata.set_quoted('GETTEXT_PACKAGE', 'gnome-remote-desktop')
 cdata.set_quoted('VERSION', meson.project_version())
 
 cdata.set('HAVE_RDP', have_rdp)
 cdata.set('HAVE_VNC', have_vnc)
+cdata.set('HAVE_NVENC', have_nvenc)
+
+cdata.set_quoted('GRD_DATA_DIR', grd_datadir)
 
 configure_file(input: 'config.h.meson',
                output: 'config.h',
@@ -68,11 +91,7 @@ endif
 top_srcdir = meson.current_source_dir()
 builddir = meson.current_build_dir()
 
-prefix = get_option('prefix')
-libexecdir = join_paths(prefix, get_option('libexecdir'))
-datadir = join_paths(prefix, get_option('datadir'))
-schemadir = join_paths(datadir, 'glib-2.0', 'schemas')
-
+subdir('data')
 subdir('src')
 subdir('tests')
 subdir('po')
@@ -96,6 +115,10 @@ output = [
   '        RDP...................... ' + have_rdp.to_string(),
   '        VNC...................... ' + have_vnc.to_string(),
   '',
+  '    Options for the RDP backend:',
+  '',
+  '        Support for hardware acceleration using NVENC and CUDA........' + have_nvenc.to_string(),
+  '',
   '  Now type \'ninja -C ' + meson.build_root() + '\' to build ' + meson.project_name(),
   '',
   '',
diff --git a/meson_options.txt b/meson_options.txt
index ca2908b..fb96cd1 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -8,6 +8,11 @@ option('vnc',
        value: true,
        description: 'Enable the VNC backend')
 
+option('nvenc',
+       type: 'boolean',
+       value: true,
+       description: 'Build with support for hardware acceleration using NVENC and CUDA')
+
 option('systemd_user_unit_dir',
        type: 'string',
        value: '',
diff --git a/src/grd-cuda-avc-utils.cu b/src/grd-cuda-avc-utils.cu
new file mode 100644
index 0000000..82d87f2
--- /dev/null
+++ b/src/grd-cuda-avc-utils.cu
@@ -0,0 +1,210 @@
+/*
+ * Copyright (C) 2021 Pascal Nowack
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of the
+ * License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
+ * 02111-1307, USA.
+ */
+
+/*
+ * Generate the PTX instructions with:
+ * nvcc -arch=compute_30 -ptx grd-cuda-avc-utils.cu -o grd-cuda-avc-utils_30.ptx
+ *
+ * Note: This requires CUDA < 11, since the generation of Kepler capable
+ * PTX code was removed from CUDA 11.
+ */
+
+#include <stdint.h>
+
+extern "C"
+{
+  __device__ uint16_t
+  nv12_get_interlaced_y_1x1 (uint16_t y_1x1,
+                             uint16_t aligned_height)
+  {
+    if (y_1x1 < aligned_height >> 1)
+      return y_1x1 << 1;
+    return (y_1x1 << 1) - aligned_height + 1;
+  }
+
+  __device__ uint16_t
+  nv12_get_interlaced_y_2x2 (uint16_t y_2x2,
+                             uint16_t aligned_height)
+  {
+    if (y_2x2 < aligned_height >> 2)
+      return y_2x2 << 1;
+    return (y_2x2 << 1) - (aligned_height >> 1) + 1;
+  }
+
+  __device__ uint8_t
+  rgb_to_y (uint8_t r,
+            uint8_t g,
+            uint8_t b)
+  {
+    return (54 * r + 183 * g + 18 * b) >> 8;
+  }
+
+  __device__ uint8_t
+  rgb_to_u (uint8_t r,
+            uint8_t g,
+            uint8_t b)
+  {
+    return ((-29 * r - 99 * g + 128 * b) >> 8) + 128;
+  }
+
+  __device__ uint8_t
+  rgb_to_v (uint8_t r,
+            uint8_t g,
+            uint8_t b)
+  {
+    return ((128 * r - 116 * g - 12 * b) >> 8) + 128;
+  }
+
+  __global__ void
+  convert_2x2_bgrx_area_to_yuv420_nv12 (uint8_t  *dst_data,
+                                        uint8_t  *src_data,
+                                        uint16_t  src_width,
+                                        uint16_t  src_height,
+                                        uint16_t  src_stride,
+                                        uint16_t  aligned_width,
+                                        uint16_t  aligned_height,
+                                        uint16_t  aligned_stride)
+  {
+    uint8_t *src, *dst_y0, *dst_y1, *dst_y2, *dst_y3, *dst_u, *dst_v;
+    uint16_t s0, s1, s2, s3;
+    int32_t r_a, g_a, b_a;
+    uint8_t r, g, b;
+    uint16_t x_1x1, y_1x1;
+    uint16_t x_2x2, y_2x2;
+
+    x_2x2 = blockIdx.x * blockDim.x + threadIdx.x;
+    y_2x2 = blockIdx.y * blockDim.y + threadIdx.y;
+
+    if (x_2x2 >= aligned_width >> 1 || y_2x2 >= aligned_height >> 1)
+      return;
+
+    /*
+     *  -------------
+     *  | d_0 | d_1 |
+     *  -------------
+     *  | d_2 | d_3 |
+     *  -------------
+     */
+    s0 = 0;
+    s1 = 4;
+    s2 = src_stride;
+    s3 = src_stride + 4;
+    /*
+     * Technically, the correct positions for the Y data in the resulting NV12
+     * image would be the following:
+     *
+     * d0 = 0;
+     * d1 = 1;
+     * d2 = aligned_stride;
+     * d3 = aligned_stride + 1;
+     *
+     * However, since MBAFF is used as frame field mode, NVENC requires the input
+     * frame to be interlaced.
+     * If the frame is not interlaced, then even lines end up in the position
+     * y / 2, instead of y and odd lines end up in the position y / 2 +
+     * aligned_height / 2, instead of y.
+     * So, calculate the interlaced y position via a dedicated function, which
+     * ensures that the lines in the input frame end up in the resulting frame to
+     * be at the correct position.
+     * Doing this now in the kernel here, instead of after the BGRX -> YUV420
+     * conversion, saves a huge amount of time, since each thread only has a
+     * super tiny overhead to perform this action, while a normal
+     * device-to-device copy operation can take at least several milliseconds.
+     */
+
+    x_1x1 = x_2x2 << 1;
+    y_1x1 = y_2x2 << 1;
+    src = src_data + y_1x1 * src_stride + (x_1x1 << 2);
+
+    dst_y0 = dst_data +
+             nv12_get_interlaced_y_1x1 (y_1x1, aligned_height) * aligned_stride +
+             x_1x1;
+    dst_y1 = dst_y0 + 1;
+    dst_y2 = dst_data +
+             nv12_get_interlaced_y_1x1 (y_1x1 + 1, aligned_height) * aligned_stride +
+             x_1x1;
+    dst_y3 = dst_y2 + 1;
+    dst_u = dst_data + aligned_height * aligned_stride +
+            nv12_get_interlaced_y_2x2 (y_2x2, aligned_height) * aligned_stride +
+            x_1x1;
+    dst_v = dst_u + 1;
+
+    /* d_0 */
+    if (x_1x1 < src_width && y_1x1 < src_height)
+      {
+        b_a = b = src[s0 + 0];
+        g_a = g = src[s0 + 1];
+        r_a = r = src[s0 + 2];
+        *dst_y0 = rgb_to_y (r, g, b);
+      }
+    else
+      {
+        b_a = b = 0;
+        g_a = g = 0;
+        r_a = r = 0;
+        *dst_y0 = 0;
+      }
+
+    if (x_1x1 + 1 < src_width && y_1x1 < src_height)
+      {
+        /* d_1 */
+        b_a += b = src[s1 + 0];
+        g_a += g = src[s1 + 1];
+        r_a += r = src[s1 + 2];
+        *dst_y1 = rgb_to_y (r, g, b);
+      }
+    else
+      {
+        *dst_y1 = 0;
+      }
+
+    if (y_1x1 + 1 < src_height)
+      {
+        /* d_2 */
+        b_a += b = src[s2 + 0];
+        g_a += g = src[s2 + 1];
+        r_a += r = src[s2 + 2];
+        *dst_y2 = rgb_to_y (r, g, b);
+
+        if (x_1x1 + 1 < src_width)
+          {
+            /* d_3 */
+            b_a += b = src[s3 + 0];
+            g_a += g = src[s3 + 1];
+            r_a += r = src[s3 + 2];
+            *dst_y3 = rgb_to_y (r, g, b);
+          }
+        else
+          {
+            *dst_y3 = 0;
+          }
+      }
+    else
+      {
+        *dst_y2 = 0;
+        *dst_y3 = 0;
+      }
+
+    b_a >>= 2;
+    g_a >>= 2;
+    r_a >>= 2;
+    *dst_u = rgb_to_u (r_a, g_a, b_a);
+    *dst_v = rgb_to_v (r_a, g_a, b_a);
+  }
+}
diff --git a/src/grd-rdp-nvenc.c b/src/grd-rdp-nvenc.c
new file mode 100644
index 0000000..89fb374
--- /dev/null
+++ b/src/grd-rdp-nvenc.c
@@ -0,0 +1,539 @@
+/*
+ * Copyright (C) 2021 Pascal Nowack
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of the
+ * License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
+ * 02111-1307, USA.
+ */
+
+#include "config.h"
+
+#include "grd-rdp-nvenc.h"
+
+#include <ffnvcodec/dynlink_loader.h>
+
+typedef struct _NvEncEncodeSession
+{
+  void *encoder;
+
+  uint16_t enc_width;
+  uint16_t enc_height;
+
+  NV_ENC_OUTPUT_PTR buffer_out;
+} NvEncEncodeSession;
+
+struct _GrdRdpNvenc
+{
+  GObject parent;
+
+  CudaFunctions *cuda_funcs;
+  NvencFunctions *nvenc_funcs;
+  NV_ENCODE_API_FUNCTION_LIST nvenc_api;
+
+  CUdevice cu_device;
+  CUcontext cu_context;
+  gboolean initialized;
+
+  CUmodule cu_module_avc_utils;
+  CUfunction cu_bgrx_to_yuv420;
+
+  GHashTable *encode_sessions;
+
+  uint32_t next_encode_session_id;
+};
+
+G_DEFINE_TYPE (GrdRdpNvenc, grd_rdp_nvenc, G_TYPE_OBJECT);
+
+static uint32_t
+get_next_free_encode_session_id (GrdRdpNvenc *rdp_nvenc)
+{
+  uint32_t encode_session_id = rdp_nvenc->next_encode_session_id;
+
+  while (g_hash_table_contains (rdp_nvenc->encode_sessions,
+                                GUINT_TO_POINTER (encode_session_id)))
+    ++encode_session_id;
+
+  rdp_nvenc->next_encode_session_id = encode_session_id + 1;
+
+  return encode_session_id;
+}
+
+gboolean
+grd_rdp_nvenc_create_encode_session (GrdRdpNvenc *rdp_nvenc,
+                                     uint32_t    *encode_session_id,
+                                     uint16_t     surface_width,
+                                     uint16_t     surface_height,
+                                     uint16_t     refresh_rate)
+{
+  NvEncEncodeSession *encode_session;
+  NV_ENC_OPEN_ENCODE_SESSION_EX_PARAMS open_params = {0};
+  NV_ENC_INITIALIZE_PARAMS init_params = {0};
+  NV_ENC_CONFIG encode_config = {0};
+  NV_ENC_CREATE_BITSTREAM_BUFFER create_bitstream_buffer = {0};
+  uint16_t aligned_width;
+  uint16_t aligned_height;
+
+  aligned_width = surface_width + (surface_width % 16 ? 16 - surface_width % 16 : 0);
+  aligned_height = surface_height + (surface_height % 64 ? 64 - surface_height % 64 : 0);
+
+  *encode_session_id = get_next_free_encode_session_id (rdp_nvenc);
+  encode_session = g_malloc0 (sizeof (NvEncEncodeSession));
+  encode_session->enc_width = aligned_width;
+  encode_session->enc_height = aligned_height;
+
+  open_params.version = NV_ENC_OPEN_ENCODE_SESSION_EX_PARAMS_VER;
+  open_params.deviceType = NV_ENC_DEVICE_TYPE_CUDA;
+  open_params.device = rdp_nvenc->cu_context;
+  open_params.apiVersion = NVENCAPI_VERSION;
+
+  if (rdp_nvenc->nvenc_api.nvEncOpenEncodeSessionEx (
+        &open_params, &encode_session->encoder) != NV_ENC_SUCCESS)
+    {
+      g_debug ("[HWAccel.NVENC] Failed to open encode session");
+      g_free (encode_session);
+      return FALSE;
+    }
+
+  encode_config.version = NV_ENC_CONFIG_VER;
+  encode_config.profileGUID = NV_ENC_H264_PROFILE_PROGRESSIVE_HIGH_GUID;
+  encode_config.gopLength = NVENC_INFINITE_GOPLENGTH;
+  encode_config.frameIntervalP = 1;
+  encode_config.frameFieldMode = NV_ENC_PARAMS_FRAME_FIELD_MODE_MBAFF;
+  encode_config.mvPrecision = NV_ENC_MV_PRECISION_QUARTER_PEL;
+  encode_config.rcParams.version = NV_ENC_RC_PARAMS_VER;
+  encode_config.rcParams.rateControlMode = NV_ENC_PARAMS_RC_VBR;
+  encode_config.rcParams.averageBitRate = 0;
+  encode_config.rcParams.maxBitRate = 0;
+  encode_config.rcParams.targetQuality = 22;
+  encode_config.encodeCodecConfig.h264Config.idrPeriod = NVENC_INFINITE_GOPLENGTH;
+  encode_config.encodeCodecConfig.h264Config.chromaFormatIDC = 1;
+
+  init_params.version = NV_ENC_INITIALIZE_PARAMS_VER;
+  init_params.encodeGUID = NV_ENC_CODEC_H264_GUID;
+  init_params.encodeWidth = aligned_width;
+  init_params.encodeHeight = aligned_height;
+  init_params.darWidth = surface_width;
+  init_params.darHeight = surface_height;
+  init_params.frameRateNum = refresh_rate;
+  init_params.frameRateDen = 1;
+  init_params.enablePTD = 1;
+  init_params.encodeConfig = &encode_config;
+  if (rdp_nvenc->nvenc_api.nvEncInitializeEncoder (
+        encode_session->encoder, &init_params) != NV_ENC_SUCCESS)
+    {
+      NV_ENC_PIC_PARAMS pic_params = {0};
+
+      g_warning ("[HWAccel.NVENC] Failed to initialize encoder");
+      pic_params.encodePicFlags = NV_ENC_PIC_FLAG_EOS;
+      rdp_nvenc->nvenc_api.nvEncEncodePicture (encode_session->encoder,
+                                               &pic_params);
+      rdp_nvenc->nvenc_api.nvEncDestroyEncoder (encode_session->encoder);
+
+      g_free (encode_session);
+      return FALSE;
+    }
+
+  create_bitstream_buffer.version = NV_ENC_CREATE_BITSTREAM_BUFFER_VER;
+  if (rdp_nvenc->nvenc_api.nvEncCreateBitstreamBuffer (
+        encode_session->encoder, &create_bitstream_buffer) != NV_ENC_SUCCESS)
+    {
+      NV_ENC_PIC_PARAMS pic_params = {0};
+
+      g_warning ("[HWAccel.NVENC] Failed to create bitstream buffer");
+      pic_params.encodePicFlags = NV_ENC_PIC_FLAG_EOS;
+      rdp_nvenc->nvenc_api.nvEncEncodePicture (encode_session->encoder,
+                                               &pic_params);
+      rdp_nvenc->nvenc_api.nvEncDestroyEncoder (encode_session->encoder);
+
+      g_free (encode_session);
+      return FALSE;
+    }
+  encode_session->buffer_out = create_bitstream_buffer.bitstreamBuffer;
+
+  g_hash_table_insert (rdp_nvenc->encode_sessions,
+                       GUINT_TO_POINTER (*encode_session_id),
+                       encode_session);
+
+  return TRUE;
+}
+
+void
+grd_rdp_nvenc_free_encode_session (GrdRdpNvenc *rdp_nvenc,
+                                   uint32_t     encode_session_id)
+{
+  NvEncEncodeSession *encode_session;
+  NV_ENC_PIC_PARAMS pic_params = {0};
+
+  if (!g_hash_table_steal_extended (rdp_nvenc->encode_sessions,
+                                    GUINT_TO_POINTER (encode_session_id),
+                                    NULL, (gpointer *) &encode_session))
+    return;
+
+  rdp_nvenc->nvenc_api.nvEncDestroyBitstreamBuffer (encode_session->encoder,
+                                                    encode_session->buffer_out);
+
+  pic_params.encodePicFlags = NV_ENC_PIC_FLAG_EOS;
+  rdp_nvenc->nvenc_api.nvEncEncodePicture (encode_session->encoder,
+                                           &pic_params);
+  rdp_nvenc->nvenc_api.nvEncDestroyEncoder (encode_session->encoder);
+
+  g_free (encode_session);
+}
+
+gboolean
+grd_rdp_nvenc_avc420_encode_bgrx_frame (GrdRdpNvenc  *rdp_nvenc,
+                                        uint32_t      encode_session_id,
+                                        uint8_t      *src_data,
+                                        uint16_t      src_width,
+                                        uint16_t      src_height,
+                                        uint16_t      aligned_width,
+                                        uint16_t      aligned_height,
+                                        uint8_t     **bitstream,
+                                        uint32_t     *bitstream_size)
+{
+  NvEncEncodeSession *encode_session;
+  CUDA_MEMCPY2D cu_memcpy_2d = {0};
+  NV_ENC_REGISTER_RESOURCE register_res = {0};
+  NV_ENC_MAP_INPUT_RESOURCE map_input_res = {0};
+  NV_ENC_PIC_PARAMS pic_params = {0};
+  NV_ENC_LOCK_BITSTREAM lock_bitstream = {0};
+  CUstream cu_stream = NULL;
+  CUdeviceptr bgrx_buffer = 0, nv12_buffer = 0;
+  size_t bgrx_pitch = 0, nv12_pitch = 0;
+  unsigned int grid_dim_x, grid_dim_y, grid_dim_z;
+  unsigned int block_dim_x, block_dim_y, block_dim_z;
+  void *args[8];
+
+  if (!g_hash_table_lookup_extended (rdp_nvenc->encode_sessions,
+                                     GUINT_TO_POINTER (encode_session_id),
+                                     NULL, (gpointer *) &encode_session))
+    return FALSE;
+
+  g_assert (encode_session->enc_width == aligned_width);
+  g_assert (encode_session->enc_height == aligned_height);
+
+  if (rdp_nvenc->cuda_funcs->cuStreamCreate (&cu_stream, 0) != CUDA_SUCCESS)
+    {
+      g_warning ("[HWAccel.CUDA] Failed to create stream");
+      return FALSE;
+    }
+
+  if (rdp_nvenc->cuda_funcs->cuMemAllocPitch (
+        &bgrx_buffer, &bgrx_pitch, src_width * 4, src_height, 4) != CUDA_SUCCESS)
+    {
+      g_warning ("[HWAccel.CUDA] Failed to allocate BGRX buffer");
+      rdp_nvenc->cuda_funcs->cuStreamDestroy (cu_stream);
+      return FALSE;
+    }
+
+  cu_memcpy_2d.srcMemoryType = CU_MEMORYTYPE_HOST;
+  cu_memcpy_2d.srcHost = src_data;
+  cu_memcpy_2d.srcPitch = src_width * 4;
+
+  cu_memcpy_2d.dstMemoryType = CU_MEMORYTYPE_DEVICE;
+  cu_memcpy_2d.dstDevice = bgrx_buffer;
+  cu_memcpy_2d.dstPitch = bgrx_pitch;
+
+  cu_memcpy_2d.WidthInBytes = src_width * 4;
+  cu_memcpy_2d.Height = src_height;
+
+  if (rdp_nvenc->cuda_funcs->cuMemcpy2DAsync (
+        &cu_memcpy_2d, cu_stream) != CUDA_SUCCESS)
+    {
+      g_warning ("[HWAccel.CUDA] Failed to initiate H2D copy");
+      rdp_nvenc->cuda_funcs->cuMemFree (bgrx_buffer);
+      rdp_nvenc->cuda_funcs->cuStreamDestroy (cu_stream);
+      return FALSE;
+    }
+
+  if (rdp_nvenc->cuda_funcs->cuMemAllocPitch (
+        &nv12_buffer, &nv12_pitch,
+        aligned_width, aligned_height + aligned_height / 2, 4) != CUDA_SUCCESS)
+    {
+      g_warning ("[HWAccel.CUDA] Failed to allocate NV12 buffer");
+      rdp_nvenc->cuda_funcs->cuStreamSynchronize (cu_stream);
+      rdp_nvenc->cuda_funcs->cuMemFree (bgrx_buffer);
+      rdp_nvenc->cuda_funcs->cuStreamDestroy (cu_stream);
+      return FALSE;
+    }
+
+  /* Threads per blocks */
+  block_dim_x = 32;
+  block_dim_y = 8;
+  block_dim_z = 1;
+  /* Amount of blocks per grid */
+  grid_dim_x = aligned_width / 2 / block_dim_x +
+               (aligned_width / 2 % block_dim_x ? 1 : 0);
+  grid_dim_y = aligned_height / 2 / block_dim_y +
+               (aligned_height / 2 % block_dim_y ? 1 : 0);
+  grid_dim_z = 1;
+
+  args[0] = &nv12_buffer;
+  args[1] = &bgrx_buffer;
+  args[2] = &src_width;
+  args[3] = &src_height;
+  args[4] = &bgrx_pitch;
+  args[5] = &aligned_width;
+  args[6] = &aligned_height;
+  args[7] = &aligned_width;
+
+  if (rdp_nvenc->cuda_funcs->cuLaunchKernel (
+        rdp_nvenc->cu_bgrx_to_yuv420, grid_dim_x, grid_dim_y, grid_dim_z,
+        block_dim_x, block_dim_y, block_dim_z, 0, cu_stream, args, NULL) != CUDA_SUCCESS)
+    {
+      g_warning ("[HWAccel.CUDA] Failed to launch BGRX_TO_YUV420 kernel");
+      rdp_nvenc->cuda_funcs->cuStreamSynchronize (cu_stream);
+      rdp_nvenc->cuda_funcs->cuMemFree (nv12_buffer);
+      rdp_nvenc->cuda_funcs->cuMemFree (bgrx_buffer);
+      rdp_nvenc->cuda_funcs->cuStreamDestroy (cu_stream);
+      return FALSE;
+    }
+
+  if (rdp_nvenc->cuda_funcs->cuStreamSynchronize (cu_stream) != CUDA_SUCCESS)
+    {
+      g_warning ("[HWAccel.CUDA] Failed to synchronize stream");
+      rdp_nvenc->cuda_funcs->cuMemFree (nv12_buffer);
+      rdp_nvenc->cuda_funcs->cuMemFree (bgrx_buffer);
+      rdp_nvenc->cuda_funcs->cuStreamDestroy (cu_stream);
+      return FALSE;
+    }
+
+  rdp_nvenc->cuda_funcs->cuStreamDestroy (cu_stream);
+  rdp_nvenc->cuda_funcs->cuMemFree (bgrx_buffer);
+
+  register_res.version = NV_ENC_REGISTER_RESOURCE_VER;
+  register_res.resourceType = NV_ENC_INPUT_RESOURCE_TYPE_CUDADEVICEPTR;
+  register_res.width = aligned_width;
+  register_res.height = aligned_height;
+  register_res.pitch = aligned_width;
+  register_res.resourceToRegister = (void *) nv12_buffer;
+  register_res.bufferFormat = NV_ENC_BUFFER_FORMAT_NV12;
+  register_res.bufferUsage = NV_ENC_INPUT_IMAGE;
+
+  if (rdp_nvenc->nvenc_api.nvEncRegisterResource (
+        encode_session->encoder, &register_res) != NV_ENC_SUCCESS)
+    {
+      g_warning ("[HWAccel.NVENC] Failed to register resource");
+      rdp_nvenc->cuda_funcs->cuMemFree (nv12_buffer);
+      return FALSE;
+    }
+
+  map_input_res.version = NV_ENC_MAP_INPUT_RESOURCE_VER;
+  map_input_res.registeredResource = register_res.registeredResource;
+
+  if (rdp_nvenc->nvenc_api.nvEncMapInputResource (
+        encode_session->encoder, &map_input_res) != NV_ENC_SUCCESS)
+    {
+      g_warning ("[HWAccel.NVENC] Failed to map input resource");
+      rdp_nvenc->nvenc_api.nvEncUnregisterResource (encode_session->encoder,
+                                                    register_res.registeredResource);
+      rdp_nvenc->cuda_funcs->cuMemFree (nv12_buffer);
+      return FALSE;
+    }
+
+  pic_params.version = NV_ENC_PIC_PARAMS_VER;
+  pic_params.inputWidth = aligned_width;
+  pic_params.inputHeight = aligned_height;
+  pic_params.inputPitch = aligned_width;
+  pic_params.inputBuffer = map_input_res.mappedResource;
+  pic_params.outputBitstream = encode_session->buffer_out;
+  pic_params.bufferFmt = map_input_res.mappedBufferFmt;
+  pic_params.pictureStruct = NV_ENC_PIC_STRUCT_FRAME;
+
+  if (rdp_nvenc->nvenc_api.nvEncEncodePicture (
+        encode_session->encoder, &pic_params) != NV_ENC_SUCCESS)
+    {
+      g_warning ("[HWAccel.NVENC] Failed to encode frame");
+      rdp_nvenc->nvenc_api.nvEncUnmapInputResource (encode_session->encoder,
+                                                    map_input_res.mappedResource);
+      rdp_nvenc->nvenc_api.nvEncUnregisterResource (encode_session->encoder,
+                                                    register_res.registeredResource);
+      rdp_nvenc->cuda_funcs->cuMemFree (nv12_buffer);
+      return FALSE;
+    }
+
+  lock_bitstream.version = NV_ENC_LOCK_BITSTREAM_VER;
+  lock_bitstream.outputBitstream = encode_session->buffer_out;
+
+  if (rdp_nvenc->nvenc_api.nvEncLockBitstream (
+        encode_session->encoder, &lock_bitstream) != NV_ENC_SUCCESS)
+    {
+      g_warning ("[HWAccel.NVENC] Failed to lock bitstream");
+      rdp_nvenc->nvenc_api.nvEncUnmapInputResource (encode_session->encoder,
+                                                    map_input_res.mappedResource);
+      rdp_nvenc->nvenc_api.nvEncUnregisterResource (encode_session->encoder,
+                                                    register_res.registeredResource);
+      rdp_nvenc->cuda_funcs->cuMemFree (nv12_buffer);
+      return FALSE;
+    }
+
+  *bitstream_size = lock_bitstream.bitstreamSizeInBytes;
+  *bitstream = g_memdup2 (lock_bitstream.bitstreamBufferPtr, *bitstream_size);
+
+  rdp_nvenc->nvenc_api.nvEncUnlockBitstream (encode_session->encoder,
+                                             lock_bitstream.outputBitstream);
+
+  rdp_nvenc->nvenc_api.nvEncUnmapInputResource (encode_session->encoder,
+                                                map_input_res.mappedResource);
+  rdp_nvenc->nvenc_api.nvEncUnregisterResource (encode_session->encoder,
+                                                register_res.registeredResource);
+  rdp_nvenc->cuda_funcs->cuMemFree (nv12_buffer);
+
+  return TRUE;
+}
+
+GrdRdpNvenc *
+grd_rdp_nvenc_new (void)
+{
+  GrdRdpNvenc *rdp_nvenc;
+  gboolean nvenc_device_found = FALSE;
+  CUdevice cu_device = 0;
+  int cu_device_count = 0;
+  g_autofree char *avc_ptx_path = NULL;
+  g_autofree char *avc_ptx_instructions = NULL;
+  g_autoptr (GError) error = NULL;
+  int i;
+
+  rdp_nvenc = g_object_new (GRD_TYPE_RDP_NVENC, NULL);
+  cuda_load_functions (&rdp_nvenc->cuda_funcs, NULL);
+  nvenc_load_functions (&rdp_nvenc->nvenc_funcs, NULL);
+
+  if (!rdp_nvenc->cuda_funcs || !rdp_nvenc->nvenc_funcs)
+    {
+      g_debug ("[HWAccel.CUDA] Failed to load CUDA or NVENC library");
+      g_clear_object (&rdp_nvenc);
+      return NULL;
+    }
+
+  rdp_nvenc->cuda_funcs->cuInit (0);
+  rdp_nvenc->cuda_funcs->cuDeviceGetCount (&cu_device_count);
+
+  g_debug ("[HWAccel.CUDA] Found %i CUDA devices", cu_device_count);
+  for (i = 0; i < cu_device_count; ++i)
+    {
+      int cc_major = 0, cc_minor = 0;
+
+      rdp_nvenc->cuda_funcs->cuDeviceGet (&cu_device, i);
+      rdp_nvenc->cuda_funcs->cuDeviceComputeCapability (&cc_major, &cc_minor,
+                                                        cu_device);
+
+      g_debug ("[HWAccel.CUDA] Device %i compute capability: [%i, %i]",
+               i, cc_major, cc_minor);
+      if (cc_major >= 3)
+        {
+          g_debug ("[HWAccel.NVENC] Choosing CUDA device with id %i", i);
+          nvenc_device_found = TRUE;
+          break;
+        }
+    }
+
+  if (!cu_device_count || !nvenc_device_found)
+    {
+      g_debug ("[HWAccel.NVENC] No NVENC capable gpu found");
+      g_clear_object (&rdp_nvenc);
+      return NULL;
+    }
+
+  rdp_nvenc->cu_device = cu_device;
+  if (rdp_nvenc->cuda_funcs->cuDevicePrimaryCtxRetain (
+        &rdp_nvenc->cu_context, rdp_nvenc->cu_device) != CUDA_SUCCESS)
+    {
+      g_warning ("[HWAccel.CUDA] Failed to retain CUDA context");
+      g_clear_object (&rdp_nvenc);
+      return NULL;
+    }
+
+  rdp_nvenc->nvenc_api.version = NV_ENCODE_API_FUNCTION_LIST_VER;
+  if (rdp_nvenc->nvenc_funcs->NvEncodeAPICreateInstance (&rdp_nvenc->nvenc_api) != NV_ENC_SUCCESS)
+    {
+      g_warning ("[HWAccel.NVENC] Could not create NVENC API instance");
+
+      rdp_nvenc->cuda_funcs->cuDevicePrimaryCtxRelease (rdp_nvenc->cu_device);
+      g_clear_object (&rdp_nvenc);
+
+      return NULL;
+    }
+
+  if (rdp_nvenc->cuda_funcs->cuCtxPushCurrent (rdp_nvenc->cu_context) != CUDA_SUCCESS)
+    {
+      g_warning ("[HWAccel.CUDA] Failed to push CUDA context");
+      rdp_nvenc->cuda_funcs->cuDevicePrimaryCtxRelease (rdp_nvenc->cu_device);
+      g_clear_object (&rdp_nvenc);
+      return NULL;
+    }
+
+  rdp_nvenc->initialized = TRUE;
+
+  avc_ptx_path = g_strdup_printf ("%s/grd-cuda-avc-utils_30.ptx", GRD_DATA_DIR);
+  if (!g_file_get_contents (avc_ptx_path, &avc_ptx_instructions, NULL, &error))
+    g_error ("[HWAccel.CUDA] Failed to read PTX instructions: %s", error->message);
+
+  if (rdp_nvenc->cuda_funcs->cuModuleLoadData (
+        &rdp_nvenc->cu_module_avc_utils, avc_ptx_instructions) != CUDA_SUCCESS)
+    {
+      g_warning ("[HWAccel.CUDA] Failed to load CUDA module");
+      g_clear_object (&rdp_nvenc);
+      return NULL;
+    }
+
+  if (rdp_nvenc->cuda_funcs->cuModuleGetFunction (
+        &rdp_nvenc->cu_bgrx_to_yuv420, rdp_nvenc->cu_module_avc_utils,
+        "convert_2x2_bgrx_area_to_yuv420_nv12") != CUDA_SUCCESS)
+    {
+      g_warning ("[HWAccel.CUDA] Failed to get AVC CUDA kernel");
+      g_clear_object (&rdp_nvenc);
+      return NULL;
+    }
+
+  return rdp_nvenc;
+}
+
+static void
+grd_rdp_nvenc_dispose (GObject *object)
+{
+  GrdRdpNvenc *rdp_nvenc = GRD_RDP_NVENC (object);
+
+  if (rdp_nvenc->initialized)
+    {
+      rdp_nvenc->cuda_funcs->cuCtxPopCurrent (&rdp_nvenc->cu_context);
+      rdp_nvenc->cuda_funcs->cuDevicePrimaryCtxRelease (rdp_nvenc->cu_device);
+
+      rdp_nvenc->initialized = FALSE;
+    }
+
+  g_clear_pointer (&rdp_nvenc->cu_module_avc_utils,
+                   rdp_nvenc->cuda_funcs->cuModuleUnload);
+
+  nvenc_free_functions (&rdp_nvenc->nvenc_funcs);
+  cuda_free_functions (&rdp_nvenc->cuda_funcs);
+
+  g_assert (g_hash_table_size (rdp_nvenc->encode_sessions) == 0);
+  g_clear_pointer (&rdp_nvenc->encode_sessions, g_hash_table_destroy);
+
+  G_OBJECT_CLASS (grd_rdp_nvenc_parent_class)->dispose (object);
+}
+
+static void
+grd_rdp_nvenc_init (GrdRdpNvenc *rdp_nvenc)
+{
+  rdp_nvenc->encode_sessions = g_hash_table_new (NULL, NULL);
+}
+
+static void
+grd_rdp_nvenc_class_init (GrdRdpNvencClass *klass)
+{
+  GObjectClass *object_class = G_OBJECT_CLASS (klass);
+
+  object_class->dispose = grd_rdp_nvenc_dispose;
+}
diff --git a/src/grd-rdp-nvenc.h b/src/grd-rdp-nvenc.h
new file mode 100644
index 0000000..483bf3d
--- /dev/null
+++ b/src/grd-rdp-nvenc.h
@@ -0,0 +1,51 @@
+/*
+ * Copyright (C) 2021 Pascal Nowack
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of the
+ * License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
+ * 02111-1307, USA.
+ */
+
+#ifndef GRD_RDP_NVENC_H
+#define GRD_RDP_NVENC_H
+
+#include <glib-object.h>
+#include <stdint.h>
+
+#define GRD_TYPE_RDP_NVENC (grd_rdp_nvenc_get_type ())
+G_DECLARE_FINAL_TYPE (GrdRdpNvenc, grd_rdp_nvenc,
+                      GRD, RDP_NVENC, GObject);
+
+GrdRdpNvenc *grd_rdp_nvenc_new (void);
+
+gboolean grd_rdp_nvenc_create_encode_session (GrdRdpNvenc *rdp_nvenc,
+                                              uint32_t    *encode_session_id,
+                                              uint16_t     surface_width,
+                                              uint16_t     surface_height,
+                                              uint16_t     refresh_rate);
+
+void grd_rdp_nvenc_free_encode_session (GrdRdpNvenc *rdp_nvenc,
+                                        uint32_t     encode_session_id);
+
+gboolean grd_rdp_nvenc_avc420_encode_bgrx_frame (GrdRdpNvenc  *rdp_nvenc,
+                                                 uint32_t      encode_session_id,
+                                                 uint8_t      *src_data,
+                                                 uint16_t      src_width,
+                                                 uint16_t      src_height,
+                                                 uint16_t      aligned_width,
+                                                 uint16_t      aligned_height,
+                                                 uint8_t     **bitstream,
+                                                 uint32_t     *bitstream_size);
+
+#endif /* GRD_RDP_NVENC_H */
diff --git a/src/grd-types.h b/src/grd-types.h
index edea7f0..c3bb763 100644
--- a/src/grd-types.h
+++ b/src/grd-types.h
@@ -32,6 +32,7 @@ typedef struct _GrdRdpGfxFrameLog GrdRdpGfxFrameLog;
 typedef struct _GrdRdpGfxSurface GrdRdpGfxSurface;
 typedef struct _GrdRdpGraphicsPipeline GrdRdpGraphicsPipeline;
 typedef struct _GrdRdpNetworkAutodetection GrdRdpNetworkAutodetection;
+typedef struct _GrdRdpNvenc GrdRdpNvenc;
 typedef struct _GrdRdpSAMFile GrdRdpSAMFile;
 typedef struct _GrdRdpServer GrdRdpServer;
 typedef struct _GrdRdpSurface GrdRdpSurface;
diff --git a/src/meson.build b/src/meson.build
index ef29d7b..2fe3923 100644
--- a/src/meson.build
+++ b/src/meson.build
@@ -71,6 +71,18 @@ if have_rdp
     winpr_dep,
     xkbcommon_dep,
   ]
+
+  if have_nvenc
+    daemon_sources += files([
+      'grd-rdp-nvenc.c',
+      'grd-rdp-nvenc.h',
+    ])
+
+    deps += [
+      dl_dep,
+      nvenc_dep,
+    ]
+  endif
 endif
 
 if have_vnc
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]