In order to simplify development, the ADM-PA101 has been set up to run PetaLinux, to allow the soft-cores to be added to the Slurm cluster as the card has Ethernet access. To enable this, we need to configure PetaLinux to boot via ‘tftp’ and mount its root filesystem over NFS.
By default, PetaLinux configures the Ethernet port with a random MAC address. To allow a DHCP assigned IP address based on the MAC address, the following variables need to be set:
CONFIG_SUBSYSTEM_ETHERNET_VERSAL_CIPS_0_PSPMC_0_PSV_ETHERNET_0_MAC="00:c0:ff:ee:00:00"
CONFIG_SUBSYSTEM_ETHERNET_VERSAL_CIPS_0_PSPMC_0_PSV_ETHERNET_0_USE_DHCP=y
The hostname can be set as using CONFIG_SUBSYSTEM_HOSTNAME="fpga01"
.
The default PetaLinux configuration will set up root
and petalinux
users. This configration can be overridden as follows:
CONFIG_ADD_EXTRA_USERS="root:root;user1:initialpassword;"
CONFIG_CREATE_NEW_GROUPS="aie;"
CONFIG_ADD_USERS_TO_GROUPS="user1:audio,video,aie;"
CONFIG_ADD_USERS_TO_SUDOERS="user1"
NOTE: This sets the default
root
password to ‘root’ and should be changed. Thepetalinux-build
command will raise a warning to remind you to change this.
In the above example, user1
has sudo
access through the addition of CONFIG_ADD_USERS_TO_SUDOERS="user1"
. The example also shows how groups can be added.
NOTE: The first build of PetaLinux should be used to create the root filesystem (or use
petalinux-build -c rootfs
to rebuild), which should then be expanded into the NFS share directory (e.g./tftpboot/nfsroot
).
Using NFS for the root filesystem should be a trivial configuration change using petalinux-config
. However, by default, the Xilinx PetaLinux configuration uses NFS v4 protocol for the client. Unfortunately, this is incompatible with the default Debian NFS server running on our login node. The answer is to force the PetaLinux boot to use NFS v3 which can be set in the BOOTARGS
using the PetaLinux config UI or in the BOOTARGS variable of project-spec/configs/config
file in the PetaLinux project directory (sw/petalinux/base
):
CONFIG_SUBSYSTEM_BOOTARGS_GENERATED="console=ttyAMA0 earlycon=pl011,mmio32,0xFF000000,115200n8 clk_ignore_unused root=/dev/nfs nfsroot=c0.ff.ee.00:/tftpboot/nfsroot,tcp,v3 ip=dhcp rw"
Here we can see that the root file system is being set to a NFS mount (root=/dev/nfs
) with the nfsroot
option including the server and path, as well as forcing tcp
and v3
of the NFS protocol.
Unfortunately, the CONFIG_SUBSYSTEM_BOOTARGS_GENERATED
setting, as the name suggests, is generated and gets wiped during the build. Therefore, the documentation states that the boot command arguments need to be placed in the chosen section of sw/petalinux/system-user.dtsi
as follows:
chosen {
stdout-path = "serial0:115200";
bootargs = "console=ttyAMA0 earlycon=pl011,mmio32,0xFF000000,115200n8 clk_ignore_unused root=/dev/nfs nfsroot=c0.ff.ee.00:/tftpboot/nfsroot,tcp,v3 ip=dhcp rw"
};
However, this breaks the build when petalinux-build
generates other .dtsi
files and we are unable to proceed further.
After much experimentation, the following approach can be used to build a PetaLinux image for the uSD card that will boot over ‘tftp’ and mount the root filesystem over NFS.
ps_base_sw-admpa101-v1_2_0.tar.gz
in a working directorysource <petalinux_tools_directory>/settings.
sh
source <vivado_directory>/settings64.sh
cd ps_base_sw-admpa101-v1_2_0/fpga/proj/base
vivado -mode batch -source mkxpr-base.tcl
vivado -mode batch -source do_build.tcl
cd ps_base_sw-admpa101-v1_2_0/sw/petalinux
petalinux-create -t project -s ../../os/simple.bsp
cd simple
petalinux-build
diff
config.patch
) to ps_base_sw-admpa101-v1_2_0/sw/petalinux/simple
patch -b project-spec/configs/config config.patch
project-spec/configs/config
directly to make the required changes abovepetalinux-build
petalinux-package --boot --u-boot
(builds BOOT.BIN
)image.ub
, boot.scr
and BOOT.BIN
from /tftpboot
to the uSD card (petalinux-build
will place the files in /tftpboot
by default).Note: Ignore the following warning as once NFS is enabled, the user accounts will be configured from the NFS root file system:
WARNING: petalinux-image-minimal-1.0-r0 do_rootfs: Enabling autologin to user root. This configuration should NOT be used in production!
]]>As mentioned above, this build assumes that there is an expanded
rootfs
for the ARM cores in/tftpboot/nfsroot
(previouspetalinux-build -c rootfs
)
RAJAPerf tests a suite of loop-based computational kernels relevant for HPC.
The DongshanNezhaSTU board contains the Allwinner D1 C906, which supports the V vector extension (version 0.7.1). The chip contains 128-bit wide vector registers and supports element sizes up to 32-bit. Because of this, we compiled RAJAPerf with single percision floating points numbers to enable speedup from vectorization.
We also compare the performance against the StarFive JH7110 (VF2), which contains a quad-core SiFive U74, and a Fujitsu Arm A64FX system, which has SIMD instructions (NEON) as well as scalable vectors (SVE). The A64FX processor is designed for HPC applications and completely different in nature to the RISC-V cores, which are designed for embedded and single-board computers (SBC). However, a comparison against the A64FX is still useful as it can highlight important differences and potential design improvements for an HPC-class RISC-V processor in the future. Because the C906 only contains a single core, all benchmarks are run on a single core to enable direct comparison across CPUs, and only NEON with 128-bit vector width is used on A64FX.
The RISC-V results are compiled using the XuanTie GCC 8.4, with -O3 -march=rv64gcv0p7 -ffast-math
for vector and -O3 -march=rv64gc -ffast-math
for scalar, and for Arm we used GCC 11.2 with -O3 -ffast-math -mcpu=a64fx -march=armv8.2-a+simd+nosve
for vector and -O3 -ffast-math -mcpu=a64fx -march=armv8.2-a+nosimd+nosve
for scalar.
In the following plots we show runtimes for the RAJAPerf kernel normalised against the kernel’s scalar runtime. For the A64FX, normalisation is against running in scalar mode on the A64FX, whereas for the Allwinner D1 and StarFive JH7110 it is normalised against running scalar on the D1. The orange and purple bars show the vectorisation performance difference on the A64FX and D1 respectively, and the green bars show a comparison of the scalar performance between the JH7110 (VF2) and the D1.
It can be observed from these plots that for most linear algebra kernels, the vectorised code on the RISC-V D1 is faster compared to its scalar counterpart.
Below we also tested LLVM 15.0, which is able to vectorize more kernels than XuanTie GCC 8.4, but generated RVV 1.0 code. We utilized the RVV-rollback tool https://github.com/RISCVtestbed/rvv-rollback to translate some of the kernels, and the speedup can be seen in the plots below.
Kernels vectorized by GCC:
Kernels not vectorized by GCC:
Kernels vectorized by GCC, but no vector instructions were executed at runtime:
Clang contains settings for vector length specific code (VLS - via -riscv-v-vector-bits-min=128
) and vector length agnostic (VLA - via -scalable-vectorization=on
), which we showed in the plots above. It can be seen that Clang and GCC have different performance in terms of vectorizing and executing vector instructions for the different kernels.
For more details of the above results, see the following publications:
ld
- linker, as
- assembler, and objdump
- displays object file information.
The first toolchain is the RISC-V GNU Compuler Toolchain, which is available at https://github.com/riscv-collab/riscv-gnu-toolchain. The README provides comprehensive instructions to compile the toolchain.
Different versions of this toolchain have already been installed on the login node and can be directly be loaded using module load
, following the instructions here. Once loaded, the compilers and binutils can be called directly, e.g.
[username@riscv-login ~]$ module load riscv64-linux/gnu-12.2
[username@riscv-login ~]$ riscv64-unknown-linux-gnu-gcc --version
riscv64-unknown-linux-gnu-gcc (g) 12.2.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Notes:
riscv(32/64)-unknown-(elf/linux-gnu)-
for (32/64)-bit and (newlib/glibc) respectively-march=ISA-string
, e.g. -march=rv64gc
. For more options, see https://gcc.gnu.org/onlinedocs/gcc/RISC-V-Options.htmlThe toolchain also includes a simulator (e.g. QEMU), which allows us to run RISC-V binaries on the host. To build the simulator, after configuring and building the gnu toolchain, additionally run $ make build-sim SIM=qemu
. To use the simulator, just run $ qemu-riscv64 (application)
.
Note:
riscv64-linux/gnu-12.2
on riscv-loginMakefile.in
under build-qemu
and add the following flags to configure
:
--cc=[c compiler] \
--cxx=[c++ compiler]
LLVM also supports RISC-V, and at the moment provides better vector (1.0) support than gcc. To build the LLVM project, the gnu toolchain has to be first built. For reference see https://llvm.org/docs/CMake.html and https://llvm.org/docs/GettingStarted.html. Most important for building LLVM for RISC-V, the following flags have to be added to cmake
(e.g. for 64-bit):
cmake ... -DLLVM_TARGETS_TO_BUILD="RISCV" \
-DLLVM_ENABLE_PROJECTS="clang;lld" \
-DLLVM_ENABLE_RUNTIMES="compiler-rt;libcxx;libcxxabi;libunwind" \
-DLLVM_DEFAULT_TARGET_TRIPLE="riscv64-linux-gnu" \
-DDEFAULT_SYSROOT="$(INSTALL_DIR)/sysroot"
where $(INSTALL_DIR)
is the gcc toolchain install directory. However, since the -DDEFAULT_SYSROOT
is set, the flag DGCC_INSTALL_PREFIX
will be ignored, which is actually necessary to find libgcc
. A workaround is to merge the paths.
This has been implemented in a PR https://github.com/riscv-collab/riscv-gnu-toolchain/pull/1166, which is currently the easiest way to build the LLVM project. To build this toolchain
$ git clone https://github.com/cmuellner/riscv-gnu-toolchain.git
$ cd riscv-gnu-toolchain/
$ git checkout origin/llvm-new
$ ./configure --prefix=$(prefix) --with-arch=rv64gc --with-abi=lp64d --enable-llvm --enable-linux
$ make
The LLVM binaries will be built in the same location in $prefix
.
Notes:
git submodule update --init --recursive
, then cd LLVM
and git fetch
to pull the latest LLVM.Makefile.in
under build-llvm-linux
and add the following flags to cmake
:-DCMAKE_C_COMPILER="[c compiler]" \
-DCMAKE_CXX_COMPILER="[c++ compiler]" \
The upstream LLVM Compiler (clang) by default supports the vector extension and auto-vectorization. To build gcc with vector support and auto-vectorization, the rvv-next branch needs to checked out.
Notes:
-march=rv64gcv -menable-experimental-extensions -O2 -mllvm --riscv-v-vector-bits-min=128
or -march=rv64gcv -menable-experimental-extensions -O2 -mllvm -scalable-vectorization=on
--with-arch=rv64gcv -O3
The toolchain contains the debugger riscv64-unknown-linux-gnu-gdb
. To debug RISC-V executables on the host, we need to use it in conjunction with the QEMU simulator. To do so, we first connect QEMU to the application by adding the -g (port)
flag, e.g.
$ qemu-riscv64 -g 1234 ./hello-world
Next we need to set up gdb to connect to the QEMU instance. In a separate terminal, create the file .gdbinit
, and include the target to connect to the port. For example,
$ cat .gdbinit
target remote localhost:1234
tui enable
layout asm
break main
This will allow us to debug with the text user interface, with a breakpoint at main
.
Then, we can simply run the debugger
$ riscv64-unknown-linux-gnu-gdb ./hello-world
and commence debugging. There may be additional instructions prompted on screen here, which should be followed.
A major caveat is that the first ratified RVV is version 1.0 (spec), whereas the C920 and C906 cores in Sophon SG2042 and the Allwinner D1 SoCs were designed to support RVV 0.7.1 (spec). The two specs are similar but not compatible. For more information, see 1 2.
On riscv-login, the following compilers modules (see Getting Started) support RVV 0.7.1:
riscv64-linux/gnu-8.4-rvv
riscv64-linux/gnu-9.2-rvv
riscv64-linux/gnu-10.2-rvv
The following compiler modules support RVV 1.0
riscv64-linux/gnu-10.2-rvv
riscv64-linux/llvm-15.0
riscv64-linux/llvm-16.0
The simplest way to work with RVV 0.7.1 is in assembly language. The spec provides some examples of how to do so. Tests of memcpy and strcpy speeds on Allwinner D1 hardware using RVV 0.7.1 have been recorded here.
Notes:
-march=...v
(e.g. -march=rv64gcv
to include vector extension; to specify the version -march=rv64gcv0p7
)riscv64-linux/gnu-8.4-rvv
provides the best auto-vectorisationriscv64-linux/gnu-10.2-rvv
compiler: https://occ-oss-prod.oss-cn-hangzhou.aliyuncs.com/resource//1663142187133/Xuantie+900+Series+RVV-0.7.1+Intrinsic+Manual.pdfDue to the fact that RVV 1.0 is the ratified version, there is significantly more support by compilers. The latest LLVM compiler and toolchain provide support for vector intrinsics (v0.10)and auto-vectorization.
Notes:
-march=...v
(e.g. -march=rv64gcv
to include vector extension; to specify the version -march=rv64gcv1p0
)rvv-next
branch toolchain, also pull the riscv-gcc-rvv-next
branch in riscv-gcc
rvv-next
), configure with --with-arch=rv64gcv
and compile with -ftree-vectorize
or -O3
(see 1 2)-march=rv64gv -target riscv64 -O2 -mllvm --riscv-v-vector-bits-min=N
(e.g. N = 128
) for vector length specific, and -march=rv64gv -target riscv64 -O2 -mllvm -scalable-vectorization=on
for vector length agnostic-fopt-info-vec-all
for gcc or -Rpass=loop-vectorize -Rpass-missed=loop-vectorize -Rpass-analysis=loop-vectorize
for clang. (See https://gcc.gnu.org/onlinedocs/gcc/Developer-Options.html#index-fopt-info-1337 and https://llvm.org/docs/Vectorizers.html)Examples:
We have introduced a tool to translate RVV 1.0 assembly code to 0.7, which is available for download here https://github.com/RISCVtestbed/rvv-rollback. It is tested for the following workflow:
This is tested for the following workflow:
.s
.s
to RVV0.7 .s
.s
to .o
The tool does not support some features introduced in v1.0, such as fractional LMUL and 64-bit elements.