ASAP Top

Written by Noda.

(2017/10/19追記)
AOCL環境をv16.1からv17.0に変更しようとして躓いている。
hostであるneutrinoの環境はv17.0にした。
Ariia10のRTEもv17.0にしたけど、$aocl diagnoseしたらカーネルドライバがv16.1だよって言われる。
SDカードイメージをv17.0に変更する必要があるらしい。
やりかたわかんにゃい

\\\\ OpenClで高位合成したい人間のためのメモ。 これでとりあえず足し算くらいはできる。
host codeぐちゃぐちゃな気がしますがとりあえず動くので許して下さいませ。
回路の最適化については触れていないので、詳しく知りたい方は下記リファレンスへどうぞ。

This is a simple guide for implementing on Arria 10 SoC using the high-level synthesis environment "Intel FPGA SDK for OpenCL". By reading this memo, you can implement a simple addition circuit on Arria 10 using the environment.

If you want to optimize your code, please read the following references.
https://www.altera.com/en_US/pdfs/literature/hb/opencl-sdk/aocl_getting_started.pdf
https://www.altera.com/en_US/pdfs/literature/hb/opencl-sdk/aocl_programming_guide.pdf

Introduction

Arria 10 SoC is a system on chip with ARM CPU and FPGA provided by Intel. The strength of the Arria10 is that a hard macro (DSP) for float operation embeds on the FPGA. Please note that the DSP does not support double operation.
OpenCl is a framework for performing parallel computation in a heterogeneous environment (CPU + GPU, CPU + FPGA, etc.). Intel FPGA SDK for OpenCL is an OpenCL-based HLS environment for FPGAs provided by Inrel, and it is said that it is possible to describe a high-performance FPGA circuit in a short period of time by using it (is it true?).
In the environment we use host code and kernel code. The former is a code for control processors (ARM on Arria 10 SoC in this case), which is described using C ++ and OpenCL API. The latter is a code for arithmetic cores (FPGA on Arria 10 SoC in this case) and is described using OpenCL C language.
And, Board Support Package (BSP) is provided for each board, so users don't need to develop I/O interface connecting CPUs, FPGAs, and memory. Although the BSP can be rewritten, we are likely to destroy the environment, so it is better not to touch it. Or, please take a backup and challenge it.

How to develop

Preparation for implementation

The sample addition code is placed in the following directory.

/home/asap2/noda/arria_test

There are two directories "add" and "common". The addition code is in "add", and it will not move without the "common" directory. Copy the "arria_test" directory and move to the "add" directory. Below, let's assume that your current directory is "add".
The host code is in "host/src/main.cpp", and the kernel code is in "device/add.cl". Besides, there are any shell codes called "秘伝のタレ".
This code sums up randomly generated 100 elements. In order to verify the calculation result, the result calculated on the FPGA is compared with that calculated on the CPU. Moreover, this code measures the calculation time of kernel and the turn-around time.

Also, we will ssh and scp to Arria 10 later, but before that you have to put your ".ssh/id_rsa.pub" in authorized_keys of Arria10. If you email asap@am.ics.keio.ac.jp with your public key, we will add it. Then you can connect to Arria 10.

Emulation

Before implementing the circuit in the FPGA, we debug on a CPU, neutrino. First, after ssh to neutrino, copy "~noda/.bash_profile" and source it.
Now that PATH has passed, you try to emulate. There is a shell code "emu_go" in the directory "add". So if you run it, emulation will start.
There are various compile options. Check references for details.
The result of execution is as follows.

bash-4.1$ ./emu_go 
aoc: Environment checks are completed successfully.
You are now compiling the full flow!!
aoc: Selected target board a10soc_2ddr
aoc: Running OpenCL parser....
aoc: OpenCL parser completed successfully.
aoc: Compiling for Emulation ....
aoc: Emulator Compilation completed successfully.
Emulator flow is successful.
To execute emulated kernel, invoke host with 
	env CL_CONTEXT_EMULATOR_DEVICE_ALTERA=1 <host_program>
 For multi device emulations replace the 1 with the number of devices you which to emulate
Initializing OpenCL
Platform: Altera SDK for OpenCL
Using 1 device(s)
  EmulatorDevice : Emulated Device
Using AOCX: add.aocx

Arria 10 SoC
Turn_around_Time: 0.712237 ms
Kernel time (device 0)(getStartEndTime): 0.619050 ms

Output: 93.649620
Reference: 93.649620

Verification: PASS

You can check the flow of calculation on the CPU. You must debug the host and kernel code until the code works properly.
Although we can confirm that the calculation is done normally, we can not simulate the execution time (we get results, but this is an unreliable value). We can measure execution time only after implementating on FPGA.
We can also check FPGA resource usage. Execute "emu_resource" in the directory "device". The execution result is below.

bash-4.1$ cd device/
bash-4.1$ ./emu_resource
aoc: Environment checks are completed successfully.
aoc: Selected target board a10soc_2ddr
aoc: Running OpenCL parser....
aoc: OpenCL parser completed successfully.
aoc: Compiling....
aoc: Linking with IP library ...

+--------------------------------------------------------------------+
; Estimated Resource Usage Summary                                   ;
+----------------------------------------+---------------------------+
; Resource                               + Usage                     ;
+----------------------------------------+---------------------------+
; Logic utilization                      ;    2%                     ;
; ALUTs                                  ;    1%                     ;
; Dedicated logic registers              ;    1%                     ;
; Memory blocks                          ;    3%                     ;
; DSP blocks                             ;    0%                     ;
+----------------------------------------+---------------------------;
aoc: First stage compilation completed successfully.
aoc: To compile this project, run "aoc add.aoco"

The float operation automatically uses the DSP. In the table above, the DSP usage rate is 0%, but the circuit size is too small, it seems that the DSP is used properly.
We confirmed that the addition of sample code worked properly, so we go to the next section.

Compile kernel code

First, execute the shell code "aocl_shell" and launch the Altera Embedded command shell.
So, run the shell "aocx_go" and you can compile the kernel code.
This is very time-consuming. Even in this sample code, it takes time more than 1 hour.
And now, We have compiled kernel code(aocx file) in the directory "device".
We will compile the host code later in ARM on Arria 10. The execution result is below.

bash-4.1$ ./aocx_go 
aoc: Environment checks are completed successfully.
You are now compiling the full flow!!
aoc: Selected target board a10soc_2ddr
aoc: Running OpenCL parser....
aoc: OpenCL parser completed successfully.
aoc: Compiling....
aoc: Linking with IP library ...
+--------------------------------------------------------------------+
; Estimated Resource Usage Summary                                   ;
+----------------------------------------+---------------------------+
; Resource                               + Usage                     ;
+----------------------------------------+---------------------------+
; Logic utilization                      ;    2%                     ;
; ALUTs                                  ;    1%                     ;
; Dedicated logic registers              ;    1%                     ;
; Memory blocks                          ;    3%                     ;
; DSP blocks                             ;    0%                     ;
+----------------------------------------+---------------------------;
aoc: First stage compilation completed successfully.
aoc: Hardware generation completed successfully.

When compilation starts, the directory "to_a10soc" specified in the shell code is created. It contains an intermediate file "add.aoco" and a directory "add" containing various data. After compilation, a binary file "add.aocx" is generated in "to_a10soc".

Transfer aocx file and host code to Arria 10

After compiling the kernel code, transfer the generated aocx file and host code (uncompiled) to Arria 10 with scp. Here we transfer to arria 10 using the shell "go_scp" in the directory "to_a10soc". Please change the transfer destination by yourself.

./to_a10soc/go_scp

Compile the host code on Arria 10

Ssh to Arria 10.

ssh root@131.113.69.239

Currently, everyone is Superuser, so you have to be careful about your actions.
Before compiling, execute the following spells on arria10. Ignore the error.

source ~/init_opencl.sh

After that, you move to the transfer destination directory in Arria 10. In this example, "~/test/" contains "aocx file" and "main.cpp", and a previously prepared "Makefile".
If you prepared your own directory, copy "~/test/Makefile".
Finally, make the "main.cpp" and compile it. The execution result is described below. Ignore the error.

root@Arria10_linaro:~/test/test_add# make clean
root@Arria10_linaro:~/test/test_add# make all
../common/src/AOCLUtils/opencl.cpp: In function ‘void* aocl_utils::alignedMalloc(size_t)’:
../common/src/AOCLUtils/opencl.cpp:55:49: warning: ignoring return value of ‘int posix_memalign(void**, size_t, size_t)’, declared with attribute warn_unused_result [-Wunused-result]
   posix_memalign (&result, AOCL_ALIGNMENT, size);
                                                 ^
../common/src/AOCLUtils/opencl.cpp: In function ‘bool aocl_utils::setCwdToExeDir()’:
../common/src/AOCLUtils/opencl.cpp:278:14: warning: ignoring return value of ‘int chdir(const char*)’, declared with attribute warn_unused_result [-Wunused-result]
   chdir(path);
              ^

Then, a directory "bin" is created. Inside there is a "host" which is the compiled host code.
Finally move "aocx file" to directory "bin" and execute "./bin/host" command. The execution result is as follows.

root@Arria10_linaro:~/test/test_add# ./bin/host 
Initializing OpenCL
Platform: Altera SDK for OpenCL
Using 1 device(s)
  a10soc_2ddrArria 10 SoC Development Kit
Using AOCX: add.aocx
Reprogramming device with handle 1

Arria 10 SoC
Turn_around_Time: 1.022762 ms
Kernel time (device 0)(getStartEndTime): 0.107940 ms

Output: 93.649620
Reference: 93.649620 

Verification: PASS

Congrats! Now we are the king of addition! ! !

Others (GUI Profiler)

When compiling the kernel code with the "--profile" option and then running on the FPGA, "profile.mon" is generated in the directory "bin". Retransfer the "mon file" to neutrino (using go_mon), and execute "aocl report" command with "aocx" (also "aoco") file. So GUI profiler launch. (Do not forget to enable X port forwarding).
Note that adding "--profile" option will degrade performance.

bash-4.1$ ./go_mon 
Enter passphrase for key '/home/hlab/hoge/.ssh/id_rsa': 
profile.mon                                   100%   97     0.1KB/s   00:00    
bash-4.1$ aocl report profile.mon add.aocx &

もう力尽きたので後はまたこんどにゃん。 Please add your knowledge to this wiki!!!


トップ   編集 凍結 差分 バックアップ 添付 複製 名前変更 リロード   新規 一覧 検索 最終更新   ヘルプ   最終更新のRSS
Last-modified: 2019-08-29 (木) 00:03:48