polar code decoder hardware design for 5g implemented on fpga

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (3.1 MB, 67 trang )

Trang 1<div class="page_container" data-page="1">

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY

</div>Trang 2<div class="page_container" data-page="2">

THIS THESIS IS COMPLETED AT

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY – VNU-HCM Supervisor: Doctor Trần Hoàng Linh

Examiner 1: Doctor Bùi Trọng Tú

Examiner 2: Doctor Nguyễn Minh Sơn

This master’s thesis is defended at HCM City University of Technology, VNU- HCM City on January 12th 2024

Master’s Thesis Committee:

1. Associate Professors – Doctor Trương Quang Vinh : Chairman 2. Associate Professors – Doctor Hoàng Trang : Commissioner 3. Doctor – Nguyễn Lý Thiên Trường : Secretary 4. Doctor – Bùi Trọng Tú : Reviewer 1 5. Doctor – Nguyễn Minh Sơn : Reviewer 2

Approval of the Chair of Master’s Thesis Committee and Dean of Faculty of Electrical and Electronics Engineering after the thesis being corrected (If any).

CHAIR OF THESIS COMMITTEE DEAN OF FACULTY OF ELECTRICAL AND ELECTRONICS ENGINEERING

</div>Trang 3<div class="page_container" data-page="3">

VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY

SOCIALIST REPUBLIC OF VIETNAM Independence – Freedom - Happiness

THE TASK SHEET OF MASTER’S THESIS

Full name: Nguyễn Đức Nam Student ID: 2070356

Date of birth: 29-04-1996 Place of birth: Bà Rịa – Vũng Tàu Major: Electronics Engineering Major ID: 8520203

I. THESIS TITLE (In Vietnamese): THIẾT KẾ PHẦN CỨNG GIẢI MÃ CỰC (POLAR CODE) CHO 5G THỰC HIỆN TRÊN FPGA/ASIC

II. THESIS TITLE (In English): POLAR CODE DECODER HARDWARE DESIGN FOR 5G IMPLEMENTED ON FPGA/ASIC

III. TASKS AND CONTENTS:

To design and implement a polar code SC decoder on an FPGA using Verilog, improve the throughputs of the Semi-parallel Successive Cancellation.

To reduce latency cycles by improving the architecture to decode the codeword in parallel based on the Semi-parallel Successive Cancellation.

To improve fmax by analyzing and reducing the most critical delay path of the Semi-parallel Successive Cancellation.

To evaluate the performance of the FPGA-based polar code implementation in terms of resource utilization, fmax, latency, throughput.

IV. THESIS START DAY: 4/9/2023

V. THESIS COMPLETION DAY: 18/12/2023

VI. SUPERVISOR: DOCTOR TRẦN HOÀNG LINH

Ho Chi Minh City, date 18/12/2023

SUPERVISOR

(Full name and signature)

HEAD OF DEPARTMENT

(Full name and signature)

DEAN OF FACULTY OF ELECTRICAL AND ELECTRONICS ENGINEERING

(Full name and signature)

</div>Trang 4<div class="page_container" data-page="4">

ACKNOWLEDGEMENTS

During my time studying and training at Ho Chi Minh City University of Technology, I received enthusiastic guidance and teaching from teachers, especially teachers working at the Department of Electrical and Electronics Engineering have imparted to me theoretical and practical knowledge over the past time. During the process of writing my thesis, I received encouragement, guidance and valuable help from teachers, family and friends. With the deepest respect and gratitude, I would like to send my sincere thanks to the teachers of Ho Chi Minh City University of Science and Technology, the teachers of the Department of Electrical and Electronics Engineering - those who are constantly enthusiastic and inspiring to gain valuable knowledge for us.

In particular, I would like to send my sincere thanks to my instructor - Dr. Tran Hoang Linh. Over the past time, thanks to your enthusiastic help and guidance, I have had valuable and useful experiences. Your comments, encouragement and

encouragement are the driving force for me to try my best to complete this thesis. Through this, I also send my sincere thanks to my family, friends and especially my parents - who always care, support and assist me in successfully completing this thesis.

With limited time and limited experience, the thesis will inevitably have shortcomings. I look forward to receiving comments and guidance from teachers so that I can improve, supplement knowledge, raise awareness and better serve practical work in the future.

I sincerely thank you. Best regards,

Ho Chi Minh City, date 18/12/2023

Nguyễn Đức Nam

</div>Trang 5<div class="page_container" data-page="5">

ABSTRACT

This thesis demonstrate an effective field-programmable gate array (FPGA) implementation of successive-cancellation (SC) decoder for Polar code that is standard in 5G wireless system. We focus on improving the best contribution architecture Semi-parallel SC decoder. Based on that, we show that the SC decoder of length N can be further optimized by decoding the codeword in parallel to reduce N/2 latency cycles, and improve max clock frequency by refining architecture of the process elements. We demonstrate an FPGA implementation of the decoder architecture for a 1024-bit-

length polar code and show that our FPGA decoder can achieve more 50% throughput comparing to the Semi-parallel SC decoder without significantly increasing the

hardware resources.

TÓM TẮT LUẬN VĂN THẠC SĨ

Luận án này chứng minh việc triển khai mảng cổng lập trình trường (FPGA) một cách hiệu quả của bộ giải mã successive-cancellation (SC) cho mã Polar – mã tiêu chuẩn trong hệ thống không dây 5G. Chúng tôi tập trung vào việc cải thiện kiến trúc của thiết kế đóng góp tốt nhất Bộ giải mã SC bán song song (Semi-Parallel SC Decoder). Dựa trên đó, chúng tơi cho thấy bộ giải mã SC có độ dài N có thể được tối ưu hóa hơn nữa bằng cách giải mã các từ mã một cách song song để giảm chu kỳ độ trễ N/2 và cải thiện tần số tối đa bằng cách tinh chỉnh kiến trúc của các bộ tính tốn giải. Chúng tơi chứng minh việc triển khai FPGA của kiến trúc bộ giải mã cho mã cực có độ dài 1024 bit và cho thấy rằng bộ giải mã FPGA của chúng tơi có thể đạt được thông lượng cao hơn 50% so với bộ giải mã SC bán song song mà không làm tăng đáng kể tài nguyên phần cứng.

</div>Trang 6<div class="page_container" data-page="6">

THE COMMITMENT

The author hereby declares that this is his own research work. The research results and conclusions in this thesis are truthful, and are not copied from any source or in any form. References to sources (if any) have been cited and reference sources recorded according to regulations.

Thesis author, Nguyễn Đức Nam

Ho Chi Minh University of Technology

Ho Chi Minh City, date 18/12/2023

</div>Trang 7<div class="page_container" data-page="7">

TABLE OF CONTENTS

1. INTRODUCTION ...1

1.1 Overview ...1

1.2 Related science researching ...2

1.3 Tasks and expected results ...5

2. PRELIMINARIES ...5

2.1 Polar Code Construction and Encoding ...5

2.2 Successive Cancellation (SC) Decoding ...7

3.4 Partial Sum Registers ...24

3.5 Partial Sum Update Logic ...25

3.6 Frozen Channel ROM ...27

3.7 Controller ...28

4. IMPLEMENTATION RESULTS ...31

4.1 Polar Code Encoder & Decoder on Matlab ...31

4.2 Polar Code Decoder Simulation on Model Sim ...35

4.3 Polar Code Decoder Function test on FPGA DE10 ...37

4.4 Polar Code Decoder Synthesis Result on FPGA Stratix IV ...38

5. CONCLUSION AND FUTURE IMPROVING WORK ...39

6. REFERENCE ...40

7. APPENDIX ...42

</div>Trang 8<div class="page_container" data-page="8">

LIST OF FIGURES

Figure 1-1 Roadmap of channel coding in wireless communication systems. ...2

Figure 2-1 Polar code encoder with N=8 ...6

Figure 2-2 Butterfly-based SC decoder with N=8 ...8

Figure 2-3 Scheduling for the butterfly-based SC decoder with N=8 ...9

Figure 2-4 Scheduling and LR data flow graph of a semi-parallel SC decoder with N=8 and P=2 ... 11

Figure 2-5 Utilization rate 𝛼𝑆𝑃 and relative-speed factor 𝜎𝑆𝑃 for the semiparallel SC decoder. ... 13

Figure 2-6 Semi-parallel SC decoder architecture... 14

Figure 2-7 Sign and magnitude processing element architecture. ... 17

Figure 3-1 Enhanced semi-parallel SC decoder high-level architecture ... 18

Figure 3-2 Schedule for original reference and enhanced semi-parallel SC decoder. ... 19

Figure 3-3 RTL architecture of a standard PE ... 21

Figure 3-4 Mirrored decoding graph for N=8 ... 22

Figure 3-5 Organization of the LLR memory for N=8 and P=2 with uniform memory block size. ... 24

Figure 3-6 Architecture of the partial sum registers with N=8 ... 25

Figure 3-7 Frozen Channel ROM ... 28

Figure 3-8 RTL design of the stage number ... 29

Figure 3-9 RTL design of the portion 𝑝𝑠 of a stage ... 29

Figure 3-10 RTL design of the LLR memory read/write address ... 30

Figure 3-11 RTL design of the partial sum register read address ... 30

Figure 3-12 RTL design of the F/G function selecting signal ... 30

Figure 4-1 Polar code system with BPSK-AWGN channels ... 31

Figure 4-2 MATLAB test bench data generating scripts ... 32

Figure 4-3 Test bench data generated by MATLAB ... 32

Figure 4-4 BPSK modulation over an AWGN channel ... 33

</div>Trang 9<div class="page_container" data-page="9">

Figure 4-5 ModelSim test bench scripts ... 35

Figure 4-6 Memory loading and decoding process in ModelSim ... 35

Figure 4-7 Model Sim simulation result ... 36

Figure 4-8 Function test on FPGA DE10 ... 37

Figure 4-9 Resource usage ... 38

Figure 4-10 Max clock frequency result ... 38

Figure 7-1 Flow Summary... 42

Figure 7-2 Flow Non-Default Global Settings ... 42

Figure 7-3 Flow Elapsed Time ... 43

Figure 7-4 Analysis & Synthesis Summary ... 43

Figure 7-5 Analysis & Synthesis Settings ... 44

Figure 7-6 Analysis & Synthesis Source Files Read ... 44

Figure 7-7 Analysis & Synthesis Resource Usage Summary ... 45

Figure 7-8 Analysis & Synthesis Resource Utilization by Entity ... 45

Figure 7-9 Analysis & Synthesis Post-Synthesis Netlist Statistics for Top Partition ... 46

Figure 7-10 Fitter Summary ... 46

Figure 7-11 Fitter Settings ... 47

Figure 7-12 Fitter Resource Usage Summary ... 47

Figure 7-13 Fitter Resource Utilization by Entity ... 48

Figure 7-14 Timing Analyzer Summary ... 48

Figure 7-15 Timing Analyzer SDC File List ... 49

Figure 7-16 Timing Analyzer Clocks ... 49

Figure 7-17 Timing Analyzer Fmax Summary at Slow 900mV 85C Model ... 50

Figure 7-18 Timing Analyzer Fmax Summary at Slow 900mV 0C Model ... 50

Figure 7-19 Timing Analyzer Multicorner Timing Analysis Summary ... 51

Figure 7-20 MATLAB decoder results, same as original message (1-12 test cases) ... 52

Figure 7-21 ModelSim decoder simulation and verification results (1-12 test cases) ... 53

</div>Trang 10<div class="page_container" data-page="10">

Figure 7-22 MATLAB decoder results, same as original message (13-24 test cases)... 54 Figure 7-23 ModelSim decoder simulation and verification results (13-24 test cases) ... 55

</div>Trang 11<div class="page_container" data-page="11">

LIST OF TABLES

Table 1-1 Comparison implementation for 1024-bit polar codes using SCD architectures on Stratix IV FPGA. ...4 Table 4-1 Simulation result, frame error rate of polar codes of length N = 1024 the SC

decoding of the 3GPP 5G standard under offset min-sum decoding. All simulations were performed using BPSK modulation over an AWGN channel... 34 Table 4-2 Comparison of our result for 1024-bit polar codes with other architectures on Stratix IV FPGA. ... 39

</div>Trang 12<div class="page_container" data-page="12">

1. INTRODUCTION

1.1 Overview

In 2008, Arıkan [1] introduced polar codes as a significant theoretical breakthrough aimed at achieving the capacity of symmetric channels. These codes are grounded in the concept of channel polarization, wherein the combination and division of channels lead to the transformation of a set of N identical binary-input discrete memoryless channels (B-DMC) into a group of polarized channels. Within this transformation, some channels become noiseless, approaching a capacity of one (termed as good channels), while others become noisy, with their capacity diminishing to zero (referred to as bad channels). This innovative approach enables the optimization of channel performance through strategic channel polarization techniques.

As the channel number, or code length, approaches infinity, the proportion of good channels to the total channels converges toward the capacity of the original channel. This phenomenon distinguishes polar codes from traditional channel codes like Turbo/LDPC codes. Polar codes introduce a novel concept in coding design, departing from the conventional approaches and showcasing a unique perspective on optimizing communication systems.

In practical applications, channel coding serves as a crucial technology for ensuring reliable transmission, particularly in wireless communications. Figure 1-1 illustrates the progression of channel code applications across 3G to 5G wireless systems. This roadmap highlights the pivotal role of channel coding in advancing the reliability and performance of wireless communication technologies over the years.

The 5G systems introduce more stringent requirements for transmission latency (1ms) and reliability (99.999%), posing challenges that traditional Turbo codes struggle to meet. In 2019, IEEE communication society published the best readings of polar coding online [2] show that polar codes provides excellent error-correcting performance with low decoding complexity for practical blocklengths when combined List SC

</div>Trang 13<div class="page_container" data-page="13">

decoding with CRC check. These favorable traits have led to polar codes being used in the 5G wireless standard, which is a testament to their outstanding performance.

Figure 1-1 Roadmap of channel coding in wireless communication systems. For the decoding of polar codes, the concept of list successive cancellation decoding (LSCD) [3] [4] was introduced. LSCD involves generating L decoding paths by employing L parallel successive cancellation decodings (SCDs) [5] [6]. However, it's important to note that this method comes with increased implementation complexity and decoding latency. Enhancements in the implementation of the successive cancellation (SC) decoder play a crucial role in improving the overall implementation of LSCD. Consequently, our focus centers on optimizing the FPGA implementation of the Semi-parallel SC decoder [5], which forms the core of the original LSCD approach. Designing the high-throughput and low-latency architecture is the key issue of hardware implementation.

1.2 Related science researching

In the realm of hardware implementation, there is a pursuit of high-throughput and low-latency architectures for both Successive Cancellation (SC) and Successive Cancellation List (SCL) decoders in practical applications. Leroux et al. [7] introduced the pipelined-tree architecture to enhance the throughput of the SC decoder, while in [5], they proposed a semi-parallel architecture for a similar purpose. Building upon

</div>Trang 14<div class="page_container" data-page="14">

these advancements, Zhang and Parhi [8] designed sequential and overlapped architectures to further reduce the decoding latency of the SC decoder. Additionally, Yuan and Parhi [9] introduced the concept of multi-bit decision to improve the throughput of the SC decoder. These efforts underscore the ongoing endeavors to optimize hardware implementations by exploring various architectural designs and decoding techniques, aiming to strike a balance between high throughput and low latency in decoding processes.

Several papers have delved into the FPGA implementation of polar decoders, each offering unique insights and methodologies [10] [5] [11] [12]. Pamuk [10] contributed by presenting an FPGA implementation of a belief propagation decoder tailored for polar codes. Leroux et al. [5] introduced a semi-parallel Successive Cancellation (SC) decoder architecture designed to maximize FPGA resource utilization efficiently. However, it's noteworthy that the latency associated with the semi-parallel decoder architecture, as presented by Leroux et al. [5], is constrained to at

least 2 N – 2 cycles Consequently, its throughput is limited to approximately fmax/2N, where N denotes the length of the considered polar code, and fmax represents the

maximum clock frequency. In another approach, Dizdar et al. [12] proposed an SC decoder architecture leveraging only combinational logic circuits. Their work demonstrated that latency could be minimized by combining combinational and synchronous SC decoders of shorter lengths. These diverse approaches highlight the ongoing efforts to enhance the FPGA implementation of polar decoders by addressing factors such as resource utilization, latency reduction, and overall decoding efficiency. Y. Ideguchi contributes to the field with a notable proposal for an efficient FPGA implementation of a Successive Cancellation (SC) decoder for polar codes [13]. In their work, they showcase the FPGA implementation of the decoder architecture tailored for a 1024-bit-length polar code. Remarkably, their FPGA decoder achieves a threefold increase in throughput compared to the conventional sequential semi-parallel decoder, all while managing to avoid a substantial increase in hardware resource utilization. The

</div>Trang 15<div class="page_container" data-page="15">

emphasis on achieving higher throughput with optimized resource utilization is a significant stride in FPGA implementations of polar decoders.

As part of future work, the focus is directed towards further enhancing the frequency, highlighting an ongoing commitment to advancing the performance and efficiency of FPGA implementations in the realm of polar code decoding. This continuous pursuit of improvement reflects the dynamic nature of research in FPGA-based polar code decoders.

Table 1-1 Comparison implementation for 1024-bit polar codes using SCD architectures on Stratix IV FPGA.

Design N # of ALUTs

# of Register

RAM kbits

Fmax MHz

Latency cycles

TP Mbps 2013 Ref. [5] 1024 4130 1691 15 173 2084 85 2014 Ref. [6] 1024 4324 2046 808 223 - 73 2014 Ref. [6] 1024 4223 2069 808 212 - 104 2019 Ref. [13] 1024 8558 13056 0 97 312 318

</div>Trang 16<div class="page_container" data-page="16">

1.3 Tasks and expected results

 To design and implement a polar code SC decoder on an FPGA using Verilog, improve the throughputs of the Semi-parallel Successive Cancellation

 To reduce latency cycles by improving the architecture to decode the codeword in parallel based on the Semi-parallel Successive Cancellation  To improve fmax by analyzing and reducing the most critical delay path

of the Semi-parallel Successive Cancellation

 To evaluate the performance of the FPGA-based polar code implementation in terms of resource utilization, fmax, latency, throughput By improving the Semi-parallel Successive Cancellation, our expectation results are:

 Reduce N/2 the latency cycles, 512 clock cycles in case N=1024  Improve fmax from 173MHz to more than 200MHz

 Improve throughput from 85 Mbps to more than 130 Mbps

2. PRELIMINARIES

2.1 Polar Code Construction and Encoding

Polar codes represent linear block codes with a length of N = 2n, where their generator matrix is formulated through the nth Kronecker power of the matrix 𝐹 = [1 0

1 1]. For example, for n = 3,

𝐹⨂3 =

[1 01 1

0 00 0 1 0

1 1 1 01 1 1 0

1 1 0 00 0 1 0

1 1 1 01 1

0 00 0

0 00 00 0

0 0 0 00 01 0

1 1 0 00 01 0

1 1 1 01 1 ]

(1)

</div>Trang 17<div class="page_container" data-page="17">

Figure 2-1 depicts the equivalent graph representation of 𝐹⨂3, where 𝑢 = 𝑢07represents the information-bit vector and 𝑥 = 𝑥07 represents the codeword transmitted through the channel. The vector notation adheres to the conventions established in [1], namely 𝑢𝑎𝑏 consists of bits 𝑢𝑎, … , 𝑢𝑏 of the vector u.

Figure 2-1 Polar code encoder with N=8

In the process of decoding received vectors with an SC decoder, each estimated bit 𝑢̂𝑖 under the assumption of correct decoding for bits 𝑢0𝑖−1, tends toward a predetermined error probability, approaching either 0 or 0.5. Additionally, as established in [1], the fraction of estimated bits with a low error probability converges toward the capacity of the underlying channel. Polar codes leverage this phenomenon, known as channel polarization, by utilizing the most reliable 𝐾 bits for information transmission while "freezing" or setting the remaining 𝑁 − 𝐾 bits to a predetermined value, often 0.

</div>Trang 18<div class="page_container" data-page="18">

2.2 Successive Cancellation (SC) Decoding

When provided with a received vector corresponding to a transmitted codeword, the SC decoder sequentially estimates the transmitted bits, starting with 𝑢0 to 𝑢𝑁−1. At step 𝑖, if 𝑖 is not in the frozen set, the SC decoder estimates 𝑢̂𝑖 such that:

𝑢̂𝑖 = {0, 𝑖𝑓

Pr(𝑦,𝑢̂0𝑖−1|𝑢𝑖=0)Pr(𝑦,𝑢̂0𝑖−1|𝑢𝑖=1) > 11, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒,

(2)

where Pr(𝑦, 𝑢̂0𝑖−1|𝑢𝑖 = 𝑏) represents the probability that y was received, given the previously decoded bits as 𝑢̂0𝑖−1, with the currently decoded bit being 𝑏, where 𝑏 ∈ {0, 1}. In this context, the ratio of probabilities in the above function serves as the likelihood ratio (LR) of bit 𝑢̂𝑖.

The SC decoding algorithm sequentially assesses the likelihood ratio LR 𝐿𝑖 of each bit 𝑢̂𝑖. Arıkan demonstrated that these LR computations can be efficiently carried out in a recursive manner using a data flow graph resembling the structure of a fast Fourier transform. This structure, illustrated in Fig. 2, is referred to as a butterfly-based decoder. The messages exchanged within the decoder are LR values denoted as 𝐿𝑙,𝑖, where 𝑙 and 𝑖 represent the graph stage index and row index, respectively. Additionally, 𝐿0,𝑖 = 𝐿(𝑢̂𝑖) and 𝐿𝑛,𝑖 is the LR directly calculated from the channel output 𝑦𝑖. The nodes in the decoder graph compute the messages using one of two functions:

𝐿𝑙,𝑖 = { 𝑓(𝐿𝑙+1,𝑖; 𝐿𝑙+1,𝑖+2𝑙), 𝑖𝑓 𝐵(𝑙, 𝑖) = 0

𝑔(𝑠̂𝑙,𝑖−2𝑙; 𝐿𝑙+1,𝑖−2𝑙; 𝐿𝑙+1,𝑖), 𝑖𝑓 𝐵(𝑙, 𝑖) = 1 (3)

where 𝑠̂ is a modulo-2 partial sum of decoded bits, 𝐵(𝑙, 𝑖) ≜ 𝑖

2𝑙 𝑚𝑜𝑑 2, 0 ≤ 𝑙 <𝑛, 𝑎𝑛𝑑 0 ≤ 𝑖 < 𝑁. In the LR domain, functions f and g can be expressed as:

𝑓(𝑎, 𝑏) =1+𝑎𝑏

𝑔(𝑠̂, 𝑎, 𝑏) = 𝑎1−2𝑠̂𝑏 (5)

</div>Trang 19<div class="page_container" data-page="19">

The computation of function 𝑓 becomes feasible once 𝑎 = 𝐿𝑙+1,𝑖 and 𝑏 = 𝐿𝑙+1,𝑖+2𝑙 are accessible. On the contrary, the calculation of 𝑔 relies on knowledge of 𝑠̂, which is derivable using the factor graph of the code. As illustrated in Figure 2-1, for example, 𝑠̂2,1 is estimated by propagating 𝑢̂03 in the factor graph: 𝑠̂2,1 = 𝑢̂1⨁𝑢̂3. This partial sum of 𝑢̂03 is then utilized to compute 𝐿2,5 = 𝑔( 𝑠̂2,1; 𝐿3,1; 𝐿3,5).

Figure 2-2 Butterfly-based SC decoder with N=8

The necessity for partial sum computations introduces significant data dependencies in the SC algorithm, imposing constraints on the sequence in which the likelihood ratios (LRs) can be computed in the graph. In Figure 2-3, the scheduling of the decoding process for 𝑁 = 8 is depicted using a butterfly-based SC decoder. At each clock cycle (CC), LRs are assessed by computing either function 𝑓 or 𝑔. It is assumed that these functions are calculated promptly upon the availability of the required data. As the channel information 𝑦0𝑁−1 becomes accessible on the right-hand side of the

</div>Trang 20<div class="page_container" data-page="20">

decoder, the estimation of bits 𝑢̂𝑖 unfolds successively by updating the relevant nodes in the graph from right to left. Upon the estimation of bit 𝑢̂𝑖, all partial sums involving 𝑢̂𝑖 are updated, facilitating subsequent evaluations of function 𝑔.

Figure 2-3 Scheduling for the butterfly-based SC decoder with N=8 It is evident that when stage 𝑙 is activated, a maximum of 2𝑙 operations can be executed simultaneously. Additionally, only one type of function (either 𝑓 or 𝑔) is employed during the activation of a specific stage. Furthermore, a stage 𝑙 is activated 2𝑛−𝑙 times throughout the decoding process of a vector. Consequently, assuming one clock cycle per stage activation, the total number of clock cycles needed to decode a vector is:

ℒ𝑟𝑒𝑓 = ∑𝑛−1𝑙=0 2𝑛−𝑙 = 2𝑁 − 2 (6)

In spite of the apparent parallel structure in this decoder, robust data dependencies impose constraints on the decoding process, rendering the decoder less efficient. Specifically, defining an active node as one with ready inputs capable of executing operations, it becomes apparent that only a fraction of the nodes are active during each decoding clock cycle, as depicted in Figure 2-3. To quantify the efficiency of an SC decoder architecture, the utilization rate, denoted as 𝛼, is employed. This rate signifies the average number of active nodes per clock cycle:

</div>Trang 21<div class="page_container" data-page="21">

𝛼 ≜ 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛𝑜𝑑𝑒 𝑢𝑝𝑑𝑎𝑡𝑒𝑠

𝑐𝑜𝑚𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛𝑎𝑙 𝑟𝑒𝑠𝑜𝑢𝑟𝑐𝑒 𝑐𝑜𝑚𝑝𝑙𝑒𝑥𝑖𝑡𝑦 𝑥 𝑐𝑜𝑚𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 (7)

In SC decoding, 𝑁𝑙𝑜𝑔2𝑁 node updates are required to decode one vector. A butterfly-based SC decoder performs this amount of computation with 𝑁𝑙𝑜𝑔2𝑁 node processors which are used during 2𝑁 − 2 clock cycles; its utilization rate is thus:

Given that in the line decoder, all 𝑁

2 PEs are simultaneously activated only twice during the decoding of a vector, irrespective of the code size, it becomes evident that we can enhance the utilization rate of a decoder by reducing the number of PEs without considerably affecting throughput. For instance, a modified line decoder incorporating only 𝑁

4 PEs would incur only a 2-clock cycle penalty compared to a full line decoder. This simplified architecture, termed the semi-parallel decoder, exhibits lower complexity at the cost of a marginal increase in latency.

This approach can be extended to a reduced number of processing elements (PEs). Let's designate 𝑃 <𝑁

2 as the count of implemented PEs. In Figure 2-4 the scheduling of a semi-parallel decoder with {𝑃 = 2; 𝑁 = 8} is illustrated, revealing that this schedule demands only 2 extra clock cycles compared to the equivalent line decoder. Notably, the computations conducted during clock cycles {0, 1} and {8, 9} in the semi-parallel decoder are accomplished within a single clock cycle in a line decoder. Furthermore, Figure 2-4 the data flow graph illustrating the likelihood ratios (LRs) generated throughout the decoding process for {𝑃 = 2; 𝑁 = 8} is presented. Notably, data generated during CC = {0, 1} becomes unnecessary after CC = 5 and can

</div>Trang 22<div class="page_container" data-page="22">

thus be substituted with the data produced in CC = {8, 9}. Consequently, the same memory element can serve to store the results of both computations.

Generally, the memory requirements remain unaltered when compared to the line decoder: the semi-parallel SC decoder necessitates 𝑁 memory elements (MEs) for the channel information y and 𝑁 − 1 MEs for intermediate results. Consequently, for a code of length 𝑁, the memory requirements of the semi-parallel decoder remain consistent, irrespective of the number of implemented processing elements (PEs).

Figure 2-4 Scheduling and LR data flow graph of a semi-parallel SC decoder with N=8 and P=2

It's essential to highlight that the data dependencies related to 𝑠̂ are not depicted in Figure 2-4. Consequently, despite the appearance that the data generated at CC = {8,

</div>Trang 23<div class="page_container" data-page="23">

9} could have been produced earlier, this is not feasible, as the value of 𝑢̂3 must be known to compute 𝐿2,4, 𝐿2,5, 𝐿2,6 and 𝐿2,7.

While the decreased count of processing elements in a semi-parallel SC decoder results in heightened latency, this increase predominantly impacts stages that necessitate more than node updates. Building upon this overarching observation, we proceed to assess the specific impact of diminishing the number of processing elements on latency. To maintain scheduling regularity, we assume that the implemented number of processing elements, denoted as 𝑃, is a power of 2, where 𝑃 = 2𝑝.

Within a semi-parallel decoder, a restricted set of processing elements is employed, potentially resulting in several clock cycles needed to finalize a stage update. The stages conforming to the condition 2𝑝 ≤ 𝑃 remain unaffected, and their latency remains constant. However, for stages demanding LR computations beyond the count of implemented processing elements, completing the update necessitates multiple clock cycles. Specifically, 2

= 2𝑁 (1 − 1

2𝑃) + (𝑛 − 𝑝 − 1)𝑁𝑃= 2𝑁 +𝑁

4𝑃) (9)

As anticipated, the latency of the semi-parallel decoder rises with a decrease in the number of implemented processing elements (PEs). However, this latency penalty does not exhibit a linear correlation with 𝑃. To quantify the trade-off between the latency of the semi-parallel decoder (ℒ𝑆𝑃) and 𝑃, we introduce the relative-speed factor:

𝜎𝑆𝑃 =ℒ𝑟𝑒𝑓

ℒ𝑆𝑃 = 2𝑃2𝑃+𝑙𝑜𝑔24𝑃𝑁

(10)

</div>Trang 24<div class="page_container" data-page="24">

This metric defines the throughput achievable by the semi-parallel decoder relative to that of the line decoder. It's important to note that the definition 𝜎𝑆𝑃 assumes both decoders can be clocked at the same frequency: 𝑇𝑐𝑙𝑘−𝑙𝑖𝑛𝑒 = 𝑇𝑐𝑙𝑘−𝑆𝑃. Synthesis results reveal that due to the substantial number of PEs in the line decoder, we indeed have 𝑇𝑐𝑙𝑘−𝑙𝑖𝑛𝑒 > 𝑇𝑐𝑙𝑘−𝑆𝑃. . Consequently, the above function represents the least favorable case for the semi-parallel architecture.

The utilization rate of a semi-parallel decoder, on the other hand, is defined as: 𝛼𝑆𝑃 = 𝑁𝑙𝑜𝑔2𝑁

</div>Trang 25<div class="page_container" data-page="25">

PEs. The reduction in the number of PEs by a factor of 𝑁

2𝑃, which is 8192 for 𝑁 = 220and 𝑃 = 64 demonstrates a significant improvement. For 𝑃 = 64 and 𝑁 = 1024, the utilization rate (𝛼𝑆𝑃 = 3.5%) is enhanced by a factor of 8 compared to the line decoder, showcasing a more efficient utilization of processing resources during the decoding process.

This substantial reduction in complexity renders the size of processing resources very small compared to the memory resources required by this architecture.

Next sections furnish an elaborate description of the diverse modules encompassed within the semi-parallel decoder, depicted in Figure 2-6 as the top-level architecture.

Figure 2-6 Semi-parallel SC decoder architecture.

2.4 Processing Elements

SC polar code decoders conduct likelihood estimations through update rules (4) and (5). However, these equations involve divisions and multiplications, rendering them impractical for hardware implementation. To mitigate complexity, [7] proposed

</div>Trang 26<div class="page_container" data-page="26">

substituting these likelihood ratio (LR) updates with equivalent functions in the logarithmic domain. Throughout this paper, log likelihood ratio (LLR) values are denoted by 𝜆𝑋 = log(𝑋), where X is an LR.

In the LLR domain, functions 𝑓 and 𝑔 become the sum-product algorithm (SPA) equations:

𝜆𝑓(𝜆𝑎, 𝜆𝑏) = 2𝑡𝑎𝑛ℎ−1(tanh (𝜆𝑎

2) tanh (𝜆𝑏

2)) (12) 𝜆𝑔(𝑠̂, 𝜆𝑎, 𝜆𝑏) = 𝜆𝑎(−1)𝑠̂+ 𝜆𝑏 (13)

Upon initial inspection, 𝜆𝑓 may seem more intricate than its counterpart (4) due to the involvement of hyperbolic functions. However, as demonstrated in [14], it can be approximated using the minimum function, resulting in simpler min-sum (MS) equations:

𝜆𝑓(𝜆𝑎, 𝜆𝑏) ≈ 𝜓∗(𝜆𝑎)𝜓∗(𝜆𝑏)min (|𝜆𝑎|, |𝜆𝑏|) (14) 𝜆𝑔(𝑠̂, 𝜆𝑎, 𝜆𝑏) = 𝜆𝑎(−1)𝑠̂+ 𝜆𝑏 (15)

where |𝑋| represents the magnitude of variable X and 𝜓∗(𝑋), its sign, defined as: 𝜓∗(𝑋) = {1 𝑤ℎ𝑒𝑛 𝑋 ≥ 0

−1 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (16)

Equations (14) and (15) indeed propose a considerably simpler hardware implementation compared to their counterparts in the LR domain. Furthermore, Figure 2-5 illustrates that, despite the approximation involved in (14), its influence on decoding performance is minimal.

From a hardware perspective, our proposal involves consolidating 𝜆𝑓 and 𝜆𝑔 into a single processing element utilizing the sign and magnitude (SM) representation for LLR values, as this simplifies the implementation of (14):

𝜓(𝜆𝑓) = 𝜓(𝜆𝑎)⨁𝜓(𝜆𝑏) (17) |𝜆𝑓| = min (|𝜆𝑎|, |𝜆𝑏|) (18)

</div>Trang 27<div class="page_container" data-page="27">

where 𝜓(𝑋), like 𝜓∗(𝑋), describes the sign of variable 𝑋, although in a way that is compatible with the sign and magnitude representation:

𝜓(𝑋) == {0 𝑤ℎ𝑒𝑛 𝑋 ≥ 0

1 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (19)

These calculations are executed using a single XOR gate and a (𝑄 − 1)-bit compare-select (CS) operator, as depicted in Figure 2-7. Conversely, function 𝜆𝑔, is realized using an SM adder/subtractor. In SM format, 𝜓(𝜆𝑔) and |𝜆𝑔| depend not only on 𝑠̂, 𝜓(𝜆𝑎), 𝜓(𝜆𝑏), |𝜆𝑎| and |𝜆𝑏| but also on the relation between the magnitudes |𝜆𝑎| and |𝜆𝑏|. For instance, if 𝑠̂ = 0, 𝜓(𝜆𝑎) = 0, 𝜓(𝜆𝑏) = 1, and |𝜆𝑎| > |𝜆𝑏|, then 𝜓(𝜆𝑔) =𝜓(𝜆𝑎) and |𝜆𝑔| = |𝜆𝑏| − |𝜆𝑎|. This relation between |𝜆𝑎| and |𝜆𝑏| is represented by bit 𝛾𝑎𝑏, which is generated using a magnitude comparator:

𝛾𝑎𝑏 = {1 𝑖𝑓 |𝜆𝑎| > |𝜆𝑏|

0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (20)

The sign 𝜓(𝜆𝑔) relies on four binary variables 𝜓(𝜆𝑎), 𝜓(𝜆𝑏), 𝑠̂ and 𝛾𝑎𝑏. Employing conventional logic minimization techniques on the truth table of 𝜓(𝜆𝑔), we derive the following simplified boolean equation:

𝜓(𝜆𝑔) = 𝛾̅̅̅̅ ∙ 𝜓(𝜆𝑎𝑏 𝑏) + 𝛾𝑎𝑏∙ (𝑠̂ ⊕ 𝜓(𝜆𝑎)) (21) where ⊕, ∙ and + represent binary XOR, AND and OR, respectively.

As show in Figure 2-7, the computation of 𝜓(𝜆𝑔) necessitates only an XOR gate and a multiplexer. Notably, 𝛾𝑎𝑏 is already accessible from the CS operator, shared between 𝜆𝑓 and 𝜆𝑔.

On the other hand, the magnitude |𝜆𝑔| is the addition or subtraction of max(|𝜆𝑎|, |𝜆𝑏|) and min(|𝜆𝑎|, |𝜆𝑏|):

|𝜆𝑔| = max(|𝜆𝑎|, |𝜆𝑏|) + (−1)𝜒min (|𝜆𝑎|, |𝜆𝑏|) (22) 𝜒 = 𝑠̂ ⊕ 𝜓(𝜆𝑎) ⊕ 𝜓(𝜆𝑏) (23)

</div>Trang 28<div class="page_container" data-page="28">

where bit 𝜒 dictates whether min (|𝜆𝑎|, |𝜆𝑏|) should undergo inversion or not. The implementation of |𝜆𝑔| involves an unsigned adder, a multiplexer, and a two’s complement operator. The two’s complement operator is utilized to negate a number, allowing the unsigned adder to perform subtraction through overflowing. This implementation also incorporates the shared CS operator.

Figure 2-7 Sign and magnitude processing element architecture.

Finally, the result of the processing element is determined by bit 𝐵(𝑙, 𝑖), such that:

Fmax Critical paths

</div>Trang 29<div class="page_container" data-page="29">

3. MAIN DESIGN, ALGORITHM OF THE THESIS

3.1 Architectural improvements

Figure 3-1 Enhanced semi-parallel SC decoder high-level architecture

</div>Trang 30<div class="page_container" data-page="30">

CCparallel 0 1 2 3 → 4 5 → 6 7 8 9 → 10 11 →

CCref[5] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 PE 𝐿2,0 L2,2 L1,0 L1,2 L2,4 L2,6 L1,4 L1,6

</div>Trang 31<div class="page_container" data-page="31">

To enhance the decoder's throughput, the architecture is modified to reduce the cycles required to decode one codeword by 𝑁/2. This improvement is achieved by decoding two bits in parallel when the decoder is in the last stage (s = 0), exploiting the fact that two subsequent 𝑢̂𝑖 are obtained from 𝑓 and 𝑔 nodes that share the same input LLRs. However, the 𝑔-node needs the output of the preceding 𝑓 node as its 𝑢̂𝑠0,𝑧 input, and traditionally, the semi-parallel architecture computes the 𝑔-node in the cycle after the 𝑓 node [5]. In the modified architecture, both possible 𝑔-node outputs are calculated speculatively while the 𝑓 output is computed. The correct 𝑔-node output is then selected with a negligible additional combinational delay. Figure 3-2 illustrates the new, shortened schedule of 𝑓 and 𝑔-nodes for parallel decoding of two bits in case of 𝑁 = 8 and 𝑃 = 4, denoted CCparallel. This modification reduces the number of cycles for decoding by N/2.

In terms of area, it's crucial to note that only one of the P processing elements needs to perform this parallel decoding, resulting in a barely noticeable increase in area. However, since two bits are decoded in parallel, both must be considered in the 𝑢̂𝑠memory update logic, introducing a slight increase in overall complexity.

3.2 Optimized PE Implementation

The architecture proposed for all processing elements (PEs) represents an enhancement over the PE architecture in [5], where both functions 𝑓 and 𝑔 are merged in a single PE, sharing a comparator and an XOR gate between the two functions. In the improved architecture, the LLRs are stored in sign-and-magnitude form. The value of sign (𝐿𝑓) is determined by 𝑠𝑖𝑔𝑛(𝐿𝑎) ⊕ 𝑠𝑖𝑔𝑛(𝐿𝑏), whereas |𝐿𝑓| is min (|𝐿𝑎|, |𝐿𝑏|).

To improve the max clock frequency fmax, we need to improve the critical paths of processing element architecture of [5]. The proposed architecture improves upon this by calculating all possible values of |𝐿𝑔| (three possible magnitudes in parallel, (|𝐿𝑎| −|𝐿𝑏|), (|𝐿𝑏| − |𝐿𝑎|), and (|𝐿𝑎| + |𝐿𝑏|)). Then select the correct output based on 𝑢̂𝑠, 𝑠𝑖𝑔𝑛(𝐿𝑎), and 𝑠𝑖𝑔𝑛(𝐿𝑏); and finally saturating as required. This optimized architecture, marked as “PE enhanced” in Figure 3-3. The value of 𝑠𝑖𝑔𝑛(𝐿𝑔) is given by 𝑢̂𝑠⨁𝑠𝑖𝑔𝑛(𝐿𝑎) when |𝐿𝑎| > |𝐿𝑏|, and 𝑠𝑖𝑔𝑛(𝐿𝑏) otherwise. This enhanced architecture

</div>Trang 32<div class="page_container" data-page="32">

achieves reduction in delay within the processing element (PE) comparing to [5] with increasing area. Given that the PEs cost only 5.5% of the total ALUTs (for P = 64), the total area impact is small but the the circuit fmax improves significantly.

Figure 3-3 RTL architecture of a standard PE

Additionally, a special PE named PE0 in Figure 3-3 – is introduced, capable of computing two decoded bits in parallel as described in section 3.1, PE0 has an additional 𝑔 node output for parallel decoding, which is used in stage 0, as depicted in Figure 3-1. PE0 does not replicate a full 𝑔 node but shares speculative computations of the standard PE. PE0 employs eight additional 2-input MUXs comparing to standard PE. PE0 also functions as a standard PE when used in stages 1, 2, …. Although, PE0 is larger than the other standard PEs, its impact on the total area is minimal, as this change affects only a single PE in the entire design. The delay through PE0 is virtually the same as that of the other PEs.

3.3 LLR Memory

During the decoding process, the processing elements (PEs) compute Likelihood Ratios (LLRs) that are subsequently reused in subsequent steps. To facilitate this reuse, the decoder must store intermediate estimates in memory. As demonstrated in [7], 2𝑁 − 1 𝑄 -bit memory elements suffice to store the received vector and track all

</div>Trang 33<div class="page_container" data-page="33">

Log-intermediate 𝑄 -bit LLR estimates. Conceptually, this memory is represented as a tree structure, where each level stores LLRs for a decoding graph stage 𝑙, with 0 ≤ 𝑙 ≤ 𝑛. Channel LLRs are stored in the leaves, while decoded bits are read from the root.

To ensure single-clock-cycle operation of the processing elements without introducing additional delays, simultaneous reading of inputs and writing of outputs in a single clock cycle is sought. Although a register-based architecture, proposed in [7] for the line decoder , could be a straightforward solution, preliminary synthesis results revealed routing and multiplexing challenges, especially for very large code lengths needed by polar codes. Instead, a parallel access approach using random access memory (RAM) is proposed. In a polar codes decoder, PEs consume twice as much information as they produce. Therefore, the semi-parallel architecture employs a dual-port RAM with a write port of width 𝑃𝑄 and a read port of width 2𝑃𝑄, along with a specific data placement in memory. This RAM-based approach not only meets the operational requirements but also reduces the area per stored bit compared to the register-based approach.

Figure 3-4 Mirrored decoding graph for N=8

</div>