key: cord-0584907-l61cqsnb authors: Yin, Xunzhao; Muller, Franz; Huang, Qingrong; Li, Chao; Imani, Mohsen; Yang, Zeyu; Cai, Jiahao; Lederer, Maximilian; Olivo, Ricardo; Laleni, Nellie; Deng, Shan; Zhao, Zijian; Zhuo, Cheng; Kampfe, Thomas; Ni, Kai title: An Ultra-Compact Single FeFET Binary and Multi-Bit Associative Search Engine date: 2022-03-15 journal: nan DOI: nan sha: b9ec0ccdb588030a9a7cc6a0ed1a2b28c388e381 doc_id: 584907 cord_uid: l61cqsnb Content addressable memory (CAM) is widely used in associative search tasks for its highly parallel pattern matching capability. To accommodate the increasingly complex and data-intensive pattern matching tasks, it is critical to keep improving the CAM density to enhance the performance and area efficiency. In this work, we demonstrate: i) a novel ultra-compact 1FeFET CAM design that enables parallel associative search and in-memory hamming distance calculation; ii) a multi-bit CAM for exact search using the same CAM cell; iii) compact device designs that integrate the series resistor current limiter into the intrinsic FeFET structure to turn the 1FeFET1R into an effective 1FeFET cell; iv) a successful 2-step search operation and a sufficient sensing margin of the proposed binary and multi-bit 1FeFET1R CAM array with sizes of practical interests in both experiments and simulations, given the existing unoptimized FeFET device variation; v) 89.9x speedup and 66.5x energy efficiency improvement over the state-of-the art alignment tools on GPU in accelerating genome pattern matching applications through the hyperdimensional computing paradigm. In the era of artificial intelligence (AI) and Internet of Things (IoT), the ever growing amount of data generated by various machine learning (ML) models and devices in edges and data centers has placed severe demands on efficient computational hardware and architectures to support high performance applications. However, the conventional Von Neumann architectures are consuming significant energy costs and latency due to the massive data transfer between storage and processing units, which is the so-called memory wall issues. Emerg- Besides matrix multiplications, the search operations are also prevalently seen and at the core of many applications, and accelerating the searches over a class of data vectors can directly benefit various computational models and improve the system performance. As a special form of IMC solutions, content addressable memories (CAMs) can accelerate parallel search operations throughout an entire memory array, thus demonstrating a promising potential utility in modern computing platforms [4, 5, 6]. CAM, depending on the stored value (i.e., binary, ternary, or multi-bit), can be classified as binary CAM (BCAM), ternary CAM (TCAM) (i.e., a third "don't care" or wildcard state), or multi-bit CAM (MCAM) [4, 7] . When given an input query, a CAM simultaneously compares each of its stored memory entry with the input, and returns the stored entries that match with the input, as shown in Fig. 1(a) . The search can be performed in either exact mode or approximate mode. In the former scenario, only the entirely matched entries will be identified, whereas in the latter case, the distances (i.e., Hamming distance for BCAM/TCAM [5] while a novel distance metric for MCAM [8] ) between the stored entries and the input query are calculated, as shown in Fig. 1(a) , thus serving as a distance kernel for various applications [4, 5, 8]. One promising application that can significantly benefit from CAM is hyperdimensional computing (HDC), which can perform cognition tasks, such as image classification and speech recognition [9, 10, 11]. As a brain-inspired computing model, in HDC, class vectors are represented as almost orthogonal hypervectors in a high dimensional space (e.g., thousands of dimensions and each dimension is independent and identically distributed), as shown in Fig. 1(b) [11]. The HDC inference for classification is performed by identifying classes that are closest to the input query. CAM can greatly accelerate the HDC inference by storing class hypervectors (HD N in Fig. 1(b) ) and calculating the distances between stored class hypervectors and input search query vector (HD Q in Fig. 1(b) ) in memory and through a massively parallel fashion. HDC also finds wide applications beyond cognition tasks, taking genome sequencing as an example in this work. Genome sequencing is a typical pattern matching problem that is widely applied in bioinformatics applications. In this task, a genome sequence is searched through the overall genome library for entries that contain the query sequence. Despite of the importance, the efficient acceleration of the pattern matching for the genome sequencing is still an open question. HDC has been proposed as an effective solution as it can transform the inherent sequential processes of pattern matching to highly parallelizable computation tasks and translate the complex distance metric between the patterns to hamming distance [12] . In the following, the FeFET device will first be explained and the 1FeFET universal CAM design will be proposed and validated. The proposed design will then be leveraged for the genome sequencing application through HDC. As the microstructure of the ferroelectric layer plays a vital role for its application inside the FeFET memory, we analyzed the microstructure on a planar film before gate structuring using STEM with a dedicated detector with high dynamic range at each pixel, as shown in Fig.2 terns for silicon and ferroelectric HfO 2 are shown in Fig.2(b) . Based on the set electron diffraction images the phase and grain orientation can be indexed in the scan, which is shown with the phase and orientation maps in Fig. 2 (c) and 2(d) for in-plane and cross-section, respectively. We can observe a highly polar orthorhombic phase fraction in the film. Still, a small monoclinic phase fraction can be identified, an area ratio of less than 5% has been extracted. The orientation of the grains is extracted and shows a homogeneous orientation within the grains. A preferred tilted out-of-plane orientation along the <110> axes can be identified. IMC, such as the crossbar array for matrix-vector multiplications or the CAM for bit-wise XNOR operations, typically operates on the current domain, as shown in Fig.3(b) , which are chosen to turn on the device LVT and HVT states, respectively, and yield the same ON current), it can be observed that the ON current of the device with LVT is not constant, but highly dependent on the gate bias V G . This places a stringent requirement on controlling V TH variation in FeFET, which is rapidly progressing with the optimization of integra-tion process yet still faces challenges especially with Fe-FET scaling [19, 20] . Two approaches are studied to address the variation challenges. The first straightforward method is continual process optimization, such as the poly-crystalline phase control, grain orientation control, etc. In this work, we adopt an alternative method proposed in [21] , which is to limit the ON current variability by a limiter. In this way, the ON current of FeFET is effectively independent on both the applied V G and the stored V TH state, therefore, the V TH variation of Fe- Integration of the series resistor with FeFET is necessary to fully exploit the benefit of current limiter. [22] proposed a TiN/SiO 2 tunneling junction based resistor integrated in the back-end-of-line (BEOL) and connected with FeFET drain. This approach, though effective, is inconvenient to implement. Another design is to adopt a split-gate structure, similar to the splitgate embedded FLASH memory [23] , where a conven- Thanks to its inherent transistor structure, a FeFET with Step 1: Detect the mismatch case of 'stored '0', search '1'', which has high current Step 2: Detect the mismatch case of 'stored '1', search '0'', which has low current Step 1 Step 1 ML Current: Step 2 ML Current: I MLS2 =I ON * (N tot -N st1sr0 ) Hamming distance between storage and search Step 2 St1Sr0 Identified at step 2 One can note that such BCAM design with our pro- Step 2 search Identify St1Sr0 Storing all '0's Storing all '1's Step 1 search decision boundary for exact match Step 1 match I MLS1 region Step 2 match I MLS2 region Step 2 search decision boundary for exact match Step 1 match I MLS1 region Step 2 resenting the search bits "00", "01", "10", and "11" are applied to each cell, respectively. For step 1 search as shown in Fig.6(a) , a below-V TH search, i.e., V SL3l below the V TH corresponding to state '10' and above the V TH corresponding to state '01' is applied to search bit "10". In this way, any mismatch cell that is applied by a search voltage above the V TH corresponding to the stored state, i.e., 'above-V TH ' scenario (search bits "10" and "11" in this example), will conduct a high ON cur- Step 1 has a low I MLS1 and step 2 has a large I MLS2 : Exact match. Other scenarios: mismatch be adopted yet at the cost of area and power overhead [26] . To evaluate the efficiency of the proposed 1FeFET CAM design beyond array level, we exploit the CAM array as an associative search engine for HDC running multiple genome sequencing tasks which are essential techniques in many bioinformatics applications. In a genome sequencing task, a query DNA sequence represented by a string of nucleotide bases A, C, G, T, is searched in a reference DNA string which generally consists of 100 millions of DNA bases. Such pattern matching helps identify the existence of the query sequence in the reference sequence to discover potential diseases or accelerate DNA alignment techniques [27, 28] . Fig.S7(a) shows the HDC architecture that efficiently parallelizes the genome sequencing tasks. In the architecture, the sequences from genome databases (E.coli, Human CHR14 and COVID-19) are encoded and stored in the associative memory (Fig.S7(c) ) for genome pattern matching. A new genome query is encoded and searched across multiple CAM banks in parallel, and the memory entries containing the sequences whose Hamming distances with the query are within a threshold are identified as illustrated in Fig.S7(b) . Fig.S7(d) We proposed an ultra-compact and scalable 1FeFET CAM design with enhanced search function and improved CAM density for low power pattern matching application via HDC paradigm. We fabricated the 1Fe-FET1R structure that integrates the series resistor into For current based in-memory computing using FeFETs, such as crossbar array for matrix-vector multiplication (multiply and accumulation, MAC) operation ( Fig.S1(a) ) and the content addressable memory for associative search (Fig.S1(b) ), it is highly desirable to have a tight distribution of the ON current for FeFETs. For conventional FeFETs, the device current variation is closely related to the transistor V TH variation. Especially for the ON current, the V TH variation has an one-to-one correspondence to the I D variation, as shown in Fig.S1(c) . The ON current variability may induce output overlaps, thus sensing error, which could degrade the operation accuracy. This can be addressed through continual FeFET material and process optimization to control both extrinsic sources and intrinsic polarization switching sources that cause V TH variation. That said, new cell structure can be designed such that the cell ON current is independent of the V G and V TH , which is adopted in this work. Using a series current limiter on the FeFET drain, once the FeFET turns ON, the I D is independent of the V G and V TH , as shown in Fig.S1(d) . This simple design, with its significantly suppressed I D variation, can enable various applications, such as the 1FeFET CAM proposed in this work. F I G U R E S 1 For current-based in-memory computing (IMC) elements, i.e., (a) crossbar for neural network and (b) content addressable memory for hamming distance computation, (c) the device's stored V TH state variability and the ON current I D variation are correlated. With large I D variation due to the V TH variation, strong overlaps between different output currents result in increased error rate being used in IMC circuits. (d) A device with V G and V TH independent ON current is desirable to minimize the ON current variation. With ON current that is robust to V TH variation, different output currents can be distinguished when being used in IMC circuits. The series current limiter can be applied for FeFETs with different sizes, as long as the equivalent ON resistance of FeFET is much less than the current limiter, so that the cell ON current is dominated by the current limiter and independent of the V G . The series current limiter can be integrated into FeFET structure such that a true 1FeFET CAM cell can be realized. One intermediate design would be the split-gate structure typically used in embedded NOR flash transistor. However, in our design, the purpose of the series transistor is to limit the channel current. Also, the gate tunable resistance provides additional flexibility, which can be leveraged for different function demonstration, such as distance kernel for multi-bit CAM. Other designs are also considered here. One is to incorporate the Schottky barrier in the source/drain, as shown in Fig.S2(c) . From TCAD simulations, the conduction band diagrams at different V G , shown in Fig.S2(d) , show that once the channel turns ON, the current transport barrier is dominated by the Schottky barrier, which is V G independent. Therefore the I D -V G curves shown in Fig.S2(e) show that the ON current is independent of the V G . Another proposed design is the underlap FDSOI FeFET, as shown in Fig.S2(f) , where an ungated channel is inserted between source/drain and the gated channel. Because this region is ungated, when the gate region turns on, the carrier transport will be limited by the ungated region, as shown in the conduction band diagram in Fig.S2 (g). The Step 1 Step 2 St0Sr1 Identified at step 1 St1Sr0 can not be Identified at step 2 Different I ON , IMC not possible Step 1 search is robust, no issue Step 2 search is not possible The 1FeFET BCAM operation is further verified using SPICE simulations. The experimental FeFET device-to-device variation shown in Fig.3(b) is adopted for the Monte Carlo simulations. We consider two scenarios where all the cells storing bit '0' and all the cells storing bit '1', respectively. The worst mismatch is considered where only 1 bit mismatch needs to be detected. Due to the strong variation in I ON and V G dependent I ON in the conventional FeFET, even a small CAM word (i.e., wordlength of 2) operation fails in step 2 search. With the series current limiter, i.e., resistor, on the drain of FeFET, the variation in I ON is significantly suppressed. As a result, a clear boundary can be defined between different degrees of mismatch, allowing to detect the hamming distance reliably. Step 1 ML Current (I MLS1 ) (mA) Step 2 search I MLS2 match region Step Step 2 Mismatched search 63 '01' & 1 '11' Step 1 Mismatch Step 1 Mismatch Mismatched search 63 '01' & 1 '10' Step 2 Step 2 Matched search 64 '01' Mismatched search 63 '01' & 1 '00' Step 2 Worst above V TH search scenario Step 1 search I MLS1 match region (a) Step 1 Match The sense amplifier converts the ML current from the CAM array to the voltages, which indicate the hamming distance between the input query and the stored CAM word. The sense amplifier ( Fig.S6(a) 1FeFET CAM Array for Bioinformatics (e.g., DNA Genome Sequencing) As an example of 1FeFET CAM application benchmarking, DNA genome sequencing through hyperdimensional computing (HDC) is considered. HDC has been proposed as an effective solution for genome sequencing as it can transform the inherent sequential processes of pattern matching to highly parallelizable computation tasks and translate the complex distance metric to hamming distance. Fig.S7(a) shows the overall flow of performing the genome sequencing with HDC, where the reference genome library are first encoded into random binary hypervectors and then stored into 1FeFET CAM array, as shown in Fig.S7(c) . Once a query genome comes in to check whether it exists in the reference library, it also goes through the hypervector encoder, and the resulting hypervector is applied as a search query to the CAM array. The encoder, shown in Fig.S7(b) , maps the genome sequence into almost orthogonal hypervectors such that only similar genome sequences to the query have small distances. By setting an appropriate distance threshold to the output of CAM array sense amplifier, whether the reference library contains the query genome can be quickly answered within a single CAM search. F I G U R E S 7 Benchmarking the 1FeFET CAM in the HDC system for DNA genome sequencing tasks. (a) Overall flow of HDC for DNA genome sequencing. The genome library is encoded into random binary hypervectors, which are stored in the 1FeFET CAM. A query sequence goes through the same encoder to generate the query hypervector, which is then applied to the 1FeFET CAM to search for the entries in the reference library within a given distance threshold. (b) The hypervector encoder can map the genome sequence into almost orthogonal hypervectors such that the library entries that are identical to the query hypervector have much higher similarity than the library entries that are different. (c) 1FeFET CAM array architecture. (d) The 1FefET CAM structure can provides significant speedup and energy saving compared with GPU. Four-Dimensional Scanning Transmission Electron Microscopy (4D-STEM): From Scanning Nanodiffraction to Ptychography and Beyond Impact of the SiO 2 interface layer on the crystallographic texture of ferroelectric hafnium oxide A FeFET based super-low-power ultra-fast embedded NVM technology for 22nm FD-SOI and beyond FeFET: A versatile CMOS compatible device with game-changing potential Ultra-low power flexible precision fefet based analog in-memory computing Analog In-memory Computing in FeFETbased 1T1R Array for Edge AI Applications A cost-efficient 28nm split-gate eFLASH memory featuring a HKMG hybrid bit cell and HV device A circuit compatible accurate compact model for ferroelectric-FETs Ferroelectric FET analog synapse for acceleration of deep neural network training Analog-to-digital converter design exploration for compute-in-memory accelerators Bioinformatics algorithms: an active learning approach BLAST+: architecture and applications GPU-BLAST: using graphics processors to accelerate protein sequence alignment