# Future computer Architectures: Computing in Memory

#### Said Hamdioui

Delft University of Technology
The Netherlands

## **ASCI Spring School**

on Heterogeneous Computing Systems May 29 - June 1, 2017



# Outline

- Motivation
  - The need of new technology and architectures
- Memristor (memristive devices)
  - Promising device, principal of working, potential
- Memrisor for memories
  - Straightforward application
- Memristor for logic
  - Different styles
- Computation-in-memory architecture
  - · Combining all together
- Some results/ potential of CIM
  - · Does it make sense?
- Conclusion

June 2, 201

#### Motivation: Computing walls 1. Power Wall Chip-level energy trends · Dominated by com & memory 70 to 90% for data-ints. Appl 2. Memory Wall · Slow Limited bandwidth Communication bottleneck IS. Borkar, "Exascale Computing: a fact or a fiction?," IPDPS · Stored program principle Energy/Op Operation Cost (8-bit operand) (45 nm) (vs. ALU) ILP Wall 0.05 pJ 1 X **ALU** operation · Insufficient parallelism at instr. level • Programmability Complexity & overhead 10<sup>4</sup> => Reduced / Saturated performance 10<sup>3</sup> Enhancement based on expensive on chip memory (~70% of area) 102 Requires LD & ST: killers of overall perf 10<sup>1</sup> Need of new architectures











## Memristor: Advantages **Dual functionality** Realize both memory and logic functions Enable new computing paradigms Reduce (eliminate) memory wall Low energy consumption · Low/zero leakage: Non-volatility Reduce the overall power consumption Scalability/ Nanometric dimensions · Extreme density at low price and reduce area Sustain the profitability of Moore's law CMOS compatibility Enable the heterogeneous integration Enhance manufacturing at low cost Two terminal passive device structure Realize dense crossbar architectures Stack on CMOS Good endurance & Good Reliability?





# Memristive based memories?

- oxRAM seems to be most promising
  - Very high density (cross-point array structure)
  - Smaller and simpler in respect to MRAM
  - Lower consumption in respect to PCM.
  - Lower programming voltage and faster

| Features      | DRAM              | FLASH<br>Nand      | MRAM<br>STT          | PCM               | ReRAM               |                    |
|---------------|-------------------|--------------------|----------------------|-------------------|---------------------|--------------------|
|               |                   |                    |                      |                   | OxRAM               | CBRAM              |
| Integration   | FE                | FE                 | BE                   | BE                | BE                  | BE                 |
| Scalability   | 32 nm             | 15 nm              | 20-30 nm             | 10-20 nm          | 10 nm               | 10-20 nm           |
| Density       | 4-6f <sup>2</sup> | 4f <sup>2</sup>    | 35-40f <sup>2</sup>  | 6-8f <sup>2</sup> | 4-6f <sup>2</sup>   | 4-6f <sup>2</sup>  |
| Write voltage | V <sub>NOM</sub>  | >10V               | 1V                   | 3-5V              | 1-2.5V              | 1-2.5V             |
| Write time    | 50ns              | 0.1ms              | 20ns                 | 10ns              | 10-50ns             | 100-1000ns         |
| Write energy  | 90fJ/bit          |                    | 2.5pJ/bit            | 20pJ/bit          | 10-100fJ/bit        | 10-100fJ/bit       |
| Endurance     | 1E <sup>+15</sup> | 1E <sup>+4-5</sup> | 1E <sup>+12-15</sup> | 1E <sup>+9</sup>  | 1E <sup>+6-10</sup> | 1E <sup>+5-6</sup> |

lune 2, 2017 [ref: Clermidy-2014]

6









# Memristor for logic

- Threshold logic
  - $f(x_1, x_2, ... x_n) = \begin{cases} 1 & if \sum_{1}^{n} x_i \ge T \\ 0, & otherwise \end{cases}$
  - · Two control voltages: Vdd & GND
  - Two logic states: 0 & 1



#### Example

## Assume n=3, T=Vth=Vdd/2

- 1. Program all devices to Ron
- 2. Provide the input voltages
  - Vdd, Vdd, 0
- 3. Vf=1 (Roff)
  - Vx = (4/7) Vdd > Vth



# Computation-in-memory: Is there any benefits?



Assume a program with n; instructions



CPU -- CIMA DRAM **External Memory** 

- $n_p$  processors
- Latency  $\propto (t_L + t_S + t_{ALU})^* (n_i/n_p)$

- n<sub>a</sub> parallel crossbar arrays
- Latency  $\propto (t_S' + t_{ALU}') * (n_i/n_a)$
- Data already loaded in CIM



## **Better overall performance**

- $t_S'+t_{ALU}' << t_L+t_S+t_{ALU}'$
- $t'_S + t'_{ALU}$  is ~ constant
- t<sub>1</sub> depends on miss rate
- E.g. Large data sizes => higher miss rate

#### Reduced energy

- Significant communication reduction
- Reduce memory & power wall

#### Parallelism is program dependent

- · CIM consumes much less than cores
- Higher n<sub>p</sub> => higher power => dark silicon

#### Potential applications

- · Loops on the same data sets
- Bit-wise operation
- · High data volume and reuse
- E.g., bio-sequencing, graph processing.

















# Computation-in-memory: Potential

## Examples

- · Healthcare: DNA sequencing
  - we assume we have 200 GB of DNA data to be compared to
  - A healthy reference of 3GB for 50% coverage\*\*

[\*\*E. A. Worthey, Current Protocols in Human Genetics, 2001]

• Mathematic: 10<sup>6</sup> parallel additions

## Assumptions

- Conventional architecture
  - FinFET 22nm multi-core implementation, with scalable number of clusters, each with 32 ALU (e.g comparator)
  - 64 clusters; each cluster share a 8KB L1 cache
- CIM architecture
  - Memristor 10nm crossbar implementation
  - The crossbar size equals to total cache size of CMOS computer

[Source: S. Hamdioui, et.al, DATE 2015]

lune 2, 2017

# Computation-in-memory: Potential

## Metrics

- Energy-delay/operation
- Computing efficiency : number of operations per required energy
- Performance area : number of operations per required area

#### Results

| Metric               | Archit. | DNA sequencing | 10 <sup>6</sup> additions |        |  |
|----------------------|---------|----------------|---------------------------|--------|--|
| Energy –Delay/       | Conv.   | 2.02e-03       | 1.5043e-18                | > x100 |  |
| operations           | CIM     | 2.34e-06       | 9.25702-21                | / X100 |  |
| Computing Efficiency | Conv.   | 4.11e01        | 6.5226e+9                 | > x100 |  |
|                      | CIM     | 3.70e04        | 3.9063e+12                | > X100 |  |
| Performance Area     | Conv.   | 5.73e06        | 5.1118e+09                | > x100 |  |
|                      | CIM     | 8.28e09        | 4.9164e+12                |        |  |

Key drives: Reduced memory bottleneck, non-volatile technology & parallelism

June 2, 201

## Conclusion

- Von-Neumann based computers
  - · Memory & communication bottleneck
  - · Complex progammability of multi-cores
  - · Higher power consumption
  - => Unable to solve (today) and future application at affordable cost
- Short term
  - Specialization: application-specific accelerators (reduced prog)
  - Near memory computing, accelerator around memories (data-centric model)

## Long term

- Alternative architecture, beyond Von Neumann & using new device tech
- Resistive computing has a huge potential (CIM architecture)
- · But many open questions: device & materials, HW& SW, algorithms, etc

June 2, 201

# Future computer Architectures: Computing in Memory

#### Said Hamdioui

Delft University of Technology
The Netherlands

#### **ASCI Spring School**

on Heterogeneous Computing Systems May 29 to June 1, 2017



