# Introduction to Tenstorrent



#### Tenstorrent at a Glance



## Software, Silicon, and Systems to Run AI, ML, and Compute Cheaper and Faster than Anyone Else



### **Tenstorrent Product Summary**

#### IP (TT-Ascalon<sup>™</sup> / Tensix NEO)



- Scales from mW to MW for efficiency and performance
- IP available for licensing
- Industry-leading performance
- Modular design available in varied configurations





|    | СН  |     |     |     |     |     |     | DISPA | сн  |
|----|-----|-----|-----|-----|-----|-----|-----|-------|-----|
| ŤΓ | D2D | D2D | D2D | D2D | 020 | D2D | D2D | D2D   |     |
| 8  | NEO | NEO | NEO | NED | NEO | NED | NEO | NEO   | 80  |
| 8  | NEO | NEO | NEO | NED | NEO | NED | NEO | NEO   | 920 |
| 8  | NEO | NEO | NEO | NED | NEO | NED | NEO | NEO   | 88  |
| 8  | NEO | NEO | NEO | NED | NEO | NED | NEO | NEO   | 020 |
| ΠĊ | D2D   | ū   |

- Portfolio of products powered by scalable Tensix AI cores
- Inference and training, CNN and NLP, recommendation engines, all on the same silicon
- Hardware available for purchase as well as IP available for licensing
- Multi-component modular chiplets

#### Servers (Tenstorrent Galaxy<sup>™</sup>)



- Galaxy Server 32 high performance ASICs in a custom chassis
- Easily combine servers into a Galaxy Rack with high bandwidth chip-to-chip connectivity





- ML compilers that scale from one chip to thousands
- TT-Buda<sup>™</sup> Automated AI/ML Compiler
- TT-Metalium<sup>™</sup> Bare metal software stack



#### Core Silicon Roadmap

### Wormhole Product Portfolio

#### PCIe Cards



- n300d: Two Wormhole<sup>™</sup> ASICs operating at up to 300W, active axial fan cooler
- n300s: Two Wormhole<sup>™</sup> ASICs operating at up to 300W, passive cooler
- n150d: One Wormhole<sup>™</sup> ASIC operating at up to 160W, active axial fan cooler
- n150s: One Wormhole<sup>™</sup> ASIC operating at up to 160W, passive cooler

#### TT-QuietBox



- Liquid-cooled desktop workstation
- Four n300 cards (8 Wormhole<sup>™</sup> ASICs)
  - 512 Tensix Cores
  - 96GB GDDR6
  - 192MB SRAM

Tenstorrent Galaxy<sup>™</sup> Wormhole Server

#### TT-LoudBox



- Air-cooled 4U server for datacenter deployments
- Four n300s cards (8 Wormhole<sup>™</sup> ASICs)
  - 512 Tensix Cores
  - 96GB GDDR6
  - 192MB SRAM



- 6U UBB design for enterprise use
- 32 Wormhole<sup>™</sup> ASICs for ultra-dense/high-performance data center deployment
- DGX level inference with higher efficiency and lower cost

### Add-In Board Overview



- PCle Gen 4
- 8GB 256-bit LPDDR4

100GbE connectivity

Moves to 12GB 192-bit GDDR6

GDDR6 at faster speed

Moves to PCIe Gen 5

• Upgrades to 400GbE

connectivity

 Inter-card and inter-system expansion

### Tenstorrent Galaxy Roadmap

server accomplished via standard network cabling

| 2023                                                                                                                     | 2024                                                                                                                        | 2025                                                                                                               |
|--------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------|
| Wormhole™                                                                                                                | Wormhole™                                                                                                                   | Blackhole™                                                                                                         |
| Prototype                                                                                                                | ODM Redesign                                                                                                                | RISC-V & AI Generation                                                                                             |
|                                                                                                                          | C DESERVE                                                                                                                   |                                                                                                                    |
| <ul> <li>32 Wormhole<sup>™</sup> cards in a<br/>highly optimized, highly dense,<br/>custom 4U chassis</li> </ul>         | <ul> <li>32 Wormhole<sup>™</sup> cards in a highly<br/>optimized, highly dense, custom<br/>6U chassis</li> </ul>            | <ul> <li>Utilize 6U Wormhole UBB design<br/>for easy customer transitions</li> <li>Mesh connections and</li> </ul> |
| <ul> <li>5.2 PetaFLOPS at BLOCKFP8</li> <li>384GB of globally accessible<br/>GDDR6 memory</li> <li>3.8GB SRAM</li> </ul> | <ul> <li>Internal head node for built-in host capability</li> <li>Increased PetaFLOPS and GDDR6 memory bandwidth</li> </ul> | <ul> <li>Training vs. inference focus<br/>being considered</li> </ul>                                              |
| Expansion beyond a single                                                                                                | UBB module for flexible customer     infrastructure                                                                         |                                                                                                                    |

• Expansion beyond a single server accomplished via standard network cabling

infrastructure

### Tenstorrent Open Source Software

- TT-Forge MLIR-based compiler integrated into various frameworks; AI/ML models from domain-specific compilers to custom kernel generation
- TT-NN<sup>™</sup> Library of optimized operators
  - ATen coverage
  - PyTorch-like API
- TT-Metalium<sup>™</sup> Low-level programming model and entry point
  - Build your own kernels
  - User-facing host API



### TT-Metalium<sup>™</sup>: Built for AI and Scale-Out

- Kernels are plain C++ with APIs
- Dedicated data movement and compute kernels ٠
  - Optimize data movement and compute overlap directly
- Any core can read/write/sync to any core or chip directly
- Full control of data layout and persistency in SRAM and ٠ DRAM
- Different cores can run different kernels and flow data directly between them
- Native multi-device kernels
  - Fused and overlapped compute and inter-chip • communication within the kernels

#### TT-Metalium<sup>™</sup>

### Native Multi-Device Kernels and Ops TT-NN<sup>™</sup> **TT-Metalium** C++ Host API

**TT-Metalium** C++ Kernel API

#### **GPU** Programming



### Software Ecosystem and Integrations



General: <u>https://github.com/tenstorrent</u> TT-Metalium<sup>™</sup>: <u>https://github.com/tenstorrent/tt-metal</u> TT-MLIR: <u>https://github.com/tenstorrent/tt-mlir</u>



### TT-Metalium™: Tensix Core to Multi-Chip Scale-Out



### Simple, practical and intuitive tooling and debug utilities



#### Perf Analyzer

Our Perf Analyzer utility shows the sequence of on-chip operations at runtime, exposing individual op performance.

This helps users identify operations in a process which are in a waiting state so that suboptimal workflows can be identified and addressed.



#### Reportify

Tenstorrent AI Hardware processes graph-type applications (such as Deep Learning).

To facilitate the understanding of the graph architecture, we have included a visualizer to help expose the connections between the layers of graphs and help refine the application workflow.

#### Human Readable IR/Netlist

Tenstorrent software compiles a human readable netlist describing how operations will be mapped to Tensix cores, just prior to compiling the machine readable binary.

This provides expert developers with the flexibility to modify the placement and routing on the chip for fine-grain optimizations.



#### RouteUI

To help developers visualize the spatial mapping of operations on Tensix cores, we offer the RouteUI utility that display onchip operations.

This enhances understanding of our dataflow approach and enables bottleneck identification for performance improvement.

### Why AI Needs Both RISC-V Cores and AI Accelerators

Tensix cores are ideal for big math operations:

- Vector calculations
- Matrix arithmetic
- Large data sets

Merging Tensix cores and CPU cores on the same die:

- Lowers latency
- Boosts utilization
- Increases ML performance



#### CPU cores are ideal for:

- Conditionality
- Traditional math
- High performance
- Robust programmability

ML Developers need both CPU and AI cores to build dynamic models of the future that are not possible today due to latency and utilization problems of using the host CPU.

### Tenstorrent RISC-V O-o-O Processor Family



**RISC-V** Processor Family

### IP Customization Advantage

#### Silicon Providers

#### Choose from a small set of available options





#### Tenstorrent

### Tenstorrent CPU IP Licenses

#### Tenstorrent offers two CPU IP licensing options

| Innovation CPU IP License                                                                                                                                                                        | Standard CPU IP License                                                                                                                        |  |  |  |  |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|
| <ul> <li>Fully modifiable license to Tenstorrent CPU IP<br/>enabling faster time to market and differentiated<br/>CPU products</li> <li>Source RTL provided for faster time-to-market</li> </ul> | <ul> <li>License Tenstorrent CPU IP without modification<br/>rights (limited rights are negotiable)</li> <li>Encrypted RTL provided</li> </ul> |  |  |  |  |
| <ul> <li>Access to infrastructure IP required for modifying<br/>and extending Tenstorrent CPU IP design</li> </ul>                                                                               | Licensees can configure IP with various     parameters but cannot extend beyond allowed     design space                                       |  |  |  |  |
| Custom instructions could be possible                                                                                                                                                            | • Any unauthorized modification voids support and maintenance; warranties; indemnification, etc.                                               |  |  |  |  |
| Licensable option for branding/naming rights                                                                                                                                                     | Licensable option for branding/naming rights                                                                                                   |  |  |  |  |

#### **Innovation License**



Fully customizable Complete ownership Source RTL for faster TTM Change the ISA (Do what x86 and ARM cannot)

Optimize performance for your specific workloads

No crazy license restrictions

# Thank You