# Glaze3D<sup>TM</sup>

Petri Nordlund Chief Architect Bitboys Oy (pnord@bitboys.fi)

**Bitboys Oy** 

# Introduction

- Glaze3D<sup>™</sup> is a new consumer-level 2D/3D-graphics accelerator chip
- Fillrate: 1200 million texels / second
- Designed and developed by Bitboys Oy, a Finnish 3D-graphics hardware company
- Uses Infineon Technologies' 0.20 μm eDRAM process
- 9 MB of embedded framebuffer memory, 128 MB (max) of external video memory

# **Design goals**

- Traditional, proven rendering architecture
- PC'99, Microsoft Windows, Direct3D and OpenGL compatibility
- Multi-chip support, two- and four-chip configurations
- Support additional geometry processor, also in multi-chip configurations
- Takes full advantage of embedded DRAM
- Small and efficient rendering core required, embedded DRAM in Glaze3D takes most of the available silicon

### Performance

- Quad-pixel pipeline @ 150 MHz
- 600 million pixels / second (dual textured)
- 1200 million texels / second
- 4.5 million fully featured triangles / second (sustained)
- Cycle-accurate, bit-accurate simulator together with in-house developed PCIBuilder allows performance tuning with real-world applications (Quake III Arena, Viewperf)

## Performance

- Texture cache: 16 KB cache for even mipmap levels and surface textures, 8 KB cache for odd mipmap levels and lightmaps. Both caches two-way set associative.
- Block coverage issue 4-pixel horizontal blocks, expect 90% coverage with average-size triangles
- Quake III arena 200 FPS with all features on @ 800x600x32
  - 400 MPIX/s
  - 350.000 drawn triangles/s
  - 3.5 MB of textures / frame, 670 MB/s of texture bandwidth
  - Depth complexity of 4

# **Features**

- 4 simultaneous textures with trilinear filtering
- DXTC texture compression
- Full-scene, order independent anti-aliasing
- Environment bump mapping
- GDI+ features
- Multiple scaled transparent video overlays
- Digital flat-panel and TV-out support

# The Glaze3D<sup>™</sup> chip

- 304 pin BGA
- 1.5M logic gates
- 130 mm<sup>2</sup> die size
- External SDR SDRAM interface
  - depth and/or color buffer stored here in higher resolutions
  - max 128 MB
  - 64- or 128-bit interface
- PCI and 2X/4X AGP interfaces: AGP interface supports direct AGP texturing

#### Glaze3D architecture



**Bitboys Oy** 

# **Triangle setup engine**



**Bitboys Oy** 

# **Pixel pipeline**



**Bitboys Oy** 

# Why embedded DRAM?

- Graphics accelerator needs GB/s of memory bandwidth, to render at 600 MPIX/s at true color and 32-bit Z, 7.2 GB/s of memory bandwidth is required
- External memory can no longer provide enough bandwidth for future graphics accelerators
- Cost-efficient less chips on board
- Reduced power consumption
- Customized size we needed exactly 9 MB (= 72 Mbits)
- Customized organization in terms of bus width, banks, etc.



#### **Cell-concepts: Trench versus stack**





#### mbedded DRAM



- 72 Mbits (9 MB) of eDRAM
- 150 MHz core/memory clock
- 9.6 GB/s memory bandwidth
- 512-bit interface
- divided into four 18 Mbit modules of 3 banks each
- Stores framebuffer and Z buffer
  enough for 1024x768x32 bit
- Wide internal buses, need lots of metal layers!

### **Multichip configurations**





- Custom bus interface built into Glaze3D<sup>™</sup>, a cost effective multi-chip solution
- Thor is a geometry processor
- The monster configuration is capable of 2400 MPIX/s, 10M triangles/s sustained -4.8 gigatexels/s.
- Target markets are:
  - PC desktop high-end
  - Arcade systems

# **Tiled rendering order**

- Full linear framebuffer in video memory but primitives rendered as tiles instead of scanlines
- Framebuffer is divided into tiles (16x16, 32x32, 64x64)
- SLI is not sufficient trashes texture caches!
- In a four chip configuration, one chip renders 1/4<sup>th</sup> of the tiles
- A Glaze3D<sup>™</sup>-rendering chip ignores the primitive if it doesn't fall into one of the tiles this chip renders
- Framebuffer split between the rendering chips monster configuration has a 36 MB embedded FB

# ey parameters for next technologies

| Technology                   | C9DD1                     | C10DD0                    | C10DD1                      |
|------------------------------|---------------------------|---------------------------|-----------------------------|
| feature size                 | 0.20 µm                   | 0.17 µm                   | 0.15 µm                     |
| 1Mb block size               | 0.64 mm <sup>2</sup>      | 0.38 mm <sup>2</sup>      | 0.30 mm <sup>2</sup>        |
| raw gate density             | 45 Kgates/mm <sup>2</sup> | 90 Kgates/mm <sup>2</sup> | ~115 Kgates/mm <sup>2</sup> |
| max. clock rate              | 200 MHz                   | 250 MHz                   | 300 MHz                     |
| bus width                    | 512 bit                   | 1024 bit                  | 1024 bit                    |
| max. bandwidth               | 12 GByte/s                | 32 GByte/s                | 37 GByte/s                  |
| memory / logic<br>on 150 mm² | 100 Mbit<br>2.5 Mgates    | 140 Mbit<br>5 Mgates      | 180 Mbit<br>6.4 Mgates      |



# **Future**

- Pump more and more triangles through the pipeline
  - Critical: CPU 3D-hardware interface, drivers
  - Geometry processors, advanced geometry processing
- More pixels and texels
  - Expect 8 gigatexels/s in 2001
  - 48 GB/s of memory bandwidth embedded DRAM is the only solution!
- More features per pixel
  - better texture filtering (anisotropic for 2D only)
  - programmability (procedural textures)
  - realistic materials and surface properties



