Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.syntblaze.com/llms.txt

Use this file to discover all available pages before exploring further.

Float16 is a half-precision, 16-bit binary floating-point type in Swift that conforms to the IEEE 754 standard. It represents real numbers using a highly compact memory footprint of exactly two bytes, trading mathematical precision and dynamic range for reduced memory consumption.

Memory Layout

Under the IEEE 754 standard for binary16, the 16 bits of a Float16 are allocated as follows:
  • Sign bit: 1 bit (determines positive or negative).
  • Exponent: 5 bits (determines the magnitude, with a bias of 15).
  • Significand (Fraction): 10 bits (stores the significant digits). Because normal numbers have an implicit leading 1, it effectively provides 11 bits of precision.

Technical Specifications

Due to its constrained bit-width, Float16 has strict mathematical boundaries:
  • Maximum finite magnitude: 65504.0
  • Minimum positive normal magnitude: 2^-14 (approximately 0.000061035)
  • Decimal precision: Approximately 3.3 decimal digits.
let halfPrecision: Float16 = 3.14

// Inspecting IEEE 754 boundaries
let maxFinite = Float16.greatestFiniteMagnitude // 65504.0
let minNormal = Float16.leastNormalMagnitude    // 0.000061035156
let minNonzero = Float16.leastNonzeroMagnitude  // 0.000000059604645
let machineEpsilon = Float16.ulpOfOne           // 0.0009765625

Type Conversion and Arithmetic

Swift enforces strict type safety and does not implicitly promote or demote floating-point types. Arithmetic operations combining Float16 with Float (32-bit) or Double (64-bit) require explicit initialization. When converting from a higher-precision type to Float16, Swift rounds the value to the nearest representable Float16 value according to the default IEEE 754 rounding mode (round to nearest, ties to even). If the source value exceeds 65504.0, it resolves to Float16.infinity.
let doubleValue: Double = 70000.5
let floatValue: Float = 3.14159265

// Explicit downcasting
let halfFromDouble = Float16(doubleValue) // Evaluates to +Inf (overflow)
let halfFromFloat = Float16(floatValue)   // Evaluates to 3.14 (precision truncated)

// Arithmetic requires matching types
let a: Float16 = 5.0
let b: Float = 10.0
// let result = a + b // Compiler error: Binary operator '+' cannot be applied to operands of type 'Float16' and 'Float'
let result = a + Float16(b) 

Hardware Architecture Dependency

The performance characteristics of Float16 are strictly tied to the underlying instruction set architecture (ISA).
  • ARM Architecture: On Apple Silicon (M-series) and A11 Bionic or newer, Float16 operations are executed natively in hardware via the ARMv8.2-A FP16 extension, yielding single-cycle arithmetic instructions.
  • x86_64 Architecture: On Intel-based Macs, hardware support for native half-precision arithmetic is generally absent. The Swift compiler and LLVM backend handle Float16 by emitting instructions that promote the 16-bit values to 32-bit Float registers for computation, and then truncate them back to 16 bits for memory storage. This software emulation incurs a computational overhead.
Master Swift with Deep Grasping Methodology!Learn More