NEF — Neural Essence Format
execution pipeline · click any node · dots show live data flow
nef.matmul(a,b) → graph node
No execution at this stage
01 — INPUT
User Code · API Call
Python / Go / C++
raw tensor ops
↳ no compute yet
Builds DAG of ops with shape,
dtype, FLOPs metadata. Device-agnostic.
02 — GRAPH BUILDER
Lazy DAG / IR
Directed Acyclic Graph · no execution
nodes = ops
edges = tensors
OPTIMIZER PASSES
Merges adjacent
elementwise ops
03A
Node Fusion
+ Constant Fold
Removes dead nodes,
finds memory reuse
03B
Dead Elim.
+ Mem Reuse
Heuristic op→hw mapping.
Inserts memory transfer nodes.
04 — DEVICE PLANNER
Hardware Assignment
MatMul→GPU · Quant→NPU · fallback→CPU
zero device
management API
Cached by (op, shape, dtype, backend).
Warm re-run = zero recompile.
05 — KERNEL COMPILER
Backend Lowering
PTX · HIP · SPIR-V · AVX-512 · NPU SDK
kernel cache
≥ 95% hit rate
compile once
run forever
Async dispatch · parallel branches
· zero-copy where hw supports
06 — EXECUTION RUNTIME
Async Graph Dispatch
parallel streams · mem transfer · sync barriers
GPU util
≥ 85% sustained
Tensor pulled to CPU memory
only when explicitly accessed
07 — OUTPUT
Materialized Tensor
→ .numpy() · .execute() · hydrad consumer
lazy → concrete
on demand only
hardware targets
NVIDIA · CUDA
AMD · ROCm
CPU · AVX-512
NPU · Vendor
Intel · SPIR-V
cache hit → skip compile
planning overhead < 1ms · graph planning < 10K nodes · write once · run anywhere