deep learning · pure c# · no python

Predictable ML inference
for .NET production.

Overfit is a from-scratch deep learning engine for .NET 10. Train in PyTorch or in-process, then run zero-allocation inference with no native dependencies and no Python runtime. AOT-compatible single-binary deployment.

0 B

allocated per GPT-2 token
(KV-cache decode)

7.6×

faster than ONNX Runtime
(Linear 784→10 inference)

native deps. no Python.
AOT compatible.

View on GitHub→ Book a 20-min demo→

What it does

01 · zero-alloc inference

Run trained models with no GC pressure.

Pre-allocated buffers, span-based math, no per-call allocations on the hot path. Predictable tail latency under concurrent load.

using var engine = InferenceEngine.FromSequential(model, 784, 10);

Span<float> input  = stackalloc float[784];
Span<float> output = stackalloc float[10];
engine.Run(input, output);  // zero-allocation

02 · transformers in pure C#

GPT-2 / Llama / Qwen / Mistral inference, no Python.

Load weights from HuggingFace or directly from GGUF files (F32, F16, BF16, Q8_0, Q4_K, Q6_K). KV-cache decode at 0 bytes per token. Top-10 logit overlap 10/10 vs PyTorch reference (maxAbsDiff = 0.000107).

using var gpt2    = Gpt2.LoadSmall("./models/gpt2");
using var session = gpt2.CreateSession();

session.Reset(gpt2.Tokenizer.Encode("Hello, world."));

for (int i = 0; i < 32; i++)
{
    var token = session.GenerateNextToken(in sampling);
    Console.Write(gpt2.Tokenizer.DecodeToken(token));
}

03 · onnx import

Bring PyTorch-trained models, run them natively.

Direct ONNX import. 14 operators supported (Conv, Gemm, ReLU, Tanh, BatchNorm, MaxPool, Add for skip connections, etc.). Branching DAG topology works (ResNet, DenseNet, EfficientNet).

var model = OnnxImporter.Load("classifier.onnx");
model.Eval();

using var engine = InferenceEngine.FromSequential(model, 784, 10);
var prediction = engine.Predict(input);  // zero-alloc

Benchmarks

AMD Ryzen 9 9950X3D · Windows 11 · .NET 10 · BenchmarkDotNet 0.15.8. Reproducible from the repo.

Single inference — Linear(784→10)

Engine	Mean	Allocated	vs ONNX
Overfit	250.7 ns	0 B	7.6× faster
ONNX Runtime (pre-allocated)	1,899 ns	224 B	baseline
ONNX Runtime (standard)	3,388 ns	952 B	0.56×

GPT-2 Small (124M params) — KV-cache decode, 64 tokens

Path	Mean	Allocated
Legacy (full forward per token)	6,318 ms	62.0 MB · grows
KV-cache	973 ms	0 B / token *

* One-time session creation cost (~74 MB constant). Per-token allocation verified zero by a CI assertion that fails the build if anything allocates during decode.

Concurrent inference — 8 threads × 1,000 calls

Engine	Mean	Allocated	vs ONNX
Overfit	522 ms	0 B	3.6× faster
ONNX Runtime	1,894 ms	117 MB	baseline

Built for

infra

Kubernetes anomaly detection

Self-hosted pod-level anomaly detection on AKS / EKS. Replace Datadog or Dynatrace anomaly modules without sending metrics out of cluster. Sub-minute detection latency, no SaaS bill.

.net enterprise

Embedded ML in production services

Add ML inference to existing .NET services without shipping a Python sidecar. One process, one runtime, one deploy pipeline, one security audit.

low-latency

High-frequency / ad-tech inference

Sub-microsecond P99.9 (0.80 µs vs 5.70 µs for ONNX). Zero GC pauses under sustained concurrent load. Deterministic tail latency matters more than peak throughput.

embedded

Edge & on-device deployment

Native AOT single-binary deployment. No CUDA libraries to ship, no Python runtime to install. Industrial automation, gaming anti-cheat, IoT.

What it is NOT

Where this is the wrong tool. Read this before deciding.

Not a PyTorch / TensorFlow replacement for research workloads.
Not GPU-first. Pure CPU. If your workload needs GPU, use PyTorch.
Not for training large transformers from scratch. PyTorch CPU is still 2.2–3.6× faster on training (their GEMM is Intel MKL — decades of AVX-512 assembly that pure-C# does not out-run).
Not a hosted SaaS. Self-hosted, deployed inside your infrastructure.
Not magic. The wins are predictable allocation behaviour and zero P/Invoke overhead — not raw matmul throughput.

Pricing

Dual-licensed: free under AGPLv3 for open-source projects and research. Commercial licenses for closed-source production use:

Self-Service

$4,800

/ year

Commercial license, perpetual for the licensed version
NuGet package + private GitHub support
Email response within 5 business days
Self-managed deployment
Minor version updates included

Get started

Production

$18,000

/ year

Everything in Self-Service
Hands-on deployment to your AKS / EKS / on-prem cluster
Custom model training on your data (1 model)
Quarterly model performance reviews
48-hour SLA for production issues

Book a demo

Enterprise

$48,000

/ year

Everything in Production
24h SLA + on-call support
Dedicated retraining pipeline
Custom feature development
Architecture review and roadmap input

Talk to us

Predictable ML inferencefor .NET production.

What it does

Run trained models with no GC pressure.

GPT-2 / Llama / Qwen / Mistral inference, no Python.

Bring PyTorch-trained models, run them natively.

Benchmarks

Single inference — Linear(784→10)

GPT-2 Small (124M params) — KV-cache decode, 64 tokens

Concurrent inference — 8 threads × 1,000 calls

Built for

Kubernetes anomaly detection

Embedded ML in production services

High-frequency / ad-tech inference

Edge & on-device deployment

What it is NOT

Pricing

Get in touch

Predictable ML inference
for .NET production.