B2B — Automation & Build

On-Premise LLM Deployment

Local language model on your own hardware or a dedicated server. Ollama, OpenWebUI, optionally vLLM for higher load — where data class, cost, or operations make that the right choice.

A language model that can run entirely on your infrastructure. Employees ask questions, summarise text, and draft emails — locally where data class, cost, or availability make that the right choice.

Why on-premise

Data protection and compliance: sensitive documents and customer data can be processed locally where that is useful or legally required
Control: model selection, update timing, logging, and access control sit with you
Economics at volume: above a certain usage intensity, the monthly API bill exceeds hardware depreciation
Pragmatic architecture: we also integrate cloud models like OpenAI or Anthropic when quality, speed, or cost make them the better fit
Availability: less dependency on individual API providers and their limits

Stack options

Ollama + OpenWebUI for most standard use cases — easy to operate, good model selection, ChatGPT-like interface
vLLM or TGI when higher load needs to be served (parallel multi-user with lower latency)
Recommended models (as of 2026): Qwen 3 / Qwen 3.5, Llama 4 (Scout / Maverick), Mistral Small 3.2 or Mistral 3 — choice depends on language quality, license, and available resources. We update the recommendation continuously.

Hardware requirements

Minimum: 24 GB VRAM (e.g. RTX 4090, RTX 5090, or RTX A5000) for 14B models in 4-bit quantisation
Comfortable: 48 GB VRAM (e.g. RTX 6000 Ada or two linked consumer cards) for 32B models
High: dedicated server with H100 or multi-GPU setup for 70B models and multi-user load

What’s included

Hardware consulting optional before purchase
Server setup, OS hardening, GPU drivers, CUDA stack
Model deployment and inference parameter tuning
Monitoring integration with an existing Grafana stack, or setup of a new one
Onboarding for end users (internal documentation, example prompts)
Written operations documentation for your IT team

What’s not included

Hardware procurement — we recommend separately. Fine-tuning custom models (let’s discuss separately).

Realistic expectations

A locally running 32B model is not a direct replacement for the largest cloud frontier models (GPT-5, Claude Opus 4, Gemini 3). For internal use cases — summarising, translating, drafting, code review, RAG over internal documents — it is often more than sufficient. For other tasks, a cloud model is simply the better choice; in that case we plan contracts, data flows, and access controls properly.

Worth it for you?

Before reaching out: run the numbers yourself. Our cost calculator compares the OpenAI API against on-premise operation for your employee count and usage intensity — including the break-even point over three years.

Book a consultation ← Back to overview