Skip to content
Mainframe

B2B — Automation & Build

On-Premise LLM Deployment

Local language model on your own hardware or a dedicated server. Ollama, OpenWebUI, optionally vLLM for higher load — where data class, cost, or operations make that the right choice.

A language model that can run entirely on your infrastructure. Employees ask questions, summarise text, and draft emails — locally where data class, cost, or availability make that the right choice.

Why on-premise

  • Data protection and compliance: sensitive documents and customer data can be processed locally where that is useful or legally required
  • Control: model selection, update timing, logging, and access control sit with you
  • Economics at volume: above a certain usage intensity, the monthly API bill exceeds hardware depreciation
  • Pragmatic architecture: we also integrate cloud models like OpenAI or Anthropic when quality, speed, or cost make them the better fit
  • Availability: less dependency on individual API providers and their limits

Stack options

  • Ollama + OpenWebUI for most standard use cases — easy to operate, good model selection, ChatGPT-like interface
  • vLLM or TGI when higher load needs to be served (parallel multi-user with lower latency)
  • Recommended models (as of 2026): Qwen 3 / Qwen 3.5, Llama 4 (Scout / Maverick), Mistral Small 3.2 or Mistral 3 — choice depends on language quality, license, and available resources. We update the recommendation continuously.

Hardware requirements

  • Minimum: 24 GB VRAM (e.g. RTX 4090, RTX 5090, or RTX A5000) for 14B models in 4-bit quantisation
  • Comfortable: 48 GB VRAM (e.g. RTX 6000 Ada or two linked consumer cards) for 32B models
  • High: dedicated server with H100 or multi-GPU setup for 70B models and multi-user load

What’s included

  • Hardware consulting optional before purchase
  • Server setup, OS hardening, GPU drivers, CUDA stack
  • Model deployment and inference parameter tuning
  • Monitoring integration with an existing Grafana stack, or setup of a new one
  • Onboarding for end users (internal documentation, example prompts)
  • Written operations documentation for your IT team

What’s not included

Hardware procurement — we recommend separately. Fine-tuning custom models (let’s discuss separately).

Realistic expectations

A locally running 32B model is not a direct replacement for the largest cloud frontier models (GPT-5, Claude Opus 4, Gemini 3). For internal use cases — summarising, translating, drafting, code review, RAG over internal documents — it is often more than sufficient. For other tasks, a cloud model is simply the better choice; in that case we plan contracts, data flows, and access controls properly.

Worth it for you?

Before reaching out: run the numbers yourself. Our cost calculator compares the OpenAI API against on-premise operation for your employee count and usage intensity — including the break-even point over three years.