B2B — Automation & Build
On-Premise LLM Deployment
Local language model on your own hardware or a dedicated server. Ollama, OpenWebUI, optionally vLLM for higher load — where data class, cost, or operations make that the right choice.
A language model that can run entirely on your infrastructure. Employees ask questions, summarise text, and draft emails — locally where data class, cost, or availability make that the right choice.
Why on-premise
- Data protection and compliance: sensitive documents and customer data can be processed locally where that is useful or legally required
- Control: model selection, update timing, logging, and access control sit with you
- Economics at volume: above a certain usage intensity, the monthly API bill exceeds hardware depreciation
- Pragmatic architecture: we also integrate cloud models like OpenAI or Anthropic when quality, speed, or cost make them the better fit
- Availability: less dependency on individual API providers and their limits
Stack options
- Ollama + OpenWebUI for most standard use cases — easy to operate, good model selection, ChatGPT-like interface
- vLLM or TGI when higher load needs to be served (parallel multi-user with lower latency)
- Recommended models (as of 2026): Qwen 3 / Qwen 3.5, Llama 4 (Scout / Maverick), Mistral Small 3.2 or Mistral 3 — choice depends on language quality, license, and available resources. We update the recommendation continuously.
Hardware requirements
- Minimum: 24 GB VRAM (e.g. RTX 4090, RTX 5090, or RTX A5000) for 14B models in 4-bit quantisation
- Comfortable: 48 GB VRAM (e.g. RTX 6000 Ada or two linked consumer cards) for 32B models
- High: dedicated server with H100 or multi-GPU setup for 70B models and multi-user load
What’s included
- Hardware consulting optional before purchase
- Server setup, OS hardening, GPU drivers, CUDA stack
- Model deployment and inference parameter tuning
- Monitoring integration with an existing Grafana stack, or setup of a new one
- Onboarding for end users (internal documentation, example prompts)
- Written operations documentation for your IT team
What’s not included
Hardware procurement — we recommend separately. Fine-tuning custom models (let’s discuss separately).
Realistic expectations
A locally running 32B model is not a direct replacement for the largest cloud frontier models (GPT-5, Claude Opus 4, Gemini 3). For internal use cases — summarising, translating, drafting, code review, RAG over internal documents — it is often more than sufficient. For other tasks, a cloud model is simply the better choice; in that case we plan contracts, data flows, and access controls properly.
Worth it for you?
Before reaching out: run the numbers yourself. Our cost calculator compares the OpenAI API against on-premise operation for your employee count and usage intensity — including the break-even point over three years.