How to Setup Ollama with Gemma 4 on Your VPS: Complete 2026 Guide
Posted by CodingMantra Team on April 13, 2026

Imagine having your own AI assistant running on a server you control—no API costs, no rate limits, no data leaving your infrastructure. Just six months ago, this would have required a PhD in machine learning operations and a budget worthy of a tech giant. Today, it takes about 15 minutes and costs less than a fancy dinner.
That's the promise of Ollama + Gemma 4. Google's March 2026 release of Gemma 4 brought flagship-level AI performance to models small enough to run on consumer hardware, while Ollama's streamlined toolchain handles all the complexity of model inference, GPU acceleration, and API management. Together, they let you deploy a production-ready AI server on any VPS—whether it's a budget Hostinger instance, a DigitalOcean droplet, or a full AWS GPU cluster.
This guide walks you through everything: installation, security hardening, performance tuning, and real-world API integration. By the end, you'll have your own self-hosted AI infrastructure that rivals what enterprises were paying six figures for just two years ago.
What You'll Build
- A secure, self-hosted Ollama server running Gemma 4
- HTTPS-enabled API endpoints for production use
- Systemd service for automatic startup and monitoring
- Firewall rules to protect your inference server
- Example integrations for web and mobile apps
Quick Reference: Gemma 4 Models at a Glance
| Model | Best For | Min RAM | Context |
|---|---|---|---|
| gemma4:e2b | Edge devices, testing | 4 GB | 128K |
| gemma4 (default) | Most deployments | 8 GB | 128K |
| gemma4:26b | Production (MoE) | 16 GB | 256K |
| gemma4:31b | Maximum quality | 32 GB | 256K |
Section 1: Understanding the Stack
What is Ollama?
Ollama is an open-source toolchain that simplifies running LLMs locally. It handles model downloading, quantization, GPU acceleration, and provides a clean REST API. Think of it as "Docker for LLMs"—it abstracts away the complexity of model inference.
Why Gemma 4?
Google's Gemma 4 (released March 2026) is the latest iteration of their open model family. Key advantages:
- Four Model Variants: E2B (2.3B), E4B (4.5B), 31B dense, and 26B MoE (Mixture of Experts)
- Massive Context Window: 128K tokens on E-models, 256K on 31B and MoE variants
- Multimodal Support: E2B and E4B handle text, images, audio, and video input
- Apache 2.0 License: Fully open for commercial use without restrictions
- Native Function Calling: Built-in tool use and configurable thinking modes
- 140+ Languages: Pre-trained with strong multilingual performance
Section 2: VPS Requirements and Selection
Minimum Hardware Requirements
| Model | RAM (INT4) | RAM (BF16) | VRAM (GPU) | Storage |
|---|---|---|---|---|
| Gemma 4 E2B | 3.2 GB | 9.6 GB | 4 GB recommended | 10 GB |
| Gemma 4 E4B | 5 GB | 15 GB | 8 GB recommended | 15 GB |
| Gemma 4 26B (MoE) | 15.6 GB | 48 GB | 24 GB recommended | 35 GB |
| Gemma 4 31B | 17.4 GB | 58 GB | 32 GB recommended | 40 GB |
Note: INT4 = quantized 4-bit (recommended for most deployments), BF16 = full 16-bit precision. MoE = Mixture of Experts (only 3.8B parameters active during inference).
VPS Provider Comparison
- Hostinger: Best value for beginners. VPS plans starting at ₹399/month with 4 GB RAM (perfect for Gemma 4 E2B/E4B). Use code codingmantra for exclusive discounts.
- DigitalOcean / Linode: Reliable, predictable pricing. GPU droplets available but pricier for sustained workloads.
- AWS EC2 (g5/g6 instances): Best for production with NVIDIA A10G/L4 GPUs. Pay-per-hour flexibility ideal for scaling.
- Google Cloud (G2 instances): Native Gemma support, often includes $300 free credits for new accounts.
- Hetzner: Excellent price-to-performance ratio in Europe. GPU servers starting at €40/month—best value for dedicated hardware.
- OVHcloud: Competitive pricing for dedicated GPU instances, good for long-term deployments with DDoS protection.
- Oracle Cloud Free Tier: Always-free ARM instances with up to 24 GB RAM—perfect for running Gemma 4 E4B at zero cost.
Cost Comparison: Self-Hosted vs API
Why go through the trouble of self-hosting? Let's talk numbers:
| Approach | Monthly Cost | Rate Limits | Data Privacy |
|---|---|---|---|
| Gemini API (Google) | $0.50 per 1M tokens | Yes, tier-based | Data processed by Google |
| Self-Hosted (Hostinger VPS) | ~₹399 fixed | Unlimited | Full control, data stays on your server |
At roughly 100,000 tokens per day, self-hosting pays for itself in under two weeks. After that, it's pure savings—and you get unlimited experimentation without watching every token.
Pro Tip:
For most use cases, start with Gemma 4 4B on a VPS with 8-16 GB RAM. You can always scale up or run multiple smaller instances for load balancing.
Section 3: Installing Ollama on Your VPS
Step 1: Connect to Your VPS
ssh username@your-vps-ip-address
Ensure your VPS is running Ubuntu 22.04+ or a comparable Linux distribution. This guide uses Ubuntu as the reference OS.
Step 2: Update System Packages
sudo apt update && sudo apt upgrade -y
Step 3: Install Ollama
Ollama provides a one-line installation script that handles all dependencies:
curl -fsSL https://ollama.com/install.sh | sh
The script automatically:
- Downloads the latest Ollama binary
- Installs required CUDA drivers if NVIDIA GPU is detected
- Creates the
ollamauser and service - Configures systemd for automatic startup
Step 4: Verify Installation
ollama --version
You should see output like ollama version 0.20.x or higher (required for Gemma 4 support). Next, check if the service is running:
systemctl status ollama
Press q to exit the status view. If the service isn't running, start it with sudo systemctl start ollama.
Section 4: Downloading and Running Gemma 4
Pull the Gemma 4 Model
Ollama's model library includes Gemma 4 in various quantized sizes. Choose based on your VPS resources:
# Default model (E4B variant, best for most use cases - 8GB+ RAM)
ollama pull gemma4
# Edge model (E2B variant, minimal resources - 4GB+ RAM)
ollama pull gemma4:e2b
# Mixture of Experts (best quality/efficiency ratio - 16GB+ RAM)
ollama pull gemma4:26b
# Maximum quality dense model (32GB+ RAM with GPU)
ollama pull gemma4:31b
Ollama automatically downloads the INT4 quantized version by default, which offers the best balance between performance and resource usage. You can also specify specific quantizations like gemma4:4b-q4_K_M if needed.
Test the Model
Run an interactive chat session to verify everything works:
ollama run gemma4:4b "Hello! Explain quantum computing in 2 sentences."
You should see a coherent response generated in real-time. Press Ctrl+D to exit the chat.
Check GPU Acceleration (if applicable)
If your VPS has an NVIDIA GPU, verify Ollama is using it:
nvidia-smi
You should see the Ollama process listed under running processes. GPU inference can be 10-50x faster than CPU-only mode.
Section 5: Configuring Ollama for Production
Environment Variables
Ollama supports several environment variables for customization. Edit the systemd service:
sudo systemctl edit ollama.service
Add the following configuration:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_ORIGINS=*"
Environment="OLLAMA_KEEP_ALIVE=5m"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Key settings explained:
OLLAMA_HOST=0.0.0.0:11434: Listen on all network interfaces (required for external access)OLLAMA_ORIGINS=*: Allow CORS from any origin (restrict this in production)OLLAMA_KEEP_ALIVE=5m: Keep models loaded for 5 minutes after last requestOLLAMA_NUM_PARALLEL=4: Handle up to 4 concurrent requestsOLLAMA_MAX_LOADED_MODELS=2: Keep maximum 2 models in memory simultaneously
Reload and restart the service:
sudo systemctl daemon-reexec
sudo systemctl restart ollama
Section 6: Securing Your Ollama Server
Configure Firewall (UFW)
Only allow necessary ports. By default, Ollama uses port 11434:
# Allow SSH (always required)
sudo ufw allow 22/tcp
# Allow Ollama port (restrict to your application server IP if possible)
sudo ufw allow from your_app_server_ip to any port 11434 proto tcp
# If you need public access (not recommended without additional auth)
# sudo ufw allow 11434/tcp
# Enable firewall
sudo ufw enable
Set Up Reverse Proxy with Nginx
A reverse proxy provides HTTPS, rate limiting, and additional security layers:
sudo apt install nginx -y
Create the Nginx configuration:
sudo nano /etc/nginx/sites-available/ollama
Add this configuration:
server {
listen 80;
server_name ai.yourdomain.com;
# Redirect HTTP to HTTPS
return 301 https://$server_name$request_uri;
}
server {
listen 443 ssl http2;
server_name ai.yourdomain.com;
# SSL certificates (use Certbot for free Let's Encrypt)
ssl_certificate /etc/letsencrypt/live/ai.yourdomain.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/ai.yourdomain.com/privkey.pem;
# Security headers
add_header X-Frame-Options "SAMEORIGIN" always;
add_header X-Content-Type-Options "nosniff" always;
add_header X-XSS-Protection "1; mode=block" always;
location / {
proxy_pass http://localhost:11434;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# Increase timeouts for long-running inference
proxy_read_timeout 300s;
proxy_connect_timeout 75s;
# Disable request buffering for streaming responses
proxy_buffering off;
}
}
Enable the site and test configuration:
sudo ln -s /etc/nginx/sites-available/ollama /etc/nginx/sites-enabled/
sudo nginx -t
sudo systemctl reload nginx
Install SSL Certificate with Certbot
sudo apt install certbot python3-certbot-nginx -y
sudo certbot --nginx -d ai.yourdomain.com
Section 7: Using the Ollama API
Basic API Endpoints
Ollama provides a RESTful API at http://localhost:11434. Here are the essential endpoints:
Chat Completion (Recommended)
curl http://localhost:11434/api/chat -d '{
"model": "gemma4",
"messages": [
{"role": "user", "content": "Explain machine learning in simple terms"}
],
"stream": false
}'
OpenAI-Compatible Endpoint
Ollama also supports OpenAI-compatible API calls for easy integration with existing tools:
curl http://localhost:11434/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "gemma4",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Generate Completion (Legacy)
curl http://localhost:11434/api/generate -d '{
"model": "gemma4",
"prompt": "What is the capital of France?",
"stream": false
}'
List Local Models
curl http://localhost:11434/api/tags
Integration Example: Node.js
const OLLAMA_URL = 'https://ai.yourdomain.com/api/chat';
async function chatWithGemma(prompt) {
const response = await fetch(OLLAMA_URL, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
model: 'gemma4',
messages: [{ role: 'user', content: prompt }],
stream: false,
options: {
temperature: 0.7,
num_predict: 512,
},
}),
});
const data = await response.json();
return data.message.content;
}
// Usage
const answer = await chatWithGemma('How do I optimize database queries?');
console.log(answer);
Integration Example: Python
import requests
OLLAMA_URL = "https://ai.yourdomain.com/api/chat"
def chat_with_gemma(prompt: str) -> str:
payload = {
"model": "gemma4",
"messages": [{"role": "user", "content": prompt}],
"stream": False,
"options": {
"temperature": 0.7,
"num_predict": 512,
}
}
response = requests.post(OLLAMA_URL, json=payload)
response.raise_for_status()
return response.json()["message"]["content"]
# Usage
answer = chat_with_gemma("Explain REST API design principles")
print(answer)
Section 8: Monitoring and Maintenance
Check Ollama Logs
# View recent logs
sudo journalctl -u ollama -n 50
# Follow logs in real-time
sudo journalctl -u ollama -f
Monitor Resource Usage
# Install htop for interactive monitoring
sudo apt install htop -y
htop
# Check GPU usage (if applicable)
watch -n 1 nvidia-smi
Update Ollama
To update to the latest version:
ollama update
Or reinstall using the installation script, which automatically updates to the latest version.
Backup Your Models
Ollama stores models in /usr/share/ollama/.ollama/models. To backup:
sudo tar -czf ollama-backup-$(date +%Y%m%d).tar.gz /usr/share/ollama/.ollama/models
Section 9: Real-World Use Cases
Now that you have a working AI server, what can you actually build? Here are some production use cases:
Use Case 1: Customer Support Chatbot
Deploy a domain-specific chatbot trained on your product documentation. Gemma 4's 128K context means you can feed it your entire knowledge base in a single prompt.
curl http://localhost:11434/api/chat -d '{
"model": "gemma4",
"messages": [
{"role": "system", "content": "You are a helpful support agent for our SaaS product. Answer questions based on the documentation provided."},
{"role": "user", "content": "[Paste 50 pages of docs here]... Now answer: How do I reset my API key?"}
]
}'
Use Case 2: Content Moderation API
Run user-generated content through Gemma 4 to detect spam, harassment, or policy violations before publishing. At ~50 tokens/ms on a GPU instance, you can moderate thousands of posts per hour.
Use Case 3: Internal Knowledge Assistant
Connect Gemma 4 to your company's Notion, Confluence, or Slack archives. Employees can ask natural language questions and get answers sourced from internal documents—all without data leaving your infrastructure.
Section 10: Troubleshooting Common Issues
Issue: "Connection Refused" on Port 11434
Solution: Ensure Ollama is configured to listen on all interfaces:
sudo systemctl edit ollama.service
# Add: Environment="OLLAMA_HOST=0.0.0.0:11434"
sudo systemctl daemon-reexec
sudo systemctl restart ollama
Issue: Slow Inference Speed
Solution: Check if GPU is being utilized:
nvidia-smi
If GPU isn't being used, ensure CUDA drivers are installed and Ollama was installed with GPU support. Consider using a smaller model quantization (e.g., gemma4:4b-q4_K_M).
Issue: Out of Memory (OOM) Errors
Solution: Reduce concurrent requests or use a smaller model:
# Reduce parallel requests
Environment="OLLAMA_NUM_PARALLEL=2"
# Or switch to a smaller model (E2B variant)
ollama pull gemma4:e2b
Conclusion: Your Self-Hosted AI Infrastructure
You now have a fully functional, production-ready AI inference server running Gemma 4 on Ollama. This setup gives you:
- Full Data Control: All inference happens on your infrastructure—no data leaves your VPS. Perfect for healthcare, legal, and enterprise applications with strict compliance requirements.
- Cost Predictability: Fixed monthly VPS costs (~$5-50/month) instead of per-token API pricing that scales with your success.
- Unlimited Experimentation: Test wild ideas without watching every token. Build internal tools, prototypes, and R&D projects that would be prohibitively expensive on paid APIs.
- Customization: Fine-tune Gemma 4 for your specific use case, add custom system prompts, or chain multiple models together without vendor restrictions.
- Scalability: Deploy multiple instances behind a load balancer as traffic grows. Horizontal scaling is as simple as spinning up another identical VPS.
What's Next?
- Explore fine-tuning: Train Gemma 4 on your domain-specific data for even better results
- Build a UI: Create a chat interface using the API examples above
- Set up monitoring: Use tools like Prometheus + Grafana to track inference latency and throughput
- Experiment with RAG: Connect Gemma 4 to a vector database for retrieval-augmented generation
The AI infrastructure landscape has fundamentally shifted. What required a team of ML engineers and a six-figure budget in 2024 now fits in a single VPS. The question isn't whether you can afford to self-host—it's whether you can afford not to.
Ready to integrate AI into your applications? Explore CodingMantra's AI-powered tools for inspiration, or start building your own custom solutions with the Ollama API.
Need Help Scaling Your AI Infrastructure?
CodingMantra specializes in custom AI deployments and enterprise-grade LLM integrations. Let's build something amazing together.
Contact Our Team