Imagine having your own AI assistant running on a server you control—no API costs, no rate limits, no data leaving your infrastructure. Just six months ago, this would have required a PhD in machine learning operations and a budget worthy of a tech giant. Today, it takes about 15 minutes and costs less than a fancy dinner.

That's the promise of Ollama + Gemma 4. Google's March 2026 release of Gemma 4 brought flagship-level AI performance to models small enough to run on consumer hardware, while Ollama's streamlined toolchain handles all the complexity of model inference, GPU acceleration, and API management. Together, they let you deploy a production-ready AI server on any VPS—whether it's a budget Hostinger instance, a DigitalOcean droplet, or a full AWS GPU cluster.

This guide walks you through everything: installation, security hardening, performance tuning, and real-world API integration. By the end, you'll have your own self-hosted AI infrastructure that rivals what enterprises were paying six figures for just two years ago.

What You'll Build

A secure, self-hosted Ollama server running Gemma 4
HTTPS-enabled API endpoints for production use
Systemd service for automatic startup and monitoring
Firewall rules to protect your inference server
Example integrations for web and mobile apps

Quick Reference: Gemma 4 Models at a Glance

Model	Best For	Min RAM	Context
gemma4:e2b	Edge devices, testing	4 GB	128K
gemma4 (default)	Most deployments	8 GB	128K
gemma4:26b	Production (MoE)	16 GB	256K
gemma4:31b	Maximum quality	32 GB	256K

Section 1: Understanding the Stack

What is Ollama?

Ollama is an open-source toolchain that simplifies running LLMs locally. It handles model downloading, quantization, GPU acceleration, and provides a clean REST API. Think of it as "Docker for LLMs"—it abstracts away the complexity of model inference.

Why Gemma 4?

Google's Gemma 4 (released March 2026) is the latest iteration of their open model family. Key advantages:

Four Model Variants: E2B (2.3B), E4B (4.5B), 31B dense, and 26B MoE (Mixture of Experts)
Massive Context Window: 128K tokens on E-models, 256K on 31B and MoE variants
Multimodal Support: E2B and E4B handle text, images, audio, and video input
Apache 2.0 License: Fully open for commercial use without restrictions
Native Function Calling: Built-in tool use and configurable thinking modes
140+ Languages: Pre-trained with strong multilingual performance

Section 2: VPS Requirements and Selection

Minimum Hardware Requirements

Model	RAM (INT4)	RAM (BF16)	VRAM (GPU)	Storage
Gemma 4 E2B	3.2 GB	9.6 GB	4 GB recommended	10 GB
Gemma 4 E4B	5 GB	15 GB	8 GB recommended	15 GB
Gemma 4 26B (MoE)	15.6 GB	48 GB	24 GB recommended	35 GB
Gemma 4 31B	17.4 GB	58 GB	32 GB recommended	40 GB

Note: INT4 = quantized 4-bit (recommended for most deployments), BF16 = full 16-bit precision. MoE = Mixture of Experts (only 3.8B parameters active during inference).

VPS Provider Comparison

Hostinger: Best value for beginners. VPS plans starting at ₹399/month with 4 GB RAM (perfect for Gemma 4 E2B/E4B). Use code codingmantra for exclusive discounts.
DigitalOcean / Linode: Reliable, predictable pricing. GPU droplets available but pricier for sustained workloads.
AWS EC2 (g5/g6 instances): Best for production with NVIDIA A10G/L4 GPUs. Pay-per-hour flexibility ideal for scaling.
Google Cloud (G2 instances): Native Gemma support, often includes $300 free credits for new accounts.
Hetzner: Excellent price-to-performance ratio in Europe. GPU servers starting at €40/month—best value for dedicated hardware.
OVHcloud: Competitive pricing for dedicated GPU instances, good for long-term deployments with DDoS protection.
Oracle Cloud Free Tier: Always-free ARM instances with up to 24 GB RAM—perfect for running Gemma 4 E4B at zero cost.

Cost Comparison: Self-Hosted vs API

Why go through the trouble of self-hosting? Let's talk numbers:

Approach	Monthly Cost	Rate Limits	Data Privacy
Gemini API (Google)	$0.50 per 1M tokens	Yes, tier-based	Data processed by Google
Self-Hosted (Hostinger VPS)	~₹399 fixed	Unlimited	Full control, data stays on your server

At roughly 100,000 tokens per day, self-hosting pays for itself in under two weeks. After that, it's pure savings—and you get unlimited experimentation without watching every token.

Pro Tip:

For most use cases, start with Gemma 4 4B on a VPS with 8-16 GB RAM. You can always scale up or run multiple smaller instances for load balancing.

Section 3: Installing Ollama on Your VPS

Step 1: Connect to Your VPS

ssh username@your-vps-ip-address

Ensure your VPS is running Ubuntu 22.04+ or a comparable Linux distribution. This guide uses Ubuntu as the reference OS.

Step 2: Update System Packages

sudo apt update && sudo apt upgrade -y

Step 3: Install Ollama

Ollama provides a one-line installation script that handles all dependencies:

curl -fsSL https://ollama.com/install.sh | sh

The script automatically:

Downloads the latest Ollama binary
Installs required CUDA drivers if NVIDIA GPU is detected
Creates the ollama user and service
Configures systemd for automatic startup

Step 4: Verify Installation

ollama --version

You should see output like ollama version 0.20.x or higher (required for Gemma 4 support). Next, check if the service is running:

systemctl status ollama

Press q to exit the status view. If the service isn't running, start it with sudo systemctl start ollama.

Section 4: Downloading and Running Gemma 4

Pull the Gemma 4 Model

Ollama's model library includes Gemma 4 in various quantized sizes. Choose based on your VPS resources:

# Default model (E4B variant, best for most use cases - 8GB+ RAM)
ollama pull gemma4

# Edge model (E2B variant, minimal resources - 4GB+ RAM)
ollama pull gemma4:e2b

# Mixture of Experts (best quality/efficiency ratio - 16GB+ RAM)
ollama pull gemma4:26b

# Maximum quality dense model (32GB+ RAM with GPU)
ollama pull gemma4:31b

Ollama automatically downloads the INT4 quantized version by default, which offers the best balance between performance and resource usage. You can also specify specific quantizations like gemma4:4b-q4_K_M if needed.

Test the Model

Run an interactive chat session to verify everything works:

ollama run gemma4:4b "Hello! Explain quantum computing in 2 sentences."

You should see a coherent response generated in real-time. Press Ctrl+D to exit the chat.

Check GPU Acceleration (if applicable)

If your VPS has an NVIDIA GPU, verify Ollama is using it:

nvidia-smi

You should see the Ollama process listed under running processes. GPU inference can be 10-50x faster than CPU-only mode.

Section 5: Configuring Ollama for Production

Environment Variables

Ollama supports several environment variables for customization. Edit the systemd service:

sudo systemctl edit ollama.service

Add the following configuration:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_ORIGINS=*"
Environment="OLLAMA_KEEP_ALIVE=5m"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_MAX_LOADED_MODELS=2"

Key settings explained:

OLLAMA_HOST=0.0.0.0:11434: Listen on all network interfaces (required for external access)
OLLAMA_ORIGINS=*: Allow CORS from any origin (restrict this in production)
OLLAMA_KEEP_ALIVE=5m: Keep models loaded for 5 minutes after last request
OLLAMA_NUM_PARALLEL=4: Handle up to 4 concurrent requests
OLLAMA_MAX_LOADED_MODELS=2: Keep maximum 2 models in memory simultaneously

Reload and restart the service:

sudo systemctl daemon-reexec
sudo systemctl restart ollama

Section 6: Securing Your Ollama Server

Configure Firewall (UFW)

Only allow necessary ports. By default, Ollama uses port 11434:

# Allow SSH (always required)
sudo ufw allow 22/tcp

# Allow Ollama port (restrict to your application server IP if possible)
sudo ufw allow from your_app_server_ip to any port 11434 proto tcp

# If you need public access (not recommended without additional auth)
# sudo ufw allow 11434/tcp

# Enable firewall
sudo ufw enable

Set Up Reverse Proxy with Nginx

A reverse proxy provides HTTPS, rate limiting, and additional security layers:

sudo apt install nginx -y

Create the Nginx configuration:

sudo nano /etc/nginx/sites-available/ollama

Add this configuration:

server {
    listen 80;
    server_name ai.yourdomain.com;

    # Redirect HTTP to HTTPS
    return 301 https://$server_name$request_uri;
}

server {
    listen 443 ssl http2;
    server_name ai.yourdomain.com;

    # SSL certificates (use Certbot for free Let's Encrypt)
    ssl_certificate /etc/letsencrypt/live/ai.yourdomain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/ai.yourdomain.com/privkey.pem;

    # Security headers
    add_header X-Frame-Options "SAMEORIGIN" always;
    add_header X-Content-Type-Options "nosniff" always;
    add_header X-XSS-Protection "1; mode=block" always;

    location / {
        proxy_pass http://localhost:11434;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # Increase timeouts for long-running inference
        proxy_read_timeout 300s;
        proxy_connect_timeout 75s;

        # Disable request buffering for streaming responses
        proxy_buffering off;
    }
}

Enable the site and test configuration:

sudo ln -s /etc/nginx/sites-available/ollama /etc/nginx/sites-enabled/
sudo nginx -t
sudo systemctl reload nginx

Install SSL Certificate with Certbot

sudo apt install certbot python3-certbot-nginx -y
sudo certbot --nginx -d ai.yourdomain.com

Section 7: Using the Ollama API

Basic API Endpoints

Ollama provides a RESTful API at http://localhost:11434. Here are the essential endpoints:

Chat Completion (Recommended)

curl http://localhost:11434/api/chat -d '{
  "model": "gemma4",
  "messages": [
    {"role": "user", "content": "Explain machine learning in simple terms"}
  ],
  "stream": false
}'

OpenAI-Compatible Endpoint

Ollama also supports OpenAI-compatible API calls for easy integration with existing tools:

curl http://localhost:11434/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "gemma4",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Generate Completion (Legacy)

curl http://localhost:11434/api/generate -d '{
  "model": "gemma4",
  "prompt": "What is the capital of France?",
  "stream": false
}'

List Local Models

curl http://localhost:11434/api/tags

Integration Example: Node.js

const OLLAMA_URL = 'https://ai.yourdomain.com/api/chat';

async function chatWithGemma(prompt) {
  const response = await fetch(OLLAMA_URL, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      model: 'gemma4',
      messages: [{ role: 'user', content: prompt }],
      stream: false,
      options: {
        temperature: 0.7,
        num_predict: 512,
      },
    }),
  });

  const data = await response.json();
  return data.message.content;
}

// Usage
const answer = await chatWithGemma('How do I optimize database queries?');
console.log(answer);

Integration Example: Python

import requests

OLLAMA_URL = "https://ai.yourdomain.com/api/chat"

def chat_with_gemma(prompt: str) -> str:
    payload = {
        "model": "gemma4",
        "messages": [{"role": "user", "content": prompt}],
        "stream": False,
        "options": {
            "temperature": 0.7,
            "num_predict": 512,
        }
    }

    response = requests.post(OLLAMA_URL, json=payload)
    response.raise_for_status()

    return response.json()["message"]["content"]

# Usage
answer = chat_with_gemma("Explain REST API design principles")
print(answer)

Section 8: Monitoring and Maintenance

Check Ollama Logs

# View recent logs
sudo journalctl -u ollama -n 50

# Follow logs in real-time
sudo journalctl -u ollama -f

Monitor Resource Usage

# Install htop for interactive monitoring
sudo apt install htop -y
htop

# Check GPU usage (if applicable)
watch -n 1 nvidia-smi

Update Ollama

To update to the latest version:

ollama update

Or reinstall using the installation script, which automatically updates to the latest version.

Backup Your Models

Ollama stores models in /usr/share/ollama/.ollama/models. To backup:

sudo tar -czf ollama-backup-$(date +%Y%m%d).tar.gz /usr/share/ollama/.ollama/models

Section 9: Real-World Use Cases

Now that you have a working AI server, what can you actually build? Here are some production use cases:

Use Case 1: Customer Support Chatbot

Deploy a domain-specific chatbot trained on your product documentation. Gemma 4's 128K context means you can feed it your entire knowledge base in a single prompt.

curl http://localhost:11434/api/chat -d '{
  "model": "gemma4",
  "messages": [
    {"role": "system", "content": "You are a helpful support agent for our SaaS product. Answer questions based on the documentation provided."},
    {"role": "user", "content": "[Paste 50 pages of docs here]... Now answer: How do I reset my API key?"}
  ]
}'

Use Case 2: Content Moderation API

Run user-generated content through Gemma 4 to detect spam, harassment, or policy violations before publishing. At ~50 tokens/ms on a GPU instance, you can moderate thousands of posts per hour.

Use Case 3: Internal Knowledge Assistant

Connect Gemma 4 to your company's Notion, Confluence, or Slack archives. Employees can ask natural language questions and get answers sourced from internal documents—all without data leaving your infrastructure.

Section 10: Troubleshooting Common Issues

Issue: "Connection Refused" on Port 11434

Solution: Ensure Ollama is configured to listen on all interfaces:

sudo systemctl edit ollama.service
# Add: Environment="OLLAMA_HOST=0.0.0.0:11434"
sudo systemctl daemon-reexec
sudo systemctl restart ollama

Issue: Slow Inference Speed

Solution: Check if GPU is being utilized:

nvidia-smi

If GPU isn't being used, ensure CUDA drivers are installed and Ollama was installed with GPU support. Consider using a smaller model quantization (e.g., gemma4:4b-q4_K_M).

Issue: Out of Memory (OOM) Errors

Solution: Reduce concurrent requests or use a smaller model:

# Reduce parallel requests
Environment="OLLAMA_NUM_PARALLEL=2"

# Or switch to a smaller model (E2B variant)
ollama pull gemma4:e2b

Conclusion: Your Self-Hosted AI Infrastructure

You now have a fully functional, production-ready AI inference server running Gemma 4 on Ollama. This setup gives you:

Full Data Control: All inference happens on your infrastructure—no data leaves your VPS. Perfect for healthcare, legal, and enterprise applications with strict compliance requirements.
Cost Predictability: Fixed monthly VPS costs (~$5-50/month) instead of per-token API pricing that scales with your success.
Unlimited Experimentation: Test wild ideas without watching every token. Build internal tools, prototypes, and R&D projects that would be prohibitively expensive on paid APIs.
Customization: Fine-tune Gemma 4 for your specific use case, add custom system prompts, or chain multiple models together without vendor restrictions.
Scalability: Deploy multiple instances behind a load balancer as traffic grows. Horizontal scaling is as simple as spinning up another identical VPS.

What's Next?

Explore fine-tuning: Train Gemma 4 on your domain-specific data for even better results
Build a UI: Create a chat interface using the API examples above
Set up monitoring: Use tools like Prometheus + Grafana to track inference latency and throughput
Experiment with RAG: Connect Gemma 4 to a vector database for retrieval-augmented generation

The AI infrastructure landscape has fundamentally shifted. What required a team of ML engineers and a six-figure budget in 2024 now fits in a single VPS. The question isn't whether you can afford to self-host—it's whether you can afford not to.

Ready to integrate AI into your applications? Explore CodingMantra's AI-powered tools for inspiration, or start building your own custom solutions with the Ollama API.

Need Help Scaling Your AI Infrastructure?

CodingMantra specializes in custom AI deployments and enterprise-grade LLM integrations. Let's build something amazing together.

Contact Our Team

What You'll Build

A secure, self-hosted Ollama server running Gemma 4
HTTPS-enabled API endpoints for production use
Systemd service for automatic startup and monitoring
Firewall rules to protect your inference server
Example integrations for web and mobile apps

Quick Reference: Gemma 4 Models at a Glance

Model	Best For	Min RAM	Context
gemma4:e2b	Edge devices, testing	4 GB	128K
gemma4 (default)	Most deployments	8 GB	128K
gemma4:26b	Production (MoE)	16 GB	256K
gemma4:31b	Maximum quality	32 GB	256K

Section 1: Understanding the Stack

What is Ollama?

Why Gemma 4?

Google's Gemma 4 (released March 2026) is the latest iteration of their open model family. Key advantages:

Four Model Variants: E2B (2.3B), E4B (4.5B), 31B dense, and 26B MoE (Mixture of Experts)
Massive Context Window: 128K tokens on E-models, 256K on 31B and MoE variants
Multimodal Support: E2B and E4B handle text, images, audio, and video input
Apache 2.0 License: Fully open for commercial use without restrictions
Native Function Calling: Built-in tool use and configurable thinking modes
140+ Languages: Pre-trained with strong multilingual performance

Section 2: VPS Requirements and Selection

Minimum Hardware Requirements

Model	RAM (INT4)	RAM (BF16)	VRAM (GPU)	Storage
Gemma 4 E2B	3.2 GB	9.6 GB	4 GB recommended	10 GB
Gemma 4 E4B	5 GB	15 GB	8 GB recommended	15 GB
Gemma 4 26B (MoE)	15.6 GB	48 GB	24 GB recommended	35 GB
Gemma 4 31B	17.4 GB	58 GB	32 GB recommended	40 GB

Note: INT4 = quantized 4-bit (recommended for most deployments), BF16 = full 16-bit precision. MoE = Mixture of Experts (only 3.8B parameters active during inference).

VPS Provider Comparison

Hostinger: Best value for beginners. VPS plans starting at ₹399/month with 4 GB RAM (perfect for Gemma 4 E2B/E4B). Use code codingmantra for exclusive discounts.
DigitalOcean / Linode: Reliable, predictable pricing. GPU droplets available but pricier for sustained workloads.
AWS EC2 (g5/g6 instances): Best for production with NVIDIA A10G/L4 GPUs. Pay-per-hour flexibility ideal for scaling.
Google Cloud (G2 instances): Native Gemma support, often includes $300 free credits for new accounts.
Hetzner: Excellent price-to-performance ratio in Europe. GPU servers starting at €40/month—best value for dedicated hardware.
OVHcloud: Competitive pricing for dedicated GPU instances, good for long-term deployments with DDoS protection.
Oracle Cloud Free Tier: Always-free ARM instances with up to 24 GB RAM—perfect for running Gemma 4 E4B at zero cost.

Cost Comparison: Self-Hosted vs API

Why go through the trouble of self-hosting? Let's talk numbers:

Approach	Monthly Cost	Rate Limits	Data Privacy
Gemini API (Google)	$0.50 per 1M tokens	Yes, tier-based	Data processed by Google
Self-Hosted (Hostinger VPS)	~₹399 fixed	Unlimited	Full control, data stays on your server

At roughly 100,000 tokens per day, self-hosting pays for itself in under two weeks. After that, it's pure savings—and you get unlimited experimentation without watching every token.

Pro Tip:

For most use cases, start with Gemma 4 4B on a VPS with 8-16 GB RAM. You can always scale up or run multiple smaller instances for load balancing.

Section 3: Installing Ollama on Your VPS

Step 1: Connect to Your VPS

ssh username@your-vps-ip-address

Ensure your VPS is running Ubuntu 22.04+ or a comparable Linux distribution. This guide uses Ubuntu as the reference OS.

Step 2: Update System Packages

sudo apt update && sudo apt upgrade -y

Step 3: Install Ollama

Ollama provides a one-line installation script that handles all dependencies:

curl -fsSL https://ollama.com/install.sh | sh

The script automatically:

Downloads the latest Ollama binary
Installs required CUDA drivers if NVIDIA GPU is detected
Creates the ollama user and service
Configures systemd for automatic startup

Step 4: Verify Installation

ollama --version

You should see output like ollama version 0.20.x or higher (required for Gemma 4 support). Next, check if the service is running:

systemctl status ollama

Press q to exit the status view. If the service isn't running, start it with sudo systemctl start ollama.

Section 4: Downloading and Running Gemma 4

Pull the Gemma 4 Model

Ollama's model library includes Gemma 4 in various quantized sizes. Choose based on your VPS resources:

# Default model (E4B variant, best for most use cases - 8GB+ RAM)
ollama pull gemma4

# Edge model (E2B variant, minimal resources - 4GB+ RAM)
ollama pull gemma4:e2b

# Mixture of Experts (best quality/efficiency ratio - 16GB+ RAM)
ollama pull gemma4:26b

# Maximum quality dense model (32GB+ RAM with GPU)
ollama pull gemma4:31b

Test the Model

Run an interactive chat session to verify everything works:

ollama run gemma4:4b "Hello! Explain quantum computing in 2 sentences."

You should see a coherent response generated in real-time. Press Ctrl+D to exit the chat.

Check GPU Acceleration (if applicable)

If your VPS has an NVIDIA GPU, verify Ollama is using it:

nvidia-smi

You should see the Ollama process listed under running processes. GPU inference can be 10-50x faster than CPU-only mode.

Section 5: Configuring Ollama for Production

Environment Variables

Ollama supports several environment variables for customization. Edit the systemd service:

sudo systemctl edit ollama.service

Add the following configuration:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_ORIGINS=*"
Environment="OLLAMA_KEEP_ALIVE=5m"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_MAX_LOADED_MODELS=2"

Key settings explained:

OLLAMA_HOST=0.0.0.0:11434: Listen on all network interfaces (required for external access)
OLLAMA_ORIGINS=*: Allow CORS from any origin (restrict this in production)
OLLAMA_KEEP_ALIVE=5m: Keep models loaded for 5 minutes after last request
OLLAMA_NUM_PARALLEL=4: Handle up to 4 concurrent requests
OLLAMA_MAX_LOADED_MODELS=2: Keep maximum 2 models in memory simultaneously

Reload and restart the service:

sudo systemctl daemon-reexec
sudo systemctl restart ollama

Section 6: Securing Your Ollama Server

Configure Firewall (UFW)

Only allow necessary ports. By default, Ollama uses port 11434:

# Allow SSH (always required)
sudo ufw allow 22/tcp

# Allow Ollama port (restrict to your application server IP if possible)
sudo ufw allow from your_app_server_ip to any port 11434 proto tcp

# If you need public access (not recommended without additional auth)
# sudo ufw allow 11434/tcp

# Enable firewall
sudo ufw enable

Set Up Reverse Proxy with Nginx

A reverse proxy provides HTTPS, rate limiting, and additional security layers:

sudo apt install nginx -y

Create the Nginx configuration:

sudo nano /etc/nginx/sites-available/ollama

Add this configuration:

server {
    listen 80;
    server_name ai.yourdomain.com;

    # Redirect HTTP to HTTPS
    return 301 https://$server_name$request_uri;
}

server {
    listen 443 ssl http2;
    server_name ai.yourdomain.com;

    # SSL certificates (use Certbot for free Let's Encrypt)
    ssl_certificate /etc/letsencrypt/live/ai.yourdomain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/ai.yourdomain.com/privkey.pem;

    # Security headers
    add_header X-Frame-Options "SAMEORIGIN" always;
    add_header X-Content-Type-Options "nosniff" always;
    add_header X-XSS-Protection "1; mode=block" always;

    location / {
        proxy_pass http://localhost:11434;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # Increase timeouts for long-running inference
        proxy_read_timeout 300s;
        proxy_connect_timeout 75s;

        # Disable request buffering for streaming responses
        proxy_buffering off;
    }
}

Enable the site and test configuration:

sudo ln -s /etc/nginx/sites-available/ollama /etc/nginx/sites-enabled/
sudo nginx -t
sudo systemctl reload nginx

Install SSL Certificate with Certbot

sudo apt install certbot python3-certbot-nginx -y
sudo certbot --nginx -d ai.yourdomain.com

Section 7: Using the Ollama API

Basic API Endpoints

Ollama provides a RESTful API at http://localhost:11434. Here are the essential endpoints:

Chat Completion (Recommended)

curl http://localhost:11434/api/chat -d '{
  "model": "gemma4",
  "messages": [
    {"role": "user", "content": "Explain machine learning in simple terms"}
  ],
  "stream": false
}'

OpenAI-Compatible Endpoint

Ollama also supports OpenAI-compatible API calls for easy integration with existing tools:

curl http://localhost:11434/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "gemma4",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Generate Completion (Legacy)

curl http://localhost:11434/api/generate -d '{
  "model": "gemma4",
  "prompt": "What is the capital of France?",
  "stream": false
}'

List Local Models

curl http://localhost:11434/api/tags

Integration Example: Node.js

const OLLAMA_URL = 'https://ai.yourdomain.com/api/chat';

async function chatWithGemma(prompt) {
  const response = await fetch(OLLAMA_URL, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      model: 'gemma4',
      messages: [{ role: 'user', content: prompt }],
      stream: false,
      options: {
        temperature: 0.7,
        num_predict: 512,
      },
    }),
  });

  const data = await response.json();
  return data.message.content;
}

// Usage
const answer = await chatWithGemma('How do I optimize database queries?');
console.log(answer);

Integration Example: Python

import requests

OLLAMA_URL = "https://ai.yourdomain.com/api/chat"

def chat_with_gemma(prompt: str) -> str:
    payload = {
        "model": "gemma4",
        "messages": [{"role": "user", "content": prompt}],
        "stream": False,
        "options": {
            "temperature": 0.7,
            "num_predict": 512,
        }
    }

    response = requests.post(OLLAMA_URL, json=payload)
    response.raise_for_status()

    return response.json()["message"]["content"]

# Usage
answer = chat_with_gemma("Explain REST API design principles")
print(answer)

Section 8: Monitoring and Maintenance

Check Ollama Logs

# View recent logs
sudo journalctl -u ollama -n 50

# Follow logs in real-time
sudo journalctl -u ollama -f

Monitor Resource Usage

# Install htop for interactive monitoring
sudo apt install htop -y
htop

# Check GPU usage (if applicable)
watch -n 1 nvidia-smi

Update Ollama

To update to the latest version:

ollama update

Or reinstall using the installation script, which automatically updates to the latest version.

Backup Your Models

Ollama stores models in /usr/share/ollama/.ollama/models. To backup:

sudo tar -czf ollama-backup-$(date +%Y%m%d).tar.gz /usr/share/ollama/.ollama/models

Section 9: Real-World Use Cases

Now that you have a working AI server, what can you actually build? Here are some production use cases:

Use Case 1: Customer Support Chatbot

Deploy a domain-specific chatbot trained on your product documentation. Gemma 4's 128K context means you can feed it your entire knowledge base in a single prompt.

curl http://localhost:11434/api/chat -d '{
  "model": "gemma4",
  "messages": [
    {"role": "system", "content": "You are a helpful support agent for our SaaS product. Answer questions based on the documentation provided."},
    {"role": "user", "content": "[Paste 50 pages of docs here]... Now answer: How do I reset my API key?"}
  ]
}'

Use Case 2: Content Moderation API

Run user-generated content through Gemma 4 to detect spam, harassment, or policy violations before publishing. At ~50 tokens/ms on a GPU instance, you can moderate thousands of posts per hour.

Use Case 3: Internal Knowledge Assistant

Section 10: Troubleshooting Common Issues

Issue: "Connection Refused" on Port 11434

Solution: Ensure Ollama is configured to listen on all interfaces:

sudo systemctl edit ollama.service
# Add: Environment="OLLAMA_HOST=0.0.0.0:11434"
sudo systemctl daemon-reexec
sudo systemctl restart ollama

Issue: Slow Inference Speed

Solution: Check if GPU is being utilized:

nvidia-smi

If GPU isn't being used, ensure CUDA drivers are installed and Ollama was installed with GPU support. Consider using a smaller model quantization (e.g., gemma4:4b-q4_K_M).

Issue: Out of Memory (OOM) Errors

Solution: Reduce concurrent requests or use a smaller model:

# Reduce parallel requests
Environment="OLLAMA_NUM_PARALLEL=2"

# Or switch to a smaller model (E2B variant)
ollama pull gemma4:e2b

Conclusion: Your Self-Hosted AI Infrastructure

You now have a fully functional, production-ready AI inference server running Gemma 4 on Ollama. This setup gives you:

Full Data Control: All inference happens on your infrastructure—no data leaves your VPS. Perfect for healthcare, legal, and enterprise applications with strict compliance requirements.
Cost Predictability: Fixed monthly VPS costs (~$5-50/month) instead of per-token API pricing that scales with your success.
Unlimited Experimentation: Test wild ideas without watching every token. Build internal tools, prototypes, and R&D projects that would be prohibitively expensive on paid APIs.
Customization: Fine-tune Gemma 4 for your specific use case, add custom system prompts, or chain multiple models together without vendor restrictions.
Scalability: Deploy multiple instances behind a load balancer as traffic grows. Horizontal scaling is as simple as spinning up another identical VPS.

What's Next?

Explore fine-tuning: Train Gemma 4 on your domain-specific data for even better results
Build a UI: Create a chat interface using the API examples above
Set up monitoring: Use tools like Prometheus + Grafana to track inference latency and throughput
Experiment with RAG: Connect Gemma 4 to a vector database for retrieval-augmented generation

Ready to integrate AI into your applications? Explore CodingMantra's AI-powered tools for inspiration, or start building your own custom solutions with the Ollama API.

Need Help Scaling Your AI Infrastructure?

CodingMantra specializes in custom AI deployments and enterprise-grade LLM integrations. Let's build something amazing together.

Contact Our Team

What You'll Build

Quick Reference: Gemma 4 Models at a Glance

Section 1: Understanding the Stack

What is Ollama?

Why Gemma 4?

Section 2: VPS Requirements and Selection

Minimum Hardware Requirements

VPS Provider Comparison

Cost Comparison: Self-Hosted vs API

Section 3: Installing Ollama on Your VPS

Step 1: Connect to Your VPS

Step 2: Update System Packages

Step 3: Install Ollama

Step 4: Verify Installation

Section 4: Downloading and Running Gemma 4

Pull the Gemma 4 Model

Test the Model

Check GPU Acceleration (if applicable)

Section 5: Configuring Ollama for Production

Environment Variables

Section 6: Securing Your Ollama Server

Configure Firewall (UFW)

Set Up Reverse Proxy with Nginx

Install SSL Certificate with Certbot

Section 7: Using the Ollama API

Basic API Endpoints

Chat Completion (Recommended)

OpenAI-Compatible Endpoint

Generate Completion (Legacy)

List Local Models

Integration Example: Node.js

Integration Example: Python

Section 8: Monitoring and Maintenance

Check Ollama Logs

Monitor Resource Usage

Update Ollama

Backup Your Models

Section 9: Real-World Use Cases

Use Case 1: Customer Support Chatbot

Use Case 2: Content Moderation API

Use Case 3: Internal Knowledge Assistant

Section 10: Troubleshooting Common Issues

Issue: "Connection Refused" on Port 11434

Issue: Slow Inference Speed

Issue: Out of Memory (OOM) Errors

Conclusion: Your Self-Hosted AI Infrastructure

What's Next?

Need Help Scaling Your AI Infrastructure?

About the Author: CodingMantra Team

More from the CodingMantra Blog

Make Your Blog Images Look Professional: Generate Custom Hero Images in Seconds

The Ultimate Guide to AI Product Try-On for Fashion Retailers

What You'll Build

Quick Reference: Gemma 4 Models at a Glance

Section 1: Understanding the Stack

What is Ollama?

Why Gemma 4?

Section 2: VPS Requirements and Selection

Minimum Hardware Requirements

VPS Provider Comparison

Cost Comparison: Self-Hosted vs API

Section 3: Installing Ollama on Your VPS

Step 1: Connect to Your VPS

Step 2: Update System Packages

Step 3: Install Ollama

Step 4: Verify Installation

Section 4: Downloading and Running Gemma 4

Pull the Gemma 4 Model

Test the Model

Check GPU Acceleration (if applicable)

Section 5: Configuring Ollama for Production

Environment Variables

Section 6: Securing Your Ollama Server

Configure Firewall (UFW)

Set Up Reverse Proxy with Nginx

Install SSL Certificate with Certbot

Section 7: Using the Ollama API

Basic API Endpoints

Chat Completion (Recommended)

OpenAI-Compatible Endpoint