CodingMantra LogoCodingMantra
GalleryProductsPortfolioServicesGamesPricingContact
CodingMantra LogoCodingMantra

Providing business solutions for small and medium-sized businesses and helping them to grow.

WhatsApp ChannelX / TwitterLinkedInInstagramFacebookGitHubYouTube

Company

  • About Us
  • Services
  • Products
  • Portfolio
  • Pricing
  • Blog
  • API Docs
  • Contact Us

Top Tools

  • All Tools
  • Image Gallery
  • Image Tools
  • Digital Marketing
  • Financial Tools
  • Games
  • SEO Tools

Legal

  • Privacy Policy
  • Terms & Conditions
  • Return Policy
  • Deals
  • Sitemap

© 2026 CodingMantra. All Rights Reserved.

    1. Home
    2. Blog
    3. How to Setup Ollama with Gemma 4 on Your VPS: Complete 2026 Guide

    How to Setup Ollama with Gemma 4 on Your VPS: Complete 2026 Guide

    Posted by CodingMantra Team on April 13, 2026

    How to Setup Ollama with Gemma 4 on Your VPS: Complete 2026 Guide

    Imagine having your own AI assistant running on a server you control—no API costs, no rate limits, no data leaving your infrastructure. Just six months ago, this would have required a PhD in machine learning operations and a budget worthy of a tech giant. Today, it takes about 15 minutes and costs less than a fancy dinner.

    That's the promise of Ollama + Gemma 4. Google's March 2026 release of Gemma 4 brought flagship-level AI performance to models small enough to run on consumer hardware, while Ollama's streamlined toolchain handles all the complexity of model inference, GPU acceleration, and API management. Together, they let you deploy a production-ready AI server on any VPS—whether it's a budget Hostinger instance, a DigitalOcean droplet, or a full AWS GPU cluster.

    This guide walks you through everything: installation, security hardening, performance tuning, and real-world API integration. By the end, you'll have your own self-hosted AI infrastructure that rivals what enterprises were paying six figures for just two years ago.

    What You'll Build

    • A secure, self-hosted Ollama server running Gemma 4
    • HTTPS-enabled API endpoints for production use
    • Systemd service for automatic startup and monitoring
    • Firewall rules to protect your inference server
    • Example integrations for web and mobile apps

    Quick Reference: Gemma 4 Models at a Glance

    Model Best For Min RAM Context
    gemma4:e2b Edge devices, testing 4 GB 128K
    gemma4 (default) Most deployments 8 GB 128K
    gemma4:26b Production (MoE) 16 GB 256K
    gemma4:31b Maximum quality 32 GB 256K

    Section 1: Understanding the Stack

    What is Ollama?

    Ollama is an open-source toolchain that simplifies running LLMs locally. It handles model downloading, quantization, GPU acceleration, and provides a clean REST API. Think of it as "Docker for LLMs"—it abstracts away the complexity of model inference.

    Why Gemma 4?

    Google's Gemma 4 (released March 2026) is the latest iteration of their open model family. Key advantages:

    • Four Model Variants: E2B (2.3B), E4B (4.5B), 31B dense, and 26B MoE (Mixture of Experts)
    • Massive Context Window: 128K tokens on E-models, 256K on 31B and MoE variants
    • Multimodal Support: E2B and E4B handle text, images, audio, and video input
    • Apache 2.0 License: Fully open for commercial use without restrictions
    • Native Function Calling: Built-in tool use and configurable thinking modes
    • 140+ Languages: Pre-trained with strong multilingual performance

    Section 2: VPS Requirements and Selection

    Minimum Hardware Requirements

    Model RAM (INT4) RAM (BF16) VRAM (GPU) Storage
    Gemma 4 E2B 3.2 GB 9.6 GB 4 GB recommended 10 GB
    Gemma 4 E4B 5 GB 15 GB 8 GB recommended 15 GB
    Gemma 4 26B (MoE) 15.6 GB 48 GB 24 GB recommended 35 GB
    Gemma 4 31B 17.4 GB 58 GB 32 GB recommended 40 GB

    Note: INT4 = quantized 4-bit (recommended for most deployments), BF16 = full 16-bit precision. MoE = Mixture of Experts (only 3.8B parameters active during inference).

    VPS Provider Comparison

    • Hostinger: Best value for beginners. VPS plans starting at ₹399/month with 4 GB RAM (perfect for Gemma 4 E2B/E4B). Use code codingmantra for exclusive discounts.
    • DigitalOcean / Linode: Reliable, predictable pricing. GPU droplets available but pricier for sustained workloads.
    • AWS EC2 (g5/g6 instances): Best for production with NVIDIA A10G/L4 GPUs. Pay-per-hour flexibility ideal for scaling.
    • Google Cloud (G2 instances): Native Gemma support, often includes $300 free credits for new accounts.
    • Hetzner: Excellent price-to-performance ratio in Europe. GPU servers starting at €40/month—best value for dedicated hardware.
    • OVHcloud: Competitive pricing for dedicated GPU instances, good for long-term deployments with DDoS protection.
    • Oracle Cloud Free Tier: Always-free ARM instances with up to 24 GB RAM—perfect for running Gemma 4 E4B at zero cost.

    Cost Comparison: Self-Hosted vs API

    Why go through the trouble of self-hosting? Let's talk numbers:

    Approach Monthly Cost Rate Limits Data Privacy
    Gemini API (Google) $0.50 per 1M tokens Yes, tier-based Data processed by Google
    Self-Hosted (Hostinger VPS) ~₹399 fixed Unlimited Full control, data stays on your server

    At roughly 100,000 tokens per day, self-hosting pays for itself in under two weeks. After that, it's pure savings—and you get unlimited experimentation without watching every token.

    Pro Tip:

    For most use cases, start with Gemma 4 4B on a VPS with 8-16 GB RAM. You can always scale up or run multiple smaller instances for load balancing.


    Section 3: Installing Ollama on Your VPS

    Step 1: Connect to Your VPS

    ssh username@your-vps-ip-address

    Ensure your VPS is running Ubuntu 22.04+ or a comparable Linux distribution. This guide uses Ubuntu as the reference OS.

    Step 2: Update System Packages

    sudo apt update && sudo apt upgrade -y

    Step 3: Install Ollama

    Ollama provides a one-line installation script that handles all dependencies:

    curl -fsSL https://ollama.com/install.sh | sh

    The script automatically:

    • Downloads the latest Ollama binary
    • Installs required CUDA drivers if NVIDIA GPU is detected
    • Creates the ollama user and service
    • Configures systemd for automatic startup

    Step 4: Verify Installation

    ollama --version

    You should see output like ollama version 0.20.x or higher (required for Gemma 4 support). Next, check if the service is running:

    systemctl status ollama

    Press q to exit the status view. If the service isn't running, start it with sudo systemctl start ollama.


    Section 4: Downloading and Running Gemma 4

    Pull the Gemma 4 Model

    Ollama's model library includes Gemma 4 in various quantized sizes. Choose based on your VPS resources:

    # Default model (E4B variant, best for most use cases - 8GB+ RAM)
    ollama pull gemma4
    
    # Edge model (E2B variant, minimal resources - 4GB+ RAM)
    ollama pull gemma4:e2b
    
    # Mixture of Experts (best quality/efficiency ratio - 16GB+ RAM)
    ollama pull gemma4:26b
    
    # Maximum quality dense model (32GB+ RAM with GPU)
    ollama pull gemma4:31b

    Ollama automatically downloads the INT4 quantized version by default, which offers the best balance between performance and resource usage. You can also specify specific quantizations like gemma4:4b-q4_K_M if needed.

    Test the Model

    Run an interactive chat session to verify everything works:

    ollama run gemma4:4b "Hello! Explain quantum computing in 2 sentences."

    You should see a coherent response generated in real-time. Press Ctrl+D to exit the chat.

    Check GPU Acceleration (if applicable)

    If your VPS has an NVIDIA GPU, verify Ollama is using it:

    nvidia-smi

    You should see the Ollama process listed under running processes. GPU inference can be 10-50x faster than CPU-only mode.


    Section 5: Configuring Ollama for Production

    Environment Variables

    Ollama supports several environment variables for customization. Edit the systemd service:

    sudo systemctl edit ollama.service

    Add the following configuration:

    [Service]
    Environment="OLLAMA_HOST=0.0.0.0:11434"
    Environment="OLLAMA_ORIGINS=*"
    Environment="OLLAMA_KEEP_ALIVE=5m"
    Environment="OLLAMA_NUM_PARALLEL=4"
    Environment="OLLAMA_MAX_LOADED_MODELS=2"

    Key settings explained:

    • OLLAMA_HOST=0.0.0.0:11434: Listen on all network interfaces (required for external access)
    • OLLAMA_ORIGINS=*: Allow CORS from any origin (restrict this in production)
    • OLLAMA_KEEP_ALIVE=5m: Keep models loaded for 5 minutes after last request
    • OLLAMA_NUM_PARALLEL=4: Handle up to 4 concurrent requests
    • OLLAMA_MAX_LOADED_MODELS=2: Keep maximum 2 models in memory simultaneously

    Reload and restart the service:

    sudo systemctl daemon-reexec
    sudo systemctl restart ollama

    Section 6: Securing Your Ollama Server

    Configure Firewall (UFW)

    Only allow necessary ports. By default, Ollama uses port 11434:

    # Allow SSH (always required)
    sudo ufw allow 22/tcp
    
    # Allow Ollama port (restrict to your application server IP if possible)
    sudo ufw allow from your_app_server_ip to any port 11434 proto tcp
    
    # If you need public access (not recommended without additional auth)
    # sudo ufw allow 11434/tcp
    
    # Enable firewall
    sudo ufw enable

    Set Up Reverse Proxy with Nginx

    A reverse proxy provides HTTPS, rate limiting, and additional security layers:

    sudo apt install nginx -y

    Create the Nginx configuration:

    sudo nano /etc/nginx/sites-available/ollama

    Add this configuration:

    server {
        listen 80;
        server_name ai.yourdomain.com;
    
        # Redirect HTTP to HTTPS
        return 301 https://$server_name$request_uri;
    }
    
    server {
        listen 443 ssl http2;
        server_name ai.yourdomain.com;
    
        # SSL certificates (use Certbot for free Let's Encrypt)
        ssl_certificate /etc/letsencrypt/live/ai.yourdomain.com/fullchain.pem;
        ssl_certificate_key /etc/letsencrypt/live/ai.yourdomain.com/privkey.pem;
    
        # Security headers
        add_header X-Frame-Options "SAMEORIGIN" always;
        add_header X-Content-Type-Options "nosniff" always;
        add_header X-XSS-Protection "1; mode=block" always;
    
        location / {
            proxy_pass http://localhost:11434;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;
    
            # Increase timeouts for long-running inference
            proxy_read_timeout 300s;
            proxy_connect_timeout 75s;
    
            # Disable request buffering for streaming responses
            proxy_buffering off;
        }
    }

    Enable the site and test configuration:

    sudo ln -s /etc/nginx/sites-available/ollama /etc/nginx/sites-enabled/
    sudo nginx -t
    sudo systemctl reload nginx

    Install SSL Certificate with Certbot

    sudo apt install certbot python3-certbot-nginx -y
    sudo certbot --nginx -d ai.yourdomain.com

    Section 7: Using the Ollama API

    Basic API Endpoints

    Ollama provides a RESTful API at http://localhost:11434. Here are the essential endpoints:

    Chat Completion (Recommended)

    curl http://localhost:11434/api/chat -d '{
      "model": "gemma4",
      "messages": [
        {"role": "user", "content": "Explain machine learning in simple terms"}
      ],
      "stream": false
    }'

    OpenAI-Compatible Endpoint

    Ollama also supports OpenAI-compatible API calls for easy integration with existing tools:

    curl http://localhost:11434/v1/chat/completions   -H "Content-Type: application/json"   -d '{
        "model": "gemma4",
        "messages": [{"role": "user", "content": "Hello!"}]
      }'

    Generate Completion (Legacy)

    curl http://localhost:11434/api/generate -d '{
      "model": "gemma4",
      "prompt": "What is the capital of France?",
      "stream": false
    }'

    List Local Models

    curl http://localhost:11434/api/tags

    Integration Example: Node.js

    const OLLAMA_URL = 'https://ai.yourdomain.com/api/chat';
    
    async function chatWithGemma(prompt) {
      const response = await fetch(OLLAMA_URL, {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({
          model: 'gemma4',
          messages: [{ role: 'user', content: prompt }],
          stream: false,
          options: {
            temperature: 0.7,
            num_predict: 512,
          },
        }),
      });
    
      const data = await response.json();
      return data.message.content;
    }
    
    // Usage
    const answer = await chatWithGemma('How do I optimize database queries?');
    console.log(answer);

    Integration Example: Python

    import requests
    
    OLLAMA_URL = "https://ai.yourdomain.com/api/chat"
    
    def chat_with_gemma(prompt: str) -> str:
        payload = {
            "model": "gemma4",
            "messages": [{"role": "user", "content": prompt}],
            "stream": False,
            "options": {
                "temperature": 0.7,
                "num_predict": 512,
            }
        }
    
        response = requests.post(OLLAMA_URL, json=payload)
        response.raise_for_status()
    
        return response.json()["message"]["content"]
    
    # Usage
    answer = chat_with_gemma("Explain REST API design principles")
    print(answer)

    Section 8: Monitoring and Maintenance

    Check Ollama Logs

    # View recent logs
    sudo journalctl -u ollama -n 50
    
    # Follow logs in real-time
    sudo journalctl -u ollama -f

    Monitor Resource Usage

    # Install htop for interactive monitoring
    sudo apt install htop -y
    htop
    
    # Check GPU usage (if applicable)
    watch -n 1 nvidia-smi

    Update Ollama

    To update to the latest version:

    ollama update

    Or reinstall using the installation script, which automatically updates to the latest version.

    Backup Your Models

    Ollama stores models in /usr/share/ollama/.ollama/models. To backup:

    sudo tar -czf ollama-backup-$(date +%Y%m%d).tar.gz /usr/share/ollama/.ollama/models

    Section 9: Real-World Use Cases

    Now that you have a working AI server, what can you actually build? Here are some production use cases:

    Use Case 1: Customer Support Chatbot

    Deploy a domain-specific chatbot trained on your product documentation. Gemma 4's 128K context means you can feed it your entire knowledge base in a single prompt.

    curl http://localhost:11434/api/chat -d '{
      "model": "gemma4",
      "messages": [
        {"role": "system", "content": "You are a helpful support agent for our SaaS product. Answer questions based on the documentation provided."},
        {"role": "user", "content": "[Paste 50 pages of docs here]... Now answer: How do I reset my API key?"}
      ]
    }'

    Use Case 2: Content Moderation API

    Run user-generated content through Gemma 4 to detect spam, harassment, or policy violations before publishing. At ~50 tokens/ms on a GPU instance, you can moderate thousands of posts per hour.

    Use Case 3: Internal Knowledge Assistant

    Connect Gemma 4 to your company's Notion, Confluence, or Slack archives. Employees can ask natural language questions and get answers sourced from internal documents—all without data leaving your infrastructure.


    Section 10: Troubleshooting Common Issues

    Issue: "Connection Refused" on Port 11434

    Solution: Ensure Ollama is configured to listen on all interfaces:

    sudo systemctl edit ollama.service
    # Add: Environment="OLLAMA_HOST=0.0.0.0:11434"
    sudo systemctl daemon-reexec
    sudo systemctl restart ollama

    Issue: Slow Inference Speed

    Solution: Check if GPU is being utilized:

    nvidia-smi

    If GPU isn't being used, ensure CUDA drivers are installed and Ollama was installed with GPU support. Consider using a smaller model quantization (e.g., gemma4:4b-q4_K_M).

    Issue: Out of Memory (OOM) Errors

    Solution: Reduce concurrent requests or use a smaller model:

    # Reduce parallel requests
    Environment="OLLAMA_NUM_PARALLEL=2"
    
    # Or switch to a smaller model (E2B variant)
    ollama pull gemma4:e2b

    Conclusion: Your Self-Hosted AI Infrastructure

    You now have a fully functional, production-ready AI inference server running Gemma 4 on Ollama. This setup gives you:

    • Full Data Control: All inference happens on your infrastructure—no data leaves your VPS. Perfect for healthcare, legal, and enterprise applications with strict compliance requirements.
    • Cost Predictability: Fixed monthly VPS costs (~$5-50/month) instead of per-token API pricing that scales with your success.
    • Unlimited Experimentation: Test wild ideas without watching every token. Build internal tools, prototypes, and R&D projects that would be prohibitively expensive on paid APIs.
    • Customization: Fine-tune Gemma 4 for your specific use case, add custom system prompts, or chain multiple models together without vendor restrictions.
    • Scalability: Deploy multiple instances behind a load balancer as traffic grows. Horizontal scaling is as simple as spinning up another identical VPS.

    What's Next?

    • Explore fine-tuning: Train Gemma 4 on your domain-specific data for even better results
    • Build a UI: Create a chat interface using the API examples above
    • Set up monitoring: Use tools like Prometheus + Grafana to track inference latency and throughput
    • Experiment with RAG: Connect Gemma 4 to a vector database for retrieval-augmented generation

    The AI infrastructure landscape has fundamentally shifted. What required a team of ML engineers and a six-figure budget in 2024 now fits in a single VPS. The question isn't whether you can afford to self-host—it's whether you can afford not to.

    Ready to integrate AI into your applications? Explore CodingMantra's AI-powered tools for inspiration, or start building your own custom solutions with the Ollama API.

    Need Help Scaling Your AI Infrastructure?

    CodingMantra specializes in custom AI deployments and enterprise-grade LLM integrations. Let's build something amazing together.

    Contact Our Team