Run Llama 3 AI model locally on your computer complete setup guide
AI Development
1/10/2025 5 min read

How to Run Llama 3 8B Locally in 2025 (CPU/GPU Complete Guide)

Learn how to run Meta's Llama 3 8B locally on CPU or GPU using Ollama or llama.cpp. Complete guide with RAM requirements, installation steps, and performance optimization tips.

K

Kuldeep (Software Engineer)

1/10/2025

If you’re a developer with 16GB RAM looking to run state-of-the-art AI locally without cloud dependencies, you’re in the right place. In this comprehensive guide, you’ll learn how to set up Meta’s Llama 3 8B model on your local machine using multiple methods, optimize performance for your hardware, and understand when Llama 3 beats even larger models.

AI Platforms: Compare local Llama 3 with Google AI Studio, ChatGPT, and Claude Sonnet for your development needs.

What you’ll learn:

  • Complete hardware requirements for Llama 3 8B
  • Step-by-step installation with Ollama (easiest method)
  • Advanced setup with llama.cpp for maximum performance
  • Performance optimization and troubleshooting
  • Real-world comparisons with Mistral, Gemma, and other models

Why Llama 3 matters now: With 8K context length, improved reasoning capabilities, and open weights, Llama 3 represents a significant leap in accessible AI technology. The 8B parameter model delivers performance comparable to much larger models while remaining feasible to run on consumer hardware.

What’s New in Llama 3?

Meta’s Llama 3 represents a major evolution in open-source language models:

Key Improvements Over Llama 2

Enhanced Architecture:

  • 8B & 70B Models: More efficient parameter usage
  • 8K Context Length: Double the context of Llama 2 (4K)
  • Improved Training: Trained on 15T tokens (vs 2T for Llama 2)
  • Better Reasoning: Enhanced mathematical and logical reasoning
  • Code Generation: Superior coding capabilities

Technical Specifications:

Model Size: 8B parameters
Context Length: 8,192 tokens
Training Data: 15 trillion tokens
Vocabulary Size: 128,256
Architecture: Transformer with Grouped Query Attention
Quantization: GGUF format support

Can Your Machine Run Llama 3?

Hardware Requirements

Minimum Requirements:

  • RAM: 16GB (32GB recommended)
  • Storage: 8GB for model files
  • CPU: Modern multi-core processor (Intel i5/AMD Ryzen 5 or better)

GPU Requirements (Recommended):

  • VRAM: 8GB minimum (16GB+ optimal)
  • CUDA: Compatible NVIDIA GPU (RTX 3060 12GB or better)
  • Alternative: AMD GPU with ROCm support

Memory Requirements by Quantization

# VRAM requirements for different quantization levels
quantization_requirements = {
    "Q2_K": "3.2GB VRAM",
    "Q3_K_L": "4.3GB VRAM", 
    "Q4_0": "4.7GB VRAM",
    "Q4_K_M": "5.4GB VRAM",
    "Q5_0": "5.6GB VRAM",
    "Q6_K": "6.6GB VRAM",
    "Q8_0": "8.5GB VRAM",
    "FP16": "16GB VRAM"
}

System Compatibility:

  • Windows 10/11 (with WSL2 recommended)
  • macOS (Intel and Apple Silicon)
  • Linux (Ubuntu 20.04+, Debian, Arch)

Step-by-Step: Run Llama 3 with Ollama (Easiest Method)

Ollama is the simplest way to get Llama 3 running locally. Here’s the complete setup:

Installation

Windows:

# Download and install Ollama
winget install Ollama.Ollama

# Or download from: https://ollama.com/download

macOS:

# Install via Homebrew
brew install ollama

# Or download from: https://ollama.com/download

Linux:

# Install via curl
curl -fsSL https://ollama.com/install.sh | sh

# Or via package manager
sudo apt install ollama  # Ubuntu/Debian

Running Llama 3 8B

# Download and run Llama 3 8B (first time)
ollama run llama3

# The model will be downloaded automatically (~4.7GB)
# Once downloaded, you can start chatting immediately

Example Interaction:

>>> Write a Python function to calculate fibonacci numbers

def fibonacci(n):
    """
    Calculate the nth Fibonacci number using dynamic programming.
    
    Args:
        n (int): The position in the Fibonacci sequence
        
    Returns:
        int: The nth Fibonacci number
    """
    if n <= 1:
        return n
    
    # Initialize base cases
    prev, curr = 0, 1
    
    # Calculate Fibonacci number iteratively
    for i in range(2, n + 1):
        prev, curr = curr, prev + curr
    
    return curr

# Test the function
print(fibonacci(10))  # Output: 55
print(fibonacci(15))  # Output: 610

Advanced Ollama Usage

Custom Model Configuration:

# Create a custom model file
cat > Modelfile << EOF
FROM llama3

# Set custom parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER num_ctx 8192

# Set system message
SYSTEM """You are a helpful coding assistant. 
Provide clear, well-commented code with explanations."""
EOF

# Create custom model
ollama create coding-assistant -f Modelfile

# Run custom model
ollama run coding-assistant

API Usage:

import requests
import json

def query_llama3(prompt, model="llama3"):
    """Query Llama 3 via Ollama API"""
    url = "http://localhost:11434/api/generate"
    
    data = {
        "model": model,
        "prompt": prompt,
        "stream": False,
        "options": {
            "temperature": 0.7,
            "top_p": 0.9,
            "num_ctx": 8192
        }
    }
    
    response = requests.post(url, json=data)
    return response.json()["response"]

# Example usage
result = query_llama3("Explain quantum computing in simple terms")
print(result)

Advanced: Use Llama 3 with llama.cpp (Maximum Performance)

For maximum performance and flexibility, llama.cpp offers the best optimization:

Installation and Setup

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# Build with CUDA support (Linux/macOS)
make LLAMA_CUDA=1

# Windows: Use the provided build scripts
# or build with Visual Studio

Download GGUF Model

# Download Llama 3 8B GGUF model (recommended quantization)
wget https://huggingface.co/TheBloke/Llama-3-8B-Instruct-GGUF/resolve/main/llama-3-8b-instruct.Q4_K_M.gguf

# Alternative: Use huggingface-hub
python -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
    repo_id='TheBloke/Llama-3-8B-Instruct-GGUF',
    filename='llama-3-8b-instruct.Q4_K_M.gguf',
    local_dir='./models'
)
"

Running with llama.cpp

# Basic inference
./main -m llama-3-8b-instruct.Q4_K_M.gguf -p "Explain machine learning"

# Interactive mode
./main -m llama-3-8b-instruct.Q4_K_M.gguf -i

# With GPU acceleration (if available)
./main -m llama-3-8b-instruct.Q4_K_M.gguf -ngl 32 -p "Write a Python script"

# Performance benchmarking
./main -m llama-3-8b-instruct.Q4_K_M.gguf -p "Hello" -n 100 --log-disable

Python Integration

import subprocess
import json

class LlamaCPP:
    def __init__(self, model_path, executable_path="./main"):
        self.model_path = model_path
        self.executable_path = executable_path
    
    def generate(self, prompt, max_tokens=512, temperature=0.7):
        """Generate text using llama.cpp"""
        cmd = [
            self.executable_path,
            "-m", self.model_path,
            "-p", prompt,
            "-n", str(max_tokens),
            "--temp", str(temperature),
            "--log-disable"
        ]
        
        result = subprocess.run(cmd, capture_output=True, text=True)
        return result.stdout.strip()
    
    def interactive(self):
        """Start interactive session"""
        cmd = [self.executable_path, "-m", self.model_path, "-i"]
        subprocess.run(cmd)

# Usage
llama = LlamaCPP("llama-3-8b-instruct.Q4_K_M.gguf")
response = llama.generate("Explain neural networks")
print(response)

Performance Optimization Tips

GPU Acceleration

NVIDIA GPU (CUDA):

# Check CUDA availability
nvidia-smi

# Run with GPU layers (adjust -ngl based on VRAM)
./main -m llama-3-8b-instruct.Q4_K_M.gguf -ngl 32

# For 8GB VRAM: -ngl 20
# For 16GB VRAM: -ngl 32
# For 24GB+ VRAM: -ngl 40

AMD GPU (ROCm):

# Install ROCm (Ubuntu)
wget https://repo.radeon.com/amdgpu-install/5.7/ubuntu/jammy/amdgpu-install_5.7.50700-1_all.deb
sudo dpkg -i amdgpu-install_5.7.50700-1_all.deb

# Build llama.cpp with ROCm
make LLAMA_HIPBLAS=1

# Run with GPU acceleration
./main -m llama-3-8b-instruct.Q4_K_M.gguf -ngl 32

Memory Optimization

# Optimize for low-memory systems
optimization_settings = {
    "batch_size": 512,        # Reduce if OOM
    "context_length": 4096,   # Reduce from 8K if needed
    "threads": 8,             # Adjust based on CPU cores
    "memory_f16": True,       # Use FP16 precision
    "no_kv_offload": False    # Enable KV cache offloading
}

Speed Benchmarks

HardwareTokens/SecondMemory UsageMethod
RTX 4090 24GB858.2GBllama.cpp + CUDA
RTX 3090 24GB658.2GBllama.cpp + CUDA
RTX 3060 12GB356.8GBllama.cpp + CUDA
M2 Max 32GB256.2GBllama.cpp + Metal
CPU (16 cores)812GBllama.cpp CPU
Ollama156.5GBOllama default

Common Issues & Solutions

”Model Not Found” Error

# Ensure correct model name
ollama run llama3:8b  # Not llama3:8B or llama-3

# Check available models
ollama list

# Pull model explicitly
ollama pull llama3:8b

Out of Memory Errors

# Use lower quantization
ollama run llama3:8b-instruct-q4_0

# Reduce context length
./main -m model.gguf -c 2048  # Reduce from 8192

# Enable memory mapping
./main -m model.gguf --mlock

Slow Inference Speed

# Optimize for speed
speed_optimizations = {
    "batch_size": 1024,       # Increase batch size
    "threads": -1,            # Use all CPU cores
    "gpu_layers": 32,         # More GPU layers
    "memory_f16": True,       # FP16 precision
    "no_mmap": False,         # Enable memory mapping
    "no_mlock": False         # Lock memory pages
}

Installation Issues

Windows WSL2 Setup:

# Enable WSL2 and install Ubuntu
wsl --install -d Ubuntu

# Install CUDA in WSL2
wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda

Llama 3 vs Alternatives Comparison

Performance Comparison Table

ModelSizeContextSpeedMemoryLicenseBest For
Llama 3 8B8B8K⭐⭐⭐⭐6GBCustomGeneral tasks, coding
Mistral 7B7B32K⭐⭐⭐⭐⭐5GBApache 2.0Long context
Gemma 7B7B8K⭐⭐⭐6GBGemma TermsResearch
Phi-3 Mini3.8B4K⭐⭐⭐⭐⭐3GBMITMobile/edge
Qwen 7B7B32K⭐⭐⭐⭐6GBTongyi QianwenMultilingual

Detailed Comparison

Llama 3 8B Advantages:

  • Superior reasoning capabilities
  • Better code generation
  • Strong mathematical performance
  • Active community support
  • Regular updates from Meta

Mistral 7B Advantages:

  • Longer context (32K tokens)
  • Faster inference
  • More permissive license
  • Better for long documents

When to Choose Llama 3:

  • Need strong reasoning and problem-solving
  • Working on coding projects
  • Want the latest model architecture
  • Need good balance of performance and size

Building a Chatbot with Llama 3

Here’s a complete example of building a web-based chatbot:

from flask import Flask, render_template, request, jsonify
import requests
import json

app = Flask(__name__)

class LlamaChatbot:
    def __init__(self, model_name="llama3"):
        self.model_name = model_name
        self.api_url = "http://localhost:11434/api/generate"
    
    def generate_response(self, message, context=""):
        """Generate chatbot response"""
        prompt = f"""
        You are a helpful AI assistant. Answer the user's question clearly and concisely.
        
        Context: {context}
        User: {message}
        Assistant:"""
        
        data = {
            "model": self.model_name,
            "prompt": prompt,
            "stream": False,
            "options": {
                "temperature": 0.7,
                "top_p": 0.9,
                "num_ctx": 4096
            }
        }
        
        try:
            response = requests.post(self.api_url, json=data, timeout=30)
            return response.json()["response"]
        except Exception as e:
            return f"Error: {str(e)}"

chatbot = LlamaChatbot()

@app.route('/')
def index():
    return render_template('chat.html')

@app.route('/chat', methods=['POST'])
def chat():
    user_message = request.json.get('message', '')
    response = chatbot.generate_response(user_message)
    return jsonify({'response': response})

if __name__ == '__main__':
    app.run(debug=True, host='0.0.0.0', port=5000)

HTML Template (templates/chat.html):

<!DOCTYPE html>
<html>
<head>
    <title>Llama 3 Chatbot</title>
    <style>
        .chat-container { max-width: 800px; margin: 0 auto; padding: 20px; }
        .message { margin: 10px 0; padding: 10px; border-radius: 10px; }
        .user { background: #007bff; color: white; text-align: right; }
        .bot { background: #f8f9fa; color: black; }
        #chat-messages { height: 400px; overflow-y: auto; border: 1px solid #ddd; padding: 10px; }
        #message-input { width: 100%; padding: 10px; margin-top: 10px; }
    </style>
</head>
<body>
    <div class="chat-container">
        <h1>Llama 3 Chatbot</h1>
        <div id="chat-messages"></div>
        <input type="text" id="message-input" placeholder="Type your message...">
    </div>

    <script>
        const chatMessages = document.getElementById('chat-messages');
        const messageInput = document.getElementById('message-input');

        function addMessage(content, isUser) {
            const messageDiv = document.createElement('div');
            messageDiv.className = `message ${isUser ? 'user' : 'bot'}`;
            messageDiv.textContent = content;
            chatMessages.appendChild(messageDiv);
            chatMessages.scrollTop = chatMessages.scrollHeight;
        }

        async function sendMessage() {
            const message = messageInput.value.trim();
            if (!message) return;

            addMessage(message, true);
            messageInput.value = '';

            try {
                const response = await fetch('/chat', {
                    method: 'POST',
                    headers: { 'Content-Type': 'application/json' },
                    body: JSON.stringify({ message: message })
                });
                const data = await response.json();
                addMessage(data.response, false);
            } catch (error) {
                addMessage('Error: Could not get response', false);
            }
        }

        messageInput.addEventListener('keypress', (e) => {
            if (e.key === 'Enter') sendMessage();
        });
    </script>
</body>
</html>

Frequently Asked Questions

Can I use Llama 3 commercially?

Yes, with restrictions. Llama 3 uses Meta’s custom license that allows commercial use for most applications, but excludes companies with 700M+ monthly active users. Check the official license for specific terms.

Is Llama 3 truly open source?

Open weights, not fully open source. Llama 3 provides model weights and inference code but doesn’t include training code or data. This is “open weights” rather than fully open source.

How much RAM do I need for Llama 3 8B?

Minimum 16GB, recommended 32GB. The model itself requires about 6GB with Q4 quantization, but you need additional RAM for:

  • Operating system (2-4GB)
  • Other applications (4-8GB)
  • Buffer for context processing (4-8GB)

Can I run Llama 3 without a GPU?

Yes, but slower. CPU-only inference works but is 5-10x slower than GPU. Modern CPUs with 16+ cores can achieve reasonable performance for casual use.

What’s the difference between Llama 3 and Llama 3.1?

Llama 3.1 improvements:

  • Better instruction following
  • Improved code generation
  • Enhanced reasoning capabilities
  • Better multilingual support
  • Reduced hallucination

How do I fine-tune Llama 3?

Use QLoRA for consumer hardware:

# Example QLoRA fine-tuning setup
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import LoraConfig, get_peft_model

# Load base model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B")

# Configure LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1
)

model = get_peft_model(model, lora_config)

Conclusion

Running Llama 3 8B locally opens up powerful AI capabilities without cloud dependencies. Whether you choose Ollama for simplicity or llama.cpp for maximum performance, you now have the tools to:

  • Run state-of-the-art AI on your hardware
  • Build custom applications with local AI
  • Optimize performance for your specific setup
  • Understand when Llama 3 beats alternatives

Next Steps:

  1. Try it yourself: Start with Ollama for the easiest setup
  2. Optimize performance: Experiment with different quantization levels
  3. Build something: Create a chatbot, coding assistant, or content generator
  4. Share your results: Document your performance benchmarks and use cases

Ready to get started? Run ollama run llama3 and begin exploring the future of local AI today!

Related Articles

Continue exploring more content on similar topics