Claude Code on Ollama: Learn how to Run a Native Coding Agent With out Burning API Credit

Learn how to Run Claude Code on Ollama

Set up Ollama by way of Homebrew or the official set up script and begin the server.
Pull a coding mannequin reminiscent of qwen2.5-coder:14b with ollama pull.
Confirm Ollama’s OpenAI-compatible endpoint responds at localhost:11434/v1.
Set up Claude Code globally by way of npm set up -g @anthropic-ai/claude-code.
Unset any current ANTHROPIC_API_KEY to stop unintentional API billing.
Export setting variables pointing Claude Code to the native Ollama endpoint.
Launch Claude Code in your venture listing and ensure the native mannequin identify seems.
Affirm native routing by checking for lively connections to port 11434 throughout a session.

Working Claude Code in opposition to Anthropic’s API will get costly quick. Run Claude Code in opposition to a neighborhood mannequin by means of Ollama and also you pay zero marginal value per question—this tutorial walks by means of the entire setup, from putting in Ollama and pulling an acceptable coding mannequin, to configuring Claude Code’s setting variables, to working actual coding duties in opposition to a React and Node.js venture.

Desk of Contents

Why Your Claude Code API Invoice Is a Drawback

Working Claude Code in opposition to Anthropic’s API will get costly quick. Builders on the Anthropic subreddit and numerous boards have reported spending between $100 and $200 in a single day of heavy agentic coding periods. One extensively cited, self-reported neighborhood account described burning by means of $175 in simply 4 hours whereas refactoring a medium-sized codebase (outcomes will range considerably by activity sort and codebase dimension). Even conservative utilization patterns, involving periodic prompts for code opinions, take a look at technology, and debugging, can simply generate month-to-month payments exceeding $500 in accordance with comparable anecdotal experiences. The token-intensive nature of agentic workflows, the place Claude Code reads total recordsdata, causes throughout a number of steps, and writes again modifications, compounds the associated fee far past what a single chat-style API name would.

Run Claude Code in opposition to a neighborhood mannequin by means of Ollama and also you pay zero marginal value per question. The mannequin runs on {hardware} already sitting on the desk.

This tutorial walks by means of the entire setup, from putting in Ollama and pulling an acceptable coding mannequin, to configuring Claude Code’s setting variables, to working actual coding duties in opposition to a React and Node.js venture. The goal reader is a developer with intermediate familiarity with CLI instruments, Node.js, and native growth environments.

Claude Code model compatibility: Claude Code is below speedy growth and its configuration interface, together with supported setting variables, could change between releases. This information paperwork one strategy to native mannequin routing by way of OpenAI-compatible endpoints. After set up, run claude --version and seek the advice of Anthropic’s present documentation or claude --help to verify the precise setting variable names supported by your put in model. If variable names have modified, adapt the directions accordingly.

What Is Claude Code and Why Go Native?

Claude Code in 60 Seconds

Claude Code is Anthropic’s agentic command-line coding instrument. In contrast to GitHub Copilot, which operates primarily as an inline autocomplete engine, or Cursor, which embeds AI inside a customized IDE fork, Claude Code features as a standalone CLI agent. It reads venture recordsdata, causes about codebases, writes and edits code throughout a number of recordsdata, runs shell instructions, and iterates by itself output. Its default working mannequin requires an Anthropic API key, routing all requests to Claude Sonnet 4 or Claude Opus, with prices decided by token consumption. A typical multi-step agentic activity can eat tens of hundreds of tokens per interplay.

The Case for Native Fashions

Working Claude Code in opposition to a neighborhood mannequin solves three issues. Privateness and knowledge sovereignty come first: supply code by no means leaves the developer’s machine, which issues for proprietary codebases and organizations with strict knowledge dealing with insurance policies. You additionally get rid of per-query prices after the one-time {hardware} funding. And the setup works with out an web connection, so you retain working when connectivity drops.

The trade-offs deserve trustworthy acknowledgment. Native fashions, even the very best open-weight coding fashions within the 7B to 16B parameter vary, don’t match Claude Sonnet 4 or Opus in complicated multi-file reasoning, nuanced architectural choices, or large-context understanding. For easy duties like boilerplate technology, refactoring, and take a look at scaffolding, native fashions produce usable output on first try for single-file edits. For duties requiring deep contextual reasoning throughout hundreds of strains, the standard hole stays important.

Understanding the Structure: Claude Code + Ollama + OpenAI-Suitable APIs

How the Items Match Collectively

Claude Code helps third-party mannequin suppliers by means of OpenAI-compatible API endpoints. That is the mechanism that makes native utilization doable. Ollama, a neighborhood mannequin server, exposes precisely such an endpoint at localhost:11434/v1. While you configure the correct setting variables, Claude Code sends its requests to this native endpoint as an alternative of Anthropic’s servers.

The request stream is simple:

Claude Code CLI  →  http://localhost:11434/v1/chat/completions  →  Ollama Server  →  Native LLM (e.g., qwen2.5-coder:14b)
     [prompt]           [OpenAI-compatible API]                      [inference]         [response]

Claude Code constructs its prompts and tool-use payloads within the OpenAI chat completions format. Ollama receives these, runs inference on the desired native mannequin, and returns the completion. From Claude Code’s perspective, it talks to an OpenAI-compatible supplier. From the mannequin’s perspective, it handles commonplace chat completion requests.

Conditions and System Necessities

{Hardware} Issues

Native LLM inference is memory-bound. The RAM figures under seek advice from out there (free) RAM, not whole put in RAM. For 7B parameter fashions at This autumn quantization, you want no less than 16GB of obtainable RAM. Working 13B or 14B parameter fashions comfortably requires 32GB or extra, and fashions with 30B+ parameters sometimes demand 64GB of obtainable RAM or a GPU with substantial VRAM. Larger quantization ranges (e.g., Q8) roughly double the RAM requirement in comparison with This autumn variants.

For GPU acceleration, Ollama helps NVIDIA GPUs by way of CUDA, Apple Silicon by way of Steel (computerized on macOS), and AMD GPUs by way of ROCm on Linux. Disk house necessities range by mannequin: count on 4GB to 10GB per quantized mannequin file.

Software program Necessities

The setup requires Node.js 18 or later (with npm), Ollama put in and working as a neighborhood server, and the Claude Code CLI put in globally by way of npm.

Step 1: Set up and Configure Ollama

Putting in Ollama

On macOS and Linux, Ollama installs with a single command. Home windows customers can obtain the installer from the Ollama web site.


brew set up ollama




curl -fsSL https://ollama.com/set up.sh | sh


ollama --version






ollama serve

On macOS, Ollama sometimes launches as a background service mechanically after Homebrew set up. On Linux, ollama serve begins the server course of. Confirm it’s working by checking that port 11434 is listening.

Pulling the Proper Mannequin

Not all fashions deal with code technology equally. The next fashions are well-suited for coding duties by means of Claude Code:

For the very best steadiness of high quality and useful resource utilization, pull qwen2.5-coder:14b. It handles multi-file edits in Python, TypeScript, and Go along with fewer syntax errors than different fashions on this parameter vary.
deepseek-coder-v2:16b generates syntactically legitimate Python and JavaScript in single-file duties (efficiency varies by activity; consider in opposition to your individual workload).
Meta’s codellama:13b is a purpose-built coding mannequin based mostly on Llama 2 (launched 2023; based mostly on the older Llama 2 structure, so the newer options above usually produce higher outcomes).
When RAM is tight, llama3.1:8b gives a lighter-weight general-purpose choice.

Mannequin alternative immediately impacts output high quality. Objective-built coding fashions like Qwen 2.5 Coder produce noticeably higher structured code, deal with edge instances extra reliably, and comply with coding conventions extra persistently than general-purpose fashions of equal dimension.


ollama pull qwen2.5-coder:14b


ollama listing

The ollama listing command ought to present the mannequin identify, dimension, and modification date, confirming the weights are downloaded and prepared.

Verifying the Native API

Earlier than configuring Claude Code, verify that Ollama’s OpenAI-compatible endpoint is responding:

curl http://localhost:11434/v1/chat/completions 
  -H "Content material-Sort: utility/json" 
  -H "Authorization: Bearer not-a-real-key-local-ollama-only" 
  -d '{
    "mannequin": "qwen2.5-coder:14b",
    "stream": false,
    "messages": [{"role": "user", "content": "Write a hello world function in JavaScript"}]
  }'

A profitable response returns a single JSON object containing the mannequin’s completion. If this command fails with “connection refused,” Ollama shouldn’t be working. If it returns a model-not-found error, the mannequin identify doesn’t match what was pulled.

Step 2: Set up and Configure Claude Code for Native Use

Putting in Claude Code CLI

Set up Claude Code globally by means of npm:

npm set up -g @anthropic-ai/claude-code


claude --version

This installs the claude command globally. The CLI requires Node.js 18 or later. Be aware the model quantity displayed — the setting variables described under are version-dependent. Run claude --help to verify the supported configuration choices on your model.

Configuring Claude Code to Use Ollama

First: when you’ve got ANTHROPIC_API_KEY set in your setting, unset it. Leaving it set could trigger Claude Code to route requests to Anthropic’s API as an alternative of Ollama, silently incurring prices.

unset ANTHROPIC_API_KEY

You configure Claude Code’s third-party supplier assist with setting variables. The precise variable names rely in your Claude Code model. Run claude --help to verify the proper names. The variables under symbolize one documented configuration strategy — confirm them in opposition to the present Anthropic documentation on your put in model:




export OPENAI_API_KEY="not-a-real-key-local-ollama-only"
export ANTHROPIC_BASE_URL="http://localhost:11434/v1"
export CLAUDE_CODE_USE_OPENAI=1
export CLAUDE_MODEL="qwen2.5-coder:14b"

Model-dependent variables: The variable names CLAUDE_CODE_USE_OPENAI, CLAUDE_MODEL, and the selection between ANTHROPIC_BASE_URL and OPENAI_BASE_URL could differ throughout Claude Code releases. Affirm them with claude --help or the Anthropic documentation on your model. If the variables are incorrect, Claude Code could silently fall again to the Anthropic API, incurring prices.

You set OPENAI_API_KEY to a placeholder string as a result of Ollama doesn’t require authentication, however Claude Code refuses to begin and not using a non-empty key worth. ANTHROPIC_BASE_URL factors to the native Ollama server’s OpenAI-compatible API path. CLAUDE_CODE_USE_OPENAI indicators Claude Code to make use of the OpenAI-compatible supplier path somewhat than the Anthropic API. CLAUDE_MODEL specifies which Ollama mannequin to make use of and should match the mannequin identify precisely as proven by ollama listing, together with the tag (e.g., :14b).

For persistence, add these exports to ~/.bashrc, ~/.zshrc, or a project-level .env file. If utilizing a project-level .env file, guarantee it’s listed in .gitignore to stop unintentional commits.

Home windows customers (PowerShell):

$env:OPENAI_API_KEY="not-a-real-key-local-ollama-only"
$env:ANTHROPIC_BASE_URL="http://localhost:11434/v1"
$env:CLAUDE_CODE_USE_OPENAI="1"
$env:CLAUDE_MODEL="qwen2.5-coder:14b"

Launching Claude Code in Native Mode

With the setting variables set, begin Claude Code in any venture listing:

cd /path/to/your/venture
claude

On startup, Claude Code ought to show the configured mannequin identify (e.g., qwen2.5-coder:14b) somewhat than a Claude Sonnet or Opus identifier. That is an preliminary indicator that configuration was utilized, however displaying the mannequin identify alone doesn’t assure native routing — the configured variable worth could possibly be proven even when routing fails. To definitively verify that requests attain Ollama, monitor connections throughout a session:


lsof -i :11434 | grep ESTABLISHED

It’s best to see an lively TCP connection to 127.0.0.1:11434. If no connection is proven, requests could also be going to Anthropic’s servers.

Step 3: Take It for a Spin with a React + Node.js Venture

Scaffolding a Check Venture

Create a minimal venture that provides Claude Code actual recordsdata to work with:

npm create vite@newest test-project -- --template react
cd test-project
npm set up
npm set up specific

Add a minimal Categorical server on the venture root. As a result of the Vite scaffold creates an ES module venture ("sort": "module" in bundle.json), the CommonJS require() syntax is not going to work by default. Both rename the file server.cjs, or add "sort": "commonjs" to a separate root-level bundle.json, or rewrite utilizing ES module import syntax. The instance under makes use of the .cjs strategy:


const specific = require('specific');
const app = specific();
const PORT = course of.env.PORT ?? 3001;

app.use(specific.json());

app.get("https://www.sitepoint.com/", (req, res) => {
  res.json({ message: 'Server is working' });
});

const server = app.pay attention(PORT, () => {
  console.log(`Server listening on port ${PORT}`);
});

server.on('error', (err) => {
  if (err.code === 'EADDRINUSE') {
    console.error(`Port ${PORT} is already in use. Set PORT env var to make use of a distinct port.`);
  } else {
    console.error('Server failed to begin:', err);
  }
  course of.exit(1);
});

This gives each a React frontend and a Node.js backend for Claude Code to function on.

Working Actual Coding Duties

With Claude Code working within the venture listing, situation a sensible immediate:

Add a /api/well being endpoint to server.cjs that returns { standing: "wholesome", uptime: course of.uptime() }
and create a React element known as HealthStatus that fetches and shows this knowledge.

With qwen2.5-coder:14b, count on output structured like this (your outcomes will range based mostly on immediate phrasing and mannequin state):


app.get('/api/well being', (req, res) => {
  res.json({
    standing: 'wholesome',
    uptime: course of.uptime(),
    timestamp: new Date().toISOString()
  });
});


import { useState, useEffect } from 'react';

const API_BASE = import.meta.env.VITE_API_BASE ?? 'http://localhost:3001';

perform HealthStatus() {
  const [health, setHealth] = useState(null);
  const [loading, setLoading] = useState(true);
  const [error, setError] = useState(null);

  useEffect(() => {
    const controller = new AbortController();

    fetch(`${API_BASE}/api/well being`, { sign: controller.sign })
      .then((res) => {
        if (!res.okay) throw new Error(`HTTP ${res.standing}`);
        return res.json();
      })
      .then((knowledge) => {
        setHealth(knowledge);
        setLoading(false);
      })
      .catch((err) => {
        if (err.identify === 'AbortError') return;
        console.error('Didn't fetch well being standing:', err);
        setError(err.message);
        setLoading(false);
      });

    return () => controller.abort();
  }, []);

  if (loading) return <p>Loading well being standing...</p>;
  if (error) return <p>Error: {error}</p>;
  if (!well being) return <p>Unable to succeed in server.</p>;

  return (
    <div>
      <h2>Server Well being</h2>
      <p>Standing: {well being.standing}</p>
      <p>Uptime: {Math.spherical(well being.uptime)}s</p>
    </div>
  );
}

export default HealthStatus;

Be aware on fetch URLs: The React frontend runs on the Vite dev server (sometimes port 5173), whereas the Categorical backend runs on port 3001. The element above makes use of the VITE_API_BASE setting variable to configure the API origin, falling again to http://localhost:3001 for native growth. For manufacturing or containerised deployments, set VITE_API_BASE to the suitable backend URL. Alternatively, configure a Vite proxy by including server: { proxy: { '/api': 'http://localhost:3001' } } to vite.config.js and use relative fetch paths.

Claude Code’s agentic capabilities imply it reads the prevailing server.cjs, identifies the place to insert the brand new endpoint, writes the modifications, creates the brand new element file, and might even replace imports in App.jsx if prompted.

Evaluating Output High quality

Native fashions within the 7B to 14B vary deal with boilerplate code, CRUD endpoint technology, easy element creation, take a look at scaffolding, and easy refactoring nicely. For single-endpoint handlers and remoted element recordsdata, they produce usable output on first try with out handbook correction.

The place native fashions fall quick is in complicated multi-file reasoning: tracing a bug throughout a number of interconnected modules, making architectural choices that require understanding a full codebase’s patterns, or producing right output when the context window fills up. Claude Sonnet 4 handles these situations with noticeably larger accuracy. For instance, Sonnet accurately traces cross-module sort errors that qwen2.5-coder:14b misses after a number of makes an attempt, and it maintains coherence throughout longer context home windows.

Efficiency Tuning and Optimization

Ollama Configuration for Higher Efficiency

Ollama exposes a number of setting variables and configuration choices that have an effect on inference velocity:




export OLLAMA_NUM_PARALLEL=2















ollama present qwen2.5-coder:14b --modelfile | grep num_ctx

Setting OLLAMA_NUM_PARALLEL above 1 permits concurrent request dealing with, which issues much less for single-user Claude Code periods however helps if different instruments share the identical Ollama occasion. Growing the context size permits the mannequin to cause over extra code directly, however will increase reminiscence consumption considerably; very lengthy contexts can eat considerably extra reminiscence than the bottom mannequin load.

Selecting the Proper Mannequin for the Process

A sensible technique is to maintain a number of fashions pulled and change between them. Use a smaller mannequin like llama3.1:8b for fast completions and easy edits the place velocity issues. Change to qwen2.5-coder:14b or deepseek-coder-v2:16b for duties requiring larger code high quality. Switching fashions requires solely altering the CLAUDE_MODEL setting variable (or the equal on your Claude Code model) and restarting Claude Code.

Full Implementation Guidelines and Mannequin Comparability Desk

Setup Guidelines

Set up Ollama (brew set up ollama or curl set up script) and confirm with ollama --version
Begin Ollama server (ollama serve or brew providers begin ollama on macOS) and ensure port 11434 is listening
Pull a coding mannequin (ollama pull qwen2.5-coder:14b) and confirm with ollama listing
Check the API endpoint with curl http://localhost:11434/v1/chat/completions (embrace "stream": false within the request physique)
Set up Claude Code (npm set up -g @anthropic-ai/claude-code) and confirm with claude --version
Unset ANTHROPIC_API_KEY if current (unset ANTHROPIC_API_KEY)
Verify claude --help to verify the proper setting variable names on your model
Set setting variables (OPENAI_API_KEY, ANTHROPIC_BASE_URL, CLAUDE_CODE_USE_OPENAI, CLAUDE_MODEL), adapting variable names in case your model differs
Launch Claude Code in a venture listing and ensure the mannequin identify in startup output
Run lsof -i :11434 (or netstat -ano | findstr :11434 on Home windows) throughout a session to confirm native routing
Run a take a look at immediate and confirm the response comes from the native mannequin

Native Coding Mannequin Comparability Desk

Mannequin	Measurement	Min. Free RAM (This autumn)	Coding High quality*	Velocity	Finest For
`llama3.1:8b`	~4.7GB	16GB	Average	Quick	Fast completions, easy edits
`codellama:13b`	~7.4GB	32GB**	Good	Average	Common code technology
`qwen2.5-coder:14b`	~8.9GB	32GB	Very Good	Average	Finest general for coding duties
`deepseek-coder-v2:16b`	~9.1GB	32GB	Very Good	Average	Advanced code technology
`codellama:34b`	~19GB	64GB	Glorious	Gradual	Most native high quality
`llama3.1:70b`	~40GB	64GB+	Glorious	Very Gradual	Close to-API high quality (if {hardware} permits)

*Coding High quality rankings replicate casual single-file cross charges on HumanEval-style duties. “Average” = frequent handbook fixes wanted; “Good” = occasional fixes; “Very Good” = first-attempt success on most single-file duties; “Glorious” = constant first-attempt success together with multi-function recordsdata.

**16GB is the technical minimal for codellama:13b; 32GB is beneficial for secure inference with out swapping. Sizes and RAM figures assume This autumn quantization; Q8 quantization roughly doubles RAM necessities. Confirm precise on-disk dimension with ollama listing after pulling.

Finest general decide: qwen2.5-coder:14b gives the strongest steadiness of code technology high quality, cheap useful resource necessities, and sensible inference velocity for iterative growth workflows.

Troubleshooting Widespread Points

Connection Refused or Mannequin Not Discovered

If Claude Code experiences connection errors, confirm that ollama serve is working and that http://localhost:11434 responds to requests. On macOS, examine whether or not the Homebrew service is already working with brew providers listing — working ollama serve manually when the service is lively causes a port battle. A “mannequin not discovered” error means the worth in CLAUDE_MODEL doesn’t precisely match the mannequin identify proven by ollama listing, together with the tag (e.g., :14b).

Gradual Responses or Out-of-Reminiscence Errors

If inference is unacceptably sluggish or the system runs out of reminiscence, cut back the context window (by way of the Modelfile PARAMETER num_ctx or the per-request choices subject), change to a smaller quantized mannequin, or confirm that GPU offloading is lively. On NVIDIA programs, nvidia-smi confirms whether or not Ollama is using the GPU. On Apple Silicon, Steel acceleration is computerized.

Claude Code Ignoring Native Config

Atmosphere variables override one another in ways in which trigger routing errors. In case you have an ANTHROPIC_API_KEY set within the shell setting or in a worldwide configuration file, Claude Code could prioritize the Anthropic supplier over the OpenAI-compatible path. Unset any Anthropic-specific variables (unset ANTHROPIC_API_KEY) earlier than launching Claude Code in native mode. Moreover, confirm that the setting variable names you might be utilizing match these supported by your put in Claude Code model — run claude --help to verify.

Warning: If setting variables are misconfigured, Claude Code could silently route requests to Anthropic’s API, incurring surprising prices. All the time confirm native routing by checking for lively connections to localhost:11434 throughout your session.

When to Use Native vs. API: A Sensible Framework

Use native fashions for iterative growth, boilerplate technology, take a look at writing, refactoring, and work on non-public or proprietary codebases the place knowledge should not depart the machine. Use the Anthropic API for complicated architectural reasoning, large-context multi-file modifications that exceed native mannequin capabilities, and code that ships to manufacturing with out further human evaluate.

Essentially the most sensible strategy is a hybrid one: default to native for the majority of each day coding duties and change to the API selectively for heavy lifts. This sample captures the vast majority of value financial savings whereas preserving entry to frontier mannequin high quality when it issues.

What Comes Subsequent

This setup eliminates API prices for almost all of routine coding agent interactions. Builders who beforehand spent $100 or extra per day on Anthropic API credit can reserve that spend for duties that genuinely require frontier mannequin capabilities. Builders who route the vast majority of routine duties regionally can considerably cut back API prices; precise financial savings rely upon particular person workflow composition and the ratio of local-suitable duties to these requiring frontier fashions.

From right here, the pure subsequent steps are experimenting with further fashions because the open-weight ecosystem evolves and creating task-specific Modelfile configurations tuned for specific programming languages or frameworks. Past that, you may combine native Claude Code periods into CI workflows for automated code evaluate on non-public repositories.