Goodbye Copilot: Running Llama 3 & Qwen-Coder Locally on an RTX 3090

Data privacy in software development isn't just a preference—for many, it's a legal requirement. While GitHub Copilot is convenient, it requires sending your proprietary logic to a third-party server.
With the massive 24GB VRAM of the NVIDIA RTX 3090, we no longer need to compromise. We can run state-of-the-art models like Llama 3 and the specialized Qwen2.5-Coder (which currently rivals GPT-4 in coding benchmarks) entirely on our own hardware.
Why the RTX 3090?
The RTX 3090 is the "sweet spot" for local LLMs. Its 24GB of GDDR6X VRAM allows you to fit:
- Llama 3 (8B): With near-instant tokens-per-second.
- Qwen2.5-Coder (32B): Using 4-bit or 8-bit quantization, providing a massive upgrade in logic and reasoning over smaller models.
Step 1: Setting up the Backend with Ollama
Ollama makes managing local models as easy as managing Docker containers. If you haven't installed it yet, head over to ollama.com.
Once installed, open your terminal and pull the models we need:
ollama pull qwen2.5-coder:32bStep 2: Integrating with VS Code (OpenCode / Continue)
To get the full "Copilot experience," you need a bridge between your IDE and Ollama. While there are many extensions, OpenCode (for agentic tasks) and Continue (for inline completions) are the current top-tier choices.
1. Install the Extensions
Search the VS Code Marketplace for:
- Continue: Best for the side-panel chat and "Apply to File" features.
- OpenCode: Best for "Agent Mode" where the AI can actually run terminal commands and read your whole directory.
2. Configure for your RTX 3090
With 24GB of VRAM, we don't need to settle for the tiny models. We will use Llama 3 (8B) for lightning-fast "Tab-Autocomplete" and Qwen-Coder (32B) for deep architectural questions.
Open your config.json (usually found in ~/.continue/config.json) and paste this optimized configuration:
{
"models": [
{
"title": "Qwen-Coder 32B (RTX 3090)",
"provider": "ollama",
"model": "qwen2.5-coder:32b",
"contextLength": 32768
},
{
"title": "Llama 3 8B",
"provider": "ollama",
"model": "llama3:8b"
}
],
"tabAutocompleteModel": {
"title": "Tab Autocomplete",
"provider": "ollama",
"model": "qwen2.5-coder:7b"
},
"allowAnonymousTelemetry": false
}3. Why the 3090 + Qwen-Coder is a Cheat Code
While many developers try to run smaller 7B or 14B models for speed, the RTX 3090 opens a different door: high-precision, large-scale coding intelligence.
The Qwen2.5-Coder 32B model is the current "sweet spot" for 24GB VRAM cards. Here is why this specific combination is so powerful:
- VRAM Efficiency: Using a Q4_K_M or Q5_K_M quantization, the 32B model occupies roughly 19GB to 22GB of VRAM. This leaves just enough room for a 32k context window—allowing the AI to "read" your entire small-to-medium project structure at once.
- Performance Benchmarks: On a 3090, you can expect generation speeds of ~35 to 45 tokens per second. For context, that is faster than the average human can read, making the "lag" of cloud-based APIs a thing of the past.
- Logic over Luck: Unlike the smaller 8B models (like base Llama 3), the 32B version of Qwen-Coder has significantly higher reasoning capabilities for complex tasks like multi-file refactoring and SQL optimization, where smaller models often "hallucinate" non-existent library functions.
4. Running OpenCode with Ollama
If you want to move beyond simple chat and let the AI actually interact with your local files, you can use OpenCode. Instead of messing with environment variables or manual JSON edits, you can now use a single command to link your local models to the agent.
The Launch Command
In your terminal, simply run:
ollama launch opencodeThis command starts a guided setup where you can select the model you want to use. Since you have an RTX 3090, I recommend selecting Qwen3-Coder or GLM-4.7-flash if you have 24 GB VRAM.
Put it to work
Once launched, you can give it a task that targets your local workspace:
Prompt: "Explain the tech stack of this project." or Prompt: "Check the
/src/componentsfolder for any outdated prop types. Refactor them to use the new TypeScript interfaces intypes.tsand let me know if any imports are missing."
Because the RTX 3090 has 24GB of VRAM, the agent can hold a significant amount of your code in its context while it works. You’ll see it scan your files, propose the diff, and wait for your approval—all without your data ever hitting a cloud server.
Read Next
Securely Running OpenClaw with Ollama via Tailscale
OpenClaw is a powerful AI agent, but giving it full host access can be risky. Learn how to run OpenClaw securely with Ollama by leveraging Tailscale to restrict access to a single port, while keeping your home network safe.
Building the Ploopy Adept BLE (Any Ball Mod)
A comprehensive guide on how to build a wireless Ploopy Adept trackball, featuring the highly recommended Any Ball mod, ordering the PCB, and assembling the components.
Daily Bugle TryHackMe Write-Up
The Daily Bugle room on TryHackMe is a hard room that requires you to compromise a Joomla CMS account.


