Run 70B models
on a MacBook
Squish compresses model weights into memory-mapped tensors that load in milliseconds — served through a fully OpenAI-compatible API on Apple Silicon. No GPU required.
Install via Homebrew
brew install squishai/squish/squish
Pull a pre-squished model (18 GB, < 2s)
squish pull llama3.3:70b
✓ Pulled llama3.3:70b in 1.4s
Start the OpenAI-compatible server
squish serve &
→ http://localhost:11435
Or chat interactively
squish run llama3.3:70b
You: Hello!
Squish: Hi! How can I help
Three steps to local AI
Squish handles everything from compression to serving. You just pull and run.
One Homebrew or pip command. No Docker, no CUDA drivers, no Python environment wrestling.
Stream any pre-squished model from HuggingFace. Weights arrive as memory-mapped INT8 tensors — ready to load instantly.
Chat in the terminal with squish run, or start a full OpenAI-compatible API with squish serve.
Built for speed at every layer
From storage format to HTTP serving, every decision in Squish is optimised for Apple Silicon performance.
Memory-mapped weights bypass all decode overhead. 70B models are ready in under 2 seconds — every time.
mmap → zero decodeWorks with LangChain, LlamaIndex, OpenAI SDK, and any tool that speaks /v1/chat/completions.
Process multiple requests in parallel in a single call. Something Ollama and LM Studio simply don't offer.
batch: [req1, req2 …]Two quantisation tiers. INT8 for near-lossless accuracy; INT4 for maximum density on 16 GB Macs.
squish push model:8bPull, run, serve, quantise, push, list, remove — composable commands for every workflow.
squish pull llama3.3:70bPre-squished models hosted on HuggingFace. Pull any model directly — no manual conversion needed.
hf://squish-community/…Why choose Squish?
See how Squish stacks up against the most popular local inference tools.
| Feature | Ollama | LM Studio | Squish ✦ |
|---|---|---|---|
| Cold start (70B) | ~30 s | ~20 s | < 2 s |
| RAM for 70B | ~40 GB | ~40 GB | ~18 GB |
| OpenAI API | ✓ | ✓ | ✓ |
| Batch requests | ✗ | ✗ | ✓ |
| Pre-compressed weights | ✗ | ✗ | ✓ HuggingFace |
| Zero-copy mmap | ✗ | ✗ | ✓ |
| Weight format | GGUF | GGUF | INT8 mmap |
| Platform | macOS / Linux | macOS / Windows | macOS (M1–M5) |
Up and running in 30 seconds
macOS via Homebrew (recommended)
brew install squishai/squish/squish
✓ squish 9.0.0 installed
Or from PyPI (Python 3.10+)
pip install squish
Verify
squish --version
squish 9.0.0
Browse available models
squish search llama
llama3.3:70b 18.2 GB INT8 ★ popular
llama3.2:3b 1.5 GB INT8
Pull a model
squish pull llama3.3:70b
████████████████████ 100% — 18.2 GB
✓ Pulled in 1.4s — ready
List local models
squish ls
llama3.3:70b 18.2 GB INT8 ✓ loaded
Interactive REPL
squish run llama3.3:70b
Loading model… 1.4s
You: Explain quantum entanglement like I'm 12
Squish: Imagine two magic coins that always
land on opposite sides, no matter how far apart…
Pass a system prompt
squish run llama3.3:70b --system "You are a pirate"
Start the server
squish serve &
→ Listening on http://localhost:11435
Query exactly like OpenAI
curl http://localhost:11435/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"llama3.3:70b",
"messages":[{"role":"user","content":"Hello!"}]}'
Works with the OpenAI Python SDK too
python3 -c "import openai; c=openai.OpenAI(base_url='http://localhost:11435/v1',api_key='x'); print(c.chat.completions.create(model='llama3.3:70b',messages=[{'role':'user','content':'Hi'}]).choices[0].message.content)"
Quantise any HuggingFace model to INT8
squish push meta-llama/Llama-3.3-70B-Instruct --bits 8
Downloading weights… 70B params
Quantising to INT8…
✓ Pushed to hf://squish-community/llama3.3-70b-int8
Or INT4 for half the size
squish push meta-llama/Llama-3.3-70B-Instruct --bits 4
✓ Pushed to hf://squish-community/llama3.3-70b-int4
Join the Squish community
Chat, contribute, and share models with people running local AI on Apple Silicon.
Ready to squish your models?
Install in 30 seconds and run 70B models on your MacBook today. Free for personal use, open-source on Jan 1, 2030.