AI & ML interests

Text Generation & Chat Assistants; Model Compression & Quantization (Q4/Q6/Q8, gs32); Inference & Serving (on-prem, low-latency); RAG / Retrieval; Agents & Tool Use; Distillation / LoRA / Fine-tuning

Recent Activity

Halley AI on Hugging Face

High-quality, Apple-Silicon–optimized MLX builds, tools, and evals — focused on practical, on-prem inference for small teams.

We publish Mixture-of-Experts (MoE) models and MLX quantizations tuned for M-series Macs (Metal + unified memory).
Target use: fast, reliable interactive chat and light batch workloads.


🚀 Featured models

gpt-oss-20b (MLX)

Repo Bits/GS Footprint Notes
halley-ai/gpt-oss-20b-MLX-5bit-gs32 Q5 / 32 ~15.8 GB Small drop vs 6-bit (~3–6% PPL); “fits‑24GB” unified memory.
halley-ai/gpt-oss-20b-MLX-6bit-gs32 Q6 / 32 ~18.4 GB Best of the group; strong quality/footprint tradeoff.

gpt-oss-120b (MLX)

Repo Bits/GS Memory Notes
halley-ai/gpt-oss-120b-MLX-8bit-gs32 Q8 / 32 ~63.42 GB Reference int8; stable and simple to use.
halley-ai/gpt-oss-120b-MLX-bf16 bf16 ~65.28 GB Non-quantized reference for evaluation/ground truth.

Format: MLX (not GGUF). For Linux/Windows or non-MLX stacks, use a GGUF build with llama.cpp.

datasets 0

None public yet