// the find

abacusai/Long-Context

★ 604 · Python · Apache-2.0 · updated Nov 2023

This repository contains code and tooling for the Abacus.AI LLM Context Expansion project. Also included are evaluation scripts and benchmark tasks that evaluate a model’s information retrieval capabilities with context expansion. We also include key experimental results and instructions for reproducing and building on them.

Research code and benchmarks from Abacus.AI's 2023 experiments extending Llama's context window beyond 2048 tokens via RoPE linear scaling and fine-tuning. The main contribution is the Giraffe-13b model (scale=16) and the WikiQA evaluation datasets. This is a research artifact, not a library you'd integrate into a project.

The WikiQA benchmark design is genuinely thoughtful — altering numeric answers to block memorization cheating is the right move and something many evals skip. The zero-shot scale generalization finding (train at 4x, eval at 8x and still get non-zero accuracy) is a concrete and interesting empirical result. Both datasets are released on HuggingFace, so the evaluation methodology is actually reusable. The write-up honestly reports where approaches failed (XPOS never converged) rather than cherry-picking wins.

Last commit November 2023 — RoPE scaling has moved fast since then (YaRN, LongRoPE, and every major model now ships 128k+ context natively), so the practical relevance of extending Llama-13b to 16k is close to zero today. The repo is training scripts and notebooks, not a usable package — no pip install, no clean API, expect friction to get anything running. Presence accuracy (substring match) is a weak eval metric that will over-count partial answers and miss paraphrases. The scale=16 model caps out around 16k tokens in practice despite the math suggesting 32k, and the README admits they don't have a fix for that.

View on GitHub →