# Beyond Retrieval: A Multitask Benchmark and Model for Code Search

Page: https://stenobird.com/podcast/daily-paper-cast-7079649/beyond-retrieval-a-multitask-benchmark-and-model-for-code-search
Text version: https://stenobird.com/podcast/daily-paper-cast-7079649/beyond-retrieval-a-multitask-benchmark-and-model-for-code-search.md
Podcast: [Daily Paper Cast](https://stenobird.com/podcast/daily-paper-cast-7079649)
Published: 2026-05-12T04:02:02+00:00
Episode link: https://share.transistor.fm/s/7e0e20ae
Audio file: https://media.transistor.fm/7e0e20ae/1a6fdd80.mp3
Processing state: not_requested
JSON: https://stenobird.com/v1/public/podcasts/daily-paper-cast-7079649/episodes/beyond-retrieval-a-multitask-benchmark-and-model-for-code-search
Duration seconds: 1274

## Resource

🤗 Upvotes: 22 | cs.SE, cs.AI Authors: Siqiao Xue, Zihan Liao, Jin Qin, Ziyin Zhang, Yixiang Mu, Fan Zhou, Hang Yu Title: Beyond Retrieval: A Multitask Benchmark and Model for Code Search Arxiv: http://arxiv.org/abs/2605.04615v2 Abstract: Code search has usually been evaluated as first-stage retrieval, even though production systems rely on broader pipelines with reranking and developer-style queries. Existing benchmarks also suffer from data contamination, label noise, and degenerate binary relevance. In this paper, we introduce \textsc{CoREB}, a contamination-limited, multitask \underline{co}de \underline{r}etrieval and r\underline{e}ranking \underline{b}enchmark, together with a fine-tuned code reranker, that goes beyond retrieval to cover the full code search pipeline. \textsc{CoREB} is built from counterfactually rewritten LiveCodeBench problems in five programming languages and delivered as timed releases with graded relevance judgments. We benchmark eleven embedding models and five rerankers across three tasks: text-to-code, code-to-text, and code-to-code. Our experiments reveal that: \circone code-specialised embeddings dominate code-to-code retrieval (${\sim}2{\times}$ over general encoders), yet no single model wins all three tasks; \circtwo short keyword queries, the format closest to real developer search, collapse every model to near-zero nDCG@10; \circthree off-the-shelf rerankers are task-asymmetric, with a 12-point swing on code-to-code and no baseline net-positive across all tasks; \circfour our fine-tuned \textsc{CoREB-Reranker} is the first to achieve consistent gains across all three tasks. The data and model are released.

## Actions

- request_transcript: `POST https://stenobird.com/v1/public/podcasts/daily-paper-cast-7079649/episodes/beyond-retrieval-a-multitask-benchmark-and-model-for-code-search/transcription-requests` — Idempotently request low-priority transcript generation for this episode.
- read_markdown: `GET https://stenobird.com/podcast/daily-paper-cast-7079649/beyond-retrieval-a-multitask-benchmark-and-model-for-code-search.md` — Read the agent-friendly Markdown representation of this episode resource.

A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed.

## Transcript

Full transcripts are not published on public pages unless there is a clear rights basis.