Using an LLM to Pick Follow-Up Quiz Questions Without Letting It Run Wild on Cost

I was building a follow-up-quiz feature: a student takes a short quiz, the system figures out which topics they're weak on, and the next round of practice questions should target those weak spots instead of being random. The obvious approach is to ask an LLM "given this list of weak topics, pick the most relevant questions from the question bank." The obvious approach also breaks the moment the question bank has more than a couple hundred rows — you can't paste the whole bank into one prompt, and you can't call the LLM once per question either.

First attempt: one LLM call, whole bank

The naive version was a single prompt with the entire bank's questions and a list of weak topics, asking the model to return the IDs of the relevant ones. It worked for the demo data (a few dozen questions). It fell over with a realistic bank size — too many tokens, slow, and the model started missing obviously-relevant questions buried in the middle of a long list (the classic "lost in the middle" problem with long contexts).

The fix wasn't a smarter prompt, it was changing the shape of the problem: stop asking the LLM to search the whole bank in one shot, and instead feed it small batches, repeating until enough relevant questions are found or a hard cap is hit.

REVIEW_BATCH_SIZE = 30   # how many candidate questions go in a single LLM call
REVIEW_MAX_BATCHES = 3   # hard cap — at most 90 candidates get reviewed, ever

The batching loop pulls a random slice of unused questions from the bank, asks the LLM which ones in that slice match the weak topics, and keeps going until it has enough or runs out of batches:

async def select_followup_questions(weak_topics: list[str], target_count: int, bank: list[Question]) -> list[str]:
    excluded: set[str] = set()
    selected: list[str] = []
    batches_tried = 0

    while len(selected) < target_count and batches_tried < REVIEW_MAX_BATCHES:
        batch = pick_random_unused(bank, excluded, size=REVIEW_BATCH_SIZE)
        if not batch:
            break
        batches_tried += 1
        matched = await ask_llm_which_match(weak_topics, batch)
        selected += [q.id for q in batch if q.id in matched][: target_count - len(selected)]
        excluded |= {q.id for q in batch}

    return selected

Tip

The hard cap on batches matters more than it looks. Without REVIEW_MAX_BATCHES, a student with an obscure or oddly-worded weak topic — one that genuinely doesn't match many questions in the bank — turns into an unbounded loop of "ask the LLM again with a fresh batch," burning calls for diminishing returns. Decide upfront how many tries is "enough," and stop.

The part that actually matters: what happens when the LLM doesn't find enough

This is the detail that's easy to skip and expensive to skip: what do you return if, after the max batches, the LLM still hasn't found target_count relevant questions? The two bad answers are "return fewer questions than promised" (breaks the UI contract) and "keep calling the LLM until it finds enough" (the unbounded loop above). The actual answer is a deterministic, non-AI fallback that just fills the remainder:

def fallback_fill(bank: list[Question], exclude: set[str], limit: int) -> list[Question]:
    fresh = [q for q in bank if q.id not in exclude]
    random.shuffle(fresh)
    if len(fresh) >= limit:
        return fresh[:limit]
    # bank is smaller than the quiz size, or mostly exhausted — allow repeats
    # rather than returning an incomplete quiz
    picked = list(fresh)
    remaining = [q for q in bank if q.id not in {p.id for p in picked}]
    random.shuffle(remaining)
    while len(picked) < limit and remaining:
        picked.append(remaining.pop())
    while len(picked) < limit and bank:
        picked.append(random.choice(bank))
    return picked[:limit]

Warning

Every AI-driven selection feature with a target count needs an answer to "what if the AI can't find enough." If you don't write that fallback explicitly, you'll discover the gap in production the first time a student's weak topic is too narrow for the question bank to satisfy — not in testing, where your sample data is too clean to expose it.

Don't block the response on the AI call

Grading a quiz and analyzing why the student got things wrong are two different costs: grading is fast (string comparisons, a tally), the AI analysis that derives weak topics and pre-selects next-round questions is not. Making the student wait for the slow part before they see their score is a bad trade for no real benefit, so the slow part runs as a background job after the response is already sent:

Timeline showing the synchronous grade-and-respond path on top and the background AI analysis + follow-up question selection path below it, running after the response is already sent

async def submit_quiz(...) -> QuizAttempt:
    # objective grading happens synchronously — it's just arithmetic
    score, wrong_answers = grade_objective(answers, answer_key)
    attempt = save_attempt(score=score, ai_summary=None)  # None = "still generating"
    return attempt  # client gets this immediately

async def finalize_analysis(attempt_id, wrong_answers):
    # runs after the response above is already on its way back to the client
    analysis = await analyze_weak_topics(wrong_answers)
    next_round_ids = await select_followup_questions(analysis.weak_topics, ...)
    update_attempt(attempt_id, ai_summary=analysis.summary, next_round_ids=next_round_ids)

The client polls or re-fetches and sees ai_summary go from None to a real value a few seconds later. The one thing this requires on the read side: if a student clicks "practice again" before the background job finishes, the next-round-question endpoint needs a synchronous fallback path too (the same fallback_fill from above works fine here) — otherwise the impatient-click case returns nothing.

Aggregating across a whole class: one AI call, not one per student

A related problem showed up at the class level: after everyone submits, a teacher wants to see common weak points across the whole class, not a list of forty individual summaries to read through. The tempting shortcut is to generate each student's weak-topic labels individually and then merge the lists — but independently-generated labels for the same underlying gap come out worded differently per student ("struggles with dosage calculation" vs. "weak on medication dosing"), and merging near-duplicate free-text labels reliably is its own hard problem.

The fix: don't generate labels per-student and merge afterward. Collect every student's wrong answers first, then make a single AI call over the whole class's data, explicitly asking it to merge same-topic mistakes from different students into one labeled bucket:

CLASS_INSIGHT_PROMPT = """Here is the wrong-answer detail for every student in this quiz, one
block per student:
{all_students_wrong_answers}

Group these into common weak points across the whole class — even if different students
missed different specific questions, merge them into one entry if they reflect the same
underlying topic. Do not list mistakes question-by-question.
"""

One call, one consistent vocabulary, no merge step needed afterward.

Note

Anything that's purely a count — most frequent wrong questions, score distribution, how many students fell below a threshold — doesn't need AI at all. Compute that from the stored records directly and only spend the AI call on the part that genuinely requires synthesizing free text across many students: naming and explaining the shared gaps.

FAQ

Why not just embed all the questions and use a vector search instead of batching through an LLM?

Embeddings would work for "find questions similar to this weak topic" and would scale better than batched LLM calls. The batching approach here was simpler to ship first and the bank size was modest enough that it stayed fast; if the bank grows by an order of magnitude, embeddings become the better trade. Worth treating LLM batching as the "good enough to start" version, not the permanent one.

How do you pick the batch size and max-batches numbers?

Batch size is bounded by how many questions you're comfortable putting in one prompt and having the model actually read carefully — bigger batches risk the same "lost in the middle" problem that broke the single-call version, just at a smaller scale. Max batches is a cost/latency budget: decide how many LLM calls one student's "next round" is allowed to cost, then size the cap to that.

What happens to questions the AI rejects — are they tried again later?

In this design, rejected questions get tracked and excluded from future batches for that student, so the system doesn't keep spending AI calls re-judging the same "not relevant" question over and over. Questions that match but didn't make the cut (because the quota was already full) aren't excluded — they're left available for a future round.

Bottom Line

AI-driven selection features need the same engineering discipline as any external API call: bound how much you're willing to spend, decide explicitly what happens on partial failure, and don't make a student wait on a slow AI call for a result you can compute instantly. The intelligence is in choosing which part of the problem needs a model — not in routing everything through one.