Optimizing Prompts for Language Model Pipelines: DSPy MIPROv2
2024-10-29
I’ve been diving deep into prompt optimization for large language models (LLMs) lately, especially as we build more complex NLP pipelines — or Language Model Programs.
These are workflows that chain together multiple LLM calls to tackle sophisticated tasks.
While powerful, designing these pipelines isn’t straightforward because each module may require prompts that work well together, and crafting them by hand is both time-consuming and inefficient.
Recently, I came across a workflow by Karthik Kalyanaraman that offers a practical approach to prompt optimization for multi-stage LLM programs.
But before diving into that, I want to quickly summarize 2 main challenges:
- The proposal problem: The first challenge is the sheer size of the prompt space. With multiple modules, the number of possible prompts becomes intractably large. We need a way to efficiently generate high-quality prompt candidates without exhaustively searching the entire space.
- The credit assignment problem: The second challenge is figuring out which parts of your prompt are actually contributing to better performance. In multi-stage pipelines, it’s tough to determine how changes in one module’s prompt affect the overall outcome. We lack intermediate labels or metrics for individual LLM calls, so we need strategies to assign credit to different prompt components effectively.
MIPROv2 tackles both the proposal and credit assignment challenges by:
- Bootstrapping Few-Shot Examples: Generating candidate examples by running inputs through your current pipeline and collecting successful outputs.
- Proposing Grounded Instructions: Creating instructions based on various aspects of the task, including data summaries and program code.
- Bayesian Optimization: Efficiently searching for the best combination of instructions and examples.
Here’s a breakdown of how it works.
Step 1: Generating Effective Demos
The first step is to create a solid set of demonstration examples — or “demos” — that showcase what ideal input-output pairs look like.
Using a labeled training dataset, we generate multiple sets of demos.
Each set includes:
- Labeled Demos: Directly sampled from the dataset.
- Bootstrapped Demos: Generated by the model using random inputs from the dataset, ensuring that the outputs meet the criteria set by our evaluation function (e.g., correctness).
For instance, if we want to generate two sets of ten demos each, we might select five labeled examples and create five bootstrapped ones for each set.
This mix helps the model see both real and generated examples that meet our standards.
import dspy
import random
from loguru import logger
from .signatures import Step1BootstrapFewShot, GenerateExampleResponse
class Step1BootstrapFewShotModule(dspy.Module):
# ... check implementation in source code
NUM_INSTRUCTIONS = 2
NUM_SETS = 2
# Generating few-shot examples
demo_generator = Step1BootstrapFewShotModule(
trainset=trainset[:20],
num_sets=NUM_SETS,
num_labeled_shots=5,
num_shuffled_shots=3,
metric="accuracy"
)
bootstrap_few_shot_examples = demo_generator()
Step 2: Crafting the Instructions
Next, we aim to generate instructions that will guide the model to produce the desired outputs. We use two main inputs:
- Summaries of the Demos: Generated by the language model to capture the essence of the examples.
- Program Intent: Derived from the code of our program, which helps infer what we’re trying to achieve.
Using these inputs, we generate a set of instructions that reflect both the nature of the problems illustrated by the demos and the overall goal of our program.
import dspy
from loguru import logger
# pylint: disable=relative-beyond-top-level
from .signatures import (
Step2GenerateDatasetIntent,
Step2GenerateProgramSummary,
Step2GenerateInstruction
)
class Step2GenerateInstructionModule(dspy.Module):
# ... check implementation in source code
# Generating instructions
instruction_generator = Step2GenerateInstructionModule(
few_shot_prompts=bootstrap_few_shot_examples,
program_code=str(program_code),
num_instructions=NUM_INSTRUCTIONS
)
instructions = instruction_generator()
Next cohort will start soon! Reserve your spot for building full-stack GenAI SaaS applications
Step 3: Optimizing the Prompt
Finally, we use a Bayesian Optimization approach to find the best combination of demos and instructions. This involves running several evaluation trials where we:
- Randomly select a combination of demos and instructions.
- Evaluate the model’s performance on a batch of validation examples.
- Keep track of the best-performing combinations based on our evaluation metric.
import dspy
from loguru import logger
from dspy.datasets.gsm8k import GSM8K
from dotenv import find_dotenv, load_dotenv
from src.simple_miprov2.programs.step1_bootstrap_few_shot.program import (
Step1BootstrapFewShotModule
)
from src.simple_miprov2.programs.step2_bootstrap_instruction.program import (
Step2GenerateInstructionModule
)
from src.simple_miprov2.programs.step3_generate_final_prompt.program import (
Step3GenerateFinalPromptModule
)
# ... check implementation in source code
lm = dspy.LM(model='gpt-4', max_tokens=250, cache=False)
dspy.settings.configure(lm=lm)
if __name__ == "__main__":
# ... check implementation in source code
# Run the generate final prompt program to generate a final prompt
logger.info("Step 3: Running generate final prompt program")
final_prompts = []
for instruction, few_shot_examples in zip(
instructions, bootstrap_few_shot_examples
):
# convert few_shot_examples to a string
few_shot_examples_str = ""
for example in few_shot_examples:
try:
input_str = example["question"]
output_str = example["answer"]
few_shot_examples_str += (
f"Question: {input_str}\nExpected Answer: {output_str}\n\n"
)
# pylint: disable=broad-exception-caught
except Exception as e:
logger.error(f"Error: {e}")
generate_final_prompt_program = Step3GenerateFinalPromptModule(
instruction=instruction,
few_shot_examples=few_shot_examples_str
)
final_prompt = generate_final_prompt_program()
final_prompts.append(final_prompt["final_prompt"])
logger.info("Final prompts:")
for i, prompt in enumerate(final_prompts, 1):
logger.info(f" Prompt {i}: {prompt}")
By the end of this process, we have prompts that are optimized to guide the model effectively, based purely on our initial labeled dataset and without requiring module-level labels or gradients.
For more information, please check source code, blog post, mipro docs.
If you have any question, please leave a comment!
Bonus Content : Building with AI
And don’t forget to have a look at some practitioner resources that we published recently: And don’t forget to have a look at some practitioner resources that we published recently:
Thank you for stopping by, and being an integral part of our community.
Happy building!