How to Prompt Engineer Phi-3-mini: A Practical Guide

Prompts are basically our requests or inputs to AI models. Prompt engineering, as the name suggests, is about going a little deeper than basic prompts by creating specialized inputs that (more) effectively guide AI models to produce near-perfect outputs.

You don’t necessarily have to use a programming language or an IDE for this, as most people suggest you can just use ChatGPT’s front end. That’s technically accurate, but that doesn’t exactly give you the actual “fun“ of prompt engineering a model as using a programming language, not to mention that it’s not as effective as well.

In this article, we’ll walk through how to do it in Python, using the Phi-3-mini-4k-instruct model by Microsoft. We’ll use the Huggingface inference API for this, so you won’t have to download a 7GB model locally.

Think about this as manipulating a model from the inside, not from basic chat messages. Messing with it, to be abstract.

Setting up the Environment

Create a Huggingface account and retrieve an API key (your profile > access tokens) with “write“ access.

This isn’t a sponsored post. If you’re working with LLMs, you’ll have to create a Huggingface account at some point; that’s certain.
Make sure you have installed Python 3.10+ in your system and have set up an IDE. Or you can use this notebook on Google Colab.
Install the `huggingface_hub` library using `pip install huggingface_hub` or another command, depending on your OS.

Understanding the Basics

Before jumping into the code, let’s learn a bit about prompt engineering.

As I mentioned before, prompt engineering is basically creating specialized inputs to control the model outputs based on your requirements.

Different LLMs respond to different prompt engineering techniques differently. This means that you can’t use the same prompt engineering template for all and every LLM. And this again means that you have to read the documentation of the LLM to figure out which technique is best to use.

Here are some popular ones:

Zero-shot learning: Asking the model to perform a task without any examples

Classify the following text as positive or negative: "I really enjoyed this movie!"

This works with well-trained models like GPT-4, Claude 3 Opus, and Gemini Ultra.

In my experience, Mistral-7B, despite being a small LLM, also has impressive results in zero-shot learning.

Few-shot learning: Providing a few examples before asking the model to perform a task.

Text: "The food was terrible." Sentiment: Negative

Text: "I had a wonderful time." Sentiment: Positive

Ideal for tasks that might be slightly vague for the model or where you want to demonstrate a specific format.

Chain-of-thought (CoT) prompting: Encouraging the model to explain its reasoning step by step.
```
Question: If John has 5 apples and gives 2 to Mary, how many does he have left?
Let's think through this step by step:
```
The first thing that might come to your mind is the DeepSeek R1 model. Well, that’s correct; it was probably the first published model with a visible chain-of-thought, which is why it was a game changer.
Role-based prompting: Asking the model to assume a specific role or persona.
```
You are an expert Python programmer. Please review this code and suggest improvements:
```
This must be the most popular prompting technique among non-programmers. ChatGPT, Claude, and most other chatbots excel at providing role-based outputs.
System prompt: Setting up context & instructions before the actual user query

This is my favourite when it comes to “messing“ with an LLM. And you can only do this in the backend in most cases, which is simply fascinating.

The system prompt acts as the "personality and instruction" set for a certain model. It’s useful for defining rules or constraints.

What’s more, you can do what you can’t do with basic inputs when you are defining a system message. If we take a small LLM, for example, if you ask it something harmful via a basic input message, it will deny answering it. If you change the system prompt, however, there’s a high probability it will ignore its safety guardrails and try to answer it— in some models.

(It’s a serious oversight in LLMs, I agree.)

All the above techniques can be done in ChatGPT’s or other chatbot’s UI, except for the system prompt and chain-of-thoughts procedure (technically, we can do that too, but not really effective).

Therefore, we’ll be talking about those two in the next section.

Chain-of-Thoughts

In most LLMs, you can’t see the chain of thoughts behind their reasoning, but you can make it visible through prompt engineering in Python.

Before writing the function, import the library and define the client:

from huggingface_hub import InferenceClient

# Replace with your Hugging Face token
client = InferenceClient(token="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")

Then we have to determine how we can implement the chain of thought.

The LLMs at present do not have a direct function to make their internal chain of thoughts visible— except for DeepSeek R1, where it’s built-in.

This means that if we are to make this happen, we’ll have to use a system prompt. However, don’t confuse this with the techniques we discussed earlier. The system prompt, in this case, acts more like a method to implement CoT, not a prompting technique.

This is how we can tell it:

Format your response as follows

1. THINKING: First, show all mental steps, considerations, and explorations. Include alternative hypotheses you consider and reject. Think about edge cases.
2. VERIFICATION: Double-check your logic and facts, identifying any potential errors.
3. ANSWER: Only after showing all thinking, provide your final answer.

Here’s how we can integrate it into the function to generate an output:

def generate_chain_of_thought_response(user_input):
    # System message defines personality and expectations
    system_prompt = (
        "Format your response as follows:"
"1. THINKING: First, show all mental steps, considerations, and explorations. Include alternative hypotheses you consider and reject. Think about edge cases."
"2. VERIFICATION: Double-check your logic and facts, identifying any potential errors."
"3. ANSWER: Only after showing all thinking, provide your final answer."
    )

    # Alternating user input to encourage visible reasoning
    formatted_user_input = f"{user_input}\nLet's think through this step by step."

    # Phi-style formatting
    prompt = (
        f"<|im_start|>system\n{system_prompt}<|im_end|>\n"
        f"<|im_start|>user\n{formatted_user_input}<|im_end|>\n"
        f"<|im_start|>assistant\n"
    )

    # Call the model
    response = client.text_generation(
        prompt,
        model="microsoft/Phi-3-mini-4k-instruct",  
        max_new_tokens=500,
        temperature=0.7,
        top_p=0.95,
        repetition_penalty=1.1,
        stop_sequences=["<|im_end|>"]
    )

    # Cleanup
    answer = response.strip().split("<|im_end|>")[0].strip()

    return answer

In this code, we have marked the limits of the LLM. Let me explain them one by one.

max_new_tokens=500: This parameter specifies the maximum number of tokens the model is allowed to generate in response to the input prompt. A single token may represent a word or a part of a word (depending on the type of the model), and its purpose is to make sure that the response is not too long.
temperature=0.7: This parameter handles the randomness of the model's output. If it’s lower, like 0.2, the model's responses are more focused and relevant; it might also result in repetition and a lack of creativity.

When it’s higher, on the other hand, the model generates more diverse and creative outputs, but may result in irrelevant info (well, sometimes). 0.7, anyhow, strikes in the middle and seems to be a fit for this model.

top_p=0.95: top_p parameter uses nucleus sampling to select the smallest set of tokens whose cumulative probability is at least 95%. Unlike top_k, which limits choices to a fixed number, top_p dynamically adjusts the token pool based on probability. A wiser approach in this case.
repetition_penalty=1.1: This applies a “penalty” to previously repeated tokens, making them less likely to appear in the generated text again and again. A value greater than 1.0 reduces the probability of repetition a lot.

Also note how we format the prompt here:

 f"<|im_start|>system\n{system_prompt}<|im_end|>\n"
        f"<|im_start|>user\n{formatted_user_input}<|im_end|>\n"
        f"<|im_start|>assistant\n"

This format, integrating `<|im_start|>` and `<|im_end|>`, depends on the type of LLM. The best way to determine this is to ask ChatGPT to read the model’s documentation.

Finally, for the interactive chat experience, implement this loop:

print("Chain-of-Thought Phi (type 'exit' to quit)")
while True:
    user_input = input("\nYou: ")
    if user_input.lower().strip() in {"exit", "quit"}:
        break
    output = generate_chain_of_thought_response(user_input)
    print("\nAssistant:\n", output)

Time for a quick test. Run the script, and ask a question like “What is 7 x 9 + 100?“ You can expect an output like below:

Firstly, let us break down the expression into two parts according to the order of operations (PEMDAS/BODMAS): parentheses first then exponents or powers, followed by multiplication and division from left to right, and finally addition and subtraction from left to right. There are no parentheses or exponents in our case; so we move on to multiplication before dealing with addition. Here’s how it breaks down:

Step 1 – Multiplication part: We need to multiply 7 times 9 which gives us \(7 \times 9 = 63\).

Next Step - Addition part: Now take that result and add 100 to it (\(63 + 100\)).

Adding these together yields \(63 + 100 = 163\).

So, when calculating \(7 \times 9 + 100\), following the correct arithmetic sequence will give us a total of 163.

That might not look like a big deal, but if you just use the Phi-3-mini-4k-instruct without any prompt engineering, the output would be much simpler.

And that’s about CoT; let’s head over to System message prompts.

System Prompts

One way to declare sorta-system messages without code is to prompt them at the start of every chat in AI models. But when the conversation continues further, most models tend to forget the initial instruction because of the context windows.

However, when you declare a system prompt in the backend of the LLM, the model will stick to it throughout the entire conversation. Why? Before generating any response, the model reads the system message first for the whole conversation, no matter the context window.

Regarding the code, start off by authorization, as we did earlier:

from huggingface_hub import InferenceClient

# Replace 'YOUR_HF_API_TOKEN' with your actual Hugging Face API token
client = InferenceClient(token="YOUR_HF_API_TOKEN")

In this case, I’ll write a system message to make the model calm and peaceful, like in Zen Buddhism. Note that Phi models have content moderation enabled (good job, Microsoft), and you won’t be able to change the prompt to anything considered harmful.

Here’s the code we can use:

def generate_response(user_input):
    system_message = (
       "Use words often used in Zen buddhism"
       "Act like you are a monk, staying calm and peaceful"
       "Encourage the user to be calm and follow Zen practices too"
    )

    prompt = (
        f"<|im_start|>system\n{system_message}<|im_end|>\n"
        f"<|im_start|>user\n{user_input}<|im_end|>\n"
        f"<|im_start|>assistant\n"

    )

For some reason, the output of this model ends with <|im_end|>. That doesn’t affect the model’s performance, but we can format it anyway.

    # Clean up the result
    answer = response.strip()

    if answer.endswith("<|im_end|>"):
        answer = answer.replace("<|im_end|>", "").strip()

    formatted_answer = '\n'.join(answer[i:i + 190] for i in range(0, len(answer), 100))
    return formatted_answer

And that’s it. Complete the code with a user-input loop as follows:

print("Zen AI (type 'quit' to exit)")
while True:
    user_input = input("\nYou: ")
    if user_input.lower() in ["quit", "exit"]:
        break
    response = generate_response(user_input)
    print("Assistant:", response)

Run a quick test run, and see how the model’s output sticks to the system message beautifully.

You: hello

Assistant: Namaste. May your day unfold with tranquility and mindfulness as guiding principles.

Feel free to change the max_new_tokens or other values for your needs.

And voila! We successfully prompted the Phi-3-mini model to show a chain of thoughts and then become a Zen monk.

Summing Up

Prompt engineering, despite sounding like a big deal, is not much of a deal. What matters is the way you ask the model to do what you want; and remember, you can’t FORCE a model to do what it should do. You should ask for it through gentle coaxing— like a mother asking a toddler to put their jacket on, without causing a tantrum.

For example, if we tell the Phi-3-mini model to “You are a freakin Zen monk! Act like one! Don’t let me repeat“, it will try to do what you ask but not as effectively. And worse, you’ll always get responses like “Please remember that as an AI developed by Microsoft, named Phi (or GPT)…“.

And that’s it for today. Thanks for reading so far. See you in… two weeks?

How to Prompt Engineer Phi-3-mini: A Practical Guide

Too Long; Didn't Read

People Mentioned

Coin Mentioned

Setting up the Environment

Understanding the Basics

Chain-of-Thoughts

System Prompts

Summing Up

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Categories

Trending Topics

How to Prompt Engineer Phi-3-mini: A Practical Guide

Too Long; Didn't Read

People Mentioned

Coin Mentioned

Setting up the Environment

Understanding the Basics

Chain-of-Thoughts

System Prompts

Summing Up

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Categories

Trending Topics