Ten tips and tricks to employ in your Gen AI projects

Lessons from a Production-ready Generative AI Application

13 min readOct 11, 2023

There are not many Generative AI applications in production use today, by which I mean that they are deployed and actively used by end-users. (Demos, POCs, and Extractive AI do not count.) The Gen AI applications that are used in production (e.g. Duet in Google Workspace, sales email creation in Salesforce’s Einstein GPT) are closed-source, and so you can’t learn from them.

That’s why I was excited when defog.ai open-sourced SqlCoder, the NLP-to-SQL model that they have been using as part of automating several Generative AI workflows at their customers. They also helpfully wrote a set of blog posts detailing their approach and their thinking. That gives me a concrete example to point to.

In this article, I’ll use SqlCoder to showcase concrete examples of things you could be doing in your own GenAI projects.

1. Devise an Evaluation Metric that is computed on how the generated text will be used.

As in traditional machine learning, the loss metric that is used to optimize an LLM does not capture its real-world utility. Classification models are trained using a cross-entropy loss but evaluated using metrics such as AUM/F-score or by assigning an economic cost to false positives, etc.

Similarly, foundational LLMs are trained to optimize metrics such as BLEU or ROUGE. At some level, all these do is to measure the overlap in tokens between the generated text and the label. Obviously, this is no good for SQL generation — the label “SELECT list_price” and the generated text has “SELECT cost_price” are not particularly close (tokens in LLMs are subwords, so here the two strings differ by only 1 token!).

The way defog solves this is explained in this blog post on how they did evaluation. Basically, instead of comparing the SQL strings directly, they run the generated SQL on a small dataset and compare the results. This allows them to accept equivalent SQL as long as the SQL ends up doing the same thing as the label. However, what happens if the columns are aliased differently? How do you handle out-of-order results? What happens if the generated SQL is a superset of the label? Lots of corner cases and nuances need to be addressed. Do read their blog post on evaluation if you are interested in this specific problem. The larger point, though, is valid for all sorts of Gen AI problems — devise an evaluation metric that is computed, not on the generated string, but on how that generated string will be used.

Many research papers use an LLM (usually GPT-4) to “score” the generated text and use this as a metric. This is not as good as devising a proper evaluation metric because LLM scores are heavily biased toward GPT algorithms and against many of the smart optimizations that you can do. Also, recall that Open AI had to turn off their service that attempted to detect AI-generated text; if they couldn’t get LLM-generated scores to work, why do you think you will?

2. Set up Experimentation Tracking

Before you start to do anything, make sure you have a system to keep records and share the results of your experiments. You will carry out a lot of experiments, and you want to make sure that you are capturing everything you have tried.

This could be as simple as a spreadsheet with the following columns: experiment, experiment descriptors (approach, parameters, dataset, etc.), training cost, inference cost, metrics (sliced by subtask: see below), qualitative notes. Or it could be more complex, taking advantage of an ML experiment tracking framework such as those built into Vertex AI, Sagemaker, neptune.ai, Databricks, Datarobot, etc.

If you are not recording experiments in a repeatable way that is consistent across all the members of your team, it will be hard to make downstream decisions.

3. Break down your problem into subtasks

You will often want to do all your evaluations not on the entire evaluation dataset but on subsets of that dataset broken down by task. For example, see how defog are reporting performance on different types of queries:

SqlCoder computes metrics on subsets of tasks. Table from defog’s results on HuggingFace.

There are three reasons why you’d want to do such sliced evaluations:

You will eventually run into a logjam between model size, performance and cost. One way to break out of the box is to have multiple ML models, each tuned on a different subtask. Many people suspect that GPT-4 is itself an ensemble of GPT 3.5-quality models. [As an aside, this is one of the reasons that individual LLMs fare poorly against GPT-4 — you need to build an ensemble of models to beat it.]
If you have multiple stakeholders, they might be interested in different things. In that case, make sure to devise and track metrics corresponding to each of their goals. You can treat these differing goals as subtasks too, and start by tracking them. You’re likely to end up having to create multiple models, one for each stakeholder. Again, you can now treat these models as members of the ensemble.
A third reason to do sliced evaluation on subtasks is that the gold standard for ML evaluation is to present it to a panel of human experts. That tends to be too expensive. However, if you ever do human evaluation, make sure you do it in such a way that you can later use computed metrics to “predict” what a human evaluation might be. Having more attributes of the problem can be helpful in doing such calibration.

4. Apply prompt engineering tricks

All the approaches to using Gen AI ultimately require sending a text prompt to a trained LLM. Over time, the community has learned quite a bit of tips and tricks to creating good prompts. Usually, the LLM’s documentation tells you what works (examples: OpenAI cookbook, Lllama2, Google PaLM) — make sure to read these and employ the suggested techniques!

The defog prompt is:

prompt = """### Instructions:
Your task is convert a question into a SQL query, given a Postgres database schema.
Adhere to these rules:
- **Deliberately go through the question and database schema word by word** to appropriately answer the question
- **Use Table Aliases** to prevent ambiguity. For example, `SELECT table1.col1, table2.col1 FROM table1 JOIN table2 ON table1.id = table2.id`.
- When creating a ratio, always cast the numerator as float
### Input:
Generate a SQL query that answers the question `{question}`.
This query will run on a database whose schema is represented in this string:
CREATE TABLE products (
product_id INTEGER PRIMARY KEY, - Unique ID for each product
name VARCHAR(50), - Name of the product
price DECIMAL(10,2), - Price of each unit of the product
quantity INTEGER - Current quantity in stock
);
CREATE TABLE customers (
customer_id INTEGER PRIMARY KEY, - Unique ID for each customer
name VARCHAR(50), - Name of the customer
address VARCHAR(100) - Mailing address of the customer
);
…
 - sales.product_id can be joined with products.product_id
 - sales.customer_id can be joined with customers.customer_id
 - sales.salesperson_id can be joined with salespeople.salesperson_id
 - product_suppliers.product_id can be joined with products.product_id
### Response:
Based on your instructions, here is the SQL query I have generated to answer the question `{question}`:
```sql
"""

This illustrates several tricks:

Task Input. The preamble (“Your task is to … SQL … Postgres database … “) is called a Task Input. This is the input to the instruction-model stage of an LLM’s training regimen. Recall that, fundamentally, a LLM is a text completion machine. Anything that you can do to boost up the probability of words in the right sector of the word space will help. So, many LLMs will work better if your preamble guides the LLM to the part of the word space that you care about. Defog’s use of words like SQL, Postgres, etc. in the preamble is critical here.
System Prompt. The rules (“go through question and schema word by word, use table aliases, etc.”) form what is called a System Prompt. This is used to guide and constrain behavior. [My suggestion to the defog team would be to avoid 10-dollar words like “Adhere” and use 10-cent words like “Always” and “Never” — they tend to work better.] LLMs are trained to honor system prompts (this is how they guard against toxicity, for example). Use them to your advantage.
Beginning and end of context. The question to be answered occurs twice. One in the section on Input and once in the section on Response. This placement — at the beginning and at the end — is not accidental. LLMs tend to weight the middle of the context lower, especially if your prompt (as here) is very long. Put the most important things in the beginning and at end. Repetition may help (experiment to see if it does).
Structured input. LLM weights are modified by the attention mechanism associated with each “head”. Given this, using consistent and unique token sequences like **Input** helps train the LLM to use the words that follow differently.
Rules in context (?). Defog has a section on rules about what columns can be joined with others. It’s quite interesting that they are putting rules on what columns can be joined as part of the input context. I haven’t seen this before (and I am not sure why this works), but something I will start to watch out for. It’s always a learning process: read the prompts that work, so you pick up new tricks.
Sentence Completion. Note how the prompt ends with “here is the SQL query I have generated …”. This is a trick to help the LLM do its natural thing of completing the prompt. Chat-based LLMs are trained to take a question and generate the corresponding answer, but if you are fine-tuning, you’ll usually be fine-tuning a base LLM that doesn’t have this capability. Setting the prompt up for sentence completion is a good idea.
Context length. All LLMs have a context length. For example, the context length for Llama2 is 2K by default, but you can increase this window by changing the source or during fine-tuning. Modern LLMs tokenize on sub-words using a library called sentencepiece, and a good rule of thumb is to think of each token as being two characters. Stay aware of the length of your prompt and make sure it remains under the context length of your LLM. Otherwise, the LLM will truncate your request! And if the sentence completion is not included the truncated prompt, the LLM will just continue the question!
Structured response. There are several tricks to get the LLM to generate a response that you can parse. One is to use the system prompt to ask it to generate YAML. This is hit-and-miss. Another is to use few-shot examples in the context to illustrate the desired response format. This tends to work better. The third trick is the most reliable: add a special sequence of characters (defog is using three backticks) in the sentence completion piece of the prompt. Then, in the postprocessing, retain only the part of the response that follows the special sequence.

As an aside, you can see #7 and #8 in Google Workspace Duet. Unless the bug has been fixed, try selecting an overly long paragraph (longer than the context) and ask it to summarize. The result will contain the word “Instruction”, which is part of the System Prompt. The reason you get to see it is that the special characters that delineate the output didn’t exist in the response. Much red-team hackery of LLMs starts with overstuffing the response — truncation exposes a lot of bugs and unanticipated behavior.

5. Intelligently mix different approaches in your architecture

There are now five approaches to building on top of Generative AI:

Zero-shot: simply sending a prompt to the LLM. You are relying solely on the training data of the LLM.
Few-shot: Including 1–2 example inputs and responses in the context. These examples could be fixed, or could be retrieved based on which examples are most relevant to the query. This is usually just a way to guide the LLM, not of teaching it new information or new tasks.
Retrieval Augmented Generation (RAG): Pulling relevant data, usually from a vector database based on similarity search, and including it in the context. This is a way to teach the LLM new information (today’s LLMs cannot be taught new skills using RAG).
Fine Tuning. Typically, this is done in a parameter efficient way (PEFT) using the low-rank adaptation (LoRA) approach of training a separate neural network that modifies the weights of the LLM so that the LLM is able to handle new tasks. In fine-tuning you teach the LLM how to handle a new instruction (today’s LLMs cannot learn new information by fine-tuning).
Agent framework. Get an LLM to generate the parameters that you will pass to an external API. This can be used to add more skills and knowledge to an LLM, but can be dangerous without a human in the loop.

As you can see, each approach has its strengths and drawbacks. So, what defog is doing is a mix of several of these approaches. Ultimately, they are doing #5 (generating the SQL that will be sent to a database), but putting the SQL in the path of a complex workflow that is guided by a human user. They pull the necessary schema and join rules (#3) based on the query. They have fine-tuned (#4) a small model to efficiently manage cost. And they are invoking the fine-tuned model in a zero-shot way (#1).

This kind of an intelligent mix of approaches is necessary to take advantage of the strengths of the various approaches and guard against their weaknesses.

6. Clean up and organize the dataset

It is becoming clear that in Gen AI, both quantity and quality of your data matter. Defog set themselves a goal of getting 10k training examples in order to fine-tune a custom model (it appears they fine-tune models for each customer: see discussion earlier about subtasks) and a big part of their effort is to clean up the dataset.

Here’s a quick checklist when it comes to ensuring your dataset is optimal:

Correctness. Ensure that the labels are all correct. Defog ensured this by making sure the SQL needed to run and produce a dataframe that can be compared with dataframes created from generated text.
Curate the data. Platypus was able to improve on Llama2 by simply removing duplicates from the training dataset, removing gray area questions, etc.
Data diversity. It’s important to use the 10k examples wisely, and show the LLM a good variety of what it will see in production. Note how Platypus uses lots of open datasets or how defog used 10 separate sets of schemas instead of training on just one set of tables.
Evol-instruct. The “Textbooks are all you need” paper showcases the importance of choosing simple examples in increasing order of difficulty. defog use a LLM to adapt a set of instructions into more complex ones.
Assign difficulty level to examples. There are lots of cases where segmenting the training dataset by difficulty can be useful. You can do sliced evaluation metric (see Tip #3), train simpler models for simpler tasks, use it as an effective ensembling mechanism, teach the model in stages of difficulty, etc.

By far, this is the tip that will give you the biggest performance boost.

7. Decide on build-vs-buy on a case-by-case basis

Large models are expensive to serve. You can get competitive results by fine-tuning a smaller model on curated datasets. These can be 1/10th or less of the cost. Plus, you can serve the fine-tuned model on-premises, on the edge, etc. When calculating the ROI, don’t ignore the financial/strategic benefits of owning the model.

That said, GPT-4 from Open AI often gives you great performance out of the box. If you can anticipate the scale at which you will call the Open AI API, you can estimate what it will cost you in production. If your requests will be few enough, fine-tuning does not make financial sense because of the development cost involved. Even if you start down the fine-tuning approach, benchmark against the state of the art model and be ready to pivot your approach if necessary.

It is unlikely that you will have the bandwidth to create custom models for everything you need. So, you will likely have a mix of bought models and built ones. Do not fall into the trap of always building or always buying.

8. Abstract away the specific LLM

OpenAI is not the only game in town. Google keeps hinting that their upcoming Gemini model is better than GPT-4. It’s likely to be a situation where there’s a new state of the art (SoTA) model every few months. Your evaluation mix should definitely have whichever model (GPT-4, Gemini, or GPT-5) is SoTA by the time you read this. However, make sure that you also compare performance and cost against other near-SoTA models like Cohere or Anthropic and previous generation ones like GPT 3.5 and PaLM2.

Which LLM you buy is mostly a business decision. Small differences in performance are rarely worth large differences in cost. So, compare performance and cost for several options.

Use langchain to abstract away the LLM and have the cost-benefits captured in your experimentation framework. This will help you negotiate effectively.

9. Deploy as an API

Even your “small” 13GB parameter finetuned LLM takes forever to load and requires a bank of GPUs to serve. Serve it as an API, even to internal users, and use a gateway service to meter and monitor it.

If your end users are application programmers, document the API interface, and if you support different prompts, document them, and provide unit tests to ensure that you don’t break downstream workflows that use those specific prompts.

If your end users are non-technical, an API is not enough. As defog illustrates, it is a good idea to provide a playground interface with example queries (“chips”) using something like streamlit; and if your end users are ML developers, to use HuggingFace for your playground functionality.

10. Automate your training

Make sure your fine-tuning pipeline is fully automated.

Default to whichever hyperscaler you normally use for your cloud ML platform, but do make sure to cost it out and ensure GPU/TPU availability in your region. There are also several startups who provide “LLMops” as a service and are often more cost-effective than the big cloud providers because they use spot instances, own their hardware, or are spending other people’s money.

A good way to preserve choice here is to containerize the entire pipeline. That way, you can easily port your (re)training pipeline to wherever the GPUs are.