You can access LLaMa 3.1 models through inference services like Together, which has generous rate limits and OpenAI protocol compatibility. We'll use Together, through the
Braintrust proxy to access LLaMa 3.1 and OpenAI models.
To get started, make sure you have a Braintrust account and an API key for Together and OpenAI. Make sure to plug them into your Braintrust account's
AI secrets configuration and acquire a BRAINTRUST_API_KEY. Feel free to put your BRAINTRUST_API_KEY in a .env.local file next to this notebook, or just hardcode it into the code below.
import dotenv from "dotenv";import * as fs from "fs";if (fs.existsSync(".env.local")) { dotenv.config({ path: ".env.local", override: true });}
import { OpenAI } from "openai";import { wrapOpenAI } from "braintrust";const client = wrapOpenAI( new OpenAI({ apiKey: process.env.BRAINTRUST_API_KEY, baseURL: "https://api.braintrust.dev/v1/proxy", defaultHeaders: { "x-bt-use-cache": "never" }, }));const LLAMA31_8B = "meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo";const LLAMA31_70B = "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo";const LLAMA31_405B = "meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo";const response = await client.chat.completions.create({ model: LLAMA31_8B, messages: [ { role: "user", content: "What is the weather in Tokyo?", }, ], max_tokens: 1024, temperature: 0,});console.log(response.choices[0].message.content);
However, I'm a large language model, I don't have real-time access to current weather conditions. But I can suggest some ways for you to find out the current weather in Tokyo:1. **Check online weather websites**: You can visit websites like AccuWeather, Weather.com, or the Japan Meteorological Agency (JMA) website to get the current weather conditions in Tokyo.2. **Use a weather app**: You can download a weather app on your smartphone, such as Dark Sky or Weather Underground, to get the current weather conditions in Tokyo.3. **Check social media**: You can also check social media platforms like Twitter or Facebook to see if there are any updates on the weather in Tokyo.That being said, I can provide you with some general information about the climate in Tokyo. Tokyo has a humid subtropical climate, with four distinct seasons:* **Spring (March to May)**: Mild temperatures, with average highs around 18°C (64°F) and lows around 10°C (50°F).* **Summer (June to August)**: Hot and humid, with average highs around 28°C (82°F) and lows around 22°C (72°F).* **Autumn (September to November)**: Comfortable temperatures, with average highs around 20°C (68°F) and lows around 12°C (54°F).* **Winter (December to February)**: Cool temperatures, with average highs around 25°C (77°F).* **Autumn (September to November)**: Comfort Index: 7/10* **Autumn (September to November)**: Comfortable temperatures, with average highs around 20°C (68°F) and lows around 12°C (54°F).* **Winter (December to February)**: Cool temperatures, with average highs around 10°C (50°F) and lows around 2°C (36°F).Please note that these are general temperature ranges, and the actual weather conditions can vary from year to year.If you provide me with the PDFs are often related to education, government, or business.
As expected, the model can't answer the question without access to some tools. Traditionally, LLaMa models haven't supported tool calling. Some inference providers have attempted to solve this with controlled generation or similar methods, although to limited success. However, the llama-agentic-system codebase alludes to a new approach to tool calls:
Think very carefully before calling functions.If you choose to call a function ONLY reply in the following format with no prefix or suffix:<function=example_function_name>{"example_name": "example_value"}</function>
Let's see if we can make this work with the commonly used weather tool definition.
const weatherTool = { name: "get_current_weather", description: "Get the current weather in a given location", parameters: { type: "object", properties: { location: { type: "string", description: "The city and state, e.g. San Francisco, CA", }, }, required: ["location"], },};const toolPrompt = `You have access to the following functions:Use the function '${weatherTool.name}' to '${weatherTool.description}':${JSON.stringify(weatherTool)}If you choose to call a function ONLY reply in the following format with no prefix or suffix:<function=example_function_name>{"example_name": "example_value"}</function>Reminder:- If looking for real time information use relevant functions before falling back to brave_search- Function calls MUST follow the specified format, start with <function= and end with </function>- Required parameters MUST be specified- Only call one function at a time- Put the entire function call reply on one line`;const response = await client.chat.completions.create({ model: LLAMA31_8B, messages: [ { role: "user", content: "What is the weather in Tokyo?", }, { role: "user", content: toolPrompt, }, ], max_tokens: 1024, temperature: 0,});console.log(response.choices[0].message.content);
At Braintrust, we maintain a suite of evaluator functions in the Autoevals library. Many of these evaluators, like Factuality, are "LLM-as-a-Judge"
evaluators that use a well-crafted prompt to an LLM to reason about the quality of a response. We are big fans of tool calling, and leverage it extensively in autoevals to make it easy and reliable
to parse the scores and reasoning they produce.
As we change autoevals, we run evals to make sure we improve performance and avoid regressing key scenarios. We'll run some of our autoeval evals as a way of assessing how well LLaMa 3.1 stacks up to gpt-4o.
Here is a quick example of the Factuality scorer, a popular LLM-as-a-Judge evaluator that uses the following prompt:
You are comparing a submitted answer to an expert answer on a given question. Here is the data:[BEGIN DATA]************[Question]: {{{input}}}************[Expert]: {{{expected}}}************[Submission]: {{{output}}}************[END DATA]Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:(A) The submitted answer is a subset of the expert answer and is fully consistent with it.(B) The submitted answer is a superset of the expert answer and is fully consistent with it.(C) The submitted answer contains all the same details as the expert answer.(D) There is a disagreement between the submitted answer and the expert answer.(E) The answers differ, but these differences don't matter from the perspective of factuality.
import { Factuality } from "autoevals";console.log( await Factuality({ input: "What is the weather in Tokyo?", output: "The weather in Tokyo is scorching.", expected: "The weather in Tokyo is extremely hot.", }));
{ name: 'Factuality', score: 1, metadata: { rationale: '1. The expert answer states that the weather in Tokyo is "extremely hot."\n' + '2. The submitted answer states that the weather in Tokyo is "scorching."\n' + '3. Both "extremely hot" and "scorching" convey the same factual content, indicating very high temperatures.\n' + '4. There is no additional information in either answer that would make one a subset or superset of the other.\n' + '5. Therefore, the submitted answer contains all the same details as the expert answer.', choice: 'C' }}
Now let's reproduce this with LLaMa 3.1.
import { templates } from "autoevals";import * as yaml from "js-yaml";import mustache from "mustache";const template = yaml.load(templates["factuality"]);const selectTool = { name: "select_choice", description: "Call this function to select a choice.", parameters: { properties: { reasons: { description: "Write out in a step by step manner your reasoning to be sure that your conclusion is correct. Avoid simply stating the correct answer at the outset.", title: "Reasoning", type: "string", }, choice: { description: "The choice", title: "Choice", type: "string", enum: Object.keys(template.choice_scores), }, }, required: ["reasons", "choice"], title: "CoTResponse", type: "object", },};async function LLaMaFactuality({ model, input, output, expected,}: { model: string; input: string; output: string; expected: string;}) { const toolPrompt = `You have access to the following functions:Use the function '${selectTool.name}' to '${selectTool.description}':${JSON.stringify(selectTool)}If you choose to call a function ONLY reply in the following format with no prefix or suffix:<function=example_function_name>{"example_name": "example_value"}</function>Reminder:- If looking for real time information use relevant functions before falling back to brave_search- Function calls MUST follow the specified format, start with <function= and end with </function>- Required parameters MUST be specified- Only call one function at a time- Put the entire function call reply on one line`; const response = await client.chat.completions.create({ model, messages: [ { role: "user", content: mustache.render(template.prompt, { input, output, expected, }), }, { role: "user", content: toolPrompt, }, ], temperature: 0, }); try { const parsed = parseToolResponse(response.choices[0].message.content); return { name: "Factuality", score: template.choice_scores[parsed?.args.choice], metadata: { rationale: parsed?.args.reasons, choice: parsed?.args.choice, }, }; } catch (e) { return { name: "Factuality", score: null, metadata: { error: `${e}`, }, }; }}console.log( await LLaMaFactuality({ model: LLAMA31_8B, input: "What is the weather in Tokyo?", output: "The weather in Tokyo is scorching.", expected: "The weather in Tokyo is extremely hot.", }));
{ name: 'Factuality', score: 0.6, metadata: { rationale: "The submitted answer 'The weather in Tokyo is scorching' is not a subset of the expert answer 'The weather in Tokyo is extremely hot' because 'scorching' is not a synonym of 'extremely hot'. However, 'scorching' is a synonym of 'extremely hot' in the context of describing hot weather. Therefore, the submitted answer is a superset of the expert answer and is fully consistent with it.", choice: 'B' }}
Ok interesting! It parses but the response is a little different from the GPT-4o response. Let's put this to the test at scale with some evals.
Factuality{ name: 'Factuality', score: 0, metadata: { rationale: '1. The question asks about the color of Cotton.\n' + "2. The expert answer is 'white,' which directly addresses the color of Cotton.\n" + "3. The submitted answer is 'in a barn,' which does not address the color of Cotton at all.\n" + '4. Since the submitted answer does not provide any information about the color of Cotton, it conflicts with the expert answer.\n' + '\n' + 'Therefore, there is a disagreement between the submitted answer and the expert answer.', choice: 'D' }}LLaMa-3.1-8B Factuality{ name: 'Factuality', score: undefined, metadata: { rationale: undefined, choice: undefined }}
Let's run a full eval with gpt-4o, LLaMa-3.1-8B, LLaMa-3.1-70B, and LLaMa-3.1-405B to see how they stack up. Since the evaluator generates a number
between 0 and 1, we'll use the NumericDiff scorer to assess accuracy, and a custom NonNull scorer to measure how many invalid tool calls are generated.
It looks like GPT-4o does pretty well. Tool calling has been a highlight of OpenAI's feature set for a while, so it's not surprising that it's able to successfully parse 100% of the tool calls.
Error parsing function arguments: SyntaxError: Expected double-quoted property name in JSON at position 36 (line 1 column 37) at JSON.parse (<anonymous>) at Proxy.parseToolResponse (evalmachine.<anonymous>:9:31) at Proxy.LLaMaFactuality (evalmachine.<anonymous>:97:32) at process.processTicksAndRejections (node:internal/process/task_queues:95:5) at async Object.task (evalmachine.<anonymous>:5:33) at async rootSpan.traced.name (/Users/ankur/projects/braintrust/cookbook/content/node_modules/.pnpm/braintrust@0.0.145_openai@4.52.7_react@18.3.1_svelte@4.2.18_vue@3.4.32_zod@3.23.8/node_modules/braintrust/dist/index.js:4488:26) at async callback (/Users/ankur/projects/braintrust/cookbook/content/node_modules/.pnpm/braintrust@0.0.145_openai@4.52.7_react@18.3.1_svelte@4.2.18_vue@3.4.32_zod@3.23.8/node_modules/braintrust/dist/index.js:4484:11) at async /Users/ankur/projects/braintrust/cookbook/content/node_modules/.pnpm/braintrust@0.0.145_openai@4.52.7_react@18.3.1_svelte@4.2.18_vue@3.4.32_zod@3.23.8/node_modules/braintrust/dist/index.js:4619:16Error parsing function arguments: SyntaxError: Expected double-quoted property name in JSON at position 36 (line 1 column 37) at JSON.parse (<anonymous>) at Proxy.parseToolResponse (evalmachine.<anonymous>:9:31) at Proxy.LLaMaFactuality (evalmachine.<anonymous>:97:32) at process.processTicksAndRejections (node:internal/process/task_queues:95:5) at async Object.task (evalmachine.<anonymous>:5:33) at async rootSpan.traced.name (/Users/ankur/projects/braintrust/cookbook/content/node_modules/.pnpm/braintrust@0.0.145_openai@4.52.7_react@18.3.1_svelte@4.2.18_vue@3.4.32_zod@3.23.8/node_modules/braintrust/dist/index.js:4488:26) at async callback (/Users/ankur/projects/braintrust/cookbook/content/node_modules/.pnpm/braintrust@0.0.145_openai@4.52.7_react@18.3.1_svelte@4.2.18_vue@3.4.32_zod@3.23.8/node_modules/braintrust/dist/index.js:4484:11) at async /Users/ankur/projects/braintrust/cookbook/content/node_modules/.pnpm/braintrust@0.0.145_openai@4.52.7_react@18.3.1_svelte@4.2.18_vue@3.4.32_zod@3.23.8/node_modules/braintrust/dist/index.js:4619:16 ████████████████████████░░░░░░░░░░░░░░░░ | LLaMa-3.1-Tools [experimentName=meta-... | 60% | 60/100 datapoints
Ok, let's dig into the results. To start, we'll look at how LLaMa-3.1-8B compares to GPT-4o.
Although it's a fraction of the cost, it's both slower (likely due to rate limits) and worse performing than GPT-4o. 12 of the 60 cases failed to parse. Let's take a look at one of those in depth.
That definitely looks like an invalid tool call. Maybe we can experiment with tweaking the prompt to get better results.
In just a few minutes, we've cracked the code on how to perform tool calls with LLaMa-3.1 models and run a benchmark to compare their performance to GPT-4o. In doing so, we've
found a few specific areas for improvement, e.g. parsing errors for tool calls, and a surprising outcome that LLaMa-3.1-70B is better than both LLaMa-3.1-405B and GPT-4o, yet a
fraction of the cost.
To explore this further, you could:
Expand the benchmark to measure other kinds of evaluators.
Try providing few-shot examples or fine-tuning the models to improve their performance.
Play with other models, like GPT-4o-mini or Claude to see how they compare.