Skip to main content

All about test suites

This document looks at all the ins and outs of creating test suites.

The step being validated by the test suite in this example is the same one described in A step in detail.

A test suite is composed of one or more tests. A test is defined by three things:

  • one or more validators
  • a workflow (or step) and version to run those validators against
  • a specific input value to use in the test

Input

In order to have repeatable tests, it is necessary to specify the input(s) for each test. For example:


{/* single input */}
{"inputs": [
{"width": 10,
"length": 12}
]
}

{/* multiple inputs, each tested separately */}
{"inputs": [
{"width": 10,
"length": 12},
{"width": 5,
"length": 10}
]
}
note

Some examples in this document are show in full (multiline) JSON for clarity, but when it comes time to use the tests, they will be in single-line JSON for use in the command line or in a JSONL file (see below for more).

Types of validators

Sandgarden currently supports two types of batch test validators: regex and workflow. A single test may contain multiple validators, as validators is a JSON list.

regex

A regex validator is exactly what it sounds like: a regular expression. If the expression specified by the validator is matched somewhere in the workflow/step output, then the test passes.

A regex validator can be fine-tuned with the jpath parameter, restricting the regular expression to be tested against only a specific JSONPath expression.

{/* Simple regex */}
{"validators": [
{
"regex": "e.+\s+"
}
]}

{/* With jpath */}
{"validators": [
{
"regex": "e.+\s+",
"jpath": "$.response"
}
]}

The first validator only looks for the specified regex anywhere in the returned data from the workflow, but the second validator specifies that the regex must be found in the $.response JSONPath.

workflow

While the regex validator is a good simple check for specific output characteristics, some workflows and steps may require validation with something more complex than a regular expression. Here's where things get meta: validator-specific workflows/steps can be created to test another workflow/step. Here's how it works:

  1. Write a new workflow/step with the following characteristics:
    • It takes the output block of the workflow as its input
    • It outputs a short JSON block: {"passed": true} (or false)
  2. Push the validator workflow/step to Sandgarden
  3. Invoke the validator workflow/step as follows:
{"validators": [
{
"resource": "workflow-or-step:version-or-tag"
}
]}

Example

Here's a very simple 'brevity checker' step that validates whether a response is under 50 words or not.

steps/brevity.py
def brevity_handler(input, sandgarden):
answer = input['answer']
wc = len(answer.split())
return {"passed": wc <= 50}

These validator steps/workflows can be very tiny. All that is really needed is the fixed input to the handler ((input, sandgarden)) and the return JSON with passed as a key and true or false as a value. Push the workflow:

sand steps push docker \
--json \
--name brevity-checker \
--entrypoint brevity.brevity_handler \
--tag latest \
--file steps/brevity.py

{
"id": "wrk_01jba9qp49fgf83a3mnky426af",
"name": "brevity-checker",
"version": 1,
"awsLambda": {
"entrypoint": "brevity.brevity_handler",
"baseImage": "public.ecr.aws/lambda/python:3.12",
"role": "arn:aws:iam::011528256002:role/DemoLambdaExecutionRole",
"memorySizeMb": 128,
"timeoutSeconds": 20
},
"modules": null
}

This test step can now be invoked as follows:

{"validators": [
{
"resource": "brevity-checker:latest"
}
]}

From the command line

A batch test can be specified using the --runs flag on sand runs start.

note

When putting regular expressions into the command line, note that additional backslashes may be necessary to prevent premature evaluation.

$ sand runs start --workflow hello-world-workflow-2:latest --runs '{"inputs": [{"question":"How does a normal person say hello to the world?"}], "validators":[{"regex":"ello"},{"regex":"orld"},{"resource":"brevity-workflow:latest"}]}'

From a JSONL file

Because the above can get unwieldy very quickly when specifying multiple inputs and validators, there is also the option to provide a JSONL file that contains batch tests in the following format, one per line:

{"inputs":[{"input-field":"input value"}],"validators":[{"regex":"regex1","jpath":"$path1"}]}

For example, the same tests as above can be represented in a JSONL file as follows.

tests.jsonl
{"inputs":[{"question":"How does a normal person say hello to the world?"}],"validators":[{"regex":"ello"},{"regex":"orld"},{"resource":"brevity-workflow:latest"}]}
{"inputs":[{"question":"What is reputed to have crashed at Area 51?"}],"validators":[{"regex":"lying"},{"regex":"aucer"},{"resource":"brevity-workflow:latest"}]}

To run the tests in a JSONL file, use the same flag (--in) at the command line.

sand runs start --workflow hello-world-workflow-2:latest --runs tests.jsonl

Comparing

Once there is a working test batch for a workflow, it can be used to ensure that any changes to the workflow don't go in the wrong direction. [sand batches compare] provides an easy interface to compare two batch runs, for example, between a baseline workflow and an updated one with an improved prompt.

$ sand batches compare run_01jbaab2bafgf94p6kqf00azwr run_01jbaag2txfgfa2wwwkj5ewkrk
╭───────────────────────────────────────────────────────────────╮ ╭───────────────────────────────────────────────────────────────╮
│ Batch run_01jbaab2bafgf94p6kqf00azwr │ │ Shared tests Shared tests between batches │
│ │ │ │
│ Passed 0 │ │ Passing 0
│ Failed 2 │ │ Failing 2
│ Total 2 │ │ Total 2
│ │ │ ----- │
│ Percentage 0.00% │ │ Percentage 0.00% │
│ ───────────────────────────────────────────────────────────── │ │ │
│ Batch run_01jbaag2txfgfa2wwwkj5ewkrk │ │ Improvements 0/0 (0.00%)
│ │ │ Regressions 0/0 (0.00%)
│ Passed 0 │ ╰───────────────────────────────────────────────────────────────╯
│ Failed 2
│ Total 2
│ │
│ Percentage 0.00% │
╰───────────────────────────────────────────────────────────────╯

Saving test suites

Once you have a test suite you're satisfied with, it makes sense to save it in the Sandgarden test suite library for easier access, modification, and reuse. This can be accomplished via the web UI or the CLI.

More to come

Lifecycle of a test suite

Start with workflow v1

When you have a workflow that is ready for testing, prepare the test suite.

$ sand workflows push --name=test-workflow --file=./examples/workflows/multifile-aws --entrypoint="handler.handler" 

✅ Workflow pushed successfully!
Workflow ID: wrk-1234
Workflow Version: 1

Run tests and review results

Run the test suite against the workflow for baseline results.

$ sand runs start --workflow=test-workflow:1 --runs {"inputs":[{"width":10,"length":12}],"validators":[{"regex":"12.*","jpath":"$.area"}]}

Batch created:
Batch ID: run_123
Status: running

Push update to workflow v2

Update your workflow, then re-run the test suite.

$ sand workflows push --name=test-workflow --file=./examples/workflows/multifile-aws --entrypoint="handler.handler" 

✅ Workflow pushed successfully!
Workflow ID: wrk-1234
Workflow Version: 2
$ sand runs start --workflow=test-workflow:2 --runs {"inputs":[{"width":10,"length":12}],"validators":[{"regex":"12.*","jpath":"$.area"}]}

Batch created:
Batch ID: run_124
Status: running

Compare results

Finally, compare the results of the two workflow versions with sand batches compare. This will show you whether there were improvements, regressions, or no changes from the earlier version of the workflow.

$ sand batches compare $BATCH_ID_1 $BATCH_ID_2

run_01jb9zc1psenr9q7fajp3y7p2r
Batch Results Batch Comparison
╭────────────────────────────────────────────────╮ ╭────────────────────────────────────────────────╮
│ Batch run_01jb9z9brcenqa5m6t803956mp │ │ Shared tests Shared tests between batches │
│ │ │ │
│ Passed 2 │ │ Passing 2
│ Failed 2 │ │ Failing 0
│ Total 4 │ │ Total 2
│ │ │ ----- │
│ Percentage 50.00% │ │ Percentage 100.00% │
│ ────────────────────────────────────────────── │ │ │
│ Batch run_01jb9zc1psenr9q7fajp3y7p2r │ │ Improvements 0/0 (0.00%)
│ │ │ Regressions 0/0 (0.00%)
│ Passed 4 │ ╰────────────────────────────────────────────────╯
│ Failed 0
│ Total 4
│ │
│ Percentage 100.00% │
(+ 2 test added) (50.00% passing)
(-2 test removed) (0.00% passing)
╰────────────────────────────────────────────────╯