Skip to content

Verify that a new model version is better

When a new model version is trained, you need to check whether it actually performs better than the current one before promoting it. This guide walks through using EyVz to run both models against the same set of images and compare the results.

The workflow is:

  1. Select a representative set of images
  2. Run both models against those images
  3. Review the results, focusing on where the models disagree
  4. Decide whether to promote the new version
  1. Open the Home view and filter for a representative set of images. Choose images that cover the range of conditions the model should handle — different products, defect types, lighting conditions.

  2. Include both easy and hard cases. Filter for annotated images so you have a human-verified ground truth to compare against. Consider including images that the current model struggles with.

  3. Select the images. Use Select all or manually select a representative sample. The number of selected images is shown in the filter pane.

  4. Click “Compare selected” to open the comparison page.

  1. Choose models. On the comparison page, select the models you want to compare. At minimum, select the current production model and the new candidate.

  2. Set parameters. Adjust the sleep between entries and batch size if needed. Default values work for most cases.

  3. Click “Run Comparison” to start. The run appears in the Runs view.

  1. Open the Runs view. The table shows your comparison run with a progress bar.

  2. Wait for completion or click View to watch progress in real time. The detail page auto-refreshes every 5 seconds.

  3. Once completed, the status changes to green and the results table populates.

The results table on the run detail page is your primary tool for evaluation. Entries are sorted by disagreement — the images where models differ most appear first.

  1. Start with the “Difference” column. Look at how often the models agree (“All Match”) versus disagree. A high rate of agreement suggests the models behave similarly. Frequent disagreement means the new model has learned something different.

  2. Check the “Conf. diff” column. Even when models agree on the label, large confidence differences reveal that one model is more certain than the other. Green (under 10%) means similar confidence. Red (over 30%) means one model is significantly more or less certain.

  3. Open disagreeing images. Click View on entries where the models differ. The annotation view shows the original image with each model’s predictions overlaid. Use the Showing dropdown to switch between models and see their predictions side by side.

  4. Compare against the human annotation. Switch to Annotation mode to see the human-verified label. Check which model agrees with the human — that model is correct for this image.

  5. Look for regressions. Pay special attention to images that the current model gets right but the new model gets wrong. Even a single regression on a critical defect type may outweigh improvements elsewhere.

After reviewing the comparison results, you should be able to answer:

  • Does the new model fix known problems? Check images that the current model struggles with. If the new model handles them better, that is a meaningful improvement.
  • Does the new model introduce new problems? Look for regressions — images the current model handles correctly but the new one does not.
  • How do confidence levels compare? If the new model is consistently more confident on correct predictions and less confident on incorrect ones, that is a sign of better calibration.
  • Is the improvement worth the risk? A small improvement with no regressions is usually safe to promote. A large improvement with some regressions requires judgment.

Once you decide to promote the new model:

  1. Revisit the Thresholds Statistics page. The new model may need different confidence thresholds than the previous one.
  2. Monitor performance using the Home view’s filters. Check low-confidence predictions and watch for unexpected behavior.
  3. Keep annotating — your ongoing corrections continue to improve future model versions.