Table 3:

Performance metrics (macro-F1, micro-F1, sensitivity, specificity, and precision) for AI-VTRA/AI-VTRA_ET predictions of radiologist-based response assessment. Within each category, we binarized the BT-RADS and AI predictions based on the target score and computed the metrics

	Imaging Improvement(BT-RADS 1)		No Significant Imaging Change (BT-RADS 2)		Imaging Worsening(BT-RADS 3)		Imaging Worsening Equivalent to RANO Progression (BT-RADS 4)
	AI-VTRA_ET	AI-VTRA	AI-VTRA_ET	AI-VTRA	AI-VTRA_ET	AI-VTRA	AI-VTRA_ET	AI-VTRA
Macro-F1	0.747	0.755	0.760	0.750	0.561	0.587	0.705	0.705
Micro-F1	0.857	0.870	0.765	0.757	0.695	0.689	0.831	0.831
Sensitivity	0.747	0.700	0.793	0.746	0.222	0.298	0.596	0.596
Specificity	0.873	0.895	0.746	0.765	0.920	0.875	0.872	0.872
Precision	0.474	0.526	0.672	0.675	0.568	0.530	0.450	0.450