This article was automatically translated from the original Turkish version.
In machine learning and artificial intelligence systems, model training and testing is the process by which a data-based system acquires learning capabilities and the accuracy of this learning is evaluated. Model training aims for an algorithm to learn patterns from given labeled data, while model testing seeks to measure the applicability of this learning to new, real-world data. These processes may vary across supervised, unsupervised, and reinforcement learning methods, but they share a similar fundamental structure.
Model training typically consists of the following stages:
The model testing phase aims to measure the performance of a trained algorithm on a dataset it has never encountered before. The metrics used in this process evaluate how well the model generalizes the knowledge it has learned—that is, its generalization capability. Model performance is assessed through both quantitative and qualitative analysis.
Accuracy: The ratio of total correct predictions to the total number of samples. It is a meaningful metric for datasets with balanced class distributions. However, it can be misleading in imbalanced datasets. For example, in a dataset where 95% of samples belong to the negative class, if all predictions are negative, accuracy will be 95%, even though the model has not learned anything meaningful.
Precision: Indicates what proportion of samples predicted as positive are truly positive. It is especially important in scenarios where false positives are costly—for example, in spam filters.
<span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em;"></span><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="mord mathnormal">rec</span><span class="mord mathnormal">i</span><span class="mord mathnormal">s</span><span class="mord mathnormal">i</span><span class="mord mathnormal">o</span><span class="mord mathnormal">n</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.13889em;">TP</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal" style="margin-right:0.13889em;">FP</span><span class="mclose">)</span><span class="mord">/</span><span class="mord mathnormal" style="margin-right:0.13889em;">TP</span><span class="mord"></span></span></span></span>
Recall (Sensitivity): Indicates what proportion of truly positive samples were correctly predicted by the model. It is critical in scenarios where missing a positive case is costly—for example, in medical diagnosis.
<span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6944em;"></span><span class="mord mathnormal" style="margin-right:0.00773em;">R</span><span class="mord mathnormal">ec</span><span class="mord mathnormal">a</span><span class="mord mathnormal" style="margin-right:0.01968em;">ll</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.13889em;">TP</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal" style="margin-right:0.10903em;">FN</span><span class="mclose">)</span><span class="mord">/</span><span class="mord mathnormal" style="margin-right:0.13889em;">TP</span><span class="mord"></span></span></span></span>
F1 Score: The harmonic mean of Precision and Recall. It is used to assess balanced model performance. The F1 score ranges from 0 to 1. Scores of 0.8 and above typically represent successful models. Scores between 0.6 and 0.8 are considered acceptable, while scores below 0.6 indicate models that generally require improvement.
<span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em;"></span><span class="mord mathnormal" style="margin-right:0.13889em;">F</span><span class="mord">1</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.6444em;"></span><span class="mord">2</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">⋅</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="mord mathnormal">rec</span><span class="mord mathnormal">i</span><span class="mord mathnormal">s</span><span class="mord mathnormal">i</span><span class="mord mathnormal">o</span><span class="mord mathnormal">n</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">⋅</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal" style="margin-right:0.00773em;">R</span><span class="mord mathnormal">ec</span><span class="mord mathnormal">a</span><span class="mord mathnormal" style="margin-right:0.01968em;">ll</span><span class="mclose">)</span><span class="mord">/</span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="mord mathnormal">rec</span><span class="mord mathnormal">i</span><span class="mord mathnormal">s</span><span class="mord mathnormal">i</span><span class="mord mathnormal">o</span><span class="mord mathnormal">n</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal" style="margin-right:0.00773em;">R</span><span class="mord mathnormal">ec</span><span class="mord mathnormal">a</span><span class="mord mathnormal" style="margin-right:0.01968em;">ll</span><span class="mclose">)</span><span class="mord"></span></span></span></span>
ROC-AUC Curve (Receiver Operating Characteristic – Area Under Curve): Measures the model’s ability to distinguish between classes. The ROC curve plots the true positive rate (Recall) against the false positive rate. An AUC score of 0.5 indicates random guessing. Models with an AUC greater than 0.8 are generally considered strong.
Loss Function: A function that measures the difference between the model’s predicted values and the true values. It quantifies how “wrong” the model is during training and testing. For example, Mean Squared Error (MSE) is commonly used in regression models, while Binary Cross Entropy is frequently used in binary classification problems.
Overfitting: This occurs when a model learns the training data too well and fails to generalize to test data. In this case, the model exhibits very low error on training data but high error on new data. A typical sign of overfitting is low training loss but high validation or test loss.
Solutions: Collect more data
Underfitting: This occurs when the model fails to capture the underlying pattern in both training and test data. It is usually due to insufficient model complexity or inadequate training time.
Model training and testing are fundamental processes that determine the accuracy and reliability of artificial intelligence projects. The quality of training, data integrity, and the correctness of testing protocols directly affect the success of the application. Therefore, the model development process must proceed iteratively and be continuously evaluated from both technical and ethical perspectives.
No Discussion Added Yet
Start discussion for "Model Training and Testing" article
Model Training Process
Model Testing and Evaluation
Key Performance Metrics
Overfitting and Underfitting
Solutions