Rethinking Clinical AI Applications in Stroke - Pitfalls, Misconceptions, and Directions for Responsible Use
Article information
Abstract
Artificial intelligence (AI), particularly deep learning, continues to advance in medical image analysis and clinical prediction. In stroke care—where timely, accurate decisions are critical—AI is seen as a promising tool, with potential to detect complex imaging patterns and enhance clinical workflows. However, practical application in real-world settings remains limited, often due to structural issues in model design, evaluation, and insufficient integration of clinical context. This narrative review examines common pitfalls in developing and applying AI models in stroke care. High performance alone does not ensure clinical value; what matters is whether the predicted target (label) is clinically meaningful and well-defined. If the label is ambiguous or fails to reflect the underlying clinical condition, even highly accurate models may produce misleading or unhelpful outputs. We also discuss limitations of current explainability tools and emphasize that lack of interpretability hinders trust and adoption in high-stakes decisions. Rather than functioning as autonomous decision-makers, AI models are better positioned as coordinators or accelerators—supporting, not replacing, clinical judgment. For responsible integration into practice, developers must disclose key aspects of the model, including training data, label definitions, and performance conditions. Clinicians, in turn, should be prepared to interpret evaluation metrics in the context of real-world care. Ultimately, clinical AI should focus not merely on maximizing performance but on solving problems relevant to clinical practice, with transparency and explainability as essential prerequisites for adoption.
INTRODUCTION
Recent advances in medical artificial intelligence (AI) have opened new possibilities in clinical care, including improvements in diagnostic accuracy, workflow efficiency, and outcome prediction. In particular, deep learning-based AI models have shown remarkable capabilities in processing high-dimensional, complex, and unstructured data—such as medical imaging—enabling the detection of subtle patterns that may elude human observers.1,2 Stroke care has been one of the most active areas for applying such AI technologies. Numerous studies have sought to estimate stroke onset time from imaging, quantify infarct extent, or determine eligibility for reperfusion therapies.3,4
However, in real-world practice, the adoption of AI tools remains challenging. Model outputs may conflict with clinical judgment, and the lack of interpretability often undermines trust among clinicians. A common misconception in medical AI is the belief that “as long as we have enough data, AI can solve any problem.” In reality, AI models are fundamentally tools that learn from human-defined tasks and labels. If the prediction target is not clinically meaningful—or if the training data are biased—the resulting model may achieve high accuracy but still be of limited or even misleading value in clinical settings.5,6
This review explores these issues in the context of stroke AI, highlighting fundamental structural pitfalls in current model development that hinder real-world utility. We focus on overlooked aspects such as the clinical appropriateness of label design, the imperative for interpretability, and the crucial role of clinicians in guiding AI development. Ultimately, we argue for a shift in perspective—from evaluating AI models solely by performance metrics to assessing their ability to address clinically meaningful problems and deliver real-world value.
WHY DEE P LE ARNING SEE MS PROMISING IN STROKE
Deep learning has emerged as one of the most significant advancements in medical artificial intelligence in recent years, particularly in domains like stroke care where diagnosis and treatment decisions must be made based on complex imaging data. Its promise in stroke stems primarily from two core strengths.
First, deep learning demonstrates exceptional capability in processing unstructured data. Brain imaging—such as magnetic resonance imaging (MRI) and computed tomography (CT)—forms the foundation of stroke diagnosis and evaluation, yet these high-dimensional and structurally complex datasets pose significant challenges for traditional statistical models and conventional machine learning techniques. These earlier approaches often struggled with preprocessing, feature extraction, and interpretation. In contrast, deep learning models, especially convolutional neural networks, can automatically learn and extract spatial features from images.7,8 This enables the precise detection of lesions, blurred boundaries, or subtle patterns within complex scans. For example, subtle signal changes in diffusion-weighted image and Fluid-Attenuated Inversion Recovery (FLAIR) images, the distribution of ischemic core, or the loss of gray-white matter differentiation are features that even experienced radiologists may overlook. By learning from thousands of annotated examples, deep learning models can consistently recognize such features. Importantly, they do so with a level of consistency that is unaffected by external factors like fatigue or individual expertise, making them clinically valuable.
Second, deep learning models can detect patterns that lie beyond the limits of human perception and traditional medical understanding. By capturing complex, high-dimensional, and nonlinear relationships in the data, these models reveal insights that clinicians may overlook.9 For instance, a model that accurately predicted previously undetected low ejection fraction from electrocardiogram (ECG) waveforms illustrates AI’s ability to identify subtle electrical signals that are imperceptible to human observers and indicative of critical cardiac conditions.10 The same principle applies to stroke imaging. Deep learning models can analyze pixel-level intensity distributions, texture differences, and contrast with surrounding tissues that may not be consciously perceived by human observers. This allows AI to potentially outperform humans in early lesion detection, infarct growth prediction, and patient selection. In acute stroke care—where every minute counts—AI’s ability to provide fast, consistent, and objective assessments can play a key role in accelerating clinical workflows.
In summary, deep learning is not only well-suited for interpreting high-dimensional unstructured data such as stroke imaging but also offers the potential to detect patterns beyond the limits of human perception. This positions it as a powerful tool to support diagnosis and treatment decisions in stroke care. However, for this technological potential to translate into meaningful clinical application, model development must prioritize not only technical performance but also clinical validity and trustworthiness.
STRUCTURAL PITFALL S IN CLINICAL AI
The limited adoption of AI in clinical practice is not solely due to insufficient performance. Rather, the more fundamental obstacle often lies in the structural design of the AI model itself—namely, how the data are collected, how labels are defined, how tasks are framed, and whether the model provides interpretable outputs. These foundational design choices create what can be termed “structural pitfalls.” Models that ignore these underlying issues may demonstrate high performance on paper but ultimately fail to translate into meaningful clinical use, becoming “solutions in search of a problem.”
1. Dataset bias
The ability of an AI model to generalize depends heavily on the diversity and representativeness of the training data. Various factors that introduce data variability and lead to out-of-distribution risks are summarized in Table 1. However, many medical AI studies rely on data from a single institution or a homogenous patient population. For instance, a model trained solely on data from a tertiary care center may perform poorly when applied to patients in a community hospital with different clinical characteristics and workflows.11,12
Temporal drift is another source of bias.13 Over time, changes in clinical practice, such as the adoption of new treatments or diagnostic protocols, can alter the characteristics of the data.14 For example, small infarcts that were previously undetectable may now be visible due to advances in MRI resolution, leading to shifts in labeling standards that a static model would not account for.
Domain shifts related to equipment or imaging protocols also pose problems.15 Variations in MRI vendor, field strength, slice thickness, or reconstruction techniques can affect image appearance and degrade model performance when these factors differ from the training environment. Similarly, regional or ethnic differences can limit model generalizability. For example, conditions such as Moyamoya disease, which are more prevalent in East Asia,16 may be underrepresented in Western training datasets, leading to potential blind spots during inference.
These biases not only reduce predictive accuracy but may lead to incorrect or unsafe decisions when the model is applied outside its original data domain. In clinical settings, such errors can directly impact patient safety.
2. Trust deficit and explainability gap
One of the most critical barriers to clinical integration of AI is the lack of explainability. This issue becomes particularly acute in high-stakes decisions, such as those encountered in stroke diagnosis and acute treatment.1,17
Many AI models output a numerical prediction—such as a binary classification or probability—without providing an explanation for how that conclusion was reached. This lack of transparency makes it difficult for clinicians to trust the results, particularly when the clinical picture is complex.
Consider a model that determines whether a patient is eligible for thrombolytic therapy. If the patient has multiple comorbidities or a borderline presentation, a clinician may reasonably hesitate to act solely on the model’s output. Without an explanation of the model’s reasoning, discrepancies between the AI’s output and the physician’s judgment can lead to hesitation and reduce trust in the system.18
Moreover, while AI systems may process thousands of variables and detect patterns imperceptible to human cognition, clinicians are still held accountable for treatment decisions. This asymmetry—between the model’s opacity and the clinician’s need to justify actions—can erode the utility of AI tools and lead to their rejection in practice.
To address this, recent efforts in explainable AI have focused on providing visual or interpretable cues, such as heatmaps, attention maps, or surrogate markers, to show the basis of a model’s decision.19 These tools aim to align AI outputs with clinical reasoning, thereby facilitating trust and integration into clinical workflows. In practice, explainability may be more important than raw accuracy, as only interpretable models can be safely and responsibly adopted in real-world medical decision-making.
CHOOSING THE RIGHT TARGET: WHY LABEL DESIGN MATTERS MORE THAN DATA QUANTITY
In the development of medical AI, one of the most commonly overlooked truths is that _what_ we choose to predict is often far more decisive for both model performance and clinical utility than the size of the dataset or the complexity of the model architecture. Simply collecting large volumes of data does not automatically yield meaningful predictions. If the target label is inappropriate, ambiguous, or poorly defined, even the most sophisticated model may end up producing results that are ultimately irrelevant or unusable in clinical practice.
One key principle is that models should aim to address clinically essential questions. For example, while many AI studies attempt to estimate time of onset in acute stroke patients using Diffusion-FLAIR mismatch, what clinicians actually want to know is not the exact time, but whether the affected brain tissue is still salvageable.20 Time is merely a surrogate marker—used to determine eligibility for treatments like thrombolysis within a 4.5-hour window. Since AI’s utility is directly linked to how the clinical question is framed, it is more appropriate to design models that directly target the more fundamental issue: the presence or absence of salvageable tissue.
Similarly, while widely used, etiological classification systems like Trial of Org 10172 in Acute Stroke Treatment (TOAST) do not always reflect the underlying pathophysiology with sufficient clarity, raising concerns about their validity as ground-truth labels for AI training. Additionally, scales like National Institutes of Health Stroke Scale (NIHSS) or modified Rankin Scale (mRS) are not interval variables, and using them in regression-based prediction tasks may lead to misleading interpretations or clinical misapplications.
For AI to deliver robust and actionable results, it must be trained on labels that are objective, well-defined, and clinically meaningful. Labels based on clear and reproducible criteria allow AI systems to learn consistently and also make it easier for clinicians to interpret the results. For example, a model trained to distinguish male from female using ECG data has achieved over 97% accuracy—an outcome largely enabled by the binary and clearly defined nature of the label.21 In contrast, subjective and ambiguous labels such as pain scores or cognitive function ratings are far more difficult to model reliably due to inconsistency and noise.
Another crucial condition is that labels should be directly observable from the input data. For instance, predicting structural abnormalities like left ventricular hypertrophy from chest X-rays is feasible because the label is grounded in image-based evidence.22 However, attempting to predict long-term mortality from the same images introduces numerous confounding variables—such as renal function, treatment adherence, or socioeconomic status—that are not captured in the imaging data alone. In such cases, AI models are likely to yield poor predictive performance and may foster unrealistic expectations in clinical settings.
Ultimately, the responsibility for defining what the model should predict must rest with clinicians. AI is merely a tool for computation and pattern recognition; it does not decide what matters. Clinicians should not passively rely on pre-existing labels or convenience-based targets, but instead must take the lead in defining what is clinically necessary. In short, AI models should not merely predict what is predictable—they should be built to predict what is actually worth predicting.
EXPLAINABILITY AS THE BRIDGE TO ADOPTION AND AI’S REALISTIC ROLE
High predictive accuracy alone is not sufficient for AI technologies to be widely adopted in clinical settings. This is especially true in high-stakes medical decisions that directly impact patient outcomes. In such scenarios, clinicians must be able to understand why an AI model made a particular prediction. This brings us to the critical concept of explainability, which plays a central role not only in building clinicians’ trust but also in allowing for the review of the model’s reasoning, ensuring its alignment with medical judgment.
Technologies that offer visual explanations for AI predictions can greatly enhance clinicians’ trust in the model. For instance, if a model detects an abnormal finding in a specific brain region, a heatmap highlighting the area of interest, or a surrogate marker that indicates graywhite matter blurring or large vessel occlusion, can help the clinician interpret and accept the model’s suggestion. This form of visual explanation is especially helpful for less experienced readers and can serve as a learning aid. However, most current explainable AI techniques still have limitations. A heatmap may show the region where the model focused its attention, but it often fails to clarify which features informed its decision.23,24 Furthermore, these visual explanations do not always align with the clinical criteria used by physicians, meaning that a model’s “explanation” may not be interpretable or reliable from a medical perspective. Therefore, explainability should not be reduced to a mere visualization layer; it must reflect the clinician’s cognitive process and decision-making logic to be truly meaningful. As such, explainability must be incorporated into the model design from the early stages, considering how clinicians will interact with the tool. It should go beyond surface-level visualization and evolve into a mechanism that provides interpretable and clinically valid rationales for model outputs.
Rather than acting as an autonomous decision-maker, AI should be seen as a coordinator and accelerator that organizes clinical information, supports decision-making, and shortens the time to treatment. This framing is more realistic and effective for clinical integration. In particular, AI can make meaningful contributions in workflow optimization by automatically detecting large vessel occlusions immediately after CT or CT angiography scans and notifying stroke specialists, thereby preventing treatment delays and allowing parallel task execution across care teams.25,26 In the prehospital setting, AI can analyze emergency call transcripts or prehospital imaging to assess the likelihood of stroke and assist in determining appropriate transport destinations or levels of care.27 Additionally, AI can function as a decision support tool by converting visual data into consistent numerical values, such as calculating Alberta Stroke Program Early CT Score (ASPECTS) scores, estimating infarct core volumes, or detecting intracranial hemorrhages.28 This is especially helpful for clinicians with less diagnostic experience. By supporting clinicians in these ways, AI can reduce workload, increase the speed and consistency of stroke care, and promote more equitable access to timely interventions. Particularly in time-critical conditions like stroke, AI has the potential to alleviate workflow bottlenecks and automate intermediate decision points, thereby significantly improving the responsiveness of the entire treatment system.
LIMITATIONS IN SCOPE AND UNINTENDED USAGE
For AI models to be used safely and effectively in clinical settings, their scope of use and intended purpose must be clearly defined. However, many existing medical AI models are developed and deployed without explicitly stating these boundaries. This lack of clarity can lead to out-of-scope use—applying the model in situations beyond its original design—which increases the risk of errors. Similarly, when models encounter out-of-distribution data that differ significantly from the training data, their predictions may become unreliable or invalid.
One of the key limitations stems from the ambiguity surrounding the model’s intended scope.1,29 AI models are optimized based on the distribution of input data and labels present during training. When applied to different patient populations, imaging equipment, or clinical objectives, performance may degrade substantially.30,31 Yet, some AI tools are introduced under broad and nonspecific labels such as “stroke detection,” “large vessel occlusion detection,” or “infarct segmentation,” which can mislead clinicians into overestimating the model’s reliability across diverse scenarios.
Out-of-distribution performance degradation is a critical concern, particularly in medical imaging.32 Factors such as vendor variability (e.g., Siemens vs. GE vs. Philips), changes in acquisition protocols (slice thickness, field strength, echo time), and demographic variation (age, ethnicity, disease prevalence) can introduce unfamiliar features to the model. These unseen variations can trigger unexpected errors,31,33 yet many models are deployed clinically without robust validation under such conditions.5
Several real-world examples illustrate how unclear scope and misuse lead to compromised outcomes. For instance, some large vessel occlusion detection models show high accuracy for proximal occlusions such as M1 segment of middle cerebral artery, but perform poorly on more distal occlusions (M2–M4), largely due to their underrepresentation in training datasets. If clinicians are unaware of this limitation, false negatives or misinterpretation can occur. Similarly, infarct segmentation models trained on ischemic stroke cases have at times been misapplied to patients with hemorrhages or tumors—scenarios beyond the model’s design—leading to potential diagnostic errors. Another common issue arises when screening models, tuned to detect early infarcts with low thresholds, are mistakenly used for diagnostic confirmation. Due to their high false-positive rate, such misuse can result in unnecessary testing or overtreatment.
result in unnecessary testing or overtreatment. These examples underscore the importance of not just evaluating model performance metrics, but also defining the boundaries of clinical applicability. Developers must clearly document the model’s training conditions, expected input characteristics, and appropriate use cases, while users must remain cautious when applying the model beyond these intended settings. Only with transparent disclosure and adherence to intended use can AI be responsibly and effectively integrated into clinical care.
WHAT AI DEVEL OPERS MUST DISCLOSE
To ensure that AI is trusted and used safely in clinical settings, it is not enough to simply demonstrate high performance metrics. Medical AI models must be accompanied by clear documentation of their training process, intended use, and data characteristics. This transparency is essential for helping users—especially clinicians—understand the appropriate scope and limitations of the model. In healthcare, where AI malfunctions or misuse can directly impact patient safety, it is critical that developers or vendors provide the following key information in a clearly documented format.
First, the specific usage conditions of the model must be clearly stated. This includes what data the model was trained on, what task it was designed for, and how the labels were defined.34 For instance, a model trained solely on data from a single institution may have limited generalizability compared to one trained on data from multiple hospitals. Label definitions also matter: in a model predicting ASPECTS scores, there is a significant difference in reliability depending on whether the labels were generated through expert consensus or by automated methods.
It is equally important to clarify the clinical purpose for which the model was developed. Whether an infarct segmentation model is meant for research analysis, for screening purposes, or as a diagnostic aid to guide treatment decisions will affect how its thresholds are set and how its outputs should be interpreted. The basis on which any threshold or cutoff was determined must be disclosed, as this information is critical for clinicians to properly contextualize and apply the model’s predictions.
Second, the characteristics of the training data must be described in detail. A model’s performance depends not just on the volume of data, but also on the quality and context of that data. For example, if the prevalence of a particular label was very low in the training set, the model may have learned to under-detect that class.35 Therefore, information on label distribution should always be disclosed.
Differences across populations must also be considered. Diseases vary in prevalence by region and ethnicity. For instance, neurocysticercosis is relatively common in Latin America, Southeast Asia, and parts of Africa, but rare in Western countries.36 AI models trained solely on Western data may miss its key imaging features, such as cysts with a scolex or calcifications. Temporal drift is another important factor: changes in treatment practices, insurance policies, or diagnostic technology over time can alter data distributions, making older models less representative of current clinical settings.13
Finally, developers must clearly disclose any potential domain shifts, even within the same imaging modality. For instance, although a model may be described as “MRIbased,” its performance can vary significantly depending on the vendor, field strength, acquisition protocol, and slice thickness of the scans. These variations affect image appearance and can undermine model reliability when applied in different clinical environments. Therefore, rather than relying on generic descriptors, developers should specify the imaging conditions under which the model was trained and validated. This is not merely a technical detail—it is a critical factor that enables end users to safely interpret and apply the model in their own clinical context.
CLINICIAN’S UNDERSTANDING OF EV ALUATION METRICS
To effectively utilize AI models in clinical practice, it is essential for clinicians to understand how these models have been evaluated. Simply stating that a model has “high accuracy” is insufficient to assess its reliability or clinical utility. AI performance is measured using a variety of metrics, each carrying different clinical implications. Therefore, clinicians must not only understand how each metric is calculated but also what it means in the context of real-world medical decision-making.
Among the most commonly used evaluation tools for binary classification is the area under the Receiver Operating Characteristic curve (AUROC), which represents the trade-off between sensitivity and 1-specificity. This metric provides an overall assessment of how well the model can distinguish between positive and negative cases. However, AUROC may overestimate clinical utility in situations with extreme class imbalance, such as when the disease prevalence is very low. In such cases, the area under the Precision-Recall Curve (AUPRC) is often a more appropriate alternative.37 AUPRC is particularly useful in rare disease settings, where the focus is on how effectively the model can identify true positives among a small set of positive cases.
Sensitivity and specificity refer to the model’s ability to correctly identify patients with the disease and correctly exclude those without it, respectively. These metrics form the foundation for evaluating the reliability of diagnostic tools, though their importance may vary depending on the clinical context. For instance, in early screening for life-threatening conditions, sensitivity may take precedence, whereas in situations where overdiagnosis is a concern, specificity might be more critical. Positive predictive value and negative predictive value are also frequently used, but these are highly dependent on disease prevalence.38 Even with the same model, positive predictive value tends to be higher in high-risk populations, while negative predictive value tends to increase in low-risk groups. Therefore, interpretation of these metrics must consider the patient distribution of the specific clinical environment in which the model is used.
In image-based models, spatial overlap metrics such as the Dice coefficient and Intersection over Union are often used to quantify how well the predicted lesion area matches the ground truth.38 While these metrics may appear similar numerically, their clinical interpretation can differ greatly depending on lesion location and context. For example, in cases where the exact boundary of a clot affects interventional access routes, small boundary differences can have significant consequences. The F1 score, which balances precision and recall, is valuable in tasks where both are critical—such as in the prediction of adverse events or detection of rare findings.39
For prognostic models involving time-to-event outcomes, evaluation metrics must incorporate the temporal dimension. AUROC is inadequate for this purpose because it does not account for time. Instead, the Concordance Index is commonly used, as it measures how well the predicted risk ranks align with the actual sequence of events.40 When finer evaluation is needed, metrics like the integrated AUC (iAUC), which averages performance across the time axis, or the Brier score, which calculates the mean squared error between predicted probabilities and actual outcomes, can be employed. The Brier score is particularly useful in assessing how well a model is calibrated—i.e., how accurate its predicted probabilities are—making it a valuable tool when probabilistic reasoning is essential for decision-making.41,42
Ultimately, each evaluation metric reflects a different perspective, and no single metric can fully capture the practical utility of an AI model. Clinicians must critically assess which metrics were used to validate the model and what those metrics imply in their specific clinical context. When clinicians develop the capacity to interpret these metrics appropriately, AI can move beyond being a technical novelty and become a meaningful tool for improving patient care.
CONCLUSIONS
Medical artificial intelligence has the potential to revolutionize various areas of healthcare, including diagnosis, treatment, and prognosis prediction. In fields like stroke care, where urgent decision-making is critical, the performance demonstrated by image-based deep learning models can leave a strong impression on clinicians. However, performance metrics or quantitative superiority alone are not sufficient to determine whether an AI model is truly viable in real-world clinical settings. This review has highlighted several structural issues that are often overlooked in the development and application of AI models in the field of stroke. The optimistic assumption that data quantity alone can resolve all problems often leads to the development of models based on clinically inappropriate or ambiguous labels. Such models, regardless of their high accuracy, may hold limited clinical value if there has been little reflection on whether the target label is clinically meaningful, whether its definition is clear, and how the model’s outputs might actually impact patient care. Moreover, explainability is not merely a desirable feature but a critical condition for earning trust—often more important than raw performance. Rather than acting as autonomous decision-makers, AI models are more realistically and effectively positioned as tools that support clinical reasoning and streamline workflows. To achieve this, close collaboration between model developers and clinicians is essential. Developers must transparently disclose key information such as the model’s intended use, data characteristics, and label design. At the same time, clinicians need the capability to interpret performance metrics appropriately and critically. Ultimately, for AI to deliver true value in clinical practice, its development must be grounded not in performance-driven competition but in clinically meaningful problem definition and thoughtful, transparent design. By shifting from a technology-centered approach to one rooted in patient-centered medical reasoning and responsible implementation, we move closer to creating an environment where AI can genuinely serve as a useful tool for clinicians.
Notes
Ethics Statement
Not applicable. This study is a narrative review and did not involve human participants or animal subjects.
Availability of Data and Material
Not applicable. This article is a narrative review based on previously published literature and contains no new data.
Acknowledgments
This research was supported by the K-Brain Project of the National Research Foundation (NRF) funded by the Korean government (MSIT) (No. RS-2023-00265393). This research has been supported by the HANDOK JESEOK FOUNDATION.
Sources of Funding
None.
Conflicts of Interest
No potential conflicts of interest relevant to this article was reported.
