Significant improvements in automated image analysis have been achieved over the recent years and tools are now increasingly being used in computer-assisted syndromology. However, the recognizability of the facial gestalt might depend on the syndrome and may also be confounded by severity of phenotype, size of available training sets, ethnicity, age, and sex. Therefore, benchmarking and comparing the performance of deep-learned classification processes is inherently difficult. For a systematic analysis of these influencing factors we chose the lysosomal storage diseases Mucolipidosis as well as Mucopolysaccharidosis type I and II, that are known for their wide and overlapping phenotypic spectra. For a dysmorphic comparison we used Smith-Lemli-Opitz syndrome as a metabolic disease and Nicolaides-Baraitser syndrome as another disorder that is also characterized by coarse facies. A classifier that was trained on these five cohorts, comprising 288 patients in total, achieved a mean accuracy of 62%. The performance of automated image analysis is not only significantly higher than randomly expected but also better than in previous approaches. In part this might be explained by our large training sets. We therefore set up a simulation pipeline that is suited to analyze the effect of different potential confounders, such as cohort size, age, sex, or ethnic background on the recognizability of phenotypes. We found that the true positive rate increases for all analyzed disorders for growing cohorts (n=[10…40]) while ethnicity and sex have no significant influence. The dynamics of the accuracies strongly suggest that the maximum recognizability is a phenotype-specific value, that has not been reached yet for any of the studied disorders. This should also be a motivation to further intensify data sharing efforts, as computer-assisted syndrome classification can still be improved by enlarging the available training sets.