Machine learning models are expected to work across a variety of settings, or inputs, with reasonable performance. A model that does not exhibit reliability work across settings cannot be trusted for real-world usage. The measure of a model's adaptability across inputs is its generalization, or generalizability. Traditionally, a machine learning model is evaluated for generalizability by testing it on unseen data, unseen inputs. The justification behind this is models that are unable to adapt to this unseen data cannot be relied on to adapt to other unseen data. This evaluation considers metrics of correctness to assess performance on the unseen data, measuring the model's ability with metrics like accuracy, precision/recall, specificity, and F1 Score. This style of evaluation effectively weeds out models which cannot hope to generalize to unseen inputs by testing the model on a sample of inputs, and showing the model fails. However, this is no guarantee; a model that succeeds on this set of unseen inputs may still fail on another.In this work, I show that generalization behavior can be inferred by evaluating models across an intentional range of inputs and studying how internal behavior varies across that input range, specifically in correspondence with the way instance similarity varies across the input range. The way an input progresses through a model's internal processes can be thought of as the internal behavior of the model in reaction to that input. For neural networks, internal behavior is literally the activation patterns of the network's neurons in response to the input stimuli. A neural model's internal behavior between two inputs can be abstractly compared by correlating the activations in response to each instance. A measure of the model's internal behavior is obtained by repeating this process across the full set of stimuli pairs. Consistent behavior manifests as similar behavior for similar inputs, and dissimilar behavior for dissimilar inputs. If a model's internal behavior is inconsistent, or unpredictable, then it cannot be trusted to generalize to unseen input. A drawback with this evaluation is that internal behavior is defined in accordance with input similarity. I implement a practical work around by considering the activations of the brain as ground truth for the relationship between instance similarity and behavioral similarity. In this dissertation, I propose and study a new model evaluation technique which assesses models on the consistency of their internal processes in relation to the similarity of different instances. I first present, evaluate, and discuss evaluation issues with two machine-learned models, one trained and evaluated on the relation of learner's physiological data to their attention (mind wandering), and the other trained and evaluated on utilizing teacher speech to identify the number of questions asked in a class session. I contrast my evaluations on these traditional models with my proposed evaluation, a Human-Model Similarity measure of internal model behavior. I study my proposed evaluation in relation to other performance metrics, across various architectures, and for potential use in model search. I find the metric is predictive of traditional accuracy metrics and can be used to predict which models will succeed well before training is complete. I define and test how this evaluation could be used in model search, specifically finding that it can reduce training time by 67% with no loss in final model performance.