In the past decade, the number of people owning smartphones has increased rapidly, and the computational power and number of sensors available on these smartphones has increased as well. This growth has allowed the rise of research that uses the smartphone to measure various aspects of daily life, including social aspects like emotion in spoken communication. Sensing emotion from speech, however, is a challenging task which is only intensified when done on audio collected from a smartphone, and is further complicated when the sensing must be done over a long period of time, such as a longitudinal study. This thesis describes the Speech Collection and Analysis of Longitudinal Emotion (SCALE) system, which is designed to address such challenges for a scalable number of users over a long period of time. The system involves both a mobile client for the smartphone and a back-end server and analysis process meant for analyzing and storing the audio features essential to emotion recognition. In this thesis, the SCALE system design will be discussed, evaluated for performance, and potential methods for assessing emotion will be presented.