Large scale hardware-supported multithreading, an attractive means of increasing computational power, benefits significantly from low per-thread costs. Hardware support for lightweight threads and synchronization is a developing area of research. Shared memory parallel systems are flourishing, but with a wide variety of architectures, synchronization mechanisms, and topologies. Portable abstractions are needed that provide basic lightweight thread control, synchronization primitives, and topology information to enable scalable application development on the full range of shared memory parallel system designs. Additionally, programmers need to be able to understand, analyze, tune, and troubleshoot the resulting large scale multithreaded programs. This thesis discusses the implementation of scalable software for massively parallel computers based on locality-aware lightweight threads and lightweight synchronization. First, this thesis presents an example lightweight threading API, the qthread library, that supports the necessary features in a portable manner. This exposes the need for a structural understanding of parallel applications. ThreadScope, a tool and visual language for structural analysis of multithreaded parallel programs is presented to address this need. A strong understanding of algorithm structure combined with a locality-aware portable threading library leads to the development of three distributed data structures Ì¢ âÂ' a memory pool, an array, and a queue Ì¢ âÂ' that adapt to system topology at runtime. Such adaptive data structures enable the development of three example adaptive computational templates Ì¢ âÂ' sorting, all-pairs, and wavefront Ì¢ âÂ' that hide the parallelism details without sacrificing scalable performance.