Improvements in fabrication technology, guided by Moore's law, has provided significant boost in clock rate and hence performance of microprocessors. Over the years, large amounts of chip area have been dedicated to components that try to extract more parallelism from single instruction streams. However, as we scale the process technology to smaller feature sizes, traditional techniques of exploiting Instruction Level Parallelism (ILP) especially using superscalar processors has yielded diminishing returns in terms of cost/performance. This thesis focuses on computer architectures, which increase the opportunities for concurrency usually not possible in systems based on complex superscalar based cores. We are specifically interested in the LWP architecture that supports light-weight multithreading capability coupled on the same die as the memory by using the Processing-In-Memory (PIM) technology or the embedded DRAM technology. The primary objective of this thesis is to explore scalable synchronization mechanisms for LWP architecture. We explore the design space for efficient low overhead implementations of mutex and barrier implementations that scale as the number of threads increases. We try to achieve this with a combination of hardware and software techniques depending on the target requirements. Towards this goal, we attempt to answer some of the following questions in this thesis: '¢ How best to implement current synchronization mechanisms in the LWP architecture? '¢ What sort of additional hardware or software support should be added to the LWP architecture to enhance the implementation? '¢ How does the LWP implementations compare to that on current architectures/ISAs? '¢ Are there any new techniques that can be developed for LWP architecture? '¢ Are any of these LWP based ideas applicable in conventional systems?