Although Moore's Law held for many decades, the gap between Moore's prediction and current capability has become larger. As cost and performance benefits associated with Moore's Law scaling slow, researchers are studying alternative architectures (e.g., based on analog and/or spiking circuits) and/or computational models (e.g., convolutional and recurrent neural networks) to perform application level tasks faster, more energy efficient and/or more accurately.One one side, more domain-specific architectures are proposed, where a hardware-centric approach is used to tailored the architecture to a specific problem. These type of domain specific problem can exploit the parallelism for the specific domain, and make better effective use of the memory hierarchy. Meanwhile, new computational models, especially machine learning related algorithms have achieved tremendous success in many application domains, specifically with deep neural networks (DNNs). The design space of accelerator for such computation can be extremely large as they can employ different datapaths, data mapping strategies, circuits and device technologies.It is hard to quickly find an optimal design to meet the requirements in certain applications.The dissertation focuses on several aspects to address the challenge. First, we investigated designing a domain specific architecture - cellular neural network (CeNN). A CeNN is a powerful processor that can significantly improve the performance of spatio-temporal applications when compared to the more traditional von Neumann architecture. We first show how tunneling field effect transistor (TFETs) and MOSFET can be utilized to enhance the performance of CeNNs. We also show that TFETs can be useful to realize non-linear VCCSs, which are either not possible or exhibit degraded performance when implemented via CMOS.Then, we investigate cellular neural network (CeNN)-based co-processors at the application-level for accuracy, delay and energy. We use two case studies for our designs: CeNN based tracking algorithm, and CeNN based convolutional neural network.In the first case study, a CeNN-friendly target-tracking algorithm was developed and mapped to an array architecture developed in conjunction with the algorithm. We compare the energy, delay, and accuracy of our architecture/algorithm (assuming all overheads) to the most accurate von Neumann algorithm (Struck). Von Neumann CPU data is measured on an Intel i5 chip. In the second case study, we present the design and evaluation of an accelerator for CoNNs. The system-level architecture is based on mixed-signal, cellular neural networks (CeNNs). Specifically, we present (i) the implementation of different layers, including convolution, ReLU, and pooling, in a CoNN using CeNN, (ii) modified CoNN structures with CeNN-friendly layers to reduce computational overheads typically associated with a CoNN, (iii) a mixed-signal CeNN architecture that performs CoNN computations in the analog and mixed signal domain, and (iv) design space exploration that identifies what CeNN-based algorithm and architectural features fare best compared to existing algorithms and architectures when evaluated over common datasets -- MNIST and CIFAR-10.We also focus on benchmarking and evaluating DNN architectures. A uniform modeling framework, EvaDNN, is presented to estimate the dynamic energy (a major component of total energy) consumed by a DNN accelerator. EvaDNN can accurately model energy contributions from device technology, circuits, architecture, a given data mapping strategy, and underlying network structure. We applied Eva-DNN on three accelerator architectures from the literature, namely: Eyeriss, ShiDianNao, and TrueNorth and demonstrate that the model can reliably estimate the energy (maximum error of 15%) required by different components of different accelerator architecture topologies for a given workload or network.In the last chapter, we focus on emerging device based architecture implementation. Emerging memory devices are an attractive choice for implementing very energy-efficient in-situ matrix-vector multiplication (MVM) for use in intelligent edge platforms. Despite their great potential, device level nonidealities have a large impact on the application-level accuracy of deep neural network (DNN) inference. We introduce a low-density parity-check code (LDPC) based approach to correct nonideality induced errors encountered during in-situ MVM. Design space explorations demonstrate that we can leverage the resilience endowed by ECC to improve energy efficiency (by reducing operating voltage). A 1.6X energy efficiency improvement in DNN inference on CIFAR-10 dataset with ResNet-20 is achieved at iso-accuracy.