key: cord-0490433-7xnxfiio authors: Dzalbs, Ivars; Kalganova, Tatiana title: Accelerating supply chains with Ant Colony Optimization across range of hardware solutions date: 2020-01-22 journal: nan DOI: nan sha: 8d016028622d751414d112af07f3c5c40ba358b1 doc_id: 490433 cord_uid: 7xnxfiio Ant Colony algorithm has been applied to various optimization problems, however most of the previous work on scaling and parallelism focuses on Travelling Salesman Problems (TSPs). Although, useful for benchmarks and new idea comparison, the algorithmic dynamics does not always transfer to complex real-life problems, where additional meta-data is required during solution construction. This paper looks at real-life outbound supply chain problem using Ant Colony Optimization (ACO) and its scaling dynamics with two parallel ACO architectures - Independent Ant Colonies (IAC) and Parallel Ants (PA). Results showed that PA was able to reach a higher solution quality in fewer iterations as the number of parallel instances increased. Furthermore, speed performance was measured across three different hardware solutions - 16 core CPU, 68 core Xeon Phi and up to 4 Geforce GPUs. State of the art, ACO vectorization techniques such as SS-Roulette were implemented using C++ and CUDA. Although excellent for TSP, it was concluded that for the given supply chain problem GPUs are not suitable due to meta-data access footprint required. Furthermore, compared to their sequential counterpart, vectorized CPU AVX2 implementation achieved 25.4x speedup on CPU while Xeon Phi with its AVX512 instruction set reached 148x on PA with Vectorized (PAwV). PAwV is therefore able to scale at least up to 1024 parallel instances on the supply chain network problem solved. Supply chain optimization has become an integral part of any global company with a complex manufacturing and distribution network. For many companies, inefficient distribution plan can make a significant difference to the bottom line. Modelling a complete distribution network from the initial materials to the delivery to the customer is very computationally intensive. With increasing supply chain modelling complexity in ever changing global geo-political environment, fast adoptability is an edge. A company can model impact of currency exchange rate changes, import tax policy reforms, oil price fluctuations and political events such as Brexit before they happen. This requires fast optimization algorithms. Mixed Integer Linear Programming (MILP) tools such as Cplex are commonly used to optimize various supply chain networks [1] . Although MILP tools are able to obtain optimum solution for large variety of linear models, not all real-world supply chain models are linear. Furthermore, MILP is computationally expensive and on large instances can fail to produce an optimal solution. For that reason, many alterative algorithmic approaches (heuristics, meta-heuristics, fuzzy methods) have been explored to solve large-complex SC models [1] . One of these algorithms is the Ant Colony Optimization (ACO), which can be well mapped to real world problems such as routing [2] and scheduling [3] . Supply Chain Optimization Problem (SCOP) includes both, finding the best route to ship a specific order and finding the most optimal time to ship it, such that it reaches expected customer satisfaction while minimizing the total cost occurred. Ant colony algorithms try to mimic the observed behavior of ants inside colonies, in order to solve a large range of optimization problems. Since the introduction by Marco Dorigo in 1992, many variations and hybrid approaches of Ant Colony algorithms have been explored [4] [5] . Most ant colony algorithms consist of two distinct stagessolution construction and pheromone feedback to other ants. Typically, an artificial ant builds its solution from the pheromone left from previous ants, therefore allowing communication over many iterations via a pheromone matrix and converges to a better solution. The process of solution creation and pheromone update is repeated over many iterations until the termination criterion is reached, this can be either total number of iterations, total computation time or dynamic termination. Researchers in [6] compared an industrial optimization-based tool -IBM ILOG Cplex with their proposed ACO algorithm. It was concluded that the proposed algorithm covered 94% of optimal solutions on small problems and 88% for large-size problems while consuming significantly less computation time. Similarly, [7] compared ACO and Cplex performance on multi-product and multi-period Inventory Routing Problem. On small instances ACO reached 95% of optimal solution while on large instances performed better than timeconstrained Cplex solver. Furthermore, ACO implementations of Closed-Loop Supply Chain (CLSC) have been proposed; CLSC contains two parts of the supply chainforward supply and reverse/return. [8] solved CLSC models, where the ACO implementation outperformed commercial MILP (Cplex) on nonlinear instances and obtained 98% optimal solution with 40% less computation time on linear instances. The aim of this paper is to explore parallelism techniques across multiple hardware solutions for a real-world supply chain optimization problem (where meta-data overhead during solution construction plays a significant role on the total compute time). The paper is structured as follows: Section 2 explores current state of the art parallel implementations of ACO across CPU, GPU and Xeon Phi; Section 3 introduces the hardware and software solutions used; Section 4 described the real-world problem being solved; Section 5 details the parallel ACO implementations and Section 6 compares the results. Finally, Section 7 concludes the paper. Since the introduction of ACO in 1992, countless ACO algorithms have been applied to many different problems and many different parallel architectures have been explored previously. [9] specifies 5 of such architectures: • Parallel Independent Ant Colonieseach ant colony develop their own solutions in parallel without any communication in-between; • Parallel Interacting Ant Colonieseach colony creates solution in parallel and some information is shared between the colonies; • Parallel Antseach ant creates solution independently, then all the resulting pheromones are shared for the next iteration; • Parallel Evaluation of Solution Elementsfor problems where fitness function calculations take considerably more time than the solution creation; • Parallel Combination of Ants and Evaluation of Solution Elementsa combination of any of the above. Researchers have tried to exploit the parallelism offered from recent multi core CPUs [10] , along with clusters of CPUs ( [11] [12] ) and most recently GPUs [13] and Intel's many core architectures such as Xeon Phi [14] . Breakdown of the strategies and problems solved are shown in Table 1 . Table 2 shows the most common problems solved by ACO and their corresponding associated constraints / meta-data required during solution creation. Parallel ACO CPU architectures have been applied to various tasksfor example, [15] applied ACO for mining supply chain scheduling problem. Authors managed to reduce the execution time from one hour (serial) to around 7 minutes. Both [30] and [31] used ACO for image edge detection with varying results, [30] achieved a speedup of 3-5 times while [31] managed to reduce sequential runtime by 30%. Most commonly, ACO has been applied to Travelling Salesman Problem (TSP) benchmarks. For instance, [17] proposed ACO approach with randomly synchronized ants, the approach showed a faster convergence compared to other TSP approaches. Moreover, authors in [19] proposed new multi-core SIMD model for solving TSPs. Similarly, both [32] and [33] [26] implemented parallel ACO for solving Flow shop scheduling problem with restrictions using MPI. Compared to sequential version of the algorithm, 93 node cluster achieved a speedup of 16x. [37] compared ACO parallel implementation on MPI and OpenMP on small vector estimation problem. It was found that maximum speedup of OpenMP was 24x while MPI -16x. Furthermore, [18] explored multi-core SIMD CPU with OpenCL and compared it to the performance of GPU. It was found optimized parallel CPU-SIMD version can achieve similar solution quality and computation time than the state of art GPU implementation solving TSP. Intel's Xeon Phi Many Integrated Core (MIC) architecture offers many cores on the CPU (60-72 cores per node) while offering lower clock frequency. Few researchers have had the opportunity to research ACO on the Xeon Phi architecture. For instance, [27] showed how utilizing L1 and L2 cache on Xeon Phi coprocessor allowed a speedup of 42x solving TSP compared to a sequential execution. Due to the nature of SIMD features such as AVX-512 on Xeon Phi, researchers in both [29] and [28] proposed a vectorization model for roulette wheel selection in TSP. In case of [29] a 16.6x speedup was achieved compared to a sequential execution. To the best of authors knowledge, Xeon Phi and ACO parallelism has not been explored to any other problem except TSP. General Purpose GPU (GPGPU) programming is a growing field in computer science and machine learning. Many researchers have tried exploiting latest GPU architectures in order to speed optimize the convergence of ACO. ACO GPU implementation expands to many fields, such as edge detection ( [25] [38] ), protein folding [20] , solving Multidimensional Knapsack Problems (MKPs) [21] and Vertex coloring problems [39] . [22] ), the majority of implementations use CUDA. For instance, [46] implemented a GPU-based ACO and achieved a speedup of 40x compared to sequential ACS. Similarly, a 22x speedup was achieved in [47] solving pr1002 TSP and 44x on fnl4461 TSP instance in [48] . However, there are also various hybrid approaches for solving TSP - [49] solves parallel Cultural ACO (pCACO) (a hybrid of genetic algorithm and ACO). Research showed that pCACO outperformed sequential and parallel ACO implementations in terms of solution quality. Furthermore, [50] solved TSP instances using ACO-PSO hybrid and authors in [51] explored heterogenous computing with multiple GPU architectures for TSP. Although task parallelism has potential for a speedup, [23] showed how data parallelism (vectorization) on GPU can achieve better performance by proposed Independent Roulette wheel (I-Roulette). Authors then expanded the I-Roulette implementation in [24] , where SS-Roulette wheel was proposed. Further, [52] implements a G-Roulettea grouped roulette wheel selection based on I-Roulette, where cities in TSP selection is grouped in CUDA warps. An impressive speedup of 172x was achieved compared to the sequential counterpart. Comparing parallel performances of different hardware architectures fairly is by no means trivial. Most research compares a sequential CPU ACO implementation to the one of the parallel GPUs, which is hardly fair [53] . To amplify the issue, unoptimized sequential code is compared to highly optimized GPU code. This results in misleading and inflated speedups [13] . Furthermore, [22] argues that often the parameter settings chosen for the sequential implementation is biased in favor of GPU. [13] proposes a criteria to calculate the real-world efficiency of two different hardware architectures by comparing the theoretical peak performances of GPU and CPU. While the proposed method is more appropriate, it still doesn't account for real-life scenarios where memory latency/speed, cache size, compilers and operating systems all play a role of the final execution time. Therefore, two different systems with similar theoretical floating-point operations per second running the same executable can have significantly different execution times. Furthermore, in some instances only execution time or solution quality is compared, rarely both are taken into consideration when comparing results. This section briefly covers the tools and hardware specific languages used in the implementation. OpenMP is set of directives to a compiler that allows programmer to create parallel tasks as well as vectorization (Single Instruction Multiple Data -SIMD) in order to speed up execution of a program. Program containing parallel OpenMP directives starts as single thread, when directive such as #pragma omp parallel is reached, main thread will create a thread pool and all methods within pragma region will be executed in parallel by each thread in the thread group. Once the thread reaches the end of the region, it will wait for all other threads to finish before dissolving the thread group and only the main thread will continue. Furthermore, OpenMP also supports nesting, meaning a thread in a thread-group can create its own individual thread-group and become the master thread for the newly created thread-group. However, thread-group creation and elimination can have significant overhead and therefore, thread-group re-use is highly recommended [54] . This paper utilizes both omp parallel and omp simd directives. Compute Unified Device Architecture (CUDA) is a General-purpose computing model on GPU developed by Nvidia in 2006. Since then this proprietary framework has been utilized in the high-performance computing space via multiple Artificial Intelligence (AI) and Machine Learning (ML) interfaces and libraries/APIs. CUDA allows to write C programs that takes advantage of any recent Nvidia GPU found in laptops, workstations and data centers. Each GPU contains multiple Streaming Multiprocessors (SM) that are designed to execute hundreds of threads concurrently. In order to achieve that, CUDA implements SIMT (Single Instruction Multiple-Threads) architecture, where instructions are pipelined for instruction level parallelism. Threads are grouped in sets of 32called warps. Each warp executes one instruction at a time on each thread. Furthermore, CUDA threads can access multiple memory spacesglobal memory (large size, slower), texture memory (read only), shared memory (shared across threads in the same SM, lower latency) and local memory (limited set of registers within each thread, fastest) [55] . A batch of threads are grouped into a thread-block. Multiple thread-blocks create a grid of thread blocks. Programmer specifies the grid dimensionality at kernel launch time, by providing the number of thread-blocks and the number of threads per thread-block. Kernel launch fails if the program exceeds the hardware resource boundaries. Knights Landing is a product code name for Intel's second-generation Intel Xeon Phi processors. First generation of Xeon Phi, named Knights Corner, was a PCI-e coprocessor card based on many Intel Atom processor cores and support for Vector Processing Units (VPUs). The main advancement over Knights Corner was the standalone processor that can boot stock operating systems, along with improved power efficiency and vector performance. Furthermore, it also introduced a new high bandwidth MCDRAM memory. Xeon phi support for standard x86 and x86-64 instructions, allows majority CPU compiled binaries to run without any modification. Moreover, support for 512-bit Advanced Vector Extensions (AVX-512) allows high throughput vector manipulations. Figure 1 . Knights Landing tile with larger processor die [56] The Knights Landing cores are divided into tiles (typically between 32 and 36 tiles in total). Each tile contains two processor cores and each core is connected to two vector processing units (VPUs). Utilizing AVX-512 and two VPUs, each core can deliver 32 dual-precision (DP) or 64 single-precision (SP) operations each cycle [56] . Furthermore, each individual core supports up to four threads of executionhyper threads where instructions are pipelined. Another introduction with the Knights Landing is the cluster modes and MCDRAM/DRAM management. Processor offers three primary cluster modes -All to all mode, Quadrant mode and Sub-Numa Cluster (SNC) mode and three memory modescache mode, flat mode and hybrid mode. For detailed description of the Knights Landing Xeon Phi architecture refer to [56] . A real-world dataset of an outbound logistics network is provided by a global microchip producer. The company provided demand data for 9216 orders that need to be routed via their outbound supply chain network of 15 warehouses, 11 origin ports and 1 destination port (see Figure 2 ). Warehouses are limited to a specific set of products that they stock, furthermore, some warehouses are dedicated for supporting only a particular set of customers. Moreover, warehouses are limited by the number of orders that can be processed in a single day. A customer making an order decides what sort of service level they require -DTD (Door to Door), DTP (Door to Port) or CRF (Customer Referred Freight). In case of CRF, customer arranges the freight and company only incurs the warehouse cost. In most instances, an order can be shipped via one of 9 couriers offering different rates for different weight bands and service levels. Although the majority of the shipments are done via air transport, some orders are shipped via groundby trucks. The majority of couriers offer discounted rates as the total shipping weight increases based on different weight bands. However, a minimum charge for shipment still applies. Furthermore, faster shipping tends to be more expensive, but offer better customer satisfaction. Customer service level is out of the scope of this research. Dataset [57] is divided into 7 tables, one table for all orders that needs to be assigned a route -OrderList The main goal of optimization is to find a set of warehouses, shipping lanes and couriers to use for the most cost-effective supply chain. Therefore the fitness function is derived from two incurred costswarehouse cost and transportation cost in equation (1). The totaling cost is then calculated across all orders in the dataset. Where is warehouse cost for order at warehouse and is transportation cost for order between warehouse port and customer port , total number of orders . Where warehouse cost for order at warehouse is calculated in (2), by the number of units in order multiplied by the warehouse storage rate (WhCosts table) . Where is the service level for order . is the minimum charge for given line , is the weight in kilograms for order shipped from warehouse port to customer port via courier using service level , delivery time and transportation mode . is the freight rate (dollars per kilogram) for given weight gap based on total weight for the line (FreightRates table) . Furthermore, transportation cost for a given order and chosen line is calculated by algorithm in Figure 3 .The algorithm first check what kind of service level the order requires, if the service level is equal to CRF (Customer Referred Freight)transportation cost is 0. Furthermore, if order transportation mode is equal to GROUND (order transported via truck), order transportation cost is proportional to the weight consumed by the order ( ) in respect of the total weight for given line and the rate charged by courier for full track . Moreover, a minimum charge of is applied in cases where the transportation cost is less than the minimum charge. In all other cases, the transportation cost is calculated based on order weight and the freight rate . The freight rate is determined based on total weight on any given line and the corresponding weight band in the freight rate table. Problem being solved complies with the following constraints: Where = 1 if order was shipped from warehouse and 0 otherwise. is the order limit per day for warehouse (WhCapacities table) . Where is the weight in kilograms for order shipped from warehouse port to customer port via courier using service level , delivery time and transportation mode . is the upper weight gap limit for line (FreightRates table) . Where product for order belongs to supported products at warehouse (ProductsPerPlant In order to solve the transportation network optimization problem, we are using an Ant Colony System algorithm first proposed by [58] . Because ACO is an iterative algorithm, it does require sequential execution. Therefore, the most naïve approach for parallel ACO is running multiple Independent Ant Colonies (IAC) with a unique seed for the pseudo random number generator for each colony (high level pseudo code in Figure 4 ). Due to the stochastic nature of solution creation, it is therefore more probabilistic to reach a better solution than a single colony. This approach has the advantage of low overhead as it requires no synchronization between the parallel instances during search. At the very end of the search, the best solution of all parallel colonies is chosen as the final solution. Main disadvantage of IAC is that if one of the colonies finds a better solution, there is no way to improve all the other colony's fitness values. 1. for all parallel instances m parallel do 2. for all iterations i do 3. for all local ants a do 4. local pheromone = global pheromone 5. construct solution 6. local pheromone update 7. end for 8. update global pheromone update based on best solution 9. end for 10. end for 11. find best solution across parallel instances Alternatively, the ACO search algorithm could also be letting the artificial ant colonies synchronize after every iteration and therefore all parallel instances are aware of the best solution and can share pheromones accordingly. High level pseudo code of such Parallel Ant (PA) implementation is shown in Figure 5 . Main advantage of this architecture is that it allows efficient pheromone sharing, therefore converging faster. However, there is a high risk of getting stuck into local optima as all ants start iteration with the same pheromone matrix. Furthermore, synchronization of all parallel instances after every iteration is costly. 1. for all iterations i do 2. for all parallel instances m parallel do 3. for all local ants a do 4. local pheromone = global pheromone 5. construct solution 6. local pheromone update 7. end for 8. find best solution across parallel instances 9. update global pheromone update based on best solution 10. end for 11. end for Both IAC and PA implementations are exploiting task parallelismeach parallel instance (thread) gets set of tasks to complete. An alternative approach would be to look at data parallelism and vectorizationeach thread processes a specific section of the data and cooperatively complete the given task. Due to the highly sequential parts of ACO, it would not be practical to only use vectorization alone. A more desirable path would be to implement vectorization in conjugate to the task parallelism. In case of CPU, task parallelism can be done by the threads, while vectorization done by Vector Processing Units (VPUs) based on Advanced Vector Extensions 2 (AVX2) or AVX512. Moreover, in case of GPU and CUDAtask parallelism would be done at thread-block level while data parallelism would exploit WARP structures. Parallel Ants with Vectorization (PAwV) expands on the Parallel Ants architecture by introducing data-parallelism of solution creation and an alternative roulette wheel implementation -SS-Wheel, first proposed in [24] . SS-Wheel mimics a sequential roulette wheel while allowing higher throughput due to parallelism. Local search in Figure 6 expands on the implementation in Figure 5 (lines 3-7). First the choiceMatrix is calculated by multiplying the probability of the route to be chosen with the tabu lista list of still available routes (where 0 represents not available and 1route still can be selected). A random number between 0 and 1 is generated in order to determine if a given route will be chosen based on exploitation or exploration. In case of exploitation, the choiceMatrix is reduced to obtain the maximum and the corresponding route index. Furthermore, in case of exploration, the route is chosen based on the SS-Roulette wheel described by [24] . A sequential implementation of ACO described in [58] is adapted from [59] by altering the heuristic information calculation for a given routedefined as a proportion of order's weight and the maximum weight gap (see Equation (2)). Furthermore, the Ant Colony System set of parameters for all configurations and architectures are shown in Table 3 . Moreover, we then implement three different Parallel ACO architectures -Independent Ant Colonies (IAC), Parallel Ants (PA) and Parallel Ants with Vectorization (PAwV) in C++ and CUDA C. Experiments were conducted on three different hardware configurations -CPU, GPU and Xeon Phi. Where Hardware A is a host system for Hardware C. • GPUs: 4x Nvidia GTX1070, 8GB GDDR5 per GPU, 1.9GHz core, 4.1GHz memory. PCIe with 16x/8x/16x/8x. • Toolchain: Visual Studio v140 toolset, Windows SDK version 8.1, x64, CUDA 9.0, compute_35, sm_35 It is important to take both elapsed time and solution quality into consideration when referring to speed optimization of optimization algorithms. One could get superior convergence within iteration but, take twice as long to compute. Similarly, one could claim that algorithm is much faster completing defined number of iterations, but sacrifice solution quality. Furthermore, there is little point comparing sequential execution of one hardware platform to a parallel implementation of another. Comparison should take into consideration all platform strengths and weaknesses and set up the most suitable configuration for given platform. In order to obtain a baseline fitness convergence rate at various number of parallel instances, we create Iterations vs Parallel Instances matrix for all architectures. An example of such matrix for Parallel Ants is shown in Table 4 . The matrix is derived by averaging the resulting fitness obtained from 10 independent simulations with a unique seed value for each given Parallel Instances configuration. All configurations are run for x number of iterations, where x is based on the total number of solutions explored and is a function of the number of Parallel Instances. The total number of solutions explored is set to 768k. The number of Parallel Instances is varied by 2 −1 with maximum n of 11, i.e. 1024 parallel instances. The best value after every 5 iterations is also recorded. We then compute the number of iterations required to reach a specific solution quality for different ACO architectures in Table 5 , expressed as proximity to the best-known optimal solution . For the specific problem and dataset, the best solution is a total cost of 2,701,367.58. There are 6 checkpoints of solution quality ranging from 99% to 99.9%. Although at first 1% gain might not seem significant, one has to remember that global supply chain costs are measured in hundreds of millions, and even 1% savings do affect bottom line. Empty fields (-) represent instances where the ACO was not able to converge to given solution quality. On all experiments, IAC was able to obtain solution quality only below 99.6%, whereas PA and PA with 5 ant local search was able to obtain optimal solution with 512 and 1024 parallel instances. Furthermore, IAC did not see any significant benefit of adding more parallel instances for 99% and 99.25% checkpoints. In contrast, PA does benefit from the increase in number of parallel instances. For instance, PA is able to obtain the same solution quality in half the number of iterations at 99% checkpoint (scaling of 2x for sequential vs 1024 parallel instances). Scaling of 633.7x in case of 99.5% checkpoint for sequential counterpart. Similarly, PA with 5 ant sequential local search has the same dynamics, with scaling of 4x at 99% checkpoint compared to sequential and 140x at 99.6% checkpoint compared to 2 and 1024 parallel instances. One can also note that at increased solution quality and little number of parallel instances, PA with 5 ant local search also offers increased efficiency in terms of total solutions explored. For example, at the 99.5% checkpoint with 2 parallel instances, PA takes 2590 iterations, while PA with 5 ant local search only requires 65 (decrease of 40x iterations, or 8x total solutions explored). However, in most instances, PA without any local search is more efficient. To evaluate speed performance, we ran each given configuration and parallel architecture for 500 iterations or 10 minutes wall-clock time (whichever happens first) and recorded total number of iterations and wall-clock time for 3 independent runs. Then, average wallclock time per iteration was calculated. It is important to measure the execution time correctly, just purely comparing computation per kernel/method may not show the real-life impact. For that reason, total time is measured from the start of the memory allocation to the freeing of the allocated memory, however it does not include time required to load the dataset into memory. This allows us to estimate, with reasonable accuracy, what is the wall-clock time required to run a specific architecture and configuration in order to converge to a given fitness quality. Although, running each given architecture and configuration 10 times would produce more accurate convergence rate estimates, it would also require significantly more computation time. Furthermore, all vectorized implementations went through iterative profiling and optimization process to obtain the fastest execution time. To the best of the authors' knowledge, all vectorized implementations have been fully optimized for the given hardware. ACO implementation of IAC, PA and PAwV was implemented in C++ and multiple experiments of the configurations are shown in Table 6 The number of iterations required to reach specific solution quality Parallel Ants was used to compile the implementation. KMP 1 (an extension of OpenMP) config was varied based on total hardware core and logical core count (16c,2t = 32 OpenMP threads). Very similar results were obtained for both IAC double precision and PA double precision, with PA having around 5% overhead compared to IAC. In both instances, running 32 OpenMP threads offered around 24% speed reduction compared to 16 threads. Furthermore, PAwV with double precision vectorization using AVX2 offered speed reduction of 26%, while scaling from 16 OpenMP threads to 32 offered almost no scaling at 256 parallel instances upwards. The nature of ACO pheromone sharing and probability calculations does not require double precision and therefore can be substituted with single precision calculations. AVX2 offers 256-bit manipulations, therefore increasing theoretical throughput by factor of 2, compared to double precision. 36% decrease in execution time was obtained, as not all parts of the code are able to take advantage of SIMD. Furthermore, doing 5 ant sequential local search within each parallel instance increases time linearly and produces little time savings in terms of solutions explored. The overall scaling factor at 1024 parallel instances compared to sequential execution at PAwV (single precision with AVX2 and 16c2t) is therefore 25.4x. Similar experiments were conducted also on the Xeon Phi hardware, PAwV, single precision, AVX2, with 5 sequential ant local search precision and offered near perfect scaling on 1024 parallel instances with 4 hyper-threads per core, or 40% overall speed improvement compared to PAwV with double precision (3 seconds vs 1.804 seconds). Alike Hardware A, having 5 sequential local ants does not provide any time savings and time increases linearly. The overall scaling factor at 1024 parallel instances compared to sequential execution at PAwV (single precision with AVX512 and 64c4t) is therefore 148x. A further set of experiments were also conducted for GPU, Table 8 . The implementation with no vectorization (Blocks x1), uses 1 thread per CUDA block to compute one solution, therefore 1024 parallel instances require 1024 blocks. Similarly, for (Blocks x32), 32 threads are used per block, each thread computing its own solution independently. For parallel instances of 32, only 1 block would be used with 32 threads. The implementation of no vectorization utilizes no shared memory, however, all static problem meta data is stored as textures. A single kernel is launched and best solution across all parallel instances is returned. Vectorized version implements architecture described in [24] , storing the route choice matrix in shared memory and utilizing local warp reduction for sum and max operations. Each thread-block builds its own solution, while the extra 32 threads assist with the reduction operations, memory copies and fitness evaluation. Table 8 shows the comparison between the two implementations. Implementation without vectorization performs on average 2 times slower compared to the vectorized version. Furthermore, 64 threads per block (Blocks x64) performs slower than 32 threads per block (Block x32). Next, scaling across multiple GPUs were explored. Each device takes a proportion of 1024 instances with unique seed values and after each iteration, best overall solution is reduced. In case of 2 GPUs and 1024 parallel instances, each device will compute 512 parallel GPU implementation is therefore one magnitude of order slower than that of CPU, though this could be explained by the nature of the problem and not be specific to ACO architecture, as there have been a lot of success on GPUs solving simple, low memory footprint TSP instances [24] [46] [47] . However, the problem being solved in this paper requires a lot of random global memory access to check for all restrictions such as order limits, capacity constraints and weight limits, which are too big to be stored on the shared memory. If both convergence rate of the architecture and the speed of the hardware is taken into account, an estimate can be made on what would be the average wall-clock time to converge to a specific solution quality. The fastest configuration for both Hardware A (Table 6 ) and Hardware B ( Table 7) was chosen and then multiplied by the number of iterations required to reach a specific solution quality (Table 5 ) to obtain an estimate of the compute time required (Table 9 ). Therefore, a fairer real-life impact can be derived. GPU results (Hardware C) were not included as they are significantly slower. Nature-inspired meta-heuristic algorithms such as Ant Colony Optimization (ACO) have been successfully applied to multiple different optimization problems. Most work focuses on the Travelling Salesman Problem (TSP). While TSPs are a good benchmark for new idea comparison, the dynamics of the proposed algorithms for benchmarks do not always match to a real-world performance where problem has more constraints (more meta-data during solution creation). Furthermore, speed and fitness performance comparisons are not always completely fair when compared to a sequential implementation. This work solves a real-world outbound supply chain network optimization problem and compares two different ACO architectures -Independent Ant Colonies (IAC) and Parallel Ants (PA). It was concluded that PA outperformed IAC in all instances, as IAC failed to find any better solution than 99.5% of optimal. In comparison, PA was able to find near optimal solution (99.9%) in less iterations due to efficient pheromone sharing across ants after each iteration. Furthermore, PA shows that it consistently finds a better solution with the same number of iterations as the number of parallel instances increase. Moreover, a detailed speed performance was measured for 3 different hardware architectures -16 core 32 thread workstation CPU, 68 core server grade Xeon Phi and general purpose Nvidia GPUs. Results showed that due to the nature of the real-world problem, memory access footprint required to check capacity limits and weight constraints did not fit on small shared memory on GPU and therefore it performed 29 times slower than the other two hardware solutions even when running 4 GPUs in parallel. When compared to a real-life impact on time required to reach a specific solution quality, both CPU and Xeon Phi optimized-vectorized implementations showed comparable speed performance; with CPU taking the lead with lower Parallel Instances count due to the much higher clock frequency. At near optimal solution (99.75%+) and 1024 parallel instances, Xeon Phi was able to take full advantage of AVX512 instruction set and outperformed CPU in terms of speed. Therefore, compared to an equivalent sequential implementation at 1024 parallel instances, CPU was able to scale 25.4x while Xeon Phi achieved a speedup of 148x. The number of parallel instances Hardware A -TR1950x Hardware B -Xeon Phi 7250F Due to the fact that PA fitness performance increases as the number of parallel instances increase, it would be worth looking into scaling above 1024 instances using either clusters of CPUs or clusters of Xeon Phi's, which will be part of the future work. Furthermore, Field Programmable Gate Arrays (FPGAs) might have potential to take advantage of highly vectorized ACO, which is another area of possible future research. Tactical supply chain planning models with inherent flexibility: definition and review An ant colony system for responsive dynamic vehicle routing Multi-satellite control resource scheduling based on ant colony optimization HOPNET: A hybrid ant colony optimization routing algorithm for mobile ad hoc network A novel hybrid approach based on Particle Swarm Optimization and Ant Colony Algorithm to forecast energy demand of Turkey A revised ant algorithm for solving locationallocation problem with risky demand in a multi-echelon supply chain network Ant Colony Optimization For Split Delivery Inventory Routing Problem Designing closed-loop supply chains with nonlinear dimensioning factors using ant colony optimization A Parallel Implementation of Ant Colony Optimization Metaheuristic algorithms and probabilistic behaviour: a comprehensive analysis of Ant Colony Optimization and its variants A parallel cooperative hybrid method based on ant colony optimization and 3-Opt algorithm for solving traveling salesman problem Parallel Performance of an Ant Colony Optimization Algorithm for TSP A Survey on GPU-Based Implementation of Swarm Intelligence Algorithms First Results of Performance Comparisons on Many-core Processors in Solving QAP with ACO: Kepler GPU versus Xeon Phi Parallel ant colony optimization for resource constrained job scheduling Comparative evaluation of platforms for parallel Ant Colony Optimization RMACO :a randomly matched parallel ant colony optimization Parallel ant colony optimization on multicore SIMD CPUs Parallel ant colony optimization on multi-core SIMD CPUs Parallel Ant Colony Optimization for the HP Protein Folding Problem A CUDA Based Solution to the Multidimensional Knapsack Problem Using the Ant Colony Optimization CPU Versus GPU Parallelization of an Ant Colony Optimization for the Longest Common Subsequence Problem Enhancing data parallelism for Ant Colony Optimization on GPUs High-throughput Ant Colony Optimization on graphics processing units Accelerating ant colony optimization-based edge detection on the GPU using CUDA Parallel Ant Colony Optimization for Flow Shop Scheduling Subject to Limited Machine Availability Using a coprocessor to solve the Ant Colony Optimization algorithm Efficient exploitation of the Xeon Phi architecture for the Ant Colony Optimization (ACO) metaheuristic A highly parallelized and vectorized implementation of Max-Min Ant System on Intel® Xeon Phi™ An Effective Parallelism Topology in Ant Colony Optimization algorithm for Medical Image Edge Detection with Critical Path Methodology (PACO-CPM) Multi-threading based implementation of Ant-Colony Optimization algorithm for image edge detection Applying ACO to Large Scale TSP Instances Effective heuristics for ant colony optimization to handle largescale problems Performance Analysis and Tuning for Parallelization of Ant Colony Optimization by Using OpenMP Parallel ant colony optimization for the determination of a point heat source position in a 2-D domain A Parallel Ant Colony Optimization for the Maximum-Weight Clique Problem Evaluation of parallelism in ant colony optimization method for numerical solution of optimal control problems GENERIC TECHNIQUES IN GENERAL PURPOSE GPU PROGRAMMING WITH APPLICATIONS TO ANT COLONY AND IMAGE PROCESSING ALGORITHMS Accelerating Ant Colony Optimization for the Vertex Coloring Problem on the GPU Optimizing PolyACO Training with GPU-Based Parallelization GPU implementation of ant colony optimization-based band selections for hyperspectral data classification A GPU-based Parallel Ant Colony Algorithm for Scientific Transit stop inspection and maintenance scheduling: A GPU accelerated metaheuristics approach Research on Solving Travelling Salesman Problem using Rank Based Ant System on GPU Multi Colony Ant System based Solution to Travelling Salesman Problem using OpenCL GACO: A GPU-based High Performance Parallel Multi-ant Colony Optimization Algorithm Accelerating ant colony optimisation for the travelling salesman problem on the GPU Dynamic strategy based parallel ant colony optimization on GPUs for TSPs Cultural Ant Colony Optimization on GPUs for Travelling Salesman Problem ACO-PSO Optimization for Solving TSP Problem with GPU Acceleration Dynamic load balancing on heterogeneous clusters for parallel ant colony optimization A GPU-based parallel MAX-MIN Ant System algorithm with grouped roulette wheel selection The GPU-based parallel Ant Colony System CUDA toolkit documentation Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition Supply Chain Logistics Problem Dataset Ant colony system: a cooperative learning approach to the traveling salesman problem Composite goal methods for transportation network optimization Authors would like to thank Intel Corporation for donating the Xeon Phi hardware.