Completed Research Assignments – Industrial Engineering Data Science Programmes

As part of the MEng (Structured) programme with focus on Data Science our students are required to complete a final 60 credit data science research project where they are required to apply and consolidate the data science knowledge gained throughout the programme.For this purpose, students will solve a real-world data science project, providing solutions for each step of the data science project life cycle and document it in a research assignment. For these projects, we collaborate with industry and academic partners who are willing to propose a topic, to provide the necessary data (if not publicly available) as well as to act as domain mentors. If you are interested in partnering with us for such a project, please contact mfrei@sun.ac.za .

Below, find a list of completed research assignments. The assignments are grouped under the year of graduation.

2024

March 2024 Graduation
Dean Basil Blackburn	Convolutional neural network filter selection using genetic algorithms
Andre Cilliers Blignaut	The value of Zero-rating internet services to provide essential services to low-income communities
Andrew Barry Boyes	Intelli-Bone: Automated fracture detection and classification in radiographs using transfer learnig
Gerrit Hendrikus Brink	Evolutionary multi-objective optimisation algorithms for a multi-objective truck and drone scheduling problem
Donovan Broughton	Evolving encapsulated neural network blocks using a genetic algorithm
Johann Liam Conradie	Machine Learning for Aquaponic System Mortality Prediction and Planting Area Optimisation
Haroon Gool	Spatio-Temporal Modelling of Road Traffic Fatalities in Western Cape
Tinotenda Dudu Katsumbe	Using Tree-Based Machine Learning Models to Improve Upon the Least-Squares Method of Quantifying Mineralogy using Bulk Chemical Compositional Data
Gerrit Kilian	Optimisation algorithms for a dynamic truck and drone scheduling problem
Ashail Maharaj	Review of Big Data clustering methods
Katlego Mokgotho	Cluster4ing free text procurement data
Santeshan Yogandra Naidoo	Few-shot learning for passive acoustic monitoring of endangered species
Dakalo Nemauluma	Digitization Of Test Pit Log Documents For Development Of A Smart Digital Ground Investigation Companion
Jaco Stefanus Olivier	Comparison of machine learning models on financial time series data
Andries Adriaan Roux	Trends in Infrastructure Delivery from Media Reports
Garett Lloyd Sidwell	Investigating sales forecasting in the formal liquor market using deep learning techniques
Matthew Clive Swanevelder	Automated Localisation and Classification of Trauma Implants in Leg X-rays through Deep Learning
Catharina Elizabeth Uys	Association between the features used by a convolutional neural network for skin cancer diagnosis and the ABC-criteria and 7-point skin lesion malignancy checklist

2023

Student	Title	Abstract
December 2023 Graduation
Kyle Edward Mathews	A dynamic optimisation approach to training feed-forward neural networks that form part of an active learning paradigm	Active learning describes a paradigm of continually selecting the most informative patterns to train a model while training progresses. Literature indicates that the parameter search landscape of feed-forward neural networks (FFNNs) that form part of an active learning paradigm does not generalise to the parameter search landscape of FFNNs trained by a static training set. The parameter search landscape of FFNNs that form part of an active learning paradigm are theorised to change while the search progresses. This research assignment investigates the effect of changing the optimiser of a FFNN that forms part of an active learning paradigm from backpropagation to a dynamic optimisation algorithm. To this extent the cooperative quantum-behaved particle swarm optimisation (CQPSO) algorithm was implemented to train FFNNs that form part of two different active learning paradigms. The active learning paradigms investigated were dynamic pattern selection (DPS) and sensitivity analysis selective learning (SASLA). Six data sets were used for the investigation. A novel hyperparameter tuning procedure was implemented to ensure efficient optimiser performance for each problem set. It was found that the CQPSO algorithm located and tracked the global minimum of four out of the six problem sets more effectively than the backpropagation algorithm in the DPS active learning paradigm. Conversely, the backpropagation algorithm located and tracked the global minimum of four out of the six problem sets more effectively than the CQPSO algorithm in the SASLA active learning paradigm. The CQPSO algorithm performance was found to depend on the dimensionality of the search space as well as the interdependence of the input training patterns.
Mokakatlela Robert Mokakatlela	Course Recommendation Based on Content A_nity with Browsing Behaviour	A recommender, or recommendation system (RS), _lters and provides relevant content to a user based on many factors such as their historic behaviour during interactions with a particular system or software. A RS is aimed at improving user experience and overcoming issues such as the distressing search problem experienced in massive open online courses (MOOCs) platforms. One such online platform is Physioplus, whose subscribers generally have very speci_c educational needs and thus can greatly bene_t from targeted responses when interacting with the system. It can therefore be argued that an enhanced course recommender engine possesses great potential to increase Physioplus subscribers satisfaction and thus reduce cancelations. The current search feature in Physioplus has some limitations, as it uses keywords, static course recommendations, and elastic site search without considering historic user site visits. The purpose of this study is to build a better course recommender system for Physioplus. The recommender takes a user’s recent Physiopedia browsing history and provides the user with a tailored and rank-ordered list of those courses that are most relevant to their entire content history. The content of a user browsing history is highly correlated with the content of the most relevant courses for that user. The recommender is built using a collaborative-based _ltering (CF) technique, item-based and user-based approach. Natural language processing and neighbourhood similarity methods are used to complement collaborative _ltering in achieving quality recommendations. The course recommender system in this study uses a training and testing dataset from a real-world Physioplus system to assess the overall performance of the proposed approach. The experiment evaluation is measured by comparing recommended versus completed courses. The results show that the proposed RS has a recall score of 76% and an accuracy rate of 53% obtained in the o_ine experiment exercise. The assumption is that the performance metrics score will improve once the proposed RS integrates with the existing Physioplus production system. All in all, the proposed RS can play an essential role in assisting users with relevant courses.
Carlos Pambo	An Evolutionary Algorithm for the Vehicle RoutingProblem with Drones with Interceptions	The use of trucks and drones as a solution to address last-mile delivery challenges is a new and promising research direction explored in this assignment. The variation of the problem where the drone can intercept the truck while in movement or at the customer location is part of an optimisation problem called the vehicle routing problem (VRP) with drones with interception (VRPDi). This study proposes an evolutionary algorithm (EA) to solve the VRPDi. The study demonstrates a metaheuristic strategy by applying an evolution-based algorithm to solve the VRPDi. In this variation of the VRPDi, multiple pairs of trucks and drones need to be scheduled. The pairs leave and return to a depot location together or separately to make deliveries to customer nodes. The drone can intercept the truck after the delivery or meet up with the truck at the following customer location. The algorithm was executed on the travelling salesman problem with drones (TSPD) datasets by Bouman et al. (2015), and the performance of the algorithm was compared by benchmarking the results of the VRPDi against the results of the VRP of the same dataset. This comparison showed improvements in total delivery time between 39% and 60%. Further detailed analysis of the algorithm results examined the total delivery time, total distance, the node delivery scheduling and the degree of diversity during the algorithm execution. This analysis also considered how the algorithm handled the VRPDi constraints. The results of the algorithm were then benchmarked against algorithms in Dillon et al. (2023), and Ernst (2024). The latter solved the problem with a maximum drone distance constraint added to the VRPDi. The analysis and benchmarking of the algorithm results showed that the algorithm satisfactorily solved 50 and 100-nodes problems in a reasonable amount of time, and the solutions found were better than those found by the algorithms in Dillon et al. (2023), and Ernst (2024) for the same problems. However, the algorithm performance deteriorated considerably as the number of nodes in the problems increased. This deterioration was both in terms of the quality of the solution and the computation time required to solve the problem.
André Luke Theron	Metaheuristics for Training Deep Neural Networks	Presently, artificial neural networks (ANNs) are popular among researchers as well as in commercial settings. The use of ANNs continue to expand into different fields. The increase in interest in ANNs have lead researchers to explore various new and innovative ways to improve the performance of ANNs. One such way is to explore the use of metaheuristics in the training of ANNs. This research assignment theoretically and empirically compares the use of metaheuristics as an alternative to the traditional training algorithm, i.e. backpropagation with stochastic gradient descent (SGD), to train deep neural networks (DNNs). Three specific metaheuristics are considered, namely particle swarm optimisation (PSO), genetic algorithm (GA) and differential evolution (DE). An in-depth analysis of SGD is conducted to highlight some potential disadvantages which might occur in the training process. The field of metaheuristics is explored as an alternative training algorithm with specific emphasis placed on the three specified metaheuristics. Five different experiments are conducted to empirically compare the backpropagation SGD training algorithm with the PSO, GA and DE training algorithms. The experiments are conducted on an image dataset. The DNN used in the experiments is a convolutional neural network (CNN). The results conclude that the SGD performs better than the metaheuristics considered. Potential future work is also discussed based on the findings of this research paper.
Arneau Jacques van der Merwe	Diversity preservation for decomposition particle swarm optimization as feed-forward neural network training algorithm under the presence of concept drift	Time series forecasting is an important area of research that lends itself to various fields in which it is practically applied. The importance of time series forecasting has led to much research in efforts to improve the accuracy of predictions. The use of artificial neural networks for time series forecasting has grown, especially with the development of simple recurrent neural networks (SRNNs). SRNNs have been shown to handle temporal sequences efficiently. Specialised architectures for SRNN increase the computational cost due to the increase in the number of weights that require optimisation during training. Therefore, the training process of neural networks can be rephrased as an optimisation problem. Recent work has shown how specialised dynamic particle swarm optimisation (PSO) algorithms can replace traditional backpropagation as a learning algorithm for feed-forward neural networks (FFNNs). Dynamic PSO algorithms to train FFNNs have been shown to outperform SRNNs using traditional backpropagation. Due to the increased dimensions for larger problems, various cooperative PSO algorithms have been developed to address the credit assignment problem as well as to better cope with variable dependency; one such PSO variant is the decomposition cooperative particle swarm optimisation algorithm. One limitation of using PSO variants for training in dynamic environments is that as the particles in a swarm converge in a specific region, the swarm diversity decays, making it difficult to adapt to environmental changes. Dynamic PSO algorithms have been successfully used in the sub-swarms of decomposition cooperative particle swarm optimisers (DCPSOs). However, these dynamic DCPSO algorithms have been shown to struggle under specific classes of dynamism. Therefore, the preservation of swarm diversity is directly linked iii to the ability to adapt in the presence of concept drift. This research project proposes various diversity preservation techniques to promote swarm diversity throughout various environmental changes. The diversity preservation techniques investigated are the use of random decomposition for dynamic DCPSO and a diversity-based penalty function for regularization. For this purpose, experiments were conducted on five well-known nonstationary forecasting problems under various classes of dynamism. Results obtained on two implementations of the DCPSO using the proposed diversity preservation techniques showed success in promoting swarm diversity. Two main implementations of DCPSOs were investigated, namely dynamic and static sub-swarms. When a static PSO algorithm was used for the sub-swarms of the DCPSO, the diversity preservation showed a significant impact. The proposed diversity preservation techniques also significantly affected swarm diversity for the DCPSO using the quantum particle swarm optimisation algorithm (QSO) as sub-swarms. The use of the diversity based penalty function for regularization showed superior performance on the training and generalization error for dynamic DCPSO. Still, it did not show a statistically significant effect on preserving swarm diversity. The use of static PSO algorithms as sub-swarms for DCPSO showed that random decomposition ranked high across the various experiments, while swarm diversity was significantly impacted. The proposed diversity preservation techniques for the dynamic DCPSO algorithms showed a trade-off between diversity preservation and performance.
March 2023 Graduation
Student	Title	Abstract
Jolanda Becker	Adaptive thresholding for mircoplot segmentation	Food security remains a global concern as flagged by the Food and Agriculture Organization of the United Nations (FAO). They report that globally one in three people do not have access to adequate food with a third of those living in Africa. The effect of climate change on crop yields adds to these concerns. Wheat makes up a substantial share of food consumption globally at 18.3% and it is particularly sensitive to the rising temperatures associated with global warming. The FOA emphasises that agricultural technology has a significant role to play in food security, with research contributing to breeding high-yield and heat-resistant crops, as an important focus area. The Department of Genetics at Stellenbosch University has a wheat pre-breeding programme that develops and tests novel crop variants. This programme monitors several experimental sites that contain microplots; relatively small wheat plots. At a single pre-breeding experimental site, there are often hundreds of microplots that must be monitored and evaluated. The within-season evaluation of microplots is performed by using digital high throughput phenotyping (HTP) analysis performed on orthomosaic images collected using unmanned aerial vehicles (UAVs). One of the phases of HTP is the plot identification phase also referred to as microplot segmentation. The current method used to perform microplot segmentation in the programme makes use of a grid that a user must impose over the orthomosaic image and manually adjust to ensure accurate segmentation. This method is manual and requires extensive post-processing to get a good fit. In addition, the current method does not generalise well to conditions that will pragmatically vary between orthomosaic collection iterations. To reduce the time spent by researchers to segment microplots, this research assignment developed an automated microplot segmentation method that requires minimal input from the user. The microplot segmentation approach, referred to as the adaptive thresholding procedure (ATP), was developed for this research assignment. The ATP uses unsupervised learning to identify and localise microplots. Unlike a grid segmentation approach, the ATP does not require any prior knowledge of the microplot layout and does not require the user to adjust a grid. The performance of the ATP microplot segmentation procedure was evaluated on thirteen orthomosaic images from four different experimental sites and subsequently compared against two manual microplot segmentation procedures. The three different microplot segmentation approaches were compared using three objective criteria namely: accuracy, intersection over union and the level of user input required. The ATP yielded superior performance in comparison to the other two segmentation methods when the conditions at the experimental sites was favourable. In the presence of weeds, the ATP did not yield satisfactory performance as the approach finds it challenging to differentiate between vegetation, weeds and non-vegetation. Despite this limitation, the ATP contributes to the existing body of knowledge on microplot segmentation methods by providing an automated microplot segmentation method that requires minimal user input.
Berna Coetzee	Decision Support Guidelines for Selecting Modern Business Intelligence Platforms in Manufacturing to Support Business Decision Making	Globally, the generation of data is increasing rapidly, and the increasing competitiveness of global markets constantly challenges the business world due to globalisation. Companies rely on sophisticated technology to manage and make decisions in this dynamic business environment and ever-evolving market. Executives are constantly strained to ensure maximised profits from new offerings and operational efficiencies and improve customer and employee experience. As digitalisation in the manufacturing industry increases, the role of data analytics and business intelligence (BI) in decision-making is significantly increasing. Manufacturers generate abundant structured and unstructured business information throughout the product lifecycle that can be used to achieve their business objectives. However, the manufacturing industry is amongst the laggard sections pertaining to digitisation and often lacks the technological and organisational foundations required to implement data tools as part of their ecosystem. Business Intelligence (BI) provides business insights to better understand the company’s large amounts of data, operations and customers. This in turn, can contribute to better decision-making and consequently improve results and profit. Rationalisation of the technologies, tools and techniques can be challenging. The selection of an appropriate tool can be time-consuming, complex and overwhelming due to the wide variety of available BI software products, each claiming that their solution offers distinctive and business-essential features. This research assignment aims to address the need for a useful approach to BI tool evaluation and selection by identifying guidelines to support decision-makers in selecting BI tools. A thematic analysis approach was used to collect, analyse and interpret the information from semi-structured interviews with professionals from the manufacturing industry. The research gauged respondents’ views on the utilisation of BI, the data challenges experienced in manufacturing, the essential criteria BI tools should fulfil, and the approaches followed in practice to select software. The research revealed that BI plays a significant role in decision-making and the prioritisation of tasks in manufacturing. The results showed that respondents valued different BI criteria requirements and decision-making processes. The findings and insights gleaned from the literature review were used to propose guidelines that support manufacturers in their decision. It elucidates the dimensions to evaluate and provides a nine-step selection process to compare BI software.
Roanné Coetzer	An Investigation into the Automatic Behaviour Classification of the African Penguin	In this modern era, climate change, deforestation, and the rapid decline of natural resources are issues that seem ever-increasing. With the extinction of many fauna and flora species in past decades, renewed focus on conservation efforts is advocated globally. The escalation of digitization brings with it an opportunity to improve conservation efforts and, consequently, reduce the rapid decline of biodiversity. Modelling and forecasting the progression of invasive species, ascertaining the presence of endangered species prior to the sanctioning of construction projects, and monitoring threatened ecosystems are some of the many ecologically beneficial possibilities technology provides. A prevalent application gaining much momentum, is the notion of applying machine learning and artificial intelligence to the domain of ecology. One such application considers animal behavioural studies — a predominantly manual endeavour requiring mounted sensors, tracking devices and/ or the continued presence and attention of a human. Ascribed to the invasive nature of many such studies, behaviour is often distorted or (at the very least) influenced. Modern computerised and digitised approaches address many of these drawbacks by providing a means of evaluating behaviour in a non-invasive (or less-invasive) manner. Mounted video cameras are, for example, less cumbersome than traditional wearable sensors. In addition, the presence of a human within or near the animal is no longer required. Considering the potential benefits to conservation, incorporating this technology into the field of behavioural studies is well warranted. This project is dedicated towards investigating the applicability of modern machine learning, specifically deep learning, to behaviour analysis in the endangered African penguin. The aim of this project is to investigate, develop, and deploy a model facilitating automatic behaviour classification in these penguins — a foundational contribution to improve current conservation efforts (improving passive monitoring systems and anomaly detection within a colony could potentially reduce response time in times of distress). The project considers a duel implementation — coordinates detailing animal movement are first extracted and subsequently presented to a suitable classifier facilitating behaviour classification. Three respective case studies are considered, they include: single penguins, two individuals, and three individuals (regarded as multiple individuals). A comprehensive investigation into the algorithmic performance associated with these models is performed and presented. Ultimately, the case evaluating three individuals based on the behaviours excitement and normal achieves an AUC of 72.9%. The case evaluating two individuals based on the behaviours interaction and no interaction achieves an AUC of 84.2%. Finally, the case evaluating one individual based on the behaviours braying, flapping, preening, resting, standing, and walking achieves an AUC of 82.1%. This yields valuable insight into the utility, applicability, and feasibility of automatic behaviour classification of the African penguin. Pivotal to this work, is the foundation it provides to the design, development, and implementation of a passive monitoring system as well as it’s benefits and contributions towards a holistic goal — aiding conservation efforts to preserve fauna and flora for future generations.
Rijk de Wet	Set-based Particle Swarm Optimization for Medoids-based Clustering of Stationary and Non-Stationary Data	Data clustering is the grouping of data instances so that similar instances are placed in the same group or cluster. Clustering has a wide range of applications and is a highly studied field of data science and computational intelligence. In particular, population-based algorithms such as particle swarm optimization (PSO) have shown to be effective at data clustering. Set-based particle swarm optimization (SBPSO) is a generic set-based variant of PSO that substitutes the vector-based mechanisms of PSO with set theory. SBPSO is designed for problems that can be formulated as sets of elements, and its aim is to find the optimal subset of elements from the optimization problem universe. When applied to clustering, SBPSO searches for an optimal set of medoids from the dataset by the optimization of an internal cluster validation criteria. In this research assignment, SBPSO is used to cluster fifteen datasets with diverse characteristics such as dimensionality, cluster counts, cluster sizes, and the presence of outliers. The SBPSO hyperparameters are tuned for optimal clustering performance on these datasets, which is compared in depth to the performance of seven other tuned clustering algorithms. Then, a sensitivity analysis of the SBPSO hyperparameters is performed to determine the effect that variation in these hyperparameters have on swarm diversity and other measures, to enable future research into the clustering of non-stationary data with SBPSO. It was found that SBPSO is a viable clustering algorithm. SBPSO ranked third from among the algorithms evaluated, although it appeared less effective in datasets with more clusters. A significant trade-off between swarm diversity and clustering ability was discovered, and the hyperparameters that control this trade-off were determined. Strategies to address these shortcomings were suggested.
Morne du Plessis	An Extension of the CRISP-DM Framework to Incorporate Change Management to Improve the Adoption of Digital Projects	Digital transformation brings technology such as artificial intelligence (AI) into the core operations of businesses, increasing their revenue while reducing their costs. AI deployments tripled in 2019 having grown by 270% in just four years. However, digital transformation is a challenging task to complete successfully. A total of 45% of large digital projects run over budget, while only 44% of digital projects ever achieve the predicted value. The primary reason for these failures can be attributed to the human aspects of these projects. Examples of these human aspects are the difficulty of access to software, the lack of understanding of technology, and of the knowledge to operate the technology. The continued success of digital transformation requires both technical and change management drivers to be in place before, during, and after AI implementations. The project starts by describing digital projects. Digital projects, which include data science and AI, have an extremely low success rate, with change management as a fundamental barrier to the success of these projects. To address the change management challenges, five different change management models are compared, from which a generalised change management model is constructed. From literature, it is concluded that the CRISP-DM framework is one of the most widely used analytics models for implementing digital projects. Using the generalised change management framework, the change management gaps within the CRISP-DM framework are identified. An extended CRISP-DM framework is constructed by filling the identified gaps in the original CRISP-DM framework with the tasks in the general change management model created. The fourth objective details the extended CRISP-DM framework. Thereafter, the extended CRISP-DM framework is validated against a real-world case study. The validation shows that the extended CRISP-DM framework indicates change management improvement areas which would most likely have improved the adoption of the project. For this research project, the success ultimately lies in the ability of the developed framework to provide an effective way to guide data specialists through tasks that will ease the challenges of digital transformation. For this assignment, all the objectives of this research assignment are achieved. The validation of the framework shown by use of the extended framework by a data specialist has the potential to improve the success rate of the digital project at a lower risk of failure.
Abdullah Esmael	An evaluation of state-of-the-art approaches to short-term dynamic forecasting	Order volume forecasting (OVF) is a strategic tool used by logistics companies to reduce operating costs and improve service delivery for their clients. It provides business units with the ability to anticipate demand, based on historical data and external factors so that resources can be deployed effectively to enable the aforementioned improvements. Until recently, statistical models have been the standard for forecasting. However, recent research into the use of state-of-the-art (SOTA) approaches to forecasting have yielded promising results. Most notably, these approaches are able to leverage covariates which enable models to incorporate auxiliary information, such that the predictions are responsive to their respective environments. This is critical to short-term forecasts, which are inherently more stochastic than long-term forecasts. This research paper seeks to compare the use of a statistical forecasting approach to a SOTA approach in the case of short-term order volume forecasting. More specifically, the NBEATS model is developed using various exogenous variables and is compared to the Exponential Smoothing (ETS) model. Both models have been developed to provide forecasts three hours into the future and are evaluated using RMSE and MAE. It was found that NBEATS provided a 36.01% improvement on the RMSE of the ETS model and a 31.6% improvement on the MAE of the ETS model. Additionally, two variations of NBEATS are compared – one trained with covariates and another without – to evaluate the improvement that covariates provide. It was found that providing models with exogenous variables resulted in a 16.15% increase in the RMSE and a 14.74% increase in MAE. The results of this paper suggest that SOTA approaches provide more consistent and accurate short-term forecasts.
Guilaume Greyling	Cross-Camera Vehicle Tracking in an Industrial Plant Using Computer Vision and Deep Learning	One of the key actors in the paper recycling process is buy-back centres. Buy-back centres buy or collect recyclable materials from individuals, formal and informal collection businesses, and institutions. Buy-back centres are important because they divert recyclable material away from landfills, which reduces the leaching of pollutants into the soil and groundwater as well as the generation of harmful gasses and chemicals. However, buy-back centres face several threats of which fraud is one of the most difficult threats to detect and prevent. Fraud occurs when the amount and/or the grade of the waste paper being sold to the buy-back centre is mispresented by the sellers in order to earn a greater income. A misrepresentation of the waste paper grade and weight being sold to the buy-back centre influences not only the availability of stock and the volume of sales to the paper mills but also the sustainability of the entire recycling ecosystem in the area. To facilitate the detection of fraud at buy-back centres, a multi-vehicle multi-camera tracking (MVMCT) framework is developed to track the movement of vehicles throughout a paper buy-back centre located in South Africa. The MVMCT framework developed can aid the buy-back centre in estimating the amount of material expected to be collected at a loading bay prior to stocktaking. When there is a large discrepancy between how much material is expected to be collected and how much is present at the loading bay, the buy-back centre can use the MVMCT framework to track and identify suspicious vehicles for further investigation. This research assignment shows that the Faster R-CNN and DeepSORT detector-tracker pair exhibits superior performance in terms of IDF1 scores. Furthermore, this research assignment addresses the vehicle re-identification problem by using a siamese network to match vehicles across several video sequences and to manage the global ID assignment process. The MVMCT framework developed in this research assignment exhibits an IDF1 score of 0.58, a multi-object tracking accuracy of 0.62, and a multi-object tracking precision of 0.53. Moreover, the MVMCT framework successfully tracks vehicles across all video sequences except for the sequence with a top-down view and shows a reasonable counting accuracy for counting the number of stationary vehicles at a loading bay.
Bernard Hesse	A Bagging Approach to Training Neural Networks using Metaheuristics	Stochastic gradient descent has become the go-to algorithm to train neural networks. As neural networks become larger in architecture and the datasets used to train them becomes larger, so has the computational cost to train the artificial network. Metaheuristics have successfully been used to train neural networks. Furthermore, metaheuristics are more robust to noisy objective functions. This research assignment investigates and concludes if metaheuristics, especially genetic algorithms, differential evolution, evolutionary programming and particle swarm optimisation, can be used to train an artificial neural network with a subsample of the training set. Different bagging training approaches with the reduction in training data are put forward, and the performances of the trained neural networks are evaluated. The performances of the trained neural networks are compared against the performances of the stochastic gradient descent trained neural network and the trained neural network using metaheuristic algorithms when using the entire training dataset. The evaluation of the performance of the artificial networks compares the validation accuracy and the generalisation factor to detect if overfitting occurs. The research assignment also answers the question of whether overfitting is reduced when training the neural network if the suggested training methods are used. The results indicate that a sub-sample of the training set can be used per iteration or generation of the metaheuristic algorithm when training a neural network with similar accuracy and similar or better overfitting performance as when training is performed using the complete training set. The best performance was achieved with a bagging strategy using the same sample size for each class to classify.
Roshana Impey	Link prediction of clients and merchants in a rewards program using graph neural networks	Rewards programs have become an offering for businesses to increase client engagement, nurture long-term relationships and maintain client retention. A host company is an intermediary network provider that connects entities within a rewards program. Identifications of future relationships between entities are identified as a link prediction task. The network is represented as a graph of interconnected entities. Graphs are complex high-dimensional structures, dynamic in shape and size. A research field called graph neural network (GNN) has gained traction to handle challenges posed by graph properties. A real-world scenario has been instantiated to apply a GNN technique to a link prediction task. The investigation aims to identify potential relationships between clients and merchants in a rewards program offered at a Bank. A framework design is created for the model architecture; a GNN encoder and MLP decoder. A GNN variation called GraphSAGE is selected as the encoder. GraphSAGE is an inductive framework; able to generalise on unseen data and leverage node attributes. A sensitivity analysis indicates that the model is sensitive to the dropout and learning rate hyperparameters. Limited attributes and connections are present which validates the sensitivity. The model is fitted to the optimal architecture, and tested on unseen data. The model performance resulted in a Receiver Operator Characteristics Curve (ROCAUC) value of 0.65. Although acceptable, a higher ROCAUC value is desirable. Another evaluation metric highlighted an area that requires further improvement. The precision vs recall results emphasised the effects of the sparse network. Most of the correct predictions are for the negative class. Although a weighted loss strategy assisted in the drawbacks, it could not overcome the challenges. The encoder output reveals embeddings which are visualised for interpretation. Embedding illustrations reveal similarities in both representations of clients and merchants. The embeddings identified two distinct merchant groups. The client embedding representations showed clusters of clients which are best represented in a non-Euclidean dimension space. An entity characteristic prediction analysis is done to gain insight into the distribution of the client and merchant features. Note that the purpose is not to validate which features the GNN learnt from. To highlight the findings on the correct positive class predictions, the female clients accounts for 99% of the predictions. Half of the correct links are associated with a rewards program client. The Homeware and Decor Store merchant service type accounts for 100% of the correct positive predictions. Implications of the data quality issues are also emphasised. Overall, the GNN demonstrates that it can learn representations in a rewards program network of clients and merchants. The network topology and relations among the clients and merchants are well detected. The GNN is capable to predict the existence of links between the entities. Opportunities are identified to further enrich the graph and improvements are proposed. The investigation provides a positive contribution to the financial industry, rewards programs and GNN as an emerging research field.
Sem Iyambo	Evaluating active learning strategies to reduce the number of labelled medical images required to train a CNN classifier	CNNs have proven to provide human-compared performance in the field of computer vision; however, one basic limitation of ANN is that they are largely rely on large, labelled data (a costly and time-consuming task of manually labelling data). This study investigates how varied sizes of initially labelled medical images affect the effectiveness of CNN-based active learning. A framework in which data to be labelled by human annotators are not selected randomly but rather selected in such a fashion that the amount of data required to train a machine learning model is reduced. Two CNN architectures were chosen to run the experiment using a well-known chest x-ray pneumonia dataset from the Kaggle repository, and active learning base uncertainty was used to measure the data’s informativeness. Eight simulations were run on varying sizes of initial labelled training images. The simulations demonstrate how active learning can reduce the cost and time required for image labelling. The performance of the two CNN architectures was assessed using AUC-score metrics and less data was required to label the images. In conclusion, the use of DenseNet-121 with least confidence sampling reduced the number of labelled images by 39% compared to the random sampling technique used as the baseline.
Derick Jacobs	A Dynamic Optimization Approach to Active Learning in Neural Networks	Artificial neural networks are popular predictive models which have a broad range of applications. Artificial neural networks have been of great interest in the field of machine learning, and as a result, they received large research efforts to improve their predictive performance. Active learning is a strategy that aims to improve the performance of artificial neural networks through an active selection of training instances. The motivation for the research assignment is to determine if there is an improvement in predictive performance when a model is trained only on instances that the model deems informative. Through the continuous selection of informative training sets, the training times of these networks can also be reduced. The training process of artificial neural networks can be seen as an optimisation problem that uses a learning algorithm to determine an optimal set of network parameters. Backpropagation is a popular learning algorithm which computes the derivatives of the loss function and the gradient descent algorithm to make appropriate parameter updates. Metaheuristic optimisation algorithms, such as particle swarm optimisation, have been shown to be efficient as neural network training algorithms. The training process is assumed to be static under fixed set learning, a process in which the model randomly samples instances from a training set that remains fixed during the training process. However, under an active training strategy, the training set continuously changes and therefore should be modelled as a dynamic optimisation problem. This study investigates if the performance of active learners can be improved if dynamic metaheuristics are used as learning algorithms. Different training strategies were implemented in the investigation which include a sensitivity analysis selective learning algorithm and the accelerated learning by active sample selection algorithm. The analysis utilised different learning algorithms which included backpropagation, static particle swarm optimisation, and dynamic variations of the particle swarm optimisation algorithm. These training strategies were applied to seven benchmark classification datasets obtained from the UCI repository. Improved performance in the generalisation factor is produced for three of the seven classification problems in which a dynamic metaheuristic is used in an active learning setting. Although these improvements are observed, generally all training configurations achieved similar performance. The conclusion drawn from the study was that it is not definitive that dynamic metaheuristics improve the performance of active learners, because performance improvements are not consistent across all classification problems and evaluation metrics.
Shaun Joubert	Rule Extraction from Financial Time Series	The ability to predict future events is very important in scientific fields. Data mining tools extract relationships among feature and feature values, and how these relationships map to the target concept. The main goal is to extract knowledge and understand trends. The resulting rule set can then be used for prediction purposes. For many real-world applications, the actual values of a time series is irrelevant. The shape of the time series can also be used to predict future events. Unfortunately, most of these research e↵orts related to this area have had limited success. Rule induction and rule extraction techniques are often unsuccessful for real-valued time series analysis due to the lack of systematic e↵ort to find relevant trends in the data. Rule induction and rule extraction methods are applied to data describing trends in financial time series data. The purpose of this study is to explore the benefits of rule extraction and rule induction,specifically on financial time series. A review of rule extraction and rule induction approaches is conducted as a first step. Thereafter, a rule extraction and rule induction framework is developed and evaluated. The most important finding of this study was the importance of balanced data, which performed significantly better if the excessive class distributions were minimised, while the predictive performance of the di↵erent rule extraction and rule induction algorithms was not statistically significant.
Piet Kloppers	Binning Continuous-Valued Features using Meta-Heuristics	The success of any machine learning model implementation is heavily dependant on the quality of the input data. Discretization, which is a widely used data preprocessing step, partitions continuous-valued features into bins which transforms the data into discrete-valued features. Not only does discretization improve the interpretability of a data set, but it also provides the opportunity to implement machine learning models which require discrete input data. This report proposes a new discretization algorithm that partitions multivariate classification problems into bins through the use of swarm intelligence. The particle swarm optimization algorithm is utilized to try and find the bin boundary values of each continuous-valued feature which leads to the optimal classification performance of classification models. The classification accuracy of the na¨ıve Bayes classifer, the C4.5 decision tree classifier and the one-rule classifier, due to the implementation of the discretizers, is used as the evaluation measure in this report. The performance of the proposed method is compared with the equal width binning, the equal frequency binning and the evolutionary cut-points selection for discretization algorithm, on different data sets that have mixed data types. The proposed discretizer is outperformed by the evolutionary cut-points selection for discretization algorithm when paired with the C4.5 decision tree classifiers. Similarly, the equal with binning discretizer outperforms the proposed discretizer when paired with the C4.5 decision tree.
Gerhard Mulder	A Genetic Algorithm Approach to Tree Bucking using Mechanical Harvester Data	Crosscutting of trees into timber logs is known as bucking. The logs are mainly used for producing saw logs at a mill. The logs have different value based on the length of the log and the small end diameter of the log. Maximisations of the value of the logs bucked from a tree can be viewed as an optimisation problem. This problem has been researched in the literature with most solutions using dynamic programming. This research assignment solves the problem using a metaheuristic approach, specifically a genetic algorithm. The main research question is whether an existing bucking, on a series of stands in a forest, could have been done more optimally. The dataset used to solve the problem comes from the bucking outputs of two mechanical harvesters. Multiplication of the volume of the log by the value per cubic meter of the log class to which the log belongs, gives the value of the log. Addition of the value of logs for a tree gives the value of the tree. It was found that the genetic algorithm outperformed the existing bucking performed, in terms of value. The research method firstly solved the problem for a randomly selected set of trees with dynamic programming, comparing it to the solutions obtained from the genetic algorithm. It was found that the genetic algorithm obtained very similar optimal bucking value for the trees. Secondly, a genetic algorithm uses hyperparameters, namely population size, probability of crossover and probability of mutation. The hyperparameters were estimated using a particle swarm optimisation algorithm wrapped around the genetic algorithm. A randomly selected set of trees was used for estimating the hyperparameters. The hyperparameters found were used to optimise the total value of each of the five stands. The total value of the optimised stands outperformed the value of the existing bucking performed by a large margin.
Peter Mwambananji	Crop recommendation system for precision farming: Malawi use case	Machine Learning (ML) has received attention from the global audience, with adoption and rapid scaling being reported across multiple industrial sectors, including agriculture, for application in automation and optimisation of processes. The advent of new farming concepts like precision farming (PF) has introduced the use of ML-powered decision support systems (DSS). These systems assist farmers in making decisions by providing data-driven recommendations that boost farming productivity and sustainability. Despite being widely developed in many parts of the world, these technologies have not yet been adopted in the sub-Saharan region, particularly in Malawi, where infrastructure and government policy have been barriers. However, changes in policymaking and the introduction of data centres have drawn agricultural stakeholders who are pushing for the development of ICT-based technologies. The desired innovations are to support farmers in making data-guided decisions for climate change mitigation, increased productivity, and environmental sustainability. The goal of this project was to create a crop recommendation system that makes use of an ML model to forecast the best crop for farmland based on its physical, chemical, and meteorological parameters. Firstly, unlabelled data for the central region of Malawi was collected from the Department of Land and the Department of Climate Change and Meteorological Services. The data were merged, cleaned, and formatted using three methods: label encoding of categorical features; label encoding of categorical features and normalisation; label encoding of ordinal features, one-hot encoding of nominal features, normalisation, and principal component analysis (PCA) dimensionality reduction. A K-means clustering data preprocessing step was applied, and five centroids were extracted, analysed by an expert agronomist, and labelled as conducive for maize, cassava, rice, beans, and sugarcane crops, respectively. Then, ten classifier algorithms, namely Logistic Regression (LRC), K-Nearest Neighbours (KNC), Support Vector Machine (SVC), Multilayer Perceptron (MLPC), Decision Tree (DTC), Random Forest (RFC), Gradient Boosting (GBC), Adaptive Boosting (ABC), eXtreme Gradient Boosting (XGBC), and Multinomial Naïve Bayes (MNBC), were trained on the three kinds of formatted datasets. A 5-fold cross-validation (CV) technique was used to assess the performances of the models on the three formatted datasets evaluated based on the F1 score and Accuracy metrics. Lastly, the models were scored based on the CV’s average F1 and Accuracy scores, the model’s structural complexity, and training times. Formatting technique 1 resulted in poor performance across models that use Euclidean distance measures, and formatting technique 3 was the most conducive for all the models except for ABC and MNBC. On formatting 3, the KNC outperformed the other models with an f1 and accuracy score of 99%, a fast-training speed, and a simple model structure. The KNC was later integrated into a test web application as its proposed method of deployment. The proof-of-concept model shows reliable results but requires further development for real-time implementation.
Aveer Nannoolal	Financial Time Series Modelling using Gramian Angular Summation Fields	Gramian angular summation fields (GASF) and Markov transition fields (MTF) have been developed as an approach to encode time series into different images, which allows the use of techniques from computer vision for time series classification and imputation. These techniques have been evaluated on a number of different time series problems. This research assignment applies GASF and MTF to financial time series. As a first step, a suitable financial time series is collected from a real world system and analyzed. The data quality is determined to identify data quality issues to be addressed. The cleaned financial time series is encoded into images, and validated using an appropriate technique to determine if a logical mapping between the time series and image planes exists. The financial time series is analyzed to determine its characteristics. These characteristics are used to guide the formulation of a modeling problem. The modeling problem compares the usefulness of the GASF and MTF approaches against conventional time series modeling and analysis techniques. The four models considered for the formulated modeling problem consists of time series and image modeling approaches. The results from the experiment indicates that the time series approaches are better suited to this modeling problem specifically. The GASF and MTF approaches do provide promising outcomes when used in a combinatorial fashion. The usage of a combination of GASF and MTF images do allow a model to learn better features when combined with sequence based approaches, which improves model performance.
Jacobus Smit	Machine Learning-based Nitrogen Fertilizer Guidelines for Canola in Conservation Agriculture Systems	Soil degradation is a major problem that South African agriculture faces, and policy-makers pay special attention to it. South Africa has especially looked at the negative effects of applying poor land practices in the agriculture sector. This research assignment attempts to use machine learning (ML) algorithms to predict the amount of nitrogen (N) to add to canola to achieve an approximate optimal yield. This should be displayed in the form of a table, known as the fertiliser recommendation system, which can be used by a farmer to achieve the desired yield. The ML algorithms used in this assignment include: random forest regressor, extra trees regressor, artificial neural network, deep neural network, k-nearest neighbour, multiple linear regression and multivariate adaptive regression splines. The primary objective of precision agriculture (PA) is to increase agricultural productivity and quality while lowering costs and emissions. Furthermore, early detection and mitigation of crop yield limiting factors may aid to increased output and profit, and yield prediction is critical for a number of crop management and economic decisions discussed in this research assignment. The random forest regressor showed to be the most accurate in forecasting yield. The resulting random forest regressor model demonstrated that machine learning could potentially forecast canola production given some characteristics. These characteristics include but are not limited to average rainfall, year of the plantation, amount of N remaining in soil from the previous harvest, and rainfall each month from the date planted to the harvest date.
Luke Totos	The use of historical tracking data to estimate or predict vehicle travel speeds	York Timbers is an integrated forestry company that grows and manufactures lumber and plywood products. The plantations owned and maintained by York Timbers contains an expansive road network consisting of 26 661 road segments with a total length of approximately 10 000 km. In order to optimize the delivery of timber from the plantations to the mill sites, the travel speed of each road segment must be estimated. To estimate the speed of each road segment in the road network, global positioning system (GPS) measurements are first matched to the self-owned road network. Map-matching is done in two parts. First, the GPS measurements are assigned to the closest road segment based on euclidean distance. Next, the connectivity of road segments is analyzed to fix any errors made introduced during map-matching. The average travel speed is then calculated for each road segment using the matched GPS measurements. The majority of the road segments however do not have GPS measurements associated with them. To estimate the travel speed of road segments without GPS measurements, five different predictive models are developed. The best performance is obtained using a regression tree which achieves a mean absolute error of 10.02 km/h on data not used to train the model. To improve the speed estimation accuracy, further refinement of the speed estimation model and speed prediction model is required. Increasing the amount of GPS measurements used in the estimation and prediction of travel speed will improve the model performance. Including other data that influences safe travel speed, such as weather data, will further improve the model performance. Identifying dangerous portions of the road network is also suggested before a model is implemented.
Charley van der Linde	A Review and Analysis of Imputation Approaches	Missing data is a common and major challenge which almost all data practitioners and researchers face, and which greatly affects the accuracy of any decision making process. Data mining and data preparation requires that the data is prepared, cleaned, transformed, and reduced in order to ensure that the integrity of the dataset has been maintained. Missing data is found and addressed within the data cleaning process, during which the user needs to decide on how to handle the missing data so as to not introduce significant bias into the dataset. Current methods of handling missing data include deletion and imputation methods. This research assignment investigates the performance of different imputation methods, specifically discussing statistical and machine learning imputation methods. The statistical imputation methods investigated are mean, hot deck, regression, maximum likelihood, Markov chain Monte Carlo (MCMC), multiple imputation by chained equations, and expectation-maximization with bootstrapping imputation. The machine learning methods investigated are k-nearest neighbor (kNN), k-means, and self-organizing maps imputation. This research paper uses an empirical procedure to facilitate the formatting and transformation of the data, and the implementation of imputation methods. Two experiments are followed in this research, namely, one in which the imputation methods are evaluated against datasets which are clean, and another in which the imputation methods are evaluated against datasets which contain outliers. The performance achieved from both experiments are evaluated using the root mean squared error, mean absolute error, percent bias, and predictive accuracy. For both experiments, it is found that MCMC imputation resulted in the best performance out of all 10 imputation methods with an overall accuracy of 75.71%. kNN imputation resulted in the second highest accuracy with an overall accuracy of 69.85%, however, introduced a large percent bias into the imputed dataset. This research concludes that single statistical imputation methods (mean, hot deck, and regression imputation) should not be used to replace missing data in any situation while multiple imputation methods are shown to have a consistent performance. MCMC imputation in particular performs the best out of all 10 imputation methods in this research, producing a high accuracy and low bias in the imputed dataset. The performance of MCMC imputation, along with its ease-of-use, makes the imputation method a suitable choice when dealing with missing data.
Jozandri Versfeld	Crawler Detection Decision Support: A Neural Network with Particle Swarm Optimisation Approach	Website crawlers are popularly used to retrieve information for search engines. The concept of website crawlers was first introduced in the early nineties. Website crawling entails the deployment of automated crawling algorithms that crawl websites with the purpose of collecting and storing information about the state of other websites. Website crawlers are categorized as good website crawlers or bad website crawlers. Good website crawlers are used by search engines and do not cause harm when crawling websites. Bad website crawlers crawl websites with malicious intent and could potentially cause harm to websites or website owners. Traffic indicators on websites are inflated if website crawlers are incorrectly identified. In some cases, bad crawlers are used to intentionally crash websites. The consequences of bad website crawlers highlight the importance of successfully distinguishing human users from website crawler sessions in website traffic classification. The focus of this research assignment is to design and implement artificial neural network algorithms capable of successfully classifying website traffic as a human user, good website crawler session, or bad website crawler session. The artificial neural network algorithms are trained with particle swarm optimizers and are validated in case studies. First, the website traffic classification problem is considered in a stationary environment and is treated as a standard classification problem. For the standard classification problem, an artificial neural network with particle swarm optimization is applied. The constraints associated with this initial problem assume that the behavioural characteristics of humans and the behavioural characteristics of web crawlers remain constant over a period of time. Thereafter, the classification problem is considered in a non-stationary environment. The dynamic classification problem exhibits concept drift due to the assumption that website crawlers change behavioural characteristics over time. To solve the dynamic classification problem, artificial neural networks are formulated and optimized with quantum-inspired particle swarm optimisation. Results demonstrate the ability of the artificial neural networks optimised with particle swarms to classify website traffic in both stationary and on-stationary environments successfully to a reasonable extent.
Erich Wiehahn	A comparative study of different single-objective metaheuristics for hyper-parameter optimisation of machine learning algorithms	Over the past three decades machine learning evolved from a research curiosity to a practical technology that enjoys widespread commercial success. In the continuous quest to gain a competitive advantage and thereby market share, companies are highly incentivised to adopt technologies that reduce costs and/or increase productivity. Machine learning proved to be one of these technologies. A significant trend in the contemporary machine learning landscape has been the rise of deep learning which experienced tremendous growth in its popularity and usefulness, predominantly driven by larger data sets, increases in computational power and more efficient training procedures. The recent interest in deep learning (along with automated machine learning frameworks), both having many hyper-parameters and large computational expenditure, has prompted a resurgence of research on hyper-parameter optimisation. Stochastic gradient descent and other derivative-based optimisation methods are seldomly used for hyper-parameter optimisation, because derivates of the objective function with respect to hyper-parameters are generally not available. The objective function for hyper-parameter optimisation is therefore considered to be a black-box function. Conventionally, hyper-parameter optimisation is performed manually by a domain expert to keep the number of trials at a minimum, however, with modern compute clusters and graphics processing units it is possible to run more trials, in which case algorithmic approaches are favoured. The process for finding a high-quality set of hyper-parameter values for a machine learning algorithm is often time-consuming and compute-intensive, therefore efficiency is considered as one (if not the most) prevailing metric to evaluate the effectiveness of an hyper-parameter optimisation technique. Popular algorithmic methods for hyper-parameter optimisation include grid search, random search, and more recently Bayesian optimisation. Metaheuristics, defined as a high-level problem independent framework that serves as a guideline for the design of underlying heuristics to solve a specific problem, are investigated as an alternative to traditional hyper-parameter optimisation techniques. Genetic algorithms, particle swarm optimisation, and estimation of distribution algorithms were identified to represent metaheuristic algorithms. To compare traditional and metaheuristic hyper-parameter optimisation algorithms on the basis of efficiency, a test suite comprised of various data sets, and machine learning algorithms are constructed. The machine learning algorithms considered in this research assignment are support vector machines, multi-layer perceptrons, and convolutional neural networks. The efficiency of hyper-parameter optimisation algorithms is compared using independent case studies, where the hyper-parameters of a different machine learning algorithm are optimised in each case. Friedman omnibus tests are employed for determining whether a difference in average rank exists for the outcomes obtained using the respective hyper-parameter optimisation techniques. Upon rejection of the null hypothesis of the Friedman test, Nemenyi post hoc tests are performed to identify pairwise differences between hyper-parameter optimisation techniques. Other fitting metrics of solution quality such as computational expenditure are also investigated.
Christopher Williams	Predicting employee burnout using machine learning techniques	While artificial intelligence techniques and methods and the subsequent possibilities of using such to solve business problems are well understood in some industries including life insurance or banking, applying these to the domain of human capital management has been met with varying levels of success and value. Models that assist in recruitment activities or predicting employee attrition have been successfully implemented by many organisations. However, there are also many pitfalls to guard against including managing inherent bias in the data used as well as how the output of such models are used, often leading to ethical concerns. In this research assignment, multiple classification models and machine learning algorithms are applied to the problem of identifying employees at risk of burnout with the aim of producing outputs that can be used to ethically and pro-actively guide wellbeing related interventions across the business. The results show that none of the approaches were successful in accurately meeting this objective with an artificial neural network approach assessed as the most accurate of all the models implemented. By evaluating each classification model’s performance, it was found that none of implemented approaches were more than 50% accurate.

2022

Student	Title	Abstract
Jeandre de Bruyn	Comparison of Machine Learning Models for the Classification of Fluorescent Microscopy Images	The lasting health consequences of a COVID-19 infection, referred to as Long COVID, can be severe and debilitating for the individual afflicted. Symptoms of Long COVID include fatigue and brain fog. These symptoms are caused by microclots that form in the bloodstream and are not broken up by the body. Microclots in the bloodstream can entangle with other proteins and can limit oxygen exchange. This inhibition of the oxygen exchange process can cause most of the symptoms experienced with Long COVID. Diagnosis and identification of individuals suffering from Long COVID is the first step in any process that aims to help alleviate the symptoms of the individual, or cure them. Current identification processes are manual and as such limited by the amount of manpower applied to the task. Automating parts of the process with machine learning can greatly speed up this process and allow more efficient use of manpower. The purpose of this research assignment is to investigate whether or not machine learning algorithms can be used to classify fluorescent microscopy images as being indicative of long COVID or not. This is done by training models and predicting on features extracted from fluorescent microscopy images using computer vision techniques. Also explored is a comparison between the performance of the machine learning algorithms used in this research assignment. It was found that logistic regression is a good choice as a classifier with a strong performance in the classification of both the positive and negative classes.
Johannes Hanekom	Anomaly Detection in Support of Predictive Maintenance of Coal Mills using Supervised Machine Learning Techniques	Since the beginning of time, people have been dependent on technology. With each industrial revolution, people became more reliant on machines, and in parallel, the need to maintain them. The goal of any maintenance organisation is always the same: to maximise asset availability. Our massive strides in technology have paved the way to the birth of Industry 4.0 where our focus starts to shift from preventive maintenance to predictive maintenance. Predictive maintenance does not follow a schedule like preventive maintenance, instead, it performs maintenance when it is necessary, not when it is too early or too late. This research assignment identifies an area of research where a study is performed in support of predictive maintenance of coal mills through supervised machine learning. The assignment uses the coal mill data from a case study company to identify data quality issues, address these issues, prepare the data for machine learning and finally to build a machine learning model which aims to predict when failure is most likely to occur. The assignment evaluates the feasibility to build a supervised machine learning model using the given data and methodology, draws conclusions about the findings and identifies opportunities for future research.
Roussouw Landman	Comparison of unsupervised machine learning models for identification of financial time series regimes and regime changes	Financial stock data has been studied extensively over many years with an objective of generating the best possible return on an investment. It is known that financial markets move through periods where securities are increasing in value (bull markets) and periods where these securities decrease in value (bear markets). These periods that exhibit similarities over different time frames are often referred to as regimes that are not necessarily limited to bull and bear regimes, but any sequences of data that experiences correlated trends. Regime extraction and detection of regime shift changes in financial time series data can be of great value to an investor. An understanding of when these financial regimes will changeDroid: Singular Value Decomposition with C and in what type of regime the financial market is tending towards, can help improve investment decisions and strengthen financial portfolios. This research deals with reviewing and comparing the viability of different regime shift detection algorithms when applied to multivariate financial time series data. The selected algorithms are applied on different stocks from the Johannesburg Stock Exchange (JSE) where the algorithms’ performances are compared with respect to regime shift detection accuracy and profitability of regimes in selected investment strategies.
Vanessa Masupe	Detection of chronic kidney disease using machine learning algorithms	Chronic Kidney Disease (CKD) is a significant public health concern worldwide that affects one in every ten (10) people globally. CKD results from a poorly functioning kidney that fails at the basic functionalities, including removing toxins, waste, and extra fluids from the blood. The build-up of the problematic material in the body can cause complications such as hypertension, anaemia, weak bones, and nerve damage. CKD often occurs in individuals that suffer from additional chronic illnesses such as diabetes, heart disease, and hypertension, in addition to the existence of unfavourable health habits and practices that lead to the kidney’s deplorable state. The presence of additional illnesses that occur in tandem with CKD causes a hindrance in its successful and early detection. The onset of CKD can be clinically detected using laboratory tests focusing on specific standard parameters such as the Glomerular Filtration Rate (GFR) and the albumincreatinine ratio. Kidney damage occurs in stages, with each subsequent stage indicating a severe reduction in the glomerular filtration rate. The GFR parameter is considered a facet of the indication of renal failure and the final stage of chronic kidney disease. It is then imperative to use early detection methods to assist in the early administration of treatment to alleviate the symptoms of the disease and combat the progression. Early-stage diagnosis involves medications, diet adjustments and invasive procedures. In developing countries, especially in Africa, the prevalence of CKD is estimated at 3 – 4 times more than in developed countries in Europe, America and Asia. The current dialysis treatment rate in South Africa stands at approximately 70 per-million population (pmp), and the transplant rate stands at approximately 9.2 per-million population (pmp). The accounted prevalence rate mainly considers individuals with accessibility to private health care options through affordability or medical insurance; however, most South Africans (approximately 84%) depend on the under-resourced, government-funded public health systems. The disparity in treatment affordability among South Africans of different economic classes introduces a two-tiered health system that affects access to quality treatments. The need for early detection and diagnosis is an important process in the field of CKD and other chronic illnesses plaguing the nation using machine learning algorithms. Machine learning applications in the health care sector aim to revolutionise the early detection and treatment of chronic illness for the greater global population. Since early detection and management are vital in preventing disease progression and reducing the risk of complications, some machine learning (ML) models have been developed to detect CKD. The primary purpose of this study is to review, develop and recommend various machine learning classification models for the efficient detection of chronic kidney disease using three datasets. These datasets include:- two UCI Machine Learning Repository datasets Chronic Kidney Disease and Risk Factor prediction of Chronic Kidney Disease; and the PLOS ONE dataset Chronic kidney disease in patients at high risk of cardiovascular disease in the United Arab Emirates: A population-based study. The final aim is to construct a high-performing ML model that has effectively and accurately learned the hidden correlations in the symptoms exhibited by CKD patients.
Tristan Mckechnie	Feature engineering approaches for financial time series forecasting using machine learning	This research assignment investigates feature engineering methods for financial time series forecasting using machine learning. The goal of the work is to investigate methods that overcome some time series characteristics which make forecasting difficult. The challenging characteristics are noise and non-stationarity. A literature review is conducted to identify suitable feature engineering methods and machine learning approaches for financial time series forecasting. A case study is developed to test the identified feature engineering methods with an empirical machine learning process. Multiple machine learning models are tested. To understand the benefit of the feature engineering methods, the forecasting results are compared with and without of the application of the feature engineering methods. Several feature engineering methods are identified: Differencing and log-transforms are two methods investigated to address non-stationarity. Moving averages, exponentially weighted moving averages, Fourier and wavelet transforms, are all methods investigated to reduce noise. The feature engineering methods are implemented as preprocessing steps prior to training machine learning models for a supervised learning problem. The supervised learning problem is to forecasting a single day ahead asset price, given ten days of previous prices. Four machine learning models commonly used for financial time series forecasting are investigated. Namely, linear regression, support vector regression (SVR), multilayer perceptron (MLP), and long short-term memory (LSTM) neural networks. The work investigates the feature engineering methods and machine learning models for four univariate time series signals. The results of the investigation found that no feature engineering method is universally helpful in improving forecasting results. For the SVR, MLP and LSTM models, denoising or smoothing the signals did improve their performance, but the best denoising or smoothing technique varies depending on the dataset used. Differencing and log-transforms caused the models to forecast a constant value near the mean of expected daily price returns, which when inverted back to the price domain cause poor regression evaluation metrics, but good directional accuracy. The findings of this research assignment are that the investigated feature engineering methods may improve forecasting performance for financial time series, but that the gains are not large. It seems that there is limited improvement gained through feature engineering past price data to predict future price, at least for the investigated feature engineering methods. It is therefore recommended that future work focus on finding alternative data sources with predictive power for the financial time series.
Christiaan Oosthuizen	Forecasting armed conflict using long short-term memory recurrent neural networks	Various recent studies have shown an optimistic future for social conflict forecasting by taking more data-driven approaches. These approaches also come at the perfect time during the big-data revolution. Conflict forecasting models can be used to reduce the severity of events or to intervene to prevent these events from materialising or escalating. As such, these predictive models are of interest to numerous institutions or organisations, such as governments and non-governmental organisations, humanitarian agencies, and even insurance companies. In this mini-dissertation, long short-term memory recurrent neural network modelling is applied to forecast armed conflict events in the Afghanistan conflict, which started in October 2011. This model utilises world news data from the Global Database of Events Language and Tone (GDELT) platform and georeferenced event data from the Uppsala Conflict Data Program (UCDP) to make its predictions. The results show that GDELT data can improve conventional baseline forecasting models to an extent by incorporating actor and event attributes that are unique to the conflict at hand. Furthermore, results indicate that news media data can be consolidated with actual recorded deaths in the forecasting model, which enables predictions that are grounded in reality.
Judene Simonis	Comparison of machine learning models on different financial time series	The efficient market hypothesis implies that shrewd market predictions are not profitable because each asset remains correctly priced by the weighted intelligence of the market participants. Several companies have shown that the efficient market hypothesis is invalid. Consequently, a considerable amount of research has been conducted to understand the performance and behaviour exhibited by financial markets, as such insights would prove valuable in the quest to identify which products will provide a positive future return. Recent advancements in artificial intelligence have presented researchers with exciting opportunities to develop models for forecasting financial markets. This dissertation investigated the capabilities of different machine learning models to forecast the future percentage change of various assets in financial markets. The financial time series (FTS) data employed are the S&P 500 index, the US 10 year bond yield, the USD/ZAR currency pair, gold futures and Bitcoin. Only the closing price data for each FTS was used. The different machine learning (ML) models that are investigated are linear regression, autoregressive integrated moving average, support vector regression (SVR), multilayer perceptron (MLP), recurrent neural network, long short term memory and gated recurrent unit. This dissertation uses an empirical procedure to facilitate the formatting, transformation, and modelling of the various FTS data sets on the ML models of interest. Two validation techniques are also investigated, namely single out-of-sample validation and walk-forward validation. The performance capabilities of the models are then evaluated with the mean square error (MSE) and accuracy metric. Within the context of FTS forecasting, the accuracy metric refers to the number of correct guesses about whether the price movement increased or decreased and the total number of guesses. An accuracy that is one percentage point above 50% is considered substantial when forecasting FTS, because a 1% edge on the market can result in a higher average return, which outperforms the market. For the individual analysis of the single out-of-sample and walk-forward validation technique, the linear regression model was the best ML model for all FTS, because it is the most parsimonious model. The concept of a parsimonious model was disregarded when comparing and contrasting the two validation techniques. The ML models applying walk-forward validation performed the best in terms of MSE on the S&P 500 index and US 10 year bond yield. The SVR model obtained the highest accuracy of 52.94% on the S&P 500 index, and the MLP model btained the highest accuracy of 51.26% on the US 10 year bond yield. The ML models applying single out-of-sample validation performed the best in terms of MSE on the USD/ZAR currency pair, gold futures and Bitcoin. The MLP model obtained the highest accuracy of 51.77% and 53.51% for the USD/ZAR currency pair and gold futures, respectively. The linear regression model obtained the highest accuracy of 55.04% for Bitcoin.
Cameron Smith	Proximal methods for seedling detection and height assessment using RGB photogrammetry and machine learning	An ever-growing global population, coupled with increasing per capita consumption and higher demand for wood-based products, have all contributed towards growing demand for planted forests. The efficiencies of such forests are in no small part due to ensuring planted seedlings are well suited to the local environment. This, in turn, has resulted in a growing demand for nurseries to cultivate such seedlings. Nursery operators are faced with the challenge of monitoring stock levels and determining the growth stage of the stock on hand. This typically involves laborious manual assessments based on statistical sampling of only a small percentage of the stock on hand. In this study, a framework for the proximal detection and height assessment of seedlings is proposed. Photogrammetry is employed using red-green-blue (RGB) imagery captured using a smartphone to produce digital surface models (DSMs) and orthomosaic images. Three image collection strategies are proposed and evaluated based on ground control point accuracy. A RetinaNet object detection model, pre-trained on unmanned aerial vehicle (UAV) derived RGB imagery, is utilised for the object detection task. Transfer learning is leveraged by retraining the detection model on a single seedling tray consisting of 98 seedlings. The model is trained on the orthomosaics produced by the photogrammetry process. In order to determine the heights of these seedlings, two proposals for sampling the seedling height from the DSM are proffered and evaluated. Finally, a number of regression algorithms are investigated as a tool to refine the sampled height. Ultimately, the ensemble based AdaBoost regression algorithm, achieves the best performance. The proposed pipeline is able to detect 98.97% of seedlings at an intersection over union (IOU) of 76.93% with only a single instance missing classification. The final root mean squared error (RMSE) of 17.26mm achieved by the height refinement process with respect to the test data suggests sufficient performance which enables an improved understanding of stock quantities and growth stage without the need for manual intervention.
Chris van Niekerk	Automated tree position detection and height estimation from RGB aerial imagery using a combination of a local-maxima based algorithm, deep learning and traditional machine learning approaches	Forest mensuration is a pivotal aspect of forest management, particularly when determining the total biomass, and subsequently fiscal value, of forest plantations. Terrestrial measurement of phenotypic properties concerning tree attributes tends to be laborious and time-consuming. Remote sensing (RS) approaches have revolutionised the way in which forest mensuration is conducted, especially due to the reduced costs and increased accessibility associated with leveraging unmanned aerial vehicles (UAVs) that incorporate high-resolution imaging sensors. The rapid development of digital aerial photogrammetry (DAP) technologies has provided a viable alternative to airborne laser scanning (ALS), technology that has typically been reserved for applications wherein high accuracy is required, and budget constraints are not of major concern. Furthermore, machine learning (ML), and particularly computer vision (CV), are becoming increasingly commonplace in the processing of orthomosaic rasters and canopy height models (CHMs). Traditionally, an ALS- or DAP-derived CHM has been utilised, together with a local maxima-type model, to detect tree crown apexes and estimate tree heights. In this study, a forest stand located in KwaZulu-Natal, South Africa, comprised of 4 968 Eucalyptus dunnii tree positions spaced at 3×2 metres, was considered. A local maxima (LM) algorithm was employed as a baseline model to improve on. The out-put of the LM algorithm was, however, also utilised in an ensemble of ML models, designed to better estimate tree positions and heights. A hybrid approach was proposed that integrates object detection, classification, and regression models in an ML model framework, with the intention of improving accuracies achieved by the LM algorithm. The object detection model was built on the RetinaNet one-stage detection model which is comprised of a feature pyramid network (FPN) that employs a focal loss (FL) function, rather than the typical cross-entropy (CE) loss function, addressing the issue of extreme class imbalance typically encountered by object detection models. This RetinaNet was made available as part of the DeepForest (DF) python package and the underlying network had been pretrained on a substantial amount of forest canopy imagery. To improve the model, hand-annotations of trees depicted in the DAP-derived orthomosaic were generated and subsequently employed in further training the DF model through the procedure of transfer learning. A support vector machine (SVM) model was built to filter misclassified tree positions and to act as a differentiator between legitimate and illegitimate tree positions. Furthermore, a multi-layer perceptron (MLP) was trained to address the inherent bias present in the CHM, and improve tree height estimations sampled from the CHM. The improvements in tree position and height accuracies were noticeable. Tree position MAE was improved by 15.68% from 0.3515 metres to 0.2964 metres. Tree height RMSE was improved by 25.30% from 0.6435 metres to 0.4807 meters, while R2, with respect to height, was increased by 15.22% from 0.6662 to 0.7676. The proportion of total trees detected was reduced by 3.33% from 98.77% to 95.48%. The number of dead and invalid tree positions detected were, however, also decreased by 82.35% and 36.36%, respectively, suggesting a substantial improvement in the quality of tree positions detected. The results demonstrate potential improvements that can be realised by incorporating ML approaches and DAP-derived data.
Francois van Zyl	Fantasy Premier League Decision Support: A Meta-learner Approach	The Fantasy Premier League is a popular online fantasy sport game, in which players, known as managers, construct so-called dream-teams based on soccer players in the English Premier League. Each player in the dream-team is assigned a points score based on their performance in each gameweek’s fixtures, where the goal of the fantasy sport is to maximize the points accumulated over the course of an entire season. Each season consists of thirty-eight gameweeks, with managers required to select eleven starting players, a captain player, and four substitution players for each gameweek. Unless a so-called special chip is used, only eleven of the fifteen players can accumulate points during each gameweek. The manager’s selected dream team is transferred to a successive gameweek, with managers allowed to transfer players into and out of their teams each gameweek. Managers are penalized for excessive player transfers and, adding to the strategic complexity of the fantasy game, the managers face strict constraints when formulating their teams. The so-called dream-team formulation problem can be decomposed into an initial dream-team formulation sub-problem and a subsequent player-transfer sub-problem. The constraints associated with these sub-problems be expressed as a system of linear equations, and given an estimate of a player’s expected performance in a fixture, a set of suggested player transfers can be obtained by using linear programming. The focus in this project is to design and implement a set of machine learning algorithms capable of forecasting the expected points of the players in a game-week’s fixtures, after which a decision support system is designed and implemented to obtain a suggested initial dream-team and a set of player transfers for the subsequent gameweeks. A total of five machine learning algorithms are considered, with each algorithm being selected from a distinctly-functioning family of learning algorithms. The five algorithms are selected from families of linear regression techniques, as well as kernel-based, neural network, decision tree ensembles, and nearest-neighbour algorithms. The applicability of using a stacked meta-learner is investigated, where the meta-learner is provided with predictions generated by the five implemented algorithms. A case study is performed on the 2020/21 Fantasy Premier League season, in which the quality of the suggested player transfers are validated. The final results obtained demonstrate that the decision support system performs favorably, where the best set of suggested player transfers would have placed in the top 5.98% of eight million real-world managers’ in the 2020/21 season.

2021

Student	Title	Abstract
Stuart Penaluna	Requirements for 3D stock assessment of timber on landings and terminals	This project aims to address the issue of having an unreliable stock assessment system in the timber supply chain, leading to inaccurate estimations for stock volumes in log piles. The system developed in this project needs to satisfy the practical constraints of the supply chain, while generating results that are frequent and accurate. The data capturing process is required to be low tech due to the vast rural areas covered by the timber supply chain. The method identified for achieving this is terrestrial structure from motion (SFM), using a consumer grade camera or a smartphone. The final data used for the project is in the form of point clouds, generated from both SFM as well as Unity, in order to increase the amount of data available. In order for the system to determine the volume of log piles, the first step required is to determine the difference between log pile and terrain within the point cloud. To do this, a classification algorithm is developed as part of this project. The algorithm makes use of neighbourhood statistics calculated during the feature engineering process, along with features in the original point cloud dataset. The algorithm used for the classification of log piles from this dataset is K-means clustering. Once the log piles can be extracted from the point cloud, an alpha shape is generated from the point cloud. The alpha shape is then used to predict the final volume of the log piles. The results of the final system show that the methodology developed achieves predicted volumes of an acceptable level for the future use case. The results in this project thus provide evidence that there is a benefit for the use of computer vision in the timber supply chain to perform stock assessments that are accurate. Finally the project acknowledges that there is need for the continuation of work in order to further improve the accuracy and implement the system.
Johannes Pieterse	A predictive model for precision tree measurements using applied machine learning	Accurately determining biological asset values is of great importance for forestry enterprises – the process ought to be characterised by the proper collection of tree data by means of utilising appropriate enumeration practices conducted at managed forest compartments. Currently, only between 5-20% of forest areas are enumerated which serve as a representative sample for the entire enclosing compartment. For forestry companies, timber volume estimations and future growth projections are based on these statistics, which may be accompanied by numerous unintentional errors during the data collection process. Many alternative methods towards estimating and inferring tree data accurately are available in the literature – the most popular characteristic is the so-called diameter at breast height (DBH), which can also be measured by means of remote sensing techniques. The advancements in laser scanning measurement apparatuses are significant in recent decades, however, these approaches are notably expensive and require specialised and technical skills for their operation. One of the main drawbacks associated with the measurement of DBH by means of laser scanning is the lack of scalability – equipment setup and data capture are arduous processes that take a significant amount of time to complete. Algorithmic breakthroughs in the domain of data science, predominantly spanning machine learning (ML) and deep learning (DL) approaches, warrant the selection and practical application of computer vision (CV) procedures. More specifically, an algorithmic approach towards monocular depth estimation (MDE) techniques was employed for the extraction of tree data features from video recordings (captured using no more than an ordinary smartphone device) and are investigated in this thesis. Towards this end, a suitable forest study area was identified to conduct the experiment and the industry partner of the project, i.e. the South African Forestry Company SOC Limited (SAFCOL) granted the necessary plantation access. The research methodology adopted for this thesis includes fieldwork at the given site, which involved first performing data collection steps according to accepted an standardised operating procedures developed for tree enumerations. This data set is regarded as the \ground truth” and comprises the target feature (i.e. actual DBH measurements) later used for modelling purposes. The video _les were processed in a structured manner in order to extract tree segment patterns from the corresponding imagery. Various ML models are then trained and tested in respect of the basic input feature data _le, which produced a relative root mean squared error (RMSE%) between 14.1 and 18.3% for the study. The relative bias yields a score between 0.08% and 1.13% indicating that the proposed workflow solution exhibits a consistent prediction result, but at an undesirable error rate (i.e. RMSE) deviation from the target output. Additionally, the suggested CV/ML workflow model is capable of generating a discernibly similar spatial representation upon visual inspection (when compared with the ground truth data set – i.e. tree coordinates captured during fieldwork). In the pursuit of precision forestry, the proposed predictive model developed for accurate tree measurements produce DBH estimations that approximate real-world values with a fair degree of accuracy.

Below, find a list of completed research assignments. The assignments are grouped under the year of graduation.

2024

March 2024 Graduation

2023

December 2023 Graduation

March 2023 Graduation

2022

2021