
As part of the MEng (Structured) programme with focus on Data Science our students are required to complete a final 60 credit data science research project where they are required to apply and consolidate the data science knowledge gained throughout the programme. For this purpose, students will solve a real-world data science project, providing solutions for each step of the data science project life cycle and document it in a research assignment.
For these projects, we collaborate with industry and academic partners who are willing to propose a topic, to provide the necessary data (if not publicly available) as well as to act as domain mentors. The data set needs to be complete.
If you are interested in partnering with us for such a project, please contact DS-PROJECTS@sun.ac.za for further information about a short project proposal and deadlines.
Project proposals reviewed by end of term 3 of a given year will be assigned to students for the following year.
Below, find a list of completed research assignments. The assignments are grouped under the year of graduation.
March 2025 Graduation |
|
Title | Abstract |
---|---|
Set-based Particle Swarm Optimization for Training Support Vector Machines | This research explores the application of set-based particle swarm optimization (SBPSO) to the training of support vector machines (SVMs), addressing challenges in hyperparameter tuning, noisy datasets, and computational efficiency. SVMs, celebrated for their classification precision, often face limitations due to their sensitivity to parameter selection and difficulties in handling high-dimensional or noisy data. SBPSO, an extension of traditional particle swarm optimization (PSO), is tailored for discrete optimization problems, making it a promising approach for optimizing SVM performance.
The study investigates two approaches: standard SBPSO-SVM training and SBPSO-SVM training with Tomek links preprocessing, which enhances data quality by reducing noise and refining decision boundaries. Experiments conducted on five benchmark datasets reveal that both methods significantly reduce the number of support vectle maiors whintaining competitive accuracy and F1 scores. However, training times were substantially longer than those of standard SVMs, highlighting a need for further optimization. To address these challenges, dynamic control of SBPSO parameters was introduced, alongside advanced preprocessing techniques such as principal component analysis (PCA) with Gaussian mixture model (GMM) noise filtering and Wilson editing. While these enhancements improve training efficiency and performance for complex and noisy datasets, the algorithm still struggles to scale effectively to very noisy, large, and highly complex datasets. This research contributes to the ongoing development of hybrid optimization frameworks, providing insights into balancing computational costs with classification performance. The findings underscore the potential of SBPSO SVM as a robust tool for advancing machine learning applications in diverse, real-world scenarios. |
Property tax validation using automatic building footprint extraction from aerial images | Property valuation is essential to determine the rates and taxes needed for municipal services. Property valuation depends on the number and size of buildings on a property. A tedious manual process is used to create the outlines of buildings and calculate the building area. Therefore, this project aims to develop a process to generate building outlines from unmanned aerial vehicle raster images with as little human intervention as possible. The solutions developed uses semantic pixel classiacation to detect buildings and and the building outlines. The outlines can then be used to validate property valuations.
To perform semantic pixel classication, an U-Net architecture was selected. Various experiments were conducted to and the optimal U-Net architecture. The output of the semantic pixel classication was used along with a contour extraction method to extract the building’s outline. Similarly, experiments were conducted to select the optimal contour extraction method. The U-Net model and contour extraction method are combined to create a process capable of extracting building outlines from raster images. Experiments were performed using a human-in-the-loop approach, a variant of active learning. The training results show accuracy, recall, precision, and intersect over union above 90%. Even though the training showed excellent training and validation metrics for the experiments, the project shows how critical the training data is to predicting test data and determining the quality of image segmentation and building outline extraction. Finally, producing vector data that accurately represents 80 to 90% of buildings with an area error of less than one square meter. |
Credit Scoring and Risk Assessment Using Machine Learning and Overdraft History | Credit scoring, by definition, is a quantitative methodology and evaluation method by which lenders assess whether a borrower (either an individual or a business) is able to repay a debt if credit is granted. A credit score is typically generated at the end of the credit scoring process, and it is a fundamental element that influences an individual’s access to credit. It acts as a gateway to financial resources such as loans, credit cards, and others, highlighting the importance of fairness, non-discrimination, and ethical practices to ensure equitable access to credit, free from prejudice and bias.
Credit history is typically the key factor in traditional scoring methods, including the FICO score, logit models, and expert judgment-based models, among others. As a result, individuals who have never borrowed may be overlooked or subjected to high-interest rates. To address these limitations, this study leverages overdraft information to develop a dynamic, inclusive, and effective credit scoring framework. This framework integrates both traditional credit history data and overdraft data, which is often underutilized but can potentially serve as an indicator of good versus bad borrowers. Additionally, the literature does not identify which machine learning method is best suited for credit scoring tasks. To overcome this uncertainty, the following algorithms are trained: KNN, Naïve Bayes, Decision Trees, ANN, and SVM, to predict bank customers that are likely to default or not default on credit using three distinct datasets: overdraft, credit history, and a combination of both. The performance of these algorithms is evaluated to determine the most accurate predictive method. Through a series of hyperparameter tuning across the algorithms, the results of this study suggest that Naive Bayes is particularly effective when both credit history and overdraft data are available, as it demonstrated minimal misclassifications and robustness in classifying customers correctly. The algorithm performed best on the three tested datasets, achieving accuracy rates of 99.01% for the credit history dataset, 99.5% for the hybrid dataset, and 100% for the overdraft dataset. KNN also performed well, with accuracy rates of 98.93% for credit history, 99.3% for the hybrid dataset, and 99.97% for overdraft. Additionally, a comparison of the overdraft credit scores versus credit history scores indicated that overdraft-based scores reflect a more optimistic distribution, with a significant reduction in the percentage of customers categorized as poor when using overdraft data alongside credit history. The combination of both datasets resulted in more accurate credit assessments, increasing the number of customers qualifying for credit approval. Specifically, 75% of customers qualified using the combined dataset, compared to 65% with overdraft data and 45% with credit history alone. The results of this study offer new perspectives for financial institutions that traditionally rely solely on credit history data to profile individuals. This unique study represents a potential game-changer in the field, with the capacity to bring about a significant paradigm shift in lending and borrowing practices. If succesfully adopted, this approach could create a mutually beneficial situation for both lenders and borrowers. Individuals often denied credit due to a lack of credit history would no longer be excluded, thereby enhancing decision-making processes and potentially increasing profitability. |
Machine Unlearning of Convolutional Neural Networks to Address the Right to Be Forgotten | This research assignment examines whether personally identifiable information can be removed from a convolutional neural network using a machine unlearning algorithm and verified as removed to ensure compliance with the right to be forgotten as outlined in the General Data Protection Regulation. Machine unlearning examines whether data removal can be achieved while preserving machine learning model performance without fully retraining a machine learning model.
In this research assignment a convolutional neural network is trained on facial images. The performance of the convolutional neural network before and after applying a machine unlearning algorithm is then established. The evaluation examines the extent of data required for machine unlearning, such as whether a single image, multiple images, or all images used during training are necessary to remove the presence of data associated with an individual. Machine unlearning demonstrated effectiveness in removing specific data from the convolutional neural network, as measured by a membership inference attack. The machine unlearning algorithm, which utilises Kullback-Lieber divergence and weight regularisation, enabled the removal of data for a single individual as well as for a forget set composed of a sampled group of individuals without requiring full retraining. The study shows that unlearning can be successfully achieved while preserving the generalisation capabilities of a convolutional neural network. |
Towards an automated medical image classification pipeline | Radiological departments have high demands for efficiency and diagnostic quality, and interpretation of radiographs are highly variable between radio- graphers. The process followed in a radiological department to support patients with health services can be made more efficient. Parts of the process, such as retrieving and processing data can be automated with artificial intelligence to expedite the process and increase the quality of services offered.
Deep learning is a subfield of artificial intelligence, and transfer learning is a subfield of deep learning. Transfer learning can be applied to image classification tasks to improve the predictive accuracy of classes. Medical images cover several modalities such as X-rays, ultrasound, magnetic resonance imaging, and angiographs, amongst others. Several transfer learning methods are compared to perform classification on two model components. The first component is a machine learning model that can predict the medical image modality type of an image. The second component is a machine learning model that can predict the body part from human anatomy. This research assignment covers the creation of a medical imaging dataset that is sourced from open source data sets. A variety of transfer learning models such as residual neural networks, dense neural networks, and efficient neural networks are evaluated on this data set. The results of this research assignment show that lightweight transfer learning methods can successfully be applied to perform classification on medical imagery. The best performing models of both components are combined in a transfer learning classification pipeline. The transfer learning pipeline produced a predictive accuracy of 96.3034% on testing data. |
Evolving Oblique Decision Trees | This study investigates the induction of classification oblique decision trees using genetic programming, with constraints imposed on the genetic operators and the fitness function. Additionally, the study examines the effect of introducing pre-defined genetic programs in the initial population of the evolutionary process on the performance of the genetic programs in solving classification tasks. The pre-defined individuals in the initial population were generated by leveraging clustering techniques and methodology inspired by the Cline decision tree [24].
The goals were achieved by developing constrained genetic programs to induce oblique decision trees. The results demonstrate that using genetic programming with applied constraints for classification purposes is feasible and results in decision trees that perform exceptionally well compared to standard axis-aligned and oblique decision trees, albeit at the cost of increased computational resources. Results from the experiment also highlight that the overall performance of genetic programming-based algorithms relies more heavily on the evolutionary process itself rather than the introduction of initial population diversifying techniques. |
Horticulture Supplier Delivery Forecast | Supermarket retailers rely on suppliers to meet customer demands, but suppliers often face disruptions that prevent them from delivering the agreed quantities. This is true in the horticultural sector, where weather and logistical challenges affect the delivery reliability. Accurate forecasting of horticultural supplier deliveries is critical for supermarket retailers, as fresh fruit is a key source of revenue. This highlights the need for improved forecasting methods that use predictive analytics to improve forecast accuracy.
The main objective was to develop a predictive analytics solution to forecast deliveries from horticultural suppliers, focusing on fresh fruit. The research aims to help retailers align supply with demand, reducing stock shortages and managing variability in deliveries. The study employs machine learning models trained on 24 months of historical data, incorporating derived features that represent factors influencing delivery reliability. The models, including a baseline model, are evaluated over a 6 month period, using 69 exclusive suppliers and 32 product types. The research assignment found that majority of the models outperformed the baseline with random forest and GRU models performing the best based on standard evaluation metrics. The baseline model achieved a mean absolute error (MAE) of 30.35, while the random forest model reduced the MAEto 0.47, demonstrating a significant improvement in forecasting accuracy. The findings show that the integration of predictive analytics and the incorporation of influential factors address key challenges faced by retailers, such as inconsistent supplier deliveries, and can improve forecast visibility and customer satisfaction. This study contributes to predictive analytics in the horticulture supply chain, highlighting the importance of integrating factors to optimise forecasting. |
Behavioural Scorecards Development and Machine Learning | This study compares traditional behavioural scorecards based on logistic regression (LR) with machine learning (ML) for credit risk assessment. The study aims to improve predictive performance while maintaining model interpretability to comply with Basel regulatory standards. To achieve this, the study introduces the Bayesian Weight of Evidence Optimizer (BWOpt) for binning optimization in LR models and proposes the interpretable prepruned penalized logistic tree regression (P-PLTR) alongside RuleFit. It also explores the effects of sampling strategies (undersampling and oversampling) on model performance with imbalanced datasets.
Results show that traditional scorecards outperform ML models, particularly with oversampled data. While RuleFit and P-PLTR show competitive performance with undersampling, P-PLTR suffers from instability in rule sets. BWOpt-enhanced LR models outperform both ML methods, highlighting the value of feature engineering. These findings align with existing literature, which suggests that ML models do not significantly outperform statistical models such as LR in structured data, though ML may offer advantages with unstructured data. Given their balance of interpretability and predictive power, traditional scorecards remain well-suited for regulated environments. |
Evaluating Heterogeneous Graph Embeddings for Product Substitute Identification with LLM-Generated Attributes | In the context of the food retail sector, the identification of product substitutes is crucial for several reasons, including the determination of the assortment of store products, the design of marketing campaigns, the promotion of items, and the avoidance of potential cannibalisation when introducing new products. Given the extensive range of products and categories, understanding product relationships and consumer purchasing behaviours is essential. Product relationships can be categorized as complements, substitutes, or irrelevant product pairs. This study seeks to investigate product substitutes through the process of product clustering. The first of three research objectives is to determine whether the use of product attributes leads to the formation of consistent and informative product groups. The second objective aims to determine whether usable and accurate product attribute values can be derived from product descriptions using large language models (LLMs). The final objective is to evaluate the impact of the LLM-generated product attribute values on the formation of substitute product groups.
To determine product attributes, a combination of structured and unstructured data sources is utilised from a prominent South African food retailer with the intention to elucidate product substitute relationships while integrating a degree of explainability derived from the product attributes. The framework known as Product Attribute Value Extraction (PAVE), offers a prompt engineering template as an efficient method for extracting explicit and implicit attributes from product descriptions with an accuracy level of up to 85% in this study depending on the chosen model. While a great accuracy is obtained there are slight nuances to the accuracy of the different attributes, where some have a significantly lower extraction accuracy for most of the models tested. However, even with significantly high accuracy rates, the LLMs can be further fine tuned for use case specific tasks, allowing for even higher accuracy. In the pursuit of identifying product substitutes, product attributes are utilised in conjunction with transaction data to capture purchasing behaviours. Various graph embedding and graph clustering models are evaluated to identify a model that can fulfil the dual objectives of substitutability and explainability. A heterogeneous graph embedding is chosen for conducting the substitutability analysis, incombination with similarity-based and graph-based clustering algorithms. The heterogeneous model is selected due to its higher potential for offering context-specific explainability amidst the continuously evolving domain of product relationships. Experiments are conducted to attempt to cluster products into substitute categories that correspond with both the retailer’s groupings and those of a baseline model. The findings indicate that the use of product attributes does not constitute the most effective and scalable approach to achieve substitute product categorisation. This limitation arises from the inherent sensitivity of heterogeneous graphs to both configuration settings and input data, which may require tailored and context-specific model calibrations. Further investigations are warranted to explore the potential integration of product attributes into heterogeneous graph embeddings for substitute categorisation. Alternatives could include knowledge graphs and link prediction or the adaptation of the PAVE framework to facilitate the extraction of product substitutes from a list, potentially enriched with external data. |
The application of feature reduction to reduce the training time of neural network. | The rise of big data has revolutionised the way businesses operate, but it has also created computational challenges when training deep learning models. Training machine learning models on large datasets is a complex tasks and time-consuming task that requires significant computational resources. This research investigates whether feature reduction techniques can be used to reduce as a solution by starting with a review of literature on transfer learning, data-centric approaches, feature selection, dataset reduction is conducted and a methodology is developed to evaluate the efficiency of these techniques on image datasets. The techniques are assessed based on number of features used, training time of the model, accuracy, and precision. The results demonstrate whether feature reduction methods can decrease training times and improve model accuracy. This research culminates in recommending the best-performing techniques, providing valuable insights for optimising machine learning processes in image analysis. |
Deriving an Agricultural Soil Quality Index from Soil Microbiome using Autoencoders | Soil quality plays a pivotal role in sustaining ecosystems, influencing climate change and supporting agricultural productivity. Degradation of soil can severely threaten food security and exacerbate global warming. Current definitions and indices for assessing soil quality concentrate on a single soil function or fail to consider the important interrelationships and dynamics between soil properties. Principal component analysis is commonly used to establish a soil quality index through additive or weighted additive models. Principal component analysis is however inadequate when nonlinear relationships or high correlation exist among variables. Moreover, additive methods require prior knowledge of how specific soil properties impact quality without considering interdependencies among them. These limitations complicate the integration of the soil microbiome into a soil quality index. Given the complexity and diversity of microbial communities in soil, there are limited studies that define soil quality from a microbial perspective. Yet, the soil microbiome is essential for maintaining soil functionality and preventing degradation. Thus, there is a need to develop a soil quality index that incorporates microbial activity to enhance food security and promote sustainable agriculture.
This study proposes the use of autoencoders to develop a soil quality index derived from soil microbiome data. To address the high dimensionality of the microbiome dataset, four feature selection techniques — principal component analysis, Pearson correlation, agglomerative hierarchical clustering, and Louvain community detection — were implemented to generate minimumdatasets which were used to train various autoencoder designs. The output from the autoencoder’s bottleneck layer was used to derive a soil quality index, which was evaluated against microbial diversity indices. The soil quality index showed a strong correlation with the Chao1 diversity index and moderate correlations with the Shannon and Simpson diversity indices. Among the minimum datasets used, the dataset generated using agglomerative hierarchical clustering produced a soil quality index with the highest correlations to microbial diversity indices. The soil quality index derived using a sparse autoencoder was particularly favored due to its simplicity, as it reduces to a sigmoid function during inference, enhancing explainability and interpretability. |
Incremental Feature Learning: A Constructive Approach to Training Neural Networks with Dynamic Particle Swarm Optimisation | Incremental feature learning (IFL) is a supervised machine learning (ML) paradigm for feedforward neural networks (NNs), where the input layer of the NN is incrementally constructed over time. The benefits of such a paradigm are twofold; the first is the ability afforded to a NN to dynamically incorporate new features as they become available over time without the need for retraining; the second is a reduction in overfitting behaviour and model complexity, and hence improved NN generalisation ability. A feature ranking approach based on feature importance is used to determine the order in which features are integrated into the model. The incremental addition of features to a NN results in a dynamic optimisation problem (DOP); more specifcally, a DOP with dimensionality expansion, where both the surface and the dimensionality of the search space evolve over time. Particle swarm optimisation (PSO) is an established method for training feedforward NNs, and has been shown in multiple studies to outperform traditional backpropagation (BP). Modified PSO algorithms have been developed to deal with dynamic environments, and have been successfully applied to train feedforward NNs in dynamic environments. This study adapts various dynamic PSO variants for use in DOPs with dimensionality expansion. The adapted dynamic PSO variants are used to train incrementally constructed NNs (INNs) using the proposed IFL framework, and the results are compared to those of fully constructed NNs (FNNs) trained using traditional BP and standard PSO on a complete dataset. Experiments were conducted on fifteen diverse datasets spanning regression and classification tasks. The results show that IFL effectively enables NNs to dynamically incorporate new features as they become available over time, and that IFL provides desirable performance in terms of overfitting behaviour and can be used as a regularisation technique. |
Grading Infrastructure Conditions through Machine Learning using Infrastructure Report Cards and Media Reports | Public infrastructure is of critical importance to advance job creation, equity, sustainable development and economic growth, yet there is a lack of information regarding infrastructure conditions to enable informed infrastructure investment decisions in South Africa. The South African Institution of Civil Engineering publishes Infrastructure Report Cards where ratings are applied to different infrastructure sectors based on factors such as condition, capacity and performance. A general lack of information, however, restricts the compilation of infrastructure report cards. Online news articles are targeted as an alternative data source to compile infrastructure report cards due to its availability, real-time and geographical coverage as well as insights into the socio-political infrastructure issues which is not adequately captured in technical reports. However, online news articles do not rate infrastructure conditions explicitly which makes it difficult to summarise and extract findings. In this research assignment a machine learning model that automatically rate infrastructure conditions from online news articles is developed.
A cross-domain modelling approach was adopted where the knowledge gained from training machine learning models on the source domain was utilised to make predictions on the target domain. Label descriptions from The South African Institution of Civil Engineering and the American Society of Civil Engineers infrastructure report cards were collected and compiled to form a source domain, while extracted online news articles were used as a target domain. An in-domain modelling approach was adopted to determine the feasibility of the datasets. In the cross-domain modelling approach, six machine learning models were trained on the scorecard dataset and evaluated on an annotated sample of the online news article dataset. The six models included three ordinal regression models, a long short-term memory model and two hybrid models where active learning and random sampling was combined with the long short-term memory model. The hybrid options were applied with the objective to improve the domain adaption of the long short-term memory model. The logistic ordinal regression all-threshold model achieved the best mean squared error score of 1.255 on the test dataset, with the ordinal ridge regression model achieving the best mean absolute error of 0.788. These results suggest that the models in this research assignment can, on average, predict the article labels within a margin of less than one grade from the true label. The findings suggest that the models performed well in the in-domain learning and is able to label news articles, but struggled with domain adaption in cross-domain learning due to misalignment between the features and labels of the two datasets. |
Automated Road Detection and Classification for Urban and Rural Areas Using Aerial imagery | This research assignment presents an automated approach to digitise roads from aerial imagery using deep learning techniques, focusing on distinguishing between paved and gravel roads. This work addresses the need for efficient and accurate road mapping in geographical information systems, supporting applications in urban planning, autonomous driving, and infrastructure management.
The solution utilises a DeepLab model based on the EfficientNetV2M architecture to identify and extract roads from aerial images and perform road quality condition assessment on the extracted road. The DeepLab model developed achieved a mean Intersection over Union score of 0.87 and a mean F1 score of 0.91. After segmentation, the segmented masks are converted into polygons using image processing techniques. These are then compiled into geographical information system-compatible shapefiles with detailed attribute mapping for road type classification. The developed pipeline incorporates parallel processing and optimised contour detection algorithms to efficiently handle large datasets, along with error handling and logging mechanisms to maintain robustness. This automated approach significantly reduces the manual effort required for road digitisation, offering a scalable solution for updating digital maps and enhancing geographical information system capabilities. This research assignment demonstrates the potential of deep learning in automating and improving the accuracy of spatial data extraction from aerial imagery, contributing to the fields of autonomous navigation and smart city infrastructure development. |
Utilizing unsupervised machine learning to identify patterns and anomalies in the JSE Top 40 equities | This study investigates using unsupervised machine learning to uncover hidden relationships and anomalies among the Johannesburg Stock Exchange (JSE) Top 40 equities. By transforming raw time-series data into informative metrics that include returns, volatility, average trading volume, and fundamental indicators such as earnings per share (EPS) and the price to-earnings (P/E) ratio, the research aims to uncover patterns that can be used to inform investment management strategies. The data is analyzed as a snapshot, with the intention that this process can be continuously applied over different time frames to gain insights into opportunities and manage portfolio risk. This approach is not designed as a long-term buy-and-hold strategy but rather to spot changes in snapshot data.
Different clustering algorithms, namely K-Means, DBSCAN, and hierarchical clustering, were employed in combination with dimension reduction techniques such as Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), and Uniform Manifold Approximation and Projection (UMAP). The models were evaluated using internal metrics namely, the Silhouette Score and Davies-Bouldin Index. Additionally, the JSE sector classifications served as an external ground truth for validation and to identify anomalies that could be leveraged for investment opportunities. The results indicate that t-SNE combined with hierarchical clustering produced the most well organised clusters, achieving a Silhouette Score of 0.5023 and a Davies-Bouldin Index of 0.5296. The analysis uncovered both expected sector groupings and notable anomalies, such as companies clustering outside their designated sectors due to similar financial characteristics. Shapley Additive Explanations (SHAP) analysis was used to provide insights into feature importance within clusters, enhancing the interpretability of the results. In conclusion, the study demonstrates that unsupervised machine learning techniques are effective in detecting meaningful patterns and anomalies in stock market data. These insights offer practical implications for investment management by providing a data-driven approach to portfolio diversification and risk assessment. This research contributes to the financial literature by showcasing the utility of advanced clustering methods in the context of the South African equity market, which could guide future studies in emerging markets. |
Advancing Distal Radius Fracture Classification Using Metric Learning: A Triplet Neural Network Approach | Recent advancements in computer vision and deep learning have enhanced distal radius fracture analysis, offering the potential to alleviate challenges in medical diagnostics in developing countries. This research investigates the application of metric learning architectures, particularly triplet neural networks, for the classification of distal radius fractures according to the Arbeitsgemeinschaft fur Osteosynthesefragen/Orthopedic Trauma Association (AO/OTA) fracture classification system. The study aims to address challenges associated with data scarcity and model generalisation while improving automated fracture detection and classification accuracy.
The research followed the cross-industry standard process for data Mining (CRISP-DM), progressing through business understanding, data preparation, modelling, and evaluation phases. The GRAZPEDWRI-DX dataset was utilised as the source domain for transfer learning to perform fracture object detection on a small target distal radius dataset (DIRAD), alongside traditional data augmentation techniques to mitigate data limitations. The object detection model, based on the eighth version of the you only look once (YOLOv8) architecture, achieved a mean average precision (mAP) of 93.8% at 50% Intersection over Union (IoU) on the GRAZPEDWRI-DX dataset and 73.1% on the DIRAD dataset. The feature extractor of a visual geometry group (VGG) 19-layer convolutional neural network (CNN), alongside a custom embedding neural network, was employed as the foundation of the developed triplet neural network, which classified distal radius fractures according to the AO/OTA classification system. The triplet neural network incorporated the triplet margin loss function and a semi-hard triplet sampling strategy and was trained separately on posteroanterior (PA) and lateral radiograph projections. Despite the triplet neural network achieving high training F1-scores of up to 97% for the PA projection, the models exhibited limited generalisation, pressing the need for additional data or refined augmentation strategies. A comparative analysis with prior research highlighted the strengths and limitations of the proposed approach. Independent evaluations of PA and lateral projection models revealed complementary strengths, which could be integrated into ensemble modelling strategies. The findings demonstrate the feasibility of triplet neural networks for distal radius fracture classification but emphasise the necessity of future work to address generalisation challenges. Proposed enhancements include integrating generative adversarial networks for data synthesis, employing segmentation to simplify fracture classification, and using ensemble models for improved diagnostic accuracy and reliability. This research represents a first step in applying metric learning architectures to the AO/OTA distal radius fracture classification system and provides insights for further research in the field. |
Office Carbon Dioxide level prediction using model confidence and signal resolution. | Indoor air quality (IAQ) is considered to have major health and wellness implications, with indoor air pollution (IAP) estimated to have tenfold the negative impact on people relative to outdoor pollution. It is estimated that people spend an average of 80%-90% of their time indoors. It is therefore in the interest of overall population health to develop robust IAQ monitoring, and control systems. In contributing to the advancement of monitoring IAQ, this study focuses on the prediction of indoor CO2 concentrations using machine learning algorithms.
Indoor CO2 concentrations can change rapidly and thus a real-time monitoring system is likely inadequate for maintaining healthy IAQ conditions. Therefore, although sensors are available, there is a requirement for predictive monitoring systems that are reliant on robust and accurate models. Researchers have explored the development and implementation of physics and machine learning algorithms for this purpose. The consensus in literature is that machine learning algorithms outperform physics based algorithms. This outcome is dependent on the availability and quality of the data used as inputs to the models. The problem is that limited focus has been placed on the data quality and characteristics in the development of indoor CO2 prediction algorithms. This research project addresses the impact of noise in the data on overall prediction performance. It is expected that the input variables used in the development of CO2 prediction models are nonstationary and susceptible to noise. This may influence the prediction performance of the algorithms. Wavelets can be used to filter signals, thereby removing noise and retaining essential information from the original signal. The caveat is that implementing the wrong filter at suboptimal levels could lead to signal distortion and information loss, thus negatively influencing the prediction performance. To minimize this risk, dynamic signal resolution can be used in training. In this project, a method that implements various wavelets at varying decomposition levels is employed. Using the outputs of the wavelet to train an ensemble of LSTMs and subsequently select the most confident models for prediction is implemented. For comparative analysis, a predictive model based on fixed signal resolution was developed. The performance of the two models was compared using the mean absolute error (MAE), mean absolute percentage error (MAPE), root mean squared error (RMSE) and coefficient of determination (R2). The implementation of the dynamic signal resolution model framework required nearly 3 times the amount of execution time as the fixed resolution. This additional computation resulted in no additional performance improvements. Instead, it was observed that using dynamic signal resolution resulted in limited prediction capability in areas of high CO2 concentrations, confirming potential risks of information losses that are possible with signal filtering. Additionally, the fixed resolution model demonstrated superior performance with the fixed resolution model reporting a MAE, MAPE, RMSE and R2 of 1.02 ppm, 0.3%, 2.365 ppm and 0.99, respectively. Whilst the dynamic signal resolution model performance metrics deteriorated to 14.96 ppm, 2.7%, 27.48 and 0.91. |
Data-Driven Predictive Maintenance for Enhanced Reliability of Continuous Miners in Underground Coal Mining | The global coal mining industry faces increasing challenges due to deteriorating operational conditions and ageing machinery. As mining companies strive to optimise their processes, there is a significant opportunity to enhance maintenance strategies to ensure machines operate at the lowest possible cost. This research assignment explores the application of data science in analysing electrical data from continuous miners to identify anomalies and alert maintenance personnel of potential failures before they occur. By employing both conventional machine learning and deep learning techniques, the aim is to determine the most effective approach for predictive maintenance. This study represents a pioneering effort in South Africa, focusing on the application of Markov chains for anomaly detection in the coal mining sector. By leveraging the Markov property and integrating it with the Mahalanobis distance, the research developed a robust framework that enhances anomaly identification. This dual approach not only enriches data science analytical capabilities but also introduces innovative perspectives in industrial maintenance. By bridging traditional clustering techniques with advanced statistical methods, the research opens new avenues for enhanced anomaly detection, offering valuable insights for future research and practical applications. |
Advancing the Argument for Shallow Models: A Comparative Analysis against Deep Learning Approaches | The rapid adoption of deep learning in various domains has led to an overreliance on complex architectures, often at the expense of simpler models that may be equally effective. This trend raises concerns about unnecessary computational costs, reduced interpretability, and increased carbon footprints, particularly in cases where shallow models could provide comparable results. This research assignment aims to evaluate the necessity of deep learning models by conducting a comparative analysis against shallow models. The study seeks to determine, under what circumstances, simpler models are preferable, providing a more resource-efficient and interpretable alternative to deep learning approaches. The research employs a mixed-methods approach, combining scientometric analysis, an extensive literature review, and selected case studies. The study critically assesses the performance of shallow versus deep models across various applications, focussing on criteria such as accuracy, computational efficiency, and scalability. The findings reveal that shallow models, when properly optimised, can achieve performance levels comparable to those of deep learning models in several contexts. Moreover, these models offer benefits in terms of lower computational demands and greater interpretability, challenging the prevailing trend of defaulting to deep learning solutions. The study concludes that deep learning is not always the best choice, advocating for a more thoughtful selection of models based on the specific needs of the application. By highlighting the strengths of shallow models, this research contributes to a more balanced approach to machine learning, encouraging the industry to reconsider when deep architectures are truly necessary. |
Data Science Approaches for Addressing Missing Values in the Transcriptome of Plasmodium falciparum | The development of new antimalarial drugs and vaccines relies heavily on the understanding of the genetics of Plasmodium falciparum.Transcriptomic data, a valuable resource for such insights, is often documented in ’omics datasets. However, these datasets are often plagued by missing values. Missing values significantly hinder the downstream biological analysis. Accurate imputation of these missing values is imperative for the analysis of ’omics datasets and the discovery of novel antimalarial drugs.
This research assignment investigates missing value imputation techniques to identify a suitable method for accurate imputation of missing values in a transcriptomic dataset. Various approaches, including single imputation, multiple imputation, machine learning imputation, and deep learning imputation are explored. Single imputation methods, such as mean/median imputation, lowest of detection (LOD), and random tail imputation (RTI), often fail to capture the complex relationships inherent in gene expression data. Consequently, advanced methods, namely maximum likelihood by chained equations (MICE), expectation maximisation (EM), k-means, fuzzy c-means (FCM), k-nearest neighbours (KNN), self-organising maps (SOM), density-based spatial clustering with noise (DBSCAN), feedforward neural network (FNN), autoencoder (AE), and generative adversarial imputation network (GAIN), are investigated for their suitability in handling missing value imputation in transcriptomic data. Each method is assessed against a set of criteria derived from the literature. The selected imputation method is evaluated on datasets with varying percentages and mechanisms of missing data. The quality of imputation is assessed using quantitative metrics, such as root means squared error (RMSE) and mean absolute error (MAE). Additionally, the impact of imputation on downstream analysis tasks, such as clustering, is examined. The SOM is selected as the imputation method. The imputation results consistently yield RMSE and MAE values lower than the standard deviation of the data. These results indicate that the errors fall within the acceptable range given the natural variability of gene expression data. Subsequent kmeans clustering performed on the imputed data showed that imputation did not affect the quality of the clusters. This finding underscores that SOM imputation adequately preserves the biological structure of the data. |
An Automated Computer Vision System to Measure Excavator Productivity | This study develops an excavator productivity model to measure and optimise construction productivity using computer vision techniques, addressing inefficiencies in construction operations and improving excavator performance. The model analyses video input with computer vision, focusing on near-optimal real-time tracking techniques. Object detection algorithms, including you only look once (YOLO) and Faster Region-based Convolutional Neural Network (Faster R-CNN), were initially explored for accurate tracking of excavator movements on construction sites. Results indicated that YOLO offered superior generalisation and performance, yielding more accurate bounding box coordinates for tracking excavators.
The dataset developed included resized and labelled excavator videos, uniformly processed for consistent colour and format. Each video was divided into three-second intervals, annotated by activity. To measure productivity, a two-phase activity recognition model was developed. Initially, a VGG16feature extractor combined with a simple long-short-term memory (LSTM) model classified the excavator as static or moving which achieved 100% accuracy in movement detection. The second phase involved designing an advanced activity recognition model to classify specific excavator tasks, including soil pick-up, hauling, and drop-off, focusing on task duration analysis and process optimisation. Various models and processes were tested, considering different angles, excavators, and backgrounds. The model performed well but required extensive training data and computational power for optimal accuracy. With 300 to 400 labelled videos containing threesecond activity segments, the accuracy of the model ranged between 80% to 100%, depending on the similarity of the test data to the training environment. Despite challenges such as lighting variations and insufficient data quality, the model demonstrates potential in tracking excavator activities. Future efforts will aim to expand the model to other machinery and enhance real-time performance, potentially yielding significant efficiency improvements and cost reductions.
|
December 2024 Graduation |
||
Title | Abstract | |
---|---|---|
Active Learning in Bagging Ensembles | This study investigates the integration of dynamic pattern selection (DPS) and ensemble learning (EL) to enhance the performance of feed-forward neural networks trained with gradient descent backpropagation, particularly addressing the bias-variance dilemma while reducing computational complexity. DPS, introduced by Röbel (1994), is an active learning technique that incrementally adds patterns with the highest errors to the training data, aiming to achieve similar generalization results as standard backpropagation with less computational expense. Bagging based EL combines the predictions of multiple models trained on resampled subsets of the original data to improve generalization performance, albeit with increased computational demands.
In this research, DPS and EL were applied independently and in combination to neural networks, evaluated on four classification problems and two regression problems. The experiments tested four scenarios: standard NNs, NN that only applied EL, only DPS, and a combination of both (referred to as EL AL NN). The results demonstrated that DPS achieved similar performance to standard backpropagation while reducing computational cost. Specifically, for the iris and hepatitis classification problems, DPS showed better generalization, possibly due to reduced overfitting. EL improved generalization across all classification and regression problems, confirming its effectiveness despite higher computational complexity. When combining DPS with EL, the study found that for two of the four classification problems and both regression problems, the EL AL NN matched the generalization of EL while reducing computational complexity. However, for the iris and wine classification problems, the EL AL NN did not generalize as well, with a reduction in the generalization factor below one, suggesting overfitting as a possible cause. |
|
Adaptive Machine Learning for the Opmization of a Water Treatment Clarification System | Digitalization is a major topic of discussion within industries presently with the aim to provide descriptive, diagnostic, predictive and prescriptive feedback to all levels of business involvement. The water treatment industry is no exception; to ensure live responses to ever‐changing feed water condtions, often unde-utilized data sources can be incorporated into an intelligent system for optimal control.
The subject of this study was a clarification system employed as pre‐treatment for the production of purified water from organically rich wastewater. The controls of the concerned system before the study were mostly static and linear in nature with a large degree of human interaction required to reach a sub‐optimal target of minimum overflow turbidity [measure of water clarity] based on feed turbidity. It was therefore desired to develop a system that models the process and optimizes the overflow quality by adjusting the feed coagulant [chemical that neutralizes charge and encourages clumping] and flocculant [chemical that binds clumped solids together] dosages on a continuous basis as a prescriptive feedback system. To address the problem at hand, features were selected based on expert knowledge, after which R was used for the data handling and analysis. Raw data was ingested from an MSSQL database and MS Excel data files, which were assessed for quality issues and then combined & processed accordingly. The resulting input features were feed coagulant & flocculant dosages, tank level, turbidity, pH, temperature and flow rates to the 4 parallel clarifiers. The target feature consisted of a 1:1 weighting between overflow turbidity and COD after min‐max scaling between 0 and 1. Of the 3 tree‐based models tested, the random forest model was found to be optimal with a testing RMSE [root mean square error] of 0.0761 units, which compared well with the median target of 0.22 ±0.09 units. An XGBoost model was then used to optimize the fitness function consisting of overflow quality and coagulant & flocculant dosages through grid search based on cost‐benefit principles. This procedure yielded promising simulation results with median relative improvements of 36.4 ±19.7% for the overflow quality, as well as 28.6 ±14.1% and 7.71 ±15.4% for the coagulant and flocculant dosages respectively. Upon live testing these results were verified as 49.1 ±32.3%, 28.6 ±6.34% and 8.52 ±10.4% respectively. These improvements were confirmed through analysis of raw feed conditions. Online retraining and exploration were also tested in simula on. Online retraining was based on deployed model predic ve accuracy within a 1 day moving average. Once this surpassed 0.1 units, a minimum of 1 day will need to have passed for retraining. In simula on, on average, the retraining rate was once per 3.12 ±8.99 days with RMSE accuracy on the deployed data range of 0.0838 ±0.0380 units. Explora on was performed by adding randomiza on to the opmization routine through randomly selecting 100 solutions and subjecting them to a 10‐instance tournament obtained through roulette wheel selec on and mutation within 10% of the dosage training ranges. It was found that the existing linear correlation between the dosages could be reduced from 0.83 to 0.14 units with a 50% increase in improvement variability with only between 29 and 62% of the exploited improvements being realized. |
|
Automated Screening of Chronic Sinusitis from Voice Recordings using Machine Learning | Chronic sinusitis is a common illness that affects millions of individuals worldwide. Currently, the screening of chronic sinusitis involves evaluating patient symptoms, conducting endoscopic examination, or using medical imaging methods such as computed tomography or magnetic resonance imaging. Symptoms-based diagnosis is often inaccurate. Endoscopic examinations are invasive and limited by anatomical variability. Medical imaging procedures are expensive and expose patients to ionising radiation, which can raise the risk of cancer. Alternatively, chronic sinusitis can potentially be automatically screened for using voice recordings but has not been investigated before.
This research assignment, propose the use of machine learning to automatically distinguish between the speech of patients with chronic sinusitis and that of healthy individuals. The dataset used in this research entails voice recordings of patients who underwent tonsillectomy, septoplasty, functional endoscopic sinus surgery and minor surgeries unrelated to the nasal cavity or vocal tract. The data collected was down-sampled, noise was reduced using a pre-emphasis filter, and unvoiced speech segments were removed through short-term energy analysis. Several audio features were extracted from the processed audio data, with the most relevant being Mel-frequency cepstral coefficients, spectral contrast, Mel-spectrogram, spectral centroid, spectral flatness, spectral bandwidth, and spectral roll-off. These features were concatenated and reduced in dimensionality using principal component analysis before being used to train various machine learning models including logistic regression, k-nearest neighbours, decision tree, extreme gradient boosting (XGBoost), random forest, support vector machine and deep neural network. The DNN model outperformed all other models considered and was selected for further evaluation using accuracy, precision, and recall metrics. The performance results indicated that the DNN model achieved an accuracy of 0.67 ± 0.0089 and 0.63 ± 0.0089 on the train and test sets, respectively. The obtained performance results are comparable with findings from several voice-based diagnosis studies. Therefore, this study has demonstrated that chronic sinusitis can be detected from voice recordings using machine learning. However, the moderate accuracy of 0.63 on the test set suggests there is still room for improvement. This could be due to factors such as the dataset size, preprocessing techniques, feature selection, and/or machine learning models considered. |
|
Predicting patient outcomes based on adverse drug events using graph neural networks | This research explores the application of graph neural networks (GNNs) in pharmacovigilance, particularly in predicting adverse drug events (ADEs) using data from the food and drug administration (FDA) adverse event reporting system (FAERS). The study begins with an in-depth analysis of the graph data model, representing complex relationships between patients, drugs, reactions, and outcomes. The GNN architecture, specifically a graph multi-layer perceptron (graph MLP), is configured and trained on this graph-structured data to enhance ADE prediction accuracy. Various evaluation metrics, including the F1 score, precision, recall, and accuracy, are employed to assess the performance of the model, alongside a comparative analysis with baseline methods.
The results demonstrate that the GNN model, when properly configured, outperforms conventional approaches in several key metrics, offering deeper insights into drug safety and patient outcomes. Furthermore, the research highlights the potential of GNNs in improving clinical decision-making, strengthening regulatory frameworks, and advancing personalized medicine. However, limitations such as data quality and model interpretability are acknowledged, prompting recommendations for future research. This study contributes to the growing body of knowledge on the use of graph-based models in healthcare, showcasing their applicability in real-world pharmacovigilance practices and paving the way for further advancements in this domain. |
March 2024 Graduation |
|
Title | Abstract |
---|---|
Convolutional neural network filter selection using genetic algorithms | “Ever since the release of large language models like ChatGPT, machine learning has garnered worldwide attention from laymen and scholars alike. However, the field of machine learning predates the development of these models by some time and has a rich history of successful applications in a variety of fields. Genetic algorithms and computer vision are two such ar- eas of machine learning that have shown great promise in solving complex problems. Genetic algorithms are a type of evolutionary algorithm that can solve a wide range of optimization problems, while computer vision involves the use of machine learning models to extract insights from image and video data.
The most commonly used models in computer vision applications is a form of neural network called convolutional neural networks. Neural networks are a type of machine learning model that takes inspiration from the structure and functioning of the human brain. Convolutional neural networks refer to a type of neural network model that is especially adept at computer vision tasks such as image classification, object detection and video analysis owing to the use of convolutional layers. Convolutional neural networks can consist of millions of parameters, the majority of which are stored in the filters that the networks use during convolution operations. Thus, one major problem hampering more widespread adoption of convolutional neural network models in practice is the size of these models. The storage required to deploy these models is not trivial, leading to a need for methods to compress these models without materially degrading their predictive capabilities. One such method is filter selection and pruning, which refers to methods that assess the filters in a convolutional neural network and remove the least important filters to reduce the size of the model. This project proposes the use of a genetic algorithm to optimise the process of filter selection, allowing multiple filter selection methods to be applied concurrently. The proposed algorithm allows filters to be pruned adaptively, with the removal methods and number of filters removed being optimised for the network being pruned. When applying the proposed algorithm, we achieve 90.91% model compression at the cost of a 0.13% point accuracy drop for a network trained on audio data. When applied to the classic Fashion-MNIST data set, 91.37% compression is achieved with a corresponding 0.39% point drop in accuracy. We also achieved 86.06% compression while increasing accuracy by 2.37% points on a model trained on the CIFAR-10 data set. These results show the utility of the algorithm and its ability to compress networks adaptively with different architectures trained on different data sets. This study reveals that genetic algorithms can be applied successfully to prune filters from convolutional neural networks and provides the underpinnings for a comprehensive genetic algorithm capable of pruning filters from any given convolutional neural network architecture. |
The value of Zero-rating internet services to provide essential services to low-income communities | This research assignment explored user interests and usage patterns on a zero-rated internet platform, MoyaApp, in South Africa to determine the value of zero-rating essential services in low-income communities.
This study focused on understanding how users interact with different categories of essential services offered through the MoyaApp platform, a Datafree subsidiary, particularly on grants, education, jobs, and other information services such as weather and electricity. The researcher used data mining techniques such as temporal association rule mining and other statistical methods to analyze user interests and usage patterns. The findings revealed that many low-income users initially registered on MoyaApp to access grant services; users gradually explored other essential services over time and became regular platform users. The researcher proposed a few recommendations to improve the benefits MoyaApp provides to low-income communities: Firstly, MoyaApp should consider expanding the jobs category to cater to users with varying levels of education. Secondly, targeting grant users with information services like weather and electricity encourages engagement. Once users are regular users of the platform, they are more likely to use more beneficial services such as education and jobs, which leads to improved socio-economic status. Thirdly, the results of this study can be used to develop a recommendation engine to suggest relevant essential services to low-income users. In conclusion, this research assignment demonstrated that providing zero rated internet services or, more accurately, reverse billing data to low income communities can be an effective strategy to enhance access to essential services and bridge the digital divide. |
Intelli-Bone: Automated fracture detection and classification in radiographs using transfer learnig | Suspected fractures are one of the most common reasons for patients to visit the emergency department (ED) in hospitals [79]. Radiographs, the primary diagnostic tool for suspected fractures, are often assessed by emergency healthcare professionals without specialised orthopaedic expertise. This restriction leads to a high number of diagnostic errors in EDs, with in correctly diagnosed fractures accounting for over 80% of reported diagnostic mistakes [79].
Given this problem with fracture diagnostics, there is an opportunity to use artificial intelligence (AI) to assist with the diagnosis of fractures. Successful implementation of an AI system that correctly locates and classifies fractures would lead to more accurate prognosis and treatment advice. The selected fracture classification system for this research assignment is the Arbeitsgemeinschaftf fur Osteosynthesefragen / Orthopaedic Trauma Association (AO/OTA) classification [90]. The object detection models selected in this research to evaluate whether AI can be used for accurate location and classification of fractures according to the AO/OTA classification are the faster region-based convolutional neural network (Faster R-CNN) [115], you only look once version 8 nano (YOLOv8n) [54], you only look once version 8 large (YOLOv8l) [54], and RetinaNet [76]. A secondary problem that this research assignment addresses is that of data scarcity. Deep learning algorithms require large amounts of data to achieve exceptional performance. The target dataset in this research assignment, the distal radius dataset (DIRAD), only consists of 776 images, where roughly half of the images contain fractures. The technique applied to overcome the data scarcity problem is transfer learning. With trans fer learning, the object detection models are pretrained on larger datasets iv such as the common objects in context (COCO) [77] and the Graz Paediatric Wrist Digital X-rays (GRAZPEDWRI-DX) dataset [95] before being trained on the target dataset. This research assignment shows that pretraining of object detectors on larger datasets leads to superior performance on scarce datasets. Furthermore, when pretraining an object detection model on a large dataset from a similar domain to perform a similar task, such as GRAZPEDWRI DX, it leads to even better results. The pretraining of the Faster R-CNN, YOLOv8n, YOLOv8l, and RetinaNet on the GRAZPEDWRI-DX improved mean average precision at an intersection over union of 50 (mAP50) by an average of 33.6% compared the same models trained from randomly initialised weights. The best performing model, namely the YOLOv8l, achieved a mAP50 of 59.7% on the DIRAD dataset. |
Evolutionary multi-objective optimisation algorithms for a multi-objective truck and drone scheduling problem | In the rapidly evolving landscape of e-commerce, the efficiency of last-mile delivery emerges as a critical bottleneck in the logistics chain. This research addresses the complexities of last mile delivery, a process significantly burdened by high costs, environmental concerns, and the increasing consumer demand for quick and convenient service. By focusing on the integration of drones with traditional truck delivery systems, this study explores an innovative solution to the challenges faced in business-to-consumer (B2C) logistics. The utilization of a combined truck and drone system presents a novel approach to optimizing delivery routes and reducing both delivery times and operational costs. This assignment introduces a multi-objective traveling salesman problem with drone interception (TSPDi), which simultaneously minimizes total delivery time and distance, thereby addressing the inherent trade-offs in last-mile logistics.
In this assignment, the non-dominated sorting genetic algorithm II (NSGA-II) and the strength pareto evolutionary algorithm 2 (SPEA2) algorithms for the TSPDi problem were adapted, with modifications and enhancements to optimise their performance. A custom population initialisation function was added to both algorithms, improving the starting point for the evolutionary process. In addition, a heuristic mutation method was developed that produces feasible high quality solutions. To create a more varied solution pool, a mechanism for selecting unique solutions for both the parent and archive populations was implemented to ensure that no duplicate solutions occurred. This approach was especially successful in keeping a wide range of solutions during extended iterations. Empirical results showed that NSGA-II is better than SPEA2 in scenarios with larger datasets and many delivery nodes, while SPEA2 has a slight advantage in smaller datasets with fewer delivery nodes. Further analysis was performed to compare the performance of the algorithms with those of Ernst [29] and Moremi [52]. Delivery time was the most important factor in the comparison, as it was the objective optimised by Ernst [29] and Moremi [52]. The results showed that the new Multi-Objective Evolutionary Algorithms (MOEA) performed similarly to the single objective function on the smaller datasets (i.e. 10 and 20 nodes) in terms of the delivery time metric; however, in most cases they did not perform better. For larger data sets (i.e. 50 to 500 nodes), MOEAs outperformed all algorithms developed by Moremi [52] and were more competitive compared to algorithms developed by Ernst [29], surpassing them in performance on most large data sets. For the truck distance metric, the MOAEs outperformed most of the single-objective evolutionary algorithms (EAs) for smaller and larger datasets. This was expected since single-objective EAs were not designed to optimize time. |
Evolving encapsulated neural network blocks using a genetic algorithm | In recent years, artificial intelligence, with its subfields of deep learning and evolutionary computation, has experienced remarkable growth. This expansion can be attributed to the increased availability of computational power and the potential value these domains offer. Consequently, this growth has fueled intensified research and attention, presenting the challenge of staying current with the rapid advancements. Furthermore, the advent of deep learning has led to the ever-increasing size and complexity of neural networks, pushing the boundaries of computational capabilities. This project investigates the viability of utilising a genetic-based evolutionary algorithm to automate the discovery of subnetworks within convolutional neural networks (CNNs), referred to as blocks, for image classification. Inspired by architectural elements in well-known CNNs like ResNet and GoogLeNet, these blocks are designed to be reusable, repeatable and modular.
The first part of this project entailed the development of a framework to represent CNN architectures, which drew inspiration from the concept of neuroevolution of augmenting topologies (NEAT). This developed representation framework was used to define the composition and layout of CNN architectures. Next, a genetic algorithm was adapted to fit within the framework, thus enabling the evolution of CNN blocks using various evolutionary operators, including mutation, speciation and crossover. The representation framework and genetic algorithm were combined to evolve a population of 100 CNNblocks over 30 generations. Throughout the evolution process, the search was guided by the measured quality of the blocks, defined by a fitness function that was designed to balance complexity and performance. Five repetitions of the experiment were performed and compared to randomly generated blocks to assess the overall success of this approach. Additionally, the performance of the evolved blocks was evaluated against manually designed blocks such as ResNet and GoogLeNet’s Inception. The results of the comparison between the genetic algorithm and random procedures demonstrated the effectiveness of the genetic algorithm in producing highly optimal solutions based on the fitness evaluation. The results showing the distribution of the population evolutionary operators also explained how the subprocedures can be used to effectively control the search. Furthermore, the result obtained using a small sample of the best performing evolved blocks proved to be highly competitive when compared to manually designed counterparts, namely ResNet and Inception. This study validates the concept of using evolutionary algorithms for neural network block generation and emphasises their ability to rival manually designed networks. The findings suggest that evolutionary computation successfully automates the discovery of competitive blocks within CNN architectures, offering new avenues for neuroevolution and overcoming limitations in the manual design processes. |
Machine Learning for Aquaponic System Mortality Prediction and Planting Area Optimisation | Aquaponics is a sustainable farming method that combines aquaculture with hydroponics. Machine learning and the internet of things (IoT) can be used to improve the profitability and efficiency of aquaponic plants. This project proposes a machine learning-based IoT system for aquaponics that can predict fish mortality and optimize crop growing areas. The system collects data on water quality, fish behaviour, and plant growth. This data is then used to train machine learning models to predict fish mortality and to optimize crop growing areas. The proposed machine learning-based IoT system has the potential to improve the profitability and efficiency of aquaponic plants. This could lead to wider adoption of aquaponics as a sustainable farming method. |
Spatio-Temporal Modelling of Road Traffic Fatalities in Western Cape | Road Traffic Accidents are a problem in South Africa. Responding to the World Health Organisation’s Decade of Action for Road Safety, the Western Cape sought new techniques and initiated the application of Data Science and Machine Learning tools to act as a decision support system. In this light, this project seeks to develop a machine learning model capable of predicting in time and space the probability of a road fatal event. This is done by aggregating relevant features of the Western Cape into an H3 grid whereby patterns in fatal events are learned. Traditional machine learning techniques and deep learning techniques are used to learn the relationship between the aggregated features and road fatal events with the aim of out performing historical average models which are currently used in industry. This is the first attempt at using machine learning techniques to model Road Traffic Fatalities in South Africa and the Western Cape. |
Using Tree-Based Machine Learning Models to Improve Upon the Least-Squares Method of Quantifying Mineralogy using Bulk Chemical Compositional Data | “Geometallurgy is an interdisciplinary science that utilises geological and metallurgical data to optimise ore-to-metal processing routes. Knowledge of the spatial distribution of minerals (and hence metals) within the ore body forms the basis of a geometallurgical model. Information about an ore body’s chemistry and quantitative mineralogy can be obtained through drill core logging exercises. The process of drilling cores, collecting samples, and analysing them is costly and time-consuming. As a result, other quick and inexpensive methods of deriving modal mineralogy have been proposed.
Element-to-mineral conversion (EMC) refers to the method of using bulk rock compositional data to calculate mineral grade quantities. EMC is a chemical mass balancing technique that utilises the bulk rock chemistry, , and the minerals’ compositional data, , to solve for modal mineralogy, . Chemical mass balances are expressed as a set of simultaneous equations that can be solved using the least-squares approach (LS-EMC). LS-EMC can only be applied if the number of unknowns (minerals) is less than or equal to the number of known variables (elements). It is often the case that there are more minerals than elements. However, minerals can be grouped such that the number of resultant mineral sets is equal to the number of elements. Although this method of grouping minerals is sufficient for geometallurgical models, it is insufficient for mineral processing models which require exact quantities for individual minerals. This study sought to investigate alternative data-science-based methods to LS-EMC. Data science is an interdisciplinary field that focuses on the application of computational statistical methods, such as machine learning, for the extraction of knowledge from data. Three tree-based machine learning (ML) algorithms, namely, Decision Tree, Random Forest, Extra Trees, were trained to predict mineral grade quantities using positional and geochemical data. The dataset used in the investigation consisted of 135 observations sourced from a geological study conducted on the Kalahari Manganese Deposit (KMD) (Blignaut, 2017). LS-EMC was also applied, and the mineral grade estimates obtained by this method were compared to ML models’ output. The R2 statistic was used to quantify how well the LS-EMC and ML-EMC output agreed with the modal mineralogy measurements obtained through quantitative x-ray diffraction (QXRD). In comparison to the other the techniques, the modal mineralogy results from the Extra Trees regressor correlated the most with the QXRD measurements, achieving R2 scores > 0.5 for six out of the eight mineral groups. Furthermore, the Extra Trees algorithm outperformed the other two tree-based models in a test designed to see which ML algorithm provided the most reliable mineral quantity predictions for ungrouped minerals. The results of this study support the conclusion that tree based machine learning algorithms can be used to improve upon the shortcomings of LS-EMC. |
Optimisation algorithms for a dynamic truck and drone scheduling problem | “With the increasing popularity of online shopping and higher customer demand for better service delivery, the importance of last-mile delivery is growing. The last mile, the final delivery to the customer, comes at a high cost to the retail industry and the environment through pollution caused by delivery vehicles. With the advancement in drone technology, delivery strategies like a truck and drone combination, which do deliveries in parallel has become viable. Improving the routing and scheduling of these combined vehicles reduces the high cost of last-mile delivery. Therefore, shortening the route through the drone intercepting the truck has a significant benefit.
In this research assignment, the coordinates of customer nodes are randomly changed to simulate a dynamic environment while a truck and drone system performs deliveries. This problem is re ferred to as the dynamic travelling salesperson problem with drone with interception (DTSPDi). This research assignment solves the problem using the ant colony system (ACS) [30], MAX-MIN ant system (MMAS) [87] and a modified ACS that transfers pheromone knowledge to the next time slice. The research assignment builds on the algorithm designed by Moremi [64] for the travelling salesperson problem with drone with interception (TSPDi). The three algorithms use 30 datasets of different sizes and spatial patterns for input. The result from the benchmarking was that ACS-KT outperformed both algorithms in both the time and distance dimensions. Interestingly, a lower wait time does not mean a lower time or distance for a route. There was also no correlation between drone and truck distances. Therefore, it seems that ACS-KT is better at handling dynamic environmental changes for the DTSPDi problem. |
Review of Big Data clustering methods | In an era defined by the challenges of processing vast and complex datasets,the study delves into the evolving landscape of big data clustering. It introduces a novel taxonomy categorizing clustering models into four distinct groups, o↵ering a roadmap for understanding their scalability and effciency in the face of increasing data volume and complexity.
The essence of this research lies in its pursuit to critically review, analyze, and evaluate various clustering models, focusing on their suitability and adaptability in handling big data, characterized by the four Vs, i.e. velocity, variety, volume, and veracity. The aim is to discern the operational dynamics of diverse clustering models, considering the findings of prior literature, which have demonstrated varying degrees of performance of these models based on selected metrics. The methodology is firmly rooted in the execution of a series of experiments on chosen clustering methods, metrics, and datasets. This empirical method is crucial to extrapolate how each model fares across di↵erent metrics and datasets, o↵ering a comparative perspective on their performance. Subsequent to the experimental phase, an extensive analysis was conducted, breaking down the selected approaches into their algorithmic components. This decomposition is pivotal to identify the origins of gains, losses, or tradeo↵s in performance, allowing for an in-depth understanding of why certain models outperform others concerning given metrics and datasets. Insights from this research highlighted the scalability and e ciency of models like parallel k-means and mini batch k-means, both theoretically and empirically, marking them as exemplary for large-scale applications. Conversely, it unveiled the computational constraints of models like selective sampling based scalable sparse subspace clustering (S5C) and purity weighted consensus clustering (PWCC), showing their limitations in scaling to big data. Acknowledging the limitations imposed by the resource constraints of Google Collab Pro+, the study presents the constraints faced during the evaluation process. The culmination of this project is marked by a comprehensive performance summary, o↵ering key insights into the strengths and weaknesses of the approached models and pro↵ering informed advice on the contextual utilization of each model. It lays the foundation for a centralized database for clustering research, aiming to fill existing knowledge gaps and facilitate optimal model discovery tailored to specific needs and infrastructural capabilities. In conclusion, this research stands as an exploration and analysis in the field of big data clustering, to uncover the potentials and bottlenecks of various models, and o↵ers valuable insights and recommendations, all while reconciling theoretical complexities with empirical validations. |
Cluster4ing free text procurement data | “The mining industry, like most others, is faced with a diverse range of challenges. Mining companies are now looking into leveraging advanced data analytics to gain insights from their data to make data-driven decisions and inform process debottlenecking to improve throughput and operating costs. Company A grapples with 50% of its group-wide procurement spend stored as unstructured text data, hindering in-depth cost analysis due to variations in describing the same items. The difficulty associated with free-text descriptions in procurement spending is that a single-item purchase can be articulated using various string expressions. Given the thousands of records generated monthly, manually aggregating these diverse strings for in-depth analysis or relying on simple lookups would prove laborious and inefficient. The literature review underscores the rising trend of organisations adopting text-mining techniques to extract insights from unstructured data. This research assignment delved into various techniques such as Tfidf feature selection, LSA, and word embedding feature transformation, leveraging data from Company A’s procurement database. The exploration of k-means and agglomerative hierarchical (AHC) text clustering techniques revealed that AHC performed better, yielding a high silhouette coefficient and passing validation inspection by a domain expect. Clustering results were analysed in Power BI, leading to the conclusion that while traditional text clustering techniques are effective, modern approaches to feature selection and dimension reduction are essential for optimal results. The research assignment successfully achieved its goal of enabling data analysis through the clustering of free text data. |
Few-shot learning for passive acoustic monitoring of endangered species | “The Hainan gibbon is a primate from the Chinese island-province of Hainan. The population of this primate has been in decline because of poaching, and is now facing extinction. Bioacoustics is a field concerned with the acquisition and study of animal sounds. Passive acoustic monitoring is an important step in data capture, and often captures months of data. Due to the low population numbers of endangered species, experts spend a large amount of time on the analysis and identification of biacoustic signatures.
Machine learning can be used to automate the bioacoustic identification of species, which would reduce analysis costs and time. Unfortunately, many machine learning algorithms require large amounts of data to perform reliably. Few-shot learning is a loosely defined structure in machine learning that aims to solve the limited data problem with unique approaches. This assignment explores the viability of accurate, image-based classification models when subject to low data volumes. Audio data is converted to spectrograms and used in image analysis. A Siamese framework, which has roots in convolutional neural networks (CNN), is the foundation of the few-shot learning approach. Within this CNN-based framework, contrastive-loss and triplet loss architectures, data augmentation techniques, transfer learning methods, and reduced image resolution datasets are investigated. The results indicate that the triplet-loss architecture produces the most accurate models, with excellent precision, recall, and F1-score statistics. The triplet-loss models prefer lower resolution images, which reduce computation time and cost. Importantly, the performance of the triplet-loss models is not affected by low data volumes. On the other hand, contrastive-loss models show significant performance degradation on lower data volumes. Overall, the triplet-loss “base CNN” model is the recommended network. This network achieves an accuracy of 99.08% and F1-score of 0.995. The Siamese framework has demonstrated a strong ability to identify the bioacoustic signature of the Hainan gibbon. Recommendations are provided for further research in this domain. |
Digitization Of Test Pit Log Documents For Development Of A Smart Digital Ground Investigation Companion | “Various geotechnical companies in South Africa have, over the years, conducted ground investigation using test pit method. Test pit involve digging a hole into the ground, making observations of the ground conditions. These companies have documented their observations in PDF format. However, given recent technological advancements, there is a growing need to digitize these documents for thorough analysis. In response to this requirement, these companies have furnished these documents to Civil Engineering Department of Stellenbosch University.
Digitization is a way of converting PDF documents into format which can be analyzed using a computer. There are two common ways to digitize documents namely manual and automatic. Manual digitization include copying and pasting information from documents to a database or retyping information contained in the document into a database. This process is laborious, time consuming, prone to errors and costly. This project explored and presented an automated way of digitizing documents using object detection model for document layout analysis and optical character recognition for extracting alphanumeric characters from images. Object detection model was developed by fine-tuning faster R-CNN pre-trained model available in Detectron2 framework. This process involved leveraging a blend of manually annotated images and synthetically generated annotations. The results demonstrated model R-101 (a variant of R101-FPN) as having a balanced performance based on accuracy and inference time. The values of mAR, mAP and inference time for model R-101 are 74.3%, 71.0% and 0.371 seconds/image, respectively. This object detection model was used to identify and provide ROI coordinates and labels to optical character recognition algorithm. Various optical character recognition algorithms were evaluated and compared across various image qualities. PaddleOCR outperformed the other three algorithms, achieving a word recognition rate of 96%. Nevertheless the performance of these algorithms was lower on blurred images as compare to other image qualities. Spelling check and correction was conducted to improve recognition rate of paddleOCR outputs by a further 1.2%. An interactive application which can be accessed online via a web link or offline in a desktop was developed for exploring the dataset. This application allows for creating scenarios using multiple slicers to visualize a word cloud of common words and frequency of characteristics (e.g soil type, moisture condition and particle size) used to describe each scenario. Semantic search algorithm was fine-tuned using sentence transformers to allow users to query the dataset using natural language and a separate desktop application was developed to facilitate this. Evaluating semantic search algorithm revealed precision, recall and F1 score of 68.3%, 65.7% and 67.0%, respectively. Suggestions for further work include performing exhaustive data analysis to discover insights and hidden patterns, training language model for improving spelling correction, collecting more documents for developing a large geological and engineering dataset as well as training a question and answering machine learning model to make data and insights more discoverable. |
Comparison of machine learning models on financial time series data | “The efficient market hypothesis states that financial markets are efficient and that investors can therefore not make excess profits consistently, because all public information is instantly reflected in the share price. Academia and investors have shown that the efficient market hypothesis does not always hold true and that market prices can be exploited when the right financial trading and price models are used to model the relationship in the underlying data. This research assignment focuses on the development of multiple machine learning models, in combination with a financial trading strategy that utilises a mixture of technical indicators, to compare the performance of different machine learning algorithms on financial time series data.
The financial time series data collected for this research assignment were 10-year minute ticker data. The two foreign exchange rate data sets, the USD/ZAR and ZAR/JPY foreign exchange rates were used. The other three data sets collected were the S&P 500 index, the FTSE 100 index, and the Brent crude oil index. The first step was to analyse the quality of the data sets. After the quality had been assessed, a trading strategy and financial trading model were used to combine the 20-period moving average, the relative strength index, and the average directional index for the labelling process. Twelve machine-learning models were developed to forecast the financial time series data sets. These were the baseline logistic regression, support vector ma chine, k-nearest neighbour, decision tree, random forest, Elman recurrent neural network, Jordan recurrent neural network, Jordan-Elman recurrent neural network, long short-term memory neural network, time-delay neural network, resilient back propagation feed-forward neural network, and particle swarm optimisation feed-forward neural network. The results from the experiments indicate that the support vector machine performed the best out of all the machine learning models considered. The baseline logistic regression model outperformed all the other machine learning models. The random forest and resilient back propagation feed-forward neural network models performed third and fourth best. These two models had higher recall scores than most models, but their accuracy scores were significantly lower than the baseline and support vector machine models. The recurrent neural network models had very poor performance. Specifically, the Elman and Jordan-Elman models had the poorest performance of the models investigated. It was determined that the non neural network machine learning models were less computationally complex, and were less dependent on a balanced data set than the neural network models. |
Trends in Infrastructure Delivery from Media Reports | It has been shown that investment in public infrastructure such as roads and electricity generally leads to economic growth, and economic growth in turn helps fight poverty and income inequality. It is therefore not surprising that the need to monitor the condition of infrastructure arises. Infrastructure report cards (IRCs) assess the condition of a country’s infrastructure. The South African Institution of Civil Engineering (SAICE) publishes IRCs for South Africa. However, limited data availability for some infrastructure sectors hamper the compilation of the SAICE IRCs. Online news articles are a promising alternative data source to assist in the compilation of the SAICE IRCs, since it is in the public domain and there exists an abundance of reputable news websites covering virtually all regions of South Africa. The task of extracting information from a large volume of online news articles can be automated to a large extent by making use of various natural language processing techniques.
In this research assignment, online news articles are collected from nine South African news websites. Topic modelling is then applied to each of the collected data sets with the goal to group the collected news articles related to specific infrastructure issues together, e.g., group all news articles about potholes, or group all news articles about sewage spills, and then represent each group of news articles as a topic. A summary for each topic is then generated by making use of a large language model. Lastly, a dashboard is designed to effectively visualise the topics, and the summaries generated for these topics. This dashboard can then be used as a tool by SAICE to identify, and monitor prevalent infrastructure issues in various regions of South Africa, while also providing SAICE with additional data for the compilation of IRCs. This research assignment concludes that it is feasible to apply topic modelling to South African news data sets for the extraction of infrastructure-related topics. It is furthermore concluded that topic modelling can help address the lack of data in compiling the SAICE IRCs. Lastly, it is concluded that it is feasible to generate summaries for the extracted topics using large language models, although the generated topic summaries can be improved upon. |
Investigating sales forecasting in the formal liquor market using deep learning techniques | This research assignment focuses on forecasting sales in the liquor industry, examining the effectiveness of deep learning techniques and a stacked ensemble approach. Time-series forecasting is a widely used technique in various fields such as economics, finance, and operations research.
A thorough literature review was conducted to gain an in-depth understanding of the topic and to survey existing solutions in the field. The study involved a thorough analysis of datasets to understand the inherent structures of the series. Evaluation metrics and various algorithms were used to assess the effectiveness of time-series forecasting techniques. The research assignment found that deep learning techniques and ensemble theory can successfully be applied to forecast sales in the liquor industry. A stacked ensemble approach was effective in improving the overall performance. The findings have the potential to significantly improve current implementations of time-series forecasting, while reducing the computational complexity and expenses associated with granular forecasting models. The research assignment concludes that deep learning and ensemble models offer a promising avenue for efficient and accurate sales forecasting in the liquor industry, being more time-efficient and computationally less complex than traditional methods. |
Automated Localisation and Classification of Trauma Implants in Leg X-rays through Deep Learning | “Revision surgery often requires orthopedic surgeons to pre-operatively identify failed implants in order to reduce the complexity and cost of the surgery. Surgeons typically examine the X-rays of a patient for preoperative implant identification, even though this method is time-consuming and occasionally unsuccessful. This study investigates the use of deep learning to automate the identification of trauma implants in leg X-rays. The investigation as sesses the performance of various object detection and classification models on a dataset of trauma implants, aiming to identify the optimal deep learning solution. Challenges related to research include limited data, imbalanced class distributions, and the presence of multiple implants in the X-ray images.
The results of the investigation indicate that the optimal deep learning solution is a two-model pipeline that employs a you only look once (YOLO) object detection model and a densely connected convolutional neural network (DenseNet) classification model. The DenseNet classification model classifies the trauma implants localised by the YOLO object detection model. The proposed pipeline achieves a mean average precision (intersection of union threshold of 0.5) of 0.967 for implant localisation and an accuracy of 73.7% for implant classification. The results of the study provide proof that deep learning models are capable of identifying trauma implants. Additionally, the study offers a deep learning solution that can be utilised in future research related to identifying trauma implants. |
Association between the features used by a convolutional neural network for skin cancer diagnosis and the ABC-criteria and 7-point skin lesion malignancy checklist | “Melanoma cases and the associated mortality rate are rising rapidly. The early detection of melanoma is crucial in decreasing the mortality rate. However, traditional methods employed by dermatologists to diagnose skin lesions are time-consuming and vulnerable to human error. Convolutional neural networks (CNNs) show promise in improving the efficiency and accuracy of classifying skin lesions as malignant or benign. However, the lack of transparency in the decision-making process of CNNs prevents these models from clinical application. For a CNN to be approved for clinical application, it must be shown that the features used by a CNN to classify skin lesions are clinical indicators of melanoma, i.e. the ABCDE criteria and 7-point skin lesion malignancy checklist.
In this research assignment, a methodology is developed to evaluate whether the features used by a CNN to classify skin lesions correspond to the ABC-criteria and the 7-point skin lesion malignancy checklist. A CNN model is developed, trained, and tested to assess the application of the formulated methodology. The association between the ABC-criteria and the 7-point skin lesion malignancy check list features and melanoma in the test dataset is investigated using statistical methods to establish a ground truth. The association between ABC-criteria and the 7-point skin lesion malignancy checklist features and the features extracted by the CNN is determined using t-distributed stochastic neighbour embedding (t-SNE) and statistical tests. The importance of colour is evaluated by testing the performance of the CNN on a grayscale dataset. The association of dataset issues with the extracted features is examined using statistical tests, and misclassifications are investigated based on features and dataset issues. Local interpretable model-agnostic explanations (LIME) is employed to explain misclassifications and correctly classified images, providing insights into the decision-making process of the CNN. The InceptionResNetV2 model with a leaky ReLU activation was selected to evaluate the formulated methodology. The correlation tests between the ABC-criteria and the 7-point skin lesion malignancy checklist features and the melanoma diagnosis in the curated test dataset showed a strong association between all the features and melanoma except for vascular structures, brown, red and black. These results were reflected in the evaluation of the association between the features used by the CNN and the ABC-criteria and the 7-point malignancy checklist since there was a strong association between the extracted features and the ABC-criteria and the 7-point malignancy checklist features except for vascular structures, brown, red and black. The decrease in performance of the InceptionResNetV2 model on the grayscale dataset indicated that colour is a feature that the CNN uses to detect melanoma. The CNN demonstrated robustness to dataset issues but showed sensitivity to the presence of hair and immersion fluid, suggesting the need for further preprocessing of the images. Overall, it was concluded that the developed methodology can determine whether a CNN uses the features in the ABC-criteria and the 7 point malignancy checklist to classify skin lesions as malignant or benign. The developed methodology showed that the CNN uses the features of the ABC-criteria and the 7-point malignancy checklist to determine whether a skin lesion is malignant or benign. |
December 2023 Graduation |
||
Title | Abstract | |
---|---|---|
A dynamic optimisation approach to training feed-forward neural networks that form part of an active learning paradigm | Active learning describes a paradigm of continually selecting the most informative patterns to train a model while training progresses. Literature indicates that the parameter search landscape of feed-forward neural networks (FFNNs) that form part of an active learning paradigm does not generalise to the parameter search landscape of FFNNs trained by a static training set. The parameter search landscape of FFNNs that form part of an active learning paradigm are theorised to change while the search progresses. This research assignment investigates the effect of changing the optimiser of a FFNN that forms part of an active learning paradigm from backpropagation to a dynamic optimisation algorithm. To this extent the cooperative quantum-behaved particle swarm optimisation (CQPSO) algorithm was implemented to train FFNNs that form part of two different active learning paradigms. The active learning paradigms investigated were dynamic pattern selection (DPS) and sensitivity analysis selective learning (SASLA). Six data sets were used for the investigation. A novel hyperparameter tuning procedure was implemented to ensure efficient optimiser performance for each problem set. It was found that the CQPSO algorithm located and tracked the global minimum of four out of the six problem sets more effectively than the backpropagation algorithm in the DPS active learning paradigm. Conversely, the backpropagation algorithm located and tracked the global minimum of four out of the six problem sets more effectively than the CQPSO algorithm in the SASLA active learning paradigm. The CQPSO algorithm performance was found to depend on the dimensionality of the search space as well as the interdependence of the input training patterns. |
|
Course Recommendation Based on Content A_nity with Browsing Behaviour | A recommender, or recommendation system (RS), _lters and provides relevant content to a user based on many factors such as their historic behaviour during interactions with a particular system or software. A RS is aimed at improving user experience and overcoming issues such as the distressing search problem experienced in massive open online courses (MOOCs) platforms. One such online platform is Physioplus, whose subscribers generally have very speci_c educational needs and thus can greatly bene_t from targeted responses when interacting with the system. It can therefore be argued that an enhanced course recommender engine possesses great potential to increase Physioplus subscribers satisfaction and thus reduce cancelations. The current search feature in Physioplus has some limitations, as it uses keywords, static course recommendations, and elastic site search without considering historic user site visits. The purpose of this study is to build a better course recommender system for Physioplus. The recommender takes a user’s recent Physiopedia browsing history and provides the user with a tailored and rank-ordered list of those courses that are most relevant to their entire content history. The content of a user browsing history is highly correlated with the content of the most relevant courses for that user. The recommender is built using a collaborative-based _ltering (CF) technique, item-based and user-based approach. Natural language processing and neighbourhood similarity methods are used to complement collaborative _ltering in achieving quality recommendations. The course recommender system in this study uses a training and testing dataset from a real-world Physioplus system to assess the overall performance of the proposed approach. The experiment evaluation is measured by comparing recommended versus completed courses. The results show that the proposed RS has a recall score of 76% and an accuracy rate of 53% obtained in the o_ine experiment exercise. The assumption is that the performance metrics score will improve once the proposed RS integrates with the existing Physioplus production system. All in all, the proposed RS can play an essential role in assisting users with relevant courses. |
|
An Evolutionary Algorithm for the Vehicle RoutingProblem with Drones with Interceptions | The use of trucks and drones as a solution to address last-mile delivery challenges is a new and promising research direction explored in this assignment. The variation of the problem where the drone can intercept the truck while in movement or at the customer location is part of an optimisation problem called the vehicle routing problem (VRP) with drones with interception (VRPDi). This study proposes an evolutionary algorithm (EA) to solve the VRPDi. The study demonstrates a metaheuristic strategy by applying an evolution-based algorithm to solve the VRPDi. In this variation of the VRPDi, multiple pairs of trucks and drones need to be scheduled. The pairs leave and return to a depot location together or separately to make deliveries to customer nodes. The drone can intercept the truck after the delivery or meet up with the truck at the following customer location. The algorithm was executed on the travelling salesman problem with drones (TSPD) datasets by Bouman et al. (2015), and the performance of the algorithm was compared by benchmarking the results of the VRPDi against the results of the VRP of the same dataset. This comparison showed improvements in total delivery time between 39% and 60%. Further detailed analysis of the algorithm results examined the total delivery time, total distance, the node delivery scheduling and the degree of diversity during the algorithm execution. This analysis also considered how the algorithm handled the VRPDi constraints. The results of the algorithm were then benchmarked against algorithms in Dillon et al. (2023), and Ernst (2024). The latter solved the problem with a maximum drone distance constraint added to the VRPDi. The analysis and benchmarking of the algorithm results showed that the algorithm satisfactorily solved 50 and 100-nodes problems in a reasonable amount of time, and the solutions found were better than those found by the algorithms in Dillon et al. (2023), and Ernst (2024) for the same problems. However, the algorithm performance deteriorated considerably as the number of nodes in the problems increased. This deterioration was both in terms of the quality of the solution and the computation time required to solve the problem. |
|
Metaheuristics for Training Deep Neural Networks | Presently, artificial neural networks (ANNs) are popular among researchers as well as in commercial settings. The use of ANNs continue to expand into different fields. The increase in interest in ANNs have lead researchers to explore various new and innovative ways to improve the performance of ANNs. One such way is to explore the use of metaheuristics in the training of ANNs. This research assignment theoretically and empirically compares the use of metaheuristics as an alternative to the traditional training algorithm, i.e. backpropagation with stochastic gradient descent (SGD), to train deep neural networks (DNNs). Three specific metaheuristics are considered, namely particle swarm optimisation (PSO), genetic algorithm (GA) and differential evolution (DE). An in-depth analysis of SGD is conducted to highlight some potential disadvantages which might occur in the training process. The field of metaheuristics is explored as an alternative training algorithm with specific emphasis placed on the three specified metaheuristics. Five different experiments are conducted to empirically compare the backpropagation SGD training algorithm with the PSO, GA and DE training algorithms. The experiments are conducted on an image dataset. The DNN used in the experiments is a convolutional neural network (CNN). The results conclude that the SGD performs better than the metaheuristics considered. Potential future work is also discussed based on the findings of this research paper. |
|
Diversity preservation for decomposition particle swarm optimization as feed-forward neural network training algorithm under the presence of concept drift | Time series forecasting is an important area of research that lends itself to various fields in which it is practically applied. The importance of time series forecasting has led to much research in efforts to improve the accuracy of predictions. The use of artificial neural networks for time series forecasting has grown, especially with the development of simple recurrent neural networks (SRNNs). SRNNs have been shown to handle temporal sequences efficiently. Specialised architectures for SRNN increase the computational cost due to the increase in the number of weights that require optimisation during training. Therefore, the training process of neural networks can be rephrased as an optimisation problem. Recent work has shown how specialised dynamic particle swarm optimisation (PSO) algorithms can replace traditional backpropagation as a learning algorithm for feed-forward neural networks (FFNNs). Dynamic PSO algorithms to train FFNNs have been shown to outperform SRNNs using traditional backpropagation. Due to the increased dimensions for larger problems, various cooperative PSO algorithms have been developed to address the credit assignment problem as well as to better cope with variable dependency; one such PSO variant is the decomposition cooperative particle swarm optimisation algorithm. One limitation of using PSO variants for training in dynamic environments is that as the particles in a swarm converge in a specific region, the swarm diversity decays, making it difficult to adapt to environmental changes. Dynamic PSO algorithms have been successfully used in the sub-swarms of decomposition cooperative particle swarm optimisers (DCPSOs). However, these dynamic DCPSO algorithms have been shown to struggle under specific classes of dynamism. Therefore, the preservation of swarm diversity is directly linked iii to the ability to adapt in the presence of concept drift. This research project proposes various diversity preservation techniques to promote swarm diversity throughout various environmental changes. The diversity preservation techniques investigated are the use of random decomposition for dynamic DCPSO and a diversity-based penalty function for regularization. For this purpose, experiments were conducted on five well-known nonstationary forecasting problems under various classes of dynamism. Results obtained on two implementations of the DCPSO using the proposed diversity preservation techniques showed success in promoting swarm diversity. Two main implementations of DCPSOs were investigated, namely dynamic and static sub-swarms. When a static PSO algorithm was used for the sub-swarms of the DCPSO, the diversity preservation showed a significant impact. The proposed diversity preservation techniques also significantly affected swarm diversity for the DCPSO using the quantum particle swarm optimisation algorithm (QSO) as sub-swarms. The use of the diversity based penalty function for regularization showed superior performance on the training and generalization error for dynamic DCPSO. Still, it did not show a statistically significant effect on preserving swarm diversity. The use of static PSO algorithms as sub-swarms for DCPSO showed that random decomposition ranked high across the various experiments, while swarm diversity was significantly impacted. The proposed diversity preservation techniques for the dynamic DCPSO algorithms showed a trade-off between diversity preservation and performance. |
|
March 2023 Graduation |
||
Title | Abstract | |
Adaptive thresholding for mircoplot segmentation | Food security remains a global concern as flagged by the Food and Agriculture Organization of the United Nations (FAO). They report that globally one in three people do not have access to adequate food with a third of those living in Africa. The effect of climate change on crop yields adds to these concerns. Wheat makes up a substantial share of food consumption globally at 18.3% and it is particularly sensitive to the rising temperatures associated with global warming. The FOA emphasises that agricultural technology has a significant role to play in food security, with research contributing to breeding high-yield and heat-resistant crops, as an important focus area. The Department of Genetics at Stellenbosch University has a wheat pre-breeding programme that develops and tests novel crop variants. This programme monitors several experimental sites that contain microplots; relatively small wheat plots. At a single pre-breeding experimental site, there are often hundreds of microplots that must be monitored and evaluated. The within-season evaluation of microplots is performed by using digital high throughput phenotyping (HTP) analysis performed on orthomosaic images collected using unmanned aerial vehicles (UAVs). One of the phases of HTP is the plot identification phase also referred to as microplot segmentation. The current method used to perform microplot segmentation in the programme makes use of a grid that a user must impose over the orthomosaic image and manually adjust to ensure accurate segmentation. This method is manual and requires extensive post-processing to get a good fit. In addition, the current method does not generalise well to conditions that will pragmatically vary between orthomosaic collection iterations. To reduce the time spent by researchers to segment microplots, this research assignment developed an automated microplot segmentation method that requires minimal input from the user. The microplot segmentation approach, referred to as the adaptive thresholding procedure (ATP), was developed for this research assignment. The ATP uses unsupervised learning to identify and localise microplots. Unlike a grid segmentation approach, the ATP does not require any prior knowledge of the microplot layout and does not require the user to adjust a grid. The performance of the ATP microplot segmentation procedure was evaluated on thirteen orthomosaic images from four different experimental sites and subsequently compared against two manual microplot segmentation procedures. The three different microplot segmentation approaches were compared using three objective criteria namely: accuracy, intersection over union and the level of user input required. The ATP yielded superior performance in comparison to the other two segmentation methods when the conditions at the experimental sites was favourable. In the presence of weeds, the ATP did not yield satisfactory performance as the approach finds it challenging to differentiate between vegetation, weeds and non-vegetation. Despite this limitation, the ATP contributes to the existing body of knowledge on microplot segmentation methods by providing an automated microplot segmentation method that requires minimal user input. | |
Decision Support Guidelines for Selecting Modern Business Intelligence Platforms in Manufacturing to Support Business Decision Making | Globally, the generation of data is increasing rapidly, and the increasing competitiveness of global markets constantly challenges the business world due to globalisation. Companies rely on sophisticated technology to manage and make decisions in this dynamic business environment and ever-evolving market. Executives are constantly strained to ensure maximised profits from new offerings and operational efficiencies and improve customer and employee experience. As digitalisation in the manufacturing industry increases, the role of data analytics and business intelligence (BI) in decision-making is significantly increasing. Manufacturers generate abundant structured and unstructured business information throughout the product lifecycle that can be used to achieve their business objectives. However, the manufacturing industry is amongst the laggard sections pertaining to digitisation and often lacks the technological and organisational foundations required to implement data tools as part of their ecosystem. Business Intelligence (BI) provides business insights to better understand the company’s large amounts of data, operations and customers. This in turn, can contribute to better decision-making and consequently improve results and profit. Rationalisation of the technologies, tools and techniques can be challenging. The selection of an appropriate tool can be time-consuming, complex and overwhelming due to the wide variety of available BI software products, each claiming that their solution offers distinctive and business-essential features. This research assignment aims to address the need for a useful approach to BI tool evaluation and selection by identifying guidelines to support decision-makers in selecting BI tools. A thematic analysis approach was used to collect, analyse and interpret the information from semi-structured interviews with professionals from the manufacturing industry. The research gauged respondents’ views on the utilisation of BI, the data challenges experienced in manufacturing, the essential criteria BI tools should fulfil, and the approaches followed in practice to select software. The research revealed that BI plays a significant role in decision-making and the prioritisation of tasks in manufacturing. The results showed that respondents valued different BI criteria requirements and decision-making processes. The findings and insights gleaned from the literature review were used to propose guidelines that support manufacturers in their decision. It elucidates the dimensions to evaluate and provides a nine-step selection process to compare BI software. |
|
An Investigation into the Automatic Behaviour Classification of the African Penguin | In this modern era, climate change, deforestation, and the rapid decline of natural resources are issues that seem ever-increasing. With the extinction of many fauna and flora species in past decades, renewed focus on conservation efforts is advocated globally. The escalation of digitization brings with it an opportunity to improve conservation efforts and, consequently, reduce the rapid decline of biodiversity. Modelling and forecasting the progression of invasive species, ascertaining the presence of endangered species prior to the sanctioning of construction projects, and monitoring threatened ecosystems are some of the many ecologically beneficial possibilities technology provides. A prevalent application gaining much momentum, is the notion of applying machine learning and artificial intelligence to the domain of ecology. One such application considers animal behavioural studies — a predominantly manual endeavour requiring mounted sensors, tracking devices and/ or the continued presence and attention of a human. Ascribed to the invasive nature of many such studies, behaviour is often distorted or (at the very least) influenced. Modern computerised and digitised approaches address many of these drawbacks by providing a means of evaluating behaviour in a non-invasive (or less-invasive) manner. Mounted video cameras are, for example, less cumbersome than traditional wearable sensors. In addition, the presence of a human within or near the animal is no longer required. Considering the potential benefits to conservation, incorporating this technology into the field of behavioural studies is well warranted. This project is dedicated towards investigating the applicability of modern machine learning, specifically deep learning, to behaviour analysis in the endangered African penguin. The aim of this project is to investigate, develop, and deploy a model facilitating automatic behaviour classification in these penguins — a foundational contribution to improve current conservation efforts (improving passive monitoring systems and anomaly detection within a colony could potentially reduce response time in times of distress). The project considers a duel implementation — coordinates detailing animal movement are first extracted and subsequently presented to a suitable classifier facilitating behaviour classification. Three respective case studies are considered, they include: single penguins, two individuals, and three individuals (regarded as multiple individuals). A comprehensive investigation into the algorithmic performance associated with these models is performed and presented. Ultimately, the case evaluating three individuals based on the behaviours excitement and normal achieves an AUC of 72.9%. The case evaluating two individuals based on the behaviours interaction and no interaction achieves an AUC of 84.2%. Finally, the case evaluating one individual based on the behaviours braying, flapping, preening, resting, standing, and walking achieves an AUC of 82.1%. This yields valuable insight into the utility, applicability, and feasibility of automatic behaviour classification of the African penguin. Pivotal to this work, is the foundation it provides to the design, development, and implementation of a passive monitoring system as well as it’s benefits and contributions towards a holistic goal — aiding conservation efforts to preserve fauna and flora for future generations. | |
Set-based Particle Swarm Optimization for Medoids-based Clustering of Stationary and Non-Stationary Data | Data clustering is the grouping of data instances so that similar instances are placed in the same group or cluster. Clustering has a wide range of applications and is a highly studied field of data science and computational intelligence. In particular, population-based algorithms such as particle swarm optimization (PSO) have shown to be effective at data clustering. Set-based particle swarm optimization (SBPSO) is a generic set-based variant of PSO that substitutes the vector-based mechanisms of PSO with set theory. SBPSO is designed for problems that can be formulated as sets of elements, and its aim is to find the optimal subset of elements from the optimization problem universe. When applied to clustering, SBPSO searches for an optimal set of medoids from the dataset by the optimization of an internal cluster validation criteria. In this research assignment, SBPSO is used to cluster fifteen datasets with diverse characteristics such as dimensionality, cluster counts, cluster sizes, and the presence of outliers. The SBPSO hyperparameters are tuned for optimal clustering performance on these datasets, which is compared in depth to the performance of seven other tuned clustering algorithms. Then, a sensitivity analysis of the SBPSO hyperparameters is performed to determine the effect that variation in these hyperparameters have on swarm diversity and other measures, to enable future research into the clustering of non-stationary data with SBPSO. It was found that SBPSO is a viable clustering algorithm. SBPSO ranked third from among the algorithms evaluated, although it appeared less effective in datasets with more clusters. A significant trade-off between swarm diversity and clustering ability was discovered, and the hyperparameters that control this trade-off were determined. Strategies to address these shortcomings were suggested. |
|
An Extension of the CRISP-DM Framework to Incorporate Change Management to Improve the Adoption of Digital Projects | Digital transformation brings technology such as artificial intelligence (AI) into the core operations of businesses, increasing their revenue while reducing their costs. AI deployments tripled in 2019 having grown by 270% in just four years. However, digital transformation is a challenging task to complete successfully. A total of 45% of large digital projects run over budget, while only 44% of digital projects ever achieve the predicted value. The primary reason for these failures can be attributed to the human aspects of these projects. Examples of these human aspects are the difficulty of access to software, the lack of understanding of technology, and of the knowledge to operate the technology. The continued success of digital transformation requires both technical and change management drivers to be in place before, during, and after AI implementations. The project starts by describing digital projects. Digital projects, which include data science and AI, have an extremely low success rate, with change management as a fundamental barrier to the success of these projects. To address the change management challenges, five different change management models are compared, from which a generalised change management model is constructed. From literature, it is concluded that the CRISP-DM framework is one of the most widely used analytics models for implementing digital projects. Using the generalised change management framework, the change management gaps within the CRISP-DM framework are identified. An extended CRISP-DM framework is constructed by filling the identified gaps in the original CRISP-DM framework with the tasks in the general change management model created. The fourth objective details the extended CRISP-DM framework. Thereafter, the extended CRISP-DM framework is validated against a real-world case study. The validation shows that the extended CRISP-DM framework indicates change management improvement areas which would most likely have improved the adoption of the project. For this research project, the success ultimately lies in the ability of the developed framework to provide an effective way to guide data specialists through tasks that will ease the challenges of digital transformation. For this assignment, all the objectives of this research assignment are achieved. The validation of the framework shown by use of the extended framework by a data specialist has the potential to improve the success rate of the digital project at a lower risk of failure. |
|
An evaluation of state-of-the-art approaches to short-term dynamic forecasting | Order volume forecasting (OVF) is a strategic tool used by logistics companies to reduce operating costs and improve service delivery for their clients. It provides business units with the ability to anticipate demand, based on historical data and external factors so that resources can be deployed effectively to enable the aforementioned improvements. Until recently, statistical models have been the standard for forecasting. However, recent research into the use of state-of-the-art (SOTA) approaches to forecasting have yielded promising results. Most notably, these approaches are able to leverage covariates which enable models to incorporate auxiliary information, such that the predictions are responsive to their respective environments. This is critical to short-term forecasts, which are inherently more stochastic than long-term forecasts. This research paper seeks to compare the use of a statistical forecasting approach to a SOTA approach in the case of short-term order volume forecasting. More specifically, the NBEATS model is developed using various exogenous variables and is compared to the Exponential Smoothing (ETS) model. Both models have been developed to provide forecasts three hours into the future and are evaluated using RMSE and MAE. It was found that NBEATS provided a 36.01% improvement on the RMSE of the ETS model and a 31.6% improvement on the MAE of the ETS model. Additionally, two variations of NBEATS are compared – one trained with covariates and another without – to evaluate the improvement that covariates provide. It was found that providing models with exogenous variables resulted in a 16.15% increase in the RMSE and a 14.74% increase in MAE. The results of this paper suggest that SOTA approaches provide more consistent and accurate short-term forecasts. | |
Cross-Camera Vehicle Tracking in an Industrial Plant Using Computer Vision and Deep Learning | One of the key actors in the paper recycling process is buy-back centres. Buy-back centres buy or collect recyclable materials from individuals, formal and informal collection businesses, and institutions. Buy-back centres are important because they divert recyclable material away from landfills, which reduces the leaching of pollutants into the soil and groundwater as well as the generation of harmful gasses and chemicals. However, buy-back centres face several threats of which fraud is one of the most difficult threats to detect and prevent. Fraud occurs when the amount and/or the grade of the waste paper being sold to the buy-back centre is mispresented by the sellers in order to earn a greater income. A misrepresentation of the waste paper grade and weight being sold to the buy-back centre influences not only the availability of stock and the volume of sales to the paper mills but also the sustainability of the entire recycling ecosystem in the area. To facilitate the detection of fraud at buy-back centres, a multi-vehicle multi-camera tracking (MVMCT) framework is developed to track the movement of vehicles throughout a paper buy-back centre located in South Africa. The MVMCT framework developed can aid the buy-back centre in estimating the amount of material expected to be collected at a loading bay prior to stocktaking. When there is a large discrepancy between how much material is expected to be collected and how much is present at the loading bay, the buy-back centre can use the MVMCT framework to track and identify suspicious vehicles for further investigation. This research assignment shows that the Faster R-CNN and DeepSORT detector-tracker pair exhibits superior performance in terms of IDF1 scores. Furthermore, this research assignment addresses the vehicle re-identification problem by using a siamese network to match vehicles across several video sequences and to manage the global ID assignment process. The MVMCT framework developed in this research assignment exhibits an IDF1 score of 0.58, a multi-object tracking accuracy of 0.62, and a multi-object tracking precision of 0.53. Moreover, the MVMCT framework successfully tracks vehicles across all video sequences except for the sequence with a top-down view and shows a reasonable counting accuracy for counting the number of stationary vehicles at a loading bay. | |
A Bagging Approach to Training Neural Networks using Metaheuristics | Stochastic gradient descent has become the go-to algorithm to train neural networks. As neural networks become larger in architecture and the datasets used to train them becomes larger, so has the computational cost to train the artificial network. Metaheuristics have successfully been used to train neural networks. Furthermore, metaheuristics are more robust to noisy objective functions. This research assignment investigates and concludes if metaheuristics, especially genetic algorithms, differential evolution, evolutionary programming and particle swarm optimisation, can be used to train an artificial neural network with a subsample of the training set. Different bagging training approaches with the reduction in training data are put forward, and the performances of the trained neural networks are evaluated. The performances of the trained neural networks are compared against the performances of the stochastic gradient descent trained neural network and the trained neural network using metaheuristic algorithms when using the entire training dataset. The evaluation of the performance of the artificial networks compares the validation accuracy and the generalisation factor to detect if overfitting occurs. The research assignment also answers the question of whether overfitting is reduced when training the neural network if the suggested training methods are used. The results indicate that a sub-sample of the training set can be used per iteration or generation of the metaheuristic algorithm when training a neural network with similar accuracy and similar or better overfitting performance as when training is performed using the complete training set. The best performance was achieved with a bagging strategy using the same sample size for each class to classify. |
|
Link prediction of clients and merchants in a rewards program using graph neural networks | Rewards programs have become an offering for businesses to increase client engagement, nurture long-term relationships and maintain client retention. A host company is an intermediary network provider that connects entities within a rewards program. Identifications of future relationships between entities are identified as a link prediction task. The network is represented as a graph of interconnected entities. Graphs are complex high-dimensional structures, dynamic in shape and size. A research field called graph neural network (GNN) has gained traction to handle challenges posed by graph properties. A real-world scenario has been instantiated to apply a GNN technique to a link prediction task. The investigation aims to identify potential relationships between clients and merchants in a rewards program offered at a Bank. A framework design is created for the model architecture; a GNN encoder and MLP decoder. A GNN variation called GraphSAGE is selected as the encoder. GraphSAGE is an inductive framework; able to generalise on unseen data and leverage node attributes. A sensitivity analysis indicates that the model is sensitive to the dropout and learning rate hyperparameters. Limited attributes and connections are present which validates the sensitivity. The model is fitted to the optimal architecture, and tested on unseen data. The model performance resulted in a Receiver Operator Characteristics Curve (ROCAUC) value of 0.65. Although acceptable, a higher ROCAUC value is desirable. Another evaluation metric highlighted an area that requires further improvement. The precision vs recall results emphasised the effects of the sparse network. Most of the correct predictions are for the negative class. Although a weighted loss strategy assisted in the drawbacks, it could not overcome the challenges. The encoder output reveals embeddings which are visualised for interpretation. Embedding illustrations reveal similarities in both representations of clients and merchants. The embeddings identified two distinct merchant groups. The client embedding representations showed clusters of clients which are best represented in a non-Euclidean dimension space. An entity characteristic prediction analysis is done to gain insight into the distribution of the client and merchant features. Note that the purpose is not to validate which features the GNN learnt from. To highlight the findings on the correct positive class predictions, the female clients accounts for 99% of the predictions. Half of the correct links are associated with a rewards program client. The Homeware and Decor Store merchant service type accounts for 100% of the correct positive predictions. Implications of the data quality issues are also emphasised. Overall, the GNN demonstrates that it can learn representations in a rewards program network of clients and merchants. The network topology and relations among the clients and merchants are well detected. The GNN is capable to predict the existence of links between the entities. Opportunities are identified to further enrich the graph and improvements are proposed. The investigation provides a positive contribution to the financial industry, rewards programs and GNN as an emerging research field. | |
Evaluating active learning strategies to reduce the number of labelled medical images required to train a CNN classifier | CNNs have proven to provide human-compared performance in the field of computer vision; however, one basic limitation of ANN is that they are largely rely on large, labelled data (a costly and time-consuming task of manually labelling data). This study investigates how varied sizes of initially labelled medical images affect the effectiveness of CNN-based active learning. A framework in which data to be labelled by human annotators are not selected randomly but rather selected in such a fashion that the amount of data required to train a machine learning model is reduced. Two CNN architectures were chosen to run the experiment using a well-known chest x-ray pneumonia dataset from the Kaggle repository, and active learning base uncertainty was used to measure the data’s informativeness. Eight simulations were run on varying sizes of initial labelled training images. The simulations demonstrate how active learning can reduce the cost and time required for image labelling. The performance of the two CNN architectures was assessed using AUC-score metrics and less data was required to label the images. In conclusion, the use of DenseNet-121 with least confidence sampling reduced the number of labelled images by 39% compared to the random sampling technique used as the baseline. | |
A Dynamic Optimization Approach to Active Learning in Neural Networks | Artificial neural networks are popular predictive models which have a broad range of applications. Artificial neural networks have been of great interest in the field of machine learning, and as a result, they received large research efforts to improve their predictive performance. Active learning is a strategy that aims to improve the performance of artificial neural networks through an active selection of training instances. The motivation for the research assignment is to determine if there is an improvement in predictive performance when a model is trained only on instances that the model deems informative. Through the continuous selection of informative training sets, the training times of these networks can also be reduced. The training process of artificial neural networks can be seen as an optimisation problem that uses a learning algorithm to determine an optimal set of network parameters. Backpropagation is a popular learning algorithm which computes the derivatives of the loss function and the gradient descent algorithm to make appropriate parameter updates. Metaheuristic optimisation algorithms, such as particle swarm optimisation, have been shown to be efficient as neural network training algorithms. The training process is assumed to be static under fixed set learning, a process in which the model randomly samples instances from a training set that remains fixed during the training process. However, under an active training strategy, the training set continuously changes and therefore should be modelled as a dynamic optimisation problem. This study investigates if the performance of active learners can be improved if dynamic metaheuristics are used as learning algorithms. Different training strategies were implemented in the investigation which include a sensitivity analysis selective learning algorithm and the accelerated learning by active sample selection algorithm. The analysis utilised different learning algorithms which included backpropagation, static particle swarm optimisation, and dynamic variations of the particle swarm optimisation algorithm. These training strategies were applied to seven benchmark classification datasets obtained from the UCI repository. Improved performance in the generalisation factor is produced for three of the seven classification problems in which a dynamic metaheuristic is used in an active learning setting. Although these improvements are observed, generally all training configurations achieved similar performance. The conclusion drawn from the study was that it is not definitive that dynamic metaheuristics improve the performance of active learners, because performance improvements are not consistent across all classification problems and evaluation metrics. |
|
Rule Extraction from Financial Time Series | The ability to predict future events is very important in scientific fields. Data mining tools extract relationships among feature and feature values, and how these relationships map to the target concept. The main goal is to extract knowledge and understand trends. The resulting rule set can then be used for prediction purposes. For many real-world applications, the actual values of a time series is irrelevant. The shape of the time series can also be used to predict future events. Unfortunately, most of these research e↵orts related to this area have had limited success. Rule induction and rule extraction techniques are often unsuccessful for real-valued time series analysis due to the lack of systematic e↵ort to find relevant trends in the data. Rule induction and rule extraction methods are applied to data describing trends in financial time series data. The purpose of this study is to explore the benefits of rule extraction and rule induction,specifically on financial time series. A review of rule extraction and rule induction approaches is conducted as a first step. Thereafter, a rule extraction and rule induction framework is developed and evaluated. The most important finding of this study was the importance of balanced data, which performed significantly better if the excessive class distributions were minimised, while the predictive performance of the di↵erent rule extraction and rule induction algorithms was not statistically significant. | |
Binning Continuous-Valued Features using Meta-Heuristics | The success of any machine learning model implementation is heavily dependant on the quality of the input data. Discretization, which is a widely used data preprocessing step, partitions continuous-valued features into bins which transforms the data into discrete-valued features. Not only does discretization improve the interpretability of a data set, but it also provides the opportunity to implement machine learning models which require discrete input data. This report proposes a new discretization algorithm that partitions multivariate classification problems into bins through the use of swarm intelligence. The particle swarm optimization algorithm is utilized to try and find the bin boundary values of each continuous-valued feature which leads to the optimal classification performance of classification models. The classification accuracy of the na¨ıve Bayes classifer, the C4.5 decision tree classifier and the one-rule classifier, due to the implementation of the discretizers, is used as the evaluation measure in this report. The performance of the proposed method is compared with the equal width binning, the equal frequency binning and the evolutionary cut-points selection for discretization algorithm, on different data sets that have mixed data types. The proposed discretizer is outperformed by the evolutionary cut-points selection for discretization algorithm when paired with the C4.5 decision tree classifiers. Similarly, the equal with binning discretizer outperforms the proposed discretizer when paired with the C4.5 decision tree. |
|
A Genetic Algorithm Approach to Tree Bucking using Mechanical Harvester Data | Crosscutting of trees into timber logs is known as bucking. The logs are mainly used for producing saw logs at a mill. The logs have different value based on the length of the log and the small end diameter of the log. Maximisations of the value of the logs bucked from a tree can be viewed as an optimisation problem. This problem has been researched in the literature with most solutions using dynamic programming. This research assignment solves the problem using a metaheuristic approach, specifically a genetic algorithm. The main research question is whether an existing bucking, on a series of stands in a forest, could have been done more optimally. The dataset used to solve the problem comes from the bucking outputs of two mechanical harvesters. Multiplication of the volume of the log by the value per cubic meter of the log class to which the log belongs, gives the value of the log. Addition of the value of logs for a tree gives the value of the tree. It was found that the genetic algorithm outperformed the existing bucking performed, in terms of value. The research method firstly solved the problem for a randomly selected set of trees with dynamic programming, comparing it to the solutions obtained from the genetic algorithm. It was found that the genetic algorithm obtained very similar optimal bucking value for the trees. Secondly, a genetic algorithm uses hyperparameters, namely population size, probability of crossover and probability of mutation. The hyperparameters were estimated using a particle swarm optimisation algorithm wrapped around the genetic algorithm. A randomly selected set of trees was used for estimating the hyperparameters. The hyperparameters found were used to optimise the total value of each of the five stands. The total value of the optimised stands outperformed the value of the existing bucking performed by a large margin. | |
Crop recommendation system for precision farming: Malawi use case | Machine Learning (ML) has received attention from the global audience, with adoption and rapid scaling being reported across multiple industrial sectors, including agriculture, for application in automation and optimisation of processes. The advent of new farming concepts like precision farming (PF) has introduced the use of ML-powered decision support systems (DSS). These systems assist farmers in making decisions by providing data-driven recommendations that boost farming productivity and sustainability. Despite being widely developed in many parts of the world, these technologies have not yet been adopted in the sub-Saharan region, particularly in Malawi, where infrastructure and government policy have been barriers. However, changes in policymaking and the introduction of data centres have drawn agricultural stakeholders who are pushing for the development of ICT-based technologies. The desired innovations are to support farmers in making data-guided decisions for climate change mitigation, increased productivity, and environmental sustainability. The goal of this project was to create a crop recommendation system that makes use of an ML model to forecast the best crop for farmland based on its physical, chemical, and meteorological parameters. Firstly, unlabelled data for the central region of Malawi was collected from the Department of Land and the Department of Climate Change and Meteorological Services. The data were merged, cleaned, and formatted using three methods: label encoding of categorical features; label encoding of categorical features and normalisation; label encoding of ordinal features, one-hot encoding of nominal features, normalisation, and principal component analysis (PCA) dimensionality reduction. A K-means clustering data preprocessing step was applied, and five centroids were extracted, analysed by an expert agronomist, and labelled as conducive for maize, cassava, rice, beans, and sugarcane crops, respectively. Then, ten classifier algorithms, namely Logistic Regression (LRC), K-Nearest Neighbours (KNC), Support Vector Machine (SVC), Multilayer Perceptron (MLPC), Decision Tree (DTC), Random Forest (RFC), Gradient Boosting (GBC), Adaptive Boosting (ABC), eXtreme Gradient Boosting (XGBC), and Multinomial Naïve Bayes (MNBC), were trained on the three kinds of formatted datasets. A 5-fold cross-validation (CV) technique was used to assess the performances of the models on the three formatted datasets evaluated based on the F1 score and Accuracy metrics. Lastly, the models were scored based on the CV’s average F1 and Accuracy scores, the model’s structural complexity, and training times. Formatting technique 1 resulted in poor performance across models that use Euclidean distance measures, and formatting technique 3 was the most conducive for all the models except for ABC and MNBC. On formatting 3, the KNC outperformed the other models with an f1 and accuracy score of 99%, a fast-training speed, and a simple model structure. The KNC was later integrated into a test web application as its proposed method of deployment. The proof-of-concept model shows reliable results but requires further development for real-time implementation. | |
Financial Time Series Modelling using Gramian Angular Summation Fields | Gramian angular summation fields (GASF) and Markov transition fields (MTF) have been developed as an approach to encode time series into different images, which allows the use of techniques from computer vision for time series classification and imputation. These techniques have been evaluated on a number of different time series problems. This research assignment applies GASF and MTF to financial time series. As a first step, a suitable financial time series is collected from a real world system and analyzed. The data quality is determined to identify data quality issues to be addressed. The cleaned financial time series is encoded into images, and validated using an appropriate technique to determine if a logical mapping between the time series and image planes exists. The financial time series is analyzed to determine its characteristics. These characteristics are used to guide the formulation of a modeling problem. The modeling problem compares the usefulness of the GASF and MTF approaches against conventional time series modeling and analysis techniques. The four models considered for the formulated modeling problem consists of time series and image modeling approaches. The results from the experiment indicates that the time series approaches are better suited to this modeling problem specifically. The GASF and MTF approaches do provide promising outcomes when used in a combinatorial fashion. The usage of a combination of GASF and MTF images do allow a model to learn better features when combined with sequence based approaches, which improves model performance. |
|
Machine Learning-based Nitrogen Fertilizer Guidelines for Canola in Conservation Agriculture Systems | Soil degradation is a major problem that South African agriculture faces, and policy-makers pay special attention to it. South Africa has especially looked at the negative effects of applying poor land practices in the agriculture sector. This research assignment attempts to use machine learning (ML) algorithms to predict the amount of nitrogen (N) to add to canola to achieve an approximate optimal yield. This should be displayed in the form of a table, known as the fertiliser recommendation system, which can be used by a farmer to achieve the desired yield. The ML algorithms used in this assignment include: random forest regressor, extra trees regressor, artificial neural network, deep neural network, k-nearest neighbour, multiple linear regression and multivariate adaptive regression splines. The primary objective of precision agriculture (PA) is to increase agricultural productivity and quality while lowering costs and emissions. Furthermore, early detection and mitigation of crop yield limiting factors may aid to increased output and profit, and yield prediction is critical for a number of crop management and economic decisions discussed in this research assignment. The random forest regressor showed to be the most accurate in forecasting yield. The resulting random forest regressor model demonstrated that machine learning could potentially forecast canola production given some characteristics. These characteristics include but are not limited to average rainfall, year of the plantation, amount of N remaining in soil from the previous harvest, and rainfall each month from the date planted to the harvest date. | |
The use of historical tracking data to estimate or predict vehicle travel speeds | York Timbers is an integrated forestry company that grows and manufactures lumber and plywood products. The plantations owned and maintained by York Timbers contains an expansive road network consisting of 26 661 road segments with a total length of approximately 10 000 km. In order to optimize the delivery of timber from the plantations to the mill sites, the travel speed of each road segment must be estimated. To estimate the speed of each road segment in the road network, global positioning system (GPS) measurements are first matched to the self-owned road network. Map-matching is done in two parts. First, the GPS measurements are assigned to the closest road segment based on euclidean distance. Next, the connectivity of road segments is analyzed to fix any errors made introduced during map-matching. The average travel speed is then calculated for each road segment using the matched GPS measurements. The majority of the road segments however do not have GPS measurements associated with them. To estimate the travel speed of road segments without GPS measurements, five different predictive models are developed. The best performance is obtained using a regression tree which achieves a mean absolute error of 10.02 km/h on data not used to train the model. To improve the speed estimation accuracy, further refinement of the speed estimation model and speed prediction model is required. Increasing the amount of GPS measurements used in the estimation and prediction of travel speed will improve the model performance. Including other data that influences safe travel speed, such as weather data, will further improve the model performance. Identifying dangerous portions of the road network is also suggested before a model is implemented. | |
A Review and Analysis of Imputation Approaches | Missing data is a common and major challenge which almost all data practitioners and researchers face, and which greatly affects the accuracy of any decision making process. Data mining and data preparation requires that the data is prepared, cleaned, transformed, and reduced in order to ensure that the integrity of the dataset has been maintained. Missing data is found and addressed within the data cleaning process, during which the user needs to decide on how to handle the missing data so as to not introduce significant bias into the dataset. Current methods of handling missing data include deletion and imputation methods. This research assignment investigates the performance of different imputation methods, specifically discussing statistical and machine learning imputation methods. The statistical imputation methods investigated are mean, hot deck, regression, maximum likelihood, Markov chain Monte Carlo (MCMC), multiple imputation by chained equations, and expectation-maximization with bootstrapping imputation. The machine learning methods investigated are k-nearest neighbor (kNN), k-means, and self-organizing maps imputation. This research paper uses an empirical procedure to facilitate the formatting and transformation of the data, and the implementation of imputation methods. Two experiments are followed in this research, namely, one in which the imputation methods are evaluated against datasets which are clean, and another in which the imputation methods are evaluated against datasets which contain outliers. The performance achieved from both experiments are evaluated using the root mean squared error, mean absolute error, percent bias, and predictive accuracy. For both experiments, it is found that MCMC imputation resulted in the best performance out of all 10 imputation methods with an overall accuracy of 75.71%. kNN imputation resulted in the second highest accuracy with an overall accuracy of 69.85%, however, introduced a large percent bias into the imputed dataset. This research concludes that single statistical imputation methods (mean, hot deck, and regression imputation) should not be used to replace missing data in any situation while multiple imputation methods are shown to have a consistent performance. MCMC imputation in particular performs the best out of all 10 imputation methods in this research, producing a high accuracy and low bias in the imputed dataset. The performance of MCMC imputation, along with its ease-of-use, makes the imputation method a suitable choice when dealing with missing data. |
|
Crawler Detection Decision Support: A Neural Network with Particle Swarm Optimisation Approach | Website crawlers are popularly used to retrieve information for search engines. The concept of website crawlers was first introduced in the early nineties. Website crawling entails the deployment of automated crawling algorithms that crawl websites with the purpose of collecting and storing information about the state of other websites. Website crawlers are categorized as good website crawlers or bad website crawlers. Good website crawlers are used by search engines and do not cause harm when crawling websites. Bad website crawlers crawl websites with malicious intent and could potentially cause harm to websites or website owners. Traffic indicators on websites are inflated if website crawlers are incorrectly identified. In some cases, bad crawlers are used to intentionally crash websites. The consequences of bad website crawlers highlight the importance of successfully distinguishing human users from website crawler sessions in website traffic classification. The focus of this research assignment is to design and implement artificial neural network algorithms capable of successfully classifying website traffic as a human user, good website crawler session, or bad website crawler session. The artificial neural network algorithms are trained with particle swarm optimizers and are validated in case studies. First, the website traffic classification problem is considered in a stationary environment and is treated as a standard classification problem. For the standard classification problem, an artificial neural network with particle swarm optimization is applied. The constraints associated with this initial problem assume that the behavioural characteristics of humans and the behavioural characteristics of web crawlers remain constant over a period of time. Thereafter, the classification problem is considered in a non-stationary environment. The dynamic classification problem exhibits concept drift due to the assumption that website crawlers change behavioural characteristics over time. To solve the dynamic classification problem, artificial neural networks are formulated and optimized with quantum-inspired particle swarm optimisation. Results demonstrate the ability of the artificial neural networks optimised with particle swarms to classify website traffic in both stationary and on-stationary environments successfully to a reasonable extent. |
|
A comparative study of different single-objective metaheuristics for hyper-parameter optimisation of machine learning algorithms | Over the past three decades machine learning evolved from a research curiosity to a practical technology that enjoys widespread commercial success. In the continuous quest to gain a competitive advantage and thereby market share, companies are highly incentivised to adopt technologies that reduce costs and/or increase productivity. Machine learning proved to be one of these technologies. A significant trend in the contemporary machine learning landscape has been the rise of deep learning which experienced tremendous growth in its popularity and usefulness, predominantly driven by larger data sets, increases in computational power and more efficient training procedures. The recent interest in deep learning (along with automated machine learning frameworks), both having many hyper-parameters and large computational expenditure, has prompted a resurgence of research on hyper-parameter optimisation. Stochastic gradient descent and other derivative-based optimisation methods are seldomly used for hyper-parameter optimisation, because derivates of the objective function with respect to hyper-parameters are generally not available. The objective function for hyper-parameter optimisation is therefore considered to be a black-box function. Conventionally, hyper-parameter optimisation is performed manually by a domain expert to keep the number of trials at a minimum, however, with modern compute clusters and graphics processing units it is possible to run more trials, in which case algorithmic approaches are favoured. The process for finding a high-quality set of hyper-parameter values for a machine learning algorithm is often time-consuming and compute-intensive, therefore efficiency is considered as one (if not the most) prevailing metric to evaluate the effectiveness of an hyper-parameter optimisation technique. Popular algorithmic methods for hyper-parameter optimisation include grid search, random search, and more recently Bayesian optimisation. Metaheuristics, defined as a high-level problem independent framework that serves as a guideline for the design of underlying heuristics to solve a specific problem, are investigated as an alternative to traditional hyper-parameter optimisation techniques. Genetic algorithms, particle swarm optimisation, and estimation of distribution algorithms were identified to represent metaheuristic algorithms. To compare traditional and metaheuristic hyper-parameter optimisation algorithms on the basis of efficiency, a test suite comprised of various data sets, and machine learning algorithms are constructed. The machine learning algorithms considered in this research assignment are support vector machines, multi-layer perceptrons, and convolutional neural networks. The efficiency of hyper-parameter optimisation algorithms is compared using independent case studies, where the hyper-parameters of a different machine learning algorithm are optimised in each case. Friedman omnibus tests are employed for determining whether a difference in average rank exists for the outcomes obtained using the respective hyper-parameter optimisation techniques. Upon rejection of the null hypothesis of the Friedman test, Nemenyi post hoc tests are performed to identify pairwise differences between hyper-parameter optimisation techniques. Other fitting metrics of solution quality such as computational expenditure are also investigated. | |
Predicting employee burnout using machine learning techniques | While artificial intelligence techniques and methods and the subsequent possibilities of using such to solve business problems are well understood in some industries including life insurance or banking, applying these to the domain of human capital management has been met with varying levels of success and value. Models that assist in recruitment activities or predicting employee attrition have been successfully implemented by many organisations. However, there are also many pitfalls to guard against including managing inherent bias in the data used as well as how the output of such models are used, often leading to ethical concerns. In this research assignment, multiple classification models and machine learning algorithms are applied to the problem of identifying employees at risk of burnout with the aim of producing outputs that can be used to ethically and pro-actively guide wellbeing related interventions across the business. The results show that none of the approaches were successful in accurately meeting this objective with an artificial neural network approach assessed as the most accurate of all the models implemented. By evaluating each classification model’s performance, it was found that none of implemented approaches were more than 50% accurate. |
Title | Abstract |
---|---|
Comparison of Machine Learning Models for the Classification of Fluorescent Microscopy Images | The lasting health consequences of a COVID-19 infection, referred to as Long COVID, can be severe and debilitating for the individual afflicted. Symptoms of Long COVID include fatigue and brain fog. These symptoms are caused by microclots that form in the bloodstream and are not broken up by the body. Microclots in the bloodstream can entangle with other proteins and can limit oxygen exchange. This inhibition of the oxygen exchange process can cause most of the symptoms experienced with Long COVID. Diagnosis and identification of individuals suffering from Long COVID is the first step in any process that aims to help alleviate the symptoms of the individual, or cure them. Current identification processes are manual and as such limited by the amount of manpower applied to the task. Automating parts of the process with machine learning can greatly speed up this process and allow more efficient use of manpower. The purpose of this research assignment is to investigate whether or not machine learning algorithms can be used to classify fluorescent microscopy images as being indicative of long COVID or not. This is done by training models and predicting on features extracted from fluorescent microscopy images using computer vision techniques. Also explored is a comparison between the performance of the machine learning algorithms used in this research assignment. It was found that logistic regression is a good choice as a classifier with a strong performance in the classification of both the positive and negative classes. |
Anomaly Detection in Support of Predictive Maintenance of Coal Mills using Supervised Machine Learning Techniques | Since the beginning of time, people have been dependent on technology. With each industrial revolution, people became more reliant on machines, and in parallel, the need to maintain them. The goal of any maintenance organisation is always the same: to maximise asset availability. Our massive strides in technology have paved the way to the birth of Industry 4.0 where our focus starts to shift from preventive maintenance to predictive maintenance. Predictive maintenance does not follow a schedule like preventive maintenance, instead, it performs maintenance when it is necessary, not when it is too early or too late. This research assignment identifies an area of research where a study is performed in support of predictive maintenance of coal mills through supervised machine learning. The assignment uses the coal mill data from a case study company to identify data quality issues, address these issues, prepare the data for machine learning and finally to build a machine learning model which aims to predict when failure is most likely to occur. The assignment evaluates the feasibility to build a supervised machine learning model using the given data and methodology, draws conclusions about the findings and identifies opportunities for future research. |
Comparison of unsupervised machine learning models for identification of financial time series regimes and regime changes | Financial stock data has been studied extensively over many years with an objective of generating the best possible return on an investment. It is known that financial markets move through periods where securities are increasing in value (bull markets) and periods where these securities decrease in value (bear markets). These periods that exhibit similarities over different time frames are often referred to as regimes that are not necessarily limited to bull and bear regimes, but any sequences of data that experiences correlated trends. Regime extraction and detection of regime shift changes in financial time series data can be of great value to an investor. An understanding of when these financial regimes will changeDroid: Singular Value Decomposition with C and in what type of regime the financial market is tending towards, can help improve investment decisions and strengthen financial portfolios. This research deals with reviewing and comparing the viability of different regime shift detection algorithms when applied to multivariate financial time series data. The selected algorithms are applied on different stocks from the Johannesburg Stock Exchange (JSE) where the algorithms’ performances are compared with respect to regime shift detection accuracy and profitability of regimes in selected investment strategies. |
Detection of chronic kidney disease using machine learning algorithms | Chronic Kidney Disease (CKD) is a significant public health concern worldwide that affects one in every ten (10) people globally. CKD results from a poorly functioning kidney that fails at the basic functionalities, including removing toxins, waste, and extra fluids from the blood. The build-up of the problematic material in the body can cause complications such as hypertension, anaemia, weak bones, and nerve damage. CKD often occurs in individuals that suffer from additional chronic illnesses such as diabetes, heart disease, and hypertension, in addition to the existence of unfavourable health habits and practices that lead to the kidney’s deplorable state. The presence of additional illnesses that occur in tandem with CKD causes a hindrance in its successful and early detection. The onset of CKD can be clinically detected using laboratory tests focusing on specific standard parameters such as the Glomerular Filtration Rate (GFR) and the albumincreatinine ratio. Kidney damage occurs in stages, with each subsequent stage indicating a severe reduction in the glomerular filtration rate. The GFR parameter is considered a facet of the indication of renal failure and the final stage of chronic kidney disease. It is then imperative to use early detection methods to assist in the early administration of treatment to alleviate the symptoms of the disease and combat the progression. Early-stage diagnosis involves medications, diet adjustments and invasive procedures. In developing countries, especially in Africa, the prevalence of CKD is estimated at 3 – 4 times more than in developed countries in Europe, America and Asia. The current dialysis treatment rate in South Africa stands at approximately 70 per-million population (pmp), and the transplant rate stands at approximately 9.2 per-million population (pmp). The accounted prevalence rate mainly considers individuals with accessibility to private health care options through affordability or medical insurance; however, most South Africans (approximately 84%) depend on the under-resourced, government-funded public health systems. The disparity in treatment affordability among South Africans of different economic classes introduces a two-tiered health system that affects access to quality treatments. The need for early detection and diagnosis is an important process in the field of CKD and other chronic illnesses plaguing the nation using machine learning algorithms. Machine learning applications in the health care sector aim to revolutionise the early detection and treatment of chronic illness for the greater global population. Since early detection and management are vital in preventing disease progression and reducing the risk of complications, some machine learning (ML) models have been developed to detect CKD. The primary purpose of this study is to review, develop and recommend various machine learning classification models for the efficient detection of chronic kidney disease using three datasets. These datasets include:- two UCI Machine Learning Repository datasets Chronic Kidney Disease and Risk Factor prediction of Chronic Kidney Disease; and the PLOS ONE dataset Chronic kidney disease in patients at high risk of cardiovascular disease in the United Arab Emirates: A population-based study. The final aim is to construct a high-performing ML model that has effectively and accurately learned the hidden correlations in the symptoms exhibited by CKD patients. |
Feature engineering approaches for financial time series forecasting using machine learning | This research assignment investigates feature engineering methods for financial time series forecasting using machine learning. The goal of the work is to investigate methods that overcome some time series characteristics which make forecasting difficult. The challenging characteristics are noise and non-stationarity. A literature review is conducted to identify suitable feature engineering methods and machine learning approaches for financial time series forecasting. A case study is developed to test the identified feature engineering methods with an empirical machine learning process. Multiple machine learning models are tested. To understand the benefit of the feature engineering methods, the forecasting results are compared with and without of the application of the feature engineering methods. Several feature engineering methods are identified: Differencing and log-transforms are two methods investigated to address non-stationarity. Moving averages, exponentially weighted moving averages, Fourier and wavelet transforms, are all methods investigated to reduce noise. The feature engineering methods are implemented as preprocessing steps prior to training machine learning models for a supervised learning problem. The supervised learning problem is to forecasting a single day ahead asset price, given ten days of previous prices. Four machine learning models commonly used for financial time series forecasting are investigated. Namely, linear regression, support vector regression (SVR), multilayer perceptron (MLP), and long short-term memory (LSTM) neural networks. The work investigates the feature engineering methods and machine learning models for four univariate time series signals. The results of the investigation found that no feature engineering method is universally helpful in improving forecasting results. For the SVR, MLP and LSTM models, denoising or smoothing the signals did improve their performance, but the best denoising or smoothing technique varies depending on the dataset used. Differencing and log-transforms caused the models to forecast a constant value near the mean of expected daily price returns, which when inverted back to the price domain cause poor regression evaluation metrics, but good directional accuracy. The findings of this research assignment are that the investigated feature engineering methods may improve forecasting performance for financial time series, but that the gains are not large. It seems that there is limited improvement gained through feature engineering past price data to predict future price, at least for the investigated feature engineering methods. It is therefore recommended that future work focus on finding alternative data sources with predictive power for the financial time series. |
Forecasting armed conflict using long short-term memory recurrent neural networks | Various recent studies have shown an optimistic future for social conflict forecasting by taking more data-driven approaches. These approaches also come at the perfect time during the big-data revolution. Conflict forecasting models can be used to reduce the severity of events or to intervene to prevent these events from materialising or escalating. As such, these predictive models are of interest to numerous institutions or organisations, such as governments and non-governmental organisations, humanitarian agencies, and even insurance companies. In this mini-dissertation, long short-term memory recurrent neural network modelling is applied to forecast armed conflict events in the Afghanistan conflict, which started in October 2011. This model utilises world news data from the Global Database of Events Language and Tone (GDELT) platform and georeferenced event data from the Uppsala Conflict Data Program (UCDP) to make its predictions. The results show that GDELT data can improve conventional baseline forecasting models to an extent by incorporating actor and event attributes that are unique to the conflict at hand. Furthermore, results indicate that news media data can be consolidated with actual recorded deaths in the forecasting model, which enables predictions that are grounded in reality. |
Comparison of machine learning models on different financial time series | The efficient market hypothesis implies that shrewd market predictions are not profitable because each asset remains correctly priced by the weighted intelligence of the market participants. Several companies have shown that the efficient market hypothesis is invalid. Consequently, a considerable amount of research has been conducted to understand the performance and behaviour exhibited by financial markets, as such insights would prove valuable in the quest to identify which products will provide a positive future return. Recent advancements in artificial intelligence have presented researchers with exciting opportunities to develop models for forecasting financial markets. This dissertation investigated the capabilities of different machine learning models to forecast the future percentage change of various assets in financial markets. The financial time series (FTS) data employed are the S&P 500 index, the US 10 year bond yield, the USD/ZAR currency pair, gold futures and Bitcoin. Only the closing price data for each FTS was used. The different machine learning (ML) models that are investigated are linear regression, autoregressive integrated moving average, support vector regression (SVR), multilayer perceptron (MLP), recurrent neural network, long short term memory and gated recurrent unit. This dissertation uses an empirical procedure to facilitate the formatting, transformation, and modelling of the various FTS data sets on the ML models of interest. Two validation techniques are also investigated, namely single out-of-sample validation and walk-forward validation. The performance capabilities of the models are then evaluated with the mean square error (MSE) and accuracy metric. Within the context of FTS forecasting, the accuracy metric refers to the number of correct guesses about whether the price movement increased or decreased and the total number of guesses. An accuracy that is one percentage point above 50% is considered substantial when forecasting FTS, because a 1% edge on the market can result in a higher average return, which outperforms the market. For the individual analysis of the single out-of-sample and walk-forward validation technique, the linear regression model was the best ML model for all FTS, because it is the most parsimonious model. The concept of a parsimonious model was disregarded when comparing and contrasting the two validation techniques. The ML models applying walk-forward validation performed the best in terms of MSE on the S&P 500 index and US 10 year bond yield. The SVR model obtained the highest accuracy of 52.94% on the S&P 500 index, and the MLP model btained the highest accuracy of 51.26% on the US 10 year bond yield. The ML models applying single out-of-sample validation performed the best in terms of MSE on the USD/ZAR currency pair, gold futures and Bitcoin. The MLP model obtained the highest accuracy of 51.77% and 53.51% for the USD/ZAR currency pair and gold futures, respectively. The linear regression model obtained the highest accuracy of 55.04% for Bitcoin. |
Proximal methods for seedling detection and height assessment using RGB photogrammetry and machine learning | An ever-growing global population, coupled with increasing per capita consumption and higher demand for wood-based products, have all contributed towards growing demand for planted forests. The efficiencies of such forests are in no small part due to ensuring planted seedlings are well suited to the local environment. This, in turn, has resulted in a growing demand for nurseries to cultivate such seedlings. Nursery operators are faced with the challenge of monitoring stock levels and determining the growth stage of the stock on hand. This typically involves laborious manual assessments based on statistical sampling of only a small percentage of the stock on hand. In this study, a framework for the proximal detection and height assessment of seedlings is proposed. Photogrammetry is employed using red-green-blue (RGB) imagery captured using a smartphone to produce digital surface models (DSMs) and orthomosaic images. Three image collection strategies are proposed and evaluated based on ground control point accuracy. A RetinaNet object detection model, pre-trained on unmanned aerial vehicle (UAV) derived RGB imagery, is utilised for the object detection task. Transfer learning is leveraged by retraining the detection model on a single seedling tray consisting of 98 seedlings. The model is trained on the orthomosaics produced by the photogrammetry process. In order to determine the heights of these seedlings, two proposals for sampling the seedling height from the DSM are proffered and evaluated. Finally, a number of regression algorithms are investigated as a tool to refine the sampled height. Ultimately, the ensemble based AdaBoost regression algorithm, achieves the best performance. The proposed pipeline is able to detect 98.97% of seedlings at an intersection over union (IOU) of 76.93% with only a single instance missing classification. The final root mean squared error (RMSE) of 17.26mm achieved by the height refinement process with respect to the test data suggests sufficient performance which enables an improved understanding of stock quantities and growth stage without the need for manual intervention. |
Automated tree position detection and height estimation from RGB aerial imagery using a combination of a local-maxima based algorithm, deep learning and traditional machine learning approaches | Forest mensuration is a pivotal aspect of forest management, particularly when determining the total biomass, and subsequently fiscal value, of forest plantations. Terrestrial measurement of phenotypic properties concerning tree attributes tends to be laborious and time-consuming. Remote sensing (RS) approaches have revolutionised the way in which forest mensuration is conducted, especially due to the reduced costs and increased accessibility associated with leveraging unmanned aerial vehicles (UAVs) that incorporate high-resolution imaging sensors. The rapid development of digital aerial photogrammetry (DAP) technologies has provided a viable alternative to airborne laser scanning (ALS), technology that has typically been reserved for applications wherein high accuracy is required, and budget constraints are not of major concern. Furthermore, machine learning (ML), and particularly computer vision (CV), are becoming increasingly commonplace in the processing of orthomosaic rasters and canopy height models (CHMs). Traditionally, an ALS- or DAP-derived CHM has been utilised, together with a local maxima-type model, to detect tree crown apexes and estimate tree heights. In this study, a forest stand located in KwaZulu-Natal, South Africa, comprised of 4 968 Eucalyptus dunnii tree positions spaced at 3×2 metres, was considered. A local maxima (LM) algorithm was employed as a baseline model to improve on. The out-put of the LM algorithm was, however, also utilised in an ensemble of ML models, designed to better estimate tree positions and heights. A hybrid approach was proposed that integrates object detection, classification, and regression models in an ML model framework, with the intention of improving accuracies achieved by the LM algorithm. The object detection model was built on the RetinaNet one-stage detection model which is comprised of a feature pyramid network (FPN) that employs a focal loss (FL) function, rather than the typical cross-entropy (CE) loss function, addressing the issue of extreme class imbalance typically encountered by object detection models. This RetinaNet was made available as part of the DeepForest (DF) python package and the underlying network had been pretrained on a substantial amount of forest canopy imagery. To improve the model, hand-annotations of trees depicted in the DAP-derived orthomosaic were generated and subsequently employed in further training the DF model through the procedure of transfer learning. A support vector machine (SVM) model was built to filter misclassified tree positions and to act as a differentiator between legitimate and illegitimate tree positions. Furthermore, a multi-layer perceptron (MLP) was trained to address the inherent bias present in the CHM, and improve tree height estimations sampled from the CHM. The improvements in tree position and height accuracies were noticeable. Tree position MAE was improved by 15.68% from 0.3515 metres to 0.2964 metres. Tree height RMSE was improved by 25.30% from 0.6435 metres to 0.4807 meters, while R2, with respect to height, was increased by 15.22% from 0.6662 to 0.7676. The proportion of total trees detected was reduced by 3.33% from 98.77% to 95.48%. The number of dead and invalid tree positions detected were, however, also decreased by 82.35% and 36.36%, respectively, suggesting a substantial improvement in the quality of tree positions detected. The results demonstrate potential improvements that can be realised by incorporating ML approaches and DAP-derived data. |
Fantasy Premier League Decision Support: A Meta-learner Approach | The Fantasy Premier League is a popular online fantasy sport game, in which players, known as managers, construct so-called dream-teams based on soccer players in the English Premier League. Each player in the dream-team is assigned a points score based on their performance in each gameweek’s fixtures, where the goal of the fantasy sport is to maximize the points accumulated over the course of an entire season. Each season consists of thirty-eight gameweeks, with managers required to select eleven starting players, a captain player, and four substitution players for each gameweek. Unless a so-called special chip is used, only eleven of the fifteen players can accumulate points during each gameweek. The manager’s selected dream team is transferred to a successive gameweek, with managers allowed to transfer players into and out of their teams each gameweek. Managers are penalized for excessive player transfers and, adding to the strategic complexity of the fantasy game, the managers face strict constraints when formulating their teams. The so-called dream-team formulation problem can be decomposed into an initial dream-team formulation sub-problem and a subsequent player-transfer sub-problem. The constraints associated with these sub-problems be expressed as a system of linear equations, and given an estimate of a player’s expected performance in a fixture, a set of suggested player transfers can be obtained by using linear programming. The focus in this project is to design and implement a set of machine learning algorithms capable of forecasting the expected points of the players in a game-week’s fixtures, after which a decision support system is designed and implemented to obtain a suggested initial dream-team and a set of player transfers for the subsequent gameweeks. A total of five machine learning algorithms are considered, with each algorithm being selected from a distinctly-functioning family of learning algorithms. The five algorithms are selected from families of linear regression techniques, as well as kernel-based, neural network, decision tree ensembles, and nearest-neighbour algorithms. The applicability of using a stacked meta-learner is investigated, where the meta-learner is provided with predictions generated by the five implemented algorithms. A case study is performed on the 2020/21 Fantasy Premier League season, in which the quality of the suggested player transfers are validated. The final results obtained demonstrate that the decision support system performs favorably, where the best set of suggested player transfers would have placed in the top 5.98% of eight million real-world managers’ in the 2020/21 season. |
Title | Abstract |
---|---|
Requirements for 3D stock assessment of timber on landings and terminals | This project aims to address the issue of having an unreliable stock assessment system in the timber supply chain, leading to inaccurate estimations for stock volumes in log piles. The system developed in this project needs to satisfy the practical constraints of the supply chain, while generating results that are frequent and accurate. The data capturing process is required to be low tech due to the vast rural areas covered by the timber supply chain. The method identified for achieving this is terrestrial structure from motion (SFM), using a consumer grade camera or a smartphone. The final data used for the project is in the form of point clouds, generated from both SFM as well as Unity, in order to increase the amount of data available. In order for the system to determine the volume of log piles, the first step required is to determine the difference between log pile and terrain within the point cloud. To do this, a classification algorithm is developed as part of this project. The algorithm makes use of neighbourhood statistics calculated during the feature engineering process, along with features in the original point cloud dataset. The algorithm used for the classification of log piles from this dataset is K-means clustering. Once the log piles can be extracted from the point cloud, an alpha shape is generated from the point cloud. The alpha shape is then used to predict the final volume of the log piles. The results of the final system show that the methodology developed achieves predicted volumes of an acceptable level for the future use case. The results in this project thus provide evidence that there is a benefit for the use of computer vision in the timber supply chain to perform stock assessments that are accurate. Finally the project acknowledges that there is need for the continuation of work in order to further improve the accuracy and implement the system. |
A predictive model for precision tree measurements using applied machine learning | Accurately determining biological asset values is of great importance for forestry enterprises – the process ought to be characterised by the proper collection of tree data by means of utilising appropriate enumeration practices conducted at managed forest compartments. Currently, only between 5-20% of forest areas are enumerated which serve as a representative sample for the entire enclosing compartment. For forestry companies, timber volume estimations and future growth projections are based on these statistics, which may be accompanied by numerous unintentional errors during the data collection process. Many alternative methods towards estimating and inferring tree data accurately are available in the literature – the most popular characteristic is the so-called diameter at breast height (DBH), which can also be measured by means of remote sensing techniques. The advancements in laser scanning measurement apparatuses are significant in recent decades, however, these approaches are notably expensive and require specialised and technical skills for their operation. One of the main drawbacks associated with the measurement of DBH by means of laser scanning is the lack of scalability – equipment setup and data capture are arduous processes that take a significant amount of time to complete. Algorithmic breakthroughs in the domain of data science, predominantly spanning machine learning (ML) and deep learning (DL) approaches, warrant the selection and practical application of computer vision (CV) procedures. More specifically, an algorithmic approach towards monocular depth estimation (MDE) techniques was employed for the extraction of tree data features from video recordings (captured using no more than an ordinary smartphone device) and are investigated in this thesis. Towards this end, a suitable forest study area was identified to conduct the experiment and the industry partner of the project, i.e. the South African Forestry Company SOC Limited (SAFCOL) granted the necessary plantation access. The research methodology adopted for this thesis includes fieldwork at the given site, which involved first performing data collection steps according to accepted an standardised operating procedures developed for tree enumerations. This data set is regarded as the \ground truth” and comprises the target feature (i.e. actual DBH measurements) later used for modelling purposes. The video _les were processed in a structured manner in order to extract tree segment patterns from the corresponding imagery. Various ML models are then trained and tested in respect of the basic input feature data _le, which produced a relative root mean squared error (RMSE%) between 14.1 and 18.3% for the study. The relative bias yields a score between 0.08% and 1.13% indicating that the proposed workflow solution exhibits a consistent prediction result, but at an undesirable error rate (i.e. RMSE) deviation from the target output. Additionally, the suggested CV/ML workflow model is capable of generating a discernibly similar spatial representation upon visual inspection (when compared with the ground truth data set – i.e. tree coordinates captured during fieldwork). In the pursuit of precision forestry, the proposed predictive model developed for accurate tree measurements produce DBH estimations that approximate real-world values with a fair degree of accuracy. |