As part of the MEng (Structured) programme with focus on Data Science our students are required to complete a final 60 credit data science research project where they are required to apply and consolidate the data science knowledge gained throughout the programme. For this purpose, students will solve a real-world data science project, providing solutions for each step of the data science project life cycle and document it in a research assignment.
For these projects, we collaborate with industry and academic partners who are willing to propose a topic, to provide the necessary data (if not publicly available) as well as to act as domain mentors. The data set needs to be complete.
If you are interested in partnering with us for such a project, please contact DS-PROJECTS@sun.ac.za for further information about a short project proposal and deadlines.
Project proposals reviewed by end of term 3 of a given year will be assigned to students for the following year.
Below, find a list of completed research assignments. The assignments are grouped under the year of graduation.
March 2024 Graduation |
|
Title | Abstract |
---|---|
Convolutional neural network filter selection using genetic algorithms | “Ever since the release of large language models like ChatGPT, machine learning has garnered worldwide attention from laymen and scholars alike. However, the field of machine learning predates the development of these models by some time and has a rich history of successful applications in a variety of fields. Genetic algorithms and computer vision are two such ar- eas of machine learning that have shown great promise in solving complex problems. Genetic algorithms are a type of evolutionary algorithm that can solve a wide range of optimization problems, while computer vision involves the use of machine learning models to extract insights from image and video data.
The most commonly used models in computer vision applications is a form of neural network called convolutional neural networks. Neural networks are a type of machine learning model that takes inspiration from the structure and functioning of the human brain. Convolutional neural networks refer to a type of neural network model that is especially adept at computer vision tasks such as image classification, object detection and video analysis owing to the use of convolutional layers. Convolutional neural networks can consist of millions of parameters, the majority of which are stored in the filters that the networks use during convolution operations. Thus, one major problem hampering more widespread adoption of convolutional neural network models in practice is the size of these models. The storage required to deploy these models is not trivial, leading to a need for methods to compress these models without materially degrading their predictive capabilities. One such method is filter selection and pruning, which refers to methods that assess the filters in a convolutional neural network and remove the least important filters to reduce the size of the model. This project proposes the use of a genetic algorithm to optimise the process of filter selection, allowing multiple filter selection methods to be applied concurrently. The proposed algorithm allows filters to be pruned adaptively, with the removal methods and number of filters removed being optimised for the network being pruned. When applying the proposed algorithm, we achieve 90.91% model compression at the cost of a 0.13% point accuracy drop for a network trained on audio data. When applied to the classic Fashion-MNIST data set, 91.37% compression is achieved with a corresponding 0.39% point drop in accuracy. We also achieved 86.06% compression while increasing accuracy by 2.37% points on a model trained on the CIFAR-10 data set. These results show the utility of the algorithm and its ability to compress networks adaptively with different architectures trained on different data sets. This study reveals that genetic algorithms can be applied successfully to prune filters from convolutional neural networks and provides the underpinnings for a comprehensive genetic algorithm capable of pruning filters from any given convolutional neural network architecture. |
The value of Zero-rating internet services to provide essential services to low-income communities | This research assignment explored user interests and usage patterns on a zero-rated internet platform, MoyaApp, in South Africa to determine the value of zero-rating essential services in low-income communities.
This study focused on understanding how users interact with different categories of essential services offered through the MoyaApp platform, a Datafree subsidiary, particularly on grants, education, jobs, and other information services such as weather and electricity. The researcher used data mining techniques such as temporal association rule mining and other statistical methods to analyze user interests and usage patterns. The findings revealed that many low-income users initially registered on MoyaApp to access grant services; users gradually explored other essential services over time and became regular platform users. The researcher proposed a few recommendations to improve the benefits MoyaApp provides to low-income communities: Firstly, MoyaApp should consider expanding the jobs category to cater to users with varying levels of education. Secondly, targeting grant users with information services like weather and electricity encourages engagement. Once users are regular users of the platform, they are more likely to use more beneficial services such as education and jobs, which leads to improved socio-economic status. Thirdly, the results of this study can be used to develop a recommendation engine to suggest relevant essential services to low-income users. In conclusion, this research assignment demonstrated that providing zero rated internet services or, more accurately, reverse billing data to low income communities can be an effective strategy to enhance access to essential services and bridge the digital divide. |
Intelli-Bone: Automated fracture detection and classification in radiographs using transfer learnig | Suspected fractures are one of the most common reasons for patients to visit the emergency department (ED) in hospitals [79]. Radiographs, the primary diagnostic tool for suspected fractures, are often assessed by emergency healthcare professionals without specialised orthopaedic expertise. This restriction leads to a high number of diagnostic errors in EDs, with in correctly diagnosed fractures accounting for over 80% of reported diagnostic mistakes [79].
Given this problem with fracture diagnostics, there is an opportunity to use artificial intelligence (AI) to assist with the diagnosis of fractures. Successful implementation of an AI system that correctly locates and classifies fractures would lead to more accurate prognosis and treatment advice. The selected fracture classification system for this research assignment is the Arbeitsgemeinschaftf fur Osteosynthesefragen / Orthopaedic Trauma Association (AO/OTA) classification [90]. The object detection models selected in this research to evaluate whether AI can be used for accurate location and classification of fractures according to the AO/OTA classification are the faster region-based convolutional neural network (Faster R-CNN) [115], you only look once version 8 nano (YOLOv8n) [54], you only look once version 8 large (YOLOv8l) [54], and RetinaNet [76]. A secondary problem that this research assignment addresses is that of data scarcity. Deep learning algorithms require large amounts of data to achieve exceptional performance. The target dataset in this research assignment, the distal radius dataset (DIRAD), only consists of 776 images, where roughly half of the images contain fractures. The technique applied to overcome the data scarcity problem is transfer learning. With trans fer learning, the object detection models are pretrained on larger datasets iv such as the common objects in context (COCO) [77] and the Graz Paediatric Wrist Digital X-rays (GRAZPEDWRI-DX) dataset [95] before being trained on the target dataset. This research assignment shows that pretraining of object detectors on larger datasets leads to superior performance on scarce datasets. Furthermore, when pretraining an object detection model on a large dataset from a similar domain to perform a similar task, such as GRAZPEDWRI DX, it leads to even better results. The pretraining of the Faster R-CNN, YOLOv8n, YOLOv8l, and RetinaNet on the GRAZPEDWRI-DX improved mean average precision at an intersection over union of 50 (mAP50) by an average of 33.6% compared the same models trained from randomly initialised weights. The best performing model, namely the YOLOv8l, achieved a mAP50 of 59.7% on the DIRAD dataset. |
Evolutionary multi-objective optimisation algorithms for a multi-objective truck and drone scheduling problem | In the rapidly evolving landscape of e-commerce, the efficiency of last-mile delivery emerges as a critical bottleneck in the logistics chain. This research addresses the complexities of last mile delivery, a process significantly burdened by high costs, environmental concerns, and the increasing consumer demand for quick and convenient service. By focusing on the integration of drones with traditional truck delivery systems, this study explores an innovative solution to the challenges faced in business-to-consumer (B2C) logistics. The utilization of a combined truck and drone system presents a novel approach to optimizing delivery routes and reducing both delivery times and operational costs. This assignment introduces a multi-objective traveling salesman problem with drone interception (TSPDi), which simultaneously minimizes total delivery time and distance, thereby addressing the inherent trade-offs in last-mile logistics.
In this assignment, the non-dominated sorting genetic algorithm II (NSGA-II) and the strength pareto evolutionary algorithm 2 (SPEA2) algorithms for the TSPDi problem were adapted, with modifications and enhancements to optimise their performance. A custom population initialisation function was added to both algorithms, improving the starting point for the evolutionary process. In addition, a heuristic mutation method was developed that produces feasible high quality solutions. To create a more varied solution pool, a mechanism for selecting unique solutions for both the parent and archive populations was implemented to ensure that no duplicate solutions occurred. This approach was especially successful in keeping a wide range of solutions during extended iterations. Empirical results showed that NSGA-II is better than SPEA2 in scenarios with larger datasets and many delivery nodes, while SPEA2 has a slight advantage in smaller datasets with fewer delivery nodes. Further analysis was performed to compare the performance of the algorithms with those of Ernst [29] and Moremi [52]. Delivery time was the most important factor in the comparison, as it was the objective optimised by Ernst [29] and Moremi [52]. The results showed that the new Multi-Objective Evolutionary Algorithms (MOEA) performed similarly to the single objective function on the smaller datasets (i.e. 10 and 20 nodes) in terms of the delivery time metric; however, in most cases they did not perform better. For larger data sets (i.e. 50 to 500 nodes), MOEAs outperformed all algorithms developed by Moremi [52] and were more competitive compared to algorithms developed by Ernst [29], surpassing them in performance on most large data sets. For the truck distance metric, the MOAEs outperformed most of the single-objective evolutionary algorithms (EAs) for smaller and larger datasets. This was expected since single-objective EAs were not designed to optimize time. |
Evolving encapsulated neural network blocks using a genetic algorithm | In recent years, artificial intelligence, with its subfields of deep learning and evolutionary computation, has experienced remarkable growth. This expansion can be attributed to the increased availability of computational power and the potential value these domains offer. Consequently, this growth has fueled intensified research and attention, presenting the challenge of staying current with the rapid advancements. Furthermore, the advent of deep learning has led to the ever-increasing size and complexity of neural networks, pushing the boundaries of computational capabilities. This project investigates the viability of utilising a genetic-based evolutionary algorithm to automate the discovery of subnetworks within convolutional neural networks (CNNs), referred to as blocks, for image classification. Inspired by architectural elements in well-known CNNs like ResNet and GoogLeNet, these blocks are designed to be reusable, repeatable and modular.
The first part of this project entailed the development of a framework to represent CNN architectures, which drew inspiration from the concept of neuroevolution of augmenting topologies (NEAT). This developed representation framework was used to define the composition and layout of CNN architectures. Next, a genetic algorithm was adapted to fit within the framework, thus enabling the evolution of CNN blocks using various evolutionary operators, including mutation, speciation and crossover. The representation framework and genetic algorithm were combined to evolve a population of 100 CNNblocks over 30 generations. Throughout the evolution process, the search was guided by the measured quality of the blocks, defined by a fitness function that was designed to balance complexity and performance. Five repetitions of the experiment were performed and compared to randomly generated blocks to assess the overall success of this approach. Additionally, the performance of the evolved blocks was evaluated against manually designed blocks such as ResNet and GoogLeNet’s Inception. The results of the comparison between the genetic algorithm and random procedures demonstrated the effectiveness of the genetic algorithm in producing highly optimal solutions based on the fitness evaluation. The results showing the distribution of the population evolutionary operators also explained how the subprocedures can be used to effectively control the search. Furthermore, the result obtained using a small sample of the best performing evolved blocks proved to be highly competitive when compared to manually designed counterparts, namely ResNet and Inception. This study validates the concept of using evolutionary algorithms for neural network block generation and emphasises their ability to rival manually designed networks. The findings suggest that evolutionary computation successfully automates the discovery of competitive blocks within CNN architectures, offering new avenues for neuroevolution and overcoming limitations in the manual design processes. |
Machine Learning for Aquaponic System Mortality Prediction and Planting Area Optimisation | Aquaponics is a sustainable farming method that combines aquaculture with hydroponics. Machine learning and the internet of things (IoT) can be used to improve the profitability and efficiency of aquaponic plants. This project proposes a machine learning-based IoT system for aquaponics that can predict fish mortality and optimize crop growing areas. The system collects data on water quality, fish behaviour, and plant growth. This data is then used to train machine learning models to predict fish mortality and to optimize crop growing areas. The proposed machine learning-based IoT system has the potential to improve the profitability and efficiency of aquaponic plants. This could lead to wider adoption of aquaponics as a sustainable farming method. |
Spatio-Temporal Modelling of Road Traffic Fatalities in Western Cape | Road Traffic Accidents are a problem in South Africa. Responding to the World Health Organisation’s Decade of Action for Road Safety, the Western Cape sought new techniques and initiated the application of Data Science and Machine Learning tools to act as a decision support system. In this light, this project seeks to develop a machine learning model capable of predicting in time and space the probability of a road fatal event. This is done by aggregating relevant features of the Western Cape into an H3 grid whereby patterns in fatal events are learned. Traditional machine learning techniques and deep learning techniques are used to learn the relationship between the aggregated features and road fatal events with the aim of out performing historical average models which are currently used in industry. This is the first attempt at using machine learning techniques to model Road Traffic Fatalities in South Africa and the Western Cape. |
Using Tree-Based Machine Learning Models to Improve Upon the Least-Squares Method of Quantifying Mineralogy using Bulk Chemical Compositional Data | “Geometallurgy is an interdisciplinary science that utilises geological and metallurgical data to optimise ore-to-metal processing routes. Knowledge of the spatial distribution of minerals (and hence metals) within the ore body forms the basis of a geometallurgical model. Information about an ore body’s chemistry and quantitative mineralogy can be obtained through drill core logging exercises. The process of drilling cores, collecting samples, and analysing them is costly and time-consuming. As a result, other quick and inexpensive methods of deriving modal mineralogy have been proposed.
Element-to-mineral conversion (EMC) refers to the method of using bulk rock compositional data to calculate mineral grade quantities. EMC is a chemical mass balancing technique that utilises the bulk rock chemistry, , and the minerals’ compositional data, , to solve for modal mineralogy, . Chemical mass balances are expressed as a set of simultaneous equations that can be solved using the least-squares approach (LS-EMC). LS-EMC can only be applied if the number of unknowns (minerals) is less than or equal to the number of known variables (elements). It is often the case that there are more minerals than elements. However, minerals can be grouped such that the number of resultant mineral sets is equal to the number of elements. Although this method of grouping minerals is sufficient for geometallurgical models, it is insufficient for mineral processing models which require exact quantities for individual minerals. This study sought to investigate alternative data-science-based methods to LS-EMC. Data science is an interdisciplinary field that focuses on the application of computational statistical methods, such as machine learning, for the extraction of knowledge from data. Three tree-based machine learning (ML) algorithms, namely, Decision Tree, Random Forest, Extra Trees, were trained to predict mineral grade quantities using positional and geochemical data. The dataset used in the investigation consisted of 135 observations sourced from a geological study conducted on the Kalahari Manganese Deposit (KMD) (Blignaut, 2017). LS-EMC was also applied, and the mineral grade estimates obtained by this method were compared to ML models’ output. The R2 statistic was used to quantify how well the LS-EMC and ML-EMC output agreed with the modal mineralogy measurements obtained through quantitative x-ray diffraction (QXRD). In comparison to the other the techniques, the modal mineralogy results from the Extra Trees regressor correlated the most with the QXRD measurements, achieving R2 scores > 0.5 for six out of the eight mineral groups. Furthermore, the Extra Trees algorithm outperformed the other two tree-based models in a test designed to see which ML algorithm provided the most reliable mineral quantity predictions for ungrouped minerals. The results of this study support the conclusion that tree based machine learning algorithms can be used to improve upon the shortcomings of LS-EMC. |
Optimisation algorithms for a dynamic truck and drone scheduling problem | “With the increasing popularity of online shopping and higher customer demand for better service delivery, the importance of last-mile delivery is growing. The last mile, the final delivery to the customer, comes at a high cost to the retail industry and the environment through pollution caused by delivery vehicles. With the advancement in drone technology, delivery strategies like a truck and drone combination, which do deliveries in parallel has become viable. Improving the routing and scheduling of these combined vehicles reduces the high cost of last-mile delivery. Therefore, shortening the route through the drone intercepting the truck has a significant benefit.
In this research assignment, the coordinates of customer nodes are randomly changed to simulate a dynamic environment while a truck and drone system performs deliveries. This problem is re ferred to as the dynamic travelling salesperson problem with drone with interception (DTSPDi). This research assignment solves the problem using the ant colony system (ACS) [30], MAX-MIN ant system (MMAS) [87] and a modified ACS that transfers pheromone knowledge to the next time slice. The research assignment builds on the algorithm designed by Moremi [64] for the travelling salesperson problem with drone with interception (TSPDi). The three algorithms use 30 datasets of different sizes and spatial patterns for input. The result from the benchmarking was that ACS-KT outperformed both algorithms in both the time and distance dimensions. Interestingly, a lower wait time does not mean a lower time or distance for a route. There was also no correlation between drone and truck distances. Therefore, it seems that ACS-KT is better at handling dynamic environmental changes for the DTSPDi problem. |
Review of Big Data clustering methods | In an era defined by the challenges of processing vast and complex datasets,the study delves into the evolving landscape of big data clustering. It introduces a novel taxonomy categorizing clustering models into four distinct groups, o↵ering a roadmap for understanding their scalability and effciency in the face of increasing data volume and complexity.
The essence of this research lies in its pursuit to critically review, analyze, and evaluate various clustering models, focusing on their suitability and adaptability in handling big data, characterized by the four Vs, i.e. velocity, variety, volume, and veracity. The aim is to discern the operational dynamics of diverse clustering models, considering the findings of prior literature, which have demonstrated varying degrees of performance of these models based on selected metrics. The methodology is firmly rooted in the execution of a series of experiments on chosen clustering methods, metrics, and datasets. This empirical method is crucial to extrapolate how each model fares across di↵erent metrics and datasets, o↵ering a comparative perspective on their performance. Subsequent to the experimental phase, an extensive analysis was conducted, breaking down the selected approaches into their algorithmic components. This decomposition is pivotal to identify the origins of gains, losses, or tradeo↵s in performance, allowing for an in-depth understanding of why certain models outperform others concerning given metrics and datasets. Insights from this research highlighted the scalability and e ciency of models like parallel k-means and mini batch k-means, both theoretically and empirically, marking them as exemplary for large-scale applications. Conversely, it unveiled the computational constraints of models like selective sampling based scalable sparse subspace clustering (S5C) and purity weighted consensus clustering (PWCC), showing their limitations in scaling to big data. Acknowledging the limitations imposed by the resource constraints of Google Collab Pro+, the study presents the constraints faced during the evaluation process. The culmination of this project is marked by a comprehensive performance summary, o↵ering key insights into the strengths and weaknesses of the approached models and pro↵ering informed advice on the contextual utilization of each model. It lays the foundation for a centralized database for clustering research, aiming to fill existing knowledge gaps and facilitate optimal model discovery tailored to specific needs and infrastructural capabilities. In conclusion, this research stands as an exploration and analysis in the field of big data clustering, to uncover the potentials and bottlenecks of various models, and o↵ers valuable insights and recommendations, all while reconciling theoretical complexities with empirical validations. |
Cluster4ing free text procurement data | “The mining industry, like most others, is faced with a diverse range of challenges. Mining companies are now looking into leveraging advanced data analytics to gain insights from their data to make data-driven decisions and inform process debottlenecking to improve throughput and operating costs. Company A grapples with 50% of its group-wide procurement spend stored as unstructured text data, hindering in-depth cost analysis due to variations in describing the same items. The difficulty associated with free-text descriptions in procurement spending is that a single-item purchase can be articulated using various string expressions. Given the thousands of records generated monthly, manually aggregating these diverse strings for in-depth analysis or relying on simple lookups would prove laborious and inefficient. The literature review underscores the rising trend of organisations adopting text-mining techniques to extract insights from unstructured data. This research assignment delved into various techniques such as Tfidf feature selection, LSA, and word embedding feature transformation, leveraging data from Company A’s procurement database. The exploration of k-means and agglomerative hierarchical (AHC) text clustering techniques revealed that AHC performed better, yielding a high silhouette coefficient and passing validation inspection by a domain expect. Clustering results were analysed in Power BI, leading to the conclusion that while traditional text clustering techniques are effective, modern approaches to feature selection and dimension reduction are essential for optimal results. The research assignment successfully achieved its goal of enabling data analysis through the clustering of free text data. |
Few-shot learning for passive acoustic monitoring of endangered species | “The Hainan gibbon is a primate from the Chinese island-province of Hainan. The population of this primate has been in decline because of poaching, and is now facing extinction. Bioacoustics is a field concerned with the acquisition and study of animal sounds. Passive acoustic monitoring is an important step in data capture, and often captures months of data. Due to the low population numbers of endangered species, experts spend a large amount of time on the analysis and identification of biacoustic signatures.
Machine learning can be used to automate the bioacoustic identification of species, which would reduce analysis costs and time. Unfortunately, many machine learning algorithms require large amounts of data to perform reliably. Few-shot learning is a loosely defined structure in machine learning that aims to solve the limited data problem with unique approaches. This assignment explores the viability of accurate, image-based classification models when subject to low data volumes. Audio data is converted to spectrograms and used in image analysis. A Siamese framework, which has roots in convolutional neural networks (CNN), is the foundation of the few-shot learning approach. Within this CNN-based framework, contrastive-loss and triplet loss architectures, data augmentation techniques, transfer learning methods, and reduced image resolution datasets are investigated. The results indicate that the triplet-loss architecture produces the most accurate models, with excellent precision, recall, and F1-score statistics. The triplet-loss models prefer lower resolution images, which reduce computation time and cost. Importantly, the performance of the triplet-loss models is not affected by low data volumes. On the other hand, contrastive-loss models show significant performance degradation on lower data volumes. Overall, the triplet-loss “base CNN” model is the recommended network. This network achieves an accuracy of 99.08% and F1-score of 0.995. The Siamese framework has demonstrated a strong ability to identify the bioacoustic signature of the Hainan gibbon. Recommendations are provided for further research in this domain. |
Digitization Of Test Pit Log Documents For Development Of A Smart Digital Ground Investigation Companion | “Various geotechnical companies in South Africa have, over the years, conducted ground investigation using test pit method. Test pit involve digging a hole into the ground, making observations of the ground conditions. These companies have documented their observations in PDF format. However, given recent technological advancements, there is a growing need to digitize these documents for thorough analysis. In response to this requirement, these companies have furnished these documents to Civil Engineering Department of Stellenbosch University.
Digitization is a way of converting PDF documents into format which can be analyzed using a computer. There are two common ways to digitize documents namely manual and automatic. Manual digitization include copying and pasting information from documents to a database or retyping information contained in the document into a database. This process is laborious, time consuming, prone to errors and costly. This project explored and presented an automated way of digitizing documents using object detection model for document layout analysis and optical character recognition for extracting alphanumeric characters from images. Object detection model was developed by fine-tuning faster R-CNN pre-trained model available in Detectron2 framework. This process involved leveraging a blend of manually annotated images and synthetically generated annotations. The results demonstrated model R-101 (a variant of R101-FPN) as having a balanced performance based on accuracy and inference time. The values of mAR, mAP and inference time for model R-101 are 74.3%, 71.0% and 0.371 seconds/image, respectively. This object detection model was used to identify and provide ROI coordinates and labels to optical character recognition algorithm. Various optical character recognition algorithms were evaluated and compared across various image qualities. PaddleOCR outperformed the other three algorithms, achieving a word recognition rate of 96%. Nevertheless the performance of these algorithms was lower on blurred images as compare to other image qualities. Spelling check and correction was conducted to improve recognition rate of paddleOCR outputs by a further 1.2%. An interactive application which can be accessed online via a web link or offline in a desktop was developed for exploring the dataset. This application allows for creating scenarios using multiple slicers to visualize a word cloud of common words and frequency of characteristics (e.g soil type, moisture condition and particle size) used to describe each scenario. Semantic search algorithm was fine-tuned using sentence transformers to allow users to query the dataset using natural language and a separate desktop application was developed to facilitate this. Evaluating semantic search algorithm revealed precision, recall and F1 score of 68.3%, 65.7% and 67.0%, respectively. Suggestions for further work include performing exhaustive data analysis to discover insights and hidden patterns, training language model for improving spelling correction, collecting more documents for developing a large geological and engineering dataset as well as training a question and answering machine learning model to make data and insights more discoverable. |
Comparison of machine learning models on financial time series data | “The efficient market hypothesis states that financial markets are efficient and that investors can therefore not make excess profits consistently, because all public information is instantly reflected in the share price. Academia and investors have shown that the efficient market hypothesis does not always hold true and that market prices can be exploited when the right financial trading and price models are used to model the relationship in the underlying data. This research assignment focuses on the development of multiple machine learning models, in combination with a financial trading strategy that utilises a mixture of technical indicators, to compare the performance of different machine learning algorithms on financial time series data.
The financial time series data collected for this research assignment were 10-year minute ticker data. The two foreign exchange rate data sets, the USD/ZAR and ZAR/JPY foreign exchange rates were used. The other three data sets collected were the S&P 500 index, the FTSE 100 index, and the Brent crude oil index. The first step was to analyse the quality of the data sets. After the quality had been assessed, a trading strategy and financial trading model were used to combine the 20-period moving average, the relative strength index, and the average directional index for the labelling process. Twelve machine-learning models were developed to forecast the financial time series data sets. These were the baseline logistic regression, support vector ma chine, k-nearest neighbour, decision tree, random forest, Elman recurrent neural network, Jordan recurrent neural network, Jordan-Elman recurrent neural network, long short-term memory neural network, time-delay neural network, resilient back propagation feed-forward neural network, and particle swarm optimisation feed-forward neural network. The results from the experiments indicate that the support vector machine performed the best out of all the machine learning models considered. The baseline logistic regression model outperformed all the other machine learning models. The random forest and resilient back propagation feed-forward neural network models performed third and fourth best. These two models had higher recall scores than most models, but their accuracy scores were significantly lower than the baseline and support vector machine models. The recurrent neural network models had very poor performance. Specifically, the Elman and Jordan-Elman models had the poorest performance of the models investigated. It was determined that the non neural network machine learning models were less computationally complex, and were less dependent on a balanced data set than the neural network models. |
Trends in Infrastructure Delivery from Media Reports | It has been shown that investment in public infrastructure such as roads and electricity generally leads to economic growth, and economic growth in turn helps fight poverty and income inequality. It is therefore not surprising that the need to monitor the condition of infrastructure arises. Infrastructure report cards (IRCs) assess the condition of a country’s infrastructure. The South African Institution of Civil Engineering (SAICE) publishes IRCs for South Africa. However, limited data availability for some infrastructure sectors hamper the compilation of the SAICE IRCs. Online news articles are a promising alternative data source to assist in the compilation of the SAICE IRCs, since it is in the public domain and there exists an abundance of reputable news websites covering virtually all regions of South Africa. The task of extracting information from a large volume of online news articles can be automated to a large extent by making use of various natural language processing techniques.
In this research assignment, online news articles are collected from nine South African news websites. Topic modelling is then applied to each of the collected data sets with the goal to group the collected news articles related to specific infrastructure issues together, e.g., group all news articles about potholes, or group all news articles about sewage spills, and then represent each group of news articles as a topic. A summary for each topic is then generated by making use of a large language model. Lastly, a dashboard is designed to effectively visualise the topics, and the summaries generated for these topics. This dashboard can then be used as a tool by SAICE to identify, and monitor prevalent infrastructure issues in various regions of South Africa, while also providing SAICE with additional data for the compilation of IRCs. This research assignment concludes that it is feasible to apply topic modelling to South African news data sets for the extraction of infrastructure-related topics. It is furthermore concluded that topic modelling can help address the lack of data in compiling the SAICE IRCs. Lastly, it is concluded that it is feasible to generate summaries for the extracted topics using large language models, although the generated topic summaries can be improved upon. |
Investigating sales forecasting in the formal liquor market using deep learning techniques | This research assignment focuses on forecasting sales in the liquor industry, examining the effectiveness of deep learning techniques and a stacked ensemble approach. Time-series forecasting is a widely used technique in various fields such as economics, finance, and operations research.
A thorough literature review was conducted to gain an in-depth understanding of the topic and to survey existing solutions in the field. The study involved a thorough analysis of datasets to understand the inherent structures of the series. Evaluation metrics and various algorithms were used to assess the effectiveness of time-series forecasting techniques. The research assignment found that deep learning techniques and ensemble theory can successfully be applied to forecast sales in the liquor industry. A stacked ensemble approach was effective in improving the overall performance. The findings have the potential to significantly improve current implementations of time-series forecasting, while reducing the computational complexity and expenses associated with granular forecasting models. The research assignment concludes that deep learning and ensemble models offer a promising avenue for efficient and accurate sales forecasting in the liquor industry, being more time-efficient and computationally less complex than traditional methods. |
Automated Localisation and Classification of Trauma Implants in Leg X-rays through Deep Learning | “Revision surgery often requires orthopedic surgeons to pre-operatively identify failed implants in order to reduce the complexity and cost of the surgery. Surgeons typically examine the X-rays of a patient for preoperative implant identification, even though this method is time-consuming and occasionally unsuccessful. This study investigates the use of deep learning to automate the identification of trauma implants in leg X-rays. The investigation as sesses the performance of various object detection and classification models on a dataset of trauma implants, aiming to identify the optimal deep learning solution. Challenges related to research include limited data, imbalanced class distributions, and the presence of multiple implants in the X-ray images.
The results of the investigation indicate that the optimal deep learning solution is a two-model pipeline that employs a you only look once (YOLO) object detection model and a densely connected convolutional neural network (DenseNet) classification model. The DenseNet classification model classifies the trauma implants localised by the YOLO object detection model. The proposed pipeline achieves a mean average precision (intersection of union threshold of 0.5) of 0.967 for implant localisation and an accuracy of 73.7% for implant classification. The results of the study provide proof that deep learning models are capable of identifying trauma implants. Additionally, the study offers a deep learning solution that can be utilised in future research related to identifying trauma implants. |
Association between the features used by a convolutional neural network for skin cancer diagnosis and the ABC-criteria and 7-point skin lesion malignancy checklist | “Melanoma cases and the associated mortality rate are rising rapidly. The early detection of melanoma is crucial in decreasing the mortality rate. However, traditional methods employed by dermatologists to diagnose skin lesions are time-consuming and vulnerable to human error. Convolutional neural networks (CNNs) show promise in improving the efficiency and accuracy of classifying skin lesions as malignant or benign. However, the lack of transparency in the decision-making process of CNNs prevents these models from clinical application. For a CNN to be approved for clinical application, it must be shown that the features used by a CNN to classify skin lesions are clinical indicators of melanoma, i.e. the ABCDE criteria and 7-point skin lesion malignancy checklist.
In this research assignment, a methodology is developed to evaluate whether the features used by a CNN to classify skin lesions correspond to the ABC-criteria and the 7-point skin lesion malignancy checklist. A CNN model is developed, trained, and tested to assess the application of the formulated methodology. The association between the ABC-criteria and the 7-point skin lesion malignancy check list features and melanoma in the test dataset is investigated using statistical methods to establish a ground truth. The association between ABC-criteria and the 7-point skin lesion malignancy checklist features and the features extracted by the CNN is determined using t-distributed stochastic neighbour embedding (t-SNE) and statistical tests. The importance of colour is evaluated by testing the performance of the CNN on a grayscale dataset. The association of dataset issues with the extracted features is examined using statistical tests, and misclassifications are investigated based on features and dataset issues. Local interpretable model-agnostic explanations (LIME) is employed to explain misclassifications and correctly classified images, providing insights into the decision-making process of the CNN. The InceptionResNetV2 model with a leaky ReLU activation was selected to evaluate the formulated methodology. The correlation tests between the ABC-criteria and the 7-point skin lesion malignancy checklist features and the melanoma diagnosis in the curated test dataset showed a strong association between all the features and melanoma except for vascular structures, brown, red and black. These results were reflected in the evaluation of the association between the features used by the CNN and the ABC-criteria and the 7-point malignancy checklist since there was a strong association between the extracted features and the ABC-criteria and the 7-point malignancy checklist features except for vascular structures, brown, red and black. The decrease in performance of the InceptionResNetV2 model on the grayscale dataset indicated that colour is a feature that the CNN uses to detect melanoma. The CNN demonstrated robustness to dataset issues but showed sensitivity to the presence of hair and immersion fluid, suggesting the need for further preprocessing of the images. Overall, it was concluded that the developed methodology can determine whether a CNN uses the features in the ABC-criteria and the 7 point malignancy checklist to classify skin lesions as malignant or benign. The developed methodology showed that the CNN uses the features of the ABC-criteria and the 7-point malignancy checklist to determine whether a skin lesion is malignant or benign. |
December 2023 Graduation |
||
Title | Abstract | |
---|---|---|
A dynamic optimisation approach to training feed-forward neural networks that form part of an active learning paradigm | Active learning describes a paradigm of continually selecting the most informative patterns to train a model while training progresses. Literature indicates that the parameter search landscape of feed-forward neural networks (FFNNs) that form part of an active learning paradigm does not generalise to the parameter search landscape of FFNNs trained by a static training set. The parameter search landscape of FFNNs that form part of an active learning paradigm are theorised to change while the search progresses. This research assignment investigates the effect of changing the optimiser of a FFNN that forms part of an active learning paradigm from backpropagation to a dynamic optimisation algorithm. To this extent the cooperative quantum-behaved particle swarm optimisation (CQPSO) algorithm was implemented to train FFNNs that form part of two different active learning paradigms. The active learning paradigms investigated were dynamic pattern selection (DPS) and sensitivity analysis selective learning (SASLA). Six data sets were used for the investigation. A novel hyperparameter tuning procedure was implemented to ensure efficient optimiser performance for each problem set. It was found that the CQPSO algorithm located and tracked the global minimum of four out of the six problem sets more effectively than the backpropagation algorithm in the DPS active learning paradigm. Conversely, the backpropagation algorithm located and tracked the global minimum of four out of the six problem sets more effectively than the CQPSO algorithm in the SASLA active learning paradigm. The CQPSO algorithm performance was found to depend on the dimensionality of the search space as well as the interdependence of the input training patterns. |
|
Course Recommendation Based on Content A_nity with Browsing Behaviour | A recommender, or recommendation system (RS), _lters and provides relevant content to a user based on many factors such as their historic behaviour during interactions with a particular system or software. A RS is aimed at improving user experience and overcoming issues such as the distressing search problem experienced in massive open online courses (MOOCs) platforms. One such online platform is Physioplus, whose subscribers generally have very speci_c educational needs and thus can greatly bene_t from targeted responses when interacting with the system. It can therefore be argued that an enhanced course recommender engine possesses great potential to increase Physioplus subscribers satisfaction and thus reduce cancelations. The current search feature in Physioplus has some limitations, as it uses keywords, static course recommendations, and elastic site search without considering historic user site visits. The purpose of this study is to build a better course recommender system for Physioplus. The recommender takes a user’s recent Physiopedia browsing history and provides the user with a tailored and rank-ordered list of those courses that are most relevant to their entire content history. The content of a user browsing history is highly correlated with the content of the most relevant courses for that user. The recommender is built using a collaborative-based _ltering (CF) technique, item-based and user-based approach. Natural language processing and neighbourhood similarity methods are used to complement collaborative _ltering in achieving quality recommendations. The course recommender system in this study uses a training and testing dataset from a real-world Physioplus system to assess the overall performance of the proposed approach. The experiment evaluation is measured by comparing recommended versus completed courses. The results show that the proposed RS has a recall score of 76% and an accuracy rate of 53% obtained in the o_ine experiment exercise. The assumption is that the performance metrics score will improve once the proposed RS integrates with the existing Physioplus production system. All in all, the proposed RS can play an essential role in assisting users with relevant courses. |
|
An Evolutionary Algorithm for the Vehicle RoutingProblem with Drones with Interceptions | The use of trucks and drones as a solution to address last-mile delivery challenges is a new and promising research direction explored in this assignment. The variation of the problem where the drone can intercept the truck while in movement or at the customer location is part of an optimisation problem called the vehicle routing problem (VRP) with drones with interception (VRPDi). This study proposes an evolutionary algorithm (EA) to solve the VRPDi. The study demonstrates a metaheuristic strategy by applying an evolution-based algorithm to solve the VRPDi. In this variation of the VRPDi, multiple pairs of trucks and drones need to be scheduled. The pairs leave and return to a depot location together or separately to make deliveries to customer nodes. The drone can intercept the truck after the delivery or meet up with the truck at the following customer location. The algorithm was executed on the travelling salesman problem with drones (TSPD) datasets by Bouman et al. (2015), and the performance of the algorithm was compared by benchmarking the results of the VRPDi against the results of the VRP of the same dataset. This comparison showed improvements in total delivery time between 39% and 60%. Further detailed analysis of the algorithm results examined the total delivery time, total distance, the node delivery scheduling and the degree of diversity during the algorithm execution. This analysis also considered how the algorithm handled the VRPDi constraints. The results of the algorithm were then benchmarked against algorithms in Dillon et al. (2023), and Ernst (2024). The latter solved the problem with a maximum drone distance constraint added to the VRPDi. The analysis and benchmarking of the algorithm results showed that the algorithm satisfactorily solved 50 and 100-nodes problems in a reasonable amount of time, and the solutions found were better than those found by the algorithms in Dillon et al. (2023), and Ernst (2024) for the same problems. However, the algorithm performance deteriorated considerably as the number of nodes in the problems increased. This deterioration was both in terms of the quality of the solution and the computation time required to solve the problem. |
|
Metaheuristics for Training Deep Neural Networks | Presently, artificial neural networks (ANNs) are popular among researchers as well as in commercial settings. The use of ANNs continue to expand into different fields. The increase in interest in ANNs have lead researchers to explore various new and innovative ways to improve the performance of ANNs. One such way is to explore the use of metaheuristics in the training of ANNs. This research assignment theoretically and empirically compares the use of metaheuristics as an alternative to the traditional training algorithm, i.e. backpropagation with stochastic gradient descent (SGD), to train deep neural networks (DNNs). Three specific metaheuristics are considered, namely particle swarm optimisation (PSO), genetic algorithm (GA) and differential evolution (DE). An in-depth analysis of SGD is conducted to highlight some potential disadvantages which might occur in the training process. The field of metaheuristics is explored as an alternative training algorithm with specific emphasis placed on the three specified metaheuristics. Five different experiments are conducted to empirically compare the backpropagation SGD training algorithm with the PSO, GA and DE training algorithms. The experiments are conducted on an image dataset. The DNN used in the experiments is a convolutional neural network (CNN). The results conclude that the SGD performs better than the metaheuristics considered. Potential future work is also discussed based on the findings of this research paper. |
|
Diversity preservation for decomposition particle swarm optimization as feed-forward neural network training algorithm under the presence of concept drift | Time series forecasting is an important area of research that lends itself to various fields in which it is practically applied. The importance of time series forecasting has led to much research in efforts to improve the accuracy of predictions. The use of artificial neural networks for time series forecasting has grown, especially with the development of simple recurrent neural networks (SRNNs). SRNNs have been shown to handle temporal sequences efficiently. Specialised architectures for SRNN increase the computational cost due to the increase in the number of weights that require optimisation during training. Therefore, the training process of neural networks can be rephrased as an optimisation problem. Recent work has shown how specialised dynamic particle swarm optimisation (PSO) algorithms can replace traditional backpropagation as a learning algorithm for feed-forward neural networks (FFNNs). Dynamic PSO algorithms to train FFNNs have been shown to outperform SRNNs using traditional backpropagation. Due to the increased dimensions for larger problems, various cooperative PSO algorithms have been developed to address the credit assignment problem as well as to better cope with variable dependency; one such PSO variant is the decomposition cooperative particle swarm optimisation algorithm. One limitation of using PSO variants for training in dynamic environments is that as the particles in a swarm converge in a specific region, the swarm diversity decays, making it difficult to adapt to environmental changes. Dynamic PSO algorithms have been successfully used in the sub-swarms of decomposition cooperative particle swarm optimisers (DCPSOs). However, these dynamic DCPSO algorithms have been shown to struggle under specific classes of dynamism. Therefore, the preservation of swarm diversity is directly linked iii to the ability to adapt in the presence of concept drift. This research project proposes various diversity preservation techniques to promote swarm diversity throughout various environmental changes. The diversity preservation techniques investigated are the use of random decomposition for dynamic DCPSO and a diversity-based penalty function for regularization. For this purpose, experiments were conducted on five well-known nonstationary forecasting problems under various classes of dynamism. Results obtained on two implementations of the DCPSO using the proposed diversity preservation techniques showed success in promoting swarm diversity. Two main implementations of DCPSOs were investigated, namely dynamic and static sub-swarms. When a static PSO algorithm was used for the sub-swarms of the DCPSO, the diversity preservation showed a significant impact. The proposed diversity preservation techniques also significantly affected swarm diversity for the DCPSO using the quantum particle swarm optimisation algorithm (QSO) as sub-swarms. The use of the diversity based penalty function for regularization showed superior performance on the training and generalization error for dynamic DCPSO. Still, it did not show a statistically significant effect on preserving swarm diversity. The use of static PSO algorithms as sub-swarms for DCPSO showed that random decomposition ranked high across the various experiments, while swarm diversity was significantly impacted. The proposed diversity preservation techniques for the dynamic DCPSO algorithms showed a trade-off between diversity preservation and performance. |
|
March 2023 Graduation |
||
Title | Abstract | |
Adaptive thresholding for mircoplot segmentation | Food security remains a global concern as flagged by the Food and Agriculture Organization of the United Nations (FAO). They report that globally one in three people do not have access to adequate food with a third of those living in Africa. The effect of climate change on crop yields adds to these concerns. Wheat makes up a substantial share of food consumption globally at 18.3% and it is particularly sensitive to the rising temperatures associated with global warming. The FOA emphasises that agricultural technology has a significant role to play in food security, with research contributing to breeding high-yield and heat-resistant crops, as an important focus area. The Department of Genetics at Stellenbosch University has a wheat pre-breeding programme that develops and tests novel crop variants. This programme monitors several experimental sites that contain microplots; relatively small wheat plots. At a single pre-breeding experimental site, there are often hundreds of microplots that must be monitored and evaluated. The within-season evaluation of microplots is performed by using digital high throughput phenotyping (HTP) analysis performed on orthomosaic images collected using unmanned aerial vehicles (UAVs). One of the phases of HTP is the plot identification phase also referred to as microplot segmentation. The current method used to perform microplot segmentation in the programme makes use of a grid that a user must impose over the orthomosaic image and manually adjust to ensure accurate segmentation. This method is manual and requires extensive post-processing to get a good fit. In addition, the current method does not generalise well to conditions that will pragmatically vary between orthomosaic collection iterations. To reduce the time spent by researchers to segment microplots, this research assignment developed an automated microplot segmentation method that requires minimal input from the user. The microplot segmentation approach, referred to as the adaptive thresholding procedure (ATP), was developed for this research assignment. The ATP uses unsupervised learning to identify and localise microplots. Unlike a grid segmentation approach, the ATP does not require any prior knowledge of the microplot layout and does not require the user to adjust a grid. The performance of the ATP microplot segmentation procedure was evaluated on thirteen orthomosaic images from four different experimental sites and subsequently compared against two manual microplot segmentation procedures. The three different microplot segmentation approaches were compared using three objective criteria namely: accuracy, intersection over union and the level of user input required. The ATP yielded superior performance in comparison to the other two segmentation methods when the conditions at the experimental sites was favourable. In the presence of weeds, the ATP did not yield satisfactory performance as the approach finds it challenging to differentiate between vegetation, weeds and non-vegetation. Despite this limitation, the ATP contributes to the existing body of knowledge on microplot segmentation methods by providing an automated microplot segmentation method that requires minimal user input. | |
Decision Support Guidelines for Selecting Modern Business Intelligence Platforms in Manufacturing to Support Business Decision Making | Globally, the generation of data is increasing rapidly, and the increasing competitiveness of global markets constantly challenges the business world due to globalisation. Companies rely on sophisticated technology to manage and make decisions in this dynamic business environment and ever-evolving market. Executives are constantly strained to ensure maximised profits from new offerings and operational efficiencies and improve customer and employee experience. As digitalisation in the manufacturing industry increases, the role of data analytics and business intelligence (BI) in decision-making is significantly increasing. Manufacturers generate abundant structured and unstructured business information throughout the product lifecycle that can be used to achieve their business objectives. However, the manufacturing industry is amongst the laggard sections pertaining to digitisation and often lacks the technological and organisational foundations required to implement data tools as part of their ecosystem. Business Intelligence (BI) provides business insights to better understand the company’s large amounts of data, operations and customers. This in turn, can contribute to better decision-making and consequently improve results and profit. Rationalisation of the technologies, tools and techniques can be challenging. The selection of an appropriate tool can be time-consuming, complex and overwhelming due to the wide variety of available BI software products, each claiming that their solution offers distinctive and business-essential features. This research assignment aims to address the need for a useful approach to BI tool evaluation and selection by identifying guidelines to support decision-makers in selecting BI tools. A thematic analysis approach was used to collect, analyse and interpret the information from semi-structured interviews with professionals from the manufacturing industry. The research gauged respondents’ views on the utilisation of BI, the data challenges experienced in manufacturing, the essential criteria BI tools should fulfil, and the approaches followed in practice to select software. The research revealed that BI plays a significant role in decision-making and the prioritisation of tasks in manufacturing. The results showed that respondents valued different BI criteria requirements and decision-making processes. The findings and insights gleaned from the literature review were used to propose guidelines that support manufacturers in their decision. It elucidates the dimensions to evaluate and provides a nine-step selection process to compare BI software. |
|
An Investigation into the Automatic Behaviour Classification of the African Penguin | In this modern era, climate change, deforestation, and the rapid decline of natural resources are issues that seem ever-increasing. With the extinction of many fauna and flora species in past decades, renewed focus on conservation efforts is advocated globally. The escalation of digitization brings with it an opportunity to improve conservation efforts and, consequently, reduce the rapid decline of biodiversity. Modelling and forecasting the progression of invasive species, ascertaining the presence of endangered species prior to the sanctioning of construction projects, and monitoring threatened ecosystems are some of the many ecologically beneficial possibilities technology provides. A prevalent application gaining much momentum, is the notion of applying machine learning and artificial intelligence to the domain of ecology. One such application considers animal behavioural studies — a predominantly manual endeavour requiring mounted sensors, tracking devices and/ or the continued presence and attention of a human. Ascribed to the invasive nature of many such studies, behaviour is often distorted or (at the very least) influenced. Modern computerised and digitised approaches address many of these drawbacks by providing a means of evaluating behaviour in a non-invasive (or less-invasive) manner. Mounted video cameras are, for example, less cumbersome than traditional wearable sensors. In addition, the presence of a human within or near the animal is no longer required. Considering the potential benefits to conservation, incorporating this technology into the field of behavioural studies is well warranted. This project is dedicated towards investigating the applicability of modern machine learning, specifically deep learning, to behaviour analysis in the endangered African penguin. The aim of this project is to investigate, develop, and deploy a model facilitating automatic behaviour classification in these penguins — a foundational contribution to improve current conservation efforts (improving passive monitoring systems and anomaly detection within a colony could potentially reduce response time in times of distress). The project considers a duel implementation — coordinates detailing animal movement are first extracted and subsequently presented to a suitable classifier facilitating behaviour classification. Three respective case studies are considered, they include: single penguins, two individuals, and three individuals (regarded as multiple individuals). A comprehensive investigation into the algorithmic performance associated with these models is performed and presented. Ultimately, the case evaluating three individuals based on the behaviours excitement and normal achieves an AUC of 72.9%. The case evaluating two individuals based on the behaviours interaction and no interaction achieves an AUC of 84.2%. Finally, the case evaluating one individual based on the behaviours braying, flapping, preening, resting, standing, and walking achieves an AUC of 82.1%. This yields valuable insight into the utility, applicability, and feasibility of automatic behaviour classification of the African penguin. Pivotal to this work, is the foundation it provides to the design, development, and implementation of a passive monitoring system as well as it’s benefits and contributions towards a holistic goal — aiding conservation efforts to preserve fauna and flora for future generations. | |
Set-based Particle Swarm Optimization for Medoids-based Clustering of Stationary and Non-Stationary Data | Data clustering is the grouping of data instances so that similar instances are placed in the same group or cluster. Clustering has a wide range of applications and is a highly studied field of data science and computational intelligence. In particular, population-based algorithms such as particle swarm optimization (PSO) have shown to be effective at data clustering. Set-based particle swarm optimization (SBPSO) is a generic set-based variant of PSO that substitutes the vector-based mechanisms of PSO with set theory. SBPSO is designed for problems that can be formulated as sets of elements, and its aim is to find the optimal subset of elements from the optimization problem universe. When applied to clustering, SBPSO searches for an optimal set of medoids from the dataset by the optimization of an internal cluster validation criteria. In this research assignment, SBPSO is used to cluster fifteen datasets with diverse characteristics such as dimensionality, cluster counts, cluster sizes, and the presence of outliers. The SBPSO hyperparameters are tuned for optimal clustering performance on these datasets, which is compared in depth to the performance of seven other tuned clustering algorithms. Then, a sensitivity analysis of the SBPSO hyperparameters is performed to determine the effect that variation in these hyperparameters have on swarm diversity and other measures, to enable future research into the clustering of non-stationary data with SBPSO. It was found that SBPSO is a viable clustering algorithm. SBPSO ranked third from among the algorithms evaluated, although it appeared less effective in datasets with more clusters. A significant trade-off between swarm diversity and clustering ability was discovered, and the hyperparameters that control this trade-off were determined. Strategies to address these shortcomings were suggested. |
|
An Extension of the CRISP-DM Framework to Incorporate Change Management to Improve the Adoption of Digital Projects | Digital transformation brings technology such as artificial intelligence (AI) into the core operations of businesses, increasing their revenue while reducing their costs. AI deployments tripled in 2019 having grown by 270% in just four years. However, digital transformation is a challenging task to complete successfully. A total of 45% of large digital projects run over budget, while only 44% of digital projects ever achieve the predicted value. The primary reason for these failures can be attributed to the human aspects of these projects. Examples of these human aspects are the difficulty of access to software, the lack of understanding of technology, and of the knowledge to operate the technology. The continued success of digital transformation requires both technical and change management drivers to be in place before, during, and after AI implementations. The project starts by describing digital projects. Digital projects, which include data science and AI, have an extremely low success rate, with change management as a fundamental barrier to the success of these projects. To address the change management challenges, five different change management models are compared, from which a generalised change management model is constructed. From literature, it is concluded that the CRISP-DM framework is one of the most widely used analytics models for implementing digital projects. Using the generalised change management framework, the change management gaps within the CRISP-DM framework are identified. An extended CRISP-DM framework is constructed by filling the identified gaps in the original CRISP-DM framework with the tasks in the general change management model created. The fourth objective details the extended CRISP-DM framework. Thereafter, the extended CRISP-DM framework is validated against a real-world case study. The validation shows that the extended CRISP-DM framework indicates change management improvement areas which would most likely have improved the adoption of the project. For this research project, the success ultimately lies in the ability of the developed framework to provide an effective way to guide data specialists through tasks that will ease the challenges of digital transformation. For this assignment, all the objectives of this research assignment are achieved. The validation of the framework shown by use of the extended framework by a data specialist has the potential to improve the success rate of the digital project at a lower risk of failure. |
|
An evaluation of state-of-the-art approaches to short-term dynamic forecasting | Order volume forecasting (OVF) is a strategic tool used by logistics companies to reduce operating costs and improve service delivery for their clients. It provides business units with the ability to anticipate demand, based on historical data and external factors so that resources can be deployed effectively to enable the aforementioned improvements. Until recently, statistical models have been the standard for forecasting. However, recent research into the use of state-of-the-art (SOTA) approaches to forecasting have yielded promising results. Most notably, these approaches are able to leverage covariates which enable models to incorporate auxiliary information, such that the predictions are responsive to their respective environments. This is critical to short-term forecasts, which are inherently more stochastic than long-term forecasts. This research paper seeks to compare the use of a statistical forecasting approach to a SOTA approach in the case of short-term order volume forecasting. More specifically, the NBEATS model is developed using various exogenous variables and is compared to the Exponential Smoothing (ETS) model. Both models have been developed to provide forecasts three hours into the future and are evaluated using RMSE and MAE. It was found that NBEATS provided a 36.01% improvement on the RMSE of the ETS model and a 31.6% improvement on the MAE of the ETS model. Additionally, two variations of NBEATS are compared – one trained with covariates and another without – to evaluate the improvement that covariates provide. It was found that providing models with exogenous variables resulted in a 16.15% increase in the RMSE and a 14.74% increase in MAE. The results of this paper suggest that SOTA approaches provide more consistent and accurate short-term forecasts. | |
Cross-Camera Vehicle Tracking in an Industrial Plant Using Computer Vision and Deep Learning | One of the key actors in the paper recycling process is buy-back centres. Buy-back centres buy or collect recyclable materials from individuals, formal and informal collection businesses, and institutions. Buy-back centres are important because they divert recyclable material away from landfills, which reduces the leaching of pollutants into the soil and groundwater as well as the generation of harmful gasses and chemicals. However, buy-back centres face several threats of which fraud is one of the most difficult threats to detect and prevent. Fraud occurs when the amount and/or the grade of the waste paper being sold to the buy-back centre is mispresented by the sellers in order to earn a greater income. A misrepresentation of the waste paper grade and weight being sold to the buy-back centre influences not only the availability of stock and the volume of sales to the paper mills but also the sustainability of the entire recycling ecosystem in the area. To facilitate the detection of fraud at buy-back centres, a multi-vehicle multi-camera tracking (MVMCT) framework is developed to track the movement of vehicles throughout a paper buy-back centre located in South Africa. The MVMCT framework developed can aid the buy-back centre in estimating the amount of material expected to be collected at a loading bay prior to stocktaking. When there is a large discrepancy between how much material is expected to be collected and how much is present at the loading bay, the buy-back centre can use the MVMCT framework to track and identify suspicious vehicles for further investigation. This research assignment shows that the Faster R-CNN and DeepSORT detector-tracker pair exhibits superior performance in terms of IDF1 scores. Furthermore, this research assignment addresses the vehicle re-identification problem by using a siamese network to match vehicles across several video sequences and to manage the global ID assignment process. The MVMCT framework developed in this research assignment exhibits an IDF1 score of 0.58, a multi-object tracking accuracy of 0.62, and a multi-object tracking precision of 0.53. Moreover, the MVMCT framework successfully tracks vehicles across all video sequences except for the sequence with a top-down view and shows a reasonable counting accuracy for counting the number of stationary vehicles at a loading bay. | |
A Bagging Approach to Training Neural Networks using Metaheuristics | Stochastic gradient descent has become the go-to algorithm to train neural networks. As neural networks become larger in architecture and the datasets used to train them becomes larger, so has the computational cost to train the artificial network. Metaheuristics have successfully been used to train neural networks. Furthermore, metaheuristics are more robust to noisy objective functions. This research assignment investigates and concludes if metaheuristics, especially genetic algorithms, differential evolution, evolutionary programming and particle swarm optimisation, can be used to train an artificial neural network with a subsample of the training set. Different bagging training approaches with the reduction in training data are put forward, and the performances of the trained neural networks are evaluated. The performances of the trained neural networks are compared against the performances of the stochastic gradient descent trained neural network and the trained neural network using metaheuristic algorithms when using the entire training dataset. The evaluation of the performance of the artificial networks compares the validation accuracy and the generalisation factor to detect if overfitting occurs. The research assignment also answers the question of whether overfitting is reduced when training the neural network if the suggested training methods are used. The results indicate that a sub-sample of the training set can be used per iteration or generation of the metaheuristic algorithm when training a neural network with similar accuracy and similar or better overfitting performance as when training is performed using the complete training set. The best performance was achieved with a bagging strategy using the same sample size for each class to classify. |
|
Link prediction of clients and merchants in a rewards program using graph neural networks | Rewards programs have become an offering for businesses to increase client engagement, nurture long-term relationships and maintain client retention. A host company is an intermediary network provider that connects entities within a rewards program. Identifications of future relationships between entities are identified as a link prediction task. The network is represented as a graph of interconnected entities. Graphs are complex high-dimensional structures, dynamic in shape and size. A research field called graph neural network (GNN) has gained traction to handle challenges posed by graph properties. A real-world scenario has been instantiated to apply a GNN technique to a link prediction task. The investigation aims to identify potential relationships between clients and merchants in a rewards program offered at a Bank. A framework design is created for the model architecture; a GNN encoder and MLP decoder. A GNN variation called GraphSAGE is selected as the encoder. GraphSAGE is an inductive framework; able to generalise on unseen data and leverage node attributes. A sensitivity analysis indicates that the model is sensitive to the dropout and learning rate hyperparameters. Limited attributes and connections are present which validates the sensitivity. The model is fitted to the optimal architecture, and tested on unseen data. The model performance resulted in a Receiver Operator Characteristics Curve (ROCAUC) value of 0.65. Although acceptable, a higher ROCAUC value is desirable. Another evaluation metric highlighted an area that requires further improvement. The precision vs recall results emphasised the effects of the sparse network. Most of the correct predictions are for the negative class. Although a weighted loss strategy assisted in the drawbacks, it could not overcome the challenges. The encoder output reveals embeddings which are visualised for interpretation. Embedding illustrations reveal similarities in both representations of clients and merchants. The embeddings identified two distinct merchant groups. The client embedding representations showed clusters of clients which are best represented in a non-Euclidean dimension space. An entity characteristic prediction analysis is done to gain insight into the distribution of the client and merchant features. Note that the purpose is not to validate which features the GNN learnt from. To highlight the findings on the correct positive class predictions, the female clients accounts for 99% of the predictions. Half of the correct links are associated with a rewards program client. The Homeware and Decor Store merchant service type accounts for 100% of the correct positive predictions. Implications of the data quality issues are also emphasised. Overall, the GNN demonstrates that it can learn representations in a rewards program network of clients and merchants. The network topology and relations among the clients and merchants are well detected. The GNN is capable to predict the existence of links between the entities. Opportunities are identified to further enrich the graph and improvements are proposed. The investigation provides a positive contribution to the financial industry, rewards programs and GNN as an emerging research field. | |
Evaluating active learning strategies to reduce the number of labelled medical images required to train a CNN classifier | CNNs have proven to provide human-compared performance in the field of computer vision; however, one basic limitation of ANN is that they are largely rely on large, labelled data (a costly and time-consuming task of manually labelling data). This study investigates how varied sizes of initially labelled medical images affect the effectiveness of CNN-based active learning. A framework in which data to be labelled by human annotators are not selected randomly but rather selected in such a fashion that the amount of data required to train a machine learning model is reduced. Two CNN architectures were chosen to run the experiment using a well-known chest x-ray pneumonia dataset from the Kaggle repository, and active learning base uncertainty was used to measure the data’s informativeness. Eight simulations were run on varying sizes of initial labelled training images. The simulations demonstrate how active learning can reduce the cost and time required for image labelling. The performance of the two CNN architectures was assessed using AUC-score metrics and less data was required to label the images. In conclusion, the use of DenseNet-121 with least confidence sampling reduced the number of labelled images by 39% compared to the random sampling technique used as the baseline. | |
A Dynamic Optimization Approach to Active Learning in Neural Networks | Artificial neural networks are popular predictive models which have a broad range of applications. Artificial neural networks have been of great interest in the field of machine learning, and as a result, they received large research efforts to improve their predictive performance. Active learning is a strategy that aims to improve the performance of artificial neural networks through an active selection of training instances. The motivation for the research assignment is to determine if there is an improvement in predictive performance when a model is trained only on instances that the model deems informative. Through the continuous selection of informative training sets, the training times of these networks can also be reduced. The training process of artificial neural networks can be seen as an optimisation problem that uses a learning algorithm to determine an optimal set of network parameters. Backpropagation is a popular learning algorithm which computes the derivatives of the loss function and the gradient descent algorithm to make appropriate parameter updates. Metaheuristic optimisation algorithms, such as particle swarm optimisation, have been shown to be efficient as neural network training algorithms. The training process is assumed to be static under fixed set learning, a process in which the model randomly samples instances from a training set that remains fixed during the training process. However, under an active training strategy, the training set continuously changes and therefore should be modelled as a dynamic optimisation problem. This study investigates if the performance of active learners can be improved if dynamic metaheuristics are used as learning algorithms. Different training strategies were implemented in the investigation which include a sensitivity analysis selective learning algorithm and the accelerated learning by active sample selection algorithm. The analysis utilised different learning algorithms which included backpropagation, static particle swarm optimisation, and dynamic variations of the particle swarm optimisation algorithm. These training strategies were applied to seven benchmark classification datasets obtained from the UCI repository. Improved performance in the generalisation factor is produced for three of the seven classification problems in which a dynamic metaheuristic is used in an active learning setting. Although these improvements are observed, generally all training configurations achieved similar performance. The conclusion drawn from the study was that it is not definitive that dynamic metaheuristics improve the performance of active learners, because performance improvements are not consistent across all classification problems and evaluation metrics. |
|
Rule Extraction from Financial Time Series | The ability to predict future events is very important in scientific fields. Data mining tools extract relationships among feature and feature values, and how these relationships map to the target concept. The main goal is to extract knowledge and understand trends. The resulting rule set can then be used for prediction purposes. For many real-world applications, the actual values of a time series is irrelevant. The shape of the time series can also be used to predict future events. Unfortunately, most of these research e↵orts related to this area have had limited success. Rule induction and rule extraction techniques are often unsuccessful for real-valued time series analysis due to the lack of systematic e↵ort to find relevant trends in the data. Rule induction and rule extraction methods are applied to data describing trends in financial time series data. The purpose of this study is to explore the benefits of rule extraction and rule induction,specifically on financial time series. A review of rule extraction and rule induction approaches is conducted as a first step. Thereafter, a rule extraction and rule induction framework is developed and evaluated. The most important finding of this study was the importance of balanced data, which performed significantly better if the excessive class distributions were minimised, while the predictive performance of the di↵erent rule extraction and rule induction algorithms was not statistically significant. | |
Binning Continuous-Valued Features using Meta-Heuristics | The success of any machine learning model implementation is heavily dependant on the quality of the input data. Discretization, which is a widely used data preprocessing step, partitions continuous-valued features into bins which transforms the data into discrete-valued features. Not only does discretization improve the interpretability of a data set, but it also provides the opportunity to implement machine learning models which require discrete input data. This report proposes a new discretization algorithm that partitions multivariate classification problems into bins through the use of swarm intelligence. The particle swarm optimization algorithm is utilized to try and find the bin boundary values of each continuous-valued feature which leads to the optimal classification performance of classification models. The classification accuracy of the na¨ıve Bayes classifer, the C4.5 decision tree classifier and the one-rule classifier, due to the implementation of the discretizers, is used as the evaluation measure in this report. The performance of the proposed method is compared with the equal width binning, the equal frequency binning and the evolutionary cut-points selection for discretization algorithm, on different data sets that have mixed data types. The proposed discretizer is outperformed by the evolutionary cut-points selection for discretization algorithm when paired with the C4.5 decision tree classifiers. Similarly, the equal with binning discretizer outperforms the proposed discretizer when paired with the C4.5 decision tree. |
|
A Genetic Algorithm Approach to Tree Bucking using Mechanical Harvester Data | Crosscutting of trees into timber logs is known as bucking. The logs are mainly used for producing saw logs at a mill. The logs have different value based on the length of the log and the small end diameter of the log. Maximisations of the value of the logs bucked from a tree can be viewed as an optimisation problem. This problem has been researched in the literature with most solutions using dynamic programming. This research assignment solves the problem using a metaheuristic approach, specifically a genetic algorithm. The main research question is whether an existing bucking, on a series of stands in a forest, could have been done more optimally. The dataset used to solve the problem comes from the bucking outputs of two mechanical harvesters. Multiplication of the volume of the log by the value per cubic meter of the log class to which the log belongs, gives the value of the log. Addition of the value of logs for a tree gives the value of the tree. It was found that the genetic algorithm outperformed the existing bucking performed, in terms of value. The research method firstly solved the problem for a randomly selected set of trees with dynamic programming, comparing it to the solutions obtained from the genetic algorithm. It was found that the genetic algorithm obtained very similar optimal bucking value for the trees. Secondly, a genetic algorithm uses hyperparameters, namely population size, probability of crossover and probability of mutation. The hyperparameters were estimated using a particle swarm optimisation algorithm wrapped around the genetic algorithm. A randomly selected set of trees was used for estimating the hyperparameters. The hyperparameters found were used to optimise the total value of each of the five stands. The total value of the optimised stands outperformed the value of the existing bucking performed by a large margin. | |
Crop recommendation system for precision farming: Malawi use case | Machine Learning (ML) has received attention from the global audience, with adoption and rapid scaling being reported across multiple industrial sectors, including agriculture, for application in automation and optimisation of processes. The advent of new farming concepts like precision farming (PF) has introduced the use of ML-powered decision support systems (DSS). These systems assist farmers in making decisions by providing data-driven recommendations that boost farming productivity and sustainability. Despite being widely developed in many parts of the world, these technologies have not yet been adopted in the sub-Saharan region, particularly in Malawi, where infrastructure and government policy have been barriers. However, changes in policymaking and the introduction of data centres have drawn agricultural stakeholders who are pushing for the development of ICT-based technologies. The desired innovations are to support farmers in making data-guided decisions for climate change mitigation, increased productivity, and environmental sustainability. The goal of this project was to create a crop recommendation system that makes use of an ML model to forecast the best crop for farmland based on its physical, chemical, and meteorological parameters. Firstly, unlabelled data for the central region of Malawi was collected from the Department of Land and the Department of Climate Change and Meteorological Services. The data were merged, cleaned, and formatted using three methods: label encoding of categorical features; label encoding of categorical features and normalisation; label encoding of ordinal features, one-hot encoding of nominal features, normalisation, and principal component analysis (PCA) dimensionality reduction. A K-means clustering data preprocessing step was applied, and five centroids were extracted, analysed by an expert agronomist, and labelled as conducive for maize, cassava, rice, beans, and sugarcane crops, respectively. Then, ten classifier algorithms, namely Logistic Regression (LRC), K-Nearest Neighbours (KNC), Support Vector Machine (SVC), Multilayer Perceptron (MLPC), Decision Tree (DTC), Random Forest (RFC), Gradient Boosting (GBC), Adaptive Boosting (ABC), eXtreme Gradient Boosting (XGBC), and Multinomial Naïve Bayes (MNBC), were trained on the three kinds of formatted datasets. A 5-fold cross-validation (CV) technique was used to assess the performances of the models on the three formatted datasets evaluated based on the F1 score and Accuracy metrics. Lastly, the models were scored based on the CV’s average F1 and Accuracy scores, the model’s structural complexity, and training times. Formatting technique 1 resulted in poor performance across models that use Euclidean distance measures, and formatting technique 3 was the most conducive for all the models except for ABC and MNBC. On formatting 3, the KNC outperformed the other models with an f1 and accuracy score of 99%, a fast-training speed, and a simple model structure. The KNC was later integrated into a test web application as its proposed method of deployment. The proof-of-concept model shows reliable results but requires further development for real-time implementation. | |
Financial Time Series Modelling using Gramian Angular Summation Fields | Gramian angular summation fields (GASF) and Markov transition fields (MTF) have been developed as an approach to encode time series into different images, which allows the use of techniques from computer vision for time series classification and imputation. These techniques have been evaluated on a number of different time series problems. This research assignment applies GASF and MTF to financial time series. As a first step, a suitable financial time series is collected from a real world system and analyzed. The data quality is determined to identify data quality issues to be addressed. The cleaned financial time series is encoded into images, and validated using an appropriate technique to determine if a logical mapping between the time series and image planes exists. The financial time series is analyzed to determine its characteristics. These characteristics are used to guide the formulation of a modeling problem. The modeling problem compares the usefulness of the GASF and MTF approaches against conventional time series modeling and analysis techniques. The four models considered for the formulated modeling problem consists of time series and image modeling approaches. The results from the experiment indicates that the time series approaches are better suited to this modeling problem specifically. The GASF and MTF approaches do provide promising outcomes when used in a combinatorial fashion. The usage of a combination of GASF and MTF images do allow a model to learn better features when combined with sequence based approaches, which improves model performance. |
|
Machine Learning-based Nitrogen Fertilizer Guidelines for Canola in Conservation Agriculture Systems | Soil degradation is a major problem that South African agriculture faces, and policy-makers pay special attention to it. South Africa has especially looked at the negative effects of applying poor land practices in the agriculture sector. This research assignment attempts to use machine learning (ML) algorithms to predict the amount of nitrogen (N) to add to canola to achieve an approximate optimal yield. This should be displayed in the form of a table, known as the fertiliser recommendation system, which can be used by a farmer to achieve the desired yield. The ML algorithms used in this assignment include: random forest regressor, extra trees regressor, artificial neural network, deep neural network, k-nearest neighbour, multiple linear regression and multivariate adaptive regression splines. The primary objective of precision agriculture (PA) is to increase agricultural productivity and quality while lowering costs and emissions. Furthermore, early detection and mitigation of crop yield limiting factors may aid to increased output and profit, and yield prediction is critical for a number of crop management and economic decisions discussed in this research assignment. The random forest regressor showed to be the most accurate in forecasting yield. The resulting random forest regressor model demonstrated that machine learning could potentially forecast canola production given some characteristics. These characteristics include but are not limited to average rainfall, year of the plantation, amount of N remaining in soil from the previous harvest, and rainfall each month from the date planted to the harvest date. | |
The use of historical tracking data to estimate or predict vehicle travel speeds | York Timbers is an integrated forestry company that grows and manufactures lumber and plywood products. The plantations owned and maintained by York Timbers contains an expansive road network consisting of 26 661 road segments with a total length of approximately 10 000 km. In order to optimize the delivery of timber from the plantations to the mill sites, the travel speed of each road segment must be estimated. To estimate the speed of each road segment in the road network, global positioning system (GPS) measurements are first matched to the self-owned road network. Map-matching is done in two parts. First, the GPS measurements are assigned to the closest road segment based on euclidean distance. Next, the connectivity of road segments is analyzed to fix any errors made introduced during map-matching. The average travel speed is then calculated for each road segment using the matched GPS measurements. The majority of the road segments however do not have GPS measurements associated with them. To estimate the travel speed of road segments without GPS measurements, five different predictive models are developed. The best performance is obtained using a regression tree which achieves a mean absolute error of 10.02 km/h on data not used to train the model. To improve the speed estimation accuracy, further refinement of the speed estimation model and speed prediction model is required. Increasing the amount of GPS measurements used in the estimation and prediction of travel speed will improve the model performance. Including other data that influences safe travel speed, such as weather data, will further improve the model performance. Identifying dangerous portions of the road network is also suggested before a model is implemented. | |
A Review and Analysis of Imputation Approaches | Missing data is a common and major challenge which almost all data practitioners and researchers face, and which greatly affects the accuracy of any decision making process. Data mining and data preparation requires that the data is prepared, cleaned, transformed, and reduced in order to ensure that the integrity of the dataset has been maintained. Missing data is found and addressed within the data cleaning process, during which the user needs to decide on how to handle the missing data so as to not introduce significant bias into the dataset. Current methods of handling missing data include deletion and imputation methods. This research assignment investigates the performance of different imputation methods, specifically discussing statistical and machine learning imputation methods. The statistical imputation methods investigated are mean, hot deck, regression, maximum likelihood, Markov chain Monte Carlo (MCMC), multiple imputation by chained equations, and expectation-maximization with bootstrapping imputation. The machine learning methods investigated are k-nearest neighbor (kNN), k-means, and self-organizing maps imputation. This research paper uses an empirical procedure to facilitate the formatting and transformation of the data, and the implementation of imputation methods. Two experiments are followed in this research, namely, one in which the imputation methods are evaluated against datasets which are clean, and another in which the imputation methods are evaluated against datasets which contain outliers. The performance achieved from both experiments are evaluated using the root mean squared error, mean absolute error, percent bias, and predictive accuracy. For both experiments, it is found that MCMC imputation resulted in the best performance out of all 10 imputation methods with an overall accuracy of 75.71%. kNN imputation resulted in the second highest accuracy with an overall accuracy of 69.85%, however, introduced a large percent bias into the imputed dataset. This research concludes that single statistical imputation methods (mean, hot deck, and regression imputation) should not be used to replace missing data in any situation while multiple imputation methods are shown to have a consistent performance. MCMC imputation in particular performs the best out of all 10 imputation methods in this research, producing a high accuracy and low bias in the imputed dataset. The performance of MCMC imputation, along with its ease-of-use, makes the imputation method a suitable choice when dealing with missing data. |
|
Crawler Detection Decision Support: A Neural Network with Particle Swarm Optimisation Approach | Website crawlers are popularly used to retrieve information for search engines. The concept of website crawlers was first introduced in the early nineties. Website crawling entails the deployment of automated crawling algorithms that crawl websites with the purpose of collecting and storing information about the state of other websites. Website crawlers are categorized as good website crawlers or bad website crawlers. Good website crawlers are used by search engines and do not cause harm when crawling websites. Bad website crawlers crawl websites with malicious intent and could potentially cause harm to websites or website owners. Traffic indicators on websites are inflated if website crawlers are incorrectly identified. In some cases, bad crawlers are used to intentionally crash websites. The consequences of bad website crawlers highlight the importance of successfully distinguishing human users from website crawler sessions in website traffic classification. The focus of this research assignment is to design and implement artificial neural network algorithms capable of successfully classifying website traffic as a human user, good website crawler session, or bad website crawler session. The artificial neural network algorithms are trained with particle swarm optimizers and are validated in case studies. First, the website traffic classification problem is considered in a stationary environment and is treated as a standard classification problem. For the standard classification problem, an artificial neural network with particle swarm optimization is applied. The constraints associated with this initial problem assume that the behavioural characteristics of humans and the behavioural characteristics of web crawlers remain constant over a period of time. Thereafter, the classification problem is considered in a non-stationary environment. The dynamic classification problem exhibits concept drift due to the assumption that website crawlers change behavioural characteristics over time. To solve the dynamic classification problem, artificial neural networks are formulated and optimized with quantum-inspired particle swarm optimisation. Results demonstrate the ability of the artificial neural networks optimised with particle swarms to classify website traffic in both stationary and on-stationary environments successfully to a reasonable extent. |
|
A comparative study of different single-objective metaheuristics for hyper-parameter optimisation of machine learning algorithms | Over the past three decades machine learning evolved from a research curiosity to a practical technology that enjoys widespread commercial success. In the continuous quest to gain a competitive advantage and thereby market share, companies are highly incentivised to adopt technologies that reduce costs and/or increase productivity. Machine learning proved to be one of these technologies. A significant trend in the contemporary machine learning landscape has been the rise of deep learning which experienced tremendous growth in its popularity and usefulness, predominantly driven by larger data sets, increases in computational power and more efficient training procedures. The recent interest in deep learning (along with automated machine learning frameworks), both having many hyper-parameters and large computational expenditure, has prompted a resurgence of research on hyper-parameter optimisation. Stochastic gradient descent and other derivative-based optimisation methods are seldomly used for hyper-parameter optimisation, because derivates of the objective function with respect to hyper-parameters are generally not available. The objective function for hyper-parameter optimisation is therefore considered to be a black-box function. Conventionally, hyper-parameter optimisation is performed manually by a domain expert to keep the number of trials at a minimum, however, with modern compute clusters and graphics processing units it is possible to run more trials, in which case algorithmic approaches are favoured. The process for finding a high-quality set of hyper-parameter values for a machine learning algorithm is often time-consuming and compute-intensive, therefore efficiency is considered as one (if not the most) prevailing metric to evaluate the effectiveness of an hyper-parameter optimisation technique. Popular algorithmic methods for hyper-parameter optimisation include grid search, random search, and more recently Bayesian optimisation. Metaheuristics, defined as a high-level problem independent framework that serves as a guideline for the design of underlying heuristics to solve a specific problem, are investigated as an alternative to traditional hyper-parameter optimisation techniques. Genetic algorithms, particle swarm optimisation, and estimation of distribution algorithms were identified to represent metaheuristic algorithms. To compare traditional and metaheuristic hyper-parameter optimisation algorithms on the basis of efficiency, a test suite comprised of various data sets, and machine learning algorithms are constructed. The machine learning algorithms considered in this research assignment are support vector machines, multi-layer perceptrons, and convolutional neural networks. The efficiency of hyper-parameter optimisation algorithms is compared using independent case studies, where the hyper-parameters of a different machine learning algorithm are optimised in each case. Friedman omnibus tests are employed for determining whether a difference in average rank exists for the outcomes obtained using the respective hyper-parameter optimisation techniques. Upon rejection of the null hypothesis of the Friedman test, Nemenyi post hoc tests are performed to identify pairwise differences between hyper-parameter optimisation techniques. Other fitting metrics of solution quality such as computational expenditure are also investigated. | |
Predicting employee burnout using machine learning techniques | While artificial intelligence techniques and methods and the subsequent possibilities of using such to solve business problems are well understood in some industries including life insurance or banking, applying these to the domain of human capital management has been met with varying levels of success and value. Models that assist in recruitment activities or predicting employee attrition have been successfully implemented by many organisations. However, there are also many pitfalls to guard against including managing inherent bias in the data used as well as how the output of such models are used, often leading to ethical concerns. In this research assignment, multiple classification models and machine learning algorithms are applied to the problem of identifying employees at risk of burnout with the aim of producing outputs that can be used to ethically and pro-actively guide wellbeing related interventions across the business. The results show that none of the approaches were successful in accurately meeting this objective with an artificial neural network approach assessed as the most accurate of all the models implemented. By evaluating each classification model’s performance, it was found that none of implemented approaches were more than 50% accurate. |
Title | Abstract |
---|---|
Comparison of Machine Learning Models for the Classification of Fluorescent Microscopy Images | The lasting health consequences of a COVID-19 infection, referred to as Long COVID, can be severe and debilitating for the individual afflicted. Symptoms of Long COVID include fatigue and brain fog. These symptoms are caused by microclots that form in the bloodstream and are not broken up by the body. Microclots in the bloodstream can entangle with other proteins and can limit oxygen exchange. This inhibition of the oxygen exchange process can cause most of the symptoms experienced with Long COVID. Diagnosis and identification of individuals suffering from Long COVID is the first step in any process that aims to help alleviate the symptoms of the individual, or cure them. Current identification processes are manual and as such limited by the amount of manpower applied to the task. Automating parts of the process with machine learning can greatly speed up this process and allow more efficient use of manpower. The purpose of this research assignment is to investigate whether or not machine learning algorithms can be used to classify fluorescent microscopy images as being indicative of long COVID or not. This is done by training models and predicting on features extracted from fluorescent microscopy images using computer vision techniques. Also explored is a comparison between the performance of the machine learning algorithms used in this research assignment. It was found that logistic regression is a good choice as a classifier with a strong performance in the classification of both the positive and negative classes. |
Anomaly Detection in Support of Predictive Maintenance of Coal Mills using Supervised Machine Learning Techniques | Since the beginning of time, people have been dependent on technology. With each industrial revolution, people became more reliant on machines, and in parallel, the need to maintain them. The goal of any maintenance organisation is always the same: to maximise asset availability. Our massive strides in technology have paved the way to the birth of Industry 4.0 where our focus starts to shift from preventive maintenance to predictive maintenance. Predictive maintenance does not follow a schedule like preventive maintenance, instead, it performs maintenance when it is necessary, not when it is too early or too late. This research assignment identifies an area of research where a study is performed in support of predictive maintenance of coal mills through supervised machine learning. The assignment uses the coal mill data from a case study company to identify data quality issues, address these issues, prepare the data for machine learning and finally to build a machine learning model which aims to predict when failure is most likely to occur. The assignment evaluates the feasibility to build a supervised machine learning model using the given data and methodology, draws conclusions about the findings and identifies opportunities for future research. |
Comparison of unsupervised machine learning models for identification of financial time series regimes and regime changes | Financial stock data has been studied extensively over many years with an objective of generating the best possible return on an investment. It is known that financial markets move through periods where securities are increasing in value (bull markets) and periods where these securities decrease in value (bear markets). These periods that exhibit similarities over different time frames are often referred to as regimes that are not necessarily limited to bull and bear regimes, but any sequences of data that experiences correlated trends. Regime extraction and detection of regime shift changes in financial time series data can be of great value to an investor. An understanding of when these financial regimes will changeDroid: Singular Value Decomposition with C and in what type of regime the financial market is tending towards, can help improve investment decisions and strengthen financial portfolios. This research deals with reviewing and comparing the viability of different regime shift detection algorithms when applied to multivariate financial time series data. The selected algorithms are applied on different stocks from the Johannesburg Stock Exchange (JSE) where the algorithms’ performances are compared with respect to regime shift detection accuracy and profitability of regimes in selected investment strategies. |
Detection of chronic kidney disease using machine learning algorithms | Chronic Kidney Disease (CKD) is a significant public health concern worldwide that affects one in every ten (10) people globally. CKD results from a poorly functioning kidney that fails at the basic functionalities, including removing toxins, waste, and extra fluids from the blood. The build-up of the problematic material in the body can cause complications such as hypertension, anaemia, weak bones, and nerve damage. CKD often occurs in individuals that suffer from additional chronic illnesses such as diabetes, heart disease, and hypertension, in addition to the existence of unfavourable health habits and practices that lead to the kidney’s deplorable state. The presence of additional illnesses that occur in tandem with CKD causes a hindrance in its successful and early detection. The onset of CKD can be clinically detected using laboratory tests focusing on specific standard parameters such as the Glomerular Filtration Rate (GFR) and the albumincreatinine ratio. Kidney damage occurs in stages, with each subsequent stage indicating a severe reduction in the glomerular filtration rate. The GFR parameter is considered a facet of the indication of renal failure and the final stage of chronic kidney disease. It is then imperative to use early detection methods to assist in the early administration of treatment to alleviate the symptoms of the disease and combat the progression. Early-stage diagnosis involves medications, diet adjustments and invasive procedures. In developing countries, especially in Africa, the prevalence of CKD is estimated at 3 – 4 times more than in developed countries in Europe, America and Asia. The current dialysis treatment rate in South Africa stands at approximately 70 per-million population (pmp), and the transplant rate stands at approximately 9.2 per-million population (pmp). The accounted prevalence rate mainly considers individuals with accessibility to private health care options through affordability or medical insurance; however, most South Africans (approximately 84%) depend on the under-resourced, government-funded public health systems. The disparity in treatment affordability among South Africans of different economic classes introduces a two-tiered health system that affects access to quality treatments. The need for early detection and diagnosis is an important process in the field of CKD and other chronic illnesses plaguing the nation using machine learning algorithms. Machine learning applications in the health care sector aim to revolutionise the early detection and treatment of chronic illness for the greater global population. Since early detection and management are vital in preventing disease progression and reducing the risk of complications, some machine learning (ML) models have been developed to detect CKD. The primary purpose of this study is to review, develop and recommend various machine learning classification models for the efficient detection of chronic kidney disease using three datasets. These datasets include:- two UCI Machine Learning Repository datasets Chronic Kidney Disease and Risk Factor prediction of Chronic Kidney Disease; and the PLOS ONE dataset Chronic kidney disease in patients at high risk of cardiovascular disease in the United Arab Emirates: A population-based study. The final aim is to construct a high-performing ML model that has effectively and accurately learned the hidden correlations in the symptoms exhibited by CKD patients. |
Feature engineering approaches for financial time series forecasting using machine learning | This research assignment investigates feature engineering methods for financial time series forecasting using machine learning. The goal of the work is to investigate methods that overcome some time series characteristics which make forecasting difficult. The challenging characteristics are noise and non-stationarity. A literature review is conducted to identify suitable feature engineering methods and machine learning approaches for financial time series forecasting. A case study is developed to test the identified feature engineering methods with an empirical machine learning process. Multiple machine learning models are tested. To understand the benefit of the feature engineering methods, the forecasting results are compared with and without of the application of the feature engineering methods. Several feature engineering methods are identified: Differencing and log-transforms are two methods investigated to address non-stationarity. Moving averages, exponentially weighted moving averages, Fourier and wavelet transforms, are all methods investigated to reduce noise. The feature engineering methods are implemented as preprocessing steps prior to training machine learning models for a supervised learning problem. The supervised learning problem is to forecasting a single day ahead asset price, given ten days of previous prices. Four machine learning models commonly used for financial time series forecasting are investigated. Namely, linear regression, support vector regression (SVR), multilayer perceptron (MLP), and long short-term memory (LSTM) neural networks. The work investigates the feature engineering methods and machine learning models for four univariate time series signals. The results of the investigation found that no feature engineering method is universally helpful in improving forecasting results. For the SVR, MLP and LSTM models, denoising or smoothing the signals did improve their performance, but the best denoising or smoothing technique varies depending on the dataset used. Differencing and log-transforms caused the models to forecast a constant value near the mean of expected daily price returns, which when inverted back to the price domain cause poor regression evaluation metrics, but good directional accuracy. The findings of this research assignment are that the investigated feature engineering methods may improve forecasting performance for financial time series, but that the gains are not large. It seems that there is limited improvement gained through feature engineering past price data to predict future price, at least for the investigated feature engineering methods. It is therefore recommended that future work focus on finding alternative data sources with predictive power for the financial time series. |
Forecasting armed conflict using long short-term memory recurrent neural networks | Various recent studies have shown an optimistic future for social conflict forecasting by taking more data-driven approaches. These approaches also come at the perfect time during the big-data revolution. Conflict forecasting models can be used to reduce the severity of events or to intervene to prevent these events from materialising or escalating. As such, these predictive models are of interest to numerous institutions or organisations, such as governments and non-governmental organisations, humanitarian agencies, and even insurance companies. In this mini-dissertation, long short-term memory recurrent neural network modelling is applied to forecast armed conflict events in the Afghanistan conflict, which started in October 2011. This model utilises world news data from the Global Database of Events Language and Tone (GDELT) platform and georeferenced event data from the Uppsala Conflict Data Program (UCDP) to make its predictions. The results show that GDELT data can improve conventional baseline forecasting models to an extent by incorporating actor and event attributes that are unique to the conflict at hand. Furthermore, results indicate that news media data can be consolidated with actual recorded deaths in the forecasting model, which enables predictions that are grounded in reality. |
Comparison of machine learning models on different financial time series | The efficient market hypothesis implies that shrewd market predictions are not profitable because each asset remains correctly priced by the weighted intelligence of the market participants. Several companies have shown that the efficient market hypothesis is invalid. Consequently, a considerable amount of research has been conducted to understand the performance and behaviour exhibited by financial markets, as such insights would prove valuable in the quest to identify which products will provide a positive future return. Recent advancements in artificial intelligence have presented researchers with exciting opportunities to develop models for forecasting financial markets. This dissertation investigated the capabilities of different machine learning models to forecast the future percentage change of various assets in financial markets. The financial time series (FTS) data employed are the S&P 500 index, the US 10 year bond yield, the USD/ZAR currency pair, gold futures and Bitcoin. Only the closing price data for each FTS was used. The different machine learning (ML) models that are investigated are linear regression, autoregressive integrated moving average, support vector regression (SVR), multilayer perceptron (MLP), recurrent neural network, long short term memory and gated recurrent unit. This dissertation uses an empirical procedure to facilitate the formatting, transformation, and modelling of the various FTS data sets on the ML models of interest. Two validation techniques are also investigated, namely single out-of-sample validation and walk-forward validation. The performance capabilities of the models are then evaluated with the mean square error (MSE) and accuracy metric. Within the context of FTS forecasting, the accuracy metric refers to the number of correct guesses about whether the price movement increased or decreased and the total number of guesses. An accuracy that is one percentage point above 50% is considered substantial when forecasting FTS, because a 1% edge on the market can result in a higher average return, which outperforms the market. For the individual analysis of the single out-of-sample and walk-forward validation technique, the linear regression model was the best ML model for all FTS, because it is the most parsimonious model. The concept of a parsimonious model was disregarded when comparing and contrasting the two validation techniques. The ML models applying walk-forward validation performed the best in terms of MSE on the S&P 500 index and US 10 year bond yield. The SVR model obtained the highest accuracy of 52.94% on the S&P 500 index, and the MLP model btained the highest accuracy of 51.26% on the US 10 year bond yield. The ML models applying single out-of-sample validation performed the best in terms of MSE on the USD/ZAR currency pair, gold futures and Bitcoin. The MLP model obtained the highest accuracy of 51.77% and 53.51% for the USD/ZAR currency pair and gold futures, respectively. The linear regression model obtained the highest accuracy of 55.04% for Bitcoin. |
Proximal methods for seedling detection and height assessment using RGB photogrammetry and machine learning | An ever-growing global population, coupled with increasing per capita consumption and higher demand for wood-based products, have all contributed towards growing demand for planted forests. The efficiencies of such forests are in no small part due to ensuring planted seedlings are well suited to the local environment. This, in turn, has resulted in a growing demand for nurseries to cultivate such seedlings. Nursery operators are faced with the challenge of monitoring stock levels and determining the growth stage of the stock on hand. This typically involves laborious manual assessments based on statistical sampling of only a small percentage of the stock on hand. In this study, a framework for the proximal detection and height assessment of seedlings is proposed. Photogrammetry is employed using red-green-blue (RGB) imagery captured using a smartphone to produce digital surface models (DSMs) and orthomosaic images. Three image collection strategies are proposed and evaluated based on ground control point accuracy. A RetinaNet object detection model, pre-trained on unmanned aerial vehicle (UAV) derived RGB imagery, is utilised for the object detection task. Transfer learning is leveraged by retraining the detection model on a single seedling tray consisting of 98 seedlings. The model is trained on the orthomosaics produced by the photogrammetry process. In order to determine the heights of these seedlings, two proposals for sampling the seedling height from the DSM are proffered and evaluated. Finally, a number of regression algorithms are investigated as a tool to refine the sampled height. Ultimately, the ensemble based AdaBoost regression algorithm, achieves the best performance. The proposed pipeline is able to detect 98.97% of seedlings at an intersection over union (IOU) of 76.93% with only a single instance missing classification. The final root mean squared error (RMSE) of 17.26mm achieved by the height refinement process with respect to the test data suggests sufficient performance which enables an improved understanding of stock quantities and growth stage without the need for manual intervention. |
Automated tree position detection and height estimation from RGB aerial imagery using a combination of a local-maxima based algorithm, deep learning and traditional machine learning approaches | Forest mensuration is a pivotal aspect of forest management, particularly when determining the total biomass, and subsequently fiscal value, of forest plantations. Terrestrial measurement of phenotypic properties concerning tree attributes tends to be laborious and time-consuming. Remote sensing (RS) approaches have revolutionised the way in which forest mensuration is conducted, especially due to the reduced costs and increased accessibility associated with leveraging unmanned aerial vehicles (UAVs) that incorporate high-resolution imaging sensors. The rapid development of digital aerial photogrammetry (DAP) technologies has provided a viable alternative to airborne laser scanning (ALS), technology that has typically been reserved for applications wherein high accuracy is required, and budget constraints are not of major concern. Furthermore, machine learning (ML), and particularly computer vision (CV), are becoming increasingly commonplace in the processing of orthomosaic rasters and canopy height models (CHMs). Traditionally, an ALS- or DAP-derived CHM has been utilised, together with a local maxima-type model, to detect tree crown apexes and estimate tree heights. In this study, a forest stand located in KwaZulu-Natal, South Africa, comprised of 4 968 Eucalyptus dunnii tree positions spaced at 3×2 metres, was considered. A local maxima (LM) algorithm was employed as a baseline model to improve on. The out-put of the LM algorithm was, however, also utilised in an ensemble of ML models, designed to better estimate tree positions and heights. A hybrid approach was proposed that integrates object detection, classification, and regression models in an ML model framework, with the intention of improving accuracies achieved by the LM algorithm. The object detection model was built on the RetinaNet one-stage detection model which is comprised of a feature pyramid network (FPN) that employs a focal loss (FL) function, rather than the typical cross-entropy (CE) loss function, addressing the issue of extreme class imbalance typically encountered by object detection models. This RetinaNet was made available as part of the DeepForest (DF) python package and the underlying network had been pretrained on a substantial amount of forest canopy imagery. To improve the model, hand-annotations of trees depicted in the DAP-derived orthomosaic were generated and subsequently employed in further training the DF model through the procedure of transfer learning. A support vector machine (SVM) model was built to filter misclassified tree positions and to act as a differentiator between legitimate and illegitimate tree positions. Furthermore, a multi-layer perceptron (MLP) was trained to address the inherent bias present in the CHM, and improve tree height estimations sampled from the CHM. The improvements in tree position and height accuracies were noticeable. Tree position MAE was improved by 15.68% from 0.3515 metres to 0.2964 metres. Tree height RMSE was improved by 25.30% from 0.6435 metres to 0.4807 meters, while R2, with respect to height, was increased by 15.22% from 0.6662 to 0.7676. The proportion of total trees detected was reduced by 3.33% from 98.77% to 95.48%. The number of dead and invalid tree positions detected were, however, also decreased by 82.35% and 36.36%, respectively, suggesting a substantial improvement in the quality of tree positions detected. The results demonstrate potential improvements that can be realised by incorporating ML approaches and DAP-derived data. |
Fantasy Premier League Decision Support: A Meta-learner Approach | The Fantasy Premier League is a popular online fantasy sport game, in which players, known as managers, construct so-called dream-teams based on soccer players in the English Premier League. Each player in the dream-team is assigned a points score based on their performance in each gameweek’s fixtures, where the goal of the fantasy sport is to maximize the points accumulated over the course of an entire season. Each season consists of thirty-eight gameweeks, with managers required to select eleven starting players, a captain player, and four substitution players for each gameweek. Unless a so-called special chip is used, only eleven of the fifteen players can accumulate points during each gameweek. The manager’s selected dream team is transferred to a successive gameweek, with managers allowed to transfer players into and out of their teams each gameweek. Managers are penalized for excessive player transfers and, adding to the strategic complexity of the fantasy game, the managers face strict constraints when formulating their teams. The so-called dream-team formulation problem can be decomposed into an initial dream-team formulation sub-problem and a subsequent player-transfer sub-problem. The constraints associated with these sub-problems be expressed as a system of linear equations, and given an estimate of a player’s expected performance in a fixture, a set of suggested player transfers can be obtained by using linear programming. The focus in this project is to design and implement a set of machine learning algorithms capable of forecasting the expected points of the players in a game-week’s fixtures, after which a decision support system is designed and implemented to obtain a suggested initial dream-team and a set of player transfers for the subsequent gameweeks. A total of five machine learning algorithms are considered, with each algorithm being selected from a distinctly-functioning family of learning algorithms. The five algorithms are selected from families of linear regression techniques, as well as kernel-based, neural network, decision tree ensembles, and nearest-neighbour algorithms. The applicability of using a stacked meta-learner is investigated, where the meta-learner is provided with predictions generated by the five implemented algorithms. A case study is performed on the 2020/21 Fantasy Premier League season, in which the quality of the suggested player transfers are validated. The final results obtained demonstrate that the decision support system performs favorably, where the best set of suggested player transfers would have placed in the top 5.98% of eight million real-world managers’ in the 2020/21 season. |
Title | Abstract |
---|---|
Requirements for 3D stock assessment of timber on landings and terminals | This project aims to address the issue of having an unreliable stock assessment system in the timber supply chain, leading to inaccurate estimations for stock volumes in log piles. The system developed in this project needs to satisfy the practical constraints of the supply chain, while generating results that are frequent and accurate. The data capturing process is required to be low tech due to the vast rural areas covered by the timber supply chain. The method identified for achieving this is terrestrial structure from motion (SFM), using a consumer grade camera or a smartphone. The final data used for the project is in the form of point clouds, generated from both SFM as well as Unity, in order to increase the amount of data available. In order for the system to determine the volume of log piles, the first step required is to determine the difference between log pile and terrain within the point cloud. To do this, a classification algorithm is developed as part of this project. The algorithm makes use of neighbourhood statistics calculated during the feature engineering process, along with features in the original point cloud dataset. The algorithm used for the classification of log piles from this dataset is K-means clustering. Once the log piles can be extracted from the point cloud, an alpha shape is generated from the point cloud. The alpha shape is then used to predict the final volume of the log piles. The results of the final system show that the methodology developed achieves predicted volumes of an acceptable level for the future use case. The results in this project thus provide evidence that there is a benefit for the use of computer vision in the timber supply chain to perform stock assessments that are accurate. Finally the project acknowledges that there is need for the continuation of work in order to further improve the accuracy and implement the system. |
A predictive model for precision tree measurements using applied machine learning | Accurately determining biological asset values is of great importance for forestry enterprises – the process ought to be characterised by the proper collection of tree data by means of utilising appropriate enumeration practices conducted at managed forest compartments. Currently, only between 5-20% of forest areas are enumerated which serve as a representative sample for the entire enclosing compartment. For forestry companies, timber volume estimations and future growth projections are based on these statistics, which may be accompanied by numerous unintentional errors during the data collection process. Many alternative methods towards estimating and inferring tree data accurately are available in the literature – the most popular characteristic is the so-called diameter at breast height (DBH), which can also be measured by means of remote sensing techniques. The advancements in laser scanning measurement apparatuses are significant in recent decades, however, these approaches are notably expensive and require specialised and technical skills for their operation. One of the main drawbacks associated with the measurement of DBH by means of laser scanning is the lack of scalability – equipment setup and data capture are arduous processes that take a significant amount of time to complete. Algorithmic breakthroughs in the domain of data science, predominantly spanning machine learning (ML) and deep learning (DL) approaches, warrant the selection and practical application of computer vision (CV) procedures. More specifically, an algorithmic approach towards monocular depth estimation (MDE) techniques was employed for the extraction of tree data features from video recordings (captured using no more than an ordinary smartphone device) and are investigated in this thesis. Towards this end, a suitable forest study area was identified to conduct the experiment and the industry partner of the project, i.e. the South African Forestry Company SOC Limited (SAFCOL) granted the necessary plantation access. The research methodology adopted for this thesis includes fieldwork at the given site, which involved first performing data collection steps according to accepted an standardised operating procedures developed for tree enumerations. This data set is regarded as the \ground truth” and comprises the target feature (i.e. actual DBH measurements) later used for modelling purposes. The video _les were processed in a structured manner in order to extract tree segment patterns from the corresponding imagery. Various ML models are then trained and tested in respect of the basic input feature data _le, which produced a relative root mean squared error (RMSE%) between 14.1 and 18.3% for the study. The relative bias yields a score between 0.08% and 1.13% indicating that the proposed workflow solution exhibits a consistent prediction result, but at an undesirable error rate (i.e. RMSE) deviation from the target output. Additionally, the suggested CV/ML workflow model is capable of generating a discernibly similar spatial representation upon visual inspection (when compared with the ground truth data set – i.e. tree coordinates captured during fieldwork). In the pursuit of precision forestry, the proposed predictive model developed for accurate tree measurements produce DBH estimations that approximate real-world values with a fair degree of accuracy. |