All ETDs from UAB

Advisory Committee Chair

Leon Jololian

Advisory Committee Members

Mohammad Haider

Karthikeyan Lingasubramanian

Murat M Tanik

Earl Wells

Document Type

Dissertation

Date of Award

2022

Degree Name by School

Doctor of Philosophy (PhD) School of Engineering

Abstract

Cybercrimes have risen and caused threats to regular internet users. In recent times, the increased use of online social networks (OSNs) allows people to easily share opinions, personal information, and others. Since a major part of OSNs content is textual content, immense research has focused on text analysis techniques using machine learning and Natural Language Processing (NLP). One important area of research focused on text analysis using machine learning is forensic text analysis. Digital forensics is a discipline concerns finding, preserving, and presenting admissible evidence in court. Sadly, the convenience of OSNs creates an optimal venue for cybercriminals to perform malicious activities. As observed, anonymous texts have been associated with suspicious activities; thus, techniques for deanonymization have been a focal research interest in the past years. Forensic authorship profiling or characterization is one area of interest that needs to be furtherly investigated on account of directing the course of the cybercrimes’ investigation. Mostly, the techniques of authorship profiling are based on machine and deep learning techniques. Such techniques use stylometric or statistical features to build the models. Several components affect the quality of such techniques e.g., dataset size and quality, preprocessing techniques, features selection, and classification methods. Lately, a new promising technique has emerged in NLP, known as a transformer which effectively enabled transfer learning. In transfer learning, a model that has been trained on a general domain dataset can be reapplied to similar or different specific tasks. Transfer learning is relatively an older technique in another field like computer vision, but it has recently been widely applied to many NLP tasks and showed astonishing performance. As a result, we chose to examine the application of transfer learning techniques to tackle profiling the age and gender of the author. After an extensive review of authorship attribution and profiling in the past ten years, we have noticed some gaps in the field of forensic authorship profiling that need to be addressed. Currently, the proposed techniques in authorship profiling have some serious limitations in terms of the quality and size of examined datasets. Moreover, the current techniques face serious issues at larger scales. Another limitation we observed is that the proposed methods are mostly based on machine learning methods which sequentially are based on preprocessing techniques and feature engineering. In our study, we offer a thorough literature review that covers different methods and their evaluations and limitations. Typically, most machine and deep learning models go through the same phases of text preprocessing, features extraction, features selections, and model training; thus, we utilized the recently trendy technique of transfer learning, which is considered features-independent, to profile anonymous authors by revealing authors’ characteristiecs using dataset from PAN authorship profiling tasks. By doing so, we examined the effect of the most used text-preprocessing techniques on profiling the age and gender of anonymous authors using the transfer learning technique with BERT as an example. In another case study, we compared BERT, RoBERTa, and BERTweet when used to categorize the age and gender of anonymous authors with recommended values of the selected models’ hyperparameters to recognize the association of theses values with overall performences of the model. Experimentally, we tested the impact of text tokenization in transfer learning using BERT tokenizer, WordPiece, as an example and how a well-known issue such as out of vocabulary limit the interpretation of BERT’s tokenizer. As a result, we utilized different techniques such as text enrichment, missing words and emojis dictionaries to mitigate the effect of text misrepresntation limitations.

Included in

Engineering Commons

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.