A Multi-Source Machine Learning Framework for Phishing E-Mail Detection Using TF-IDF Features

Dilek A. E., TONKAL Ö.

18th International Conference on Information Security and Cryptology, ISCTurkiye 2025, Ankara, Türkiye, 22 - 23 Ekim 2025, (Tam Metin Bildiri)

Yayın Türü: Bildiri / Tam Metin Bildiri
Doi Numarası: 10.1109/isctrkiye68593.2025.11224854
Basıldığı Şehir: Ankara
Basıldığı Ülke: Türkiye
Anahtar Kelimeler: cybersecurity, e-mail classification, machine learning, natural language processing, phishing detection
Samsun Üniversitesi Adresli: Evet

Özet

Phishing attacks delivered via email continue to pose a significant cybersecurity problem for both individuals and organizations. This study presents a classification model that uses traditional machine learning algorithms to improve the detection accuracy of phishing email attacks. Eleven reliable datasets, including Enron, SpamAssassin, Nazario, and TREC, were combined to create a robust and heterogeneous dataset, yielding approximately 200,000 labeled email samples. Each email was rigorously classified as phishing or legitimate. The dataset was constructed by combining the subject and body content. Before feature extraction, an extensive preprocessing step using the Term Frequency-Inverse Document Frequency (TF-IDF) methodology was applied to the dataset. Analysis was performed using Support Vector Machines (LinearSVC), Random Forest, XGBoost, Multinomial Naive Bayes, and Logistic Regression machine learning models. The Logistic Regression model demonstrated remarkable performance, achieving a classification accuracy of 97.7%. The proposed phishing detection framework, developed in Python, underwent a rigorous validation process with both theoretical evaluation and empirical experiments, confirming its effectiveness in enhancing email security.