68
0

A Pipeline of Augmentation and Sequence Embedding for Classification of Imbalanced Network Traffic

Abstract

Network Traffic Classification (NTC) is one of the most important tasks in network management. The imbalanced nature of classes on the internet presents a critical challenge in classification tasks. For example, some classes of applications are much more prevalent than others, such as HTTP. As a result, machine learning classification models do not perform well on those classes with fewer data. To address this problem, we propose a pipeline to balance the dataset and classify it using a robust and accurate embedding technique. First, we generate artificial data using Long Short-Term Memory (LSTM) networks and Kernel Density Estimation (KDE). Next, we propose replacing one-hot encoding for categorical features with a novel embedding framework based on the "Flow as a Sentence" perspective, which we name FS-Embedding. This framework treats the source and destination ports, along with the packet's direction, as one word in a flow, then trains an embedding vector space based on these new features through the learning classification task. Finally, we compare our pipeline with the training of a Convolutional Recurrent Neural Network (CRNN) and Transformers, both with imbalanced and sampled datasets, as well as with the one-hot encoding approach. We demonstrate that the proposed augmentation pipeline, combined with FS-Embedding, increases convergence speed and leads to a significant reduction in the number of model parameters, all while maintaining the same performance in terms of accuracy.

View on arXiv
@article{shokri2025_2502.18909,
  title={ A Pipeline of Augmentation and Sequence Embedding for Classification of Imbalanced Network Traffic },
  author={ Matin Shokri and Ramin Hasibi },
  journal={arXiv preprint arXiv:2502.18909},
  year={ 2025 }
}
Comments on this paper