An Intelligent Email Response System (IERS)

Thumbnail Image
Luo, Zili
Natural Language Processing , Email Classification , Question Detection , Response Generation , Email Response System
Email is one of the most common methods of official and personal communication to exchange information. For the administration department, dealing with hundreds of emails with the same type of inquiries or requests results in a huge operational overhead. In this work, we propose an intelligent email response system that performs three main steps: (i) automatically classify emails, (ii) detect questions or requests for information, and (iii) generate a response from a FAQ database or a knowledge base. In the first stage, we explore email classification models. An email classification system should understand the topics in the email content for categorizing emails and indicating if an incoming email should be handled by the mailbox owner. Email categorization based on topics is a multi-label classification task. Most existing email categorization models perform binary classification to identify spam, phishing, or malware attacks. We propose a CNN-BiLSTM model for multi-class email classification. Our experiments show that compared to CNN (76.19%) and BiLSTM (61.9%) models, the CNN-BiLSTM (83.33%) model has much better performance. The second stage in our pipeline addresses question detection. We explore applying machine learning (ML) approaches to the International Computer Science Institute (ICSI) Meeting Recorder Dialogue Act (MRDA) corpus dataset. In the dataset i with question marks, the Random Forest (RF) model and the CNN model give a 0.99 F1-score and 1.00 F1-score respectively while the Rule-based model gives a worse result (0.89 F1-score). If the question marks are removed the F1-score decreases to 0.85 for both RF and CNN models. Moreover, using parser tree graph embedding with Feather and GL2VEC fails to detect questions. Finally, using TF-IDF weighted GloVe embedding gives worse results compare to GloVe embedding for both RF and CNN models. In response generation, we use the SOTA BART model on the Eli5 dataset as the baseline. We find using complete sections and including a feature selection section provides better results in response generation. Using complete sentences gives a 0.2929 F1-score compared to 0.2729 F1-score with the baseline. Moreover, using a pre-trained extractive QA model in feature selection gives the best performance (0.2932 F1-score).
External DOI