問題描述
我正在解析電子郵件.當我看到對電子郵件的回復時,我想刪除引用的文本,以便我可以將文本附加到上一封電子郵件(即使它是回復).
I am parsing emails. When I see a reply to an email, I would like to remove the quoted text so that I can append the text to the previous email (even if its a reply).
通常,您會看到:
第一封電子郵件(對話開始)
1st email (start of conversation)
This is the first email
第二封電子郵件(回復第一封)
2nd email (reply to first)
This is the second email
Tim said:
This is the first email
此輸出將僅為這是第二封電子郵件".盡管不同的電子郵件客戶端引用文本的方式不同,但如果有辦法只獲取大部分新的電子郵件文本,那也是可以接受的.
The output of this would be "This is the second email" only. Although different email clients quote text differently, if there were someway to get mostly the new email text only, that would also be acceptable.
推薦答案
我使用以下正則表達式來匹配引用文本的前導(最后一個是重要的):
I use the following regex(s) to match the lead in for quoted text (the last one is the one that counts):
/** general spacers for time and date */
private static final String spacers = "[\s,/\.\-]";
/** matches times */
private static final String timePattern = "(?:[0-2])?[0-9]:[0-5][0-9](?::[0-5][0-9])?(?:(?:\s)?[AP]M)?";
/** matches day of the week */
private static final String dayPattern = "(?:(?:Mon(?:day)?)|(?:Tue(?:sday)?)|(?:Wed(?:nesday)?)|(?:Thu(?:rsday)?)|(?:Fri(?:day)?)|(?:Sat(?:urday)?)|(?:Sun(?:day)?))";
/** matches day of the month (number and st, nd, rd, th) */
private static final String dayOfMonthPattern = "[0-3]?[0-9]" + spacers + "*(?:(?:th)|(?:st)|(?:nd)|(?:rd))?";
/** matches months (numeric and text) */
private static final String monthPattern = "(?:(?:Jan(?:uary)?)|(?:Feb(?:uary)?)|(?:Mar(?:ch)?)|(?:Apr(?:il)?)|(?:May)|(?:Jun(?:e)?)|(?:Jul(?:y)?)" +
"|(?:Aug(?:ust)?)|(?:Sep(?:tember)?)|(?:Oct(?:ober)?)|(?:Nov(?:ember)?)|(?:Dec(?:ember)?)|(?:[0-1]?[0-9]))";
/** matches years (only 1000's and 2000's, because we are matching emails) */
private static final String yearPattern = "(?:[1-2]?[0-9])[0-9][0-9]";
/** matches a full date */
private static final String datePattern = "(?:" + dayPattern + spacers + "+)?(?:(?:" + dayOfMonthPattern + spacers + "+" + monthPattern + ")|" +
"(?:" + monthPattern + spacers + "+" + dayOfMonthPattern + "))" +
spacers + "+" + yearPattern;
/** matches a date and time combo (in either order) */
private static final String dateTimePattern = "(?:" + datePattern + "[\s,]*(?:(?:at)|(?:@))?\s*" + timePattern + ")|" +
"(?:" + timePattern + "[\s,]*(?:on)?\s*"+ datePattern + ")";
/** matches a leading line such as
* ----Original Message----
* or simply
* ------------------------
*/
private static final String leadInLine = "-+\s*(?:Original(?:\sMessage)?)?\s*-+
";
/** matches a header line indicating the date */
private static final String dateLine = "(?:(?:date)|(?:sent)|(?:time)):\s*"+ dateTimePattern + ".*
";
/** matches a subject or address line */
private static final String subjectOrAddressLine = "((?:from)|(?:subject)|(?:b?cc)|(?:to))|:.*
";
/** matches gmail style quoted text beginning, i.e.
* On Mon Jun 7, 2010 at 8:50 PM, Simon wrote:
*/
private static final String gmailQuotedTextBeginning = "(On\s+" + dateTimePattern + ".*wrote:
)";
/** matches the start of a quoted section of an email */
private static final Pattern QUOTED_TEXT_BEGINNING = Pattern.compile("(?i)(?:(?:" + leadInLine + ")?" +
"(?:(?:" +subjectOrAddressLine + ")|(?:" + dateLine + ")){2,6})|(?:" +
gmailQuotedTextBeginning + ")"
);
我知道在某些方面這有點矯枉過正(而且可能會很慢!)但效果很好.如果您發現任何與此不符的地方,請告訴我,以便我改進!
I know that in some ways this is overkill (and might be slow!) but it works pretty well. Please let me know if you find anything that doesn't match this so I can improve it!
這篇關于如何從電子郵件中刪除引用的文本并僅顯示新文本的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!