小編給大家分享一下.NET Core如何實現(xiàn)定時抓取網(wǎng)站文章并發(fā)送到郵箱,希望大家閱讀完這篇文章之后都有所收獲,下面讓我們一起去探討吧!
建網(wǎng)站原本是網(wǎng)站策劃師、網(wǎng)絡程序員、網(wǎng)頁設計師等,應用各種網(wǎng)絡程序開發(fā)技術(shù)和網(wǎng)頁設計技術(shù)配合操作的協(xié)同工作。成都創(chuàng)新互聯(lián)專業(yè)提供網(wǎng)站設計、做網(wǎng)站,網(wǎng)頁設計,網(wǎng)站制作(企業(yè)站、自適應網(wǎng)站建設、電商門戶網(wǎng)站)等服務,從網(wǎng)站深度策劃、搜索引擎友好度優(yōu)化到用戶體驗的提升,我們力求做到極致!作為一個持續(xù)運行的工具,沒有日志記錄怎么行,我準備使用的是NLog來記錄日志,它有個日志歸檔功能非常不錯。在http請求中,由于網(wǎng)絡問題吧可能會出現(xiàn)失敗的情況,這里我使用Polly來進行Retry。使用HtmlAgilityPack來解析網(wǎng)頁,需要對xpath有一定了解。下面是詳細說明:
組件名 | 用途 | github |
---|---|---|
NLog | 記錄日志 | https://github.com/NLog/NLog |
Polly | 當http請求失敗,進行重試 | https://github.com/App-vNext/Polly |
HtmlAgilityPack | 網(wǎng)頁解析 | https://github.com/zzzprojects/html-agility-pack |
MailKit | 發(fā)送郵件 | https://github.com/jstedfast/MailKit |
有不了解的組件,可以通過訪問github獲取資料。
參考文章
https://www.jb51.net/article/112595.htm
獲取&解析博客園首頁數(shù)據(jù)
我是用的是HttpWebRequest來進行http請求,下面分享一下我簡單封裝的類庫:
using System; using System.IO; using System.Net; using System.Text; namespace CnBlogSubscribeTool { /// <summary> /// Simple Http Request Class /// .NET Framework >= 4.0 /// Author:stulzq /// CreatedTime:2017-12-12 15:54:47 /// </summary> public class HttpUtil { static HttpUtil() { //Set connection limit ,Default limit is 2 ServicePointManager.DefaultConnectionLimit = 1024; } /// <summary> /// Default Timeout 20s /// </summary> public static int DefaultTimeout = 20000; /// <summary> /// Is Auto Redirect /// </summary> public static bool DefalutAllowAutoRedirect = true; /// <summary> /// Default Encoding /// </summary> public static Encoding DefaultEncoding = Encoding.UTF8; /// <summary> /// Default UserAgent /// </summary> public static string DefaultUserAgent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36" ; /// <summary> /// Default Referer /// </summary> public static string DefaultReferer = ""; /// <summary> /// httpget request /// </summary> /// <param name="url">Internet Address</param> /// <returns>string</returns> public static string GetString(string url) { var stream = GetStream(url); string result; using (StreamReader sr = new StreamReader(stream)) { result = sr.ReadToEnd(); } return result; } /// <summary> /// httppost request /// </summary> /// <param name="url">Internet Address</param> /// <param name="postData">Post request data</param> /// <returns>string</returns> public static string PostString(string url, string postData) { var stream = PostStream(url, postData); string result; using (StreamReader sr = new StreamReader(stream)) { result = sr.ReadToEnd(); } return result; } /// <summary> /// Create Response /// </summary> /// <param name="url"></param> /// <param name="post">Is post Request</param> /// <param name="postData">Post request data</param> /// <returns></returns> public static WebResponse CreateResponse(string url, bool post, string postData = "") { var httpWebRequest = WebRequest.CreateHttp(url); httpWebRequest.Timeout = DefaultTimeout; httpWebRequest.AllowAutoRedirect = DefalutAllowAutoRedirect; httpWebRequest.UserAgent = DefaultUserAgent; httpWebRequest.Referer = DefaultReferer; if (post) { var data = DefaultEncoding.GetBytes(postData); httpWebRequest.Method = "POST"; httpWebRequest.ContentType = "application/x-www-form-urlencoded;charset=utf-8"; httpWebRequest.ContentLength = data.Length; using (var stream = httpWebRequest.GetRequestStream()) { stream.Write(data, 0, data.Length); } } try { var response = httpWebRequest.GetResponse(); return response; } catch (Exception e) { throw new Exception(string.Format("Request error,url:{0},IsPost:{1},Data:{2},Message:{3}", url, post, postData, e.Message), e); } } /// <summary> /// http get request /// </summary> /// <param name="url"></param> /// <returns>Response Stream</returns> public static Stream GetStream(string url) { var stream = CreateResponse(url, false).GetResponseStream(); if (stream == null) { throw new Exception("Response error,the response stream is null"); } else { return stream; } } /// <summary> /// http post request /// </summary> /// <param name="url"></param> /// <param name="postData">post data</param> /// <returns>Response Stream</returns> public static Stream PostStream(string url, string postData) { var stream = CreateResponse(url, true, postData).GetResponseStream(); if (stream == null) { throw new Exception("Response error,the response stream is null"); } else { return stream; } } } }
獲取首頁數(shù)據(jù)
string res = HttpUtil.GetString(https://www.cnblogs.com);
解析數(shù)據(jù)
我們成功獲取到了html,但是怎么提取我們需要的信息(文章標題、地址、摘要、作者、發(fā)布時間)呢。這里就亮出了我們的利劍HtmlAgilityPack,他是一個可以根據(jù)xpath來解析網(wǎng)頁的組件。
載入我們前面獲取的html:
HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(html);
從上圖中,我們可以看出,每條文章所有信息都在一個class為post_item的div里,我們先獲取所有的class=post_item的div
//獲取所有文章數(shù)據(jù)項 var itemBodys = doc.DocumentNode.SelectNodes("//div[@class='post_item_body']");
我們繼續(xù)分析,可以看出文章的標題在class=post_item_body的div下面的h4標簽下的a標簽,摘要信息在class=post_item_summary的p標簽里面,發(fā)布時間和作者在class=post_item_foot的div里,分析完畢,我們可以取出我們想要的數(shù)據(jù)了:
foreach (var itemBody in itemBodys) { //標題元素 var titleElem = itemBody.SelectSingleNode("h4/a"); //獲取標題 var title = titleElem?.InnerText; //獲取url var url = titleElem?.Attributes["href"]?.Value; //摘要元素 var summaryElem = itemBody.SelectSingleNode("p[@class='post_item_summary']"); //獲取摘要 var summary = summaryElem?.InnerText.Replace("\r\n", "").Trim(); //數(shù)據(jù)項底部元素 var footElem = itemBody.SelectSingleNode("div[@class='post_item_foot']"); //獲取作者 var author = footElem?.SelectSingleNode("a")?.InnerText; //獲取文章發(fā)布時間 var publishTime = Regex.Match(footElem?.InnerText, "\\d+-\\d+-\\d+ \\d+:\\d+").Value; Console.WriteLine($"標題:{title}"); Console.WriteLine($"網(wǎng)址:{url}"); Console.WriteLine($"摘要:{summary}"); Console.WriteLine($"作者:{author}"); Console.WriteLine($"發(fā)布時間:{publishTime}"); Console.WriteLine("--------------華麗的分割線---------------"); }
運行一下:
我們成功的獲取了我們想要的信息。現(xiàn)在我們定義一個Blog對象將它們裝起來。
public class Blog { /// <summary> /// 標題 /// </summary> public string Title { get; set; } /// <summary> /// 博文url /// </summary> public string Url { get; set; } /// <summary> /// 摘要 /// </summary> public string Summary { get; set; } /// <summary> /// 作者 /// </summary> public string Author { get; set; } /// <summary> /// 發(fā)布時間 /// </summary> public DateTime PublishTime { get; set; } }
http請求失敗重試
我們使用Polly在我們的http請求失敗時進行重試,設置為重試3次。
//初始化重試器 _retryTwoTimesPolicy = Policy .Handle<Exception>() .Retry(3, (ex, count) => { _logger.Error("Excuted Failed! Retry {0}", count); _logger.Error("Exeption from {0}", ex.GetType().Name); });
測試一下:
可以看到當遇到exception是Polly會幫我們重試三次,如果三次重試都失敗了那么會放棄。
發(fā)送郵件
使用MailKit來進行郵件發(fā)送,它支持IMAP,POP3和SMTP協(xié)議,并且是跨平臺的十分優(yōu)秀。下面是根據(jù)前面園友的分享自己封裝的一個類庫:
using System.Collections.Generic; using CnBlogSubscribeTool.Config; using MailKit.Net.Smtp; using MimeKit; namespace CnBlogSubscribeTool { /// <summary> /// send email /// </summary> public class MailUtil { private static bool SendMail(MimeMessage mailMessage,MailConfig config) { try { var smtpClient = new SmtpClient(); smtpClient.Timeout = 10 * 1000; //設置超時時間 smtpClient.Connect(config.Host, config.Port, MailKit.Security.SecureSocketOptions.None);//連接到遠程smtp服務器 smtpClient.Authenticate(config.Address, config.Password); smtpClient.Send(mailMessage);//發(fā)送郵件 smtpClient.Disconnect(true); return true; } catch { throw; } } /// <summary> ///發(fā)送郵件 /// </summary> /// <param name="config">配置</param> /// <param name="receives">接收人</param> /// <param name="sender">發(fā)送人</param> /// <param name="subject">標題</param> /// <param name="body">內(nèi)容</param> /// <param name="attachments">附件</param> /// <param name="fileName">附件名</param> /// <returns></returns> public static bool SendMail(MailConfig config,List<string> receives, string sender, string subject, string body, byte[] attachments = null,string fileName="") { var fromMailAddress = new MailboxAddress(config.Name, config.Address); var mailMessage = new MimeMessage(); mailMessage.From.Add(fromMailAddress); foreach (var add in receives) { var toMailAddress = new MailboxAddress(add); mailMessage.To.Add(toMailAddress); } if (!string.IsNullOrEmpty(sender)) { var replyTo = new MailboxAddress(config.Name, sender); mailMessage.ReplyTo.Add(replyTo); } var bodyBuilder = new BodyBuilder() { HtmlBody = body }; //附件 if (attachments != null) { if (string.IsNullOrEmpty(fileName)) { fileName = "未命名文件.txt"; } var attachment = bodyBuilder.Attachments.Add(fileName, attachments); //解決中文文件名亂碼 var charset = "GB18030"; attachment.ContentType.Parameters.Clear(); attachment.ContentDisposition.Parameters.Clear(); attachment.ContentType.Parameters.Add(charset, "name", fileName); attachment.ContentDisposition.Parameters.Add(charset, "filename", fileName); //解決文件名不能超過41字符 foreach (var param in attachment.ContentDisposition.Parameters) param.EncodingMethod = ParameterEncodingMethod.Rfc2047; foreach (var param in attachment.ContentType.Parameters) param.EncodingMethod = ParameterEncodingMethod.Rfc2047; } mailMessage.Body = bodyBuilder.ToMessageBody(); mailMessage.Subject = subject; return SendMail(mailMessage, config); } } }
測試一下:
說明
關(guān)于抓取數(shù)據(jù)和發(fā)送郵件的調(diào)度,程序異常退出的數(shù)據(jù)處理等等,在此我就不詳細說明了,有興趣的看源碼(文末有g(shù)ithub地址)
抓取數(shù)據(jù)是增量更新的。不用RSS訂閱的原因是RSS更新比較慢。
完整的程序運行截圖:
每發(fā)送一次郵件,程序就會將記錄時間調(diào)整到今天的9點,然后每次抓取數(shù)據(jù)之后就會判斷當前時間減去記錄時間是否大于等于24小時,如果符合就發(fā)送郵件并且更新記錄時間。
收到的郵件截圖:
截圖中的郵件標題為13日但是郵件內(nèi)容為14日,是因為我為了演示效果,將今天(14日)的數(shù)據(jù)copy到了13日的數(shù)據(jù)里面,不要被誤導了。
還提供一個附件便于收集整理:
看完了這篇文章,相信你對“.NET Core如何實現(xiàn)定時抓取網(wǎng)站文章并發(fā)送到郵箱”有了一定的了解,如果想了解更多相關(guān)知識,歡迎關(guān)注創(chuàng)新互聯(lián)行業(yè)資訊頻道,感謝各位的閱讀!
分享名稱:.NETCore如何實現(xiàn)定時抓取網(wǎng)站文章并發(fā)送到郵箱-創(chuàng)新互聯(lián)
當前地址:http://m.rwnh.cn/article36/djigpg.html
成都網(wǎng)站建設公司_創(chuàng)新互聯(lián),為您提供服務器托管、網(wǎng)站導航、網(wǎng)站收錄、營銷型網(wǎng)站建設、做網(wǎng)站、靜態(tài)網(wǎng)站
聲明:本網(wǎng)站發(fā)布的內(nèi)容(圖片、視頻和文字)以用戶投稿、用戶轉(zhuǎn)載內(nèi)容為主,如果涉及侵權(quán)請盡快告知,我們將會在第一時間刪除。文章觀點不代表本網(wǎng)站立場,如需處理請聯(lián)系客服。電話:028-86922220;郵箱:631063699@qq.com。內(nèi)容未經(jīng)允許不得轉(zhuǎn)載,或轉(zhuǎn)載時需注明來源: 創(chuàng)新互聯(lián)
猜你還喜歡下面的內(nèi)容