使用 Cheerio 和 Node.js 進行網頁抓取：逐步指南 - PIA S5 Proxy

Summer 限時優惠：住宅計畫 10% 折扣，截止日期為 2030 年 6 月 25 日

立即獲取

Socks5代理限时特惠：享受高达 85% 的折扣 + 1000 个免费 IP

立即獲取

username

email

Trusted by more than 70,000 worldwide.

100% residential proxy

100% residential proxy

Country/City targeting

Country/City targeting

No charge for invalid IP

No charge for invalid IP

IP lives for 24 hours

IP lives for 24 hours

Award-winning web intelligence solutions

Welcome!

Create your free account

Forgot password?

Enter your email to receive recovery information

OR

Username or email address *

text clear

Password *

text clear

show password

· Please input the correct email address

Forgot password?

Log in

Don`t have an account? Register

Email address *

text clear

Password *

text clear

show password

Invitation code(Not required)

I have read and agree

Terms of services

and

Register

Already have an account？ Log In

Email address *

text clear

Submit

Password has been recovered? Log In

< 返回博客

使用 Cheerio 和 Node.js 進行網頁抓取：逐步指南

Tina . 2024-08-19

在當今資訊化的時代，網頁抓取（Web Scraping）已成為獲取資料的重要手段。透過抓取網頁內容，使用者可以獲得市場資訊、競爭對手資料等有價值的資訊。本文將介紹如何使用 Cheerio 和 Node.js 進行網頁抓取，幫助你快速掌握這項技術。

什麼是 Cheerio 和 Node.js？

Node.js：一個基於 Chrome V8 引擎的 JavaScript 運行環境，使得開發者能夠使用 JavaScript 在伺服器端進行程式設計。 Node.js 非常適合處理 I/O 密集型任務，如網頁抓取。

Cheerio：一個快速、靈活、精簡的 jQuery 核心實現，用於伺服器端的 DOM 操作。 Cheerio 讓在 Node.js 中解析和操作 HTML 文件變得簡單。

環境建構

在開始之前，請確保你的電腦上已安裝 Node.js。可以在 Node.js 官網下載並安裝。

建立專案資料夾

在你的工作目錄中建立一個新的資料夾，例如 web-scraping。

bash

複製

mkdir web-scraping

cd web-scraping

初始化 Node.js 項目

在專案資料夾中執行以下命令以初始化專案：

bash

複製

npm init -y

安裝所需依賴

安裝 axios（用於傳送 HTTP 請求）和 cheerio（用於解析 HTML）：

bash

複製

npm install axios cheerio

基本使用方法

以下是使用 Cheerio 和 Node.js 進行網頁抓取的基本步驟。

1. 發送 HTTP 請求

使用 axios 發送 GET 請求以取得網頁內容。以下是一個範例程式碼：

javascript

複製

const axios = require('axios');

async function fetchData(url) {

try {

const response = await axios.get(url);

return response.data;

} catch (error) {

console.error(`Error fetching data: ${error}`);

}

}

2. 解析 HTML 內容

使用 Cheerio 解析取得的 HTML 內容，並擷取所需資料。以下是解析範例：

javascript

複製

const cheerio = require('cheerio');

function parseData(html) {

const $ = cheerio.load(html);

const titles = [];

$('h2.title').each((index, element) => {

titles.push($(element).text());

});

return titles;

}

3. 整合程式碼

將以上兩個步驟整合在一起，形成完整的網頁抓取程式：

javascript

複製

const axios = require('axios');

const cheerio = require('cheerio');

async function fetchData(url) {

try {

const response = await axios.get(url);

return response.data;

} catch (error) {

console.error(`Error fetching data: ${error}`);

}

}

function parseData(html) {

const $ = cheerio.load(html);

const titles = [];

$('h2.title').each((index, element) => {

titles.push($(element).text());

});

return titles;

}

(async () => {

const url = 'https://example.com'; // 替換為目標網站

const html = await fetchData(url);

const titles = parseData(html);

console.log(titles);

})();

運行程式

在終端機中執行以下命令，啟動你的網頁抓取程式：

bash

複製

node index.js

確保將 index.js 替換為你的檔名。程式將輸出抓取到的標題。

注意事項

遵循網站的爬蟲協議：在抓取資料之前，請檢查目標網站的 robots.txt 文件，確保遵循其爬蟲政策。

頻率控制：避免在短時間內發送大量請求，以免被網站封鎖。可以使用 setTimeout 函數來控制請求頻率。

處理動態內容：如果目標網站使用 JavaScript 動態載入內容，請考慮使用如 Puppeteer 等工具進行抓取。

使用代理程式提升抓取效率

在進行網頁抓取時，使用代理程式可以有效提升抓取效率和安全性。 PIA S5 Proxy 是優秀的代理服務，具有以下優勢：

高匿名性：PIA S5 Proxy 提供高匿名性，保護使用者的真實IP位址，降低被封鎖的風險。

快速穩定：高效率的連線速度與穩定性，確保資料抓取的順利進行。

靈活配置：支援多種代理類型，適合不同的抓取需求。

使用 Cheerio 和 Node.js 進行網頁抓取是一項強大且靈活的技術，可以幫助你獲得有價值的資料。透過本文的逐步指南，你可以輕鬆上手網頁抓取。同時，結合 PIA S5 Proxy，可進一步提升抓取的安全性和效率。希望這篇文章能幫助你在網頁抓取的旅程中取得成功！

希望這篇文章能為你提供有關使用 Cheerio 和 Node.js 進行網頁抓取的有價值資訊！

< 上一篇

取得旋轉住宅代理IP的指南與PIA S5 Proxy介紹

下一篇 >

如何選擇最佳的網路爬蟲服務：完整指南

在本文中：

support@piaproxy.com

enable JavaScriptChatBot