在知道目標之後,接著就要決定如何達成。網路爬蟲的概念很簡單,就是進入網頁之後,將網站中的原始碼擷取下來,透過Xpath或是Html節點的方式來找到目標字串。

根據上次[Python][教學] 網路爬蟲(crawler)實務(上)--網頁元件解析分析的內容,我們的爬網策略大致上會是:


  • 進入搜尋頁面>找到店家網址>進入店家頁面>擷取資料
根據這樣的流程,將他拆解成更符合爬蟲程式的邏輯:
  • 進入搜尋頁面
  • 搜尋頁面有多個頁面,透過參數一次抓取n個搜尋頁面(搜尋網頁數n)
  • 從搜尋頁面中解析出店家的網址,每頁有m個店家網址(店家網頁數m)
  • 進入店家網址,解析出需要用的資訊(搜尋網頁數=n * m)
廢話不多說直接看code:
  ##import 必要套件
  import requests
  from bs4 import BeautifulSoup
  import HTMLParser
  import time
  from random import randint
  import sys
  from IPython.display import clear_output
   
  ##從搜尋頁面擷取店家網址(因為搜尋頁面的電話是圖片不好抓)
  links = ['http://www.ipeen.com.tw/search/all/000/1-100-0-0/?p=' + str(i+1) + 'adkw=東區&so=commno' for i in range(10)]
  shop_links=[]
  for link in links:
  res = requests.get(link)
  soup = BeautifulSoup(res.text.encode("utf-8"))
  shop_table = soup.findAll('h3',{'class':'name'})
  ##關在a tag裡的網址抓出來
  for shop_link in shop_table:
  link = 'http://www.ipeen.com.tw' + [tag['href'] for tag in shop_link.findAll('a',{'href':True})][0]
  shop_links.append(link)
  ##避免被擋掉,小睡一會兒
  time.sleep(1)
   
  ##建立變項檔案的header
  title = "shop" + "," + "category" + "," + "tel" + "," + "addr" + "," + "cost" + "," + "rank" + "," + "counts" + "," + "share" + "," + "collect"
  shop_list = open('shop_list.txt','w')
  ##先把header寫進去
  shop_list.write(title.encode('utf-8') + "\n")
   
  for i in range(len(shop_links)):
   
  res = requests.get(shop_links[i])
  soup = BeautifulSoup(res.text.encode("utf-8"))
  header = soup.find('div',{'class':'info'})
   
  shop = header.h1.string.strip()
   
  ##做例外處理
  try:
  category = header.find('p', {'class':'cate i'}).a.string
  except Exception as e:
  category = ""
   
  try:
  tel = header.find('p',{'class': 'tel i'}).a.string.replace("-","")
  except Exception as e:
  tel = ""
   
  try:
  addr = header.find('p', {'class': 'addr i'}).a.string.strip()
  except Exception as e:
  addr = ""
   
  try:
  cost = header.find('p', {'class':'cost i'}).string.split()[1]
  except Exception as e:
  cost = ""
   
  try:
  rank = header.find('span', {'itemprop': 'average'}).string
  except Exception as e:
  rank = ""
   
  try:
  counts = header.find_all('em')[0].string.replace(',','')
  except Exception as e:
  counts = ""
   
  try:
  share = header.find_all('em')[1].string.replace(',','')
  except Exception as e:
  share = ""
   
  try:
  collect = header.find_all('em')[2].string.replace(',','')
  except Exception as e:
  collect = ""
   
  ##串起來用逗號分格(應該有更好的方法,但是先將就用用)
  result = shop + "," + category + "," + tel + "," + addr + "," + cost + "," + rank + "," + counts + "," + share + "," + collect
  shop_list.write(result.encode('utf-8') + "\n")
   
  ##隨機睡一下
  time.sleep(randint(1,5))
  clear_output()
  print i
  sys.stdout.flush()
   
  shop_list.close()
   
view rawiPeen.py hosted with ❤ by GitHub
  • 這次的爬網流程雖然簡單,但是還是有幾個要注意的地方:
  1. time.sleep: 這次總共抓了n * m 個網頁,短時間的大量抓取會消耗網站資源,影響網站運行,所以通常有品的爬網者會設定睡眠時間,避免造成對方主機的負擔
  2. try except: 當需要自動抓大量欄位時,一定要考慮或是注意到你要抓的欄位可能不是每個頁面都有提供,所以要加上例外處理才能避免錯誤而跳出程式
  3. 更為普遍的xpath設定: 如果只抓一兩個頁面,xpath要怎樣設都可以,也可以很簡單的利用數數的方式去取得標籤。但是如果要抓大量網頁,每個網頁的每個節點的數量可能會不一樣,最好多看幾個網頁原始碼,找到每個標籤在結構上的固定位置,避免抓錯欄位。

資料來源:Bryan的行銷研究及資料分析筆記


留下你的回應

以訪客張貼回應

0

在此對話中的人們

  • 訪客 - Mathilda Jutila

    回報

    Exquisitely professional thoughts. I merely hit essay writing service reviews upon this web site and desired to enunciate that I've definitely delighted in reckoning your blog articles or blog posts. Rest assured I'll subsist pledging to your RSS and I wish you write-up after much more shortly

熱門標籤雲

每月文章