阅读 81

爬虫:HTTP请求与HTML解析(爬取某乎网站)

1. 发送web请求

1.1  requests 

  用requests库的get()方法发送get请求,常常会添加请求头"user-agent",以及登录"cookie"等参数

1.1.1  user-agent

  登录网站,将"user-agent"值复制到文本文件

1.1.2  cookie

   登录网站,将"cookie"值复制到文本文件

1.1.3  测试代码

import requests
from requests.exceptions import RequestException

headers = {
    cookie: ‘‘,
    user-agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36
} # 替换为自己的cookie


def get_page(url):
    try:
        html = requests.get(url, headers=headers, timeout=5)
        if html.status_code == 200:
            print(请求成功)
            return html.text
        else:   # 这个else语句不是必须的
            return None
    except RequestException:
        print(请求失败)


if __name__ == __main__:
    input_url = https://www.zhihu.com/hot
    get_page(input_url)

 

1.2  selenium

  多数网站能通过window.navigator.webdriver的值识别selenium爬虫,因此selenium爬虫首先要防止网站识别selenium模拟浏览器。同样,selenium请求也常常需要添加请求头"user-agent",以及登录"cookie"等参数

1.2.1  移除Selenium中window.navigator.webdriver的值

  在程序中添加如下代码(对应老版本谷歌)

from selenium.webdriver import Chrome
from selenium.webdriver import ChromeOptions


option = ChromeOptions()
option.add_experimental_option(excludeSwitches, [enable-automation])
driver = Chrome(options=option)
time.sleep(10)

1.2.2  user-agent

  登录网站,将"user-agent"值复制到文本文件,执行如下代码将添加请求头

from selenium.webdriver import Chrome
from selenium.webdriver import ChromeOptions


option = ChromeOptions()
option.add_argument(user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36")

1.2.3  cookie

   因为selenium要求cookie需要有"name","value"两个键以及对应的值的值,如果网站上面的cookie是字符串的形式,直接复制网站的cookie值将不符合selenium要求,可以用selenium中的get_cookies()方法获取登录"cookie"

from selenium.webdriver import Chrome
from selenium.webdriver import ChromeOptions
import time
import json

option = ChromeOptions()
option.add_experimental_option(excludeSwitches, [enable-automation])
option.add_argument(user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36")
driver = Chrome(options=option)
time.sleep(10)

driver.get(https://www.zhihu.com/signin?next=%2F)
time.sleep(30)  
driver.get(https://www.zhihu.com/)
cookies = driver.get_cookies()
jsonCookies = json.dumps(cookies)
    
with open(cookies.txt, a) as f:  # 文件名和文件位置自己定义
    f.write(jsonCookies)
    f.write(\n)

 

1.2.4  测试代码示例

  将上面获取到的cookie复制到下面程序中便可运行

from selenium.webdriver import Chrome
from selenium.webdriver import ChromeOptions
import time

option = ChromeOptions()
option.add_experimental_option(excludeSwitches, [enable-automation])
driver = Chrome(options=option)
time.sleep(10)

driver.get(https://www.zhihu.com)
time.sleep(10)

driver.delete_all_cookies()   # 清除刚才的cookie
time.sleep(2)

cookie = {}  #  替换为自己的cookie
driver.add_cookie(cookie)
driver.get(https://www.zhihu.com/)
time.sleep(5)
for i in driver.find_elements_by_css_selector(div[itemprop="zhihu:question"] > a):
    print(i.text)

2. HTML解析(元素定位)

  要爬取到目标数据首先要定位数据所属元素,BeautifulSoup和selenium都很容易实现对HTML的元素遍历

2.1  BeautifulSoup元素定位

  下面代码BeautifulSoup首先定位到属性为"HotItem-title"的"h2"标签,然后再通过.text()方法获取字符串值

import requests
from requests.exceptions import RequestException

headers = {
    cookie: ‘‘,
    user-agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36
}           # 替换为自己的cookie


def get_page(url):
    try:
        html = requests.get(url, headers=headers, timeout=5)
        if html.status_code == 200:
            print(请求成功)
            return html.text
        else:   # 这个else语句不是必须的
            return None
    except RequestException:
        print(请求失败)

def parse_page(html):
    html = BeautifulSoup(html, "html.parser")
    titles = html.find_all("h2", {class: HotItem-title})[:10]
    for title in titles:
        print(title.text())


if __name__ == __main__:
    input_url = https://www.zhihu.com/hot
    parse_page(get_page(input_url))

 

2.2  selenium元素定位

  selenium元素定位语法形式与requests不太相同,下面代码示例(1.2.4 测试代码示例)采用了一种层级定位方法:‘div[itemprop="zhihu:question"] > a‘,笔者觉得这样定位比较放心。

  selenium获取文本值得方法是.text,区别于requests的.text()

from selenium.webdriver import Chrome
from selenium.webdriver import ChromeOptions
import time

option = ChromeOptions()
option.add_experimental_option(excludeSwitches, [enable-automation])
driver = Chrome(options=option)
time.sleep(10)

driver.get(https://www.zhihu.com)
time.sleep(10)

driver.delete_all_cookies()   # 清除刚才的cookie
time.sleep(2)

cookie = {}  #  替换为自己的cookie
driver.add_cookie(cookie)
driver.get(https://www.zhihu.com/)
time.sleep(5)
for i in driver.find_elements_by_css_selector(div[itemprop="zhihu:question"] > a):
    print(i.text)

 

 

 

 

  

原文:https://www.cnblogs.com/ydkh/p/14779012.html

文章分类
代码人生
版权声明:本站是系统测试站点,无实际运营。本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容, 请发送邮件至 XXXXXXo@163.com 举报,一经查实,本站将立刻删除。
相关推荐