scrapy 爬取当当网信息并保存mysql
1.1 题目
熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法;
Scrapy+Xpath+MySQL数据库
存储技术路线爬取当当网站图书数据 候选网站:www.dangdang.com/
1.2 思路
1.2.1 setting.py
打开请求头
连接数据库信息
ROBOTSTXT_OBEY
设置为False
打开pipelines
1.2.2 item.py
编写item.py的字段
class DangdangItem(scrapy.Item): title = scrapy.Field() author = scrapy.Field() publisher = scrapy.Field() date = scrapy.Field() price = scrapy.Field() detail = scrapy.Field() 复制代码
1.2.3 db_Spider.py
观察网页,查看分页
第二页 第三页
所以很容易发现这个page_index
就是分页的参数
获取节点信息
def parse(self, response): lis = response.xpath('//*[@id="component_59"]') titles = lis.xpath(".//p[1]/a/@title").extract() authors = lis.xpath(".//p[5]/span[1]/a[1]/text()").extract() publishers = lis.xpath('.//p[5]/span[3]/a/text()').extract() dates = lis.xpath(".//p[5]/span[2]/text()").extract() prices = lis.xpath('.//p[3]/span[1]/text()').extract() details = lis.xpath('.//p[2]/text()').extract() for title,author,publisher,date,price,detail in zip(titles,authors,publishers,dates,prices,details): item = DangdangItem( title=title, author=author, publisher=publisher, date=date, price=price, detail=detail, ) self.total += 1 print(self.total,item) yield item self.page_index += 1 yield scrapy.Request(self.next_url % (self.keyword, self.page_index), callback=self.next_parse) 复制代码
指定爬取数量
爬取102
条
1.2.4 pipelines.py
数据库连接
def __init__(self): # 获取setting中主机名,端口号和集合名 host = settings['HOSTNAME'] port = settings['PORT'] dbname = settings['DATABASE'] username = settings['USERNAME'] password = settings['PASSWORD'] self.conn = pymysql.connect(host=host, port=port, user=username, password=password, database=dbname, charset='utf8') self.cursor = self.conn.cursor() 复制代码
插入数据
def process_item(self, item, spider): data = dict(item) sql = "INSERT INTO spider_dangdang(title,author,publisher,b_date,price,detail)" \ " VALUES (%s,%s, %s, %s,%s, %s)" try: self.conn.commit() self.cursor.execute(sql, [data["title"], data["author"], data["publisher"], data["date"], data["price"], data["detail"], ]) print("插入成功") except Exception as err: print("插入失败", err) return item 复制代码
结果查看,一共102条数据,这个id
我是设置自动自增的,因为有之前测试的数据插入,所以id
并没有从1开始
作者:小生凡一
链接:https://juejin.cn/post/7032204479403556878