爬虫会遇到很多问题,有简单的,也有复杂的。这里,我只介绍一些简单的问题,比如异常处理,应对反爬虫等问题。
异常处理
前面一章已经有了一些异常处理,这里再补充一些常见的异常:
- 网页没有找到或者获取异常
- 服务器没有找到
- 没有爬取得内容
第三点上次内容已有涉及,下面仅说明下前两种情况怎么办。
网页没有找到或者获取异常
这种情况,会有HTTP错误返回,可能是“404 Page Not Found”,也可能是“500 Internal Server Error”等。这些情况下,aise_for_status()
方法会抛出异常。
参考这段代码,运行代码,看看效果。可以发现,如果有错误会打印提示,而不是直接崩掉。
import requests
from requests.exceptions import HTTPError
try:
r = requests.get('http://httpbin.org/status/200')
r.raise_for_status()
except HTTPError:
print('Could not download page')
else:
print(r.url, 'downloaded successfully')
try:
r = requests.get('http://httpbin.org/status/404')
r.raise_for_status()
except HTTPError:
print('Could not download', r.url)
else:
print(r.url, 'downloaded successfully')
服务器没有找到
如果服务器根本就没有找到,比如你输入了”www.meiyou.zhenmeiyou“,但是这个网址不存在。你可以使用下面方法设定timeout时间,超过时间就会报错:
r = requests.get('https://github.com', timeout=5)
应对反爬虫
关于反爬虫,可以看这里。
对于一般情况,我们只要知道下面两种机制即可:
一个就是计算单个userAgent访问频率,如果超过阈值,就封掉。
另一个是限制ip访问频率的反爬虫机制,也就是同一个ip,超过一个访问频率阈值就封掉,可以通过暂停(time.sleep,上一章已经使用过)或者代理方法解决。
应对userAgent封锁
在requests的请求的头中存在着User-Agent,我们可以通过下图方法看到它:
如果一个网站限制同一个User-Agent的频繁访问,或者不允许没有User-Agent的访问,那么我们可以先建立一个User-Agent仓库,然后随机从里面拿出一个用。
首先,我们可以上网找一个这样的仓库,比如这里。我们可以把这些User-Agent赋值给一个字符串(为节省篇幅,省略大部分):
user_agents = '''Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36
...
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'''
print(user_agents.split('\n'))
观察输出,确认没有问题。上面代码将一个字符串以“\n”为分隔符,转成了列表。
那么为了随机使用一个User-Agent,可以采用下面操作:
random.seed()
user_agent = random.choice(user_agents.split('\n'))
headers = {'User-Agent': user_agent}
response = requests.get(url, headers=headers)
上面代码中,首先我们使用random.seed()
初始化一个随机生成器,然后随机选择一个User-Agent。最后将这个随机的User-Agent使用在requests.get()
中。
这样,我们就可以每次使用随机的一个User-Agent来爬取数据了。
应对ip封锁
下面说下如何破解限制ip访问频率的反爬虫机制,我们已经通过暂停(time.sleep)方法降低了访问频率。但是这种方法杀人一千,自损八百,用的过分的话会极大影响爬取速度。
那么我们可以采用设置代理的方法,让别人帮我们访问。这样,我们需要一个比较大的代理池,可以使用“http://haoip.cc/tiqu.htm”提供的代理。为了使用这个这个站点的最新代理,我们首先需要爬取这个网站的代理列表。
爬取很简单,关键是处理一下数据。爬取得原始数据包含很多空格,回车等空白字符,需要将他们丢弃,代码如下:
def get_ips(url):
ips = list()
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.text, 'lxml')
raw_ips = soup.select("div.col-xs-12")[0].text.replace(' ', '').split('\n')
for raw_ip in raw_ips:
if raw_ip == '':
pass
else:
ips.append(raw_ip)
return ips
然后就可以在建立连接的时候使用了:
random.seed()
ip = random.choice(ips)
proxy = {'http': ip} # 构造成一个代理
response = requests.get(url, proxies=proxy) # 使用代理获取response
为了使用方便,我们还可以设置什么时候使用代理,使用代理的规则等内容:
def get_response(url, use_proxy=True, num_retrials=4):
"""
获取url的响应。可以设置代理和重试次数
:param url: 要连接到网址
:param proxy: 代理
:param num_retrials: 重试次数
:return: 响应
"""
def random_ip():
"""
get random ip from an ip list
"""
def get_ips(url='http://haoip.cc/tiqu.htm'):
"""
find the ip list shown in url
"""
ips = list()
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.text, 'lxml')
raw_ips = soup.select("div.col-xs-12")[0].text.replace(' ', '').split('\n')
for raw_ip in raw_ips:
if raw_ip == '':
pass
else:
ips.append(raw_ip)
return ips
ips = get_ips()
random.seed()
ip = random.choice(ips)
return ip
if not use_proxy: # 如果不使用代理
try:
response = requests.get(url)
response.raise_for_status()
except HTTPError: # 如果上面的代码执行报错
if num_retrials > 0: # num_retrials是重试次数
time.sleep(10) # 延迟十秒
print('获取网页出错,10S后将获取倒数第:', num_retrials, '次')
return get_response(url, False, num_retrials - 1) # 调用自身 并将次数减1
else:
print('开始使用代理')
return get_response(url, True) # 代理不为空的时候
else:
return response
else: # 如果使用代理
ip = random_ip()
proxy = {'http': ip} # 构造一个代理
try:
response = requests.get(url, proxies=proxy) # 使用代理获取response
except HTTPError:
if num_retrials > 0:
time.sleep(10)
print('将会更换代理,10S后将重新获取倒数第', num_retrials, '次')
return get_response(url, True, num_retrials - 1)
else:
print('代理也不好使了!取消代理')
return get_response(url, False)
else:
print('当前代理是:', proxy)
return response
get_response('http://www.baidu.com', False)
上面代码代理设置的部分我已经加上了很多注释,理解起来应该问题不大。
综合使用
掌握了前面内容,下面就把所有内容综合起来,搞一个健壮的爬虫出来,而且把数据存入MongoDB:
import random
from bs4 import BeautifulSoup
import requests
import time
import pymongo
from multiprocessing import Pool
# MogoDb设置
client = pymongo.MongoClient('localhost', 27017)
tongcheng_db = client["tongcheng"]
products = tongcheng_db["products"]
# 入口网址
start_url = 'http://sz.58.com/'
user_agents = '''Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36
Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/602.4.8 (KHTML, like Gecko) Version/10.0.3 Safari/602.4.8
Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0
Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36
Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36
Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0
Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.79 Safari/537.36 Edge/14.14393
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Firefox/52.0
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36
Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/602.4.8 (KHTML, like Gecko) Version/10.0.3 Safari/602.4.8
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36
Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.1 Safari/603.1.30
Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/56.0.2924.76 Chrome/56.0.2924.76 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0
Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:51.0) Gecko/20100101 Firefox/51.0
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36
Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0
Mozilla/5.0 (Windows NT 6.1; rv:52.0) Gecko/20100101 Firefox/52.0
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:51.0) Gecko/20100101 Firefox/51.0
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36
Mozilla/5.0 (Windows NT 6.3; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:52.0) Gecko/20100101 Firefox/52.0
Mozilla/5.0 (iPad; CPU OS 10_2_1 like Mac OS X) AppleWebKit/602.4.6 (KHTML, like Gecko) Version/10.0 Mobile/14D27 Safari/602.1
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36
Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36
Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36
Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36
Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko
Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36
Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36 OPR/43.0.2442.1144
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; Trident/5.0)
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; Trident/5.0)
Mozilla/5.0 (iPhone; CPU iPhone OS 10_2_1 like Mac OS X) AppleWebKit/602.4.6 (KHTML, like Gecko) Version/10.0 Mobile/14D27 Safari/602.1
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/602.4.8 (KHTML, like Gecko) Version/10.0.3 Safari/602.4.8
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/602.3.12 (KHTML, like Gecko) Version/10.0.2 Safari/602.3.12
Mozilla/5.0 (Windows NT 5.1; rv:52.0) Gecko/20100101 Firefox/52.0
Mozilla/5.0 (Windows NT 6.1; rv:51.0) Gecko/20100101 Firefox/51.0
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/601.7.7 (KHTML, like Gecko) Version/9.1.2 Safari/601.7.7
Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36
Mozilla/5.0 (X11; CrOS x86_64 9000.91.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.110 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:51.0) Gecko/20100101 Firefox/51.0
Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36
Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko
Mozilla/5.0 (X11; Linux x86_64; rv:51.0) Gecko/20100101 Firefox/51.0
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36
Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0
Mozilla/5.0 (Windows NT 6.3; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.1 Safari/603.1.30
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36
Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12) AppleWebKit/602.1.50 (KHTML, like Gecko) Version/10.0 Safari/602.1.50
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0.1 Safari/602.2.14
Mozilla/5.0 (Windows NT 6.1; rv:45.0) Gecko/20100101 Firefox/45.0
Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:51.0) Gecko/20100101 Firefox/51.0
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.100 Safari/537.36
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/53.0.2785.143 Chrome/53.0.2785.143 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36
Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; Touch; rv:11.0) like Gecko
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'''
proxy_list = list() # 存储代理列表,防止过多访问代理网站
def get_response(url, use_proxy=True, num_retrials=4, timeout=5):
"""
获取url的响应。可以设置代理和重试次数
:param url: 要连接到网址
:param proxy: 代理
:param num_retrials: 重试次数
:return: 响应
"""
def random_ip():
"""
get random ip from an ip list
"""
def get_ips(url='http://haoip.cc/tiqu.htm', ips=proxy_list):
"""
find the ip list shown in url
"""
if len(proxy_list) != 0: # 如果proxy_list已经有值,直接返回即可
return ips
try:
response = requests.get(url, timeout=6)
response.raise_for_status()
except requests.HTTPError:
print('代理池服务器出现问题')
except requests.ReadTimeout:
print('请检查代理地址')
else:
soup = BeautifulSoup(response.text, 'lxml')
raw_ips = soup.select("div.col-xs-12")[0].text.replace(' ', '').split('\n')
for raw_ip in raw_ips:
if raw_ip == '':
continue
ips.append(raw_ip)
return ips
ips = get_ips()
random.seed()
if ips is None:
raise ValueError('代理异常')
ip = random.choice(ips)
return ip
def ramdom_headers():
"""
构造一个随机的Headers
"""
random.seed()
user_agent = random.choice(user_agents.split('\n'))
headers = {'User-Agent': user_agent}
return headers
if not use_proxy: # 如果不使用代理
try:
response = requests.get(url, headers=ramdom_headers(), timeout=timeout)
response.raise_for_status()
except: # 如果上面的代码执行报错
if num_retrials > 0: # num_retrials是重试次数
time.sleep(3) # 延迟3秒
print('获取网页出错,3S后将获取倒数第:', num_retrials, '次')
return get_response(url, False, num_retrials - 1) # 调用自身 并将次数减1
else:
print('开始使用代理')
return get_response(url, True) # 代理不为空的时候
else:
return response
#else: 如果使用代理
try:
ip = random_ip()
except ValueError as e:
print(str(e))
else:
proxy = {'http': ip} # 构造一个代理
try:
response = requests.get(url, headers=ramdom_headers(), proxies=proxy, timeout=timeout) # 使用代理获取response
except:
if num_retrials > 0:
time.sleep(10)
print('将会更换代理,10S后将重新获取倒数第', num_retrials, '次')
return get_response(url, True, num_retrials - 1)
else:
print('代理也不好使了!取消代理')
return get_response(url, False)
else:
print('当前代理是:', proxy)
return response
def get_categories(url):
"""
获得页面的类目链接
:param url: 页面链接
:return: 链接网址
"""
categories = set() # set类型,防止重复链接
r = get_response(url)
soup = BeautifulSoup(r.text, 'lxml')
links = soup.select('.colWrap em a')
for link in links:
cat_url = url + link.get('href').split('/')[1] # 构造链接
if cat_url == start_url: # 如果和入口网址一样,跳出本次循环
continue
if cat_url.endswith(".shtml"): # 如果以.shtml结尾,跳出本次循环
continue
categories.add(cat_url) # 加入set
return categories # 返回类目结果set
def get_products_of_all_pages(page):
"""
获得一个类目内所有的商品信息,并写入数据库
:param page: 一个类目的第一个页面网址
"""
def get_products_of_one_page(url):
"""
或者一个类目某一个页面的商品信息,并写入数据库
:param url: 页面网址
"""
r = get_response(url)
soup = BeautifulSoup(r.text, 'lxml')
if soup.find('div', {"class": "noinfotishi"}): # 没有更多商品了
raise ValueError("没有了")
if not soup.find("td", {"class": "t"}): # 不是我要的商品
raise ValueError("不爬取这种")
items = soup.findAll("td", {"class": "t"})
for item in items:
try:
title = item.find("a").text
price = item.find("span", {"class": "price"}).text
location = item.find("span", {"class": "fl"}).text.replace('\n', '')
except AttributeError: # 如果上面信息有缺失,捕获这个错误
print(url + "有商品少了些信息")
continue
products.insert_one({"title": title,
"price": price,
"location": location})
print("inserted")
print("ok")
time.sleep(1.5)
page_num = 1
while True:
try:
get_products_of_one_page(page + "/pn" + str(page_num))
except ValueError as error:
print(str(error))
break
page_num += 1
def start():
categories = get_categories(start_url)
print(categories)
for cat in cats:
get_products_of_all_pages(cat)
start()
这个爬虫很可能很快就不能爬取58同城内容了,但是没有关系,基本思想都在这里了。