分别用requests和selenium实现了拉勾的爬虫

背景

  • 因为下周一要去腾讯面试写代码,面试官说会考爬虫和数据处理,我就寻思着写个爬虫练练手,因为
    最近几天一直都在拉勾上找工作就用拉勾拿来练手了。

环境

  • OSX 10.14.6 on Macbook pro 2017
  • Selenium==3.11.0
  • Python3.64
  • Chromedriver 70.0.3538.97

Github仓库

实现参考

说明

  • Request版本代码是完全照搬原作者的,Selenium版只参考了点击下一页的两行代码,其它
    代码copy自以前写的爬虫,具体大家可以进原文章看看,然后对比我下面的代码

Selenium版本代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
# coding = 'utf-8'

from selenium.webdriver.chrome.webdriver import WebDriver as ChromeDriver
from user_agent import generate_user_agent
from os.path import join as os_join
from bs4 import BeautifulSoup
import os
import time
import pandas


class Lagou(object):
CURR_PATH = os.path.abspath('.')
CHROME_DRIVER_PATH: str = os_join(CURR_PATH, 'chromedriver_mac')

def __init__(self, initial_url: str):
from selenium.webdriver.chrome.options import Options as ChromeOptions
self.initial_url: str = initial_url
options: ChromeOptions = ChromeOptions()
# options.add_argument('headless')
custom_ua: str = generate_user_agent(os='win')
options.add_argument('user-agent=' + custom_ua)
self.driver = ChromeDriver(executable_path=self.CHROME_DRIVER_PATH,
chrome_options=options)

def get_first_page_source(self):
try:
self.driver.implicitly_wait(5)
self.driver.set_script_timeout(20)
self.driver.set_page_load_timeout(25)
self.driver.get(self.initial_url)
page_source: str = self.driver.page_source
return page_source
except KeyboardInterrupt:
self.driver.quit()
print('The exception of KeyboardInterrupt '
'detected and the chrome driver has quited.')
except Exception as e:
print(str(e))
self.driver.quit()
exit(1)

def process_data(self, page_source):
soup = BeautifulSoup(page_source, 'lxml')
company_list = soup.select('ul.item_con_list li')
data_list = []
for company in company_list:
attrs = company.attrs
company_name = attrs['data-company']
job_name = attrs['data-positionname']
job_salary = attrs['data-salary']
data_list.append(company_name + ',' + job_name + ',' + job_salary)
return data_list

def get_next_page_source(self):
try:
self.driver.implicitly_wait(5)
self.driver.set_script_timeout(20)
self.driver.set_page_load_timeout(25)
next_page = self.driver.find_elements_by_xpath("//span[@class='pager_next ']")
next_page[0].click()
page_source: str = self.driver.page_source
return page_source
except KeyboardInterrupt:
self.driver.quit()
print('The exception of KeyboardInterrupt '
'detected and the chrome driver has quited.')
except Exception as e:
print(str(e))
self.driver.quit()
exit(1)

@staticmethod
def save_data(self, data, csv_header):
table = pandas.DataFrame(data)
table.to_csv(r'/Users/sharp/Desktop/LaGou.csv', header=csv_header, index=False, mode='a+')

def save_data_into_csv(self, line_data):
with open(r'/Users/sharp/Desktop/LaGou.csv', 'a+') as f:
f.write(line_data+'\n')


url = 'https://www.lagou.com/jobs/list_linux%E8%BF%90%E7%BB%B4?labelWords=&fromSearch=true&suginput='
lagou = Lagou(url)
print('Get page {} source'.format(str(1)))
first_page_source = lagou.get_first_page_source()
first_page_data = lagou.process_data(first_page_source)
lagou.save_data_into_csv('company_name,job_name,job_salary')
for data in first_page_data:
lagou.save_data_into_csv(data)

for i in range(1, 30):
print('Get page {} source'.format(str(i+1)))
next_page_source = lagou.get_next_page_source()
next_page_data = lagou.process_data(next_page_source)
for data in first_page_data:
lagou.save_data_into_csv(data)
time.sleep(8)

Selenium版本代码说明

说几个值得注意的地方

  • xpath拿到下一页后要去下标为0的第一个元素然后再click(),代码如下:
    1
    2
    next_page = self.driver.find_elements_by_xpath("//span[@class='pager_next ']")
    next_page[0].click()
  • 用headless模式运行会报“element is not clickable at point”错误,所以
    options.add_argument('headless')这一行就注释掉了
  • 再一个就是休眠的时间,我第一次爬的时候休眠设置的是5秒,结果爬到第12页就跳转到了登陆页面,
    后来我调成了8秒,实践证明可以一口气爬完30页。
  • 最后要说明的是原博的代码实在读起来比较晦涩,我的呢纯过程式,先处理第一页,然后顺序处理
    剩余的页面,一目明了,代码更是清晰易懂。

数据截图


截图中上半部分是selenium版本的爬虫生成的,下半部分是requests版本的。

题外话

目前这两个版本的脚本都是默认爬取linux运维职位,如果想支持在命令行指定爬取任意职位可以给我
发邮件(sharp.gan1993@gmail.com)付费定制脚本哦。

Author: Sharp
Link: http://sharpgan.com/2019/09/22/implement-lagou-crawler/
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.
支付宝打赏
微信打赏