分享一个爬取故事FM网站mp3音频的爬虫

2020年09月05日 318点热度 0人点赞 1条评论

先看效果

开发&运行环境

  • Mac OS: 10.13.6
  • Python:3.7.7
  • requests==2.24.0
  • youtube-dl==2020.7.28

安装依赖库:

$ sudo pip3 install requests==2.24.0
$ sudo pip3 install youtube-dl==2020.7.28
$ sudo pip3 install user-agent==0.1.9
$ sudo pip3 install html5lib==1.1
$ sudo pip3 install beautifulsoup4==4.9.1

上菜

第一个是爬取url的爬虫,如下:

#!/usr/bin/env python3
# coding = 'utf-8'
"""
@Time    : 2020/9/5-14:13
@Author  : sharp
@FileName: collect_url.py
@Software: PyCharm
@Blog    :https://www.sharpgan.com/
"""

import time
import random
import requests
from bs4 import BeautifulSoup
from user_agent import generate_user_agent

headers = {
        "accept": "text/html,application/xhtml+xml,application/xml;"
                  "q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,"
                  "application/signed-exchange;v=b3;q=0.9",
        'Accept-Language': 'en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7',
        'Accept-Encoding': 'gzip, deflate, br',
        'User-Agent': generate_user_agent(os='win')}

f = open("urls.txt", "w+")
# 这里的页码请自行根据实际情况修改一下
for num in range(2, 70):
    host = "https://storyfm.cn/page/" + str(num)
    data = requests.get(host, headers=headers).text
    soup = BeautifulSoup(data, 'html5lib')
    href = soup.select("div.isotope-index-text a")
    url_unique_list = set([h.get('href') for h in href])
    for u in url_unique_list:
        print(u)
        f.write(u + '\n')
    seconds = random.choice([i / 10 for i in range(10, 60)])
    print("sleep: " + str(seconds))
    time.sleep(seconds)
f.close()

拿到所有的url后,开始用youtube-dl下载,如下:

#!/usr/bin/env python3
# coding = 'utf-8'
"""
@Time    : 2020/9/5-15:00
@Author  : sharp
@FileName: download.py
@Software: PyCharm
@Blog    :https://www.sharpgan.com/
"""
import os
import random
import time

urls_list = [url.strip() for url in open('urls.txt')]
for url in urls_list:
    os.system("youtube-dl " + url)
    seconds = random.choice([i / 10 for i in range(60, 200)])
    print("sleep: " + str(seconds))
    time.sleep(seconds)

 

注意:请勿移除time.sleep(seconds) 这一行代码,以及缩短seconds变量的取值范围,

否则对网站造成过大访问压力,影响人家正常业务,本站概不负任何责任!!!

Sharp

"A Linux user and a Python{}".format('er')

文章评论

  • cici

    666~

    2020年09月09日