2019基于python的网络爬虫系列，爬取糗事百科

程序员文章站 2023-11-19 16:33:10

**因为糗事百科的URL改变，正则表达式也发生了改变，导致了网上许多的代码不能使用，所以写下了这一篇博客，希望对大家有所帮助，谢谢！** 废话不多说，直接上代码。为了方便提取数据，我用的是beautifulsoup库和requests ![使用requests和bs4](https://img-b ......

**因为糗事百科的url改变，正则表达式也发生了改变，导致了网上许多的代码不能使用，所以写下了这一篇博客，希望对大家有所帮助，谢谢！**

废话不多说，直接上代码。

为了方便提取数据，我用的是beautifulsoup库和requests

![使用requests和bs4](https://img-blog.csdnimg.cn/20191017093920758.png)

``## 具体代码如下

```
import requests
from bs4 import beautifulsoup

def download_page(url):
headers = {"user-agent": "mozilla/5.0 (windows nt 6.1; wow64; rv:6.0) gecko/20100101 firefox/6.0"}
r = requests.get(url, headers=headers)
return r.text

def get_content(html):
soup = beautifulsoup(html, 'html.parser')
con = soup.find(id='main')
con_list = con.find_all('div', class_="cat_llb")
for i in con_list:
author = i.find('h3').string # 获取名字
content = i.find('div', id="endtext").get_text() # 获取内容
save_txt(author, content)

def save_txt(*args):
for i in args:
with open('qiubai.txt', 'a', encoding='utf-8') as f:

f.write(i+'\n'+'\n')

# def save_txt(str):
# for i in str:
#
# with open('qiubai.txt', 'a', encoding='utf-8') as f:
# f.write(str + '\n')
# f.write(i)

def main():
# 可以构造如下 url，

for i in range(1, 20):

url = 'http://www.lovehhy.net/joke/detail/qsbk/{}'.format(i)
html = download_page(url)
get_content(html)

if __name__ == '__main__':
main()

```

哦 ,对了，新网站的地址是http://www.lovehhy.net/joke/detail/qsbk/
有什么不懂得欢迎留言

上一篇：设计模式模式（四）：建造者模式（生成器模式）

下一篇： Python开发专属壁纸下载与轮换程序

2019基于python的网络爬虫系列，爬取糗事百科

2019基于python的网络爬虫系列，爬取糗事百科

Python多线程爬虫实战_爬取糗事百科段子的实例

2019基于python的网络爬虫系列，爬取糗事百科

Python多线程爬虫实战_爬取糗事百科段子的实例

Python多线程爬虫实战_爬取糗事百科段子的实例_python

Python多线程爬虫实战_爬取糗事百科段子的实例_python