欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页  >  IT编程

python函数式编程--爬取豆瓣数据

程序员文章站 2022-09-21 09:02:56
导入模块并输出类型代码import requestsimport pandas as pdimport jsonimport timeprint( ''' 1-纪录片;2-传记;3-犯罪;4-历史;5-动作; 6-情色;7-歌舞;8-儿童;10-悬疑;11-剧情; 12-灾难;13-爱情;14-音乐;15-冒险;16-奇幻; 17-科幻;18-运动;19-惊悚;20-恐怖;22-战争; 23-短篇;24-喜剧;25-动画;26-同性;27-西部; 2...


导入模块并输出类型代码


import requests import pandas as pd import json import time print( '''
    1-纪录片;2-传记;3-犯罪;4-历史;5-动作;
    6-情色;7-歌舞;8-儿童;10-悬疑;11-剧情;
    12-灾难;13-爱情;14-音乐;15-冒险;16-奇幻;
    17-科幻;18-运动;19-惊悚;20-恐怖;22-战争;
    23-短篇;24-喜剧;25-动画;26-同性;27-西部;
    28-家庭;29-武侠;30-古装;31-黑色电影
''') 


根据需求输入类型代码及多少个电影数据


leixing = input("根据类型代码输入您想下载类型的代码:") num = input("请输入你想下载前多少名的电影信息:") 


获取每个电影信息


def download(leixing, num): for i in range(int(num)): url = f"https://movie.douban.com/j/chart/top_list?type={leixing}&interval_id=100%3A90&action=&start=0&limit={i}" headers ={ 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36' } response = requests.get(url, headers=headers) dt = json.loads(response.text) title = [i['title'] for i in dt] rank = [i['rank'] for i in dt] score = [i['score'] for i in dt] types = [i['types'] for i in dt] regions = [i['regions'] for i in dt] release_date = [i['release_date'] for i in dt] actors = [i['actors'] for i in dt] cover_url = [i['cover_url'] for i in dt] date = pd.DataFrame({'电影名称':title,'排名':rank,'评分':score,'地区':regions,'上映时间':release_date,'类型':types,'主演':actors,'电影链接':cover_url}) date.index = date.index + 1 date.to_excel('e:/豆瓣电影排行榜.xlsx') time.sleep(2) download(leixing, num) 


完整代码


import requests import pandas as pd import json import time print( '''
    1-纪录片;2-传记;3-犯罪;4-历史;5-动作;
    6-情色;7-歌舞;8-儿童;10-悬疑;11-剧情;
    12-灾难;13-爱情;14-音乐;15-冒险;16-奇幻;
    17-科幻;18-运动;19-惊悚;20-恐怖;22-战争;
    23-短篇;24-喜剧;25-动画;26-同性;27-西部;
    28-家庭;29-武侠;30-古装;31-黑色电影
''') leixing = input("根据类型代码输入您想下载类型的代码:") num = input("请输入你想下载前多少名的电影信息:") def download(leixing, num): for i in range(int(num)): url = f"https://movie.douban.com/j/chart/top_list?type={leixing}&interval_id=100%3A90&action=&start=0&limit={i}" headers ={ 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36' } response = requests.get(url, headers=headers) dt = json.loads(response.text) title = [i['title'] for i in dt] rank = [i['rank'] for i in dt] score = [i['score'] for i in dt] types = [i['types'] for i in dt] regions = [i['regions'] for i in dt] release_date = [i['release_date'] for i in dt] actors = [i['actors'] for i in dt] cover_url = [i['cover_url'] for i in dt] date = pd.DataFrame({'电影名称':title,'排名':rank,'评分':score,'地区':regions,'上映时间':release_date,'类型':types,'主演':actors,'电影链接':cover_url}) date.index = date.index + 1 date.to_excel('e:/豆瓣电影排行榜.xlsx') time.sleep(2) download(leixing, num) 


思路

#1.抓取信息页面为动态页面
#2.真实url中含有数量(翻页)信息
https://movie.douban.com/j/chart/top_list?type={leixing}&interval_id=100%3A90&action=&start=0&limit={i}
#3.获取数据为json数据


本文地址:https://blog.csdn.net/weixin_43422435/article/details/108249430

相关标签: python爬虫 python