博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
python日常—爬取豆瓣250条电影记录
阅读量:4561 次
发布时间:2019-06-08

本文共 1651 字,大约阅读时间需要 5 分钟。

import requests  import lxml.html,csv  doubanUrl = 'https://movie.douban.com/top250?start={}&filter='def getSource(doubanUrl):    response = requests.get(doubanUrl)          response.encoding = 'utf-8'            return response.content             def getEveryItem(source):    selector = lxml.html.document_fromstring(source)      movieItemList = selector.xpath('//div[@class="info"]')    movieList = []    for eachMovie in movieItemList:        movieDict = {}        title = eachMovie.xpath('div[@class="hd"/a/span/[@class="title"]/text()')        otherTitle = eachMovie.xpath('div[@class="hd"/a/span/[@class="other"]/text()')        link = eachMovie.xpath('div[@class="hd"/a/@href')[0]        star = eachMovie.xpath('div[@class="hd"/div[@class="star"]/span[@class="rating_num"]/text()')        quote = eachMovie.xpath('div[@class="hd"/p[@class="quote"]/span/text()')                movieDict['title'] = ''.join(title+otherTitle)        movieDict['url'] = link        movieDict['star'] = star        movieDict['quote'] = quote        movieList.append(movieDict)    return movieListdef writeData(movieList):    with open('./Douban.csv','w',encoding='UTF-8',newline='') as f:        writer = csv.DictWriter(f,fieldnames=['title','star','quote','url'])                writer.writeheader()        for each in movieList:            writer.writerow(each)if __name__ == 'main':    movieList = []    for i in range(10):        pageLink = doubanUrl.format(i*25)        print(pageLink)        source = getSource(pageLink)        movieList = getEveryItem(source)    print(movieList[:10])    writeData(movieList)

转载于:https://www.cnblogs.com/zxycb/p/9823311.html

你可能感兴趣的文章
shiro设置加密算法源码解析
查看>>
第二次冲刺
查看>>
实验四
查看>>
win8.1镜像制作
查看>>
Windows 服务开发框架介绍 - Topshelf
查看>>
php,字符串(二)
查看>>
easyui validatebox 验证类型
查看>>
编程迷茫时候看一看
查看>>
“ORA-00913: 值过多”、“ORA-00911: 无效字符”
查看>>
编程中的那些容易迷糊的小知识
查看>>
Sizzle前奏
查看>>
Paint Chain HDU - 3980(sg)
查看>>
Chales常用操作
查看>>
C++ 运算符重载<<
查看>>
windows镜像
查看>>
Flask 模板语法
查看>>
spark-2.2.0安装和部署——Spark集群学习日记
查看>>
Linux Kernel 4.21已更新:优化AMD 7nm Zen2架构
查看>>
腾讯2016编程笔试题
查看>>
HTML的行内元素与块级元素的区别?
查看>>