python日常—爬取豆瓣250条电影记录-白红宇

python日常—爬取豆瓣250条电影记录

阅读量：4561 次

发布时间：2019-06-08

本文共 1651 字，大约阅读时间需要 5 分钟。

import requests  import lxml.html,csv  doubanUrl = 'https://movie.douban.com/top250?start={}&filter='def getSource(doubanUrl):    response = requests.get(doubanUrl)          response.encoding = 'utf-8'            return response.content             def getEveryItem(source):    selector = lxml.html.document_fromstring(source)      movieItemList = selector.xpath('//div[@class="info"]')    movieList = []    for eachMovie in movieItemList:        movieDict = {}        title = eachMovie.xpath('div[@class="hd"/a/span/[@class="title"]/text()')        otherTitle = eachMovie.xpath('div[@class="hd"/a/span/[@class="other"]/text()')        link = eachMovie.xpath('div[@class="hd"/a/@href')[0]        star = eachMovie.xpath('div[@class="hd"/div[@class="star"]/span[@class="rating_num"]/text()')        quote = eachMovie.xpath('div[@class="hd"/p[@class="quote"]/span/text()')                movieDict['title'] = ''.join(title+otherTitle)        movieDict['url'] = link        movieDict['star'] = star        movieDict['quote'] = quote        movieList.append(movieDict)    return movieListdef writeData(movieList):    with open('./Douban.csv','w',encoding='UTF-8',newline='') as f:        writer = csv.DictWriter(f,fieldnames=['title','star','quote','url'])                writer.writeheader()        for each in movieList:            writer.writerow(each)if __name__ == 'main':    movieList = []    for i in range(10):        pageLink = doubanUrl.format(i*25)        print(pageLink)        source = getSource(pageLink)        movieList = getEveryItem(source)    print(movieList[:10])    writeData(movieList)

转载于:https://www.cnblogs.com/zxycb/p/9823311.html

你可能感兴趣的文章

Windows 服务开发框架介绍 - Topshelf

查看>>