Scrapy 入门教程
1.安装
pip install Scrapy
2. 创建项目
scrapy startproject myspider
项目创建完毕后,其目录结构如下:
mypider/
scrapy.cfg # deploy configuration file
myspider/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
middlewares.py # project middlewares file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
3. 创建爬虫
cd myspider
scrapy genspider example example.com
在 /myspider/myspider/spiders下会生成一个爬虫文件example.py。示例代码如下:
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
allowed_domains = ['example.com']
start_urls = ['http://example.com/']
def parse(self, response):
pass
allowed_domains 是允许爬取的域名,start_urls 是爬虫入口网址。
可在parse()函数中自定义解析方式,使用yield将解析项传递到pipelines.py中
4. 启动爬虫爬取数据
scrapy crawl example