## 创建一个[scrapy](http://www.scrapyd.cn/)项目
```
E:\pythonText> scrapy startproject mySlider
E:\pythonText\mySlider>tree
E:.
└─mySlider
└─spiders
```
### **创建第一个爬虫项目**
**1. 进入spiders目录下:**
```
scrapy genspider myspider baidu.com
```
**2.或者进入spliders目录下自己创建一个文件baiduspider.py**
```
#!/usr/bin/env python
# _*_ coding:utf-8 _*_
import scrapy
from mySlider.items import MysliderItem
class ItcastSplider(scrapy.Spider):
"""
创建一个爬虫类
"""
# 爬虫名
name = "itcast"
# 允许爬虫作用的范围
allowd_domains = ["http://www.itcast.cn/"]
#爬虫的url 可以写多个url
start_urls = [r"http://www.itcast.cn/channel/teacher.shtml#"]
def parse(self, response):
#with open("teacher.html","w") as f:
# f.write(response.body)
# 所有老师信息的集合
teacherItem = []
#Item对象用来保存数据的
item = MysliderItem()
#通过scrapy自带的xpath匹配出所有老师的根节点列表集合
teacher_list = response.xpath('//div[@class="li_txt"]')
#遍历根节点集合
for each in teacher_list:
#extract() 将匹配出来的结果转换为Unicode字符串
name = each.xpath('./h3/text()').extract()
title = each.xpath('./h4/text()').extract()
info = each.xpath('./p/text()').extract()
item['name'] = name[0]
item['title'] = title[0]
item['info'] = info[0]
#使用yield 将获取的数据交给pipelines
yield item
#返回的数据不经过pipeline
# teacherItem.append(item)
# return teacherItem
```
其中splider/items.py文件如下
```
import scrapy
class MysliderItem(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field()
title = scrapy.Field()
info = scrapy.Field()
```
**3.运行爬虫**
```
E:\pythonText\mySlider>scrapy crawl itcast
```
>[danger] windows下利用scrapy(python2.7)写爬虫,运行 scrapy crawl dmoz 命令时提示:exceptions.ImportError: No module named win32api
则需要安装:[pywin32](https://sourceforge.net/projects/pywin32/files/pywin32/Build%20220/)
**4.将爬取的内容保存为json格式**
```
E:\pythonText\mySlider>scrapy crawl itcast -o teacherItem.json
#或者teacherItem.csv格式等
#转换为csv格式的要转码
str.encode("gbk",'ignore')
```
**使用pycharm来执行脚本**
>[success] pycharm 右边->edit configurations ->python/itcastsplider->configuration->script path 选择项目中的新建的start.py
```
#!/usr/bin/env python
# _*_ coding:utf-8 _*_
from scrapy import cmdline
cmdline.execute("scrapy crawl itcast -o teacherItem.csv".split())
```
**使用yield将自己编写pipelines管道处理**
1.将配置项目中的settings.py中的ITEM_PIPELINES
~~~
ITEM_PIPELINES = {
'mySlider.pipelines.MysliderPipeline': 300,
}
~~~
2.在pipelines.py
~~~
from json import dumps
class MysliderPipeline(object):
def __init__(self):
#初始化创建一个文件
self.filename = open("teacher.json","w")
def process_item(self, item, spider):
#将item转换为json
jsontext = dumps(dict(item), ensure_ascii=False)
#print jsontext
#将json数据写入文件
self.filename.write(jsontext)
return item
def close_spider(self,spider):
self.filename.close()
~~~
3.执行scrapy
```
E:\pythonText\mySlider>scrapy crawl itcast
```