Web Scraping With Scrapy

Recipe # | posted in Howto, Programming | Comments

1 – Problem Description

How to create a simple scraper with Scrapy.

2 – Solution

Scrapy is a web scraping framework for Python. We need to get specific data from webpages and export them to a file (csv/json/xml) or import them to a database for further processing. In this example we show you how to achieve this writing output to a csv file.

  1. Install scrapy

  2. Before you start scraping, you will have to set up a new Scrapy project. Enter a directory where you’d like to store your code and then run:

    scrapy startproject myProject

  3. Now, our aim is to capture name and address from lists of persons, so the first thing to do is to model our item. So, we edit Items.py file:

1
2
3
4
5
6
from scrapy.item import Item, Field

class PersonData(Item):
    name = Field()
    address = Field()
    category = Field()
  1. Next, we create our spider file in /spiders folder, myProject_spider.py.

  2. Now, we won’t go into details for every row, because this is a specific piece of code for a particular project. We’ll just mention the key points.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from myProject.items import PersonData
from scrapy.http import Request

class myProjectSpider(CrawlSpider):
    name = 'myProject'
    allowed_domains = ['mydomain.com']
    start_urls = ['http://www.mydomain.com/1',
                  'http://www.mydomain.com/2',
                  'http://www.mydomain.com/3']
    rules = (Rule(SgmlLinkExtractor(allow=('\\?page=\\d')),'parse_start_url',follow=True),)

    def parse_start_url(self, response):
     hxs = HtmlXPathSelector(response)
     names = hxs.select('//a[@class="name"]/text()').extract()
     addresses = hxs.select('//div[@class="address"]/text()').extract()
     subpage = hxs.select('//a[@class="subpage"]/@href').extract()
     for name,address,subpage in zip(names,addresses,subpages):
      person = PersonData()
      person['name'] = name
      person['address'] = address
      request = Request(subpage,callback = self.subPage)
      request.meta['person'] = person
      yield request

    def subPage(self,response):
     person = response.meta['person']
     hxs = HtmlXPathSelector(response)
     if (hxs.select('//div[@class="category"]/span/text()').extract()):
      person['category'] = hxs.select('//div[@class="category"]/span/text()').extract()[0]
     else:
      person['category'] = ""
     yield person
  • 10-12: we define the urls from which scraping will start.

  • 13: we define the rule we will use for sequential pages (if there is pagination). In order to be able to use rules like this you must use CrawlSpider instead of BaseSpider or create your custom spider. So:

1
\\?page=\\d

is regular expression where \d matches any digit. For our specific project our paginated pages have this format

1
2
3
http://www.mydomain.com/1?page=x
http://www.mydomain.com/2?page=x
http://www.mydomain.com/3?page=x

where x is any digit. Next point we have to mention is the callback function parse_start_url. CAUTION: If you want your spider start scraping from the url you define in start_urls then you have to use this callback function name. This was a bit tricky,because it is undocumented and it was solved after a lot of searching.

  • 17-19: We extract text from web pages, using xpath syntax.
  • 24-26: For every person, there is a subpage with details. So, we need to scrap this subpage to get category. We use a second callback function for subpages and we pass our item.

  • 28-35: In this function we extract category if it exists and we return person.

    1. Next, we need to post-process our scraped data, so we edit pipilines.py:

    import csv

1
2
3
4
5
6
7
8
9
10
11
12
class myExporter(object):

    def __init__(self):
        self.myCSV = csv.writer(open('output.csv', 'wb'))
        self.myCSV.writerow(['name', 'address','category'])

    def process_item(self, item, spider):
     self.myCSV.writerow([item['name'].encode('utf-8'),
                                    item['address'].encode('utf-8'),
                                    item['category'].encode('utf-8')])

     return item

So, we create a file called output.csv with write permission and we write every person’s data we scraped per row. Keep in mind that we have unicode text and if we want to encode it for example to utf-8, we ll do it as shown above. Last but not least, we should declare our exporter to settings.py:

1
ITEM_PIPELINES = ['myProject.pipelines.myExporter']
  1. Go to the root of your project directory and type:

    scrapy crawl myProject

BUYAKASHA! :D

3 – References

[1] Scrapy

[2] Reqular Expressions

[3] XPath

Comments