In this tutorial I use Scrapy to collect data from Craigslist.com. Specifically, the data under craigslist.org/Seattle/housing/vacation rentals. You can find the page under the link: https://seattle.craigslist.org/d/vacation‐rentals/search/vac
In the example, I collected following information:
- Title
- Posted Date
- Rental Price
- Number of bedrooms
- Neighborhood
- Description
For more information or the code, please go to my github page.
PREPARATION
INSTALLATION
You can install scrapy through pip install
command:
or use conda install
command:
CREAT PROJECT
Before we start coding, we can use scrapy startproject
command to quickly create a project.
In terminal or CMD, navigate to your desired folder and execute following command:
1
|
$ scrapy startproject scrapy_craigslist
|
Here scrapy_craigslist is the name of the project.
After that, we can use genspider
command to create a Scrapy Spider. Here, we name it vacation_rentals and designated a URL. We user craiglist.org Seattle vacation house list page as an example.
1
|
$ scrapy genspider vacation_rentals seattle.craigslist.org/d/vacation‐rentals/search/vac
|
This will create a directory with the following structure:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
─── scrapy_craigslist
├── __init__.py
├── __pycache__
│ ├── __init__.cpython-36.pyc
│ └── settings.cpython-36.pyc
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders
├── __init__.py
├── __pycache__
│ ├── __init__.cpython-36.pyc
│ └── vacation_rentals.cpython-36.pyc
└── vacation_rentals.py
|
EDITING
Navigate to the spiders folder and open the spider py file in your favorite editor.
There are some pre written code, but you need to make sure that allowed_domains
and start_urls
are in the right form.
1
2
3
4
5
6
7
8
9
10
|
import scrapy
class CarSpider(scrapy.Spider):
name = 'car'
allowed_domains = ['craigslist.org']
start_urls = ['https://seattle.craigslist.org/d/vacation‐rentals/search/vac/']
def parse(self, response):
pass
|
Let’s write our own code under def parse(self, response):
. You can check the code here.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
|
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request
import re
class VacationRentalsSpider(scrapy.Spider):
name = 'vacation_rentals'
allowed_domains = ['craigslist.org']
start_urls = ['http://seattle.craigslist.org/d/vacation‐rentals/search/vac/']
def parse(self, response):
# Extract all wrapper for each list item between <p class="result-info"></p>
vacs = response.xpath('//p[@class="result-info"]')
# Get next page button URL <a href="/search/vac?s=120" class="button next" title="next page">next > </a>
next_rel_url = response.xpath('//a[@class="button next"]/@href').extract_first()
# Get full address.
next_url = response.urljoin(next_rel_url)
# Go through all the pages.
yield Request(next_url, callback=self.parse)
# Loop each item to extract title, posted date, rental price, number of bedrooms, and neighborhood
for vac in vacs:
# Get title from <a></a> tag.
title = vac.xpath('a/text()').extract_first()
# Get posted date from <time class="result-date" datetime="2019-03-06 18:34" title="Wed 06 Mar 06:34:28 PM">Mar 6</time> block. Use @datetime for attribute datetime.
pdate = vac.xpath('time/@datetime').extract_first().split()[0]
# Get rental price form <span class="result-price">$84</span>
rprice = vac.xpath('span/span[@class="result-price"]/text()').extract_first()
# Get Number of bedrooms from <span class="housing">2br - 760ft<sup>2</sup> - </span> and clean up the extra
nbedroom = str(vac.xpath('span/span[@class="housing"]/text()').extract_first()).split('-')[0].strip()
# Get Neighborhood from <span class="result-hood"> (*** - *****)</span>
hood = re.sub('[()]', '', str(vac.xpath('span/span[@class="result-hood"]/text()').extract_first())).strip()
# Get the address of description page of each vacation house.
vacaddress = vac.xpath('a/@href').extract_first()
# We needed open the URL of each house and scrape the house description, while passing the meta to parse_page function.
yield Request(vacaddress, callback=self.parse_page, meta={'URL': vacaddress, 'Title': title, 'Posted Date':pdate,"Rental Price":rprice,"Number of bedrooms":nbedroom, "Neighborhood":hood})
# Extract description page of the vacation house.
def parse_page(self, response):
# Pass the variables
url = response.meta.get('URL')
title = response.meta.get('Title')
pdate = response.meta.get('Posted Date')
rprice = response.meta.get('Rental Price')
nbedroom = response.meta.get('Number of bedrooms')
hood = response.meta.get('Neighborhood')
# Get the description.
description = "".join(line for line in response.xpath('//*[@id="postingbody"]/text()').extract())
yield{'Title': title, 'Posted Date':pdate,"Rental Price":rprice,"Number of bedrooms":nbedroom, "Neighborhood":hood,'Description':description}
|
RUN SPIDER
To put our spider to work, run crawl
command in terminal or CMD:
1
|
$ scrapy crawl vacation_rentals -o result-titles.csv
|
-o
means out put data into file. result-titles.csv
is the files' name.