Scraping a JSON response with Scrapy


Scrapy is a popular Python library for web scraping, which provides an easy and efficient way to extract data from websites for a variety of tasks including data mining and information processing. In addition to being a general-purpose web crawler, Scrapy may also be used to retrieve data via APIs.

One of the most common data formats returned by APIs is JSON, which stands for JavaScript Object Notation. In this article, we’ll look at how to scrape a JSON response using Scrapy.

To install Scrapy write the following command in your command line or on your terminal:

pip install scrapy

Example

Now we’ll look at an example to extract data from the bored public API endpoint (https://www.boredapi.com/api/activity).

Here’s what the actual data returned looks like:

{
  "activity": "Learn calligraphy",
  "type": "education",
  "participants": 1,
  "price": 0.1,
  "link": "",
  "key": "4565537",
  "accessibility": 0.1
}

Python3

import scrapy

import json

  

  

class Spider(scrapy.Spider):

    name = "bored"

  

    def start_requests(self):

  

        yield scrapy.Request(url, self.parse)

  

    def parse(self, response):

        data = json.loads(response.text)

  

        activity = data["activity"]

        type = data["type"]

        participants = data["participants"]

  

        yield {"Activity": activity, "Type": type

               "Participants": participants}

Explanation:

Here we have a Scrapy spider named Spider. The spider has 3 main parts:

  • The name variable – sets the name of the spider to “bored”.
  • The start_requests method – initiates the request to the API endpoint at “https://www.boredapi.com/api/activity”. The method yields a Scrapy request object and passes it to the parse method.
  • The parse method – handles the response from the API endpoint. The method loads the JSON response data into a Python dictionary using the json.loads function. Then, it extracts the values of the “activity”, “type”, and “participants” keys from the dictionary and stores them in variables with the same names. Finally, it yields a dictionary with the activity, type, and participants as keys and their corresponding values.

To run this file type the following into your terminal:

scrapy runspider <file name>

Output:

the output of the above command

Now, this output will contain a lot of unnecessary lines so it’ll be better to store your parsed responses in a separate file. You can do it by adding a -o tag to the command for the output file.

 

The “-L ERROR” is added to prevent any outputs other than error messages.

activity.json looks like this:

 


Scrapy is a popular Python library for web scraping, which provides an easy and efficient way to extract data from websites for a variety of tasks including data mining and information processing. In addition to being a general-purpose web crawler, Scrapy may also be used to retrieve data via APIs.

One of the most common data formats returned by APIs is JSON, which stands for JavaScript Object Notation. In this article, we’ll look at how to scrape a JSON response using Scrapy.

To install Scrapy write the following command in your command line or on your terminal:

pip install scrapy

Example

Now we’ll look at an example to extract data from the bored public API endpoint (https://www.boredapi.com/api/activity).

Here’s what the actual data returned looks like:

{
  "activity": "Learn calligraphy",
  "type": "education",
  "participants": 1,
  "price": 0.1,
  "link": "",
  "key": "4565537",
  "accessibility": 0.1
}

Python3

import scrapy

import json

  

  

class Spider(scrapy.Spider):

    name = "bored"

  

    def start_requests(self):

  

        yield scrapy.Request(url, self.parse)

  

    def parse(self, response):

        data = json.loads(response.text)

  

        activity = data["activity"]

        type = data["type"]

        participants = data["participants"]

  

        yield {"Activity": activity, "Type": type

               "Participants": participants}

Explanation:

Here we have a Scrapy spider named Spider. The spider has 3 main parts:

  • The name variable – sets the name of the spider to “bored”.
  • The start_requests method – initiates the request to the API endpoint at “https://www.boredapi.com/api/activity”. The method yields a Scrapy request object and passes it to the parse method.
  • The parse method – handles the response from the API endpoint. The method loads the JSON response data into a Python dictionary using the json.loads function. Then, it extracts the values of the “activity”, “type”, and “participants” keys from the dictionary and stores them in variables with the same names. Finally, it yields a dictionary with the activity, type, and participants as keys and their corresponding values.

To run this file type the following into your terminal:

scrapy runspider <file name>

Output:

the output of the above command

Now, this output will contain a lot of unnecessary lines so it’ll be better to store your parsed responses in a separate file. You can do it by adding a -o tag to the command for the output file.

 

The “-L ERROR” is added to prevent any outputs other than error messages.

activity.json looks like this:

 

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
JSONresponseScrapingScrapyTechTechnoblenderTechnology
Comments (0)
Add Comment