Sql-parser js 

In this article we will make a Node.js site parser, or just JavaScript, I think you will be interested. Project Preparation: You must have Node.JS installed, if you don’t know how to do that, read these articles: Now you need to prepare the project, to do this create a folder where it will be stored,…

In this article we will make a Node.js site parser, or just JavaScript, I think you will be interested.

Project Preparation:


You must have Node.JS installed, if you don’t know how to do that, read these articles:


Now you need to prepare the project, to do this create a folder where it will be stored, enter this command:

npm init -y

That is, in this way we have initialized the project, now we download all the libraries we need, through the command npm.

npm install --save request request-promise cheerio

That’s the end of the preparation.

Let’s write a javascript parser:


We’ll start with a simpler example at first, and gradually more and more complex and solvent.

We take an HTML page:


Let’s get a page with American presidents from Wikipedia for the example, open a text editor to do so, and write a function to get the HTML code.

const rp = require('request-promise');
const url = 'https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States';
 
rp(url)
  .then(function(html){
    //It worked!
    console.log(html);
  })
  .catch(function(err){
    //error
  });

The entire HTML document should appear in the terminal.

Using Chrome DevTools:


Cool, we got the raw HTML from the web page! But now we need to make sense of this giant chunk of text. To do that, we need to use Chrome DevTools so we can easily search for what we need in the HTML.

Using Chrome DevTools is simple: just open Google Chrome and right-click on the item you want to look at (I right-click on George Washington) :

Смотрим теги для парсера

Now just click the “View Code” button and Chrome will open the DevTools toolbar, allowing you to easily check the source HTML code of the page.

Проверяем тег для парсера на Node.js

Parsing the HTML with Cheerio.js:


Great, Chrome DevTools now shows us the exact pattern we should look for in the code (the big tag with a hyperlink inside it).

Let’s use Cheerio.js to parser the HTML we received earlier to return a list of links to the individual Wikipedia pages of the US presidents.

const rp = require('request-promise');
const $ = require('cheerio');
const url = 'https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States';
 
rp(url)
  .then(function(html){
    // Got the HTML
    console.log($('big > a', html).length);
    console.log($('big > a', html));
  })
  .catch(function(err){
    // Error
  });

This is what should be displayed in the terminal:

45
{ '0':
  { type: 'tag',
    name: 'a',
    attribs: { href: '/wiki/George_Washington', title: 'George Washington' },
    children: [ [Object] ],
    next: null,
    prev: null,
    parent:
      { type: 'tag',
        name: 'big',
        attribs: {},
        children: [Array],
        next: null,
        prev: null,
        parent: [Object] } },
  '1':
    { type: 'tag'
  ...

That is, the essence of the Cheerio.js library is that you can take an element by selector from a string element and you get an object with all its parameters, to learn more about the library, follow this link.

We check that exactly 45 elements (the number of US presidents) are returned, which means that there are no additional hidden big tags on the page.

Now we can go through and get a list of links to the Wikipedia page for all 45 presidents, taking them from the “attributes” section of each element.

const rp = require('request-promise');
const $ = require('cheerio');
const url = 'https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States';
 
rp(url)
  .then(function(html){
    // Got the Page
    const wikiUrls = [];
    for (let i = 0; i < 45; i++) {
      wikiUrls.push($('big > a', html)[i].attribs.href);
    }
    console.log(wikiUrls);
  })
  .catch(function(err){
    // Error
  });

This is what should appear in the terminal:

[
  '/wiki/George_Washington',
  '/wiki/John_Adams',
  '/wiki/Thomas_Jefferson',
  '/wiki/James_Madison',
  '/wiki/James_Monroe',
  '/wiki/John_Quincy_Adams',
  '/wiki/Andrew_Jackson',
  ...
]

We’ve got links to 45 US presidents, in the same way you can generate new tags and output them to the site, or send them via RestAPI.

In general this isn’t the whole translated article yet, but I think it’s enough to make a good Node.JS parser, if you need to translate everything, feel free to comment.

Conclusion:


In this article you have read how to make a Node.js site parser, I think it was interesting and useful for you.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *