🔮 A Node.js scraper for humans.
A Node.js scraper for humans.
Want to save time or not using Node.js? Try our hosted API.
# Using npm npm install --save scrape-itUsing yarn
yarn add scrape-it
:bulb: ProTip: You can install the cli version of this module by running
npm install --global scrape-it-cli(or
yarn global add scrape-it-cli).
Here are some frequent questions and their answers.
scrape-ithas only a simple request module for making requests. That means you cannot directly parse ajax pages with it, but in general you will have those scenarios:
scrape-itthe ajax url (e.g.
example.com/api/that-endpoint) and you will you will be able to parse the response
.scrapeHTMLmethod from scrape it once you get the HTML loaded on the page.
There is no fancy way to crawl pages with
scrape-it. For simple scenarios, you can parse the list of urls from the initial page and then, using Promises, parse each page. Also, you can use a different crawler to download the website and then use the
.scrapeHTMLmethod to scrape the local files.
Use the
.scrapeHTMLto parse the HTML read from the local files using
fs.readFile.
const scrapeIt = require("scrape-it")// Promise interface scrapeIt("https://ionicabizau.net", { title: ".header h1" , desc: ".header h2" , avatar: { selector: ".header img" , attr: "src" } }).then(({ data, response }) => { console.log(
Status Code: ${response.statusCode}
) console.log(data) })// Callback interface scrapeIt("https://ionicabizau.net", { // Fetch the articles articles: { listItem: ".article" , data: {
// Get the article date and convert it into a Date object createdAt: { selector: ".date" , convert: x => new Date(x) } // Get the title , title: "a.article-title" // Nested list , tags: { listItem: ".tags > span" } // Get the content , content: { selector: ".article-content" , how: "html" } // Get attribute value of root listItem by omitting the selector , classes: { attr: "class" } } } // Fetch the blog pages
, pages: { listItem: "li.page" , name: "pages" , data: { title: "a" , url: { selector: "a" , attr: "href" } } }
// Fetch some other data from the page
, title: ".header h1" , desc: ".header h2" , avatar: { selector: ".header img" , attr: "src" } }, (err, { data }) => { console.log(err || data) }) // { articles: // [ { createdAt: Mon Mar 14 2016 00:00:00 GMT+0200 (EET), // title: 'Pi Day, Raspberry Pi and Command Line', // tags: [Object], // content: '
Everyone knows (or should know)...a" alt="">
\n', // classes: [Object] }, // { createdAt: Thu Feb 18 2016 00:00:00 GMT+0200 (EET), // title: 'How I ported Memory Blocks to modern web', // tags: [Object], // content: 'Playing computer games is a lot of fun. ...', // classes: [Object] }, // { createdAt: Mon Nov 02 2015 00:00:00 GMT+0200 (EET), // title: 'How to convert JSON to Markdown using json2md', // tags: [Object], // content: '
I love and ...', // classes: [Object] } ], // pages: // [ { title: 'Blog', url: '/' }, // { title: 'About', url: '/about' }, // { title: 'FAQ', url: '/faq' }, // { title: 'Training', url: '/training' }, // { title: 'Contact', url: '/contact' } ], // title: 'Ionică Bizău', // desc: 'Web Developer, Linux geek and Musician', // avatar: '/images/logo.png' }
There are few ways to get help:
scrapeIt(url, opts, cb)
A scraping module for humans.
url: The page url or request options.
opts: The options passed to
scrapeHTMLmethod.
cb: The callback function.
data(Object): The scraped data.
$(Function): The Cheeerio function. This may be handy to do some other manipulation on the DOM, if needed.
response(Object): The response object.
body(String): The raw body as a string.
scrapeIt.scrapeHTML($, opts)
Scrapes the data in the provided element.
For the format of the selector, please refer to the Selectors section of the Cheerio library
$: The input element.
Object
opts: An object containing the scraping information. If you want to scrape a list, you have to use the
listItemselector:
listItem(String): The list item selector.
data(Object): The fields to include in the list objects:
selector(String): The selector.
convert(Function): An optional function to change the value.
how(Function|String): A function or function name to access the value.
attr(String): If provided, the value will be taken based on the attribute name.
trim(Boolean): If
false, the value will not be trimmed (default:
true).
closest(String): If provided, returns the first ancestor of the given element.
eq(Number): If provided, it will select the nth element.
texteq(Number): If provided, it will select the nth direct text child. Deep text child selection is not possible yet. Overwrites the
howkey.
listItem(Object): An object, keeping the recursive schema of the
listItemobject. This can be used to create nested lists.
Example:
js { articles: { listItem: ".article" , data: { createdAt: { selector: ".date" , convert: x => new Date(x) } , title: "a.article-title" , tags: { listItem: ".tags > span" } , content: { selector: ".article-content" , how: "html" } , traverseOtherNode: { selector: ".upperNode" , closest: "div" , convert: x => x.length } } } }
If you want to collect specific data from the page, just use the same schema used for the
datafield.
Example:
js { title: ".header h1" , desc: ".header h2" , avatar: { selector: ".header img" , attr: "src" } }
Have an idea? Found a bug? See how to contribute.
I open-source almost everything I can, and I try to reply to everyone needing help using these projects. Obviously, this takes time. You can integrate and use these projects in your applications for free! You can even change the source code and redistribute (even resell it).
However, if you get some profit from this or just want to encourage me to continue creating stuff, there are few ways you can do it:
Bitcoin—You can send me bitcoins at this address (or scanning the code below):
1P9BRsmazNQcuyTxEqveUsnf5CERdq35V6
Thanks! :heart:
If you are using this library in one of your projects, add it in this list. :sparkles:
@web-master/node-web-scraper
proxylist
mit-ocw-scraper
beervana-scraper
cnn-market
@tryghost/mg-webscraper
bandcamp-scraper
blockchain-notifier
dncli
degusta-scrapper
trump-cabinet-picks
cevo-lookup
camaleon
scrape-vinmonopolet
do-fn
university-news-notifier
selfrefactor
parn
picarto-lib
mix-dl
jishon
sahibinden
sahibindenServer
sgdq-collector
ubersetzung
ui-studentsearch
paklek-cli
egg-crawler
@thetrg/gibson
jobs-fetcher
fmgo-marketdata
rayko-tools
leximaven
codinglove-scraper
vandalen.rhyme.js
uniwue-lernplaetze-scraper
spon-market
macoolka-net-scrape
gatsby-source-bandcamp
salesforcerelease-parser
yu-ncov-scrape-dxy
rs-api
startpage-quick-search
fa.js
helyesiras
flamescraper
covidau
3abn
scrape-it-cli
codementor
u-pull-it-ne-parts-finder
blankningsregistret
steam-workshop-scraper
scrapos-worker
node-red-contrib-scrape-it
growapi
@ben-wormald/bandcamp-scraper
bible-scraper