Need help with arachnid?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

zrashwani
234 Stars 65 Forks MIT License 123 Commits 5 Opened issues

Description

Crawl all unique internal links found on a given website, and extract SEO related information - supports javascript based sites

Services available

!
?

Need anything else?

Contributors list

# 291,659
Jupyter...
outlier...
anomaly
PHP
31 commits
# 190,359
PHP
HTML
2 commits
# 342,937
PHP
CSS
Shell
2 commits
# 234,811
Vue.js
CSS
Shell
PHP
2 commits
# 9,274
PHP
Telegra...
bot-fra...
chatbot...
1 commit
# 202,481
PHP
steam
PHPUnit
source-...
1 commit

Arachnid Web Crawler

This library will crawl all unique internal links found on a given website up to a specified maximum page depth.

This library is using symfony/panther & FriendsOfPHP/Goutte libraries to scrap site pages and extract main SEO-related info, including:

title
,
h1 elements
,
h2 elements
,
statusCode
,
contentType
,
meta description
,
meta keyword
and
canonicalLink
.

This library is based on the original blog post by Zeid Rashwani here:

http://zrashwani.com/simple-web-spider-php-goutte

Josh Lockhart adapted the original blog post's code (with permission) for Composer and Packagist and updated the syntax to conform with the PSR-2 coding standard.

SymfonyInsight Build Status codecov

How to Install

You can install this library with Composer. Drop this into your

composer.json
manifest file:
{
    "require": {
        "zrashwani/arachnid": "dev-master"
    }
}

Then run

composer install
.

Getting Started

Basic Usage:

Here's a quick demo to crawl a website: ```php <?php require 'vendor/autoload.php';

$url = 'http://www.example.com';
$linkDepth = 3;
// Initiate crawl, by default it will use http client (GoutteClient), 
$crawler = new \Arachnid\Crawler($url, $linkDepth);
$crawler->traverse();

// Get link data $links = $crawler->getLinksArray(); //to get links as objects use getLinks() method print_r($links);

Enabling Headless Browser mode:

Headless browser mode can be enabled, so it will use Chrome engine in background which is useful to get contents of javacript-based sites.

enableHeadlessBrowserMode method set the scraping adapter used to be PantherChromeAdapter which is based on Symfony Panther library:

```php $crawler = new \Arachnid\Crawler($url, $linkDepth); $crawler->enableHeadlessBrowserMode() ->traverse() ->getLinksArray();

In order to use this, you need to have chrome-driver installed on your machine, you can use

dbrekelmans/browser-driver-installer
to install chromedriver locally:
composer require --dev dbrekelmans/bdi
./vendor/bin/bdi driver:chromedriver drivers

Advanced Usage:

Set additional options to underlying http client, by specifying array of options in constructor or creating Http client scrapper with desired options:

     array('username', 'password')];
        $crawler = new \Arachnid\Crawler('http://github.com', 2, $clientOptions);

    //or by creating and setting scrap client
    $options = array(
        'verify_host' =&gt; false,
        'verify_peer' =&gt; false,
        'timeout' =&gt; 30,
    );

    $scrapperClient = CrawlingFactory::create(CrawlingFactory::TYPE_HTTP_CLIENT, $options);
    $crawler-&gt;setScrapClient($scrapperClient);

You can inject a PSR-3 compliant logger object to monitor crawler activity (like Monolog): ```php <?php
$crawler = new \Arachnid\Crawler($url, $linkDepth); // ... initialize crawler

//set logger for crawler activity (compatible with PSR-3)
$logger = new \Monolog\Logger('crawler logger');
$logger->pushHandler(new \Monolog\Handler\StreamHandler(sys_get_temp_dir().'/crawler.log'));
$crawler->setLogger($logger);
?>
   You can set crawler to visit only pages with specific criteria by specifying callback closure using `filterLinks` method:

```php <?php //filter links according to specific callback as closure $links = $crawler->filterLinks(function($link) { //crawling only links with /blog/ prefix return (bool)preg_match('/./blog.$/u', $link); }) ->traverse() ->getLinks();

You can use

LinksCollection
class to get simple statistics about the links, as following: ```php <?php $links = $crawler->traverse() ->getLinks(); $collection = new LinksCollection($links);
//getting broken links
$brokenLinks = $collection->getBrokenLinks();

//getting links for specific depth $depth2Links = $collection->getByDepth(2);

//getting external links inside site $externalLinks = $collection->getExternalLinks();

How to Contribute

  1. Fork this repository
  2. Create a new branch for each feature or improvement
  3. Apply your code changes along with corresponding unit test
  4. Send a pull request from each feature branch

It is very important to separate new features or improvements into separate feature branches, and to send a pull request for each branch. This allows me to review and pull in new features or improvements individually.

All pull requests must adhere to the [PSR-2 standard][psr2].

System Requirements

  • PHP 7.2.0+

Authors

License

MIT Public License

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.