Page Scanner

Find dead links, 404, 301 and more in your content (command line or admin).

Install

composer require pushword/page-scanner

Usage

Command line

php bin/console pw:page-scan              # scan all hosts
php bin/console pw:page-scan localhost.dev # scan a specific host
php bin/console pw:page-scan --skip-external  # skip external URL checks
php bin/console pw:page-scan --limit=100      # stop after 100 errors

Admin

The scanner is accessible via the admin menu. Results are cached and refreshed automatically.

API

With the API extension installed, the same scan is available over REST at POST/GET /api/page-scan (background dispatch + polling) for scripted and agent workflows.

What it checks

  • Internal links: page exists, is published, is not a redirect
  • External links: HTTP status codes (parallel checking with caching)
  • Anchor links: target element exists in the page
  • Media files: referenced images and files exist
  • Parent pages: parent-child host consistency
  • TODO comments: deferred actions tied to page publication (see below)

TODO comments

When writing a page, you can leave TODO comments to remind yourself of actions to take when another page gets published.

Use <!--TODO:linkWhenPublished slug --> where you want a link to appear once the target page is published:

Read more about this topic
<!--TODO:linkWhenPublished my-upcoming-article -->
in a future article.

You can include the intended anchor text:

<!--TODO:linkWhenPublished my-upcoming-article "read our detailed guide" -->

Action when published

Use <!--TODO:doWhenPublished slug "instruction" --> for generic actions:

<!--TODO:doWhenPublished product-launch "add comparison table here" -->

Multi-host support

By default, the slug is resolved against the current page's host. To reference a page on another host, prefix with the host:

<!--TODO:linkWhenPublished other-site.com/target-slug "see also" -->

Scanner behavior

Target page stateScanner action
Slug not foundWarning: unknown page
Exists but not publishedSilent (still waiting)
Now publishedWarning: replace TODO with a link (or follow the instruction)

Configuration

# config/packages/pushword_page_scanner.yaml
pushword_page_scanner:
  min_interval_between_scan: 'PT5M'        # minimum interval between scans
  external_url_cache_ttl: 86400            # external URL cache TTL in seconds (24h)
  parallel_batch_size: 50                  # URLs checked in parallel per batch
  url_check_timeout_ms: 10000             # timeout per external URL check (ms)
  skip_external_url_check: false           # skip external URL validation
  links_to_ignore:                         # glob patterns for links to skip
    - 'https://www.example.tld/*'
    - '/admin/*'
  errors_to_ignore: []                     # error message patterns to suppress

errors_to_ignore supports global patterns ("message") and per-route patterns ("host/slug: message") with fnmatch wildcards.