Need help with ftr-site-config?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

fivefilters
232 Stars 188 Forks Other 2.3K Commits 29 Opened issues

Description

Site-specific article extraction rules to aid content extractors, feed readers, and 'read later' applications.

Services available

!
?

Need anything else?

Contributors list

# 216,151
xpath
Firefox
Chrome
HTML
1272 commits
# 10,667
PHP
Symfony
chatbot...
symfony...
185 commits
# 96,219
PHP
Atom
Nextclo...
Symfony
124 commits
# 540,816
xpath
45 commits
# 75,595
python3
Firefox
pinboar...
chromiu...
42 commits
# 89,160
PHP
Symfony
symfony...
C
36 commits
# 114,536
C
Perl
Shell
Linux
31 commits
# 148
bittorr...
tracker...
c-plus-...
cpp11
29 commits
# 291,522
PHP
Symfony
symfony...
Compose...
27 commits
# 202,688
messeng...
C++
libpurp...
pidgin
21 commits
# 72,225
Ruby
Elixir
Objecti...
desktop...
20 commits
# 72,802
xpath
Shell
Python
TeX
20 commits
# 27,218
PHP
SQL
Laravel
elastic...
19 commits
# 199,973
PHP
HTML
bitly
shortur...
19 commits
# 651,641
xpath
13 commits
# 488,701
xpath
Android
Shell
PHP
13 commits
# 353,201
R
xpath
disqus
sqlite3
10 commits
# 21,100
Yii Fra...
yii2-ex...
yii2
gitlab-...
9 commits
# 149,438
HTML
specifi...
Shell
Django
8 commits
# 701,631
xpath
7 commits

Full-Text RSS site config files

Full-Text RSS, our article extraction tool, makes use of site-specific extraction rules to improve results. Each time a URL is processed, it checks to see if there are extraction rules for the site being processed. If there are no rules are found, it tries to detect the content block automatically.

This repository contains the site-specific extraction rules we rely on in Full-Text RSS.

Contributing changes

We run automated tests on these files to detect issues. If you'd like to help keep these up to date, please look at the test results and see which files you'd like to contribute fixes for.

We chose GitHub for this set of files because they offer one feature which we hope will make contributing changes easier: file editing through the web interface.

You can now make changes to any of our site config files and request that your changes be pulled into the main set we maintain. This is what GitHub calls the Fork and Pull model:

The Fork & Pull Model lets anyone fork an existing repository and push changes to their personal fork without requiring access be granted to the source repository. The changes must then be pulled into the source repository by the project maintainer. This model reduces the amount of friction for new contributors and is popular with open source projects because it allows people to work independently without upfront coordination.

When we receive a pull request we'll review the changes and if everything's okay we'll update our copy.

If a site is not in our set, you can create a file for it in the same way. See Creating files on GitHub.

How to write a site config file

The quickest and simplest way is to use our point-and-click interface. It's a simple tool only intended to create a rule to extract the correct content block.

For further refinements, e.g. selecting the title, stripping elements, dealing with multi-page articles, please see our help page.

File naming

Use

example.com.txt
for
  • www.example.com
  • example.com

Use

.example.com.txt
for
  • sport.example.com
  • news.example.com
  • environment.example.com
  • etc.

Use

sport.example.com.txt
to target just that sub-domain:
  • sport.example.com

Note:

.example.com.txt
will not match
www.example.com
or
example.com

Instapaper

When we introduced site patterns, we chose to adopt the same format used by Instapaper. This allows us to make use of the existing extraction rules contributed by Instapaper users.

Marco, Instapaper's creator, graciously opened up the database of contributions to everyone:

And, recognizing that your efforts could be useful to a wide range of other tools and services, I'll make the list of all of these site-specific configurations available to the public, free, with no strings attached.

Most of the extraction rules in our set are borrowed from Instapaper. You can see the list maintained by Instapaper at instapaper.com/bodytext/ (no longer available since Instapaper was sold).

Testing site config files

Currently you will have to have a copy of Full-Text RSS to test changes to the site config files. In the future we will try to make this process easier.

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.