Need help with pySBD?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

nipunsadvilkar
314 Stars 29 Forks MIT License 279 Commits 8 Opened issues

Description

🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

Services available

!
?

Need anything else?

Contributors list

# 36,703
Jupyter...
logisti...
scikit-...
cython
227 commits
# 400,967
PHP
Python
rule-ba...
3 commits
# 77,595
Shell
Bioinfo...
spacy
pytorch
2 commits
# 658,511
Python
rule-ba...
1 commit
# 80,167
rule-ba...
C++
Shell
Jupyter...
1 commit

PySBD logo

pySBD: Python Sentence Boundary Disambiguation (SBD)

Python package codecov License PyPi GitHub

pySBD - python Sentence Boundary Disambiguation (SBD) - is a rule-based sentence boundary detection module that works out-of-the-box.

This project is a direct port of ruby gem - Pragmatic Segmenter which provides rule-based sentence boundary detection.

pysbd_code

Highlights

'PySBD: Pragmatic Sentence Boundary Disambiguation' a short research paper got accepted into 2nd Workshop for Natural Language Processing Open Source Software (NLP-OSS) at EMNLP 2020.

Research Paper:

https://arxiv.org/abs/2010.09657

Recorded Talk:

pysbd_talk

Poster:

name

Install

Python

pip install pysbd

Usage

  • Currently pySBD supports 22 languages.
import pysbd
text = "My name is Jonas E. Smith. Please turn to p. 55."
seg = pysbd.Segmenter(language="en", clean=False)
print(seg.segment(text))
# ['My name is Jonas E. Smith.', 'Please turn to p. 55.']
import spacy
from pysbd.utils import PySBDFactory

nlp = spacy.blank('en')

explicitly adding component to pipeline

(recommended - makes it more readable to tell what's going on)

nlp.add_pipe(PySBDFactory(nlp))

or you can use it implicitly with keyword

pysbd = nlp.create_pipe('pysbd')

nlp.add_pipe(pysbd)

doc = nlp('My name is Jonas E. Smith. Please turn to p. 55.') print(list(doc.sents))

[My name is Jonas E. Smith., Please turn to p. 55.]

Contributing

If you want to contribute new feature/language support or found a text that is incorrectly segmented using pySBD, then please head to CONTRIBUTING.md to know more and follow these steps.

  1. Fork it ( https://github.com/nipunsadvilkar/pySBD/fork )
  2. Create your feature branch (
    git checkout -b my-new-feature
    )
  3. Commit your changes (
    git commit -am 'Add some feature'
    )
  4. Push to the branch (
    git push origin my-new-feature
    )
  5. Create a new Pull Request

Citation

If you use

pysbd
package in your projects or research, please cite PySBD: Pragmatic Sentence Boundary Disambiguation.
@inproceedings{sadvilkar-neumann-2020-pysbd,
    title = "{P}y{SBD}: Pragmatic Sentence Boundary Disambiguation",
    author = "Sadvilkar, Nipun  and
      Neumann, Mark",
    booktitle = "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.nlposs-1.15",
    pages = "110--114",
    abstract = "We present a rule-based sentence boundary disambiguation Python package that works out-of-the-box for 22 languages. We aim to provide a realistic segmenter which can provide logical sentences even when the format and domain of the input text is unknown. In our work, we adapt the Golden Rules Set (a language specific set of sentence boundary exemplars) originally implemented as a ruby gem pragmatic segmenter which we ported to Python with additional improvements and functionality. PySBD passes 97.92{\%} of the Golden Rule Set examplars for English, an improvement of 25{\%} over the next best open source Python tool.",
}

Credit

This project wouldn't be possible without the great work done by Pragmatic Segmenter team.

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.