linguist

by douban

douban / linguist

Language Savant, Python clone of github/linguist.

129 Stars 33 Forks Last release: Not found Other 102 Commits 4 Releases

Available items

No Items, yet!

The developer of this repository has not created any items for sale yet. Need a bug fixed? Help with integration? A different license? Create a request here:

Linguist

Build Status

Language Savant, Python clone of github/linguist.

Installation

PIP

bash
pip install linguist

Easyinstall ```bash easyinstall linguist ```

Features

Language detection

Linguist defines the list of all languages known in a yaml file. In order for a file to be highlighted, a language and lexer must be defined there.

Most languages are detected by their file extension. This is the fastest and most common situation.

For disambiguating between files with common extensions, we use a Bayesian classifier. For an example, this helps us tell the difference between

.h
files which could be either C, C++, or Obj-C.

For testing, there is a simple FileBlob API:

from linguist.libs.file_blob import FileBlob

FileBlob('test.py').language.name #=> 'Python'

FileBlob('test_file').language.name #=> 'Python'

See linguist/libs/language.py and lib/linguist/languages.yml.

Syntax Highlighting

The actual syntax highlighting is handled by pygments. It also provides a Lexer abstraction that determines which highlighter should be used on a file.

Stats

The Language Graph you see on every repository is built by aggregating the languages of all repo's blobs.

The repository stats API can be used on a directory:

from linguist.libs.repository import Repository

project = Repository.from_directory(".")

project.language.name #=> 'Python'

project.languages #=> defaultdict(, {: 53446, : 1991})

for lang, count in projects.languages.iteritems(): print lang.name, count #=> Python, 53446 #=> JavaScript, 1991

These stats are also printed out by the binary. Try running

pylinguist [dir_path|file_path]
:
$ pylinguist ~/douban/proj/code/
60.8% JavaScript
39.1% Python
0.1% Shell

$ pylinguist static/js/lib/jquery.min.js static/js/lib/jquery.min.js: 2 lines (2 sloc) type: Text language: JavaScript appears to be generated source code appears to be a vendored file

$ pylinguist config.py config.py: 34 lines (23 sloc) type: Text language: Python

Ignore vendored files

Checking other code into your git repo is a common practice. But this often inflates your project's language stats and may even cause your project to be labeled as another language. We are able to identify some of these files and directories and exclude them.

from linguist.libs.file_blob import FileBlob

FileBlob('static/js/jquery-2.0.0.min.js').is_vendored #=> True

See BlobHelper#is_vendored and linguist/libs/vendor.yml.

Generated file detection

from linguist.libs.file_blob import FileBlob

FileBlob('jquery-2.0.0.min.js').is_generated #=> True FileBlob('app.coffee').is_generated #=> True

See Generated#is_generated.

Contributing

* Fork the repository.
* Create a topic branch.
* Implement your feature or bug fix.
* Add, commit, and push your changes.
* Submit a pull request.

Testing

cd tests/
python run.py

Changelog

v0.1.1 [2014-11-03] * Updated require Pygments

v0.1.0 [2013-11-19] * Better performance, create && require scanner * Sync the latest version of github/linguist * Using MIME Types, create && require mime * Compatible github custom lexers, create && require pygments-github-lexers

v0.0.3 [2013-05-20] * Bugfix: ignore dir if dir.startswith('.')

v0.0.2 [2013-04-25] * Added script

pylinguist
* Disable detech unknown ext file * Bugfix count blob sloc * Added some unittest

v0.0.1 [2013-04-22] * Release v0.0.1

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.