scandir

by benhoyt

benhoyt / scandir

Better directory iterator and faster os.walk(), now in the Python 3.5 stdlib

462 Stars 59 Forks Last release: over 1 year ago (v1.10.0) BSD 3-Clause "New" or "Revised" License 443 Commits 12 Releases

Available items

No Items, yet!

The developer of this repository has not created any items for sale yet. Need a bug fixed? Help with integration? A different license? Create a request here:

scandir, a better directory iterator and faster os.walk()

.. image:: https://img.shields.io/pypi/v/scandir.svg :target: https://pypi.python.org/pypi/scandir :alt: scandir on PyPI (Python Package Index)

.. image:: https://travis-ci.org/benhoyt/scandir.svg?branch=master :target: https://travis-ci.org/benhoyt/scandir :alt: Travis CI tests (Linux)

.. image:: https://ci.appveyor.com/api/projects/status/github/benhoyt/scandir?branch=master&svg=true :target: https://ci.appveyor.com/project/benhoyt/scandir :alt: Appveyor tests (Windows)

scandir()
is a directory iteration function like
os.listdir()
, except that instead of returning a list of bare filenames, it yields
DirEntry
objects that include file type and stat information along with the name. Using
scandir()
increases the speed of
os.walk()
by 2-20 times (depending on the platform and file system) by avoiding unnecessary calls to
os.stat()
in most cases.

Now included in a Python near you!

scandir
has been included in the Python 3.5 standard library as
os.scandir()
, and the related performance improvements to
os.walk()
have also been included. So if you're lucky enough to be using Python 3.5 (release date September 13, 2015) you get the benefit immediately, otherwise just
download this module from PyPI 
_, install it with
pip install scandir
, and then do something like this in your code:

.. code-block:: python

# Use the built-in version of scandir/walk if possible, otherwise
# use the scandir module version
try:
    from os import scandir, walk
except ImportError:
    from scandir import scandir, walk

PEP 471 
, which is the PEP that proposes including
scandir
in the Python standard library, was
accepted 
in July 2014 by Victor Stinner, the BDFL-delegate for the PEP.

This

scandir
module is intended to work on Python 2.7+ and Python 3.4+ (and it has been tested on those versions).

Background

Python's built-in

os.walk()
is significantly slower than it needs to be, because -- in addition to calling
listdir()
on each directory -- it calls
stat()
on each file to determine whether the filename is a directory or not. But both
FindFirstFile
/
FindNextFile
on Windows and
readdir
on Linux/OS X already tell you whether the files returned are directories or not, so no further
stat
system calls are needed. In short, you can reduce the number of system calls from about 2N to N, where N is the total number of files and directories in the tree.

In practice, removing all those extra system calls makes

os.walk()
about 7-50 times as fast on Windows, and about 3-10 times as fast on Linux and Mac OS X. So we're not talking about micro-optimizations. See more benchmarks in the "Benchmarks" section below.

Somewhat relatedly, many people have also asked for a version of

os.listdir()
that yields filenames as it iterates instead of returning them as one big list. This improves memory efficiency for iterating very large directories.

So as well as a faster

walk()
, scandir adds a new
scandir()
function. They're pretty easy to use, but see "The API" below for the full docs.

Benchmarks

Below are results showing how many times as fast

scandir.walk()
is than
os.walk()
on various systems, found by running
benchmark.py
with no arguments:

==================== ============== ============= System version Python version Times as fast ==================== ============== ============= Windows 7 64-bit 2.7.7 64-bit 10.4 Windows 7 64-bit SSD 2.7.7 64-bit 10.3 Windows 7 64-bit NFS 2.7.6 64-bit 36.8 Windows 7 64-bit SSD 3.4.1 64-bit 9.9 Windows 7 64-bit SSD 3.5.0 64-bit 9.5 Ubuntu 14.04 64-bit 2.7.6 64-bit 5.8 Mac OS X 10.9.3 2.7.5 64-bit 3.8 ==================== ============== =============

All of the above tests were done using the fast C version of scandir (source code in

_scandir.c
).

Note that the gains are less than the above on smaller directories and greater on larger directories. This is why

benchmark.py
creates a test directory tree with a standardized size.

The API

walk() ~~~~~~

The API for

scandir.walk()
is exactly the same as
os.walk()
, so just
read the Python docs 
_.

scandir() ~~~~~~~~~

The full docs for

scandir()
and the
DirEntry
objects it yields are available in the
Python documentation here 
_. But below is a brief summary as well.
scandir(path='.') -> iterator of DirEntry objects for given path

Like

listdir
,
scandir
calls the operating system's directory iteration system calls to get the names of the files in the given
path
, but it's different from
listdir
in two ways:
  • Instead of returning bare filename strings, it returns lightweight

    DirEntry
    objects that hold the filename string and provide simple methods that allow access to the additional data the operating system may have returned.
  • It returns a generator instead of a list, so that

    scandir
    acts as a true iterator instead of returning the full list immediately.

scandir()
yields a
DirEntry
object for each file and sub-directory in
path
. Just like
listdir
, the
'.'
and
'..'
pseudo-directories are skipped, and the entries are yielded in system-dependent order. Each
DirEntry
object has the following attributes and methods:
  • name
    : the entry's filename, relative to the scandir
    path
    argument (corresponds to the return values of
    os.listdir
    )
  • path
    : the entry's full path name (not necessarily an absolute path) -- the equivalent of
    os.path.join(scandir_path, entry.name)
  • is_dir(*, follow_symlinks=True)
    : similar to
    pathlib.Path.is_dir()
    , but the return value is cached on the
    DirEntry
    object; doesn't require a system call in most cases; don't follow symbolic links if
    follow_symlinks
    is False
  • is_file(*, follow_symlinks=True)
    : similar to
    pathlib.Path.is_file()
    , but the return value is cached on the
    DirEntry
    object; doesn't require a system call in most cases; don't follow symbolic links if
    follow_symlinks
    is False
  • is_symlink()
    : similar to
    pathlib.Path.is_symlink()
    , but the return value is cached on the
    DirEntry
    object; doesn't require a system call in most cases
  • stat(*, follow_symlinks=True)
    : like
    os.stat()
    , but the return value is cached on the
    DirEntry
    object; does not require a system call on Windows (except for symlinks); don't follow symbolic links (like
    os.lstat()
    ) if
    follow_symlinks
    is False
  • inode()
    : return the inode number of the entry; the return value is cached on the
    DirEntry
    object

Here's a very simple example of

scandir()
showing use of the
DirEntry.name
attribute and the
DirEntry.is_dir()
method:

.. code-block:: python

def subdirs(path):
    """Yield directory names not starting with '.' under given path."""
    for entry in os.scandir(path):
        if not entry.name.startswith('.') and entry.is_dir():
            yield entry.name

This

subdirs()
function will be significantly faster with scandir than
os.listdir()
and
os.path.isdir()
on both Windows and POSIX systems, especially on medium-sized or large directories.

Further reading

  • The Python docs for scandir 
    _
  • PEP 471 
    _, the (now-accepted) Python Enhancement Proposal that proposed adding
    scandir
    to the standard library -- a lot of details here, including rejected ideas and previous discussion

Flames, comments, bug reports

Please send flames, comments, and questions about scandir to Ben Hoyt:

http://benhoyt.com/

File bug reports for the version in the Python 3.5 standard library

here 
_, or file bug reports or feature requests for this module at the GitHub project page:

https://github.com/benhoyt/scandir

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.