pandas-profiling

by pandas-profiling

pandas-profiling / pandas-profiling

Create HTML profiling reports from pandas DataFrame objects

6.2K Stars 945 Forks Last release: about 2 months ago (v2.9.0) MIT License 704 Commits 29 Releases

Available items

No Items, yet!

The developer of this repository has not created any items for sale yet. Need a bug fixed? Help with integration? A different license? Create a request here:

Pandas Profiling

Pandas Profiling Logo Header

Build Status Code Coverage Release Version Python Version Code style: black

Documentation | Slack | Stack Overflow

Generates profile reports from a pandas

DataFrame
. The pandas
df.describe()
function is great but a little basic for serious exploratory data analysis.
pandas_profiling
extends the pandas DataFrame with
df.profile_report()
for quick data analysis.

For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:

  • Type inference: detect the types of columns in a dataframe.
  • Essentials: type, unique values, missing values
  • Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
  • Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
  • Most frequent values
  • Histogram
  • Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
  • Missing values matrix, count, heatmap and dendrogram of missing values
  • Text analysis learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data.
  • File and Image analysis extract file sizes, creation dates and dimensions and scan for truncated images or those containing EXIF information.

Announcements

Version v2.9.0 released

The release candidate for v2.9.0 was already out for a while, now v2.9.0 is finally released. See the changelog below to know what has changed.

Spark backend in progress

We can happily announce that we're working on a Spark backend for generating profile reports. Stay tuned.

Support
pandas-profiling

The development of

pandas-profiling
relies completely on contributions. If you find value in the package, we welcome you to support the project through GitHub Sponsors! It's extra exciting that GitHub matches your contribution for the first year.

Find more information here:

September 2, 2020 ๐Ÿ’˜


Contents: Examples | Installation | Documentation | Large datasets | Command line usage | Advanced usage | Support | Types | How to contribute | Editor Integration | Dependencies


Examples

The following examples can give you an impression of what the package can do:

  • Census Income (US Adult Census data relating income)
  • NASA Meteorites (comprehensive set of meteorite landings) Open In Colab Binder
  • Titanic (the "Wonderwall" of datasets) Open In Colab Binder
  • NZA (open data from the Dutch Healthcare Authority)
  • Stata Auto (1978 Automobile data)
  • Vektis (Vektis Dutch Healthcare data)
  • Colors (a simple colors dataset)

Specific features: * Russian Vocabulary (demonstrates text analysis) * Cats and Dogs (demonstrates image analysis from the file system) * Celebrity Faces (demonstrates image analysis with EXIF information) * Website Inaccessibility (demonstrates URL analysis) * Orange prices and Coal prices (showcases report themes)

Tutorials: * Tutorial: report structure using Kaggle data (advanced) (modify the report's structure) Open In Colab Binder

Installation

Using pip

PyPi Downloads PyPi Monthly Downloads PyPi Version

You can install using the pip package manager by running

pip install pandas-profiling[notebook]

Alternatively, you could install the latest version directly from Github:

pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

Using conda

Conda Downloads Conda Version

You can install using the conda package manager by running

conda install -c conda-forge pandas-profiling

From source

Download the source code by cloning the repository or by pressing 'Download ZIP' on this page. Install by navigating to the proper directory and running

python setup.py install

Documentation

The documentation for

pandas_profiling
can be found here. Previous documentation is still available here.

Getting started

Start by loading in your pandas DataFrame, e.g. by using ```python import numpy as np import pandas as pd from pandas_profiling import ProfileReport

df = pd.DataFrame( np.random.rand(100, 5), columns=["a", "b", "c", "d", "e"] )

To generate the report, run:
python profile = ProfileReport(df, title="Pandas Profiling Report") ```

Explore deeper

You can configure the profile report in any way you like. The example code below loads the explorative configuration file, that includes many features for text (length distribution, unicode information), files (file size, creation time) and images (dimensions, exif information). If you are interested what exact settings were used, you can compare with the default configuration file.

profile = ProfileReport(df, title='Pandas Profiling Report', explorative=True)

Learn more about configuring

pandas-profiling
on the Advanced usage page.

Jupyter Notebook

We recommend generating reports interactively by using the Jupyter notebook. There are two interfaces (see animations below): through widgets and through a HTML report.

Notebook Widgets

This is achieved by simply displaying the report. In the Jupyter Notebook, run:

python
profile.to_widgets()

The HTML report can be included in a Jupyter notebook:

HTML

Run the following code:

profile.to_notebook_iframe()

Saving the report

If you want to generate a HTML report file, save the

ProfileReport
to an object and use the
to_file()
function:
python
profile.to_file("your_report.html")
Alternatively, you can obtain the data as json: ```python

As a string

jsondata = profile.tojson()

As a file

profile.tofile("yourreport.json") ```

Large datasets

Version 2.4 introduces minimal mode. This is a default configuration that disables expensive computations (such as correlations and dynamic binning). Use the following syntax:

profile = ProfileReport(large_dataset, minimal=True)
profile.to_file("output.html")

Command line usage

For standard formatted CSV files that can be read immediately by pandas, you can use the

pandas_profiling
executable. Run
pandas_profiling -h

for information about options and arguments.

Advanced usage

A set of options is available in order to adapt the report generated.

  • title
    (
    str
    ): Title for the report ('Pandas Profiling Report' by default).
  • pool_size
    (
    int
    ): Number of workers in thread pool. When set to zero, it is set to the number of CPUs available (0 by default).
  • progress_bar
    (
    bool
    ): If True,
    pandas-profiling
    will display a progress bar.

More settings can be found in the default configuration file, minimal configuration file and dark themed configuration file.

Example

python
profile = df.profile_report(title='Pandas Profiling Report', plot={'histogram': {'bins': 8}})
profile.to_file("output.html")

Supporting open source

Maintaining and developing the open-source code for pandas-profiling, with millions of downloads and thousands of users, would not be possible with support of our gracious sponsors.

Lambda Labs

Lambda workstations, servers, laptops, and cloud services power engineers and researchers at Fortune 500 companies and 94% of the top 50 universities. Lambda Cloud offers 4 & 8 GPU instances starting at $1.50 / hr. Pre-installed with TensorFlow, PyTorch, Ubuntu, CUDA, and cuDNN.

We would like to thank our generous Github Sponsors supporters who make pandas-profiling possible:

Martin Sotir, Joseph Yuen, Brian Lee, Stephanie Rivera, nscsekhar, abdulAziz

More info if you would like to appear here: Github Sponsor page

Types

Types are a powerful abstraction for effective data analysis, that goes beyond the logical data types (integer, float etc.).

pandas-profiling
currently recognizes the following types: Boolean, Numerical, Date, Categorical, URL, Path, File and Image.

We have developed a type system for Python, tailored for data analysis: visions. Selecting the right typeset drastically reduces the complexity the code of your analysis. Future versions of

pandas-profiling
will have extended type support through
visions
!

Contributing

Read on getting involved in the Contribution Guide. A low threshold place to ask questions or start contributing is by reaching out on the pandas-profiling Slack. Join the Slack community.

Editor integration

PyCharm integration

  1. Install
    pandas-profiling
    via the instructions above
  2. Locate your

    pandas-profiling
    executable.

    On macOS / Linux / BSD:

    $ which pandas_profiling
    (example) /usr/local/bin/pandas_profiling
    

    On Windows:

    $ where pandas_profiling
    (example) C:\ProgramData\Anaconda3\Scripts\pandas_profiling.exe
    
  3. In Pycharm, go to Settings (or Preferences on macOS) > Tools > External tools

  4. Click the + icon to add a new external tool

  5. Insert the following values

    • Name: Pandas Profiling
    • Program: The location obtained in step 2
    • Arguments: "$FilePath$" "$FileDir$/$FileNameWithoutAllExtensions$_report.html"
    • Working Directory: $ProjectFileDir$

PyCharm Integration

To use the PyCharm Integration, right click on any dataset file: External Tools > Pandas Profiling.

Other integrations

Other editor integrations may be contributed via pull requests.

Dependencies

The profile report is written in HTML and CSS, which means pandas-profiling requires a modern browser.

You need Python 3 to run this package. Other dependencies can be found in the requirements files:

| Filename | Requirements| |----------|-------------| | requirements.txt | Package requirements| | requirements-dev.txt | Requirements for development| | requirements-test.txt | Requirements for testing| | setup.py | Requirements for Widgets etc. |

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.