Need help with document_cluster?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

brandomr
483 Stars 340 Forks 31 Commits 14 Opened issues

Description

A guide to document clustering in Python

Services available

!
?

Need anything else?

Contributors list

# 67,347
Shell
HTML
social-...
footpri...
15 commits
# 25,647
Clojure
Ruby
jvm
Jupyter...
1 commit
# 1,334
q
React
declara...
prometh...
1 commit

Document Clustering with Python

In this guide, I will explain how to cluster a set of documents using Python. My motivating example is to identify the latent structures within the synopses of the top 100 films of all time (per an IMDB list). See the original postfor a more detailed discussion on the example. This guide covers:

The 'clusteranalysis' workbook is fully functional; the 'clusteranalysisweb' workbook has been trimmed down for the purpose of creating this walkthrough. Feel free to download the repo and use 'clusteranalysis' to step through the guide yourself.

How the repo is set up

Once you've pulled down the repo, all you need to do is run 'clusteranalysis.ipynb'; it will find the various lists of synopses and titles. The 'FilmScrape.ipynb' contains the code I used to actually scrape the synopses, in case you are interested. The other items in the repo are mostly incidentals for setting up the webpage walk-through. There is also one pickled model.

At some point in the future I'll write up how I executed the web scraping in case it's of interest.

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.