Need help with html-to-markdown?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

JohannesKaufmann
261 Stars 39 Forks MIT License 74 Commits 6 Opened issues

Description

⚙️ Convert HTML to Markdown. Even works with entire websites and can be extended through rules.

Services available

!
?

Need anything else?

Contributors list

# 190,134
html-to...
Markdow...
golang
steam
68 commits
# 146,647
wechat
video-t...
D
Common ...
1 commit
# 18,150
Kuberne...
Ruby
continu...
cpluspl...
1 commit
# 518,807
PHP
reactph...
html-to...
Markdow...
1 commit
# 100,724
HTML
unit-te...
seleniu...
github-...
1 commit

html-to-markdown

Go Report Card codecov GitHub MIT License GoDoc

gopher stading on top of a machine that converts a box of html to blocks of markdown

Convert HTML into Markdown with Go. It is using an HTML Parser to avoid the use of

regexp
as much as possible. That should prevent some weird cases and allows it to be used for cases where the input is totally unknown.

Installation

go get github.com/JohannesKaufmann/html-to-markdown

Usage

import md "github.com/JohannesKaufmann/html-to-markdown"

converter := md.NewConverter("", true, nil)

html = <strong>Important</strong>

markdown, err := converter.ConvertString(html) if err != nil { log.Fatal(err) } fmt.Println("md ->", markdown)

If you are already using goquery you can pass a selection to

Convert
.
markdown, err := converter.Convert(selec)

Using it on the command line

If you want to make use of

html-to-markdown
on the command line without any Go coding, check out
html2md
, a cli wrapper for
html-to-markdown
that has all the following options and plugins builtin.

Options

The third parameter to

md.NewConverter
is
*md.Options
.

For example you can change the character that is around a bold text ("

**
") to a different one (for example "
__
") by changing the value of
StrongDelimiter
.
opt := &md.Options{
  StrongDelimiter: "__", // default: **
  // ...
}
converter := md.NewConverter("", true, opt)

For all the possible options look at godocs and for a example look at the example.

Adding Rules

converter.AddRules(
  md.Rule{
    Filter: []string{"del", "s", "strike"},
    Replacement: func(content string, selec *goquery.Selection, opt *md.Options) *string {
      // You need to return a pointer to a string (md.String is just a helper function).
      // If you return nil the next function for that html element
      // will be picked. For example you could only convert an element
      // if it has a certain class name and fallback if not.
      content = strings.TrimSpace(content)
      return md.String("~" + content + "~")
    },
  },
  // more rules
)

For more information have a look at the example add_rules.

Using Plugins

If you want plugins (github flavored markdown like striketrough, tables, ...) you can pass it to

Use
.
import "github.com/JohannesKaufmann/html-to-markdown/plugin"

// Use the GitHubFlavored plugin from the plugin package. converter.Use(plugin.GitHubFlavored())

Or if you only want to use the

Strikethrough
plugin. You can change the character that distinguishes the text that is crossed out by setting the first argument to a different value (for example "~~" instead of "~").
converter.Use(plugin.Strikethrough(""))

For more information have a look at the example github_flavored.

Writing Plugins

Have a look at the plugin folder for a reference implementation. The most basic one is Strikethrough.

Security

This library produces markdown that is readable and can be changed by humans.

Once you convert this markdown back to HTML (e.g. using goldmark or blackfriday) you need to be careful of malicious content.

This library does NOT sanitize untrusted content. Use an HTML sanitizer such as bluemonday before displaying the HTML in the browser.

Other Methods

Godoc

func (c *Converter) Keep(tags ...string) *Converter

Determines which elements are to be kept and rendered as HTML.

func (c *Converter) Remove(tags ...string) *Converter

Determines which elements are to be removed altogether i.e. converted to an empty string.

Issues

If you find HTML snippets (or even full websites) that don't produce the expected results, please open an issue!

Contributing & Testing

Please first discuss the change you wish to make, by opening an issue. I'm also happy to guide you to where a change is most likely needed.

Note: The outside API should not change because of backwards compatibility...

You don't have to be afraid of breaking the converter, since there are many "Golden File Tests":

Add your problematic HTML snippet to one of the

input.html
files in the
testdata
folder. Then run
go test -update
and have a look at which
.golden
files changed in GIT.

You can now change the internal logic and inspect what impact your change has by running

go test -update
again.

Note: Before submitting your change as a PR, make sure that you run those tests and check the files into GIT...

Related Projects

License

This project is licensed under the terms of the MIT license.

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.