RStudio Pandoc – HTML To Markdown

[This article was first published on R on YIHAN WU, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The knitr and rmarkdown packages are used in conjunction with pandoc to convert R code and figures to a variety of formats including PDF, and word. Here, I’m exploring how to convert HTML back to markdown format. This post came about when I was searching how to convert XML to markdown, which I still haven’t found an easy way to do. Pandoc is not the only way to convert HTML to markdown (see turndown, html2text)

Pandoc is packaged within RStudio and on Windows, the executables are located within Program Files/RStudio/bin/pandoc. The rmarkdown package contains wrapper functions for using pandoc within RStudio.

Here, I am trying to convert this example HTML page back to markdown using the function pandoc_convert. First, pandoc_convert requires an actual file which means it does not accept a quoted string of HTML code in its input argument.

The example html:

<html>
<!-- Text between angle brackets is an HTML tag and is not displayed.
Most tags, such as the HTML and /HTML tags that surround the contents of
a page, come in pairs; some tags, like HR, for a horizontal rule, stand 
alone. Comments, such as the text you're reading, are not displayed when
the Web page is shown. The information between the HEAD and /HEAD tags is 
not displayed. The information between the BODY and /BODY tags is displayed.-->
<head>
<title>Enter a title, displayed at the top of the window.</title>
</head>
<!-- The information between the BODY and /BODY tags is displayed.-->
<body>
<h1>Enter the main heading, usually the same as the title.</h1>
<p>Be <b>bold</b> in stating your key points. Put them in a list: </p>
<ul>
<li>The first item in your list</li>
<li>The second item; <i>italicize</i> key words</li>
</ul>
<p>Improve your image by including an image. </p>
<p><img src="http://www.mygifs.com/CoverImage.gif" alt="A Great HTML Resource"></p>
<p>Add a link to your favorite <a href="https://www.dummies.com/">Web site</a>.
Break up your page with a horizontal rule or two. </p>
<hr>
<p>Finally, link to <a href="page2.html">another page</a> in your own Web site.</p>
<!-- And add a copyright notice.-->
<p>© Wiley Publishing, 2011</p>
</body>
</html>

I saved the HTML example here as example.html.

html_page <- readLines("../../static/files/example.html")

We can print the object in R.

cat(html_page)
## <html> <!-- Text between angle brackets is an HTML tag and is not displayed. Most tags, such as the HTML and /HTML tags that surround the contents of a page, come in pairs; some tags, like HR, for a horizontal rule, stand  alone. Comments, such as the text you're reading, are not displayed when the Web page is shown. The information between the HEAD and /HEAD tags is  not displayed. The information between the BODY and /BODY tags is displayed.--> <head> <title>Enter a title, displayed at the top of the window.</title> </head> <!-- The information between the BODY and /BODY tags is displayed.--> <body> <h1>Enter the main heading, usually the same as the title.</h1> <p>Be <b>bold</b> in stating your key points. Put them in a list: </p> <ul> <li>The first item in your list</li> <li>The second item; <i>italicize</i> key words</li> </ul> <p>Improve your image by including an image. </p> <p><img src="http://www.mygifs.com/CoverImage.gif" alt="A Great HTML Resource"></p> <p>Add a link to your favorite <a href="https://www.dummies.com/">Web site</a>. Break up your page with a horizontal rule or two. </p> <hr> <p>Finally, link to <a href="page2.html">another page</a> in your own Web site.</p> <!-- And add a copyright notice.--> <p>© Wiley Publishing, 2011</p> </body> </html>

pandoc can convert between many different formats, and for markdown, it has multiple variants including the github flavored variant (for Github), and php markdown extra (the variant used by WordPress sites).

The safest variant to pick is markdown_strict which is the original markdown variant.

Pandoc requires the file path which in my case, is located in a different directory rather than my working directory.

library(rmarkdown)
file_path <- "../../static/files/example.html"
pandoc_convert(file_path, to = "markdown_strict")
Enter the main heading, usually the same as the title.
======================================================

Be **bold** in stating your key points. Put them in a list:

-   The first item in your list
-   The second item; *italicize* key words

Improve your image by including an image.

![A Great HTML Resource](http://www.mygifs.com/CoverImage.gif)

Add a link to your favorite [Web site](https://www.dummies.com/). Break
up your page with a horizontal rule or two.

------------------------------------------------------------------------

Finally, link to [another page](page2.html) in your own Web site.

© Wiley Publishing, 2011

Notice that heading 1 is formatted with ==== rather than the # that RMarkdown seems to favor. We can require pandoc to use the # during the conversion by adding an argument.

pandoc_convert(file_path, to = "markdown_strict", options = c("--atx-headers"))
# Enter the main heading, usually the same as the title.

Be **bold** in stating your key points. Put them in a list:

-   The first item in your list
-   The second item; *italicize* key words

Improve your image by including an image.

![A Great HTML Resource](http://www.mygifs.com/CoverImage.gif)

Add a link to your favorite [Web site](https://www.dummies.com/). Break
up your page with a horizontal rule or two.

------------------------------------------------------------------------

Finally, link to [another page](page2.html) in your own Web site.

© Wiley Publishing, 2011

Right now, the output is being piped to the console. A file can be created instead with:

pandoc_convert(file_path, to = "markdown_strict", output = "example.md")

Pandoc has a multitude of styling extensions for markdown variants, all listed on the manual page.

Pandoc ignores everything enclosed in <!-- -->. When converting from markdown to HTML, these comments are usually directly placed as is in the HTML document but the opposite does not seem to be true.

Lastly, this was tested using pandoc version 1.19.2.1. Pandoc 2.5 was released last month.

To leave a comment for the author, please follow the link and comment on their blog: R on YIHAN WU.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)