Zapier Formatter Hates Google Docs
Table of Contents
Recently updated on March 13th, 2024 at 10:35 pm
This is a small gripe for me, but for anyone that’s experienced it, it can be a huge headache. Zapier is one of the largest, widely used consumer-oriented automation platforms on the internet, and yet there’s just so many things they ignore to the detriment of their clients.
For example, the Zapier for WordPress integration plugin hasn’t been updated in over two years. The plugin has terrible reviews, sitting at a sad 2 out of 5 stars, hasn’t been tested in the last three major releases of WordPress, doesn’t work with most websites that have 2-Factor Authentication enabled via WordFence, All-In-One WP Security, or other security plugin providers, and it’s probably riddled with security flaws by now, maybe.
For a company with over 500 employees and more than 2.2 million users, you would think someone at Zapier would have said “Maybe we should update our plugin for the largest website content management system on the planet” at some point in the last three iterations of WordPress.
But that’s a different topic altogether. Today, we’re talking about Zapier’s shortcomings with their “Formatter by Zapier”. It advertises very clearly in SERP “Zapier Formatter: Automatically format text the way you want”, but when it comes to getting HTML text into WordPress from Google Docs, that’s not quite the case. There’s a huge shortcoming with the formatter for this simple task, and I know I’m not the only person that’s experienced this issue.
When setting up a Zap to get HTML data from a Google Doc into any other data destination that Zapier allows, it’s advisable to strip the “junk” HTML that’s imported into Zapier’s Google Docs integration. This helps to cut down unnecessary code, and therefore unnecessary work, by the human reviewing and entering the data, and by the machines that transport and process said data.
To remove the junk HTML data, the Zapier way:
- Get the data from the Google Docs Trigger or Action
- Create a new action, using “Formatter by Zapier”
- Set the Event to “Text” and click “Continue”
- Set the Transform Action to “Convert HTML to Markdown”
- Click Inside the Input field, click the drop-down arrow to the right of your Google Doc trigger/action, and select “File Raw Html Content”, click out of the drop-down, and click “Continue”
- Click “Test Step” to preview your data
That takes all the junk HTML (html, head, meta content tags, etc) and removes them from the content, while taking most of the formatting rules and turning them into markdown, traditionally used for bulletin board systems, forums, and other social-oriented websites. If you need the content in HTML format, like for publishing the content to a WordPress website, then you need to convert the remaining “cleaned” content back into HTML.
To convert Markdown back to HTML, the Zapier way:
- Create a new action following the “Text in Formatter by Zapier” action you just created, and select “Formatter by Zapier”
- Set the Event to “Text” and click “Continue”
- Set the Transform Action to “Convert Markdown to HTML”
- Click inside the input field, click the down-down arrow to the right of your “Text in Formatter by Zapier”, click the line that begins with “Output”, click out of the drop-down, and click “Continue”
- Click “Test Step” to preview your data
Now you’ve taken the stripped down Markdown content and converted the Markdown code back into HTML. Great, or at least that’s how it seems. If you finish your Zap and import the data into your destination, like a WordPress post, you’ll find that there’s some issues.
Issues Zapier Formatter Causes with HTML from Google Docs
Random New Line Insertion
The biggest issue is when you do this, despite getting clean code, Zapier has mysteriously decided where to start new lines of text. It’s almost like because the preview of the Zapier output window is only so wide, that Zapier will create new line characters when it reaches the end of its window. We discovered this problem when we were setting up a Google Docs to WordPress automation and generated this blog article on WordPress’s Illegal String Offset in class-wp-roles.php:292 error.
Example:
Words will suddenly wrap to a new line without
any rhyme or reason as to the sentence structure and the
only logical reason behind it is that Zapier’s parsing
mechanism simply ran out of room on that line so it started
a new one.
The text winds up looking like that if you send it straight to the destination. Fixing it by hand can be tedious and time consuming, especially if you’re writing long form for SEO.
Loss of Google Images in the Content
Another issue that arises with stripping the junk HTML and then adding clean HTML back is that Zapier’s Formatter does not like hyphens in the Google Docs’ image links, which look like this:
https://lh7-us.googleusercontent.com/54zWUdZrhe1zATU9Dc2wo6N2xnKYmpz-Xil0F4ht3lQXc5loksHyJ7Vlq6BNSNyVbZmLvgyLGOYWylT0qExXYsf9TRB2_hgFo82GHoOFsHXi76LsYPMfdIURDJiR2T50NjvW56McV1M6m11gK-uK4F0
This causes the HTML reconstruction phase to lose the image data, which again requires more manual processing to correct, and can be even more time consuming, as it wraps part of the image link in alt and title tags.
Google serves these images from Google Docs without an image extension, so parsing the data by pattern recognition (finding the end of the file using popular image format extensions like jpg, jpeg, png, gif, bmp, etc) is practically impossible. So, what’s the solution?
Custom HTML to Markdown via Javascript in Code by Zapier
The issue with paragraphs wrapping prematurely is caused by the HTML to Markdown conversion in Formatter by Zapier. Since we know that’s where the problem lies for paragraphs, here’s how we fix it. Instead of creating a “Formatter by Zapier” action, do the following
To remove the junk HTML, the RAD way:
- Create a “Code by Zapier” action
- In the event drop down, select “Run Javascript”, and click “Continue”
- In the “Input Data” input field, type “html” (no quotes)
- In the “Input Data” dropdown, select your “File Raw Html Content” from the Google Docs step
- Copy and paste the following into the “Code” input field:
const decodeHtmlEntities = (text) => {
// Manually replace common HTML entities
return text
.replace(/&/g, '&')
.replace(/</g, '<')
.replace(/>/g, '>')
.replace(/"/g, '"')
.replace(/'/g, "'")
.replace(/’/g, "’")
.replace(/‘/g, "‘")
.replace(/“/g, '“')
.replace(/”/g, '”')
.replace(/…/g, '…');
// Add more entities here as needed
};
const cleanHtml = (html) => {
// Decode HTML entities and remove CSS styles and unwanted HTML attributes
let decodedHtml = decodeHtmlEntities(html);
let cleanedHtml = decodedHtml
.replace(/<style[^>]*>.*?<\/style>/gs, '') // Remove <style> tags
.replace(/class="[^"]*"/g, '') // Remove class attributes
.replace(/id="[^"]*"/g, '') // Remove id attributes
.replace(/<\!--.*?-->/gs, '') // Remove HTML comments
.replace(/<script[^>]*>.*?<\/script>/gs, '');// Remove <script> tags
return cleanedHtml;
};
function htmlToMarkdown(html) {
return html
// Convert headers
.replace(/<h1[^>]*>(.*?)<\/h1>/gi, '# $1\n\n')
.replace(/<h2[^>]*>(.*?)<\/h2>/gi, '## $1\n\n')
.replace(/<h3[^>]*>(.*?)<\/h3>/gi, '### $1\n\n')
.replace(/<h4[^>]*>(.*?)<\/h4>/gi, '#### $1\n\n')
.replace(/<h5[^>]*>(.*?)<\/h5>/gi, '##### $1\n\n')
.replace(/<h6[^>]*>(.*?)<\/h6>/gi, '###### $1\n\n')
// Convert links
.replace(/<a href="([^"]+)"[^>]*>(.*?)<\/a>/gi, '[$2]($1)')
// Convert emphasis
.replace(/<strong[^>]*>(.*?)<\/strong>/gi, '**$1**')
.replace(/<em[^>]*>(.*?)<\/em>/gi, '*$1*')
.replace(/<b[^>]*>(.*?)<\/b>/gi, '**$1**')
.replace(/<i[^>]*>(.*?)<\/i>/gi, '*$1*')
// Convert unordered lists
.replace(/<ul[^>]*>/gi, '\n')
.replace(/<\/ul>/gi, '\n')
.replace(/<li[^>]*>(.*?)<\/li>/gi, '* $1\n')
// Convert ordered lists
.replace(/<ol[^>]*>/gi, '\n')
.replace(/<\/ol>/gi, '\n')
.replace(/<li[^>]*>(.*?)<\/li>/gi, '1. $1\n')
// Convert line breaks
.replace(/<br\s*\/?>/gi, '\n')
// Convert blockquotes
.replace(/<blockquote[^>]*>(.*?)<\/blockquote>/gi, '> $1\n')
// Remove images, keep alt text if present
.replace(/<img[^>]*alt="([^"]*)"[^>]*>/gi, '![$1]')
.replace(/<img[^>]*>/gi, '')
// Remove all other HTML tags
.replace(/<[^>]+>/g, '')
// Decode HTML entities
.replace(/&/g, '&')
.replace(/</g, '<')
.replace(/>/g, '>')
.replace(/"/g, '"')
.replace(/'/g, "'")
.replace(/ /g, ' ');
}
const inputHtml = inputData.html;
const cleanedHtml = cleanHtml(inputHtml);
const outputMarkdown = htmlToMarkdown(cleanedHtml);
output = [{markdown: outputMarkdown}];
- Click “Continue”
- Click “Test” to preview your output data
This should keep all of the paragraph formatting the way you intended it. No random new lines are inserted, so it should save you loads of time. Now to fix the issue where images are lost in returning the Markdown to HTML.
To Convert Markdown to HTML, the RAD way:
- Create a “Code by Zapier” action
- In the event drop down, select “Run Javascript”, and click “Continue”
- In the “Input Data” input field, type “markdown” (no quotes)
- In the “Input Data” dropdown, select your “Markdown” output from the HTML to Markdown step
- Copy and paste the following into the “Code” input field:
function markdownToHtml(markdown) {
let html = markdown
// Preliminary cleanup of URLs in Markdown links
.replace(/\[(.*?)\]\((.*?)\)/gim, (match, text, url) => {
// Remove unwanted query parameters starting from "&sa="
const cleanedUrl = url.replace(/&sa=.*$/, '');
return `[${text}](${cleanedUrl})`;
})
// Convert headers
.replace(/^###### (.*$)/gim, '<h6>$1</h6>')
.replace(/^##### (.*$)/gim, '<h5>$1</h5>')
.replace(/^#### (.*$)/gim, '<h4>$1</h4>')
.replace(/^### (.*$)/gim, '<h3>$1</h3>')
.replace(/^## (.*$)/gim, '<h2>$1</h2>')
.replace(/^# (.*$)/gim, '<h1>$1</h1>')
// Convert blockquotes
.replace(/^\> (.*$)/gim, '<blockquote>$1</blockquote>')
// Convert bold and italic
.replace(/\*\*\*(.*?)\*\*\*/gim, '<strong><em>$1</em></strong>')
.replace(/\*\*(.*?)\*\*/gim, '<strong>$1</strong>')
.replace(/\*(.*?)\*/gim, '<em>$1</em>')
// Convert links and images
.replace(/\!\[(.*?)\]\((.*?)\)/gim, '<img alt="$1" src="$2" />')
.replace(/\[(.*?)\]\((.*?)\)/gim, '<a href="$2">$1</a>')
// Convert unordered lists
.replace(/^\* (.*$)/gim, '<ul>\n<li>$1</li>\n</ul>')
// Convert ordered lists
.replace(/^1\. (.*$)/gim, '<ol>\n<li>$1</li>\n</ol>')
// Combine adjacent list items into a single list
.replace(/<\/ul>\n<ul>/gim, '')
.replace(/<\/ol>\n<ol>/gim, '')
// Convert line breaks to <br>
.replace(/\n/gim, '<br />');
// Wrap non-block elements in <p> tags
html = html.split('<br /><br />').map(paragraph => {
if (!paragraph.startsWith('<h1>') && !paragraph.startsWith('<h2>') && !paragraph.startsWith('<h3>') &&
!paragraph.startsWith('<h4>') && !paragraph.startsWith('<h5>') && !paragraph.startsWith('<h6>') &&
!paragraph.startsWith('<ul>') && !paragraph.startsWith('<ol>') && !paragraph.startsWith('<li>') &&
!paragraph.startsWith('<blockquote>')) {
return `<p>${paragraph.replace(/<br \/>/g, '')}</p>`;
}
return paragraph.replace(/<br \/>/g, '\n');
}).join('\n\n');
return html;
}
const inputMarkdown = inputData.markdown;
const outputHtml = markdownToHtml(inputMarkdown);
output = [{html: outputHtml}];
- Click “Continue”
- Click “Test” to preview your output data
This will Javascript will reconstruct your HTML content from the Markdown you created in the previous step, but it’s written to convert the image tags back into img src tags. This solves the second half of our Formatter by Zapier issue.
Conclusion
When you run these two Javascript codes consecutively, you get clean HTML output from Google Docs, while maintaining your paragraph formatting and your images. It takes some initial work to get it set up, but if you’re working on a Zapier automation to do this anyway, it only takes a few minutes longer to set it up this way and the few extra minutes it takes during setup will pay for itself in droves the first time you run it.
Hopefully you found this information useful. What do you use Zapier for? Have any of you tried its biggest competitor, Pabbly? or their HIPAA-compliant brethren Keragon? Let us know in the comments below!
About the Author: Mark Bush
NOTE: Some links on this page may be affiliate links, and help support our business. These links do not alter the cost of the product, but provide a small percentage of the sale to us as the referral source.
0 Comments
Trackbacks/Pingbacks