Text Cleaner
Paste messy text and choose which cleanup operations to apply. Useful for cleaning up text copied from PDFs, websites, Word documents, or emails before pasting into another tool or publishing.
Cleanup Options
Cleaned Text
When Do You Need a Text Cleaner?
Text copied from PDFs, websites, and word processors often carries hidden formatting baggage that causes problems when pasted elsewhere:
- PDFs: Line breaks are inserted at every visual line wrap, turning a paragraph into dozens of short lines. Soft hyphens and ligature characters may appear as garbage text.
- Microsoft Word / Google Docs: Smart quotes (" " ' ') look elegant in a document but can break code, JSON, CSV files, and some web forms that expect straight ASCII quotes.
-
Websites: Copying from a webpage often captures hidden HTML tags, non-breaking spaces (
), and zero-width characters that are invisible but interfere with search, sorting, and string operations. - Emails: Quoted reply threads can add ">" characters, trailing spaces, and mixed line endings (Windows CRLF vs Unix LF).
Smart Quotes vs. Straight Quotes
Smart (curly) quotes — like " " and ' ' — are the typographically correct form for printed and web content. But they're encoded differently from the standard ASCII straight quotes (" and ') and can cause problems in:
- Source code (Python, JavaScript, SQL strings)
- Command-line terminals
- CSV files and spreadsheet imports
- Database queries
- Search fields that don't normalize Unicode
The text cleaner's "Replace smart quotes" option converts all four curly variants to their plain ASCII equivalents.
Zero-Width and Invisible Characters
Zero-width characters (ZWC) are Unicode code points that produce no visible glyph but exist in the text. Common examples include the zero-width space (U+200B), zero-width non-joiner (U+200C), zero-width joiner (U+200D), and the byte-order mark (U+FEFF). They're commonly found in:
- Text copied from mobile apps and messaging platforms
- Text that has been watermarked with invisible steganographic characters
- Content translated or generated by some AI tools
These characters are invisible to the reader but can break exact-match searches, cause incorrect character counts, and produce unexpected behavior in code. The "Remove zero-width and invisible characters" option strips them all.
Frequently Asked Questions
Can I undo the cleaning?
The original text remains in the top textarea — simply copy the cleaned output separately. If you want to compare the original and cleaned versions side by side, use the text diff tool.
Does cleaning remove blank lines?
The "Collapse multiple blank lines into one" option reduces runs of two or more empty lines down to a single blank line — preserving paragraph breaks while removing excessive spacing. If you want to remove all blank lines, use the "Remove all line breaks" option, which joins the entire text into one block. For removing only duplicate lines (not blank lines), see the remove duplicate lines tool.
What does "decode HTML entities" do?
HTML-encoded text uses sequences like & for &, < for <, and for a non-breaking space. When you copy text from a web page's source or from a CMS, these entities sometimes appear literally rather than as the characters they represent. This option converts them back to their original characters.