Skip to content

Strip unwanted tags and attributes from XHTML

2009 July 1
by Richard Knop

Have you ever experienced strange tags and attributes in XHTML markup produced by a rich text editor? And were you sure the rich text editor doesn’t produce those tags on its own?

The answer to this mysterious behaviour is: MS Word. Many users will just copy and paste a text from Word documents unaware of a fact that the text comes with garbage XHTML tags and attributes only the Word uses.

I wrote a Zend action helper that strips all unwanted tags and attributes from an XHTML markup. All credit goes to nick’s comment in PHP’s strip_tags() function documentation:

  1. class My_Controller_Action_Helper_StripTagsAttributes extends Zend_Controller_Action_Helper_Abstract
  2. {
  3.     public function direct($string, $allowedTags = null, $allowedAttributes = null)
  4.     {
  5.         if ($allowedAttributes) {
  6.             if (!is_array($allowedAttributes)) {
  7.                 $allowedAttributes = explode(',', $allowedAttributes);
  8.             }
  9.             if (is_array($allowedAttributes)) {
  10.                 $allowedAttributes = implode('|', $allowedAttributes);
  11.             }
  12.             $rep = '/([^>]*) (' . $allowedAttributes . ')(=)(\'.*\'|".*")/i';
  13.             $string = preg_replace($rep, '$1 $2_-_-$4', $string);
  14.         }
  15.         if (preg_match('/([^>]*) (.*)(=\'.*\'|=".*")(.*)/i', $string) > 0) {
  16.             $string = preg_replace('/([^>]*) (.*)(=\'.*\'|=".*")(.*)/i', '$1$4', $string);
  17.         }
  18.         $rep = '/([^>]*) (' . $allowedAttributes . ')(_-_-)(\'.*\'|".*")/i';
  19.         if ($allowedAttributes) {
  20.             $string = preg_replace($rep, '$1 $2=$4', $string);
  21.         }
  22.         return strip_tags($string, $allowedTags);
  23.     }
  24. }

How to use it:

  1. /*
  2.  * this will strip all tags except of
  3.  * <p>, <strong>, <em>, <ul>, <ol>, <li>, <a>, <h1>, <h2> alt <h3>
  4.  * and al attributes except
  5.  * href, src and alt from the $xhtml variable
  6.  */
  7. $xhtml = $this->_helper->stripTagsAttributes($xhtml,
  8.                                              '<p><strong><em><ul><ol><li><a><h1><h2><h3>',
  9.                                              'href,src,alt');

Finaly, I strongly recommend using HTML Purifier as a means of protection against cross-site scripting (often referred to as XSS).

No comments yet

Leave a Reply

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS