Strip unwanted tags and attributes from XHTML
Have you ever experienced strange tags and attributes in XHTML markup produced by a rich text editor? And were you sure the rich text editor doesn’t produce those tags on its own?
The answer to this mysterious behaviour is: MS Word. Many users will just copy and paste a text from Word documents unaware of a fact that the text comes with garbage XHTML tags and attributes only the Word uses.
I wrote a Zend action helper that strips all unwanted tags and attributes from an XHTML markup. All credit goes to nick’s comment in PHP’s strip_tags() function documentation:
-
class My_Controller_Action_Helper_StripTagsAttributes extends Zend_Controller_Action_Helper_Abstract
-
{
-
public function direct($string, $allowedTags = null, $allowedAttributes = null)
-
{
-
if ($allowedAttributes) {
-
if (!is_array($allowedAttributes)) {
-
$allowedAttributes = explode(',', $allowedAttributes);
-
}
-
if (is_array($allowedAttributes)) {
-
$allowedAttributes = implode('|', $allowedAttributes);
-
}
-
$rep = '/([^>]*) (' . $allowedAttributes . ')(=)(\'.*\'|".*")/i';
-
$string = preg_replace($rep, '$1 $2_-_-$4', $string);
-
}
-
if (preg_match('/([^>]*) (.*)(=\'.*\'|=".*")(.*)/i', $string) > 0) {
-
$string = preg_replace('/([^>]*) (.*)(=\'.*\'|=".*")(.*)/i', '$1$4', $string);
-
}
-
$rep = '/([^>]*) (' . $allowedAttributes . ')(_-_-)(\'.*\'|".*")/i';
-
if ($allowedAttributes) {
-
$string = preg_replace($rep, '$1 $2=$4', $string);
-
}
-
return strip_tags($string, $allowedTags);
-
}
-
}
How to use it:
-
/*
-
* this will strip all tags except of
-
* <p>, <strong>, <em>, <ul>, <ol>, <li>, <a>, <h1>, <h2> alt <h3>
-
* and al attributes except
-
* href, src and alt from the $xhtml variable
-
*/
-
$xhtml = $this->_helper->stripTagsAttributes($xhtml,
-
'<p><strong><em><ul><ol><li><a><h1><h2><h3>',
-
'href,src,alt');
Finaly, I strongly recommend using HTML Purifier as a means of protection against cross-site scripting (often referred to as XSS).