My favorites | Sign in
Project Home Downloads Wiki Issues Source
Project Information
Members
Featured
Downloads
Wiki pages
Links

A Python and PHP implementations of a HTML parser based on the WHATWG HTML5 specification for maximum compatibility with major desktop web browsers.

Note that the separate ports are not kept in sync; they are effectively different projects offering similar functionality for their respective languages.

Notes

  • Users of the sanitizer must ensure that they serialize with quoted attribute values to avoid some known script injection holes in older browsers including IE < 8
  • The Ruby port is currently unmaintained

Python 0.95 Release Features

  • Parses valid and invalid HTML documents to a tree
  • Support for minidom, ElementTree (including cElementTree and lxml.etree), BeautifulSoup (deprecated) and custom simpletree output formats
  • DOM to SAX converter
  • Reports parse errors
  • Character encoding detection
  • Filtering and serializing of trees
  • HTML+CSS sanitizer
  • Many unit tests

Documentation

Using html5Lib

Getting help/getting involved

Powered by Google Project Hosting