The following is a proposed standard for bringing more semanticity to articles on the Web. In our efforts to provide quality content without the superfluous leavings, we've seen that the Web is a pretty messy place. We hope that by providing some simple guidelines we can help publishers make their content a little more presentable with Readability while also making the Web a bit more semantic.
By and large, you'll find that our guidelines just follow other specifications. We lean heavily on the work of the hNews microformat as well as the new elements provided within HTML5. If anything is unclear, please refer to the hNews microformat specification as well as this handy guide to semantic elements in html5, from Mark Pilgrim's Dive into HTML5.
(loading..)
Hover over an element to the left to see more information about its use.
The hNews Microformat
Readability recommends and parses the hNews microformat for Articles. To quote the microformat specification, “hNews is a microformat for news content. hNews extends hAtom, introducing a number of fields that more completely describe a journalistic work.”
Below are a few explanations of the hNews microformat and how Readability uses it. For full information, view the hNews spec at microformats.org
hentry
hentry
denotes the beginning of an Entry, which will be the wrapper within which all of our content is found.
entry-title
The entry-title
class denotes the title of the Article. This is intentionally distinct from the title tag, which often differs due to organization name or SEO content.
entry-content
The entry-content
class denotes what part of the article is the body content. Readability will use this as the body, if found.
entry-summary
The entry-summary
class denotes the lede, subhead or dek of the Article. If this exists, it should be content distinct from the title or content of the article that gives a brief summary—one or two sentences—of the article itself.
byline
The byline
vcard denotes who wrote the article. Typically a person. 'fn' within it denotes the person's full name. See hCard for more info.
source-org
The source of the article is the organization or group backing the article. If it is solely an individual, the individual itself will suffice (and you may append source-org onto the author vcard).
source-org also follows the hCard spec.
HTML5 Recommendations
The capabilities in HTML5 afford a great deal more semanticity, which is very helpful when trying to understand content. These are a few guidelines that will help your markup be easily understandable as an article.
<article>
Use the article tag to wrap an entry. It's semantic and easy for Readability to spot.
<time>
We’ll be looking for time elements with the pubdate
attribute within articles we process. This will help us understand when the article was published. To quote the HTML5 Working Draft, the pubdate attribute “is a boolean attribute. If specified, it indicates that the date and time given by the element is the publication date and time of the nearest ancestor article element, or, if the element has no ancestor article element, of the document as a whole.”
<aside>, <header>, <nav> and <footer>
By using these tags, you can provide a big head start in figuring out what is not the primary content of the page.
<figure> and <figcaption>
These tags should be used for media related to an article. This allows us to pull media in nicely into an article's flow. Most typically images, but other media is also allowed, as per the w3c spec: “The element can thus be used to annotate illustrations, diagrams, photos, code listings, etc.” Please note that Readability may strip content such as flash and images, depending on user preference.
Readability-Specific Directives
These are guidelines specifically created to help Readability or similar parsers with your content. They may not have much semantic value outside of a parser context.
.entry-unrelated
This is a special class that explicitly tells Readability (and other parsers) to ignore the content within it. It can be used on any element. This is currently the only readability-specific directive.
.entry-content-asset
This is a special class that explicitly tells Readability (and other parsers) that the content within it is related to the content. This is particularly useful in cases where you have content that should be an asset in a figure tag, but can't yet switch to HTML5.
.comment
A comment class will help Readability to better filter (or, in the future, display) extraneous comments from an article text.