One of the stumbling blocks for working out what the effects on the accessibility of new (and old) HTML5 features is not having any publicly accessible usage data. It is difficult without data to argue for the inclusion of features in HTML5 or working out how features should be accessibility supported. I have made an initial attempt to rectify this by collecting the HTML content of the home pages of the the top 10,000 web sites.I spent most of the Easter long weekend collecting the HTML pages. The original source for the “top 10,000” sites URLs was from this URL list I found on paste bin. I used HTTrack website copier to capture the HTML files. The initial pass was somewhat effected by redirects, so I went through the error log and collected a second list of URLs from the captured pages that had resulted in “page has moved” files. The resulting 8915 HTML pages are a result of the 2 sets of URLs. The HTML content (including URL lists) is provided as a zip file:
Top 10000 HTML files zip file – 121 MB (Please only download if you are going to make use of the data)
hgroup element usage
I have only just started to analyse the data. The first analysis is of the new HTML5 hgroup element and this is as yet only a simple gathering of instances of its use. No attempt has been made as yet, for example, to analyse what percentage of its use conforms to HTML5 author conformance requirements.
Of the top 8915 HTML pages, 79 (0.89%) were found to include use of the HTML5 hgroup
element. A total of 418 instances of the hgroup
were found within the 79 pages.
Instances of hgroup element use in top 10000 web sites – home pages
Inclusion of hgroup in HTML5
Note: I am a proponent of the removal and/or replacement of hgroup
in HTML5, there are currently 5 change proposals being reviewed by the W3C HTML working group chairs on this subject:
- Change Proposal: replace hgroup with the subline element
- Change Proposal: no-change hgroup
- Change Proposal: replace hgroup with a simple element
- Change Proposal: remove hgroup add an outlineMask attribute
- Change Proposal: Replace <hgroup> with an element that has a simple content model and backwards compatibility.
Comments
Hi Steve,
Over at Mozilla we’ve been doing something similar, but with regard to CSS properties. You might find some of the data sets in this ticket to be of use: https://bugzilla.mozilla.org/show_bug.cgi?id=708406 . If not, ping me directly and I can probably get you more.
Hi Steve,
It would be interesting to know how many of sites actually use (or try to use) HTML5. For example, how many use a non-HTML4/XHTML doctype.
You might find that the 0.89% might be 10% of sites using HTML5, or 1%, or 50%…
It would help put the figures in perspective.
Hi Alastair, I am crunching the data at the moment, and will provide more details soon. I have looked at how many use the HTML5 doctype and found that approx 17% of the sample pages use it.
Hi John, thanks for the heads up, the CSS data will be useful, for instance I want to look at the use of outline:none.
Ah, great, so about 5% of sites trying to use HTML5 also used hgroup. Not many.
Hi Alastair, of the 1454 pages using the HTML5 doctype, 77 also used
hgroup
so yes around 4.9 %.Sounds cool and interesting work you have started here. I have a little “hmmm”, because we all do the same thing when we try to do surveys on the Web. We often try only the Home page of Web sites. Which I guess might create a bias, I wonder if we should add at least for each of these sites a secondary page. The issue then being which one. 🙂
Would be interesting if the pages in question are using WordPress. WordPress uses hgroup quite long now. Have to check your URL list to verify it. Thanks for the effort.
hi Sylvia have a look at the generator.html it contains all pages with a
meta name=generator
. those pages that are identified as using wordpress are contained within the results.