HTML data for the masses: data dump

Steve Faulkner April 12, 2012Development

I have been doing regex searches on the HTML of the 8900 or so of the top 10000 home pages I collected over easter and am providing the results of those searches I have conducted so far, in raw form:

Top 10000 web sites home pages HTML code data dump

Searches on the HTML of the 8900 sample pages were conducted on various HTML elements and attributes.

NOTE: the resulting data output files are sometimes large and the HTML code is whoeful, they are supplied as. I will as time permits analyse the data and also clean up the HTML code.

data dump
element/attribute	HTML file size	last modified date
address.html	338 KB	11/04/2012
alt.html	23573 KB	12/04/2012
aria.html	2566 KB	11/04/2012
audio.html	5 KB	10/04/2012
doctypeall-clean.zip	5 KB	11/04/2012
figure-figcaption.html	3034 KB	11/04/2012
footer.html	1853 KB	10/04/2012
generator.html	1548 KB	10/04/2012
header.html	2659 KB	11/04/2012
hgroup.html	247 KB	10/04/2012
label-placeholder.htm	258 KB	12/04/2012
longdesc.html	2194 KB	10/04/2012
nav.html	2194 KB	11/04/2012
placeholder-title.html	467 KB	12/04/2012
placeholder.html	1489 KB	12/04/2012
section.html	4202 KB	10/04/2012
summaryattribute.html	1068 KB	12/04/2012
tabindex.html	6848 KB	12/04/2012
th.html	5557 KB	12/04/2012
u.html	2363 KB	10/04/2012
video.html	143 KB	10/04/2012
top10000URL1.txt	330 KB	11/04/2012
top10000URL2.txt	79 KB	09/04/2012

About Steve Faulkner

Steve was the Chief Accessibility Officer at TPGi before he left in October 2023. He joined TPGi in 2006 and was previously a Senior Web Accessibility Consultant at vision australia. Steve is a member of several groups, including the W3C Web Platforms Working Group and the W3C ARIA Working Group. He is an editor of several specifications at the W3C including ARIA in HTML and HTML Accessibility API Mappings 1.0. He also develops and maintains HTML5accessibility and the JAWS bug tracker/standards support.

See All Posts by Steve Faulkner

ARC Platform Overview

ARC Analytics

ARC Content

Platform & Programs

Companion Tools

Start Now

Solutions Overview

Accessibility Services

Conformance

JAWS Solutions

Solutions by Org Size

Resources

Conformance & Compliance

Top Content

Free Tools

HTML data for the masses: data dump

Top 10000 web sites home pages HTML code data dump

further reading:

About Steve Faulkner