I have been doing regex searches on the HTML of the 8900 or so of the top 10000 home pages I collected over easter and am providing the results of those searches I have conducted so far, in raw form:
Top 10000 web sites home pages HTML code data dump
Searches on the HTML of the 8900 sample pages were conducted on various HTML elements and attributes.
NOTE: the resulting data output files are sometimes large and the HTML code is whoeful, they are supplied as. I will as time permits analyse the data and also clean up the HTML code.
element/attribute | HTML file size | last modified date |
---|---|---|
address.html | 338 KB | 11/04/2012 |
alt.html | 23573 KB | 12/04/2012 |
aria.html | 2566 KB | 11/04/2012 |
audio.html | 5 KB | 10/04/2012 |
doctypeall-clean.zip | 5 KB | 11/04/2012 |
figure-figcaption.html | 3034 KB | 11/04/2012 |
footer.html | 1853 KB | 10/04/2012 |
generator.html | 1548 KB | 10/04/2012 |
header.html | 2659 KB | 11/04/2012 |
hgroup.html | 247 KB | 10/04/2012 |
label-placeholder.htm | 258 KB | 12/04/2012 |
longdesc.html | 2194 KB | 10/04/2012 |
nav.html | 2194 KB | 11/04/2012 |
placeholder-title.html | 467 KB | 12/04/2012 |
placeholder.html | 1489 KB | 12/04/2012 |
section.html | 4202 KB | 10/04/2012 |
summaryattribute.html | 1068 KB | 12/04/2012 |
tabindex.html | 6848 KB | 12/04/2012 |
th.html | 5557 KB | 12/04/2012 |
u.html | 2363 KB | 10/04/2012 |
video.html | 143 KB | 10/04/2012 |
top10000URL1.txt | 330 KB | 11/04/2012 |
top10000URL2.txt | 79 KB | 09/04/2012 |