More HTML Tidy
Jason Clinton
me at jasonclinton.com
Fri Oct 10 15:58:57 CDT 2003
Jonathan Hutchins wrote:
>Jason, you suggested HTML Tidy for dealing with Word 10 - was that just a
>theoretical suggestion, based on the documentation claims, or have you made
>this process work? All I get are a stream of error warnings. What good is
>HTML tidy if you have to manually clean the document before you can feed it
>to tidy?
>
>
Sample Word 10 "HTML" file with a single line of text:
---------------------------------------------------------------------------
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv=Content-Type content="text/html; charset=utf-8">
<meta name=ProgId content=Word.Document>
<meta name=Generator content="Microsoft Word 10">
<meta name=Originator content="Microsoft Word 10">
<link rel=File-List href="This%20is%20a%20test_files/filelist.xml">
<title>This is a test</title>
<!--[if gte mso 9]><xml>
<o:DocumentProperties>
<o:Author>Jason Clinton</o:Author>
<o:LastAuthor>Jason Clinton</o:LastAuthor>
<o:Revision>1</o:Revision>
<o:TotalTime>0</o:TotalTime>
<o:Created>2003-10-10T15:48:00Z</o:Created>
<o:LastSaved>2003-10-10T15:48:00Z</o:LastSaved>
<o:Pages>1</o:Pages>
<o:Words>2</o:Words>
<o:Characters>14</o:Characters>
<o:Company>UMKC-IHD</o:Company>
<o:Lines>1</o:Lines>
<o:Paragraphs>1</o:Paragraphs>
<o:CharactersWithSpaces>15</o:CharactersWithSpaces>
<o:Version>10.4219</o:Version>
</o:DocumentProperties>
<o:OfficeDocumentSettings>
<o:AllowPNG/>
</o:OfficeDocumentSettings>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:WordDocument>
<w:GrammarState>Clean</w:GrammarState>
</w:WordDocument>
</xml><![endif]-->
<style>
<!--
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{mso-style-parent:"";
margin:0in;
margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:12.0pt;
mso-bidi-font-size:10.0pt;
font-family:"Times New Roman";
mso-fareast-font-family:"Times New Roman";
mso-bidi-font-family:Arial;}
@page Section1
{size:8.5in 11.0in;
margin:1.0in 1.25in 1.0in 1.25in;
mso-header-margin:.5in;
mso-footer-margin:.5in;
mso-paper-source:0;}
div.Section1
{page:Section1;}
-->
</style>
<!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:"";
mso-padding-alt:0in 5.4pt 0in 5.4pt;
mso-para-margin:0in;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:"Times New Roman";}
</style>
<![endif]--><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026"/>
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1"/>
</o:shapelayout></xml><![endif]-->
</head>
<body lang=EN-US style='tab-interval:.5in'>
<div class=Section1>
<p class=MsoNormal>This is a test.</p>
</div>
</body>
---------------------------------------------------------------------------
Tidy command:
---------------------------------------------------------------------------
tidy --word-2000 yes This is a test.htm
---------------------------------------------------------------------------
Output:
---------------------------------------------------------------------------
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN">
<html>
<head>
<meta name="generator" content=
"HTML Tidy for Windows (vers 1st October 2003), see www.w3.org">
<title>This is a test</title>
</head>
<body>
<div class="Section1">
<p>This is a test.</p>
</div>
</body>
</html>
---------------------------------------------------------------------------
YMVV
More information about the Kclug
mailing list