More HTML Tidy

Jason Clinton me at jasonclinton.com
Fri Oct 10 15:58:57 CDT 2003


Jonathan Hutchins wrote:

>Jason, you suggested HTML Tidy for dealing with Word 10 - was that just a 
>theoretical suggestion, based on the documentation claims, or have you made 
>this process work?  All I get are a stream of error warnings.  What good is 
>HTML tidy if you have to manually clean the document before you can feed it 
>to tidy?
>  
>
Sample Word 10 "HTML" file with a single line of text:
---------------------------------------------------------------------------
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns="http://www.w3.org/TR/REC-html40">

<head>
<meta http-equiv=Content-Type content="text/html; charset=utf-8">
<meta name=ProgId content=Word.Document>
<meta name=Generator content="Microsoft Word 10">
<meta name=Originator content="Microsoft Word 10">
<link rel=File-List href="This%20is%20a%20test_files/filelist.xml">
<title>This is a test</title>
<!--[if gte mso 9]><xml>
 <o:DocumentProperties>
  <o:Author>Jason Clinton</o:Author>
  <o:LastAuthor>Jason Clinton</o:LastAuthor>
  <o:Revision>1</o:Revision>
  <o:TotalTime>0</o:TotalTime>
  <o:Created>2003-10-10T15:48:00Z</o:Created>
  <o:LastSaved>2003-10-10T15:48:00Z</o:LastSaved>
  <o:Pages>1</o:Pages>
  <o:Words>2</o:Words>
  <o:Characters>14</o:Characters>
  <o:Company>UMKC-IHD</o:Company>
  <o:Lines>1</o:Lines>
  <o:Paragraphs>1</o:Paragraphs>
  <o:CharactersWithSpaces>15</o:CharactersWithSpaces>
  <o:Version>10.4219</o:Version>
 </o:DocumentProperties>
 <o:OfficeDocumentSettings>
  <o:AllowPNG/>
 </o:OfficeDocumentSettings>
</xml><![endif]--><!--[if gte mso 9]><xml>
 <w:WordDocument>
  <w:GrammarState>Clean</w:GrammarState>
 </w:WordDocument>
</xml><![endif]-->
<style>
<!--
 /* Style Definitions */
 p.MsoNormal, li.MsoNormal, div.MsoNormal
    {mso-style-parent:"";
    margin:0in;
    margin-bottom:.0001pt;
    mso-pagination:widow-orphan;
    font-size:12.0pt;
    mso-bidi-font-size:10.0pt;
    font-family:"Times New Roman";
    mso-fareast-font-family:"Times New Roman";
    mso-bidi-font-family:Arial;}
@page Section1
    {size:8.5in 11.0in;
    margin:1.0in 1.25in 1.0in 1.25in;
    mso-header-margin:.5in;
    mso-footer-margin:.5in;
    mso-paper-source:0;}
div.Section1
    {page:Section1;}
-->
</style>
<!--[if gte mso 10]>
<style>
 /* Style Definitions */
 table.MsoNormalTable
    {mso-style-name:"Table Normal";
    mso-tstyle-rowband-size:0;
    mso-tstyle-colband-size:0;
    mso-style-noshow:yes;
    mso-style-parent:"";
    mso-padding-alt:0in 5.4pt 0in 5.4pt;
    mso-para-margin:0in;
    mso-para-margin-bottom:.0001pt;
    mso-pagination:widow-orphan;
    font-size:10.0pt;
    font-family:"Times New Roman";}
</style>
<![endif]--><!--[if gte mso 9]><xml>
 <o:shapedefaults v:ext="edit" spidmax="1026"/>
</xml><![endif]--><!--[if gte mso 9]><xml>
 <o:shapelayout v:ext="edit">
  <o:idmap v:ext="edit" data="1"/>
 </o:shapelayout></xml><![endif]-->
</head>

<body lang=EN-US style='tab-interval:.5in'>

<div class=Section1>

<p class=MsoNormal>This is a test.</p>

</div>

</body>
---------------------------------------------------------------------------

Tidy command:
---------------------------------------------------------------------------
tidy --word-2000 yes This is a test.htm
---------------------------------------------------------------------------

Output:
---------------------------------------------------------------------------
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN">
<html>
<head>
<meta name="generator" content=
"HTML Tidy for Windows (vers 1st October 2003), see www.w3.org">
<title>This is a test</title>
</head>
<body>
<div class="Section1">
<p>This is a test.</p>
</div>
</body>
</html>
---------------------------------------------------------------------------

YMVV




More information about the Kclug mailing list