A Characterization of Compound Documents on the Web
Rice Computer Science, Technical Report TR99-351, November 1999
Recent developments in office productivity suites make it easier for users to publish rich compound documents on the Web. Compound documents appear as a single unit of information but may contain data generated by different applications, such as text, images, and spreadsheets. Given the popularity enjoyed by these office suites and the pervasiveness of the Web as a publication medium, we expect that in the near future these compound documents will become an increasing proportion of the Web’s content. As a result, the content handled by servers, proxies, and browsers may change considerably from what is currently observed. Furthermore, these compound documents are currently treated as opaque byte streams, but future Web infrastructure may wish to understand their internal structure to provide higher-quality service. In order to guide the design of this future Web infrastructure, we characterize compound documents currently found on theWeb. Previous studies ofWeb content either ignored these document types altogether or did not consider their internal structure. We study compound documents originated by the three most popular applications from the Microsoft Office suite: Word, Excel, and PowerPoint. Our study encompasses over 12,500 documents retrieved from 935 differentWeb sites. Our main conclusions are: 1. Compound documents are in general much larger than current HTML documents. 2. For large documents, embedded objects and images make up a large part of the documents’ size. 3. For small documents, XML format produces much larger documents than OLE. For large documents, there is little difference. 4. Compression considerably reduces the size of documents in both formats.