The amount of useful semi-structured data on the web continues to grow at a stunning pace. Often
interesting web data are not in database systems but in HTML pages, XML pages, or text les.
Data in these formats is not directly usable by standard SQL-like query processing engines that
support sophisticated querying and reporting beyond keyword-based retrieval. Hence, the web users
or applications need a smart way of extracting data from these web sources. One of the popular
approaches is to write wrappers around the sources, either manually or with software assistance, to
bring the web data within the reach of more sophisticated query tools and general mediator-based
information integration systems.
In this paper, we describe the methodology and the software development of an XML-enabled
wrapper construction system - XWRAP for semi-automatic generation of wrapper programs. By
XML-enabled we mean that the metadata about information content that are implicit in the original
web pages will be extracted and encoded explicitly as XML tags in the wrapped documents. In
addition, the query-based content ltering process is performed against the XML documents. The
XWRAP wrapper generation framework has three distinct features. First, it explicitly separates
tasks of building wrappers that are specic to a Web source from the tasks that are repetitive for any
source, and uses a component library to provide basic building blocks for wrapper programs. Second,
it provides inductive learning algorithms that derive or discover wrapper patterns by reasoning about
sample pages or sample specications. Third and most importantly, we introduce and develop a two-
phase code generation framework.
The first phase utilizes an interactive interface facility to encode
the source-specic metadata knowledge identied by individual wrapper developers as declarative
information extraction rules. The second phase combines the information extraction rules generated
at the rst phase with the XWRAP component library to contruct an executable wrapper program
for the given web source. The two-phase code generation approach exhibits a number of advantages
over existing approaches. First, it provides a user-friendly interface program to allow users to generate
their information extraction rules with a few mouse clicks. Second, it provides a clean separation
of the information extraction semantics from the generation of procedural wrapper programs (e.g.,
Java code). Such separation allows new extraction rules to be incorporated into a wrapper program
incrementally. Third, it facilitates the use of the micro-feedback approach to revisit and tune the
wrapper programs at run time. We report the performance of XWRAP and our experiments by
demonstrating the benet of building wrappers for a number of Web sources in dierent domains
using the XWRAP generation system.
The architecture of XWRAP for data wrapping consists of four components - Syntactical Structure Normalization,
Information Extraction, Code Generation, Program Testing and Packaging. Figure 1 illustrates how the wrapper
generation process would work in the context of data wrapping scenario.
Syntactical Structure Normalization is the rst component and also called Syntactical Normalizer, which
prepares and sets up the environment for information extraction process by performing the following three tasks.
First, the syntactical normalizer accepts an URL selected and entered by the XWRAP user, issues an HTTP
request to the remote server identied by the given URL, and fetches the corresponding web document (or so
called page object). This page object is used as a sample for XWRAP to interact with the user to learn and
derive the important information extraction rules. Second, it cleans up bad HTML tags and syntactical erros.
Third, it transforms the retrieved page object into a parse tree or so-called syntactic token tree.
Information Extraction is the second component, which is responsible for deriving extraction rules that use
declarative specication to describe how to extract information content of interest from its HTML formatting.
XWRAP performs the information extraction task in three steps - (1) identifying interesting regions in the
retrieved document, (2) identifying the important semantic tokens and their logical paths and node positions in
the parse tree, and (3) identifying the useful hierarchical structures of the retrieved document. Each step results
in a set of extraction rules specied in declarative languages.
Code Generation is the third component, which generates the wrapper program code through applying the
three sets of information extraction ruls produced in the second step.