WIL: A Tool for Web-data Extraction


By Nadim Barsoum

Technical Advisor: Prof. David Evans

Abstract

This report describes the design and use of a software library. The Web Integration Library(WIL) facilitates the integration of software applications with the web. The library provides functions that will search for and retrieve data from a website according to the programmer's input. The library thus provides an interface for web pages and abstracts away any lower-level HTML code interpretation. Using the library's data-retrieval power, programmers are free to focus on more critical areas of their code. The library provides a complete client-side solution to web integration that does not require any work on the server-side and that allows a program to take advantage of updates to a website.


Using WIL

WIL provides two main data retrieval functions, GetRowTP and GetColTP, that use the inputs described in section 2.5 to realize the relationship between sets of data in a table. The following sections explain how the input should be formatted and what the input means for the GetRowTP and GetColTP functions. A number of examples are provided to help clarify the library's use.


Consider a website with table 1 below as its only content:

Name Age Height
Cher 101 5'9"
Laughayette 25 5'8"
Michael Jackson 10 5'4"
Table 1

To give WIL an idea of how the data should be extracted, the programmer must first provide a container object or, in this case, a structure that represents a person. This object embodies the format in which the data will be extracted. Figure 11 shows an example of the type of object that the programmer might provide. The number of the target table must then be given. Next, the programmer must identify the mutator functions that modify the data members of the container object. In Figure 11, setname( string *name), setage( string *age), setheight( string *height) are examples of such mutator functions. Pointers to these functions must be put into an array that is then passed to one of the two retrieval functions, GetRowTP and GetColTP. The programmer picks which function to use depending on how data is formatted in the target table. If records are kept in rows, then the GetRowTP function is used. If the records are kept in columns, then the GetColTP function should be used. These functions assume that the first row or column of a table contain titles and thus do not extract the first row or column. Finally, if the target table is embedded within another table, a location string is required to locate the target table on the page. Otherwise, no more information is needed.

Figure 11. Programmer-defined class

After the programmer has provided the needed information, WIL sets off extracting the target table's entries in to copies of the container object. When there are no more entries to extract, WIL returns a vector of the objects that were extracted. Figure 12 shows an example program that uses WIL to extract the people in Table 1 and output their information to the screen.


Figure 12. A program that uses WIL

The prototypes of the retrieval functions, GetRowTP and GetColTP, are described as follows:

GetRowTP( T obj , int tablenum, int numofFuncs, void (T::*f[])(string *name), fstream &sin, int numargs, ...) GetColTP( T obj , int tablenum, int numofFuncs, void (T::*f[])(string *name), fstream &sin, int numargs, ...)

The programmer must pass in a container object as the first parameter of the retrieval functions. Figure 12 shows the object "Guy", which is of type Person, passed in as the first parameter. The second parameter determines the number of the target table. The third parameter is the number of mutator function pointers inside the function pointer array passed in as the fourth parameter. The fifth parameter is an input file stream that provides WIL with the HTML code. The sixth parameter is a number that states how many numbers will follow it for the optional variable argument list found as parameter 7. The variable argument list is for use only when the target table is a embedded in another table. When dealing with embedded tables, the programmer must provide the complete path to the target table as the seventh parameter of the retrieval function. In Figure 12, since the target table (table 1) was not embedded, no location string was required and the sixth parameter was set to 0. Appendix B provides my implementation of GetRowTP and GetColTP.


Embedded Tables

Some websites, having shied away from frames because of security problems, adopt tables for their layout control. Such websites rely on embedded tables frequently and thus contain tables within tables within tables. In this example we use the Report function to come up with a location string for the target table that is found inside of table 2. This location string becomes the seventh parameter of the retrieval function. Notice that table 2 contains two entries, the first is our target table and the second a text string.

Name Age Height
Cher 101 5'9"
Laughayette 25 5'8"
Michael Jackson 10 5'4"
This is an empty Entry
Table 2

The Report function must first be used to generate a report of the web page so the programmer can locate the target table. In this example it is obvious where the table is located but there are several websites that nest numerous tables within each other and make it hard to come up with the location string of the target table. The prototype for the Report function is: Report(fstream &fin, ofstream &fout) The first parameter is an input stream providing WIL with the HTML file. The second parameter is the output stream where the generated report of location strings will be written. Figure 13 shows the code on the left resulting in a report on the right that describes the location of the data in the file "test.html" which contains table 2 from above. The resulting report completely describes the page to the programmer, the indentation reflects how one thing is embedded within another. So seeing our target table's entries being indented 1 tab further than the initial integer string of 1 1 1, we know that the desired table is within another table. The string within the parenthesis and followed by a colon is the entry of table 2 that contains our target table. We must still provide the number of the target table because it is an element found within the first entry of table 2.


Figure 13. A report on test.html

Thus our code from figure 12 changes to the code shown in figure 14. The only difference in the code is how the function GetRowTP is called. The sixth argument says that there are 3 arguments to follow it: 1, 1 and 1. These 3 arguments tell GetRowTP that the table in question is found inside the first column of the first row of the first table on the website.


Figure 14. Program for embedded tables

If the table was embedded even deeper within other tables, the programmer would have to provide an even longer location string to describe the target table's position on the page. To obtain the location string of a deeply embedded target table, the location string of every parent up to the root table should be concatenated to one another. Figure 15 illustrates a deeply embedded table and the location string of every parent table down to the target table.


Figure 15. Embedded tables

If this were the scenario for our previous example then the GetRowTP function call would be as follows:

GetRowTP(Guy, 1,3, f, sin, 6, 1, 1, 1, 1, 1, 1);

Notice that the root table A does not need a location string if the programmer were referencing any of its contents but the nested tables within A require location strings. If a certain table-entry is not desired but happens to be in the target table, the programmer must provide a dummy function inside the given class that would be executed when that entry is extracted. Also, the programmer should take good care of creating the function pointer array. The first pointer must point to the mutator responsible for the first attribute of a record in the target table. For example, a pointer to setname( string *name) must be the first pointer in the array in the examples above. A pointer to setage( string *age) should be the next pointer in the array and so on. The array must contain the mutator function pointers in the same order that their target attributes appear in the target table.



The following are the files of the library:
The Report function
The Select Mechanism
GetRowTP and GetColTP functions

A demonstration of using the library:
The Main program
The Example Class