In this section, we organize the web page format before performing classification algorithms. Two steps are used for our preparation. The first step is web page segmentation for dividing the main page into five blocks. The second step is feature manipulation by using three important features such as spatial, location and presentation.

Web Page Segmentation

According to web page segmentation [5], we can divide the main web page into five blocks: top, bottom, left, right and center as shown in Fig. 1(a). Hence, we can investigate area relationship of each block within web page. Apart from dividing web page, we consider the important location which is related to e-commerce web sites for using in Section 3.2 location description. In fact, there are various designs for different e-commerce Web sites. The previous study showed the attracting components on the appropriate location [4] such as main navigation on the top or left, “about us” navigation on the bottom and advertising on the right or top center of the screen. We analyze and synthesize them to the web template as shown in Fig.1 (b). We also identify each area portions of navigation bar, product index, customer service bar, content, news and advertisement.

Feature Description

We define the description of three feature types: spatial, location, and presentation as follows.

1) Spatial feature set:

It consists of area ratio which is calculated by using each block area divided by full page area and sizing ratio which is calculated by using width divided by height of each block. We obtain 11 attributes for constructing spatial features such as header, footer area and width divided by height of header area of web page. We investigate data as shown in Table 1.We notice that the maximum areas are 51.6% and 58% in the content block of websites from the Top 500 Guide denoted by Top-500 group and randomly chosen ecommerce websites denoted by random group accordingly. The most interesting blocks is the bottom block whose areas are 14% and 9 % respectively. There are no different between two groups in the left area. The minimum areas are 6% and 4.4% respectively in the right block. But the maximum different sizing is 4.94 in the bottom block, in the Top-500 group and 11.58 in the random group. We investigate data as shown in Table 2, the width and height ratio of left or right components are similar to both two groups.

2) Location feature set:

It is used for describing the appearance of e-commerce components in the area of web page. We obtain 11 attributes for constructing location features such as navigation, customer service and search position. We notice that the salient components are navigation bar, product index, customer service bar, content, news and advertisement. We summarize the data as shown in Table 3 which represent the statistical value of Top-500 group and random group. We find that the very high occurrence features in Top-500 are content on center area, navigation, logo, cart, account and search on top area. For the random group, the very high occurrence features are navigation, and logo on top area.

3) Presentation feature set:

It is used for describing the format or style of each e-commerce web sites components such as navigation and search format in the different types such as highlight, underline, pop-up, text modification, typographic and selective. Besides navigation format, we define the other attributes in the presentation feature such as text density for investigating the ratio of character number per content area, image alignment in content area for investigating alignment of image in the content area. We obtain 15 attributes for constructing presentation features such as navigation and customer service style. We also noticed that all web pages from the Top-500 group did not apply the frame-set style.