Since the earliest days of the Internet, information technology professionals have struggled to devise a scaled-down, in-house version for their corporations. The fruit of their labor is the intranet, a smaller network for transporting information within and among corporations that promised the same hyperlinked cohesion that made the Internet such a compelling distribution tool. With the evolution of the intranet, of course, came a need to organize and access the information contained within it and to live up to the expectations placed upon it. And with those needs came the emergence of intranet search systems.
A quick glance at the market reveals a variety of choices. Some run on the platforms through which a corporation's Web is hosted, thus providing a virtual out-of-the-box indexing of the system's HTML files. In many cases, users can begin searching the entire collection in short order, perhaps as they do with a commercial Internet search engine. What more, then, could anyone want? Isn't this problem like buying a new car--more a matter of taste and affordability than a subject of lengthy analysis?
More often than not, a corporation's intranet is maintained not by a single individual, but by a team comprising varied interests, points of view, and responsibilities. Moreover, Web technologies continue to grow exponentially. Every month, in fact, brings nearly a year's worth of changes. And corporations often have already selected search systems or, increasingly, want to be able to search beyond their own intranet.
The sheer quantity of information to be searched and the range of users' needs compound the problem of selecting and configuring an intranet search system. For instance, will colleagues be happy when a routine search behaves as it does on a typical Internet search engine, giving them 20,000 possible items to review in response to a simple question? To understand fully the mechanisms that enable a useful, query-sensitive response, one must first understand the nature of a document and the process by which that document is categorized and accessed in a search.
YOU HAVE TO START SOMEWHERE: THE EVOLUTION OF THE DOCUMENT
Everybody knows what a document is, yet surprisingly, few can define the term. What's worse, attempting to pinpoint a definition typically leads to more questions than answers. For instance, is a document a single Web page or a collection of them? Are the sound or animation objects part of the document, or separate documents? Despite the countless questions that emerge, it is reasonable to begin with the circular assumption that a document is a book-like collection of related information objects; an information object, in turn, is any meaningful set of data that can be tied to other sets of data in a comprehensive search.
A typical document search involves a scan of its text for designated words or concepts. The search systems that scan for these words include full-text indices with pointers to essentially every word in a collection of documents. Queries using this index can range from simple words or phrases to Boolean AND-OR operations to extended operators like proximity. As the number of indexed documents increases, the need for more sophisticated search techniques increases as well. These search aids often include thesauruses, language support, and even facilities for searching general concepts.
One company that has successfully addressed intranet document management issues is Sunnyvale, California-based Verity. Verity's comprehensive search engine--embedded in many of its competitors' products--offers "topic" queries that can be combined for increasingly rich concept searches. For instance, a basic topic query for "garden" quickly expands to subordinate queries like "vegetable," "flower," and "herb" gardens, which are themselves distinct queries. By building well-designed families of topics, extraordinarily detailed concepts can be searched and found in large document collections.
Although the most talked-about stage of document development is the search--and-use phase, effective intranets must also carefully consider how they will receive documents, what binary types of documents will be accessible, how long documents will remain available to users, and what becomes of them after their expiration date has passed.
NOT ALL DOCUMENTS ARE CREATED WHERE AND HOW TO SEARCH
In many companies, departmental or divisional HTML pages are constructed consideration for how the content be searched. The home page's usually provides access to content frequently subdivided by corporate organization or function. When the amount of content is relatively small, searching is not difficult. A point-and-click approach, in these cases, is all that's needed.
When the amount of content grows, however, searching by navigation alone becomes far less simplistic. New categories of information are set up, and if these mirror the firm's organization, the information layout changes with the corporation. Add to that difficulty the reality that different groups may begin growing their own subnets, or have preferences for their own document management and search needs, and you have either a disaster brewing or a great opportunity, depending on the color of the lenses through which you view the world.
If paper documents are part of the content to be searched, they are typically digitized using an Optical Character Recognition (OCR) system. OCR renders text searchable and creates files that are always smaller than their image counterparts. Unfortunately, OCR systems do not preserve word processor structures, but throw away font and layout information, as well as pictures and graphics. Systems based on Adobe Acrobat Capture, however, will create Portable Document Format (PDF) renditions that are not only searchable, but preserve many of these elements, including graphics.
The information objects inherent to documents almost always have attributes that simplify any given search. PDF documents, for one, have built-in attributes that can assist in searching; new attributes can also be created, if appropriate. Even word processor files have attributes--saved as "summary" or "cover" page information--that generally include the author's name and the files' subject matter. With these document attributes, users can divide and conquer portions of their document base and then apply full-text queries to the remaining database of information.
Equally important, though often forgotten, is the host language in which the documents are written. While simple 8-bit ASCII is common to English-read pages, it may not be accessible in other parts of the world. And while HTML tags are written in English, what lies between them may not be. UNICODE, for instance, supports non-English languages ranging from the common European FIGS (French, Italian, German, and Spanish) to kanji. Likewise, Acrobat PDF files can express non-English text. Particularly useful to the development of a search system strategy is an upcoming standard called the eXtensible Markup Language (XML), which is designed to support foreign languages and add more structure to electronic documents.
AN INTRANET AND ITS SEARCH SYSTEM: MAKING IT WORK FOR YOU
Further complicating the quest for an effective search system is the purpose of the corporate intranet itself Is it simply a communications vehicle, or is it a virtual workplace where employees work and share files? If it is a virtual environment, users will want to extend their boundaries and search content within and beyond their company's intranet. Likewise, they will demand a unified process of searching. For corporations that have already settled on a separate search system for accessing information outside their own intranets, the process of integrating these search systems and techniques creates yet another obstacle.
Given the diversity of factors that must be considered, it is not surprising that the search for an intranet search system can seem daunting. The investigation becomes more manageable, however, through needs analysis. Specifically, the user must consider a series of questions about the types of documents to be examined, the characteristics of the people who will be using the search system, and the required capabilities of the search system itself to identify those products that are most likely to take full advantage of an intranet's information archives. For example, the IT professional responsible for initiating a mechanism for intranet searches must ask what types of documents will be indexed and searched, whether paper legacy documents will be searched, and whether documents will be searched by predefined attributes.
Also critical to the search for an appropriate intranet searching mechanism is An understanding of the user and his or her needs. If the average user does not meet certain qualifications, or if funds are some searching mechanisms will better than others.
A final consideration centers on the that will be placed on the search being investigated. Questions of importance include the plat-upon which the system will run the types of operations and level of customization desired. When reviewing the possibilities, it is wise to have a test suite of documents representing the kinds your organization uses readily available for indexing and searching.
MATCHING THE SYSTEM TO THE NEED
Every intranet is different, just as almost every living organism is different. However, some systems stand out for their ability to minimize clutter and streamline the search process. While Web search agents within programs are common, few are fully capable of delivering only what you want and minimizing the clutter. To assist agents, document collections must themselves be sifted into rational categories which can be automated.
One system that successfully improves the value of Web search agents and automates category building is Information Access Systems' Judgement Space, or J-spaces[R]. Originating from the U.S. Air Force's Artificial Intelligence Center, J-space has been commercially available through integrated products for more than ten years.
As companies grow, their document collections tend to evolve into islands of information searched and managed by incompatible systems. To solve this problem, Infodata Systems, of Fairfax, Virginia, has developed a product called Virtual File Cabinet (VFC). This customizable, Web-based system allows users to access, organize, and share documents. With VFC, users can search, retrieve, edit, and file information throughout the enterprise, regardless of where the documents originated or are stored, by navigating a hierarchy of collections that uses an intuitively obvious metaphor.
But all that power may go for naught if a system is too difficult to use. Several years ago, a Massachusetts-based commercial property insurance company implemented a powerful industry-leading search system. It was second-to-none in its power and customized to provide search features appropriate to the business' needs. Unfortunately, it was not easy to use, and the system never met the company's usage goals. Lesson learned: If a system flunks the useability test, or does not meet a user's needs, it will either be underused or replaced.
The best systems not only include useful search aids, but will chunk document collections into Yahoo-like categories, thus reducing the number of responses. Furthermore, systems should ably rank results by relevancy and provide summaries of results, clusters of results, and the option to "find me more like this." While doing all this, the systems should also provide an automated setup of categories for searching, and search constantly growing, heterogeneous groups and types of information, including nontextual media and structured database information.
XTENDING HTML: WHERE SEARCHING IS HEADED
Given the need for continuously improved search systems, there are several key areas to watch. For example, anyone considering implementing or using intranet searching systems must pay close attention to XML, which will likely facilitate more tailored searching.
Metaphorically, HTML can be compared to Henry Ford's Model-T: It made automobiles available to the masses, was simple and affordable, and you could get it in any color you wanted as long as that color was black. SGML--HTML's parent standard--is like a Mercedes-Benz, with a full range of mix-and-match options. Specifically, it is built for the long haul, but quite expensive. XML, like the majority of vehicles on the market, is fully customizable and affordable, but lacks some of SGML's capabilities. Although the XML specification is only about one-tenth the size of the SGML specification, it is still remarkably powerful; thus, the mad rush to make everything, including browsers, publishing systems, and search engines, support XML.
The biggest benefit of XML to search systems will be its ability to perform zoned searching, or full-text searching, within custom document elements. Likewise, XML supports ISO 10646 (the UNICODE standard), enabling support for international languages and the use of those languages in XML tags. Though no search engine or document management system currently supports XML, it is a sure bet that many will be pledging support within the next year.
Yet another aspect of XML that deserves consideration by potential users is membership in the World Wide Web Consortium. By accessing its Web site, at http://www.w3.org/Consortium/Member/List, one may determine whether a particular Search System Analysis Checklist vendor is committed to emerging Web standards, including XML. Microsoft, Netscape, and Digital Equipment, for instance, are all members of the World Wide Web Consortium, yet only one company's product--Microsoft's Internet Explorer 4.0--had sufficiently committed to XML in early 1998.
Of course, developing and marketing search systems has not proven to be the highly profitable endeavor originally anticipated. Vendors who seemed to hold solid positions in the top tier of vendor comparison lists have struggled recently. Verity, for one, is not the economic powerhouse it used to be. Fulcrum, once a top-tier vendor with references including Microsoft, was acquired by PC DOCS, developers of the document management system of the same name. And new search system vendors like Oracle are emerging from other disciplines, too.
Given these trends, and the evolution of the intranet into a viable repository for important data, the need for effective searching mechanisms seems more important than ever. The demand for intranet searches--which, at first, seemed to be a simple problem--is rapidly becoming a noticeably complex undertaking. Luckily, search technologies and vendor offerings are maturing with no end--but yet, a light--in sight.