WNILS Working Group Chris Weider INTERNET-DRAFT Merit Network, Inc. Jim Fullton UNC Chapel Hill Simon Spero 11/10/92 UNC Chapel Hill Architecture of the Whois++ Index Service Status of this memo: The authors describe an archtecture for indexing in distributed databases, and apply this to the WHOIS++ protocol. This document is an Internet Draft. Internet Drafts are working documents of the Internet Engineering Task Force (IETF), its Areas, and its Working Groups. Note that other groups may also distribute working documents as Internet Drafts. Internet Drafts are draft documents valid for a maximum of six months. Internet Drafts may be updated, replaced, or obsoleted by other documents at any time. It is not appropriate to use Internet Drafts as reference material or to cite them other than as a "working draft" or "work in progress." Please check the I-D abstract listing contained in each Internet Draft directory to learn the current status of this or any other Internet Draft. This Internet Draft expires May 10, 1993. 1. Purpose: The WHOIS++ directory service [GDS, 1992] is intended to provide a simple, extensible directory service predicated on a template-based information model and a flexible query language. This document describes an architecture designed to link together many of these WHOIS++ servers into a distributed, searchable wide area directory service. 2. Scope: This document details a distributed, easily maintained architecture for providing a unified index to a large number of distributed WHOIS++ servers. This architecture can be used with systems other than WHOIS++ to provide a distributed directory service which is also searchable. 3. Motivation and Introduction: It seems clear that with the vast amount of directory information potentially available on the Internet, it is simply unfeasible to build a centralized directory to serve all this information. Therefore, we should look at building a distributed directory service. If we are to distribute the directory service, the easiest (although not necessarily the best) way of building the directory service is to build a hierarchy of directory information collection agents. In this architecture, a directory query is delivered to a certain agent in the tree, and then handed up or down, as appropriate, so that the query is delivered to the agent which holds the information which fills the query. This approach has been tried before, most notably in some implementations of the X.500 standard. However, there are two major flaws with the approach as it has been taken. This new Index Service is designed to fix these flaws. 3.1 The search problem Current implementations of this hierarchical architecture require that a search query issued at a certain location in the directory agent tree be replicated to _all_ subtrees, because there is no way to tell which subtrees might contain the desired information. It is obvious that this has rather extreme scaling problems, and in fact the search facility has been turned off in the X.500 architecture because of this problem. Our new WHOIS++ architecture solves this problem by having a set of 'forward information' at each level of the tree. That is, each level of the tree has some idea of where to look lower in the tree to find the requested information. Consequently, the search tree can be pruned enormously, making search feasible at all levels of the tree. We have chosen a certain set of information to hand up the tree as forward information; this may or may not be exactly the set of information required to build a truly searchable directory. However, it seems clear that without some sort of forward information, the search problem becomes intractable. 3.2 The location problem Current implementations of this hierarchical architecture also encode details about the directory agent hierarchy in the location information for a specific entry. With search turned off, this requires a user to know exactly how the hierarchy of servers is laid out and how they are named, which leads to acrimonious debate about the shape of the name space and really massive headaches whenever it becomes apparant that the current namespace is unsuited to the current usages and must be changed. The new Index Service gets around this by a) not enforcing a true hierarchy on the directory agents, b) dissociating the directory service from the information served, and c) allowing new hierarchies to be built whenever necessary, without destroying the hierarchies already in place. Thus a user does not need to know in advance where in the hierarchy the information served is contained, and the information a user enters to guide the search does not ever have to explicitly show up in the hierarchy. Although there are provisions in the WHOIS++ query syntax to watch the directory service as it hand the query around, and consequently to divine the structure of the directory service hierarchy, it really is not relevant to the user, and does not ever have to be taken into consideration. 3.3 The Yellow Pages problem Current implementations of this hierarchical architecture have also been unsuited to solving the Yellow Pages problem; that is, the problem of easily and flexibly building special-purpose directories (say of molecular biologists) and of automatically maintaining these directories once they have been built. In particular with the current systems, one has to build into the name space the attributes appropriate to the new directory. Since our new Index Service very easily allows directory servers to pick and choose between information proffered by a given entry server, and because we have an architecture which allows for automatic polling of data, Yellow Pages capabilities fall very naturally out of the design. Although the ability to search all levels of the tree(s) gets us a long way towards the Yellow Pages, it is this capacity to locate, gather, and maintain information in a distributed and selective way that really solves the problem. 4. Components of the Index Service: 4.1 WHOIS++ servers The whois++ service is described in [GDS, 1992]. As that service specifies only the query language, the information model, and the server responses, whois++ services can be provided by a wide variety of databases and directory services. However, to participate in the Index Service, that underlying database must also be able to generate a 'centroid' for the data it serves. 4.2 Centroids as forward knowledge The centroid of a server is comprised of a list of the templates and attributes used by that server, and a word list for each attribute. The word list for a given attribute contains one occurrence of every word which appears at least once in that attribute in some record in that server's data, and nothing else. For example, if a whois++ server contains exactly three records, as follows: Record 1 Record 2 Template: User Template: User First Name: John First Name: Joe Last Name: Smith Last Name: Smith Favourite Drink: Labatt Beer Favourite Drink: Molson Beer Record 3 Template: Domain Domain Name: foo.edu Contact Name: Mike Foobar the centroid for this server would be Template: User First Name: Joe John Last Name: Smith Favourite Drink: Beer Labatt Molson Template: Domain Domain Name: foo.edu Contact Name: Mike Foobar It is this information which is handed up the tree to provide forward knowledge.As we mention above, this may not turn out to be the ideal solution for forward knowledge, and we suspect that there may be a number of different sets of forward knowledge used in the Index Service. However, the directory architecture is in a very real sense independent of what types of forward knowledge are handed around, and it is entirely possible to build a unified directory which uses many types of forward knowledge. 4.3 Index servers and Index server Architecture A whois++ index server collects and collates the centroids (or other forward knowledge) of either a number of whois++ servers or of a number of other index servers. An index server must be able to generate a centroid for the information it contains. 4.3.1 Queries to index servers An index server will take a query in standard whois++ format, search its collections of centroids, determine which servers hold records which may fill that query, and then forward the query to the appropriate servers. 4.3.2 Index server distribution model and centroid propogation The diagram below illustrates how a tree of index servers is created for a set of whois++ servers. whois++ index index servers servers servers for for _______ whois++ lower-level | | servers index servers | A |__ |_______| \ _______ \----------| | _______ | D |__ ______ | | /----------|_______| \ | | | B |__/ \----------| | |_______| | F | /----------|______| / _______ _______ / | | | |- | C |--------------| E | |_______| |_______| In the portion of the index tree shown above, whois++ servers A and B hand theircentroids up to index server D, whois++ server C hands its centroid up to index server E, and index servers D and E hand their centroids up to index server F. The number of levels of index servers, and the number of index servers at each level, will depend on the number of whois++ servers deployed, and the response time of individual layers of the server tree. These numbers will have to be determined in the field. 4.3.4 Centroid propogation and changes to centroids Centroid propogation is initiated by an authenticated POLL command (sec. 4.2). The format of the POLL command allows the poller to request the centroid of any or all templates and attributes held by the polled server. After the polled server has authenticated the poller, it determines which of the requested centroids the poller is allowed to request, and then issues a CENTROID-CHANGES report (sec. 4.3) to transmit the data. When the poller receives the CENTROID-CHANGES report, it can authenticate the pollee to determine whether to add the centroid changes to its data. Additionally, if a given pollee knows what pollers hold centroids from the pollee, it can signal to those pollers the fact that its centroid has changed by issuing a DATA-CHANGED command. The poller can then determine if and when to issue a new POLL request to get the updated information. The DATA-CHANGED command is included in this protocol to allow 'interactive' updating of critical information. 4.3.5 Query handling and passing algorithm When an index server receives a query, it searches its collection of centroids, and determines which servers hold records which may fill that query. As whois++ becomes widely deployed, it is expected that some index servers may specialize in indexing certain whois++ templates or perhaps even certain fields within those templates. If an index server obtains a match with the query _for those template fields and attributes the server indexes_, it is to be considered a match for the purpose of forwarding the query. When the index server has completed its search to match the query to a server, it then forwards the request as shown in 5.4. Each server in the chain can then use the authentication information included in the FORWARDED-QUERY command to determine whether to continue forwarding the query. Also, a whois++ query can specify the 'trace' option, which sends to the user a string containing the IANA handle and an identification string for each index server the query is handed to. 5. Syntax for operations of the Index Service: 5.1 Data changed syntax The data changed template look like this: DATA-CHANGED: Version-number: // version number of index service software, used to insure // compatibility Time-of-latest-centroid-change: // time stamp of latest centroid change, GMT Time-of-message-generation: // time when this message was generated, GMT Server-handle: // IANA unique identifier for this server Best-time-to-poll: // For heavily used servers, this will identify when // the server is likely to be lightly loaded // so that response to the poll will be speedy, GMT Authentication-type: // Type of authentication used by server, or NONE Authentication-data: // data for authentication END DATA-CHANGED // This line must be used to terminate the data changed // message 5.2 Polling syntax POLL: Version-number: // version number of poller's index software, used to // insure compatibility Start-time: // give me all the centroid changes starting at this time, GMT End-time: // ending at this time, GMT Template: // a standard whois++ template name, or the keyword ALL, for a // full update. Field: // used to limit centroid update information to specific fields, // is either a specific field name, a list of field names, // or the keyword ALL Server-handle: // IANA unique identifier for the polling server. // this handle may optionally be cached by the polled // server to announce future changes Authentication-type: // Type of authentication used by poller, or NONE Authentication-data: // Data for authentication END POLL // This line must by used to terminate the poll message 5.3 Centroid change report CENTROID-CHANGES: Version-number: // version number of pollee's index software, used to // insure compatibility Start-time: // change list starting time, GMT End-time: // change list ending time, GMT Server-handle: // IANA unique identifier of the responding server Authentication-type: // Type of authentication used by pollee, or NONE Authentication-data: // Data for authentication Compression-type: // Type of compression used on the data, or NONE Size-of-compressed-data: // size of compressed data if compression is used Operation: // One of 3 keywords: ADD, DELETE, FULL // ADD - add these entries to the centroid for this server // DELETE - delete these entries from the centroid of this // server // FULL - the full centroid as of end-time follows Multiple occurrences of the following block of fields: Template: // a standard whois++ template name Field: // a field name within that template Data: // the word list itself, one per line, cr/lf terminated end of multiply repeated block END CENTROID-CHANGES // This line must be used to terminate the centroid // change report 5.4 Forwarded query FORWARDED-QUERY: Version-number: // version number of forwarder's index software, used to // insure compatibility Forwarded-From: // IANA unique identifier of the server forwarding query Forwarded-time: // time this query forwarded, GMT (used for debugging) Trace-option: // YES if query has 'trace' option listed, NO if not. // used at message reception time to generate trace information Query-origination-address: // address of origin of query Body-of-Query: // The original query goes here Authentication-type: // Type of authentication used by queryer Authentication-data: // Data for authentication END FORWARDED-QUERY // This line must be used to terminate the body of the // query 6 Author's Addresses Chris Weider clw@merit.edu Industrial Technology Institute, Pod G 2901 Hubbard Rd, Ann Arbor, MI 48105 O: (313) 747-2730 F: (313) 747-3185 Jim Fullton fullton@mdewey.ga.unc.edu 310 Wilson Library CB #3460 University of North Carolina Chapel Hill, NC 27599-3460 O: (919) 962-9107 F: (919) 962-5604 Simon Spero ses@sunsite.unc.edu 310 Wilson Library CB #3460 University of North Carolina Chapel Hill, NC 27599-3460 O: (919) 962-9107 F: (919) 962-5604