ASSOCIATES (vol. 1, no. 1, July 1994) - associates.ucr.edu
[This is the first in a series of articles on full text data management compiled by John Lozowsky, Assistant Director, Information Management and the staff at The Treasury, Wellington, New Zealand. Further articles will cover their problems, the search for replacement software, its installation, interface development and data conversion, and the results. The Editors.] FULL TEXT REVISITED. The Treasury Experience, circa 1992 New Zealand Library and Information Services Conference, March 1992 A speech presented by Megan Clark former Information Manager Library Services The Treasury Not long after I started work at Treasury I spoke at New Plymouth on what Treasury was doing in its Information Centre. Three years down the track I can proudly assert that the experiment has succeeded. We still have problems, both people and computer problems. Any one who has implemented a large network system will understand there is always something round the corner. I am not going to dwell on the personnel issues of the Information Centre, however. My purpose today is to revisit our full text database and to discuss the pros and cons of such. 1. Why did Treasury go for a full text database? 2. Is it worth the expense and technical input needed? 3. Do the people it was created for like it/use it? 4. Do the information specialists like it? 5. Wider issues of document retrieval. These are the questions I will address today. 1. Why did Treasury choose full-text retrieval. Treasury went the full text way to provide an information resource for its staff who write, in total, hundreds of papers each year for Ministers of the Crown. These papers cover an incredibly wide range of topics. If the information the staff need to write these papers was online, the idea was staff would access that information themselves. Analysts kept filing cabinets full of copies of theirs and other peoples papers for "quick referral". Many of you will know the monetary cost of that form of storage plus the inefficiency costs of retrieving information that way. Information stored locally and unindexed is often narrow and doesn't usually contain more than one point of view. In fact writing a paper from these sources is usually called "top of the head stuff". Paper was often stored in the central files and duplicated elsewhere or, worse still, not stored in central files and, therefore, accessible only to the author from a personal file. Corporate knowledge was not shared and. in an organisation with many new staff each year and a high turnover, this meant valuable institutional knowledge was often lost forever. A consulting company, Logica Pty, was chosen to look at Treasury's information needs. What happened then is history. Basis was chosen and a complete office automation package was bought which all linked together. The hardware is Digital, the electronic mail is All-In-1 and the word processor is WPs Plus. Treasury upgraded to Basis Plus in November 1990. Data, text and references can be sent electronically all around the building and can be loaded into the database, text and all, and can be transferred back out. A copy can be edited or reworded and the new edition loaded again. Documents can be replaced if necessary, although that has archival implications and, therefore, is not done. In other words a sophisticated and powerful information retrieval system was introduced. Basis Plus has a sophisticated search facility for the information specialist as well as the easier interface module used by all users. 2. Is it worth the expense and technical input needed? The database is now vast and growing fast. It has about 90,000 records many of which are long text. It also has many references to journal articles received within the centre plus the library catalogue and the files index. Incoming correspondence is indexed in order that it may be tracked and outgoing records matched to the incoming. It is used for Ministerial tracking and could, if we wish, be used for serials maintenance, acquisitions and book/file issuing. It is a successful database and certainly makes retrieval of Treasury generated material much easier to locate and deal with. All documents are assigned index terms from the thesaurus. The thesaurus used by our staff has been developed from within the organisation. Many hours of work has gone into its development. Indexing is currently done by a variety of staff within the Information Centre. To get to this stage has required an almost full time Database administrator and a full time technical computer analyst plus support from Digital, the supplier, and some support from IDI, the software creators. It falls over on occasions; like all software, it has bugs and sometimes we could kill it. Still, the system now makes life easier for us. I think that one of the main differences between full text and bibliographic databases is the staff input needed and the rapid growth in database size. A bibliographic database will retrieve records and can be used for tracking in the same way as our system. In this case, though, the paper copy still has to be found. One has to hope that the author sent a copy for filing, or that no person has removed the paper you want off the file, or that the file hasn't disappeared, etc., etc. A database such as the one we have created is not a recipe to reduce staffing levels. All staff create templates for documents being input onto the system. Where text is available, a number of staff are responsible for attaching the text to the relevant templates. Most staff assist with assigning index terms, but not all are able to. The Treasury database is full text with thesaurus control in order to enhance retrieval both bibliographically and textually. For these reasons, the database works and is a success. Whether it is worth the expense probably depends on what side of the political spectrum you are on. In my view, the database is worth the time and money spent on it. I am not saying this is the only software that will do these things. Others will be similar for different costs, ours is, however, a complete integrated package, hence the expense. 3. Do our users like or use the database? Sievert and McKinnin (Ojala, Aug 1990) admit that teaching the techniques of relevant and precise full text searching to novice searchers and to end-users is extremely difficult. There is a real danger of end-users becoming end-losers if they are unable to come to grips with full text searching. In order for non- information specialists to get the best out of a full text database depends on how keen they are on computers, information or both. Many of our staff are still non- users and we are yet to isolate why. They tell us it is hard to use, especially for infrequent searchers. Massive training of users is necessary to encourage use of the database. Search skills are easy to forget if they are only used now and then. Infrequent users expect or prefer menu driven options which experienced searchers loath. Inexperienced searchers often get huge hits or zero postings. Either option turns them off. In our experience, a user who fails to find what they want is less likely to try again. A librarian used to online searching would find the interface clumsy and difficult. However, our users often find their searching facility unfriendly or not user friendly. The Treasury database is one database containing all types of material. Users do not always want all types of material. For example, if they wish to eliminate some types of material and only search for correspondence, they can get disillusioned fast if they cannot remember how to eliminate the other types of material. With a database this large, groups of people are needed to input the material and assign the subject headings. Herein lies one of our main problems. The quality of indexing varies markedly and this presents problems in itself because documents can disappear with bad indexing. The plus of full text data is that a text search can often retrieve a missed indexed item and the reverse is true also. Good indexing can retrieve a document that natural language searching may not turn up. Full text databases need a controlled vocabulary of index or descriptor terms in order to increase recall and precision. I will discuss this a little more fully at the end. 4 Do information specialists like it? Yes and no is the answer to 'do information specialists like it?' Like all database software, it takes quite some time to get used to different sorts of styles. Compared to searching some full text compact disc databases, ours seems a little clumsy. However, a thesaurus can be loaded on to Basis and have all records checked against it. This form of data validation can be used to check data in other fields. This is a plus for those information specialists who like thesaurus control. This can aid document retrieval but it also involves a thesaurus manager in a lot of work keeping the thesaurus up to date. We are in an important marketing phase at the moment. Our users need to be cajoled into using the system because it is Treasury- wide. It is considered to have failed if all staff don't use it. I don't consider that valid. I believe if the information staff can retrieve the documents required by the user and perform the essential tracking functions, then the system is a success. I believe our experiment is reasonably successful but we do still have a long way to go in terms of total compliance. I read recently a quote from John M. Clark in _Records Management Quarterly_: "No information management system can store or classify the sum total of human ideas, that is, knowledge in such a way that the user will be able to access all relevant data promptly and efficiently in the desired format." This quote is relevant considering the volume of information we are dealing with all the time, and our 55,000 records is therefore minuscule. It is, however, realistic to expect a database such as ours to retrieve the documents contained within, promptly and efficiently. 5. What are wider issues of document retrieval? I would like to dwell for a while on the issue of bibliographic vs full-text databases. Enclosed is a list of articles I read whilst preparing for this talk. I touched on these issues earlier. It seems to me the debate on which type of database produces the higher precision results is very relevant to those of us who have introduced or are introducing full-text databases into our organisation. Your users will expect very high precision and equally high recall. This is not possible if you go the full-text way with no subject indexing of your data. All of the articles I have included verify this. The indexing, which complements the full-text, will enable many of the problems of natural language searching to be overcome by adding the enhancement of controlled terms to your data. Many full-text software automatically index most terms in a document (Basis certainly does). This is referred to as Automatic Indexing. If you index those documents using a controlled thesaurus this is referred to as manual indexing. In the experience of these writers and, to a limited extent, from our own database, both methods are essential if you are to meet the high expectations of your users. This is not cheap and it is not easy. Indexing is a highly skilled and vital part of any useful database. Staff time and skills will need to be allocated to that task. Indexing has been one of the most difficult aspects of our computerisation. Teaching people to index accurately and concisely is particularly difficult. It is a task few people enjoy doing and, therefore, few people understand the importance of indexing. Indexing is also very subjective as no two indexers view a document in the same way. With poor indexing, documents can become very hard to recall. In conclusion, I would like to assert that a fully indexed full-text database is an essential part of our information service. Treasury staff like the facility of having all types of information available in the one database. It is a mixture of full-text Treasury generated material and bibliographic external material. The project we are reviewing for 1991/92 is imaging. We are experimenting with scanning documents into the database and perhaps at next year's conference I can share with you the successes and pitfalls of scanning and imaging. I would like to reiterate here that all of the problems mentioned above are going to be relevant with imaging. When imaging becomes more prevalent, indexing and the problems associated are going to be just as important. If any of you have any questions I will be happy to answer them if I can. Sources Basch, Reva. `The seven deadly sins of full-text searching'. _Database_. Aug 1989. pp. 15-23. Blair, David C. and Maron, M. E. `Full-text information retrieval: Further analysis and clarification'. _Information Processing and Management_. V.26,no. 3, 1990. pp. 437-447. Clark, John M. `Using image scanners to create and access electronically stored documents'. _Records Management Quarterly_. V.25, no.3, July 1991. pp. 9-10, 12-13. Locke, Christopher. `The dark side of DIP : Do you know where your documents are? : Do you know what's in them?'. _Byte_ April 1991. pp. 193-204. Ojala, Marydee. `Research into full-text retrieval'. _Database_ Aug 1990. pp. 78-80. Temopir, Carol. `Full-text database retrieval performance'. _Online Review_ V.9, no. 2, 1985. pp. 149-165