Associates Volume 1 Number 1 July 1994 - Feature

ASSOCIATES (vol. 1, no. 1, July 1994) - associates.ucr.edu

[This is the first in a series of articles on full text data
management compiled by John Lozowsky, Assistant Director,
Information Management and the staff at The Treasury, Wellington,
New Zealand.  Further articles will cover their problems, the
search for replacement software, its installation, interface
development and data conversion, and the results.  The Editors.]
 
 
 
		     FULL TEXT REVISITED.
	       The Treasury Experience, circa 1992
 
	  New Zealand Library and Information Services
		     Conference, March 1992
 
		A speech presented by Megan Clark
		   former Information Manager
			Library Services
			  The Treasury
 
 
Not long after I started work at Treasury I spoke at New Plymouth
on what Treasury was doing in its Information Centre.  Three
years down the track I can proudly assert that the experiment has
succeeded.  We still have problems, both people and computer
problems.  Any one who has implemented a large network system
will understand there is always something round the corner. I am
not going to dwell on the personnel issues of the Information
Centre, however.  My purpose today is to revisit our full text
database and to discuss the pros and cons of such.
 
1.   Why did Treasury go for a full text database?
 
2.   Is it worth the expense and technical input needed?
 
3.   Do the people it was created for like it/use it?
 
4.   Do the information specialists like it?
 
5.   Wider issues of document retrieval.
 
These are the questions I will address today.
 
 
1.   Why did Treasury choose full-text retrieval.
 
Treasury went the full text way to provide an information
resource for its staff who write, in total, hundreds of papers
each year for Ministers of the Crown.  These papers cover an
incredibly wide range of topics.  If the information the staff
need to write these papers was online, the idea was staff would
access that information themselves.  Analysts kept filing
cabinets full of copies of theirs and other peoples papers for
"quick referral".  Many of you will know the monetary cost of
that form of storage plus the inefficiency costs of retrieving
information that way.
 
Information stored locally and unindexed is often narrow and
doesn't usually contain more than one point of view.  In fact
writing a paper from these sources is usually called "top of the
head stuff".  Paper was often stored in the central files and
duplicated elsewhere or, worse still, not stored in central files
and, therefore, accessible only to the author from a personal
file.  Corporate knowledge was not shared and. in an organisation
with many new staff each year and a high turnover, this meant
valuable institutional knowledge was often lost forever.
 
A consulting company, Logica Pty, was chosen to look at
Treasury's information needs.  What happened then is history.
Basis was chosen and a complete office automation package was
bought which all linked together.  The hardware is Digital, the
electronic mail is All-In-1 and the word processor is WPs Plus.
Treasury upgraded to Basis Plus in November 1990.
 
Data, text and references can be sent electronically all around
the building and can be loaded into the database, text and all,
and can be transferred back out.  A copy can be edited or
reworded and the new edition loaded again.  Documents can be
replaced if necessary, although that has archival implications
and, therefore, is not done.
 
In other words a sophisticated and powerful information retrieval
system was introduced.  Basis Plus has a sophisticated search
facility for the information specialist as well as the easier
interface module used by all users.
 
2.   Is it worth the expense and technical input needed?
 
The database is now vast and growing fast.  It has about 90,000
records many of which are long text.  It also has many references
to journal articles received within the centre plus the library
catalogue and the files index.  Incoming correspondence is
indexed in order that it may be tracked and outgoing records
matched to the incoming.  It is used for Ministerial tracking
and could, if we wish, be used for serials maintenance,
acquisitions and book/file issuing.
 
It is a successful database and certainly makes retrieval of
Treasury generated material much easier to locate and deal with.
All documents are assigned index terms from the thesaurus.  The
thesaurus used by our staff has been developed from within the
organisation.  Many hours of work has gone into its development.
Indexing is currently done by a variety of staff within the
Information Centre.
 
To get to this stage has required an almost full time Database
administrator and a full time technical computer analyst plus
support from Digital, the supplier, and some support from IDI,
the software creators.  It falls over on occasions; like all
software, it has bugs and sometimes we could kill it.  Still, the
system now makes life easier for us.
 
I think that one of the main differences between full text and
bibliographic databases is the staff input needed and the rapid
growth in database size.  A bibliographic database will retrieve
records and can be used for tracking in the same way as our
system.  In this case, though, the paper copy still has to be
found.  One has to hope that the author sent a copy for filing,
or that no person has removed the paper you want off the file, or
that the file hasn't disappeared, etc., etc.
 
A database such as the one we have created is not a recipe to
reduce staffing levels.  All staff create templates for documents
being input onto the system.  Where text is available, a number
of staff are responsible for attaching the text to the relevant
templates.  Most staff assist with assigning index terms, but not
all are able to.  The Treasury database is full text with
thesaurus control in order to enhance retrieval both
bibliographically and textually.
 
For these reasons, the database works and is a success.
 
Whether it is worth the expense probably depends on what side of
the political spectrum you are on.  In my view, the database is
worth the time and money spent on it.  I am not saying this is
the only software that will do these things.  Others will be
similar for different costs, ours is, however, a complete
integrated package, hence the expense.
 
3.   Do our users like or use the database?
 
Sievert and McKinnin (Ojala, Aug 1990) admit that teaching the
techniques of relevant and precise full text searching to novice
searchers and to end-users is extremely difficult.  There is a
real danger of end-users becoming end-losers if they are unable
to come to grips with full text searching.  In order for non-
information specialists to get the best out of a full text
database depends on how keen they are on computers, information
or both.  Many of our staff are still non- users and we are yet
to isolate why.  They tell us it is hard to use, especially for
infrequent searchers.
 
Massive training of users is necessary to encourage use of the
database.  Search skills are easy to forget if they are only used
now and then.  Infrequent users expect or prefer menu driven
options which experienced searchers loath.  Inexperienced
searchers often get huge hits or zero postings. Either option
turns them off.  In our experience, a user who fails to find what
they want is less likely to try again.  A librarian used to
online searching would find the interface clumsy and difficult.
However, our users often find their searching facility unfriendly
or not user friendly.
The Treasury database is one database containing all types of
material.  Users do not always want all types of material.  For
example, if they wish to eliminate some types of material and
only search for correspondence, they can get disillusioned fast
if they cannot remember how to eliminate the other types of
material.
 
With a database this large, groups of people are needed to input
the material and assign the subject headings.  Herein lies one of
our main problems.  The quality of indexing varies markedly and
this presents problems in itself because documents can disappear
with bad indexing.  The plus of full text data is that a text
search can often retrieve a missed indexed item and the reverse
is true also.  Good indexing can retrieve a document that natural
language searching may not turn up.  Full text databases need a
controlled vocabulary of index or descriptor terms in order to
increase recall and precision.  I will discuss this a little more
fully at the end.
 
4     Do information specialists like it?
 
Yes and no is the answer to 'do information specialists like it?'
Like all database software, it takes quite some time to get used
to different sorts of styles.  Compared to searching some full
text compact disc databases, ours seems a little clumsy.
However, a thesaurus can be loaded on to Basis and have all
records checked against it.  This form of data validation can be
used to check data in other fields.  This is a plus for those
information specialists who like thesaurus control. This can aid
document retrieval but it also involves a thesaurus manager in a
lot of work keeping the thesaurus up to date.
 
We are in an important marketing phase at the moment.  Our users
need to be cajoled into using the system because it is Treasury-
wide.  It is considered to have failed if all staff don't use it.
 
I don't consider that valid.  I believe if the information staff
can retrieve the documents required by the user and perform the
essential tracking functions, then the system is a success.  I
believe our experiment is reasonably successful but we do still
have a long way to go in terms of total compliance.
 
I read recently a quote from John M. Clark in _Records Management
Quarterly_:  "No information management system can store or
classify the sum total of human ideas, that is, knowledge in such
a way that the user will be able to access all relevant data
promptly and efficiently in the desired format."
 
This quote is relevant considering the volume of information we
are dealing with all the time, and our 55,000 records is
therefore minuscule.  It is, however, realistic to expect a
database such as ours to retrieve the documents contained
within, promptly and efficiently.
 
5.   What are wider issues of document retrieval?
 
I would like to dwell for a while on the issue of bibliographic
vs full-text databases.  Enclosed is a list of articles I read
whilst preparing for this talk.  I touched on these issues
earlier.
 
It seems to me the debate on which type of database produces the
higher precision results is very relevant to those of us who have
introduced or are introducing full-text databases into our
organisation.  Your users will expect very high precision and
equally high recall.  This is not possible if you go the
full-text way with no subject indexing of your data.  All of the
articles I have included verify this.
 
The indexing, which complements the full-text, will enable many
of the problems of natural language searching to be overcome by
adding the enhancement of controlled terms to your data.  Many
full-text software automatically index most terms in a document
(Basis certainly does).  This is referred to as Automatic
Indexing.  If you index those documents using a controlled
thesaurus this is referred to as manual indexing.  In the
experience of these writers and, to a limited extent, from
our own database, both methods are essential if you are to meet
the high expectations of your users.
 
This is not cheap and it is not easy.  Indexing is a highly
skilled and vital part of any useful database.  Staff time and
skills will need to be allocated to that task.  Indexing has been
one of the most difficult aspects of our computerisation.
Teaching people to index accurately and concisely is particularly
difficult.  It is a task few people enjoy doing and, therefore,
few people understand the importance of indexing.  Indexing is
also very subjective as no two indexers view a document in the
same way.  With poor indexing, documents can become very hard to
recall.
 
In conclusion, I would like to assert that a fully indexed
full-text database is an essential part of our information
service.  Treasury staff like the facility of having all types of
information available in the one database.  It is a mixture of
full-text Treasury generated material and bibliographic external
material.
 
The project we are reviewing for 1991/92 is imaging.  We are
experimenting with scanning documents into the database and
perhaps at next year's conference I can share with you the
successes and pitfalls of scanning and imaging.  I would like to
reiterate here that all of the problems mentioned above are going
to be relevant with imaging.  When imaging becomes more
prevalent, indexing and the problems associated are going to be
just as important. If any of you have any questions I will be
happy to answer them if I can.
 
			     Sources
 
Basch, Reva.  `The seven deadly sins of full-text searching'.
     _Database_. Aug 1989.  pp. 15-23.
 
Blair, David C. and Maron, M. E.  `Full-text information
     retrieval: Further analysis and clarification'.
     _Information Processing and Management_.  V.26,no. 3, 1990.
     pp. 437-447.
 
Clark, John M.  `Using image scanners to create and access
     electronically stored documents'.  _Records Management
     Quarterly_.  V.25, no.3, July 1991.   pp. 9-10, 12-13.
 
Locke, Christopher.  `The dark side of DIP : Do you know where
     your documents are? : Do you know what's in them?'.  _Byte_
     April 1991. pp. 193-204.
 
Ojala, Marydee.  `Research into full-text retrieval'.  _Database_
     Aug 1990.  pp. 78-80.
 
Temopir, Carol.  `Full-text database retrieval performance'.
     _Online Review_  V.9, no. 2, 1985.  pp. 149-165