Key Words and Phrases

Key Words and Phrases – The Key to Scholarly Visibility and Efficiency in an Information Explosion

Edward E. Gbur, Jr. and Bruce E. Trumbo

 

From The American Statistician, February 1995, Vol. 49, pages 29-33, with non-substantive modifications in format, typography, punctuation, etc. for more effective display on the web.


ABSTRACT: The statistical literature is growing at an increasing rate and has become diffused among more journals and publishers. Most libraries now have neither the funds nor the space to provide anything approaching complete coverage. As a result, access to the literature by statisticians is increasingly mediated by computer-assisted bibliographic searches.

Authors can do much to help bibliographers provide relevant and complete information about their papers, including the selection of appropriate key words and phrases (KWs). Complete bibliographic information can also help prevent duplication of results, and thus wasted time and journal space. Guidelines for the selection of optimal KWs are provided and discussed.

KEY WORDS: Acknowledgment of priority, Bibliography, Computer-readable databases, Duplicate publication, Electronic publication, Information management, Information retrieval

AUTHOR INFORMATION, printed with the original paper, correct as of 1995:

Edward E. Gbur, Jr., is Associate Professor, Agricultural Statistics Laboratory, University of Arkansas, Fayetteville, AR 72701-1201. He is presently editor of the Current Index to Statistics.

Bruce E. Trumbo is Professor of Statistics and Mathematics, California State University, Hayward, CA 94542. He was the first editor of the Current Index to Statistics Extended Database and is presently a member of the Committee on Publications of the American Statistical Association.


Navigation by Section:

1. Rapid Growth of Statistical Literature

2. Statistical Bibliographies and Information Retrieval

3. Advice to Authors: Strategies for Visibility

4. Preventing Publication of Duplicate Results

5. Selecting Key Words and Phrases

6. Conclusions

General References

References for Section 4


1. The Rapid Growth of the Statistical Literature

Several decades ago, it was feasible for a statistician to keep informed of developments in his or her areas of specialization by subscribing to several of the most important statistics journals and periodically browsing several more journals and recently received monographs in a local library. As the number of papers, journals, and publishers has increased, it has now become a daunting or impossible task even to scan the tables of contents of the relevant publications. For many years scientists have been commenting on the bewildering growth of the scientific literature in general (e.g., Price 1956) and the mathematical sciences in particular (e.g., May 1968, 1971).

A decade ago, the situation throughout science was such that Naisbitt (1982, p. 24) complained:

The level of information is clearly impossible to handle by present means. Uncontrolled and unorganized information is no longer a resource in an information society. Instead, it becomes the enemy of the information worker. Scientists who are overwhelmed with technical data complain of information pollution and charge that it takes less time to do an experiment than to find out whether or not it has already been done.

The statistical literature has not escaped this phenomenal growth pattern. In 1993, Current Index to Statistics (CIS) included articles from approximately 125 statistics journals and 300 nonstatistics journals. Of the 100 journals currently indexed in their entirety in CIS, 41 of them did not exist in 1975, the year in which CIS was begun. Since 1990, 12 new statistics journals have appeared, and at least two more are set to begin publication in the near future. Not only has there been an increase in the number of journals, established journals have grown larger. For example:

  • In 1975, the Journal of the American Statistical Association contained 980 pages; in 1993, it contained 1,488 pages–a 51.8% increase in the number of pages;
  • The Annals of Statistics has seen a 58.1% increase in the number of pages during this period, while
  • The Annals of Probability increased 113.4% despite the creation of the Annals of Applied Probability by the Institute of Mathematical Statistics.

2. Statistical Bibliographies and Information Retrieval

In response to the retrieval problems posed by a growing scientific literature, bibliographic publications have been established in most scientific fields. In statistics, the “Tukey Indexes” were among the first (Tukey 1973; Ross and Tukey 1973, 1975), providing several decades of coverage up through the mid-1960s. By the early 1970s, many scholars in the statistical sciences began to give serious consideration to bibliographic issues (see, for example, Lancaster 1970, and Gani 1970). As part of this bibliographic awakening, by the mid-1970s most statistics journals had begun to print abstracts and keywords with each article– crucial, readily available data for information retrieval that they had almost never provided ten years earlier.

The Current Index to Statistics was founded (Joiner 1976) as a joint project of the American Statistical Association and the Institute of Mathematical Statistics to provide bibliographic coverage of the field of “statistics,” defined in a broad sense. CIS and Statistical Theory and Method Abstracts, published by the International Statistics Institute, have been the primary printed references devoted exclusively to statistics. Several other bibliographies of mathematics, physical science, social science, and medicine have provided partial coverage of the statistical literature.

More recently, CIS began publication of a computer-searchable database (Trumbo 1990). Early versions of the database were taken exclusively from the computer files used to produce CIS volumes. The scope has since been expanded to include pre-1975 information on selected journals, partially closing the gap between the Tukey Indexes and the beginning of CIS. To accentuate these and other differences, it has been renamed the Current Index to Statistics Extended Database (CIS/ED).

Computer-searchable databases make possible fast, flexible, automated literature searches. A researcher can use a computer database such as CIS/ED to generate a tailor-made bibliography on any desired topic. The success of such an effort depends, however, upon the completeness of the database and the researcher’s ingenuity in knowing what “search strings” to use. Information retrieval problems have plagued researchers regardless of the media (May 1971) and will likely continue to do so in the future.

A database that relied only on titles and authors would be severely limited because titles, at best, are rarely able to give a complete description of important topics covered and, at worst, may be deliberately vague or “cute.” Appropriately chosen key words and phrases (KWs) can be included in a bibliographic record to give a more complete description of the article or book being indexed, thus improving the precision of the retrieval process.

In the future, computer databases for statistics may provide searchable text of abstracts or of entire articles. As the number of records and the information associated with each record increases, the unwary user may run just as great a risk of being overwhelmed with unwanted retrieved information as of failing to retrieve a complete collection of desired items.

Whatever the problems of compiling databases may be or however difficult the learning curve for their artful use may be, it seems clear that bibliographic searches have now replaced visual browsing as the method of choice for retrieving information on developments in the sciences (Abelson 1989).

In the face of this information explosion, even the largest research libraries now find that they have neither the space nor the funds to archive all relevant journals. One response has been to form consortia of libraries so that articles in known locations can be retrieved within a few hours or overnight from the collection of some consortium member. Availability is maintained, but browsing of actual journals is not possible. While there are now bibliographic services that contain thousands of journal tables of contents and, in some cases, abstracts, their scope is so broad that very few statistics journals are included, and those that are tend to be widely available already.

In the future, it is likely that many journals will be archived electronically. Refereed journals already exist in many disciplines only in electronic format, never having been distributed in print at all. The first such journal in statistics was the Journal of Statistics Education, which began publication in 1993. Solomon, Arnold, Trumbo, and Velleman (1994) specifically discussed electronic publications in statistics.

3. Advice to Authors: Strategies for Visibility

As an author, your main goals of publication are to disseminate information to the widest possible audience and to establish your claim to origination of the ideas presented. Traditionally, you might have tried to do this by submitting to a prestigious journal of wide circulation and perhaps hoping to have your paper selected to be the lead article for an issue.

The new realities of distribution and retrieval require a fundamental change in strategy: as an author, you should do everything possible to assist bibliographers in providing a complete and accurate record of the information included so that your paper can be retrieved by exactly the right audience. All serious bibliographic databases include titles, author names, and KWs. Some include abstracts and citation information from your reference list as well as more specialized information (e.g., Mathematical Reviews codes). As an author, you should attempt to use each of these record parts to its fullest possible advantage.

Title. The title of your article should be informative. Titles such as “A note on a theorem of Smith” and “New uses for familiar statistics” contain few, if any, words usable in a bibliographic search. Likewise, titles such as “On the use of least squares regression” contain statistical terminology that, by itself, would retrieve such an overwhelming number of records that it would render the search useless.

Key words. KWs should convey any important information not conveniently included in the title. Advice on the selection of KWs is discussed separately in Section 5.

Abstracts. Although abstracts are not yet routinely part of some databases, they have some bibliographic uses now. In some disciplines, abstracts are deemed important enough that authors are required to follow explicitly stated journal guidelines for their construction. While this is not the case in statistics, careful attention should be given to the preparation of abstracts.

Abstracts that merely repeat the title or reference information of another paper are essentially useless. Your abstract should give a clear but brief statement of the topic you discuss or the problem you solve, and a concise statement of your major results or conclusions. A brief account of the methods used can be helpful. Try to put the paper in perspective with respect to the rest of the literature. Often a colleague uninvolved in the work can help you to select the crucial ingredients of a good abstract.

Mathematical notation is generally not computer-searchable. You should try crossing out any mathematical notation and asking yourself what useful information remains.

Professional name. Because some retrieval is naturally done using authors’ names, especially as a reputation or association with a particular topic is established, you should decide as early as possible in your career on the exact form of your professional name and then use that name without variation throughout your career. Regardless of how uncommon your name may be, you should consistently use one or more given names, perhaps together with initials, chosen to minimize potential confusion with other authors. Once the form of your professional name is established, you should avoid abbreviating your chosen given names with initials and should consider not changing professional names because of changes in marital status.

4. Preventing Publication of Duplicate Results

The quote above from Naisbitt (1982) could apply to reproving theorems and reinventing methodology as well as to redoing experiments. Of course, a problem arises when the duplicated results are submitted for publication. The problem of publication of duplicate results is certainly not new. May (1968) studied the mathematics literature prior to 1920 on determinants and concluded that of the almost 2,000 papers, 21% were duplications and that 43% were trivial modifications of previous results “easier to deduce than to look up, even if within reach.”

Have duplicate publications become more common in statistics as a result of the information explosion? A search of CIS/ED shows that the number of acknowledgments of priority has been essentially constant since 1975.

  • This surprising result may be due partially to inconsistent handling of correction and acknowledgment notes over time by CIS editors and changes in the breadth of coverage of some peripheral and unrefereed journals.
  • While preparing pre-1975 material for CIS/ED, one of us (Trumbo) has noticed that errata and acknowledgments of priority in those earlier years sometimes dealt with typographical errors and duplications that editors of most journals would not bother mentioning today.
  • Finally, the correction rate for recent papers is undoubtedly lower owing to the time lag, sometimes a long one, necessary to discover the duplication. For example, a recent issue of The Annals of Statistics carries an acknowledgment of priority by Nabeya and Tanaka (1994) for Nabeya and Tanaka (1988), which included results previously published in several papers from the period 1969 to 1975.

We know of no enlightening way to discuss the issue of repeat publications without referring to specific instances. We stress that we are sympathizing with, rather than criticizing, the authors involved. We have listed the references for Section 4 separately because they deal with illustrations from various fields, not with the general subject matter of this article. Citing a recent instance in which a search of CIS/ED might have prevented a duplicate publication in The American Statistician, Olkin (1993) suggested that systematic computer-aided searches ought to become obligatory. A few journals (e.g., Communications in Statistics) already include such admonitions in their instructions to authors. Search requirements are an excellent suggestion which would benefit both authors and publishers. However, CIS would not have helped Nabeya and Tanaka because the original versions of the results in question appeared before CIS was established. Moreover, in 1988 there was no CIS database with pre-1975 material, and coverage in the Tukey index ended in the mid-1960s.

Bibliographic searches can also miss their mark when they need to extend across boundaries between subdisciplines. In those cases especially, differing terminologies or a failure to understand the far-reaching consequences of a theoretical result can obscure a target item. We give two illustrations.

  • Simpkin and Downham (1989) acknowledged the priority of some results of Worsley (1983) appearing in Simplin and Downham (1988), but they observed that the duplication occurred “Because the title and key words [in Worsley (1983)] do not (in our view) reflect the breadth of the contents….” In retrospect, we doubt that Worsley could reasonably have been expected to have included KWs in his paper that would have helped Simpkin and Downham to find it.
  • More recently, Cormack (1993) acknowledged results of Hook and Regal (1982, 1992) and Regal and Hook (1984, 1991) on estimation of population sizes, which he repeated in Cormack (1992), commenting that “…my only, and feeble, excuse is that their earlier papers are not indexed under either capture-recapture or mark-recapture in Current Index to Statistics.” He further laments:

Biometrics used to be seen as the natural home for all papers concerned with such a general methodology [as mark-recapture], with a parallel paper describing the details of the application in a particular scientific area. It is a matter of regret that it no longer fulfills this function.

In following up this case, Gbur noted in internal CIS correspondence, “I doubt that I would have put capture- recapture as a keyword [to Regal and Hook 1991], even in hindsight,” and Trumbo added that KW to some relevant records in the 1993 edition of CIS/ED, if only to close the barn door after the horse had gotten out. We note, with the 20-20 vision of hindsight, that Cormack would have found two of the earlier papers if he had done a computer search for the intersection of “closed,” “population,” and “size.”

These are difficult cases, meant to illustrate that no bibliographic system can lead routinely to retrieval of all articles that might be of interest. They also serve to illustrate that bibliographers who are not, and cannot be expected to be, experts in all fields should not be expected to fill in all KW connections left unfilled by authors. On a more positive note, we hope they also illustrate the importance of Item 10 in the list provided in the next section.

5. Suggestions for Authors: Selecting Key Words and Phrases

The main purpose of KWs is to complement title and author information to help interested readers retrieve your article from a printed or computer-searchable bibliography. Try to consider all types of potential readers and how they might look for papers such as yours. In addition, keep in mind that KWs, even if not used in a particular search, may help a reader to identify your article as being of interest among the many retrieved using other search words; that is, taken together, your title and KWs may be viewed as a mini- abstract of your paper.

Based on our recent intensive experience with CIS and CIS/ED, we offer the following ten suggestions on the selection of KWs by authors. We believe they will increase the chances that your KWs will be used by bibliographers and will be useful to your intended audience.

  1. Use simple, specific noun clauses. For example, use variance estimation, not estimate the variance. Generalized exponential used in epidemiology is better split into generalized exponential distribution and epidemiology.
  2. Some potential KWs are too common to be useful. For example, the use of regression as a KW without modification will retrieve a dauntingly long list of papers (602 records in 1993 CIS printed volume file). Statistic unmodified is next to useless (932 in the same single-year file).
  3. It is not necessary to repeat information already provided in the title. For example, Weibull distribution might be a good KW, but not if the title is “On the estimation of lifetimes when the underlying distribution is Weibull.”
  4. Try to avoid unnecessary prepositions, especially of and in, except in standard phrases. For example, use data quality rather than quality of data; use reliability rather than theory of reliability.
  5. Avoid acronyms. Acronyms such as ANOVA, ARIMA, and SPRT are widely recognized among statisticians, but an acronym may fall out of favor over the years and may be unknown to those in related fields or puzzling to beginners.
  6. Spell out Greek letters and try to avoid mathematical symbols. There may be a very few ideas so well known by their mathematical notation as to be exceptions. In these few cases, provide an alternative nonsymbolic KW. There are many ways that mathematical notation can be encoded in a database, making computer searches on mathematical notation impractical.
  7. Include names of people as part of KWs only if they are truly part of established terminology; for example, Pitman efficiency, Weibull distribution, or James-Stein estimator. Avoid making a Smith-Chen-Rao estimator into a KW if this just refers to something from a paper in your reference section. Do not refer to a result of your own in this way unless the terminology is truly well-established; the article is already retrievable by your name as an author.
  8. Where applicable, include crucial mathematical or computer techniques such as generating function, Chebyshev inequality, or Monte Carlo used to derive results, and a statistical philosophy or approach such as maximum likelihood, fiducial inference, or empirical Bayes.
  9. Where appropriate, note areas of applications such as tumor growth or labor force and special attributes of the article such as a real-world dataset that illustrates the present method and might be used to illustrate others, a table that is of value beyond the exposition of the article, or a useful computer algorithm provided in the article.
  10. Be especially alert to include alternative or inclusive terminology. If a concept is, or has been, known by several terminologies, include any KW that might help a user conducting a search across a span of time or from outside your subspecialty. For example, the statistician’s characteristic function is the mathematicians Fourier transform. Another well known example is the Taguchi jargon, much of which can be translated into standard statistical terminology.

This last guideline is perhaps the most difficult and the most important of all. Its implementation requires a wider view and often benefits from consultation with knowledgeable colleagues. While alternative established terminologies are useful, KWs should never be used to try to introduce new terminology.

6. Conclusions

Even though the statistical literature is growing at an increasing rate, modern bibliographic resources provide important assistance in keeping up to date in a subfield and in retrieving articles on a specific topic. By using standard forms of their names, selecting titles carefully, and especially, by making wise choices of key words and phrases, authors can help bibliographers to make their publications more easily retrievable from bibliographic databases and thus potentially increase the scholarly impact of their work.


GENERAL REFERENCES

Abelson, P. H. (1989), “Retrieval of Scientific and Technical Data,” Science, 245, 9.

Gani, J. (1970), “On Coping with New Information in Probability and Statistics,” Journal of the Royal Statistical Society, Ser. A, 133, 422-450.

Joiner, B. (ed.) (1976), Current Index to Statistics: Applications, Methods, and Theory (Vol 1, 1975), published by Management Committee of the Current Index to Statistics, distributed by American Statistical Association, Alexandria, VA, and by Institute of Mathematical Statistics, Hayward, CA.

Lancaster, H. O. (1970), “Problems in the Bibliography of Statistics,” Journal of the Royal Statistical Society, Ser. A, 133, 409-421.

May, K, O. (1968), “Growth and Quality of the Mathematical Literature,” Isis, 59, 363-371.

May, K. O. (1971), “Problems of Information Retrieval in Mathematics,” in Proceedings of the 25th Summer Meeting of the Canadian Mathematical Congress, Lakehead University, Thunder Bay, Ontario.

Naisbitt, J. (1982), Megatrends. Ten New Directions Transforming Our Lives, New York: Warner Books.

Olkin, I. (1993), Letter to the editor of The American Statistician, unpublished.

Price, D. J. (1956), “The Exponential Curve of Science,” Discovery, 17, 240-243.

Ross, I. C., and Tukey, J. W, (1973), Index to Statistics and Probability: Locations and Authors, Los Altos, CA: R&D Press. (Now distributed by American Mathematical Society, Providence, RI.)

Ross, I. C., and Tukey, J. W., (1975), Index to Statistics and Probability: Permuted Titles, Los Altos, CA: R&D Press. (Now distributed by American Mathematical Society, Providence, RI.)

Solomon, D. L., Arnold, J. T., Trumbo, B. E., and Velleman, P. F. (1994), “Electronic Publications in Statistics–Ready or Not, Here They Come,” The American Statistician, 48, 191-196.

Trumbo, B. E. (ed.) (1990), Current Index to Statistics Cumulative Database (1980- 1989) (computer-readable bibliography of the statistical literature, 12 MBytes), published by Management Committee of Current Index to Statistics, distributed by Institute of Mathematical Statistics, Hayward, CA.

Tukey, J. W. (1973), Index to Statistics and Probability: Citation Index, Los Altos, CA: R&D Press. (Now distributed by American Mathematical Society, Providence, RI.)

REFERENCES FOR SECTION 4

Cormack, R. (1992), “Interval Estimation for Mark-Recapture Studies of Closed Populations,” Biometrics, 48, 567-576.

Cormack, R. (1993), “Acknowledgment of Priority,” Biometrics, 49, 315.

Hook, E. B., and Regal, R. R. (1982), “Validity of Bernoulli Census, Truncated Binomial and Log-Linear Methods for Corrections for Underestimates in Prevalence Studies,” American Journal of Epidemiology, 116, 168-172.

Hook, E. B., and Regal, R. R. (1992), “The Value of Capture-Recapture Methods Even for Apparent Exhaustive Surveys,” American Journal of Epidemiology, 135, 1060-1067.

Nabeya, S., and Tanaka, K. (1988), “Asymptotic Theory of a Test for the Constancy of Regression Coefficients Against the Random Walk Alternative,” Annals of Statistics, 16, 218-235.

Nabeya, S., and Tanaka, K. (1994), “Acknowledgment of Priority,” Annals of Statistics, 22, 563.

Regal, R. R., and Hook, E. B. (1984), “Goodness-of-Fit Based Confidence Intervals: For Estimates of the Size of a Closed Population,” Statistics in Medicine, 3, 287-291.

Regal, R. R., and Hook, E. B. (1991), “The Effects of Model Selection on Confidence Intervals for the Size of the Closed Population,” Statistics in Medicine, 10, 717-721.

Simpkin, J. M., and Downham, D. Y. (1988), “Testing for a Change-Point in Registry Data with an Example on Hypospadias,” Statistics in Medicine, 7, 387-393.

Simpkin, J. M., and Downham, D. Y. (1989), “Acknowledgment of Priority,” Statistics in Medicine, 8, 1414-1416.

Worsley, K. J. (1983), “The Power of Likelihood Ratio and Cumulative Sum Tests for a Change in a Binomial Probability,” Biometrika, 70, 455-460.