Introduction

In the ever-evolving world of eDiscovery, organizations must continually adapt their strategies and processes to ensure they utilize the best and most defensible methods. This article aims to look at the various new strategies organizations can employ to conduct a defensible eDiscovery analysis. By understanding and utilizing these strategies, organizations can ensure their analysis is done in a practical, timely, and, above all else, defensible manner.

Why Defensible Strategies Matter

Our clients often tell us about their experiences gathering selected electronic data in response to production requests. One client, under federal investigation, asked its IT department to search more than 50 sites for files containing the phrase “price increase.” Formulated by the client’s attorneys, the request was passed down to the IT department, which searched for exact matches on “price increase” without considering “increased prices,” “rate increases,” and several other synonymous equivalents. As a result, the search missed many vital documents. Ultimately, the investigation did not satisfy federal regulators and had to be reformulated and redone, wasting time, effort, and money.


In another instance, executives at a construction firm were asked to search for emails about a particular project and then copy the responsive hits to a folder for later production. And in a third case, the legal team performed searches directly in Outlook, moving emails to folders named for production categories. These folder names subsequently appeared in the footers of some Microsoft Office documents, which would have revealed work product had the files been produced that way. Are these methods safe, defensible, or repeatable? Almost certainly not—yet, they are being routinely employed.


In today’s litigation environment, failing to produce relevant electronic documents can be detrimental to one’s case and often very costly. At the same time, high volumes of data and the complexity of data review make the task of reviewing for relevance, responsiveness, and privilege daunting, not to mention expensive. One effective way to reduce the volume of data to be reviewed is to employ defensible filtering methodologies to remove non-discoverable data from the review set.


This paper describes relatively straightforward ways to plan, develop and apply filter terms to electronic documents, emails, their attachments, and other data files to isolate groups of potentially relevant and privileged documents. The suggested methodology is appropriate for a few thousand to hundreds of thousands, or even millions, of emails, attachments, and electronic files. Although it is outside the scope of this paper, newly emerging and highly sophisticated technologies that apply concept searching, clustering, and natural language analysis, among other techniques, to filter huge volumes of data should be investigated when dealing with extensive collections of files. While far more costly than traditional Boolean-based filter methodologies, these new technologies can be very effective— even cost-effective—in filtering large data populations for review and production.

Electronic Filtering To Reduce Review Volume

In the days when discovery consisted of reviewing only paper documents, the legal team gathered those documents by first identifying file folders that were likely to contain relevant or responsive documents and then checking the documents contained in each file. Those days are over.

Even though the world’s most popular operating system, Microsoft’s Windows, utilizes “folders” to store data on hard disk drives, searching such “folders” on a file-by-file basis is not practical due to the sheer volume of data to be reviewed. In addition, most of the highly relevant data has been scattered to the four winds by email. Therefore, a more intelligent approach is needed to permit the review and production of large volumes of electronic documents efficiently and defensibly.

Before any defensible search strategy can be employed, developing and executing a competent collection strategy targeting specific locations and data types is necessary. Under-collection of data may miss substantial evidence. Conversely, the over-collection of data (which is usually the case) creates excessive burdens and costs. Therefore, it is essential to determine who is most likely to have potentially relevant data, what file types comprise such data, and where that data is stored. However, filtering by search terms should not be part of the collection strategy for several reasons, not least because the set of words might change throughout discovery. Filtering must only be done on a data collection that contains or is likely to contain all discoverable or potentially relevant electronic documents. While the methodologies involved in properly collecting electronic files are outside this paper’s scope, they are worth further attention and study.

We will assume that a proper and thorough data collection has been completed and the universe of data is of sufficient volume to require reduction. The first step in defensible filtering is to filter out duplicate email groups and files (“deduping”). Filtering by date for the applicable time frame and filtering by agreed-upon file types are defensible actions if they are performed by a reliable electronic evidence (“ediscovery”) vendor that understands date-related metadata issues and uses a method more sophisticated than simply identifying file extensions to determine file types.

Finally, the subject of this paper, filtering by a set of well-developed search terms and phrases, can dramatically reduce the time it takes to review an extensive collection of electronic discoveries by limiting the volume of data to be reviewed.

Given an adequately gathered collection of emails and files, we recommend the following steps to develop a defensible filter term reduction methodology. Each step will be explained in the following sections.

  1. Find out what kinds of tools will be available for the filtering, i.e., can you specify wildcards, phrases, one or more words in a defined proximity to another term, etc.? How many filter terms may be used simultaneously?
  2. Develop two sets of initial search terms: a) the first derived from the issues in the case or a production request—the relevance filter; and b) the second derived from terms and names that might indicate privileged status—the privilege filter.
  3. Consider asking your ediscovery vendor for a project dictionary, a list of all terms found in the data set or a sample, and the frequency at which each term appears. This may not be a trivial undertaking, but it may be cost-effective for the information it provides.
  4. Working with a qualified consultant, your ediscovery vendor, and/or legal team members, review the list vis-à-vis sample documents and/or a project dictionary to expand or limit your list.
  5. Perform a test filter run on a representative sample. Testing the filter set on a subset of the emails and files may be possible to see if any terms yield unexpectedly high or low results. The terms can be edited based on the results.
  6. Review sample documents from those portions of the data collection where no hits were found to ensure that no responsive data exists in those collections. Revise and retest the search terms to maximize results.
  7. Negotiate a sign-off agreement with opposing counsel in which all parties agree on the search terms for filtering.
  8. Document each step of the filtering processing, preserving, in particular, the final set of filter terms as input into the filtering software and a copy of the filtering software in the version used to do the actual filtering.
Legal Aid Western Missouri

Determine Available Filtering Tools

Understanding the tools that can be used for filtering will help you formulate your list. You may want to be sure that attachments to emails and the emails themselves will be subjected to filtering as well as all files with text content. If files or attachments have no text content, e.g., jpegs or tiffs, decide whether you want them to pass or fail the filter.

The kinds of search capabilities you may expect your ediscovery vendor to have included the ability to search for:

  • Words or strings of characters
  • wildcarded words
  • pairs of words (wildcarded or not) with proximity requirements, e.g., “quick” within two terms of “fox*” (where * indicates wildcard)
  • exact phrases or strings
  • Proximity sets, e.g., a set of words all within a certain proximity, e.g., “quick,” “fox,” “jumped,” “lazy” and “dog,” all within 12 words of one another
  • “and” sets, e.g., the existence of two or more terms or words anywhere in a document
  • “not” conditions, e.g., “Bell” but not within two words of “Lab*” to eliminate “Bell Labs” or “Bell Laboratories.”
  • “contains” is essential for finding email addresses that may not be delimited. This search term may also be a double wildcard, e.g., *mickeymouse@disney.com*.
  • Your ediscovery vendor’s search capabilities may allow you to limit the searches for all or subsets of terms to specific date ranges or fields in an email.

Develop Initial Relevance and Privilege Filter Terms

This first list can come from a production request or brainstorming session of the legal team and one or more client representatives and should include terms of art, names, and email addresses. Perhaps the best source of search terms is key emails, letters, and documents because such writings contain the terminology and jargon used by the key players to describe the critical issues in the case. Usually, such reports are readily available, mainly if they were part of the initial information exchange between attorney and client. Typically specific terms on the list will suggest additional terms and synonyms that should be employed. It is essential here to be sure that each concept or subject in the production request is covered in the relevance filter and that each name or term that might indicate privilege is in the privilege filter.

Developing the list of privilege filter terms may be challenging. Suppose you merely want to find potentially privileged documents based on a few attorney names. In that case, you might look for only unusual names as last names, e.g., Kurbowsky, with a wildcard indicator to pick up the possessive. For a common name such as Smith, where the only Smith you want to filter on is Elizabeth Smith, Esq., it may not be safe to require both the first and last name. However, if you know a Charles Smith in IT whose emails are not privileged, there may be an “and not” tool allowing you to exclude hits on Smith within one or two words of Charles.

Moreover, ensure you develop ways to search for advice rendered by an attorney without referencing that attorney’s name. Phrases such as “our counsel advised us . . .” or “based on legal advice . . .” are not uncommon in documents written by non-lawyer executives or managers who cite specific legal advice when issuing instructions or developing policy without concurrently mentioning the name of the lawyer, firm or department that provided the advice. Finally, don’t forget that there are all sorts of “privilege” beyond attorney-client privilege. Therefore, make sure you include in your privilege filter trade secrets and matters that are protected by constitutional or statutory rights of privacy.

Request a Project Dictionary

Your ediscovery vendor may be able to provide you with a project dictionary: a list of words and terms found in the extracted text of the target emails and files. If the collection is vast, ask for a sample from key custodians’ collections. This list will show you variations of terms. For an extraction from a project dictionary showing all the variations of the word “cost.” Examining this project dictionary is one way to understand the vocabulary used in the target collection of electronic documents. The dictionary can be sorted alphabetically and by frequency of hits, which is more useful in developing further search terms.

Review of Initial Relevance and Privilege Filter Terms Lists

Your lists should be reviewed and edited by someone knowledgeable about the filtering techniques to be used, generally a consultant or someone familiar who works for your vendor. The reviewer should return the revised list with explanations of how the filtering or search will be done. If a project dictionary was generated, it might be helpful in conjunction with the review, depending on the power of the search tools. For example, using a wildcard for the name “Cash” to pick up “Cash’s” is not wise if the dictionary reveals high frequencies of terms that start with “cash.”

Test and Retest the Filter Run; Revise Filter Terms and Sample as Necessary

Your vendor should be able to provide statistics on how the set of filter terms performs. You may wish to run the filter on a subset of your population to see if you are getting too many unwanted or too few responsive documents. This feedback strategy sets the standard for keyword or Boolean searching before review. Using a project dictionary and test results, an interactive search and review process will help the reviewer develop more specific search terms and phrases to limit frequently occurring hits to truly relevant documents or broaden search terms and phrases across data populations with no hits being returned. Finally, when filtering returns no or few hits, it is wise to review samples of document collections from key custodians that otherwise contain responsive data. Looking outside the source of actual or expected results helps validate the search terms lists.

Execute a Sign-off Agreement

A sign-off agreement is a wise step to assure that the filtering strategy is defensible and will satisfy the court in case of a discovery dispute or challenge to the admissibility of evidence. Such an agreement should include a list of the proposed filter terms with unambiguous indications of how they will be used. Demonstrating the efficacy of the search terms with acceptable filter test results can strengthen your hand when negotiating a sign-off agreement. In addition, such an agreement should provide for the revision and re-approval of search terms if the initial search terms fail to find the great majority of relevant documents or return a large volume of irrelevant documents. By relevance here we mean responsive and discoverable—those that would be selected by an attorney responding to a production request.

Preserve Filter Terms and Methodology

A document that details clearly and unambiguously the final set of filter terms and how they are to be searched, e.g., wildcards, pairs, phrases, etc., should be preserved along with information about the software used for the filtering. Ask your ediscovery vendor to maintain a copy of the software used in the primary filter run, including the version used. Also, ask that a set of terms be preserved or delivered to you as input into the filtering software. Suppose there is any question about why a document did not appear in the filtered set. In that case, the input filter terms and the software should show that the failure to include the document was systematic rather than intentional.

Conclusion

The discovery of electronically stored information— emails and attachments, spreadsheets, databases, and the like—is increasing exponentially and will soon become the norm in litigation. Unfortunately, such information is so voluminous and disorganized that a file-by-file review is no longer an effective or prudent means of searching for discoverable information. On the other hand, the content of most electronically stored data can be searched word-by-word, allowing such information to be filtered electronically by utilizing Boolean search terms and phrases, statistical analysis and clustering, and linguistic analysis, among other automated or nearly automated techniques. The use of filtering dramatically reduces the volume of electronic information that must be reviewed before production, leading to greater efficiencies in the discovery phase of litigation.

However, to take advantage of these efficiencies, filtering must be done thoughtfully, logically, thoroughly, and ultimately defensibly. The goal is to identify, through an automated search of the collected data that returns “hits” (a list of data files containing the responsive search terms or phrases), all of the electronic information that is i) potentially relevant to the issues in a given case, ii) potentially privileged, and/or iii) responsive to a given set of discovery requests. Overly simple search strategies will yield hits on large volumes of irrelevant information while, at the same time, missing potentially essential information. Moreover, failing to produce discoverable, non-privileged electronic details can result in costly and even case-crippling sanctions.

Assuming the proper foundation for data review has been laid by competent data collection, defensible filtering can be accomplished by employing the following steps:

  • Determine the search capabilities of the software that will be used to filter the data set.
  • Develop an initial set of search terms and phrases for both a relevance and privilege filter.
  • Develop a project dictionary.
  • Review the initial search terms with the legal team using the project dictionary and key documents to ensure comprehensiveness.
  • Test the filters against a representative sample of data to ensure that expected results are being returned. Revise filters and retest as necessary.
  • Sample review data from key custodians where the search returned no hits to ensure the filters are not missing essential data. Revise filters and retest as necessary.
  • Negotiate a sign-off agreement that i) memorializes the steps taken and search terms utilized in filtering and ii) permits downstream modification of the filtering process to deal with new issues or unforeseeable results.
  • Document each step taken in the filtering processing, preserving with particularity the filter terms as input into the filtering software and the specific version of the filtering software used to search.

The defensible filtering techniques outlined above will dramatically reduce the time, expense, and frustration involved in reviewing all but the most significant volumes of electronically stored information before document production. Moreover, if the production of such information is challenged, these techniques should demonstrate to the judge, magistrate, or commissioner that the filtering was conducted reasonably, thus avoiding sanctions. Finally, the effectiveness of defensive filtering depends entirely on the training and experience of the personnel or service bureau employed and the quality of the technology used to conduct the filtering.

Endnotes

Coleman (Parent) Holdings, Inc. v. Morgan Stanley & Co., Inc., 2005 WL 679071 (Fla. Cir. Ct. Mar. 1, 2005). In a suit alleging a fraudulent stock sale, the plaintiff filed a motion for an adverse inference instruction against the defendant for destroying emails and failing to comply with a court order to compel email discovery. Throughout the discovery process, the defendant overwrote emails, failed to notify and timely process hundreds of DLT and 8mm tapes, and failed to produce emails and attachments. The court found the plaintiff did not receive relevant email due to the defendant’s discovery tactics and granted the motion for an adverse inference instruction noting “[t]he conclusion is inescapable that [the defendant] sought to thwart discovery.” The court ordered the defendant to pay the plaintiff’s motion costs. The court also noted the defendant “gave no thought to using an outside contractor to expedite the process of completing the discovery, though it had certified completion months earlier; it lacked the technological capacity to upload and search the data at that time, and would not attain that capacity for months.” See also Coleman (Parent) Holdings, Inc. v. Morgan Stanley & Co., Inc., 2005 WL 674885 (Fla. Cir. Ct. Mar. 23, 2005).
E*Trade Securities LLC v. Deutsche Bank AG, et al., No. 02-3711 RHK/AJB and No. 02-3682 RHK/AJB (D. Minn. Feb. 17, 2005). The plaintiffs, who alleged the defendants engaged in a fraudulent securities lending scheme, sought sanctions against the defendants for “convert[ing] the litigation process into a sport of dirty tricks and obfuscation” by allegedly destroying evidence, suppressing discoverable material, and failing to search for responsive documents. The magistrate found sanctions appropriate because the defendants acted in bad faith by permanently erasing all of the company’s hard drives in mid-2002, despite being notified of potential litigation in January 2002. The magistrate also found the defendants’ failure to preserve relevant phone calls recorded on DVDs warranted sanctions because they should have realized the recordings would be highly relevant and, as a result, should have halted their recycling policy. The magistrate recommended an adverse inference instruction and $10,000 in sanctions, declaring the “destruction of potentially relevant evidence … prejudiced the plaintiffs.”
Beck v. Atlantic Coast PLC, 2005 WL 352437 (Del. Ch. Feb. 11, 2005). The plaintiff brought a breach of warranty and fraud class action against the defendants. In the complaint, the plaintiff accused the defendants of employing deceptive marketing and sales techniques to sell software on the Internet. Contrary to his representation in the complaint, the plaintiff never actually used or paid for the product. However, a consultant evaluated the product and determined it did not do what it purported to do. Discovery revealed that the plaintiff, using the guise of an organization interested in purchasing the software, corresponded with the defendants via email and then posted the information he learned on his Internet Web page. During discovery, the plaintiff withheld much of the Web page’s content from the defendants.

Meanwhile, the defendants discovered the entire content of the page, which the plaintiff did not disclose, upon performing an Internet search. The defendants moved for dismissal and sought attorney fees, stating the plaintiff and its counsel acted frivolously and in bad faith. Finding the plaintiff’s behavior “inexcusable,” the court dismissed the suit and ordered the plaintiff and his counsel to pay the defendants $25,000. The court stated, “[w]ithout ensuring that [the plaintiff] and his counsel bear appropriate responsibility for their inappropriate conduct by awarding a substantial, but fair, sanction of fees and costs against them, this court would do a disservice not only to [the defendants] and other litigants in this court, but also to those plaintiffs whose interests are well served by the availability of class action suits.”
Nartron Corp. v. General Motors Corp., 2005 WL 26991 (Mich. Ct. App. Jan. 6, 2005). In a case involving a breach of contract claim, the plaintiff appealed the trial court’s decision granting costs, sanctions for discovery abuse, and prejudgment interest in favor of the defendant. The trial court determined the plaintiff’s failure to produce a database in response to the defendant’s discovery request “tainted, corrupted, or permeated all of the discovery in the case.” The appellate court affirmed the trial court’s order in part and awarded the defendant attorney fees, costs, and expert witness fees.
Zubulake v. UBS Warburg LLC (“Zubulake V”), SDNY 02 CV 1234 (SAS) 7/20/04; 2004 US Dist. LEXIS (SDNY, July 20, 2004). On July 20, 2004, Judge Scheindlin imposed sanctions against UBS for destroying relevant e- mail messages during the litigation. The court ordered UBS to pay Zubulake’s expenses and attorney fees incurred in pursuing the missing emails. In addition to the monetary sanctions, Judge Scheindlin also granted plaintiff’s request for additional discovery and for a jury instruction permitting an adverse inference to be drawn from the missing evidence.
Invision Media Communications, Inc. v. Federal Insurance Co., No. 02 Civ. 5461 (NRB)(KNF), 2004 WL 396037 (S.D.N.Y). In an insurance suit stemming from business disruption caused by the 9/11 attacks, the plaintiff and the defendant filed cross-motions to compel discovery and for sanctions. Two of the many incidents alleged involved electronic discovery. In the first incident, plaintiff’s general counsel testified that as the company’s offices were closed and employees laid off, she directed that hard drives of those employees’ computers be “wiped.” The defendant requested sanctions for spoliation, which the court denied in the absence of any showing that the wiped hard drives would have rendered relevant evidence. In the second incident, the defendant requested emails from three months, around September 2001. The plaintiff initially responded that there were no responsive emails, as the policy had been to delete all emails after two weeks. However, the emails were eventually found and produced. The court found that a “reasonable inquiry by the plaintiff’s counsel before responding to Federal’s document request . . . would have alerted counsel that the plaintiff possessed electronic mail that fell within the scope of Federal’s document request.” The plaintiff was directed to pay costs and reasonable attorneys’ fees resulting from the additional discovery required.
Kucala Enters., Ltd. v. Auto Wax Co., No. 02 C 1403, 2003 WL 21230605 (ND Ill. May 27, 2003). In a patent infringement case, the defendant repeatedly requested documents from the plaintiff, including business records and correspondence from the plaintiff’s computer system. After three motions to compel production, the defendant was allowed access to the plaintiff’s computer to conduct an inspection. The computer forensics expert working the examination discovered that the plaintiff had used commercially available disk-wiping software, “Evidence Eliminator,” to “clean” approximately 3,000 files three days before the inspection and another 12,000 on the night before the assessment between the hours of midnight and 4:00 a.m. The magistrate judge found that, based on the totality of the circumstances, the spoliation was intentional and recommended to the trial judge that the plaintiff’s case be dismissed with prejudice and that the plaintiff pays the defendant’s attorneys’ fees and costs from the time the Evidence Eliminator was first used. On de novo review, the district court judge rejected the recommendation to dismiss the plaintiff’s case with prejudice, favoring adjudication of the claims and counterclaims, but upheld the recommendation that the plaintiff bear attorneys’ fees and costs. Kucala Enters., Ltd. v. Auto Wax Co., No. 02 C 1403, 2003 WL 22433095 (ND Ill. Oct. 27, 2003) (Rulings on Objections dated October 27, 2003).
Procter & Gamble Co. v. Haugen, 2003 WL 22080734 (D. Utah Aug. 19, 2003) (Order dated Aug. 19, 2003). Procter & Gamble (P&G) sued several independent distributors of rival Amway products, claiming unfair trade practices for allegedly distributing email associating P&G with Satanism. P&G immediately informed the defendants of their duty to preserve computer evidence crucial to the case, but neglected to impose a similar duty upon itself, destroying email records of five key P&G employees. Without citing Federal Rule of Civil Procedure 37, the court granted the defendant’s motion to dismiss the case on three grounds, each of which the court stated was sufficient alone to grant dismissal. The three grounds were i) the plaintiff failed to preserve evidence it knew was “critical” to the case, ii) the plaintiff’s actions rendered an adequate defense “basically impossible,” and iii) the plaintiff destroyed the very evidence it would need to support its proposed expert testimony on damages, rendering the testimony inadmissible on Daubert grounds. In a previous decision, the trial court sanctioned the plaintiff $10,000— $2,000 for each of the five key employees whose files had been destroyed. Procter & Gamble Co. v. Haugen, 179 FRD 622 (D. Utah 1998), rev’d on other grounds, 222 F.3d 1262 (10th Cir. 2000).
2Metadata is data contained within a file that is about the content or form of the file or the application used to create or edit that file. Examples of metadata include the version of the processing application and the creation, modification, or access date/time stamps (not to be confused with the file system date). Metadata may also include data about how to format or display the content, e.g., fonts and sizes. It may also include identification of the registered user name of the computer system. Although metadata may be “hidden” during typical use, it is not to be confused with hidden data, including deletions, font changes, etc. in word processing files. Different types of metadata are available for different types of files.
While a rough idea of file types may be obtained by looking at extensions, file types are more accurately determined by examining the content of a file for signature strings and characters. Typically, electronic evidence collections are filtered by file types, i.e., limited to emails and their attachments, office-type files, specialty files, database files, etc. Certain file types are typically excluded from further processing and review, e.g., system files, specialty files, files that come with Windows, etc. Thus the accurate determination of file types is essential. For example, Unix has a standard process called Magic to determine file types. D-M Information Systems has developed its own tools to ensure that all files containing discoverable text are included in the set of electronic data subjected to defensible filtering rather than being discarded at the outset due to simple file-type filtering.
Why bother with more sophisticated methods of determining file types? To overcome user changes and chicanery. Users change file extensions or use non-mnemonic extensions for a variety of reasons. Two common reasons to change file extensions include: i) allowing executable files to be emailed to restricted systems and ii) hiding the true file type so that specific files are unlikely to be detected and opened. Such files are often of particular interest and should not be casually discarded. Thus, a more sophisticated method of file-type determination is necessary to ensure a completely defensible filtering process.