In this series of articles, we are discussing the role of Big Data in litigation. Our first post published a year ago (it has been a busy year spent digging up old databases) defined “structured data” and “big data” in the context of litigation.

In this next post, we will discuss Data Archaeology, that is to say, the process of discovering, extracting, and organizing data.

When we work with Big Data in litigation the structured data generally comes from one of three sources:

  1. Data we "dig up" from our client’s systems and archives as part of the disclosure process;
  2. Data we receive from the opposing party as part of their production; and
  3. Third-party data, such as market data, that is relevant to the case.

In this post, we’ll focus on the first two of these.

When we work with Big Data, whether we are on the defendant side or the plaintiff/claimant side, we need to have our fedora and bull whip ready to channel our inner-Indiana Jones. We know where we need to get to—namely a consolidated repository of the structured data relevant to the case that is ready to be produced and analyzed. The question we get is usually: “How do you start?” We can’t just start aimlessly digging for data in the hope of turning up something relevant.

We must start with understanding the case, and more importantly the business context and processes that underpin the matter in dispute. For example, if this is a trading dispute, we must understand what products are being traded and the processes at our client that supported this trading. Once we understand this, we can then start to understand what systems supported these business processes and what data stores these systems kept. We often,  perhaps unimaginatively, refer to this phase of work as Process and Systems Discovery. Failing to properly understand the legal case and the client’s business processes is a key pitfall. If you don’t take this step, you can end up digging aimlessly, turning up data that is not relevant, and failing to unearth data that is.

The output from the Process and Systems Discovery is an inventory of relevant data sources, often accompanied by samples of the data obtained from the client. As tempting as it may be to start the full-blown extraction of this data, it is necessary at this stage to consult with the legal team, to consider certain factors, such as:

  • Data redundancy: Often the same, or similar, data can exist in multiple different systems (e.g., for trading it can be stored in systems that interface with exchanges, in the primary trading system, and in settlement systems). In this case, an extract from a single data source may be sufficient to provide the required data.
  • Proportionality: Certain data may exist in legacy formats which means that extracting and transforming the data into a usable format will be complex, very costly, and take several months to complete. This is normally because the software that could have easily ingested and interpreted the proprietary data format is no longer available or licensed to the client. Certainly, we have encountered instances where obtaining the requisite hardware and software would have cost many hundreds of thousands of dollars.
  • Legal considerations: There are nuances to every case, and this often depends on whether we are working for the plaintiff or for the defendant. These nuances need careful consideration and close collaboration between us and the legal team.

Once we have agreed on what data should be obtained we proceed to the fun (!) bit: data extraction. We receive data in multiple formats, often from disparate systems, and none of these are necessarily produced for us in neat and orderly formats. We rarely have access to a data dictionary (a technical document that explains the nature and format of the data captured in the columns of the tables). We rarely are given any detailed instructions on how the underlying database is structured, and the only firsthand accounts of the systems housing all of the information are usually gleaned from reading depositions or our discussions with counsel.

Occasionally we know the underlying system is Oracle, MySQL, IBM DB2, or Microsoft SQL databases. We can obtain the data from these systems in a wide variety of formats, ranging from backup database files, backup .CSV files, or even a series of text file extracts. All of this data gets securely imported into a specialized IT data analysis environment, often hosted by us, but sometimes hosted by the client. Increasingly the data volumes on our projects require the use of cloud-based data analytics platforms, such as Microsoft Azure or AWS.

Before we can start on this fun expedition, we need to make sure the data is stable. What does that mean to us? It means we have the data imported into our system completely and correctly. We spend time making sure the data imported into our system is usable – we go table by table, column by column, to verify each cell has been imported correctly, albeit the toolkit we have developed over the years means this can be completed quickly and efficiently. It is always a fun discussion with client and counsel when we explain that before we can start the exploration, we need to make sure we have the right height for our ”Staff of Data” (to use another Raiders of the Lost Ark analogy). Otherwise, we will always be digging in the wrong place.

Once we get the data into our system, we want to start building our own data dictionary. We want to understand the tables and columns available to us and we want to work with the client and counsel to understand the specific asks within the litigation.

In our work, this is where the adventure hits its high point for us. We work to understand any limitations of the data. Do we have data for the entire relevant time periods, business processes, legal entities, and business units? Once we are comfortable with the data we have – or we are told “this is as comfortable as you are going to get” – we work with clients and counsel to identify the relevant questions to answer, but this is for our next post: Data analysis.

About the authors:

Tom is a Partner in our Forensic Data Analytics team, based in Dallas. He has over 20 years of experience leading clients through complex litigation and regulatory matters involving large amounts of structured and unstructured data with a focus on the financial services industry.

David is a Partner in our Forensic Data Analytics team, based in London. He has over 18 years of experience assisting clients with high-profile regulatory and legal challenges, working to support blue chip companies in areas such as Anti-trust, Corporate investigations, Commercial litigation, and Financial crime.