Every organization uses databases to manage operations and store critical information. These databases capture the actions, the processes, and the intentions of those organizations. For large legal cases such as class-action suits, false claims matters, and anti-merger challenges, fully utilizing this information can make or break the case. Because of this, knowing how to work with databases has become an important skill in corporate law. In this post, we will show you how to work with databases using an approach that emphasizes the legal aspects of the analysis and takes advantage of the full power of databases.
Going beyond spreadsheets and consultants
Spreadsheets are an easy to use tool for viewing structured data making them the de facto choice for attorneys faced with analyzing data from corporate databases. Be aware, however, that spreadsheets most often do not represent the full picture. Often a spreadsheet is a narrow slice of data created by an IT professional who is not familiar with the legal matter. Also, spreadsheets may be missing critical dimensions such as geography, system/process changes, and relationships with other events.
Another approach, often taken, is to utilize the expertise of outside consultants to analyze the data. While this allows the attorney to remain detached from technical considerations, it can hide pertinent information that would otherwise be revealed in the data. Consultants may not be inclined to follow legal theories that emerge during data analysis.
The data, its structure, and its relationships can convey worlds of information for a legal matter. The side that can take full advantage of this hidden information has a huge advantage. In a defense case, for example, individual claims by the plaintiff can be refuted with an aggregate understanding of the policies and actions of the defendant organization as revealed in the data.
The challenge is to skillfully handle the technical analysis while continuing to think like an attorney:
The Power is in SQL (Structured Query Language)
Almost all enterprise databases are SQL databases because SQL has unique advantages that make it popular in the IT world. These advantages include the ability to investigate data with a methodological precision that is also extremely useful in law. Unlike searches and artificial intelligence algorithms, SQL has the advantage that you can show how the conclusion was arrived at for even the most complex analyses. The SQL query language precisely describes the steps in the analysis. Also, SQL can be used to reverse engineer business processes since the processes can often be inferred from the structure of the data in a database. Business processes are critical in legal matters involving compliance with the law. We recommend using SQL to your advantage and will show you how. Read on!
The Process: “Database Discovery” vs. eDiscovery
The process of working with documents in law is well understood and there is a large industry of tools and services surrounding eDiscovery. Databases, however, require different tools and different expertise. The process of “Database Discovery” can be compared to eDiscovery:
We will go through each step of this process in the following sections. Here are the steps:
Gather: work with the client’s IT department to collect data from different systems and databases
Clean: run manual and automated processes to remove duplicates and generate profile information about each data set
Analyze: analyze the data using the SQL query language to clarify key questions about dates, locations, business processes, and calculations
Review: execute a records review with experts to analyze, verify, and markup critical details. Incorporate this new information back into the analysis.
Produce: extract productions from the database into files. Develop interactive dashboards and data visualizations for presentation in court.
Gather: Extracting data from a SQL Database
Databases often contain operationally critical, sensitive and proprietary information and so are protected behind the walls of IT organizations. Typically, when a legal data request is made, an IT analyst with secure access would need to run a report, write a query, or perform a dump of the requested data. Large organizations have many systems and sometimes hundreds of databases which make them hard to navigate. Based on our experience in legal discovery here are a few points we recommend for consideration:
Watch out for incomplete data! IT analysts like to filter data (i.e. exclude certain records) to reduce the risk of exposing sensitive information. Ask for the complete set of records. It is better to remove columns (like a social security number column) than to remove entire records.
Ask for Specifics. Once you receive the data files from the IT organization it might not be obvious what they are and where they came from. Ask questions. What system did the data come from? What type of database? Are there many such systems in the organization that contain the same type of data? What selection criteria was used to reduce the data? What time frame was used to query the data?
Be careful with sensitive data. Ask whether there is sensitive information in the data such as personal and financial information. Data leaves a footprint as it is copied from one place to another. Assume that all deleted data is recoverable.
Look out for duplicates. Are there duplicate records in the data? Duplication can easily happen if a data export or SQL query is done incorrectly. Duplicates matter because they can cause counts and summation to be incorrect.
Be aware that data can get modified or corrupted. Some tools (like Excel) alter the number of decimal places for numbers or remove leading zeros from numeric codes. For example, “0783” and “003” could become “783” and “3” when you save an Excel file into another format. Also column headers in tabular data can get removed or become misaligned with the data fields. Dates can also be easily misinterpreted. In Europe, the date 04/06/2022 is interpreted as June 4th, 2022. Because of this issue, where possible, ask for universal date formats such as “YYYY-MM-DD”
When data is extracted from a database, files of different types are created depending on how the extraction is done. Some of these files, such as CSV (comma separated value) files can be opened by applications like Excel or text editors. Other files, such as database backups, require special programs and expertise to open and use. The best practice is to keep and track all files as-received in a secure location accessible by multiple team members. Once organized, these files will need to be loaded into a place where they can be cleaned and analyzed.
In our experience, the best place to analyze the data is in a SQL database since that will be best able to take advantage of the inherent relational (SQL) structure of the data. Any type of file extracted from a database by an IT department can be loaded back into a SQL. And the resulting SQL database can manage all of the data from the many files allowing you to “join” across data from different files or even from different source systems.
Once the data files are received, it is time to load the data. The loading process depends on the tools chosen and the types of files being used. Most often we use CSV files and spreadsheets to load data into individual database tables but sometimes we see entire backups of a database. For the loading process we recommend having a technician on the team that is knowledgeable about the tools and the loading process.
Clean: How to Prepare Data for Analysis
After data is loaded into the database we want to make sure that everything is correct: that there are no file corruption issues or human mistakes that occurred in the gathering step. Since the extraction of data from an enterprise database is a bit of an art form, don’t assume that the IT department extracted the data correctly or even that all the relevant data was collected. The cleaning process will involve using the structured query language (SQL) to review and fix the data. A typical cleaning process might involve:
Running a data profile. These automatic scripts create a table of statistics that give an overview of the data as well as an initial indication of issues. This may produce information such as: total number of records, number of duplicate records, number of empty values for critical fields, and etc.
Manual spot checking. Here a technician looks through a small set of the records in each table looking for issues such as misaligned columns and truncated codes.
Identifying sensitive data. Typically we ask that no sensitive information be included in the import files unless absolutely necessary, but it is always good practice to double check.
Optimize the data. Databases allow you to query millions of rows of data but may require that certain optimization features be applied to achieve good performance. We will want to identify areas of the database that need this optimization.
Once the data is checked and cleaned we are ready to begin the analysis.
Analyze: Using Structured Query Language (SQL) for Analysis
While there are many tools available that make it easy to analyse SQL data, those tools can oversimply the task and gloss over critical details. Here we will discuss using a more powerful approach: using the SQL language directly. SQL is a script-like language for asking questions about data in a database. Since SQL is an industry standard, there are many training courses available ranging from beginner to advanced.
Using a SQL query tool, you connect to a database with a connection string, a username and a password. In the tool’s editor, you type a SQL statement (or query), hit the run button, and the database returns a tabular spreadsheet-like result that answers the query.
Take the following SQL query for example:
SELECT sales_rep, commission_amountFROM commissionsWHERE commission_amount > 100000;
When you hit “run”, the database would return all the sales representatives that have received a commission over $100,000. The results would appear in a tabular format like this:
sales_rep | commission_amount |
Jim Sales | 110,000 |
Anna Tonken | 256,000 |
As you can see from the SQL query above, the “SELECT” clause indicates the fields that you want to show, the “FROM” clause identifies the table (or collection of data) that you want to query and the “WHERE” clause is the filtering criteria that asks the question you want answered.
The above is just a simple example, but SQL can be very powerful. In a “WHERE” clause you could query for all geographic sites that are in a 10 mile radius of some location. In the “SELECT” clause you can do summations, averages, standard deviations and other mathematical functions. There is also a “GROUP BY” clause that allows you to specify the groups of data over which the aggregation functions (summation, average, etc) are performed. You can join together different tables of data using “JOIN” or “UNION” clauses. And finally, there are many plugins for databases that extend the types of things you can do in a SQL query. SQL databases have been around for over 40 years and have an enormous set of features. We are continually surprised at what can be done with such an easy (if you know how) to use language.
Even though the analysis step can be highly technical, this is where the attorneys should be most involved because it is where we test our legal theories. Using SQL we follow a legal line of reasoning like this:
First, we identify the data records that are involved in the plaintiff’s claims. We seek to understand the structure of the data and the underlying process by which the data was created.
Next, we understand the claims in terms of the actual data and look for inconsistencies, unfounded assumptions, and scopes in which the claims do and don’t apply.
Finally we pose several theories for alternate positions and model those theories using SQL queries. Is the theory consistent with the data? Does the theory produce a better outcome for our client?
Review: How to Cull and Tag SQL Records
When working with sql data sets, the purpose of the review is to create additional information that can be added to our analysis efforts. This review can be done automatically (using SQL!), by having knowledgeable and trained staff look over the data, or it is also possible to have AI perform a review of SQL data. Just as in document review, we review the data critical to the case such as a set of disputed transactions. In a SQL database we create the sets for review using a SQL query. When run, the query will return all records and data to be reviewed. Once we have the query in hand, there are several different approaches to the review which are outlined below.
AI Review of SQL Data
Reviewing documents (and data) using artificial intelligence is new in the eDiscovery world and so the techniques are still being worked out. Document AI review is done by presenting the AI model with each document in turn and asking a question (prompt) for each document. At Sapling, we perform AI review in a similar way on SQL data, but rather than documents we have rows of data. Currently, this type of review requires either a tool (such as an eDiscovery system or Sapling’s platform) or custom code.
Automated Review by Creating Derivative Data using SQL
In addition to querying data, a SQL statement can be used to create and update data. We typically prefer to create new data so that the original data is preserved. Here is an example of a SQL statement that pulls all transaction records between the dates of interest:
SELECT * FROM transactionsWHERE trans_date between 2007-02-07' AND '2007-02-15'
Now let’s suppose that certain transactions were seen to have errors where it appears that a discount was applied but really the operator intended to change the price to account for a publicized price change. First we create a new column in the transactions table using the SQL tool. Next we want to find records where the error was made and set the corrected_price column to account for the price change. Maybe something like this:
UPDATE transactionsSET corrected_price = 200.00WHERE trans_date between 2007-02-07' AND '2007-02-15'AND product = ‘Resound Plus x15’
This is a simple example, but demonstrates the process. It is pretty typical to have to formulate multiple SQL statements to catch all the variations of the issue. But now that we have the data updated we can perform analysis to see the difference between the original assumptions about the discount and the corrected amounts. This might help us reduce the scope of the damages.
Human Review using Custom Forms
Another way to review data in a database is to build a custom form where reviewers can tag and add data to existing records. Typically we build custom forms that can be used in a web browser so that they are easy for users to access. Behind the scenes, these forms perform SQL updates to reflect the user’s changes.
Human Review via Excel
A SQL query can easily be used to produce an Excel file because the output of a SQL query is tabular data. To perform a review using Excel, we export the data using SQL, open the file in Excel, add columns for our review questions, and ask the reviewers to go through the rows in the spreadsheet and update the new columns. When the reviewers are done, we collect the spreadsheets and load them back into the database.
Produce: How to Present Data from a SQL Database
Presentation of data is important in legal cases and with SQL quite a lot is possible. When we work with database data we will have a SQL query that gives us the data that we want to present or produce. We must be careful to write the query in a way as to limit the information to only what is needed. This limitation can be done using the myriad filtering features of the SQL language. Of course, the query isn’t the same as the presentation. Once you have the tabular output from a query there are many ways of creating a presentation from it.
Production as a CSV or Excel file
The simplest and most universal presentation of SQL data is as a file of tabular data which can be opened in a text editor. In this case we run a SQL query to export the output to a CSV file which can also be imported into Excel. This format is one of the standards for producing discovered data in a legal matter.
Production as Screen Shots of the Application
Database data exists in a raw, tabular form that can be lacking in context. A table of numbers and texts may not be enough to provide the proper context in a court room. In addition to providing Excel files of data, it may be necessary to include screenshots of the data within the application as the users see it. While it is possible to produce screenshots of a few records, it is impractical to create and work with screenshots for thousands of records.
Presenting Data as Charts and Graphs
Another option is to use a database analytics tool to create a graph or chart similar to the ones that can be created in Excel. In our work we use a tool built into our platform for this purpose, but other tools are available as well. These tools allow you to write a SQL query and hook the results into a presentation such as a graph or chart. Many tools allow the presentation to be interactive where a user can click or scroll to drill down into data.
Custom Presentations
Yet another option is to create a completely custom presentation such as an image of a mobile phone screen that shows the relevant text messages between two people. These custom presentations can be precisely fitted to the legal argument that they are presenting. To build custom presentations you will need a tool that allows you to develop custom UI. In our case we use a low-code development tool that can build full web based applications.
Attorneys can do SQL
While there are certainly technical requirements, it is worthwhile to get the legal experts (i.e. the attorneys) directly into the database. Within the database, the attorney can operate in rapid fire exploratory mode without having to go through intermediaries. And ultimately, the analysis in legal discovery is an endeavor not too different from data analysis in databases. We believe that attorneys can and should do SQL.
In practice we don’t expect every attorney to be a database expert, although we know some who are. We typically set up a team that has the appropriate set of expertise which includes the attorneys as team members. This team works together to produce some fast turnaround legal analytics. While the attorneys don’t usually set up the database and load the data, they absolutely can dig into SQL.
Hopefully this how-to was helpful. We have a ton of experience working in “Database Discovery” and are happy to share the knowledge. Let us know if you have any future topics you would like to see us cover.