Skip to main content
  • Caselaw
    Tools overview Search Trends API Timeline Cases by Jurisdiction Bulk Data Fetch PDFs From Text
  • Support/docs
    Docs Overview API Bulk Data Search
  • Gallery
    Gallery home CAP Labs Research Results Coursework Fun Stuff Applications Applications Third Party Tutorials
  • About
    About CAP Contact
  • Log in
Documentation / Site Features / Bulk Data
    • For Researchers
        • How do you want to access caselaw data?
          • Bulk Downloads
          • API
        • What level of access do you need?
        • How do I register?
        • How do I apply for researcher access?
          • Important Caveats
          • Eligibility
          • Where do I apply?
    • For Courts
        • Digital-First Guidelines
            • Introduction
            • Digital-first publishing guidelines
              • Essential characteristics
                • Online
                • Free & Open
                • Comprehensive
                • Official
                • Citable
                • Machine Readable
              • Desirable characteristics
                • Digitally Signed
                • Versioning
                • Structured Data
                • Medium-Neutral
                • Archives
                • Search
                • Bulk
                • API
        • Case Studies
            • Case study: Arkansas
            • Case study: Canada
            • Case study: New Mexico
    • For Libraries
    • Registration
    • Search
        • Overview
        • What's included?
        • Searching in CAP is simple
          • First: Choose What To Search
          • Second: Select Your Search Criteria
          • Third: Execute the Search
        • Full-Text Case Search
          • Phrase Search
          • Exclusion
        • Tips
        • Getting Legal Help
    • API
        • API Learning Track
        • Authentication
          • Get an API Key
          • Modify The Request Headers
          • Example
          • Failure: error_auth_required
          • Browsable API
          • Sitewide Token Authentication
        • Case Text Formats
        • Pagination and Counts
          • Example
        • Endpoints
          • API Base
          • Cases Endpoint
            • Endpoint Parameters
            • Single Case Endpoint
            • Search Syntax
            • Examples
          • Reporters Endpoint
            • Endpoint Parameters
            • Examples
          • Jurisdictions Endpoint
            • Endpoint Parameters
          • Courts
            • Endpoint Parameters
          • Volumes
            • Endpoint Parameters
          • Ngrams
            • Endpoint Parameters
            • Examples
    • Bulk Data
        • Access Limits
        • Downloading
        • API Equivalence
        • Data Format
        • Using Bulk Data
        • Other repositories
    • Historical Trends
        • Start Here
        • Reading Results
          • Key
          • Horizontal axis
          • Vertical axis
        • Customize
          • Percentage Count/Instance Count/Scaling
          • Smoothing
        • Table view
        • Keyboard navigation
        • Download
        • Wildcard search
        • Citation search
        • Jurisdiction search
        • Jurisdiction codes
        • Filter fields
        • Citation feature
    • API Learning Track
        • Intro to APIs
        • CAP API Tutorial
            • Intro: Browsable API
            • Intro to JSON
            • curl
            • Overview of the endpoints
            • Dig-in With Real Queries
            • Next Steps
            • Wrap-up
        • CAP API In Depth
            • Getting Started
              • Making Basic Queries
              • Filtering
              • Search
                • Full-text Search
                • Filtering by Groups or Ranges
              • Sorting
                • Random sorting
              • Types of Data You Can Query
            • Getting Full Case Text
            • Authentication
              • Find your API Key
              • Modify Your Headers
                • curl
                • python requests library
                • Other Environments
              • Doesn't work?
                • error_auth_required
                • error_limit_exceeded
            • Data Formats
              • Structured Casebody Text
            • Other Endpoints
    • Access Limits
        • Exceptions
        • Open Jurisditions
        • Research Access
        • Commercial Licensing
        • User Types and Permissions
          • Unregistered Users
          • Registered Users
          • Researchers
          • Commercial Users
    • Stability and Changes
    • Reporting Problems
        • Misspelled Words
        • Website Errors
        • Metadata Errors
    • Documentation Glossary
        • API
        • Character
        • Special Character
        • Command Line
        • curl
        • Endpoint
        • Jurisdiction
        • OCR
        • RESTful
        • Reporter
        • Server
        • Slug
        • String
        • Top-Level Domain
        • URL
        • URL Parameter
        • URL Path
        • Open Jurisdiction
        • Restricted Jurisdiction
        • Cursor
    • Data Specifications
        • Bulk
          • Structure
          • Data Format
        • API
          • Individual Records
          • Query Results
        • Individual Objects
          • Case
            • Casebody
            • Analysis Fields
          • Jurisdiction
          • Court
          • Volume
          • Reporter
          • Citation
          • Ngrams
    • Changelog
        • August 28 2020
        • August 2020
        • June 2020
        • April 2020
        • March 2020
        • February 2020
        • January 24, 2020
        • January 19, 2020
        • January 16, 2020
        • January 9, 2020
        • December 6, 2019
        • October 1, 2019
        • July 31, 2019
        • June 19, 2019

Access Limits

All metadata files, and bulk data files for our open jurisdictions, are available to everyone without a login.

Bulk data files for the remaining jurisdictions are available to research scholars who sign a research agreement. You can request a research agreement by creating an account and then visiting your account page.

See our About page for details on our data access restrictions.

Downloading

You can download bulk data manually from our website, or use the manifest.csv file to select URLs to download programmatically.

When downloading bulk files, you may find that the download times out on the largest files. In that case, use wget, which retries when it encounters a network problem. Here's an example for the U.S. file with case body in text format:

wget --header="Authorization: Token your-api-token" "https://case.law/download/bulk_exports/latest/by_jurisdiction/case_text_restricted/us_text.zip"

Because this is a restricted file it requires an Authorization header. Replace your-api-token with your API token from the user details page.

API Equivalence

Each file that we offer for download is equivalent to a particular query to our API. For example, the file ill_text.zip contains all cases that would be returned by an API query with full_case=true&jurisdiction=ill&body_format=text. We offer files for each possible jurisdiction value and each possible reporter value, combined with body_format=text, body_format=xml, and plain metadata-only export.

The JSON objects returned by the API and in bulk files differ only in that bulk JSON objects do not include "url" fields, which can be reconstructed from object IDs.

Data Format

Bulk data files are provided as zipped directories. Each directory is in BagIt format, with a layout like this:

.
├── bag-info.txt
├── bagit.txt
├── data/
│   └── data.jsonl.xz
└── manifest-sha512.txt

Because the zip file provides no additional compression, we recommend uncompressing it for convenience and keeping the uncompressed directory on disk.

Caselaw data is stored within the data/data.jsonl.xz file. The .jsonl.xz suffix indicates that the file is compressed with xzip, and is a text file where each line represents a JSON object.

Using Bulk Data

The data.jsonl.xz file can be unzipped using third-party GUI programs like The Unarchiver (Mac) or 7-zip (Windows), or from the command line with a command like unxz -k data/data.jsonl.xz.

However, this increases the disk space needed by about 500%, and in most cases is unnecessary. Instead we recommend interacting directly with the compressed files.

To read the file from the command line, run:

xzcat data/data.jsonl.xz | less

If you install jq you can get nicely formatted output ...

xzcat data/data.jsonl.xz | jq | less

... or run more sophisticated queries. For example, to extract the name of each case:

xzcat data/data.jsonl.xz | jq .name | less

You can also interact directly with the compressed files from code. The following example prints the name of each case using Python:

import lzma, json
with lzma.open("data/data.jsonl.xz") as in_file:
    for line in in_file:
        case = json.loads(str(line, 'utf8'))
        print(case['name'])

To load the compressed data file into an R data frame, do something like this:

> install.packages("jsonlite")
> library(jsonlite)
> ark <- stream_in(xzfile("Arkansas-20190416-text/data/data.jsonl.xz"))

Other repositories

You can also explore our Illinois Public Bulk Data on Harvard Dataverse and Kaggle.

  • Find what you were looking for?

    If you have suggestions for improving this documentation, let us know!

Bulk Data

Our bulk data files contain the same information that is available via our API, but are much faster to download if you want to interact with a large number of cases. Each file contains all of the cases from a single jurisdiction or reporter.
Access data

Caselaw
  • Search
  • API
  • Trends
  • Bulk Data
  • By Jurisdiction
  • Fetch PDFs
Gallery
  • CAP Labs
  • Research
  • Coursework
  • Fun Stuff
  • Community Apps
  • Our Apps
  • Community Tutorials
Docs
  • About CAP
  • Docs Overview
  • API
  • Bulk Data
  • Search
Site text is licensed CC BY-SA 4.0. Source code is MIT licensed. Harvard asserts no copyright in caselaw retrieved from this site. ©2023 The President and Fellows of Harvard University.
terms
privacy
accessibility