Skip to main content
  • ABOUT
  • CASELAW
  • GALLERY
  • LOG IN
  • FOR COURTS
  • CONTACT
Documentation / Specs And Reference / Data Specifications
    • For Researchers
        • How do you want to access caselaw data?
          • Bulk Downloads
          • API
        • What level of access do you need?
        • How do I register?
        • How do I apply for researcher access?
          • Important Caveats
          • Eligibility
          • Where do I apply?
            • Harvard
            • Other Institutions
            • Others
    • For Courts
        • Digital-First Guidelines
            • Introduction
            • Digital-first publishing guidelines
              • Essential characteristics
                • Online
                • Free & Open
                • Comprehensive
                • Official
                • Citable
                • Machine Readable
              • Desirable characteristics
                • Digitally Signed
                • Versioning
                • Structured Data
                • Medium-Neutral
                • Archives
                • Search
                • Bulk
                • API
        • Case Studies
            • Case study: Arkansas
            • Case study: Canada
            • Case study: New Mexico
    • For Libraries
    • Registration
    • Search
        • Overview
        • What's included?
        • Searching in CAP is simple
          • First: Choose What To Search
          • Second: Select Your Search Criteria
          • Third: Execute the Search
        • Full-Text Case Search
          • Phrase Search
          • Exclusion
        • Tips
        • Getting Legal Help
    • API
        • API Learning Track
        • Authentication
          • Get an API Key
          • Modify The Request Headers
          • Example
          • Failure: error_auth_required
          • Browsable API
          • Sitewide Token Authentication
        • Case Text Formats
        • Pagination and Counts
          • Example
        • Endpoints
          • API Base
          • Cases Endpoint
            • Endpoint Parameters
            • Single Case Endpoint
            • Search Syntax
            • Examples
          • Reporters Endpoint
            • Endpoint Parameters
            • Examples
          • Jurisdictions Endpoint
            • Endpoint Parameters
          • Courts
            • Endpoint Parameters
          • Volumes
            • Endpoint Parameters
          • Ngrams
            • Endpoint Parameters
            • Examples
          • Citations
            • Endpoint Parameters
    • Bulk Data
        • Access Limits
        • Downloading
        • API Equivalence
        • Data Format
        • Using Bulk Data
        • Other repositories
    • Historical Trends
        • Start Here
        • Reading Results
          • Key
          • Horizontal axis
          • Vertical axis
        • Customize
          • Percentage Count/Instance Count/Scaling
          • Smoothing
        • Table view
        • Keyboard navigation
        • Download
        • Wildcard search
        • Jurisdiction search
        • Jurisdiction codes
        • Citation feature
    • API Learning Track
        • Intro to APIs
        • CAP API Tutorial
            • Intro: Browsable API
            • Intro to JSON
            • curl
            • Overview of the endpoints
            • Dig-in With Real Queries
            • Next Steps
            • Wrap-up
        • CAP API In Depth
            • Getting Started
              • Making Basic Queries
              • Filtering
              • Search
                • Full-text Search
                • Filtering by Groups or Ranges
              • Sorting
              • Types of Data You Can Query
            • Getting Full Case Text
            • Authentication
              • Find your API Key
              • Modify Your Headers
                • curl
                • python requests library
                • Other Environments
              • Doesn't work?
                • error_auth_required
                • error_limit_exceeded
            • Data Formats
              • Structured Casebody Text
            • Other Endpoints
    • Access Limits
        • Exceptions
        • Open Jurisditions
        • Research Access
        • Commercial Licensing
        • User Types and Permissions
          • Unregistered Users
          • Registered Users
          • Researchers
          • Commercial Users
    • Stability and Changes
    • Reporting Problems
        • Misspelled Words
        • Website Errors
        • Metadata Errors
    • Documentation Glossary
        • API
        • Character
        • Special Character
        • Command Line
        • curl
        • Endpoint
        • Jurisdiction
        • OCR
        • RESTful
        • Reporter
        • Server
        • Slug
        • String
        • Top-Level Domain
        • URL
        • URL Parameter
        • URL Path
        • Open Jurisdiction
        • Restricted Jurisdiction
        • Cursor
    • Data Specifications
        • Bulk
          • Structure
          • Data Format
        • API
          • Individual Records
          • Query Results
        • Individual Objects
          • Case
            • Casebody
            • Analysis Fields
          • Jurisdiction
          • Court
          • Volume
          • Reporter
          • Citation
          • Ngrams
    • Changelog
        • August 28 2020
        • August 2020
        • June 2020
        • April 2020
        • March 2020
        • February 2020
        • January 24, 2020
        • January 19, 2020
        • January 16, 2020
        • January 9, 2020
        • December 6, 2019
        • October 1, 2019
        • July 31, 2019
        • June 19, 2019

Bulk

Bulk User Guide

Structure

Bulk data files are provided as zipped directories. Each directory is in BagIt format, with a layout like this:

.
├── bag-info.txt
├── bagit.txt
├── data/
│   └── data.jsonl.xz
└── manifest-sha512.txt

Data Format

Caselaw data is stored within the data/data.jsonl.xz file. The .jsonl.xz suffix indicates that the file is compressed with xzip, and is a text file where each line represents a JSON object. Each line of the JSON file is an object retrieved from the API.

API

API queries always return JSON. Here's what they look like. For more details on queries, check out the API Reference.

Individual Records

If you specify an individual record (reachable through the "url" value present in most types of records) then you'll receive a single JSON object as formatted below.

Query Results

If you're not specifying a specific record to return by its primary key (usually an id), your results will be structured to return multiple objects, even if there's only one match to your query.

{
    "count": (int),
    "next": (url with pagination cursor),
    "previous": (url with pagination cursor),
    "results": (array of json objects, as listed below)
}

Individual Objects

Case

{
    "id": (int),
    "url": (API url to this case),
    "name": (string),
    "name_abbreviation": (string),
    "decision_date": (string),
    "docket_number": (string),
    "first_page": (string),
    "last_page": (string),
    "citations": [array of citation objects],
    "volume": {Volume Object},
    "reporter": {Reporter Object},
    "court": {Court Object},
    "jurisdiction": {Jurisdiction Object},
    "cites_to": [array of cases this case cites to],
    "frontend_url": (url of case on our website),
    "frontend_pdf_url": (url of case pdf),
    "preview": [array of snippets that contain search term],
    "analysis": {
        "cardinality": (int),
        "char_count": (int),
        "ocr_confidence": (float),
        "sha256": (str),
        "simhash": (str),
        "word_count": (int)
    },
    "last_updated": (datetime),
    "casebody": {
        "status": ok/(error)"
        "data": (null if status is not ok) {
            "judges": [array of strings that contain judges names],
            "parties": [array of strings containing party names],
            "opinions": [
                {
                    "text": (case text),
                    "type": (string),
                    "author": (string)
                }
            ],
            "attorneys": [array of strings that contain attorneys names],
            "corrections": (string. May include formatting notes.),
            "head_matter": (elements before the case text)
        }
    }
}

Casebody

Without the full_case=true parameter set, this query would not have a case body. This can be useful when you want to browse the metadata of a bunch of cases but only get case text for specific ones, conserving your 500-case-per-day limit.

This shows the default output for casebody— a JSON field with structured plain text. You can change that to HTML or XML by setting the body_format query parameter to either html or xml.

This is what you can expect from different format specifications using the body_format parameter:

Text Format (default)

https://api.case.law/v1/cases/?jurisdiction=ill&full_case=true

The default text format is best for natural language processing. Example response data:

"data": {
      "head_matter": "Fifth District\n(No. 70-17;\nThe People of the State of Illinois ...",
      "opinions": [
          {
              "author": "Mr. PRESIDING JUSTICE EBERSPACHER",
              "text": "Mr. PRESIDING JUSTICE EBERSPACHER\ndelivered the opinion of the court: ...",
              "type": "majority"
          }
      ],
      "judges": [],
      "attorneys": [
          "John D. Shulleriberger, Morton Zwick, ...",
          "Robert H. Rice, State’s Attorney, of Belleville, for the Peop ..."
      ]
  }
}

In this example, "head_matter" is a string representing all text printed in the volume before the text prepared by judges. "opinions" is an array containing a dictionary for each opinion in the case. "judges", and "attorneys" are particular substrings from "head_matter" that we believe refer to entities involved with the case.

XML Format

https://api.case.law/v1/cases/?jurisdiction=ill&full_case=true&body_format=xml

The XML format is best if your analysis requires more information about pagination, formatting, or page layout. It contains a superset of the information available from body_format=text, but requires parsing XML data. Example response data:

"data": "<?xml version='1.0' encoding='utf-8'?>\n<casebody ..."

HTML Format

https://api.case.law/v1/cases/?jurisdiction=ill&full_case=true&body_format=html

The HTML format is best if you want to show readable, formatted caselaw to humans. It represents a best-effort attempt to transform our XML-formatted data to semantic HTML ready for CSS formatting of your choice. Example response data:

"data": "<section class=\"casebody\" data-firstpage=\"538\" data-lastpage=\"543\"> ..."

Analysis Fields

Each case result in the API returns an analysis section, such as:

"analysis": { 
    "word_count": 1110, 
    "sha256": "0876189e8ac20dd03b7...", 
    "ocr_confidence": 0.654, 
    "char_count": 6890, 
    "pagerank": { 
        "percentile": 0.31980916105919643, 
        "raw": 5.770123949632993e-08 
     }, 
    "cardinality": 390,
    "simhash": "1:3459aad720da314e" 
}

Analysis fields are values calculated by processing the raw case text. They can be searched with filters.

All analysis fields are optional, and may or may not appear for a given case.

Analysis fields have the following meanings:

Cardinality (cardinality)

The number of unique words in the full case text including head matter.

Character count (char_count)

The number of unicode characters in the full case text including head matter.

OCR Confidence (ocr_confidence)

A relative score of the predicted accuracy of optical character recognition in the case, from 0.0 to 1.0. ocr_confidence is generated by averaging the OCR engine's reported confidence for each word in the case. The score has no objective interpretation, other than that a case with a lower score is likely to have more typographical errors than a case with a higher score.

PageRank (pagerank)

Example: "pagerank": {"raw": 0.00278, "percentile": 0.997}

An estimate of the all-time significance of this case in the citation graph, from 0.0 to 1.0, calculated using the PageRank algorithm. Cases with no inbound citations will not have this field, and implicitly have a rank of 0.

The "raw" score can be interpreted as the probability of encountering that case if you start at a random case and followed random citations. The "percentile" score indicates the percentage of cases, between 0.0 and 1.0, that have a lower raw score than the given case.

SHA-256 (sha256)

The hex-encoded SHA-256 hash of the full case text including head matter. This will match only if two cases have identical text, and will change if case text is edited (such as for OCR correction).

SimHash (simhash)

The hex-encoded, 64-bit SimHash of the full case text including head matter. The simhash of cases with more similar text will have lower Hamming distance.

Simhashes are prepended by a version number, such as "1:33e68120ecb2d7de", to allow for algorithmic improvements. Simhashes with different version numbers may have been calculated using different parameters (such as hash algorithm or tokenization) and may not be directly comparable.

Word count (word_count)

The number of words in the full case text including head matter.

Jurisdiction

    {
        "url": (url),
        "id": (int),
        "slug": (string),
        "name": (string),
        "name_long": (string),
        "whitelisted": true/false
    }

Court

    {
        "id": (int),
        "url": (url),
        "name": (string),
        "name_abbreviation":(string),
        "jurisdiction":(string),
        "jurisdiction_url": (url),
        "slug": (string)
    },

Volume

    {
        "url": (url),
        "barcode": (string),
        "volume_number": (string),
        "title": (string),
        "publisher": (string),
        "publication_year": (int),
        "start_year": (int),
        "end_year": (int),
        "nominative_volume_number": (string),
        "nominative_name": (string),
        "series_volume_number": (string),
        "reporter": (string),
        "reporter_url": (url),
        "jurisdictions": [list of jurisdiction objects],
        "pdf_url": (url),
        "frontend_url": (url)
    },

Reporter

     {
        "id": (int),
        "url": (url),
        "full_name": (string),
        "short_name": (string),
        "start_year": (int),
        "end_year": (int),
        "jurisdictions": [list of jurisdiction objects],
        "frontend_url": (url)
    },

Citation

    {
        "id": (int),
        "cite": (string),
        "cited_by": (url)
    },

Ngrams

    (search term): {
        (string jurisdiction)/"total": [
            {
                "year": (string),
                "count": [
                    (int),
                    (int)
                ],
                "doc_count": [
                    (int),
                    (int)
                ]
            }
        ]
    }
  • Find what you were looking for?

    If you have suggestions for improving this documentation, let us know!

Data Specifications

©2021 The President and Fellows of Harvard University. Site text is licensed CC BY-SA 4.0. Source code is MIT licensed. Harvard asserts no copyright in caselaw retrieved from this site.

  • TERMS
  • PRIVACY
  • ACCESSIBILITY