I need someone to scrape some government law database sites. These should be quite easy to scrape as they're simple formatted information sites.
# CA legislature
It seems the old site is much easier to scrape.
# CA Courts
and its linked pages
As you can see these are both very simple server-side HTML sites, so should be quite easy to scrape.
# San Francisco Court
This is a PDF file, so it's harder to scrape, but please let me know if you have expertise in this.
We would want to get a clean JSON file of the results, with html tags removed, and a structure to it that kept the headings.
The next stage of the project is applying some natural language processing to extract keywords and tags so that we can apply a search across all this content. Please advise if you have knowhow in this area too.
# Headings / meta-data
For some law, the heading and section is critical data to be retained, for example:
lawCode = CIV
division = 2
title = 2
part = 1
chapter = 2
article = 1
So as you walk through the site, this would need to be retained.
We would like all of the content sites to be normalized to the same structure so we can search across them.
Please recommend how you would structure these different documents in JSON format. For example for every chapter of content should we include that hierarchy as tags?
Or apply a hierarchy to the JSON document itself, but keep the JSON flat?
Our eventual goal is to produce a type of search information for this content.
We will be scraping many other public information legal sites going forward but this is just an initial sample. Please give a cost estimate for this as a one-off project.