in

A crawler/scraper based on golang + colly, configurable via JSON


Super-Simple Scraper

This a very thin layer on top of Colly which allows configuration from a JSON file. The output is JSONL which is ready to be imported into Typesense.

  • Scrape HTML & PDF documents based on the configured selectors
  • Selectors can use CSS selectors or template-based ones which have sprig functions available.

See the example configuration. Many of these options are directly copied to the Colly equivalents:

We have an image on DockerHub, so after installing Docker and jq, something like this will work:

docker run -it -v `pwd`:/go/src/app -e "CONFIG=$(cat ./path/to/your/config.json | jq -r tostring)" gotripod/ssscraper:main

The manual method is:

docker build -t ssscraper .
docker run -v `pwd`:/go/src/app -it --rm --name ssscraper-ahoy ssscraper

# you're now in the docker container

cd src/app
go build
./ssscraper

Using VSCode, clone and open the repo directory with the Containers extension installed.

  • Webhook support – POST the output to a URL on completion
  • Different output formats
  • Custom weighting for selectors
  • Extract the selector/template logic to a common function
  • Add Word doc support

GitHub

https://github.com/gotripod/ssscraper/




Leave a Reply

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

GIPHY App Key not set. Please check settings

MUJHE PYAAR PYAAR HAI LYRICS – BHOOT POLICE

Barbaad Lyrics Translation — Helmet | Goldboy