From 4c9a7dfaf4855c8b0e98c59306fd5bdb3b51ca58 Mon Sep 17 00:00:00 2001 From: Austin Davis Date: Thu, 29 Feb 2024 00:39:17 -0700 Subject: [PATCH] docs --- README.md | 16 ++++++++++++++++ headless/README.md | 31 +++++++++++++++++++++++++++++++ proxy/README.md | 7 +++++++ scraper/README.md | 6 ++++++ 4 files changed, 60 insertions(+) create mode 100644 headless/README.md create mode 100644 proxy/README.md create mode 100644 scraper/README.md diff --git a/README.md b/README.md index b97e26f..f163fb5 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,19 @@ # job-scraper A timed event that once a day scraps relevant jobs links and sends them to discord. ![job-scraper (1)](https://github.com/austin1237/job-scraper/assets/1394341/39688936-66f2-4819-93bf-fcafb83930c4) + +## Deployment +Deployment currently uses [Terraform](https://www.terraform.io/) to set up AWS services. +### Prerequisites +This repo needs a private [Amazon ECR repo](https://us-east-1.console.aws.amazon.com/ecr/repositories?region=us-east-1) to be created in the same region that our container based lambda is deployed to (in our case us-east-1). Name the private repo to headless. + +### Setting up remote state +Terraform has a feature called [remote state](https://www.terraform.io/docs/state/remote.html) which ensures the state of your infrastructure to be in sync for mutiple team members as well as any CI system. + +This project **requires** this feature to be configured. To configure **USE THE FOLLOWING COMMAND ONCE PER TEAM**. + +```bash +cd terraform/remote-state +terraform init +terraform apply +``` \ No newline at end of file diff --git a/headless/README.md b/headless/README.md new file mode 100644 index 0000000..5ccff97 --- /dev/null +++ b/headless/README.md @@ -0,0 +1,31 @@ +# lol-counter-source-api +A lambda that invokes a headless browser to render a page (including it's javascript) and passes along the rendered html. + +## Why is this lambda using a container deployment rather than the standard zip deployment? +[Pupeteer](https://pptr.dev/) requires a chrome/chromium binary which execeeded the standard [lambda size limit](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html#function-configuration-deployment-and-execution). Using a container image greatly increases the limit and allows for the binary to be deployed. Currently this service also uses [@sparticuz/chromium](https://github.com/Sparticuz/chromium) due to the standard pupeeteer chromium install having permissions issues when running in the deployed aws env. + +## Prerequisites +You must have the following installed/configured on your system for this to work correctly
+1. [Docker](https://www.docker.com/) +2. [Docker-Compose](https://docs.docker.com/compose/) + +## Development Environment +The development environment uses a pinned version of [aws's node 18 image](https://gallery.ecr.aws/lambda/nodejs) to mimic the running lambda. + +```bash +docker-compose up +``` + +The output is similar to what you would see in cloudwatch logs ex. + +```bash +headless-lambda-1 | 18 Aug 2023 09:47:04,515 [INFO] (rapid) exec '/var/runtime/bootstrap' (cwd=/var/task, handler=) +``` + +The endpoint of the local container is localhost:3000/2015-03-31/functions/function/invocations send a POST request with the following body +```json +{ + "queryStringParameters": { + "url": "https://www.google.com" +}} +``` \ No newline at end of file diff --git a/proxy/README.md b/proxy/README.md new file mode 100644 index 0000000..4234a87 --- /dev/null +++ b/proxy/README.md @@ -0,0 +1,7 @@ +# Scraper +This is go lamda that recieves a url as a query string and passes along that website html. This lambda does not render any javascript, for that functionality look folder called headless. + +## Prerequisites +You must have the following installed/configured on your system for this to work correctly
+1. [Go](https://go.dev/doc/install) + diff --git a/scraper/README.md b/scraper/README.md new file mode 100644 index 0000000..b53d317 --- /dev/null +++ b/scraper/README.md @@ -0,0 +1,6 @@ +# Scraper +This is a go lambda that goes through the proxy api to receive website html. Once received it parses the html and does a keyword check on the job description. If any keyword exists in the description then the job link and company are sent to discord for manual review. + +## Prerequisites +You must have the following installed/configured on your system for this to work correctly
+1. [Go](https://go.dev/doc/install)