Automatic monthly processing of hackernew's who's hiring thread into markdown tables

Amber Williams' avatar

Amber Williams

Last edited - February 20, 2024


Navigating the dense and often unstructured world of HackerNews hiring threads can be akin to finding a needle in a haystack.

There's several similar solutions like this already, but let's build our own anyway.

Github repo here

Collecting Hacker News hiring tread data

I first performed a Google search to find the latest HackerNews hiring threads for a specific month and year, and then extracted the top link to the target thread. Next, I used BeautifulSoup to scrape data from this thread, focusing on comments structured like job listings. Finally, I compiled these comments into a dictionary and continued scraping across multiple pages to gather all relevant hiring data.

I realize now there is an id to these sorts of threads that makes the google search reduntant. Oh well

Token Size Estimation and GPT Prompt Formulation

Estimating the token size of each data point was done using OpenAI's tiktoken. This estimation was crucial for batching the rows into OpenAI requests, optimizing both speed and expenditure.

Then came the task of formulating the GPT prompt, which included:

  • Data Extraction: Transforming the raw scraped rows into structured columns of interest, like salary.

  • Few-Shot Prompting: A technique to enable in-context learning where we provide demonstrations in the prompt to steer the model to better performance.

  • JSON Mode: Enforcing output to conform to valid JSON

  • Error Handling

Processing, Cleaning

Once the GPT had processed the data, it was time to rejoin the batched data into a cohesive set. This step was followed by a thorough cleaning process, ensuring the data was accurate and useful.

Automation

To keep this process up-to-date without monthly manual intervention, I set up a monthly GitHub Action cron job. This job scrapes the latest data, processes it, and commits the updated table data to a repository, with a link provided in the README. Which finally opens an easy to manage pull request assigned to me.

Putting it all together

graph TB
    AAA[scrape hackernews] --> AA{Token counter} --> A
    A[scraped.csv] -. chunk data .-> B(batch-0.csv)
    A -. chunk data .-> C(batch-1.csv)
    A -. chunk data .-> D(batch-2.csv)

    B -..-> Q[Processing Queue]
    C -..-> Q
    D -..-> Q

    Q --> E{LLM System
 data extraction}
    E -. batch n .-> E

    E --> | Rejoin processed batches | F[summary.csv]
    F --> G{Clean data}
    G --> H[clean.csv]

Imperfect results

This project, while not perfect and certainly in need of improvements, showcases the practical application of web scraping, AI, and automation in extracting and analyzing data from unstructured sources. It was an engaging journey to automate this process, highlighting how these technologies can be effectively utilized for insightful data analysis. Looking ahead, I aim to refine the system and compare parsed data over time to reveal emerging trends.