Automatic monthly processing of hackernew's who's hiring thread into markdown tables
Amber Williams
Last edited - February 20, 2024
Navigating the dense and often unstructured world of HackerNews hiring threads can be akin to finding a needle in a haystack.
There's several similar solutions like this already, but let's build our own anyway.
Collecting Hacker News hiring tread data
I first performed a Google search to find the latest HackerNews hiring threads for a specific month and year, and then extracted the top link to the target thread. Next, I used BeautifulSoup to scrape data from this thread, focusing on comments structured like job listings. Finally, I compiled these comments into a dictionary and continued scraping across multiple pages to gather all relevant hiring data.
I realize now there is an id to these sorts of threads that makes the google search reduntant. Oh well
Token Size Estimation and GPT Prompt Formulation
Estimating the token size of each data point was done using OpenAI's tiktoken. This estimation was crucial for batching the rows into OpenAI requests, optimizing both speed and expenditure.
Then came the task of formulating the GPT prompt, which included:
Data Extraction: Transforming the raw scraped rows into structured columns of interest, like
salary
.Few-Shot Prompting: A technique to enable in-context learning where we provide demonstrations in the prompt to steer the model to better performance.
JSON Mode: Enforcing output to conform to valid JSON
Error Handling
Processing, Cleaning
Once the GPT had processed the data, it was time to rejoin the batched data into a cohesive set. This step was followed by a thorough cleaning process, ensuring the data was accurate and useful.
Automation
To keep this process up-to-date without monthly manual intervention, I set up a monthly GitHub Action cron job. This job scrapes the latest data, processes it, and commits the updated table data to a repository, with a link provided in the README. Which finally opens an easy to manage pull request assigned to me.
Putting it all together
graph TB
AAA[scrape hackernews] --> AA{Token counter} --> A
A[scraped.csv] -. chunk data .-> B(batch-0.csv)
A -. chunk data .-> C(batch-1.csv)
A -. chunk data .-> D(batch-2.csv)
B -..-> Q[Processing Queue]
C -..-> Q
D -..-> Q
Q --> E{LLM System
data extraction}
E -. batch n .-> E
E --> | Rejoin processed batches | F[summary.csv]
F --> G{Clean data}
G --> H[clean.csv]
Imperfect results
This project, while not perfect and certainly in need of improvements, showcases the practical application of web scraping, AI, and automation in extracting and analyzing data from unstructured sources. It was an engaging journey to automate this process, highlighting how these technologies can be effectively utilized for insightful data analysis. Looking ahead, I aim to refine the system and compare parsed data over time to reveal emerging trends.