Bundled RDS shrunk from 6.4 MB to 3.7 MB by storing only 14 core columns with xz compression. The 82 indicator columns are now computed on the fly by gao_links() and cached in memory.
New gao_update_data() function downloads the latest data from GitHub Releases using base R download.file() — no curl-impersonate needed. gao_links() checks for user-local cached data before the bundled copy.
auto_download() now offers to check for updated data in interactive sessions before proceeding.
Daily CI workflow now uploads the RDS to a pinned GitHub Release (data-latest) for gao_update_data() to fetch.
Fixed missing lifecycle-deprecated.svg badge referenced by extract_pdf_links.Rd.
Updated CITATION.cff to match current version and license.
gao 0.4.0
New extract_text() function for extracting text from downloaded PDFs. Requires pdftools (added to Suggests).
Bundled dataset now includes page_count, topics, and subject_terms columns. gao_links() returns a 9-column data.frame.
Full metadata backfill: title, published, released are now 100% populated across all 56,000+ reports. summary at 97.5%.
Missing report_id values filled from URL slugs (now 100% complete).
Page counts extracted from 55,000+ PDF archive and matched to metadata via URL slug and report ID (80.7% coverage).
Daily CI workflow now backfills page_count for newly added reports.
Fixed update_links() column mismatch when bundled data has columns that new scrape results lack.
License changed from MIT to GPL (>= 3).
gao 0.3.0
Breaking:gao_links() now returns a data.frame with columns url, title, report_id, published, released, and summary instead of a character vector.
Bundled dataset switched from text (.txt) to RDS (.rds) for compression with rich metadata.
Year filtering in auto_download() now uses published date and fiscal year calculation instead of regex on report IDs, fixing ~29% of reports with legacy ID formats that previously yielded NA years.
extract_links() and update_links() now return data.frames with full report metadata.