Changelog • gao

gao 0.5.0

CRAN preparation release.
Bundled RDS shrunk from 6.4 MB to 3.7 MB by storing only 14 core columns with xz compression. The 82 indicator columns are now computed on the fly by gao_links() and cached in memory.
New gao_update_data() function downloads the latest data from GitHub Releases using base R download.file() — no curl-impersonate needed. gao_links() checks for user-local cached data before the bundled copy.
auto_download() now offers to check for updated data in interactive sessions before proceeding.
Daily CI workflow now uploads the RDS to a pinned GitHub Release (data-latest) for gao_update_data() to fetch.
Fixed missing lifecycle-deprecated.svg badge referenced by extract_pdf_links.Rd.
Updated CITATION.cff to match current version and license.

New extract_text() function for extracting text from downloaded PDFs. Requires pdftools (added to Suggests).
Bundled dataset now includes page_count, topics, and subject_terms columns. gao_links() returns a 9-column data.frame.
Full metadata backfill: title, published, released are now 100% populated across all 56,000+ reports. summary at 97.5%.
Missing report_id values filled from URL slugs (now 100% complete).
Page counts extracted from 55,000+ PDF archive and matched to metadata via URL slug and report ID (80.7% coverage).
Daily CI workflow now backfills page_count for newly added reports.
Fixed update_links() column mismatch when bundled data has columns that new scrape results lack.
License changed from MIT to GPL (>= 3).

Breaking: gao_links() now returns a data.frame with columns url, title, report_id, published, released, and summary instead of a character vector.
Bundled dataset switched from text (.txt) to RDS (.rds) for compression with rich metadata.
Year filtering in auto_download() now uses published date and fiscal year calculation instead of regex on report IDs, fixing ~29% of reports with legacy ID formats that previously yielded NA years.
extract_links() and update_links() now return data.frames with full report metadata.
Fixed R-CMD-check GitHub Action syntax error (args parameter).

Added auto_download() convenience wrapper that handles the full pipeline (load links, filter by year, download as PDF/HTML) in one call.
PDF URLs are now constructed directly from report IDs, avoiding one HTTP request per report compared to extract_pdf_links().
Interactive prompts for format and year range when arguments are omitted.
Non-interactive safety: confirm = TRUE errors unless explicitly set to FALSE, preventing accidental mass downloads.