Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds two new Python scripts to calculate financial margins from SEC 10-K filings. Both scripts fetch company financial data and compute key margins (gross margin, operating margin, net profit margin, and EBITDA margin), but use different approaches:
Changes:
- Added
calculate_margins_xbrl.pywhich uses SEC's XBRL companyfacts API for deterministic data extraction - Added
calculate_margins_llm.pywhich uses HTML parsing combined with DeepSeek LLM for financial statement extraction
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 27 comments.
| File | Description |
|---|---|
| calculate_margins_xbrl.py | Implements deterministic margin calculation by querying SEC XBRL companyfacts API with fallback concept matching for various financial statement items |
| calculate_margins_llm.py | Implements margin calculation by downloading 10-K HTML, extracting income statement tables, and using DeepSeek LLM to parse the financial data |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| python3 tenk_margins_xbrl.py TICKER [YEAR] | ||
|
|
||
| Examples: | ||
| python3 tenk_margins_xbrl.py AAPL 2023 | ||
| python3 tenk_margins_xbrl.py PGR 2024 # insurers should also work |
There was a problem hiding this comment.
The usage comment refers to 'tenk_margins_xbrl.py' but the actual filename is 'calculate_margins_xbrl.py'. This inconsistency will confuse users trying to run the script.
| python3 tenk_margins_xbrl.py TICKER [YEAR] | |
| Examples: | |
| python3 tenk_margins_xbrl.py AAPL 2023 | |
| python3 tenk_margins_xbrl.py PGR 2024 # insurers should also work | |
| python3 calculate_margins_xbrl.py TICKER [YEAR] | |
| Examples: | |
| python3 calculate_margins_xbrl.py AAPL 2023 | |
| python3 calculate_margins_xbrl.py PGR 2024 # insurers should also work |
|
|
||
| def main(): | ||
| if len(sys.argv) < 2: | ||
| print("Usage: python tenk_margins.py TICKER [YEAR]") |
There was a problem hiding this comment.
The usage example refers to 'tenk_margins.py' but the actual filename is 'calculate_margins_llm.py'. This inconsistency will confuse users trying to run the script.
| print("Usage: python tenk_margins.py TICKER [YEAR]") | |
| print("Usage: python calculate_margins_llm.py TICKER [YEAR]") |
| Send the income statement table to DeepSeek and ask for structured JSON. | ||
| Tries to be robust to non‑pure‑JSON responses. | ||
| """ | ||
| if not DEEPSEEK_API_KEY: |
There was a problem hiding this comment.
The DEEPSEEK_API_KEY is retrieved from environment variables without validation. If the key is an empty string (rather than None), the check on line 199 will pass but the API call will likely fail. Consider also checking for empty strings: 'if not DEEPSEEK_API_KEY or not DEEPSEEK_API_KEY.strip():'
| if not DEEPSEEK_API_KEY: | |
| if not DEEPSEEK_API_KEY or not str(DEEPSEEK_API_KEY).strip(): |
| if require_fy: | ||
| fp = f.get("fp") | ||
| # 'FY' or sometimes 'Y' | ||
| if fp not in ("FY", "Y", None): |
There was a problem hiding this comment.
The fiscal period filter accepts 'FY', 'Y', or None. Accepting None when require_fy=True is contradictory - if fp is None, we cannot verify it's a full year fiscal period. This could lead to selecting facts from quarterly filings when a full year is required. Consider removing None from the accepted values when require_fy=True, or add a comment explaining why None is acceptable.
| if fp not in ("FY", "Y", None): | |
| if fp not in ("FY", "Y"): |
| try: | ||
| return float(val), concept_name | ||
| except Exception: | ||
| continue |
There was a problem hiding this comment.
Bare except clauses (except Exception) without logging or specific error handling silently ignore conversion errors. This makes debugging difficult when values cannot be converted to float. Consider logging the error or at least the problematic value before continuing to the next candidate.
| 'CONSOLIDATED STATEMENTS OF OPERATIONS', 'STATEMENTS OF INCOME', etc. | ||
| Return the HTML of that table as a string. | ||
| """ | ||
| soup = BeautifulSoup(html, "lxml") |
There was a problem hiding this comment.
The BeautifulSoup parser uses "lxml" which requires the lxml package to be installed. However, this dependency is not documented in the script's docstring or in installation instructions. If lxml is not installed, BeautifulSoup will fall back to a different parser (like html.parser), which could produce different results. Consider either documenting the lxml dependency or explicitly handling the case where lxml is not available.
| def get_ticker_cik_map() -> Dict[str, int]: | ||
| """ | ||
| Download the official SEC ticker -> CIK mapping JSON and return a dict | ||
| {ticker_upper: cik_int}. | ||
| """ | ||
| tickers_url = "https://www.sec.gov/files/company_tickers.json" | ||
|
|
||
| # Use proper headers for www.sec.gov | ||
| headers = { | ||
| "User-Agent": "MyResearchScript/1.0 your_email@example.com", | ||
| "Accept-Encoding": "gzip, deflate", | ||
| } | ||
|
|
||
| resp = requests.get(tickers_url, headers=headers) | ||
| resp.raise_for_status() | ||
| data = resp.json() | ||
|
|
||
| mapping = {} | ||
| # Format: { "0": {"cik_str": 320193, "ticker": "AAPL", "title": "Apple Inc."}, ...} | ||
| for _, entry in data.items(): | ||
| ticker = entry["ticker"].upper() | ||
| cik = int(entry["cik_str"]) | ||
| mapping[ticker] = cik | ||
| return mapping | ||
|
|
||
|
|
||
| def get_cik_for_ticker(ticker: str) -> int: | ||
| mapping = get_ticker_cik_map() | ||
| t = ticker.upper() | ||
| if t not in mapping: | ||
| raise ValueError(f"Ticker {ticker} not found in SEC ticker list.") | ||
| return mapping[t] |
There was a problem hiding this comment.
The functions 'get_ticker_cik_map', 'get_cik_for_ticker', and 'get_10k_metadata' are duplicated between 'calculate_margins_xbrl.py' and 'calculate_margins_llm.py'. This code duplication violates the DRY (Don't Repeat Yourself) principle and makes maintenance harder. Consider extracting these common functions into a shared utility module that both scripts can import.
| def call_deepseek_for_income_statement(table_text: str) -> Dict[str, Any]: | ||
| """ | ||
| Send the income statement table to DeepSeek and ask for structured JSON. | ||
| Tries to be robust to non‑pure‑JSON responses. |
There was a problem hiding this comment.
The docstring uses a non-standard hyphen character (U+2011 NON-BREAKING HYPHEN) in "non‑pure‑JSON" instead of a standard ASCII hyphen. While this doesn't break functionality, it's inconsistent with standard Python coding conventions. Consider using regular ASCII hyphens for consistency.
| Tries to be robust to non‑pure‑JSON responses. | |
| Tries to be robust to non-pure-JSON responses. |
| OPERATING_INCOME_CONCEPTS = [ | ||
| "OperatingIncomeLoss", | ||
| "IncomeLossFromContinuingOperationsBeforeIncomeTaxesExtraordinaryItemsNoncontrollingInterest", | ||
| "IncomeLossFromContinuingOperationsBeforeIncomeTaxesMinorityInterestAndIncomeLossFromEquityMethodInvestments", |
There was a problem hiding this comment.
The OPERATING_INCOME_CONCEPTS list includes "IncomeLossFromContinuingOperationsBeforeIncomeTaxesExtraordinaryItemsNoncontrollingInterest" and "IncomeLossFromContinuingOperationsBeforeIncomeTaxesMinorityInterestAndIncomeLossFromEquityMethodInvestments" which represent income before taxes, not operating income. These concepts include interest and other non-operating items, making them different from operating income (which excludes interest and non-operating items). Using these as fallbacks for operating income will produce incorrect operating margins. Consider removing these concepts or adding a comment explaining why they are acceptable proxies in certain contexts.
| OPERATING_INCOME_CONCEPTS = [ | |
| "OperatingIncomeLoss", | |
| "IncomeLossFromContinuingOperationsBeforeIncomeTaxesExtraordinaryItemsNoncontrollingInterest", | |
| "IncomeLossFromContinuingOperationsBeforeIncomeTaxesMinorityInterestAndIncomeLossFromEquityMethodInvestments", | |
| # Only include true operating income concepts here. Pre-tax income concepts | |
| # (which include interest and other non-operating items) are intentionally | |
| # excluded to avoid mis-stating operating margins. | |
| OPERATING_INCOME_CONCEPTS = [ | |
| "OperatingIncomeLoss", |
| Convert an HTML table to a text representation suitable for LLM parsing. | ||
| We'll create a simple tab-separated format. | ||
| """ | ||
| soup = BeautifulSoup(table_html, "lxml") |
There was a problem hiding this comment.
The BeautifulSoup parser uses "lxml" but this dependency is not documented anywhere. Consider adding a note about required dependencies (requests, beautifulsoup4, lxml) in the module docstring.
No description provided.