Skip to content

Instantly share code, notes, and snippets.

@thinkphp
Last active June 21, 2025 13:44
Show Gist options
  • Save thinkphp/eac67f4c6f770add8b4938799870e9f9 to your computer and use it in GitHub Desktop.
Save thinkphp/eac67f4c6f770add8b4938799870e9f9 to your computer and use it in GitHub Desktop.
example-NYSE R Data Science

Part 1: Stock Performance Analysis (25 points)

Dataset stock_performance.xlsx contains data on NYSE-listed companies' stock performance, market capitalization, and sector information. Variables are as follows:

  • ticker: Stock ticker symbol (e.g., AAPL, MSFT)
  • company_name: Full company name
  • sector: Business sector (Technology, Healthcare, Finance, Energy, Consumer, Industrial)
  • market_cap: Market capitalization in billions of USD
  • price_change_1y: One-year percentage price change
  • dividend_yield: Annual dividend yield as percentage
  • pe_ratio: Price-to-earnings ratio
  • volume_avg_30d: Average daily trading volume over 30 days (in millions)

The following questions concern stock_performance.xlsx.

  1. Report the mean and median of market_cap, in billions of USD (2 points).

  2. Within each sector, compute the mean and median one-year price change percentage. (12 points)

    Sector Technology Healthcare Finance Energy Consumer Industrial
    Mean Price Change ____% ____% ____% ____% ____% ____%
    Median Price Change ____% ____% ____% ____% ____% ____%
  3. Identify the top 5 stocks with the highest dividend yield. Report their ticker symbols in descending order of dividend yield. (5 points)

  4. Create a scatterplot that shows the relationship between market capitalization and average trading volume. The plot will be graded on the below components. (Total 6 points)

    • The plot is a scatterplot. (1 point)
    • The scatterplot has volume_avg_30d on the y-axis. (1 point)
    • The scatterplot has market_cap on the x-axis. (1 point)
    • The y-axis should run from 0 up to and including 200 (millions). (1 point)
    • The x-axis should run from 0 up to and including 3000 (billions). (1 point)
    • The plot contains a line of best fit with confidence band. (1 point)

Part 2: Trading Behavior Survey (25 points)

Dataset trader_survey.xlsx contains data on a survey of retail investors' trading behavior and demographics. Variables are as follows:

  • investor_id: Unique identifier for each investor
  • age: Investor's age
  • experience_years: Years of trading experience
  • portfolio_value: Total portfolio value in thousands of USD
  • trades_per_month: Number of trades executed per month
  • risk_tolerance: Self-reported risk tolerance (1-5 scale, 5 = highest risk tolerance)
  • annual_return: Self-reported annual return percentage
  • investment_style: Primary investment approach (Growth, Value, Index, Day_Trading)

The following questions concern trader_survey.xlsx.

  1. This dataset contains errors. Identify the percentage of observations containing apparent errors in each of the below variables. (3 points) a. risk_tolerance _____% b. annual_return _____%
    c. trades_per_month _____%

For questions 2-4, replace all apparent errors in the dataset with NAs.

  1. Define 'active traders' as those who execute 20 or more trades per month, and 'high-return investors' as those with annual returns > 15%. Report the percentage of high-return investors for each combination of trading activity and investment style, ignoring participants with errors. (8 points)

    Investment Style Growth Value Index Day_Trading
    Active Trader ____% ____% ____% ____%
    Not Active Trader ____% ____% ____% ____%
  2. Report the mean and median portfolio values by investment style, ignoring participants with errors in portfolio_value. (8 points)

    Investment Style Growth Value Index Day_Trading
    Mean Portfolio Value $____k $____k $____k $____k
    Median Portfolio Value $____k $____k $____k $____k
  3. Create a bar plot displaying the average number of trades per month by investment style. The plot will be graded on the below components. (6 points) a. The plot is a bar plot. (1 point) b. Investment styles are correctly displayed on the x-axis. (1 point) c. Average trades_per_month values are on the y-axis. (1 point) d. The y-axis ranges from 0 up to and including 50. (1 point) e. The y-axis varies in intervals of 10. (1 point) f. Bars represent mean monthly trades by investment style. (1 point)

Part 3: Corporate Financial Health (25 points)

Dataset corporate_financials.xlsx contains financial metrics for publicly traded companies. Variables are as follows:

  • company_id: Unique identifier for companies
  • sector: Business sector
  • revenue: Annual revenue in millions of USD
  • net_income: Annual net income in millions of USD
  • total_debt: Total debt in millions of USD
  • cash_reserves: Cash and cash equivalents in millions of USD
  • employee_count: Total number of employees

Dataset esg_ratings.xlsx contains Environmental, Social, and Governance ratings for the same companies. Variables are as follows:

  • company_id: Unique identifier for companies
  • environmental_score: Environmental performance score (0-100)
  • social_score: Social responsibility score (0-100)
  • governance_score: Corporate governance score (0-100)
  • overall_esg_score: Combined ESG score (0-100)

The following questions concern both corporate_financials.xlsx and esg_ratings.xlsx.

  1. Report the mean profit margin (net_income/revenue * 100) for each sector. (6 points)

    Sector Technology Healthcare Finance Energy Consumer Industrial
    Mean Profit Margin ____% ____% ____% ____% ____% ____%
  2. Define 'financially healthy' companies as those with a debt-to-cash ratio < 2.0 (total_debt/cash_reserves). Report the mean revenue and employee count by financial health status. (4 points)

    Mean Revenue Mean Employee Count
    Financially Healthy $____M _____
    Not Financially Healthy $____M _____
  3. Divide companies into terciles by overall ESG score (lowest 33%, middle 34%, highest 33%). Within each tercile, report the mean environmental, social, and governance scores. (9 points)

    Mean Environmental Score Mean Social Score Mean Governance Score
    Lowest ESG Tercile _____ _____ _____
    Middle ESG Tercile _____ _____ _____
    Highest ESG Tercile _____ _____ _____
  4. Create a box plot showing the distribution of revenue-per-employee ratios (revenue/employee_count * 1000) by sector. The plot will be graded on the following components. (6 points) a. The plot is a box and whisker plot. (1 point) b. Sectors are clearly labeled on the x-axis. (1 point) c. Revenue per employee (in thousands) is on the y-axis. (1 point) d. The y-axis spans from 0 to 800 (thousands USD). (1 point) e. The y-axis varies in intervals of 100. (1 point) f. All six sectors are represented. (1 point)

Part 4: Economic Indicators & Market Sentiment (25 points)

Dataset market_indicators.xlsx contains monthly economic indicators and market performance data. Variables are as follows:

  • month_year: Month and year (e.g., "2023-01")
  • sp500_return: Monthly S&P 500 return percentage
  • unemployment_rate: Monthly unemployment rate percentage
  • inflation_rate: Monthly inflation rate percentage
  • interest_rate: Federal funds rate percentage
  • consumer_confidence: Consumer confidence index (0-200)
  • vix_index: Volatility index (fear gauge)
  • gdp_growth: Quarterly GDP growth rate (annualized percentage)

The following questions concern market_indicators.xlsx.

  1. Report the percentage of missing values in each of the below variables. (3 points)

    Variable gdp_growth consumer_confidence vix_index
    Percent Missing ____% ____% ____%
  2. Create a histogram of S&P 500 monthly returns. Your plot will be graded on the following attributes. (8 points) a. The plot is a histogram. (1 point) b. Monthly returns are on the x-axis. (1 point) c. The x-axis ranges from -15 to and including 15 (percent). (1 point) d. The x-axis bins vary in intervals of 2.5. (2 points) e. Frequency counts are on the y-axis. (1 point) f. The y-axis ranges from 0 up to and including 25. (1 point) g. The y-axis varies in intervals of 5. (1 point)

  3. Calculate the correlation between S&P 500 returns and each economic indicator. Report correlation coefficients for the relationships, ignoring missing values. (8 points)

    Economic Indicator Correlation with S&P 500 Returns
    Unemployment Rate _____
    Inflation Rate _____
    Interest Rate _____
    Consumer Confidence _____
    VIX Index _____
  4. Classify months as 'High Volatility' if VIX > 25 and 'Low Volatility' otherwise. Report the mean values of S&P 500 returns, unemployment rate, and consumer confidence for each volatility regime, ignoring missing values. (6 points)

    Mean S&P 500 Return Mean Unemployment Rate Mean Consumer Confidence
    High Volatility ____% ____% _____
    Low Volatility ____% ____% _____
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment