Spaces:

milwright
/

reddit-scraper

Running

App Files Files Community

milwright commited on May 15

Commit

fa85a62

1 Parent(s): e91302e

Prepare Reddit Scraper for Hugging Face deployment

Browse files

Files changed (10) hide show

.env.template +8 -6
.gitignore +15 -23
.streamlit/secrets.toml.template +12 -0
README-HF.md +69 -0
advanced_scraper_ui.py +41 -4
app.py +38 -0
enhanced_scraper.py +38 -18
packages.txt +4 -0
requirements.txt +1 -0
setup_for_hf.sh +107 -0

.env.template CHANGED Viewed

@@ -1,10 +1,12 @@
 # Reddit API Credentials
 REDDIT_CLIENT_ID=your_client_id_here
 REDDIT_CLIENT_SECRET=your_client_secret_here
-REDDIT_USER_AGENT=your_user_agent_here
-REDDIT_USERNAME=your_username_here
-REDDIT_PASSWORD=your_password_here
-# Optional Configuration
-MAX_POSTS_PER_SUBREDDIT=100
-CLUSTERING_THRESHOLD=0.3

 # Reddit API Credentials
+# Replace these values with your own credentials from https://www.reddit.com/prefs/apps
+# Do not include quotes around the values
+# Your Reddit API client ID
 REDDIT_CLIENT_ID=your_client_id_here
+# Your Reddit API client secret
 REDDIT_CLIENT_SECRET=your_client_secret_here
+# Your Reddit API user agent (convention: <platform>:<app ID>:<version> by /u/<reddit username>)
+REDDIT_USER_AGENT=RedditScraperApp/1.0

.gitignore CHANGED Viewed

@@ -1,9 +1,14 @@
-# Python
 __pycache__/
 *.py[cod]
 *$py.class
 *.so
 .Python
 build/
 develop-eggs/
 dist/
@@ -15,35 +20,22 @@ lib64/
 parts/
 sdist/
 var/
-wheels/
 *.egg-info/
 .installed.cfg
 *.egg
-# Virtual Environment
 venv/
-env/
 ENV/
-# Environment Variables
-.env
-.env.local
-.env.*.local
-# IDE
-.idea/
-.vscode/
-*.swp
-*.swo
-# Logs
-*.log
-logs/
-# OS
 .DS_Store
 Thumbs.db
-# Data directories
-csv/
-results/

+# Environment variables and credentials
+.env
+.streamlit/secrets.toml
+# Python cache files
 __pycache__/
 *.py[cod]
 *$py.class
 *.so
 .Python
+env/
 build/
 develop-eggs/
 dist/
 parts/
 sdist/
 var/
 *.egg-info/
 .installed.cfg
 *.egg
+# Virtual environments
 venv/
 ENV/
+env/
+# Data files that might be generated by the app
+*.csv
+*.json
+csv/
+results/
+# System files
 .DS_Store
 Thumbs.db
+.ipynb_checkpoints

.streamlit/secrets.toml.template ADDED Viewed

	@@ -0,0 +1,12 @@

+# Reddit API Credentials for Hugging Face Space
+# Copy this file to secrets.toml and fill in your credentials
+# Or set these values in the Hugging Face Space settings under "Repository Secrets"
+# Your Reddit API client ID
+REDDIT_CLIENT_ID = ""
+# Your Reddit API client secret
+REDDIT_CLIENT_SECRET = ""
+# Your Reddit API user agent
+REDDIT_USER_AGENT = "RedditScraperApp/1.0"

README-HF.md ADDED Viewed

	@@ -0,0 +1,69 @@

+# Reddit Scraper
+![Reddit Scraper Logo](https://raw.githubusercontent.com/huggingface/hub-docs/main/static/icons/streamlit.svg)
+A comprehensive tool for scraping Reddit data with a user-friendly interface for data collection, analysis, and visualization.
+## Features
+- 🔍 **Search multiple subreddits** simultaneously
+- 🔑 **Filter posts by keywords** and various criteria
+- 📊 **Visualize data** with interactive charts
+- 💾 **Export results** to CSV or JSON
+- 📜 **Track search history**
+- 🔐 **Secure credentials** management
+## How to Use
+### 1. Set up Reddit API Credentials
+To use this app, you will need Reddit API credentials. You can get these from the [Reddit Developer Portal](https://www.reddit.com/prefs/apps).
+- Click "Create App" or "Create Another App"
+- Fill in the details (name, description)
+- Select "script" as the application type
+- Use "http://localhost:8000" as the redirect URI (this doesn't need to be a real endpoint)
+- Click "Create app"
+- Take note of the client ID (the string under "personal use script") and client secret
+Enter these credentials in the app's sidebar or set them up as secrets in the Hugging Face Space settings (if you've duplicated this Space).
+### 2. Searching Reddit
+1. Enter one or more subreddits to search (one per line)
+2. Specify keywords to search for (one per line)
+3. Adjust parameters like post limit, sorting method, etc.
+4. Click "Run Search" to start scraping
+### 3. Working with Results
+- Use the tabs to navigate between different views
+- Apply additional filters to the search results
+- Visualize the data with built-in charts
+- Export results to CSV or JSON for further analysis
+## Privacy & API Usage
+This tool uses the official Reddit API and follows Reddit's API terms of service. Your API credentials are never stored on our servers unless you explicitly save them to your own copy of this Space.
+## Setup Your Own Copy
+If you want to run this app with your own credentials always available:
+1. Duplicate this Space to your account
+2. Go to Settings → Repository Secrets
+3. Add the following secrets:
+   - `REDDIT_CLIENT_ID`: Your Reddit API client ID
+   - `REDDIT_CLIENT_SECRET`: Your Reddit API client secret
+   - `REDDIT_USER_AGENT`: (Optional) A custom user agent string
+## Tech Stack
+- [Streamlit](https://streamlit.io/): UI framework
+- [PRAW](https://praw.readthedocs.io/): Reddit API wrapper
+- [Pandas](https://pandas.pydata.org/): Data processing
+- [Plotly](https://plotly.com/): Data visualization
+## Feedback & Contributions
+If you find any issues or have suggestions for improvements, please open an issue on the [GitHub repository](https://github.com/yourusername/reddit-scraper) or create a discussion on this Hugging Face Space.

advanced_scraper_ui.py CHANGED Viewed

@@ -6,6 +6,7 @@ import time
 import os
 import json
 from datetime import datetime
 from enhanced_scraper import EnhancedRedditScraper
 # Page configuration
@@ -209,13 +210,49 @@ def main():
         # Credentials
         with st.expander("Reddit API Credentials", expanded=not st.session_state.scraper):
-            client_id = st.text_input("Client ID", value="aBHOo9oQ3D-liyfGOc34cQ")
-            client_secret = st.text_input("Client Secret", value="4__ziHwdOBNYjlGUG0k7XvK-r5OJDw", type="password")
-            user_agent = st.text_input("User Agent", value="WebScraperUI/1.0")
             if st.button("Initialize API Connection"):
                 if initialize_scraper(client_id, client_secret, user_agent):
                     st.success("API connection established!")
         # Search Parameters
         st.subheader("Search Parameters")
@@ -476,4 +513,4 @@ def main():
             st.info("No search history yet.")
 if __name__ == "__main__":
-    main()

 import os
 import json
 from datetime import datetime
+from dotenv import load_dotenv
 from enhanced_scraper import EnhancedRedditScraper
 # Page configuration
         # Credentials
         with st.expander("Reddit API Credentials", expanded=not st.session_state.scraper):
+            st.markdown("""
+            ### Reddit API Credentials
+            Please enter your Reddit API credentials below. You can obtain these from the
+            [Reddit Developer Portal](https://www.reddit.com/prefs/apps).
+            If you don't have your own credentials, you can leave these fields empty and the app
+            will try to use credentials from environment variables if available.
+            """)
+            # Try to load from .env file
+            load_dotenv()
+            default_client_id = os.environ.get("REDDIT_CLIENT_ID", "")
+            default_client_secret = os.environ.get("REDDIT_CLIENT_SECRET", "")
+            default_user_agent = os.environ.get("REDDIT_USER_AGENT", "RedditScraperApp/1.0")
+            client_id = st.text_input("Client ID", value=default_client_id)
+            client_secret = st.text_input("Client Secret", value=default_client_secret, type="password")
+            user_agent = st.text_input("User Agent", value=default_user_agent)
+            save_as_env = st.checkbox("Save credentials for future use (saved in .env file)", value=False)
             if st.button("Initialize API Connection"):
+                # Save credentials if requested
+                if save_as_env and (client_id or client_secret):
+                    env_vars = []
+                    if client_id:
+                        env_vars.append(f"REDDIT_CLIENT_ID={client_id}")
+                    if client_secret:
+                        env_vars.append(f"REDDIT_CLIENT_SECRET={client_secret}")
+                    if user_agent and user_agent != "RedditScraperApp/1.0":
+                        env_vars.append(f"REDDIT_USER_AGENT={user_agent}")
+                    # Write to .env file
+                    with open(".env", "w") as f:
+                        f.write("\n".join(env_vars))
+                    st.success("Credentials saved to .env file")
                 if initialize_scraper(client_id, client_secret, user_agent):
                     st.success("API connection established!")
+                    # Set environment variables for the current session
+                    os.environ["REDDIT_CLIENT_ID"] = client_id
+                    os.environ["REDDIT_CLIENT_SECRET"] = client_secret
+                    os.environ["REDDIT_USER_AGENT"] = user_agent
         # Search Parameters
         st.subheader("Search Parameters")
             st.info("No search history yet.")
 if __name__ == "__main__":
+    main()

app.py ADDED Viewed

	@@ -0,0 +1,38 @@

+# Reddit Scraper Hugging Face Space Launcher
+# This file serves as the entry point for our Hugging Face Space
+import os
+import streamlit as st
+from dotenv import load_dotenv
+# Load environment variables from .streamlit/secrets.toml if in Hugging Face Space Environment
+def load_huggingface_secrets():
+    try:
+        # HF Spaces store secrets in st.secrets
+        client_id = st.secrets.get("REDDIT_CLIENT_ID", "")
+        client_secret = st.secrets.get("REDDIT_CLIENT_SECRET", "")
+        user_agent = st.secrets.get("REDDIT_USER_AGENT", "RedditScraperApp/1.0")
+        # Set as environment variables for other modules to use
+        if client_id:
+            os.environ["REDDIT_CLIENT_ID"] = client_id
+        if client_secret:
+            os.environ["REDDIT_CLIENT_SECRET"] = client_secret
+        if user_agent:
+            os.environ["REDDIT_USER_AGENT"] = user_agent
+        return True
+    except Exception:
+        # Fallback to regular .env file if not in HF Space
+        return False
+# Try to load secrets (first from HF secrets, then from .env)
+load_huggingface_secrets()
+load_dotenv()
+# Import the main app after environment setup to ensure it has access to variables
+from advanced_scraper_ui import main
+# Run the main app
+if __name__ == "__main__":
+    main()

enhanced_scraper.py CHANGED Viewed

@@ -4,7 +4,9 @@ import datetime
 import re
 import json
 import os
 from typing import List, Dict, Any, Optional
 class EnhancedRedditScraper:
     """
@@ -194,26 +196,44 @@ class EnhancedRedditScraper:
 # Example usage
 if __name__ == "__main__":
     # Create the scraper instance
     scraper = EnhancedRedditScraper(
-        client_id="aBHOo9oQ3D-liyfGOc34cQ",
-        client_secret="4__ziHwdOBNYjlGUG0k7XvK-r5OJDw",
-        user_agent="rcuny"
     )
     # Simple example
-    results = scraper.scrape_subreddit(
-        subreddit_name="cuny",
-        keywords=["question", "help", "confused"],
-        limit=25,
-        sort_by="hot",
-        include_comments=True
-    )
-    print(f"Found {len(results)} matching posts")
-    # Save results to file
-    if results:
-        csv_path = scraper.save_results_to_csv("reddit_results")
-        json_path = scraper.save_results_to_json("reddit_results")
-        print(f"Results saved to {csv_path} and {json_path}")

 import re
 import json
 import os
+import os.path
 from typing import List, Dict, Any, Optional
+from dotenv import load_dotenv
 class EnhancedRedditScraper:
     """
 # Example usage
 if __name__ == "__main__":
+    # Load environment variables from .env file
+    load_dotenv()
+    # Get credentials from environment variables or use defaults for development
+    client_id = os.environ.get("REDDIT_CLIENT_ID", "")
+    client_secret = os.environ.get("REDDIT_CLIENT_SECRET", "")
+    user_agent = os.environ.get("REDDIT_USER_AGENT", "RedditScraperApp/1.0")
+    if not client_id or not client_secret:
+        print("Warning: Reddit API credentials not found in environment variables.")
+        print("Please set REDDIT_CLIENT_ID and REDDIT_CLIENT_SECRET in .env file")
+        print("or as environment variables for proper functionality.")
+        # For development only, you could set default credentials here
     # Create the scraper instance
     scraper = EnhancedRedditScraper(
+        client_id=client_id,
+        client_secret=client_secret,
+        user_agent=user_agent
     )
     # Simple example
+    try:
+        results = scraper.scrape_subreddit(
+            subreddit_name="cuny",
+            keywords=["question", "help", "confused"],
+            limit=25,
+            sort_by="hot",
+            include_comments=True
+        )
+        print(f"Found {len(results)} matching posts")
+        # Save results to file
+        if results:
+            csv_path = scraper.save_results_to_csv("reddit_results")
+            json_path = scraper.save_results_to_json("reddit_results")
+            print(f"Results saved to {csv_path} and {json_path}")
+    except Exception as e:
+        print(f"Error: {str(e)}")
+        print("This may be due to missing or invalid API credentials.")

packages.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+# System dependencies for the Reddit Scraper app
+# This file is used by Hugging Face Spaces to install system packages
+# No additional system packages are required for this app

requirements.txt CHANGED Viewed

@@ -3,3 +3,4 @@ pandas>=1.3.0
 streamlit>=1.3.0
 plotly>=5.5.0
 matplotlib>=3.5.0

 streamlit>=1.3.0
 plotly>=5.5.0
 matplotlib>=3.5.0
+python-dotenv>=0.20.0

setup_for_hf.sh ADDED Viewed

	@@ -0,0 +1,107 @@

+#!/bin/bash
+# Setup script for pushing Reddit Scraper to Hugging Face
+echo "==== Reddit Scraper: Hugging Face Setup ===="
+echo ""
+# Check for required tools
+echo "Checking for required tools..."
+if ! command -v git &> /dev/null; then
+    echo "❌ Git not found. Please install Git before continuing."
+    exit 1
+else
+    echo "✅ Git installed"
+fi
+if ! command -v python3 &> /dev/null; then
+    echo "❌ Python 3 not found. Please install Python 3 before continuing."
+    exit 1
+else
+    echo "✅ Python 3 installed"
+fi
+if ! command -v pip3 &> /dev/null; then
+    echo "❌ pip not found. Please install pip before continuing."
+    exit 1
+else
+    echo "✅ pip installed"
+fi
+if ! command -v huggingface-cli &> /dev/null; then
+    echo "⚠️ Hugging Face CLI not installed. Installing now..."
+    pip install huggingface_hub
+    if ! command -v huggingface-cli &> /dev/null; then
+        echo "❌ Failed to install Hugging Face CLI. Please install manually: pip install huggingface_hub"
+        exit 1
+    else
+        echo "✅ Hugging Face CLI installed"
+    fi
+else
+    echo "✅ Hugging Face CLI installed"
+fi
+echo ""
+echo "Verifying project files..."
+# Check for required files
+required_files=("app.py" "requirements.txt" "enhanced_scraper.py" "advanced_scraper_ui.py" "README-HF.md")
+missing_files=0
+for file in "${required_files[@]}"; do
+    if [ ! -f "$file" ]; then
+        echo "❌ Missing required file: $file"
+        missing_files=$((missing_files+1))
+    else
+        echo "✅ Found $file"
+    fi
+done
+if [ $missing_files -gt 0 ]; then
+    echo ""
+    echo "❌ Some required files are missing. Please make sure all project files exist."
+    exit 1
+fi
+echo ""
+echo "All required files are present."
+echo ""
+# Check for Hugging Face login
+echo "Checking Hugging Face login status..."
+huggingface-cli whoami &> /dev/null
+if [ $? -ne 0 ]; then
+    echo "You need to login to Hugging Face first."
+    echo "Run the following command and follow the instructions:"
+    echo ""
+    echo "huggingface-cli login"
+    echo ""
+    exit 1
+else
+    echo "✅ Already logged in to Hugging Face"
+fi
+echo ""
+echo "==== Ready to push to Hugging Face! ===="
+echo ""
+echo "To create a new Hugging Face Space and push your code:"
+echo ""
+echo "1. Go to https://huggingface.co/new-space"
+echo "2. Choose a Space name (e.g., 'reddit-scraper')"
+echo "3. Select 'Streamlit' as the SDK"
+echo "4. Create the Space"
+echo ""
+echo "Then run the following commands to push your code:"
+echo ""
+echo "git init"
+echo "git add ."
+echo "git commit -m \"Initial commit of Reddit Scraper\""
+echo "git branch -M main"
+echo "git remote add origin https://huggingface.co/spaces/YOUR_USERNAME/reddit-scraper"
+echo "git push -u origin main"
+echo ""
+echo "Replace YOUR_USERNAME with your Hugging Face username."
+echo ""
+echo "Remember to set up your Reddit API credentials in the Space settings!"
+echo ""