milwright commited on
Commit
fa85a62
Β·
1 Parent(s): e91302e

Prepare Reddit Scraper for Hugging Face deployment

Browse files
.env.template CHANGED
@@ -1,10 +1,12 @@
1
  # Reddit API Credentials
 
 
 
 
2
  REDDIT_CLIENT_ID=your_client_id_here
 
 
3
  REDDIT_CLIENT_SECRET=your_client_secret_here
4
- REDDIT_USER_AGENT=your_user_agent_here
5
- REDDIT_USERNAME=your_username_here
6
- REDDIT_PASSWORD=your_password_here
7
 
8
- # Optional Configuration
9
- MAX_POSTS_PER_SUBREDDIT=100
10
- CLUSTERING_THRESHOLD=0.3
 
1
  # Reddit API Credentials
2
+ # Replace these values with your own credentials from https://www.reddit.com/prefs/apps
3
+ # Do not include quotes around the values
4
+
5
+ # Your Reddit API client ID
6
  REDDIT_CLIENT_ID=your_client_id_here
7
+
8
+ # Your Reddit API client secret
9
  REDDIT_CLIENT_SECRET=your_client_secret_here
 
 
 
10
 
11
+ # Your Reddit API user agent (convention: <platform>:<app ID>:<version> by /u/<reddit username>)
12
+ REDDIT_USER_AGENT=RedditScraperApp/1.0
 
.gitignore CHANGED
@@ -1,9 +1,14 @@
1
- # Python
 
 
 
 
2
  __pycache__/
3
  *.py[cod]
4
  *$py.class
5
  *.so
6
  .Python
 
7
  build/
8
  develop-eggs/
9
  dist/
@@ -15,35 +20,22 @@ lib64/
15
  parts/
16
  sdist/
17
  var/
18
- wheels/
19
  *.egg-info/
20
  .installed.cfg
21
  *.egg
22
 
23
- # Virtual Environment
24
  venv/
25
- env/
26
  ENV/
 
27
 
28
- # Environment Variables
29
- .env
30
- .env.local
31
- .env.*.local
32
-
33
- # IDE
34
- .idea/
35
- .vscode/
36
- *.swp
37
- *.swo
38
-
39
- # Logs
40
- *.log
41
- logs/
42
 
43
- # OS
44
  .DS_Store
45
  Thumbs.db
46
-
47
- # Data directories
48
- csv/
49
- results/
 
1
+ # Environment variables and credentials
2
+ .env
3
+ .streamlit/secrets.toml
4
+
5
+ # Python cache files
6
  __pycache__/
7
  *.py[cod]
8
  *$py.class
9
  *.so
10
  .Python
11
+ env/
12
  build/
13
  develop-eggs/
14
  dist/
 
20
  parts/
21
  sdist/
22
  var/
 
23
  *.egg-info/
24
  .installed.cfg
25
  *.egg
26
 
27
+ # Virtual environments
28
  venv/
 
29
  ENV/
30
+ env/
31
 
32
+ # Data files that might be generated by the app
33
+ *.csv
34
+ *.json
35
+ csv/
36
+ results/
 
 
 
 
 
 
 
 
 
37
 
38
+ # System files
39
  .DS_Store
40
  Thumbs.db
41
+ .ipynb_checkpoints
 
 
 
.streamlit/secrets.toml.template ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Reddit API Credentials for Hugging Face Space
2
+ # Copy this file to secrets.toml and fill in your credentials
3
+ # Or set these values in the Hugging Face Space settings under "Repository Secrets"
4
+
5
+ # Your Reddit API client ID
6
+ REDDIT_CLIENT_ID = ""
7
+
8
+ # Your Reddit API client secret
9
+ REDDIT_CLIENT_SECRET = ""
10
+
11
+ # Your Reddit API user agent
12
+ REDDIT_USER_AGENT = "RedditScraperApp/1.0"
README-HF.md ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Reddit Scraper
2
+
3
+ ![Reddit Scraper Logo](https://raw.githubusercontent.com/huggingface/hub-docs/main/static/icons/streamlit.svg)
4
+
5
+ A comprehensive tool for scraping Reddit data with a user-friendly interface for data collection, analysis, and visualization.
6
+
7
+ ## Features
8
+
9
+ - πŸ” **Search multiple subreddits** simultaneously
10
+ - πŸ”‘ **Filter posts by keywords** and various criteria
11
+ - πŸ“Š **Visualize data** with interactive charts
12
+ - πŸ’Ύ **Export results** to CSV or JSON
13
+ - πŸ“œ **Track search history**
14
+ - πŸ” **Secure credentials** management
15
+
16
+ ## How to Use
17
+
18
+ ### 1. Set up Reddit API Credentials
19
+
20
+ To use this app, you will need Reddit API credentials. You can get these from the [Reddit Developer Portal](https://www.reddit.com/prefs/apps).
21
+
22
+ - Click "Create App" or "Create Another App"
23
+ - Fill in the details (name, description)
24
+ - Select "script" as the application type
25
+ - Use "http://localhost:8000" as the redirect URI (this doesn't need to be a real endpoint)
26
+ - Click "Create app"
27
+ - Take note of the client ID (the string under "personal use script") and client secret
28
+
29
+ Enter these credentials in the app's sidebar or set them up as secrets in the Hugging Face Space settings (if you've duplicated this Space).
30
+
31
+ ### 2. Searching Reddit
32
+
33
+ 1. Enter one or more subreddits to search (one per line)
34
+ 2. Specify keywords to search for (one per line)
35
+ 3. Adjust parameters like post limit, sorting method, etc.
36
+ 4. Click "Run Search" to start scraping
37
+
38
+ ### 3. Working with Results
39
+
40
+ - Use the tabs to navigate between different views
41
+ - Apply additional filters to the search results
42
+ - Visualize the data with built-in charts
43
+ - Export results to CSV or JSON for further analysis
44
+
45
+ ## Privacy & API Usage
46
+
47
+ This tool uses the official Reddit API and follows Reddit's API terms of service. Your API credentials are never stored on our servers unless you explicitly save them to your own copy of this Space.
48
+
49
+ ## Setup Your Own Copy
50
+
51
+ If you want to run this app with your own credentials always available:
52
+
53
+ 1. Duplicate this Space to your account
54
+ 2. Go to Settings β†’ Repository Secrets
55
+ 3. Add the following secrets:
56
+ - `REDDIT_CLIENT_ID`: Your Reddit API client ID
57
+ - `REDDIT_CLIENT_SECRET`: Your Reddit API client secret
58
+ - `REDDIT_USER_AGENT`: (Optional) A custom user agent string
59
+
60
+ ## Tech Stack
61
+
62
+ - [Streamlit](https://streamlit.io/): UI framework
63
+ - [PRAW](https://praw.readthedocs.io/): Reddit API wrapper
64
+ - [Pandas](https://pandas.pydata.org/): Data processing
65
+ - [Plotly](https://plotly.com/): Data visualization
66
+
67
+ ## Feedback & Contributions
68
+
69
+ If you find any issues or have suggestions for improvements, please open an issue on the [GitHub repository](https://github.com/yourusername/reddit-scraper) or create a discussion on this Hugging Face Space.
advanced_scraper_ui.py CHANGED
@@ -6,6 +6,7 @@ import time
6
  import os
7
  import json
8
  from datetime import datetime
 
9
  from enhanced_scraper import EnhancedRedditScraper
10
 
11
  # Page configuration
@@ -209,13 +210,49 @@ def main():
209
 
210
  # Credentials
211
  with st.expander("Reddit API Credentials", expanded=not st.session_state.scraper):
212
- client_id = st.text_input("Client ID", value="aBHOo9oQ3D-liyfGOc34cQ")
213
- client_secret = st.text_input("Client Secret", value="4__ziHwdOBNYjlGUG0k7XvK-r5OJDw", type="password")
214
- user_agent = st.text_input("User Agent", value="WebScraperUI/1.0")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
215
 
216
  if st.button("Initialize API Connection"):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
217
  if initialize_scraper(client_id, client_secret, user_agent):
218
  st.success("API connection established!")
 
 
 
 
219
 
220
  # Search Parameters
221
  st.subheader("Search Parameters")
@@ -476,4 +513,4 @@ def main():
476
  st.info("No search history yet.")
477
 
478
  if __name__ == "__main__":
479
- main()
 
6
  import os
7
  import json
8
  from datetime import datetime
9
+ from dotenv import load_dotenv
10
  from enhanced_scraper import EnhancedRedditScraper
11
 
12
  # Page configuration
 
210
 
211
  # Credentials
212
  with st.expander("Reddit API Credentials", expanded=not st.session_state.scraper):
213
+ st.markdown("""
214
+ ### Reddit API Credentials
215
+ Please enter your Reddit API credentials below. You can obtain these from the
216
+ [Reddit Developer Portal](https://www.reddit.com/prefs/apps).
217
+
218
+ If you don't have your own credentials, you can leave these fields empty and the app
219
+ will try to use credentials from environment variables if available.
220
+ """)
221
+
222
+ # Try to load from .env file
223
+ load_dotenv()
224
+ default_client_id = os.environ.get("REDDIT_CLIENT_ID", "")
225
+ default_client_secret = os.environ.get("REDDIT_CLIENT_SECRET", "")
226
+ default_user_agent = os.environ.get("REDDIT_USER_AGENT", "RedditScraperApp/1.0")
227
+
228
+ client_id = st.text_input("Client ID", value=default_client_id)
229
+ client_secret = st.text_input("Client Secret", value=default_client_secret, type="password")
230
+ user_agent = st.text_input("User Agent", value=default_user_agent)
231
+
232
+ save_as_env = st.checkbox("Save credentials for future use (saved in .env file)", value=False)
233
 
234
  if st.button("Initialize API Connection"):
235
+ # Save credentials if requested
236
+ if save_as_env and (client_id or client_secret):
237
+ env_vars = []
238
+ if client_id:
239
+ env_vars.append(f"REDDIT_CLIENT_ID={client_id}")
240
+ if client_secret:
241
+ env_vars.append(f"REDDIT_CLIENT_SECRET={client_secret}")
242
+ if user_agent and user_agent != "RedditScraperApp/1.0":
243
+ env_vars.append(f"REDDIT_USER_AGENT={user_agent}")
244
+
245
+ # Write to .env file
246
+ with open(".env", "w") as f:
247
+ f.write("\n".join(env_vars))
248
+ st.success("Credentials saved to .env file")
249
+
250
  if initialize_scraper(client_id, client_secret, user_agent):
251
  st.success("API connection established!")
252
+ # Set environment variables for the current session
253
+ os.environ["REDDIT_CLIENT_ID"] = client_id
254
+ os.environ["REDDIT_CLIENT_SECRET"] = client_secret
255
+ os.environ["REDDIT_USER_AGENT"] = user_agent
256
 
257
  # Search Parameters
258
  st.subheader("Search Parameters")
 
513
  st.info("No search history yet.")
514
 
515
  if __name__ == "__main__":
516
+ main()
app.py ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Reddit Scraper Hugging Face Space Launcher
2
+ # This file serves as the entry point for our Hugging Face Space
3
+
4
+ import os
5
+ import streamlit as st
6
+ from dotenv import load_dotenv
7
+
8
+ # Load environment variables from .streamlit/secrets.toml if in Hugging Face Space Environment
9
+ def load_huggingface_secrets():
10
+ try:
11
+ # HF Spaces store secrets in st.secrets
12
+ client_id = st.secrets.get("REDDIT_CLIENT_ID", "")
13
+ client_secret = st.secrets.get("REDDIT_CLIENT_SECRET", "")
14
+ user_agent = st.secrets.get("REDDIT_USER_AGENT", "RedditScraperApp/1.0")
15
+
16
+ # Set as environment variables for other modules to use
17
+ if client_id:
18
+ os.environ["REDDIT_CLIENT_ID"] = client_id
19
+ if client_secret:
20
+ os.environ["REDDIT_CLIENT_SECRET"] = client_secret
21
+ if user_agent:
22
+ os.environ["REDDIT_USER_AGENT"] = user_agent
23
+
24
+ return True
25
+ except Exception:
26
+ # Fallback to regular .env file if not in HF Space
27
+ return False
28
+
29
+ # Try to load secrets (first from HF secrets, then from .env)
30
+ load_huggingface_secrets()
31
+ load_dotenv()
32
+
33
+ # Import the main app after environment setup to ensure it has access to variables
34
+ from advanced_scraper_ui import main
35
+
36
+ # Run the main app
37
+ if __name__ == "__main__":
38
+ main()
enhanced_scraper.py CHANGED
@@ -4,7 +4,9 @@ import datetime
4
  import re
5
  import json
6
  import os
 
7
  from typing import List, Dict, Any, Optional
 
8
 
9
  class EnhancedRedditScraper:
10
  """
@@ -194,26 +196,44 @@ class EnhancedRedditScraper:
194
 
195
  # Example usage
196
  if __name__ == "__main__":
 
 
 
 
 
 
 
 
 
 
 
 
 
 
197
  # Create the scraper instance
198
  scraper = EnhancedRedditScraper(
199
- client_id="aBHOo9oQ3D-liyfGOc34cQ",
200
- client_secret="4__ziHwdOBNYjlGUG0k7XvK-r5OJDw",
201
- user_agent="rcuny"
202
  )
203
 
204
  # Simple example
205
- results = scraper.scrape_subreddit(
206
- subreddit_name="cuny",
207
- keywords=["question", "help", "confused"],
208
- limit=25,
209
- sort_by="hot",
210
- include_comments=True
211
- )
212
-
213
- print(f"Found {len(results)} matching posts")
214
-
215
- # Save results to file
216
- if results:
217
- csv_path = scraper.save_results_to_csv("reddit_results")
218
- json_path = scraper.save_results_to_json("reddit_results")
219
- print(f"Results saved to {csv_path} and {json_path}")
 
 
 
 
 
4
  import re
5
  import json
6
  import os
7
+ import os.path
8
  from typing import List, Dict, Any, Optional
9
+ from dotenv import load_dotenv
10
 
11
  class EnhancedRedditScraper:
12
  """
 
196
 
197
  # Example usage
198
  if __name__ == "__main__":
199
+ # Load environment variables from .env file
200
+ load_dotenv()
201
+
202
+ # Get credentials from environment variables or use defaults for development
203
+ client_id = os.environ.get("REDDIT_CLIENT_ID", "")
204
+ client_secret = os.environ.get("REDDIT_CLIENT_SECRET", "")
205
+ user_agent = os.environ.get("REDDIT_USER_AGENT", "RedditScraperApp/1.0")
206
+
207
+ if not client_id or not client_secret:
208
+ print("Warning: Reddit API credentials not found in environment variables.")
209
+ print("Please set REDDIT_CLIENT_ID and REDDIT_CLIENT_SECRET in .env file")
210
+ print("or as environment variables for proper functionality.")
211
+ # For development only, you could set default credentials here
212
+
213
  # Create the scraper instance
214
  scraper = EnhancedRedditScraper(
215
+ client_id=client_id,
216
+ client_secret=client_secret,
217
+ user_agent=user_agent
218
  )
219
 
220
  # Simple example
221
+ try:
222
+ results = scraper.scrape_subreddit(
223
+ subreddit_name="cuny",
224
+ keywords=["question", "help", "confused"],
225
+ limit=25,
226
+ sort_by="hot",
227
+ include_comments=True
228
+ )
229
+
230
+ print(f"Found {len(results)} matching posts")
231
+
232
+ # Save results to file
233
+ if results:
234
+ csv_path = scraper.save_results_to_csv("reddit_results")
235
+ json_path = scraper.save_results_to_json("reddit_results")
236
+ print(f"Results saved to {csv_path} and {json_path}")
237
+ except Exception as e:
238
+ print(f"Error: {str(e)}")
239
+ print("This may be due to missing or invalid API credentials.")
packages.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ # System dependencies for the Reddit Scraper app
2
+ # This file is used by Hugging Face Spaces to install system packages
3
+
4
+ # No additional system packages are required for this app
requirements.txt CHANGED
@@ -3,3 +3,4 @@ pandas>=1.3.0
3
  streamlit>=1.3.0
4
  plotly>=5.5.0
5
  matplotlib>=3.5.0
 
 
3
  streamlit>=1.3.0
4
  plotly>=5.5.0
5
  matplotlib>=3.5.0
6
+ python-dotenv>=0.20.0
setup_for_hf.sh ADDED
@@ -0,0 +1,107 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ # Setup script for pushing Reddit Scraper to Hugging Face
4
+
5
+ echo "==== Reddit Scraper: Hugging Face Setup ===="
6
+ echo ""
7
+
8
+ # Check for required tools
9
+ echo "Checking for required tools..."
10
+
11
+ if ! command -v git &> /dev/null; then
12
+ echo "❌ Git not found. Please install Git before continuing."
13
+ exit 1
14
+ else
15
+ echo "βœ… Git installed"
16
+ fi
17
+
18
+ if ! command -v python3 &> /dev/null; then
19
+ echo "❌ Python 3 not found. Please install Python 3 before continuing."
20
+ exit 1
21
+ else
22
+ echo "βœ… Python 3 installed"
23
+ fi
24
+
25
+ if ! command -v pip3 &> /dev/null; then
26
+ echo "❌ pip not found. Please install pip before continuing."
27
+ exit 1
28
+ else
29
+ echo "βœ… pip installed"
30
+ fi
31
+
32
+ if ! command -v huggingface-cli &> /dev/null; then
33
+ echo "⚠️ Hugging Face CLI not installed. Installing now..."
34
+ pip install huggingface_hub
35
+ if ! command -v huggingface-cli &> /dev/null; then
36
+ echo "❌ Failed to install Hugging Face CLI. Please install manually: pip install huggingface_hub"
37
+ exit 1
38
+ else
39
+ echo "βœ… Hugging Face CLI installed"
40
+ fi
41
+ else
42
+ echo "βœ… Hugging Face CLI installed"
43
+ fi
44
+
45
+ echo ""
46
+ echo "Verifying project files..."
47
+
48
+ # Check for required files
49
+ required_files=("app.py" "requirements.txt" "enhanced_scraper.py" "advanced_scraper_ui.py" "README-HF.md")
50
+ missing_files=0
51
+
52
+ for file in "${required_files[@]}"; do
53
+ if [ ! -f "$file" ]; then
54
+ echo "❌ Missing required file: $file"
55
+ missing_files=$((missing_files+1))
56
+ else
57
+ echo "βœ… Found $file"
58
+ fi
59
+ done
60
+
61
+ if [ $missing_files -gt 0 ]; then
62
+ echo ""
63
+ echo "❌ Some required files are missing. Please make sure all project files exist."
64
+ exit 1
65
+ fi
66
+
67
+ echo ""
68
+ echo "All required files are present."
69
+ echo ""
70
+
71
+ # Check for Hugging Face login
72
+ echo "Checking Hugging Face login status..."
73
+ huggingface-cli whoami &> /dev/null
74
+ if [ $? -ne 0 ]; then
75
+ echo "You need to login to Hugging Face first."
76
+ echo "Run the following command and follow the instructions:"
77
+ echo ""
78
+ echo "huggingface-cli login"
79
+ echo ""
80
+ exit 1
81
+ else
82
+ echo "βœ… Already logged in to Hugging Face"
83
+ fi
84
+
85
+ echo ""
86
+ echo "==== Ready to push to Hugging Face! ===="
87
+ echo ""
88
+ echo "To create a new Hugging Face Space and push your code:"
89
+ echo ""
90
+ echo "1. Go to https://huggingface.co/new-space"
91
+ echo "2. Choose a Space name (e.g., 'reddit-scraper')"
92
+ echo "3. Select 'Streamlit' as the SDK"
93
+ echo "4. Create the Space"
94
+ echo ""
95
+ echo "Then run the following commands to push your code:"
96
+ echo ""
97
+ echo "git init"
98
+ echo "git add ."
99
+ echo "git commit -m \"Initial commit of Reddit Scraper\""
100
+ echo "git branch -M main"
101
+ echo "git remote add origin https://huggingface.co/spaces/YOUR_USERNAME/reddit-scraper"
102
+ echo "git push -u origin main"
103
+ echo ""
104
+ echo "Replace YOUR_USERNAME with your Hugging Face username."
105
+ echo ""
106
+ echo "Remember to set up your Reddit API credentials in the Space settings!"
107
+ echo ""