Spaces:
Running
Running
A newer version of the Streamlit SDK is available:
1.49.1
Data Card for JioPay RAG Chatbot
Dataset Overview
This dataset contains publicly accessible information from JioPay's business website and help center, collected for building a customer support RAG chatbot.
Data Sources
Primary Sources
JioPay Business Website
- URL: https://jiopay.com/business
- Content: Business information, features, pricing, integration details
- Collection Method: Web scraping with multiple pipelines
JioPay Help Center/FAQs
- URL: https://jiopay.com/help (or similar)
- Content: Frequently asked questions, troubleshooting guides
- Collection Method: Structured data extraction
Additional Sources
- Any other publicly accessible JioPay documentation
- Public API documentation (if available)
- Community forums and public discussions
Data Collection Details
Scraping Pipelines Used
- requests + BeautifulSoup4: Primary scraping method
- trafilatura: Readability-focused extraction
- Playwright: Dynamic content handling (if needed)
Collection Metrics
- Total Pages Scraped: [To be filled after scraping]
- Total Tokens: [To be filled after processing]
- Coverage: Business pages, FAQ sections, help documentation
- Noise Ratio: [To be calculated during processing]
- Throughput: [To be measured during scraping]
Data Quality
- Cleanliness: HTML boilerplate removed, structure preserved
- Completeness: All publicly accessible content included
- Accuracy: Source URLs maintained for verification
Data Processing
Chunking Strategies
- Fixed Chunking: 256, 512, 1024 tokens with 0, 64, 128 overlap
- Semantic Chunking: Sentence/paragraph boundary detection
- Structural Chunking: HTML tag and heading-based segmentation
- Recursive Chunking: Hierarchical fallback approach
- LLM-based Chunking: Instruction-aware segmentation
Embedding Models Tested
- OpenAI: text-embedding-3-small, text-embedding-3-large
- E5: intfloat/e5-base, intfloat/e5-large
- BGE: BAAI/bge-small-en-v1.5, BAAI/bge-base-en-v1.5
Compliance and Ethics
Legal Compliance
- β Respects robots.txt
- β Follows website Terms & Conditions
- β Only accesses publicly available content
- β No user data or gated content accessed
Data Usage
- Purpose: Customer support automation
- Scope: Public business and help documentation only
- Retention: Data stored locally for RAG system
- Sharing: Not redistributed, used only for this project
Data Statistics
Collection Statistics
- Start Date: [To be filled]
- End Date: [To be filled]
- Total Collection Time: [To be filled]
- Success Rate: [To be calculated]
Content Statistics
- Business Pages: [To be counted]
- FAQ Items: [To be counted]
- Help Articles: [To be counted]
- Total Documents: [To be calculated]
Processing Statistics
- Chunks Generated: [To be calculated per strategy]
- Average Chunk Size: [To be calculated]
- Embedding Dimensions: [Varies by model]
Data Access
File Structure
data/
βββ scraped/ # Raw scraped HTML/text
βββ processed/ # Cleaned and chunked data
βββ chroma_db/ # Vector database
βββ evaluation/ # Test datasets
Data Formats
- Raw Data: HTML files, JSON metadata
- Processed Data: JSON with chunks and metadata
- Vector Data: ChromaDB collections
- Evaluation Data: JSON test sets
Updates and Maintenance
Data Freshness
- Last Updated: [To be filled]
- Update Frequency: As needed for accuracy
- Version Control: Git-tracked data processing scripts
Quality Assurance
- Regular validation of scraped content
- Monitoring for website structure changes
- Periodic re-scraping for updated content
Contact Information
For questions about this dataset or data collection process:
- Project Repository: [GitHub URL]
- Maintainer: [Your Name]
- Last Updated: [Date]