# Data Card for JioPay RAG Chatbot ## Dataset Overview This dataset contains publicly accessible information from JioPay's business website and help center, collected for building a customer support RAG chatbot. ## Data Sources ### Primary Sources 1. **JioPay Business Website** - URL: https://jiopay.com/business - Content: Business information, features, pricing, integration details - Collection Method: Web scraping with multiple pipelines 2. **JioPay Help Center/FAQs** - URL: https://jiopay.com/help (or similar) - Content: Frequently asked questions, troubleshooting guides - Collection Method: Structured data extraction ### Additional Sources - Any other publicly accessible JioPay documentation - Public API documentation (if available) - Community forums and public discussions ## Data Collection Details ### Scraping Pipelines Used 1. **requests + BeautifulSoup4**: Primary scraping method 2. **trafilatura**: Readability-focused extraction 3. **Playwright**: Dynamic content handling (if needed) ### Collection Metrics - **Total Pages Scraped**: [To be filled after scraping] - **Total Tokens**: [To be filled after processing] - **Coverage**: Business pages, FAQ sections, help documentation - **Noise Ratio**: [To be calculated during processing] - **Throughput**: [To be measured during scraping] ### Data Quality - **Cleanliness**: HTML boilerplate removed, structure preserved - **Completeness**: All publicly accessible content included - **Accuracy**: Source URLs maintained for verification ## Data Processing ### Chunking Strategies 1. **Fixed Chunking**: 256, 512, 1024 tokens with 0, 64, 128 overlap 2. **Semantic Chunking**: Sentence/paragraph boundary detection 3. **Structural Chunking**: HTML tag and heading-based segmentation 4. **Recursive Chunking**: Hierarchical fallback approach 5. **LLM-based Chunking**: Instruction-aware segmentation ### Embedding Models Tested 1. **OpenAI**: text-embedding-3-small, text-embedding-3-large 2. **E5**: intfloat/e5-base, intfloat/e5-large 3. **BGE**: BAAI/bge-small-en-v1.5, BAAI/bge-base-en-v1.5 ## Compliance and Ethics ### Legal Compliance - ✅ Respects robots.txt - ✅ Follows website Terms & Conditions - ✅ Only accesses publicly available content - ✅ No user data or gated content accessed ### Data Usage - Purpose: Customer support automation - Scope: Public business and help documentation only - Retention: Data stored locally for RAG system - Sharing: Not redistributed, used only for this project ## Data Statistics ### Collection Statistics - **Start Date**: [To be filled] - **End Date**: [To be filled] - **Total Collection Time**: [To be filled] - **Success Rate**: [To be calculated] ### Content Statistics - **Business Pages**: [To be counted] - **FAQ Items**: [To be counted] - **Help Articles**: [To be counted] - **Total Documents**: [To be calculated] ### Processing Statistics - **Chunks Generated**: [To be calculated per strategy] - **Average Chunk Size**: [To be calculated] - **Embedding Dimensions**: [Varies by model] ## Data Access ### File Structure ``` data/ ├── scraped/ # Raw scraped HTML/text ├── processed/ # Cleaned and chunked data ├── chroma_db/ # Vector database └── evaluation/ # Test datasets ``` ### Data Formats - **Raw Data**: HTML files, JSON metadata - **Processed Data**: JSON with chunks and metadata - **Vector Data**: ChromaDB collections - **Evaluation Data**: JSON test sets ## Updates and Maintenance ### Data Freshness - **Last Updated**: [To be filled] - **Update Frequency**: As needed for accuracy - **Version Control**: Git-tracked data processing scripts ### Quality Assurance - Regular validation of scraped content - Monitoring for website structure changes - Periodic re-scraping for updated content ## Contact Information For questions about this dataset or data collection process: - **Project Repository**: [GitHub URL] - **Maintainer**: [Your Name] - **Last Updated**: [Date]