{ "en": { "Title": "# ✨Easy-to-use LLM Training Data Generation Framework✨\n\n", "Intro": "### [GraphGen](https://github.com/open-sciencelab/GraphGen) is a framework for synthetic data generation guided by knowledge graphs, designed to tackle challenges for knowledge-intensive QA generation. \n\nBy uploading your text chunks (such as knowledge in agriculture, healthcare, or marine science) and filling in the LLM API key, you can generate the training data required by **[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory)** and **[xtuner](https://github.com/InternLM/xtuner)** online. We will automatically delete user information after completion.", "Use Trainee Model": "Use Trainee Model to identify knowledge blind spots, please keep disabled for SiliconCloud", "Synthesizer URL Info": "Base URL for the Synthesizer Model API, use SiliconFlow as default", "Synthesizer Model Info": "Model for constructing KGs and generating QAs", "Trainee URL Info": "Base URL for the Trainee Model API, use SiliconFlow as default", "Trainee Model Info": "Model for training", "SiliconFlow Token for Trainee Model": "SiliconFlow API Key for Trainee Model", "Model Config": "Model Configuration", "SiliconFlow Token Info": "Get SiliconFlow API Key at \"https://cloud.siliconflow.cn/account/ak\", efficiently and stably use LLM interfaces", "SiliconFlow Token": "SiliconFlow API Key", "Upload File": "Upload File", "Example Files": "Example Files", "File Preview": "File Preview", "Split Config Info": "If the input text is a long text without chunks, the system will split the text into appropriate paragraphs based on the following parameters.", "Chunk Size Info": "Split the long text according to this value. Too short will lead to incomplete knowledge, and too long will lead to LLM input being too long", "Chunk Size": "chunk_size(Chunk Size)", "Chunk Overlap Info": "The overlapping part between two adjacent chunks, which can help maintain context continuity", "Chunk Overlap": "chunk_overlap(Chunk Overlap)", "Split Config": "Split Config", "Quiz & Judge Config Info": "Synthesizer Model generates quiz questions based on each knowledge unit in the knowledge graph to assess the Trainee Model's understanding of the knowledge and obtain comprehension loss.", "Quiz Samples Info": "Configure how many quiz questions to generate for each knowledge unit", "Quiz Samples": "quiz_samples(Quiz Samples)", "Quiz & Judge Config": "Quiz & Judge Config", "Partition Config Info": "Partition the knowledge graph into multiple communities (subgraphs), each community is the smallest unit for generating QAs. Appropriate partitioning methods can improve relevance and diversity.", "Which algorithm to use for graph partitioning.": "Which algorithm to use for graph partitioning.", "Partition Method": "method(Partition Method)", "DFS intro": "The DFS partitioning method uses a depth-first search algorithm to traverse the knowledge graph, starting from one unit and exploring as deeply as possible along connected units until a preset community size is reached or there are no more unvisited units. It then starts a new community from another unvisited unit, repeating this process until all units are assigned to communities.", "Max Units Per Community Info": "The maximum number of knowledge units (nodes) allowed in each community. If a community exceeds this limit, it will be further partitioned. A unit refers to a node in the knowledge graph, which can be an entity or a relation.", "Max Units Per Community": "max_units_per_community(Max Units Per Community)", "BFS intro": "The BFS partitioning method uses a breadth-first search algorithm to traverse the knowledge graph, starting from one unit and exploring all its neighboring units before moving on to the neighbors' neighbors. This process continues until a preset community size is reached or there are no more unvisited units. It then starts a new community from another unvisited unit, repeating this process until all units are assigned to communities.", "Leiden intro": "The Leiden partitioning method is a community detection algorithm based on modularity optimization, designed to identify tightly connected subgraphs within a graph. The algorithm iteratively optimizes the assignment of nodes to communities, maximizing the density of connections within communities while minimizing connections between communities. The Leiden algorithm can effectively handle large-scale graph data and typically produces higher-quality community partitions compared to other community detection algorithms, such as the Louvain algorithm.", "Maximum Size of Communities Info": "The maximum number of nodes allowed in a community. If a community exceeds this limit, it will be further partitioned.", "Maximum Size of Communities": "max_size(Maximum Size of Communities)", "Use Largest Connected Component Info": "The largest connected component refers to the largest subset of nodes in a graph where there is a path connecting any two nodes. When this option is enabled, the partitioning algorithm will only consider the largest connected component of the knowledge graph for community partitioning, ignoring other smaller connected components. This helps ensure that the generated communities have higher connectivity and relevance.", "Use Largest Connected Component": "use_lcc(Use Largest Connected Component)", "Random Seed Info": "The random seed changes the initial state of the graph partitioning, thereby affecting the partitioning results. By setting different random seeds, different community partitioning schemes can be generated, which helps improve the diversity of generated QAs.", "Random Seed": "random_seed(Random Seed)", "ECE intro": "ECE is an original graph partitioning method based on the principle of model calibration. It evaluates the performance of each unit under the current model by computing its calibration error (referred to as the comprehension loss) and partitions the graph according to this comprehension error.", "Min Units Per Community Info": "Limit the minimum number of nodes allowed in each community. If a community has fewer nodes than this limit, it will be discarded.", "Min Units Per Community": "min_units_per_community(Min Units Per Community)", "Max Tokens Per Community Info": "The maximum number of tokens allowed in each community. If a community exceeds this limit, it will be further partitioned.", "Max Tokens Per Community": "max_tokens_per_community(Max Tokens Per Community)", "Unit Sampling Strategy Info": "Unit sampling strategy determines how to select units from candidate units when constructing communities. Unit sampling strategies include random, max_loss, and min_loss. random means selecting units randomly, max_loss means prioritizing units with higher comprehension loss, and min_loss means prioritizing units with lower comprehension loss.\n\n(Note: Only when the Trainee Model is activated and evaluated will there be comprehension loss, allowing the use of max_loss and min_loss strategies; otherwise, only the random strategy can be used.)", "Unit Sampling Strategy": "unit_sampling(Unit Sampling Strategy)", "Partition Config": "Knowledge Graph Partition Config", "Generation Config Info": "Generation configuration includes generation mode and output data format.", "Mode Info": "Includes various generation modes such as atomic, aggregated, multi-hop, and chain-of-thought, suitable for tasks of different complexity.", "Mode": "mode(Mode)", "Output Data Format Info": "Includes various output formats such as Alpaca, Sharegpt, and ChatML.", "Output Data Format": "data_format(Output Data Format)", "Generation Config": "Generation Config", "Output File": "Output File" }, "zh": { "Title": "# ✨开箱即用的LLM训练数据生成框架✨\n\n", "Intro": "### [GraphGen](https://github.com/open-sciencelab/GraphGen) 是一个基于知识图谱的数据合成框架,旨在知识密集型任务中生成问答。\n\n 上传你的文本块(如农业、医疗、海洋知识),填写 LLM api key,即可在线生成 **[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory)**、**[xtuner](https://github.com/InternLM/xtuner)** 所需训练数据。结束后我们将自动删除用户信息。", "Use Trainee Model": "使用Trainee Model来识别知识盲区,使用硅基流动时请保持禁用", "Synthesizer URL Info": "调用合成模型API的URL,默认使用硅基流动", "Synthesizer Model Info": "用于构建知识图谱和生成问答的模型", "Trainee URL Info": "调用学生模型API的URL,默认使用硅基流动", "Trainee Model Info": "用于训练的模型", "SiliconFlow Token for Trainee Model": "SiliconFlow Token for Trainee Model", "Model Config": "模型配置", "SiliconFlow Token Info": "在 \"https://cloud.siliconflow.cn/account/ak\" 获取硅基流动 API 秘钥, 使用高效稳定的 LLM 接口", "SiliconFlow Token": "硅基流动 API 秘钥", "Upload File": "上传文件", "Example Files": "示例文件", "File Preview": "文件预览", "Split Config Info": "如果输入文本是未分块的长文本,系统会根据以下参数将文本分成合适的段落。", "Chunk Size Info": "按照该值将分割长文本,太短会导致知识不完整,太长会导致 LLM 输入过长", "Chunk Size": "chunk_size(分割大小)", "Chunk Overlap Info": "两个相邻块之间的重叠部分,有助于保持上下文的连续性", "Chunk Overlap": "chunk_overlap(分割重叠大小)", "Split Config": "文本分割配置", "Quiz & Judge Config Info": "合成模型根据知识图谱中的每个知识单元,生成判断题,用于评估学生模型对知识的理解程度,得到理解误差。", "Quiz Samples Info": "配置每个知识单元生成多少判断题", "Quiz Samples": "quiz_samples(Quiz Samples)", "Quiz & Judge Config": "测试与评判配置", "Partition Config Info": "将知识图谱划分为多个社区(子图),每个社区是生成问答的最小单位。合适的分区方法可以提高关联性和多样性。", "Which algorithm to use for graph partitioning.": "选择用于图划分的算法。", "Partition Method": "method(划分方法)", "DFS intro": "DFS划分方法使用深度优先搜索算法遍历知识图谱,从一个单元开始,沿着与之连接的单元深入探索,直到达到预设的社区大小或没有更多未访问的单元为止。然后,它会从另一个未访问的单元开始新的社区,重复这一过程,直到所有单元都被分配到社区中。", "Max Units Per Community Info": "每个社区允许的知识单元(节点)的最大数量。如果一个社区超过这个限制,它将被进一步划分。一个单元指的是知识图谱中的一个节点,可以是实体或关系。", "Max Units Per Community": "max_units_per_community(每个社区的最大单元数)", "BFS intro": "BFS划分方法使用广度优先搜索算法遍历知识图谱,从一个单元开始,探索所有与之直接连接的单元,然后再从这些单元出发,继续探索它们的直接连接单元。这个过程会持续直到达到预设的社区大小或没有更多未访问的单元为止。然后,它会从另一个未访问的单元开始新的社区,重复这一过程,直到所有单元都被分配到社区中。", "Leiden intro": "Leiden划分方法是一种基于模块度优化的社区检测算法,旨在识别图中的紧密连接子图。该算法通过迭代地优化节点的社区分配,最大化社区内的连接密度,同时最小化社区间的连接。Leiden算法能够有效处理大规模图数据,并且通常比其他社区检测算法(如Louvain算法)产生更高质量的社区划分结果。", "Maximum Size of Communities Info": "一个社区中允许的最大节点数量。如果一个社区的节点数超过这个限制,它将被进一步划分。", "Maximum Size of Communities": "max_size(社区的最大尺寸)", "Use Largest Connected Component Info": "最大连通分量是指在图中节点之间存在路径连接的最大子集。启用此选项后,划分算法将仅考虑知识图谱中的最大连通分量进行社区划分,忽略其他较小的连通分量。这有助于确保生成的社区具有更高的连通性和相关性。", "Use Largest Connected Component": "use_lcc(使用最大连通分量)", "Random Seed Info": "随机种子改变图划分的初始状态,从而影响划分结果。通过设置不同的随机种子,可以生成不同的社区划分方案,有助于提高生成问答的多样性。", "Random Seed": "random_seed(随机种子)", "ECE intro": "ECE是一种基于模型校准原理的原创图划分方法。ECE通过计算单元的校准误差来评估其在当前模型下的表现(记为理解误差),并根据理解误差对图进行划分。", "Min Units Per Community Info": "限制每个社区中允许的最小节点数量。如果一个社区的节点数少于这个限制,它将被舍弃。", "Min Units Per Community": "min_units_per_community(每个社区的最小单元数)", "Max Tokens Per Community Info": "每个社区允许的最大Token数量。如果一个社区的Token数超过这个限制,它将被进一步划分。", "Max Tokens Per Community": "max_tokens_per_community(每个社区的最大Token数)", "Unit Sampling Strategy Info": "单元采样策略决定在构建社区的时候如何从候选单元中选择单元。单元采样策略包括 random, max_loss, min_loss。 random表示随机选择单元,max_loss表示优先选择理解误差较大的单元,min_loss表示优先选择理解误差较小的单元。\n\n(注意:只有当学生模型启动时,经过评测后,才会有理解误差,才能使用 max_loss 和 min_loss 策略,否则只能使用 random 策略)", "Unit Sampling Strategy": "unit_sampling(单元采样策略)", "Partition Config": "知识图谱分区配置", "Generation Config Info": "生成配置包括生成模式和输出数据格式。", "Mode Info": "包括原子、聚合、多跳、思维链等多种生成模式,适用于不同复杂度的任务。", "Mode": "mode(生成模式)", "Output Data Format Info": "包括 Alpaca, Sharegpt, ChatML等多种输出格式。", "Output Data Format": "data_format(输出数据格式)", "Generation Config": "生成配置", "Output File": "输出文件" } }