Commit 
							
							·
						
						740ed29
	
1
								Parent(s):
							
							f4c750f
								
modified readme
Browse files
    	
        README.md
    CHANGED
    
    | @@ -1,42 +1,93 @@ | |
| 1 | 
             
            ---
         | 
| 2 | 
             
            library_name: transformers
         | 
| 3 | 
             
            language:
         | 
| 4 | 
            -
            - en
         | 
| 5 | 
             
            - fr
         | 
| 6 | 
             
            - de
         | 
|  | |
|  | |
|  | |
|  | |
| 7 | 
             
            tags:
         | 
| 8 | 
            -
            -  | 
|  | |
|  | |
|  | |
| 9 | 
             
            ---
         | 
| 10 |  | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
| 11 |  | 
| 12 | 
            -
             | 
| 13 |  | 
| 14 | 
            -
             | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
| 15 |  | 
| 16 | 
            -
            <!-- Provide a longer summary of what this model is. -->
         | 
| 17 | 
             
            ```python
         | 
| 18 | 
             
            from transformers import pipeline
         | 
| 19 |  | 
| 20 | 
            -
            MODEL_NAME = " | 
| 21 | 
            -
             | 
| 22 | 
            -
            lang_pipeline = pipeline("lang-detect", model=MODEL_NAME, 
         | 
| 23 | 
            -
                                    trust_remote_code=True,
         | 
| 24 | 
            -
                                    device='cpu')
         | 
| 25 |  | 
| 26 | 
            -
             | 
| 27 | 
            -
             | 
|  | |
|  | |
|  | |
|  | |
| 28 |  | 
| 29 | 
            -
             | 
| 30 | 
            -
             | 
|  | |
| 31 |  | 
|  | |
|  | |
| 32 | 
             
            ```
         | 
| 33 |  | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
| 34 | 
             
            ```
         | 
| 35 | 
            -
            {'label': 'fr', 'confidence': 99.87}
         | 
| 36 | 
            -
            ```
         | 
| 37 | 
            -
            Works with lists of sentences also.
         | 
| 38 |  | 
| 39 | 
            -
            ### BibTeX entry and citation info
         | 
| 40 |  | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
| 41 | 
             
            ```
         | 
| 42 | 
            -
             | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
             
            ---
         | 
| 2 | 
             
            library_name: transformers
         | 
| 3 | 
             
            language:
         | 
|  | |
| 4 | 
             
            - fr
         | 
| 5 | 
             
            - de
         | 
| 6 | 
            +
            - en
         | 
| 7 | 
            +
            - it
         | 
| 8 | 
            +
            - lb
         | 
| 9 | 
            +
            license: agpl-3.0
         | 
| 10 | 
             
            tags:
         | 
| 11 | 
            +
            - language-identification
         | 
| 12 | 
            +
            - multilingual
         | 
| 13 | 
            +
            - historical
         | 
| 14 | 
            +
            - impresso
         | 
| 15 | 
             
            ---
         | 
| 16 |  | 
| 17 | 
            +
            # Model Card for impresso-project/language-identifier
         | 
| 18 | 
            +
             | 
| 19 | 
            +
            ## Overview
         | 
| 20 | 
            +
             | 
| 21 | 
            +
            `impresso-project/language-identifier` is a multilingual language identification model fine-tuned for use on historical newspaper content. It supports **German (de), French (fr), Italian (it), English (en), and Luxembourgish (lb)** — the core languages of the [Impresso Project](https://impresso-project.ch), which focuses on analyzing historical media across national and linguistic borders.
         | 
| 22 | 
            +
             | 
| 23 | 
            +
            This model has been adapted for short, OCR-noisy and fragmentary inputs typical of historical digitized texts.
         | 
| 24 |  | 
| 25 | 
            +
            ## Model Details
         | 
| 26 |  | 
| 27 | 
            +
            - **Model type:** Language identification
         | 
| 28 | 
            +
            - **Interface:** Hugging Face `transformers` pipeline
         | 
| 29 | 
            +
            - **Languages supported:** fr, de, en, it, lb
         | 
| 30 | 
            +
            - **License:** AGPL-3.0
         | 
| 31 | 
            +
            - **Developed by:** UZH, Switzerland
         | 
| 32 | 
            +
            - **Training data:** Historical newspapers from the impresso corpus and related sources
         | 
| 33 | 
            +
             | 
| 34 | 
            +
            ## How to Use
         | 
| 35 |  | 
|  | |
| 36 | 
             
            ```python
         | 
| 37 | 
             
            from transformers import pipeline
         | 
| 38 |  | 
| 39 | 
            +
            MODEL_NAME = "impresso-project/language-identifier"
         | 
|  | |
|  | |
|  | |
|  | |
| 40 |  | 
| 41 | 
            +
            lang_pipeline = pipeline(
         | 
| 42 | 
            +
                "langident",
         | 
| 43 | 
            +
                model=MODEL_NAME,
         | 
| 44 | 
            +
                trust_remote_code=True,
         | 
| 45 | 
            +
                device="cpu",
         | 
| 46 | 
            +
            )
         | 
| 47 |  | 
| 48 | 
            +
            text = """En l'an 1348, au plus fort des ravages de la peste noire à travers
         | 
| 49 | 
            +
            l'Europe, le Royaume de France se trouvait à la fois au bord du désespoir et
         | 
| 50 | 
            +
            face à une opportunité."""
         | 
| 51 |  | 
| 52 | 
            +
            langs = lang_pipeline(text)
         | 
| 53 | 
            +
            print(langs)
         | 
| 54 | 
             
            ```
         | 
| 55 |  | 
| 56 | 
            +
            ## Output Format
         | 
| 57 | 
            +
             | 
| 58 | 
            +
            The output is a single dictionary with the predicted language and confidence score:
         | 
| 59 | 
            +
             | 
| 60 | 
            +
            ```python
         | 
| 61 | 
            +
            {
         | 
| 62 | 
            +
              "language": "fr",
         | 
| 63 | 
            +
              "score": 1.0
         | 
| 64 | 
            +
            }
         | 
| 65 | 
             
            ```
         | 
|  | |
|  | |
|  | |
| 66 |  | 
|  | |
| 67 |  | 
| 68 | 
            +
            ## Use Cases
         | 
| 69 | 
            +
             | 
| 70 | 
            +
            - Preprocessing for OCR and NLP tasks on historical corpora
         | 
| 71 | 
            +
            - Document and segment-level language tagging
         | 
| 72 | 
            +
            - Filtering and sorting multilingual newspaper archives
         | 
| 73 | 
            +
             | 
| 74 | 
            +
            ## Limitations
         | 
| 75 | 
            +
             | 
| 76 | 
            +
            - Works best on **sentence- or paragraph-length** texts
         | 
| 77 | 
            +
            - May struggle with code-switching or OCR-degraded text that mixes languages
         | 
| 78 | 
            +
            - Primarily optimized for **Impresso-like sources** (19th–20th century newspapers)
         | 
| 79 | 
            +
             | 
| 80 | 
            +
            ## Installation
         | 
| 81 | 
            +
             | 
| 82 | 
            +
            ```bash
         | 
| 83 | 
            +
            pip install transformers floret
         | 
| 84 | 
             
            ```
         | 
| 85 | 
            +
             | 
| 86 | 
            +
            ## Contact
         | 
| 87 | 
            +
             | 
| 88 | 
            +
            - Website: [https://impresso-project.ch](https://impresso-project.ch)
         | 
| 89 | 
            +
             | 
| 90 | 
            +
            <p align="center">
         | 
| 91 | 
            +
              <img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="300" alt="Impresso Logo"/>
         | 
| 92 | 
            +
            </p>
         | 
| 93 | 
            +
             | 
