--- title: Spanish Embeddings Api emoji: 🐨 colorFrom: green colorTo: green sdk: docker pinned: false --- # Multilingual & Legal Embeddings API A high-performance FastAPI application providing access to **5 specialized embedding models** for Spanish, Catalan, English, and multilingual text. Each model has its own dedicated endpoint for optimal performance and clarity. 🌐 **Live API**: [https://aurasystems-spanish-embeddings-api.hf.space](https://aurasystems-spanish-embeddings-api.hf.space) 📖 **Interactive Docs**: [https://aurasystems-spanish-embeddings-api.hf.space/docs](https://aurasystems-spanish-embeddings-api.hf.space/docs) ## 🚀 Quick Start ### Basic Usage ```bash # Test jina-v3 endpoint (multilingual, loads at startup) curl -X POST "https://aurasystems-spanish-embeddings-api.hf.space/embed/jina-v3" \ -H "Content-Type: application/json" \ -d '{"texts": ["Hello world", "Hola mundo"], "normalize": true}' # Test Catalan RoBERTa endpoint curl -X POST "https://aurasystems-spanish-embeddings-api.hf.space/embed/roberta-ca" \ -H "Content-Type: application/json" \ -d '{"texts": ["Bon dia", "Com estàs?"], "normalize": true}' ``` ## 📚 Available Models & Endpoints | Endpoint | Model | Languages | Dimensions | Max Tokens | Loading Strategy | |----------|--------|-----------|------------|------------|------------------| | `/embed/jina-v3` | jinaai/jina-embeddings-v3 | Multilingual (30+) | 1024 | 8192 | **Startup** | | `/embed/roberta-ca` | projecte-aina/roberta-large-ca-v2 | Catalan | 1024 | 512 | On-demand | | `/embed/jina` | jinaai/jina-embeddings-v2-base-es | Spanish, English | 768 | 8192 | On-demand | | `/embed/robertalex` | PlanTL-GOB-ES/RoBERTalex | Spanish Legal | 768 | 512 | On-demand | | `/embed/legal-bert` | nlpaueb/legal-bert-base-uncased | English Legal | 768 | 512 | On-demand | ### Model Recommendations - **🌍 General multilingual**: Use `/embed/jina-v3` - Best overall performance - **🇪🇸 Spanish general**: Use `/embed/jina` - Excellent for Spanish/English - **🇪🇸 Spanish legal**: Use `/embed/robertalex` - Specialized for legal texts - **🏴󠁧󠁢󠁣󠁡󠁴󠁿 Catalan**: Use `/embed/roberta-ca` - Best for Catalan text - **🇬🇧 English legal**: Use `/embed/legal-bert` - Specialized for legal documents ## 🔗 API Endpoints ### Model-Specific Embedding Endpoints Each model has its dedicated endpoint: ``` POST /embed/jina-v3 # Multilingual (startup model) POST /embed/roberta-ca # Catalan POST /embed/jina # Spanish/English POST /embed/robertalex # Spanish Legal POST /embed/legal-bert # English Legal ``` ### Utility Endpoints ``` GET / # API information GET /health # Health check and model status GET /models # List all models with specifications ``` ## 📖 Usage Examples ### Python ```python import requests API_URL = "https://aurasystems-spanish-embeddings-api.hf.space" # Example 1: Multilingual with Jina v3 (startup model - fastest) response = requests.post( f"{API_URL}/embed/jina-v3", json={ "texts": [ "Hello world", # English "Hola mundo", # Spanish "Bonjour monde", # French "こんにちは世界" # Japanese ], "normalize": True } ) result = response.json() print(f"Jina v3: {result['dimensions']} dimensions") # 1024 # Example 2: Catalan text with RoBERTa-ca response = requests.post( f"{API_URL}/embed/roberta-ca", json={ "texts": [ "Bon dia, com estàs?", "Barcelona és una ciutat meravellosa", "M'agrada la cultura catalana" ], "normalize": True } ) catalan_result = response.json() print(f"Catalan: {catalan_result['dimensions']} dimensions") # 1024 # Example 3: Spanish legal text with RoBERTalex response = requests.post( f"{API_URL}/embed/robertalex", json={ "texts": [ "Artículo primero de la constitución", "El contrato será válido desde la fecha de firma", "La jurisprudencia establece que..." ], "normalize": True } ) legal_result = response.json() print(f"Spanish Legal: {legal_result['dimensions']} dimensions") # 768 # Example 4: English legal text with Legal-BERT response = requests.post( f"{API_URL}/embed/legal-bert", json={ "texts": [ "This agreement is legally binding", "The contract shall be governed by English law", "The party hereby agrees and covenants" ], "normalize": True } ) english_legal_result = response.json() print(f"English Legal: {english_legal_result['dimensions']} dimensions") # 768 # Example 5: Spanish/English bilingual with Jina v2 response = requests.post( f"{API_URL}/embed/jina", json={ "texts": [ "Inteligencia artificial y machine learning", "Artificial intelligence and machine learning", "Procesamiento de lenguaje natural" ], "normalize": True } ) bilingual_result = response.json() print(f"Bilingual: {bilingual_result['dimensions']} dimensions") # 768 ``` ### JavaScript/Node.js ```javascript const API_URL = 'https://aurasystems-spanish-embeddings-api.hf.space'; // Function to get embeddings from specific endpoint async function getEmbeddings(endpoint, texts) { const response = await fetch(`${API_URL}/embed/${endpoint}`, { method: 'POST', headers: { 'Content-Type': 'application/json', }, body: JSON.stringify({ texts: texts, normalize: true }) }); if (!response.ok) { throw new Error(`Error: ${response.status}`); } return await response.json(); } // Usage examples try { // Multilingual embeddings const multilingualResult = await getEmbeddings('jina-v3', [ 'Hello world', 'Hola mundo', 'Ciao mondo' ]); console.log('Multilingual dimensions:', multilingualResult.dimensions); // Catalan embeddings const catalanResult = await getEmbeddings('roberta-ca', [ 'Bon dia', 'Com estàs?' ]); console.log('Catalan dimensions:', catalanResult.dimensions); } catch (error) { console.error('Error:', error); } ``` ### cURL Examples ```bash # Multilingual with Jina v3 (startup model) curl -X POST "https://aurasystems-spanish-embeddings-api.hf.space/embed/jina-v3" \ -H "Content-Type: application/json" \ -d '{ "texts": ["Hello", "Hola", "Bonjour"], "normalize": true }' # Catalan with RoBERTa-ca curl -X POST "https://aurasystems-spanish-embeddings-api.hf.space/embed/roberta-ca" \ -H "Content-Type: application/json" \ -d '{ "texts": ["Bon dia", "Com estàs?"], "normalize": true }' # Spanish legal with RoBERTalex curl -X POST "https://aurasystems-spanish-embeddings-api.hf.space/embed/robertalex" \ -H "Content-Type: application/json" \ -d '{ "texts": ["Artículo primero"], "normalize": true }' # English legal with Legal-BERT curl -X POST "https://aurasystems-spanish-embeddings-api.hf.space/embed/legal-bert" \ -H "Content-Type: application/json" \ -d '{ "texts": ["This agreement is binding"], "normalize": true }' # Spanish/English bilingual with Jina v2 curl -X POST "https://aurasystems-spanish-embeddings-api.hf.space/embed/jina" \ -H "Content-Type: application/json" \ -d '{ "texts": ["Texto en español", "Text in English"], "normalize": true }' ``` ## 📋 Request/Response Schema ### Request Body ```json { "texts": ["text1", "text2", "..."], "normalize": true, "max_length": null } ``` | Field | Type | Required | Default | Description | |-------|------|----------|---------|-------------| | `texts` | array[string] | ✅ Yes | - | 1-50 texts to embed | | `normalize` | boolean | No | `true` | L2-normalize embeddings | | `max_length` | integer/null | No | `null` | Max tokens (model-specific limits) | ### Response Body ```json { "embeddings": [[0.123, -0.456, ...], [0.789, -0.012, ...]], "model_used": "jina-v3", "dimensions": 1024, "num_texts": 2 } ``` ## ⚡ Performance & Limits - **Maximum texts per request**: 50 - **Startup model**: `jina-v3` loads at startup (fastest response) - **On-demand models**: Load on first request (~30-60s first time) - **Typical response time**: 100-300ms after models are loaded - **Memory optimization**: Automatic cleanup for large batches - **CORS enabled**: Works from any domain ## 🔧 Advanced Usage ### LangChain Integration ```python from langchain.embeddings.base import Embeddings from typing import List import requests class MultilingualEmbeddings(Embeddings): """LangChain integration for multilingual embeddings""" def __init__(self, endpoint: str = "jina-v3"): """ Initialize with specific endpoint Args: endpoint: One of "jina-v3", "roberta-ca", "jina", "robertalex", "legal-bert" """ self.api_url = f"https://aurasystems-spanish-embeddings-api.hf.space/embed/{endpoint}" self.endpoint = endpoint def embed_documents(self, texts: List[str]) -> List[List[float]]: response = requests.post( self.api_url, json={"texts": texts, "normalize": True} ) response.raise_for_status() return response.json()["embeddings"] def embed_query(self, text: str) -> List[float]: return self.embed_documents([text])[0] # Usage examples multilingual_embeddings = MultilingualEmbeddings("jina-v3") catalan_embeddings = MultilingualEmbeddings("roberta-ca") spanish_legal_embeddings = MultilingualEmbeddings("robertalex") ``` ### Semantic Search ```python import numpy as np from typing import List, Tuple def semantic_search(query: str, documents: List[str], endpoint: str = "jina-v3", top_k: int = 5): """Semantic search using specific model endpoint""" response = requests.post( f"https://aurasystems-spanish-embeddings-api.hf.space/embed/{endpoint}", json={"texts": [query] + documents, "normalize": True} ) embeddings = np.array(response.json()["embeddings"]) query_embedding = embeddings[0] doc_embeddings = embeddings[1:] # Calculate cosine similarities (already normalized) similarities = np.dot(doc_embeddings, query_embedding) top_indices = np.argsort(similarities)[::-1][:top_k] return [(idx, similarities[idx]) for idx in top_indices] # Example: Multilingual search documents = [ "Python programming language", "Lenguaje de programación Python", "Llenguatge de programació Python", "Language de programmation Python" ] results = semantic_search("código en Python", documents, "jina-v3") for idx, score in results: print(f"{score:.4f}: {documents[idx]}") ``` ## 🚨 Error Handling ### HTTP Status Codes | Code | Description | |------|-------------| | 200 | Success | | 400 | Bad Request (validation error) | | 422 | Unprocessable Entity (schema error) | | 500 | Internal Server Error (model loading failed) | ### Common Errors ```python # Handle errors properly try: response = requests.post( "https://aurasystems-spanish-embeddings-api.hf.space/embed/jina-v3", json={"texts": ["text"], "normalize": True} ) response.raise_for_status() result = response.json() except requests.exceptions.HTTPError as e: print(f"HTTP error: {e}") print(f"Response: {response.text}") except requests.exceptions.RequestException as e: print(f"Request error: {e}") ``` ## 📊 Model Status Check ```python # Check which models are loaded health = requests.get("https://aurasystems-spanish-embeddings-api.hf.space/health") status = health.json() print(f"API Status: {status['status']}") print(f"Startup model loaded: {status['startup_model_loaded']}") print(f"Available models: {status['available_models']}") print(f"Models loaded: {status['models_count']}/5") # Check endpoint status for model, endpoint_status in status['endpoints'].items(): print(f"{model}: {endpoint_status}") ``` ## 🔒 Authentication & Rate Limits - **Authentication**: None required (open API) - **Rate limits**: Generous limits on Hugging Face Spaces - **CORS**: Enabled for all origins - **Usage**: Free for research and commercial use ## 🏗️ Architecture ### Endpoint-Per-Model Design - **Startup model**: `jina-v3` loads at application startup for fastest response - **On-demand loading**: Other models load when first requested - **Memory optimization**: Progressive loading reduces startup time - **Model caching**: Once loaded, models remain in memory for fast inference ### Technical Stack - **FastAPI**: Modern async web framework - **Transformers**: Hugging Face model library - **PyTorch**: Deep learning backend - **Docker**: Containerized deployment - **Hugging Face Spaces**: Cloud hosting platform ## 📄 Model Licenses - **Jina models**: Apache 2.0 - **RoBERTa models**: MIT/Apache 2.0 - **Legal-BERT**: Apache 2.0 ## 🤝 Support & Contributing - **Issues**: [GitHub Issues](https://huggingface.co/spaces/AuraSystems/spanish-embeddings-api/discussions) - **Interactive Docs**: [FastAPI Swagger UI](https://aurasystems-spanish-embeddings-api.hf.space/docs) - **Model Papers**: Check individual model pages on Hugging Face --- Built with ❤️ using **FastAPI** and **Hugging Face Transformers**