ibm-granite
/

granite-3.3-8b-rag-agent-lib

@@ -6,7 +6,6 @@ base_model: ibm-granite/granite-3.3-8b-instruct
 library_name: peft
 library_name: transformers
 ---
 # LoRA Adapter for Context Relevancy
 Welcome to Granite Experiments!
@@ -20,7 +19,6 @@ Just a heads-up: Experiments are forever evolving, so we can't commit to ongoing
 This is a LoRA adapter for [ibm-granite/granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.3-8b-instruct) that is fine-tuned for the context relevancy task:
     Given (1) a document and (2) a multi-turn conversation between a user and an AI assistant, identify whether the document is relevant (including partially relevant) and useful to answering the last user question.
 While this adapter is general purpose, it is especially effective in RAG settings right after the retrieval model's step where the adapter can be used to identify documents or passages that may mislead or harm the downstream generator model's response generation.
 - **Developer:** IBM Research
@@ -37,18 +35,27 @@ The classification output from the context relevancy model can be used in severa
 - Signal to human annotators working in RAG settings which documents are irrelevant/relevant to the current turn of the conversation they are reviewing. Identifying such documents helps reduce the [human annotator's high cognitive load](https://dl.acm.org/doi/10.1145/3706599.3719962) involved in manually reading and reviewing several documents, especially in long multi-turn conversations.
-**Model input**: The input to the model is a list of conversational turns and a list of documents, where each document is a dict containing the fields `title` and `text`. The turns in the conversation can alternate between the `user` and `assistant` roles, and the last turn is assumed to be from the `user`. For every document in the list of documents, the model converts that document and the conversation into a string using the `apply_chat_template` function.
-To prompt the LoRA adapter to determine context relevancy, a special context relevancy role is used to trigger this capability of the model. The role includes the keyword "context_relevance": `<|start_of_role|>context_relevance<|end_of_role|>`
-~~~
 <|start_of_role|>context_relevance: Analyze the provided document in relation to the final user query from the conversation. Determine if the document contains information that could help answer the final user query. Output 'relevant' if the document contains substantial information directly useful for answering the final user query. Output 'partially relevant' if the document contains some related information that could partially help answer the query, or if you are uncertain about the relevance - err on the side of 'partially relevant' when in doubt. Output 'irrelevant' only if the document clearly contains no information that could help answer the final user query. When uncertain, choose 'partially relevant' rather than 'irrelevant'. Your output should be a JSON structure with the context relevance classification:
 ```json
 {
     "context_relevance": "YOUR_CONTEXT_RELEVANCE_CLASSIFICATION_HERE"
 }
 ```<|end_of_role|>
-~~~
 **Model output**: When prompted with the above input, the model generates a json structure containing the context relevance output (irrelevant, partially relevant, relevant), e.g.
@@ -63,58 +70,132 @@ To prompt the LoRA adapter to determine context relevancy, a special context rel
 Use the code below to get started with the model. Before running the script, set the `LORA_NAME` parameter to the path of the directory that you downloaded the LoRA adapter. The download process is explained [here](https://huggingface.co/ibm-granite/granite-3.3-8b-rag-agent-lib#quickstart-example).
 ```python
-import  torch
-from  transformers  import  AutoTokenizer,  AutoModelForCausalLM
-from  peft  import  PeftModel
-from peft import PeftModelForCausalLM as lora_model
-device=torch.device('cuda' if torch.cuda.is_available() else 'cpu')
-CONTEXT_RELEVANCY_PROMPT = "<|start_of_role|>context_relevance<|end_of_role|>"
 BASE_NAME = "ibm-granite/granite-3.3-8b-instruct"
 LORA_NAME = "PATH_TO_DOWNLOADED_DIRECTORY"
-tokenizer = AutoTokenizer.from_pretrained(BASE_NAME, padding_side='left',trust_remote_code=True)
-model_base = AutoModelForCausalLM.from_pretrained(BASE_NAME,device_map="auto")
 model_context_relevancy = PeftModel.from_pretrained(model_base, LORA_NAME)
 convo = [
     {
-      "role": "user",
-      "content": "Am I better off going to work for a FAANG?"
     },
     {
-      "role": "assistant",
-      "content": "I can't tell you much about working for a FAANG (Facebook, Amazon, Apple, Netflix, Google) company, but large companies offer resources such as the opportunity to learn from experienced people, and teams dedicated to support you.  There isn't a single \"\"right\"\" place to work. FAANG companies tend to offer 6-figure salaries."
     },
     {
-      "role": "user",
-      "content": "which FAANG pays the most?"
     }
-  ]
 documents = [
     {
         "title": "",
-        "text": "\nEh, you hear this argument all the time, but it doesn't actually work out that way in the real world, though, because corporate pay structure is extremely malleable over time.   If a company makes $100M one year, it will pay the investors what they're expecting, the low-level employees what they're willing to tolerate (which is often the minimum wage), and then the upper management whatever is left (i.e. whatever the company can afford to attract the best management). By bumping up the minimum wage, the main effect is that it forces companies to change their pay structures (which are currently ridiculous - the U.S. CEO-to-avg-worker pay is around 200:1; Japan and Germany are around 15:1, IIRC).  Feel free to dig deeper into the numbers and the studies if you want further evidence, but even a cursory glance at our history (or the current situation in Australia) shows that the effects of a high minimum wage on both inflation and unemployment are largely overstated."
     },
     {
         "title": "",
-        "text": "\nThe highest paid finance role is a hedge fund manager at a top fund - but that's like winning the lotto so here's the most pragmatic way to make a lot of money:  * First 2-3 years out of college: Investment Banking Analyst  * Next 2-3 years: Switch to the buyside (Private Equity)  You'll easily top $400k by the time you're 26-27.  If you're promoted to VP you are golden. Most get forced out after their associate stint and go to a top MBA program, after which you'd go back into PE or do the CFO route.  Not sure w/o a degree, to be honest."
     }
 ]
-for document in documents:
-    string = tokenizer.apply_chat_template(convo, documents=[document], tokenize=False,add_generation_prompt=False)
-    inputs = string + CONTEXT_RELEVANCY_PROMPT
-    inputT = tokenizer(inputs, return_tensors="pt")
-    output = model_context_relevancy.generate(inputT["input_ids"].to(device), attention_mask=inputT["attention_mask"].to(device), max_new_tokens=3)
-    output_text = tokenizer.decode(output[0])
-    answer = output_text.split(CONTEXT_RELEVANCY_PROMPT)[1]
-    print(answer)
 ```
 ## Training Details

 library_name: peft
 library_name: transformers
 ---
 # LoRA Adapter for Context Relevancy
 Welcome to Granite Experiments!
 This is a LoRA adapter for [ibm-granite/granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.3-8b-instruct) that is fine-tuned for the context relevancy task:
     Given (1) a document and (2) a multi-turn conversation between a user and an AI assistant, identify whether the document is relevant (including partially relevant) and useful to answering the last user question.
 While this adapter is general purpose, it is especially effective in RAG settings right after the retrieval model's step where the adapter can be used to identify documents or passages that may mislead or harm the downstream generator model's response generation.
 - **Developer:** IBM Research
 - Signal to human annotators working in RAG settings which documents are irrelevant/relevant to the current turn of the conversation they are reviewing. Identifying such documents helps reduce the [human annotator's high cognitive load](https://dl.acm.org/doi/10.1145/3706599.3719962) involved in manually reading and reviewing several documents, especially in long multi-turn conversations.
+**Model input**: The input to the model consists of:
+1. A conversation formatted using the chat template
+2. The final user query extracted from the conversation
+3. A document to evaluate for relevance
+4. A special context relevancy invocation prompt
+The model uses a specific format with separate roles for each component:
+- Conversation: Applied via `tokenizer.apply_chat_template()`
+- Final user query: `<|start_of_role|>final_user_query<|end_of_role|>{query}<|end_of_text|>`
+- Document: `<|start_of_role|>document {"document_id": "1"}<|end_of_role|>{document_content}<|end_of_text|>`
+- Context relevance prompt: See below
+**Context Relevance Invocation Prompt**:
+```
 <|start_of_role|>context_relevance: Analyze the provided document in relation to the final user query from the conversation. Determine if the document contains information that could help answer the final user query. Output 'relevant' if the document contains substantial information directly useful for answering the final user query. Output 'partially relevant' if the document contains some related information that could partially help answer the query, or if you are uncertain about the relevance - err on the side of 'partially relevant' when in doubt. Output 'irrelevant' only if the document clearly contains no information that could help answer the final user query. When uncertain, choose 'partially relevant' rather than 'irrelevant'. Your output should be a JSON structure with the context relevance classification:
 ```json
 {
     "context_relevance": "YOUR_CONTEXT_RELEVANCE_CLASSIFICATION_HERE"
 }
 ```<|end_of_role|>
+```
 **Model output**: When prompted with the above input, the model generates a json structure containing the context relevance output (irrelevant, partially relevant, relevant), e.g.
 Use the code below to get started with the model. Before running the script, set the `LORA_NAME` parameter to the path of the directory that you downloaded the LoRA adapter. The download process is explained [here](https://huggingface.co/ibm-granite/granite-3.3-8b-rag-agent-lib#quickstart-example).
 ```python
+import torch
+import json
+import re
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from peft import PeftModel
+device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+# Define the context relevance prompt
+CR_INSTRUCTION_TEXT = "Analyze the provided document in relation to the final user query from the conversation. Determine if the document contains information that could help answer the final user query. Output 'relevant' if the document contains substantial information directly useful for answering the final user query. Output 'partially relevant' if the document contains some related information that could partially help answer the query, or if you are uncertain about the relevance - err on the side of 'partially relevant' when in doubt. Output 'irrelevant' only if the document clearly contains no information that could help answer the final user query. When uncertain, choose 'partially relevant' rather than 'irrelevant'."
+cr_json_object = {
+    "context_relevance": "YOUR_CONTEXT_RELEVANCE_CLASSIFICATION_HERE"
+}
+cr_json_str = json.dumps(cr_json_object, indent=4)
+CR_JSON = "Your output should be a JSON structure with the context relevance classification:\n" + "```json\n" + cr_json_str + "\n```"
+CR_INVOCATION_PROMPT = "<|start_of_role|>context_relevance: " + CR_INSTRUCTION_TEXT + " " + CR_JSON + "<|end_of_role|>"
 BASE_NAME = "ibm-granite/granite-3.3-8b-instruct"
 LORA_NAME = "PATH_TO_DOWNLOADED_DIRECTORY"
+# Load tokenizer and models
+tokenizer = AutoTokenizer.from_pretrained(BASE_NAME, padding_side='left', trust_remote_code=True)
+model_base = AutoModelForCausalLM.from_pretrained(BASE_NAME, device_map="auto")
 model_context_relevancy = PeftModel.from_pretrained(model_base, LORA_NAME)
+# Example conversation and documents
 convo = [
     {
+        "role": "user",
+        "content": "Am I better off going to work for a FAANG?"
     },
     {
+        "role": "assistant",
+        "content": "I can't tell you much about working for a FAANG (Facebook, Amazon, Apple, Netflix, Google) company, but large companies offer resources such as the opportunity to learn from experienced people, and teams dedicated to support you. There isn't a single \"\"right\"\" place to work. FAANG companies tend to offer 6-figure salaries."
     },
     {
+        "role": "user",
+        "content": "which FAANG pays the most?"
     }
+]
 documents = [
     {
         "title": "",
+        "text": "\nEh, you hear this argument all the time, but it doesn't actually work out that way in the real world, though, because corporate pay structure is extremely malleable over time. If a company makes $100M one year, it will pay the investors what they're expecting, the low-level employees what they're willing to tolerate (which is often the minimum wage), and then the upper management whatever is left (i.e. whatever the company can afford to attract the best management). By bumping up the minimum wage, the main effect is that it forces companies to change their pay structures (which are currently ridiculous - the U.S. CEO-to-avg-worker pay is around 200:1; Japan and Germany are around 15:1, IIRC). Feel free to dig deeper into the numbers and the studies if you want further evidence, but even a cursory glance at our history (or the current situation in Australia) shows that the effects of a high minimum wage on both inflation and unemployment are largely overstated."
     },
     {
         "title": "",
+        "text": "\nThe highest paid finance role is a hedge fund manager at a top fund - but that's like winning the lotto so here's the most pragmatic way to make a lot of money: * First 2-3 years out of college: Investment Banking Analyst * Next 2-3 years: Switch to the buyside (Private Equity) You'll easily top $400k by the time you're 26-27. If you're promoted to VP you are golden. Most get forced out after their associate stint and go to a top MBA program, after which you'd go back into PE or do the CFO route. Not sure w/o a degree, to be honest."
     }
 ]
+def extract_and_format_json(raw_text):
+    """Extract JSON content from raw_text"""
+    match = re.search(r"```json\s*(.*?)\s*```", raw_text, re.DOTALL)
+    if not match:
+        raise ValueError("No valid JSON fenced by ```json ...``` was found.")
+    json_string = match.group(1)
+    # Remove invalid escape sequences
+    cleaned_json_string = re.sub(r'\\(?![\"\\/bfnrt]|u[0-9a-fA-F]{4})', '', json_string)
+    try:
+        parsed = json.loads(cleaned_json_string)
+    except json.JSONDecodeError as e:
+        raise ValueError(f"Invalid JSON format after cleaning: {e}")
+    return parsed
+# Process each document
+for i, document in enumerate(documents):
+    # Extract document content
+    document_content = f"{document['title']}\n\n{document['text']}".strip() if document['title'] else document['text']
+    # Extract final user query
+    final_user_query = None
+    for msg in reversed(convo):
+        if msg["role"] == "user":
+            final_user_query = msg["content"]
+            break
+    # Create conversation string (without system prompt if not present)
+    conversation_with_system = [{"role": "system", "content": ""}] + convo
+    conversation_string = tokenizer.apply_chat_template(conversation_with_system, tokenize=False, add_generation_prompt=False)
+    # Remove the system prompt part
+    string_to_remove = tokenizer.apply_chat_template([conversation_with_system[0]], tokenize=False, add_generation_prompt=False)
+    conversation_string = conversation_string[len(string_to_remove):]
+    # Build the input format
+    final_query_role = f"<|start_of_role|>final_user_query<|end_of_role|>{final_user_query}<|end_of_text|>\n"
+    document_role = f"<|start_of_role|>document {{\"document_id\": \"1\"}}<|end_of_role|>\n{document_content}<|end_of_text|>\n"
+    # Construct the full input
+    input_text = conversation_string + final_query_role + document_role + CR_INVOCATION_PROMPT
+    # Tokenize and generate
+    inputs = tokenizer(input_text, return_tensors="pt")
+    model_device = next(model_context_relevancy.parameters()).device
+    output = model_context_relevancy.generate(
+        inputs["input_ids"].to(model_device),
+        attention_mask=inputs["attention_mask"].to(model_device),
+        max_new_tokens=50,
+        pad_token_id=tokenizer.eos_token_id,
+        do_sample=False  # Deterministic greedy decoding
+    )
+    # Decode and extract the generated part
+    raw_output_text = tokenizer.decode(output[0])
+    generated_part = raw_output_text.rsplit("<|end_of_role|>", 1)[-1]
+    # Extract the classification from JSON
+    try:
+        parsed_json = extract_and_format_json(generated_part)
+        classification = parsed_json["context_relevance"]
+        print(f"Document {i+1}: {classification}")
+    except ValueError as e:
+        print(f"Document {i+1}: Error parsing output - {e}")
+        # Fallback to text-based extraction
+        output_lower = generated_part.lower()
+        if "irrelevant" in output_lower and "partially" not in output_lower:
+            print(f"Document {i+1}: irrelevant (fallback)")
+        elif "partially relevant" in output_lower:
+            print(f"Document {i+1}: partially relevant (fallback)")
+        elif "relevant" in output_lower:
+            print(f"Document {i+1}: relevant (fallback)")
 ```
 ## Training Details