Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] GraphRAG integration issue #367

Closed
ronchengang opened this issue Oct 7, 2024 · 6 comments · Fixed by #374
Closed

[BUG] GraphRAG integration issue #367

ronchengang opened this issue Oct 7, 2024 · 6 comments · Fixed by #374
Labels
bug Something isn't working

Comments

@ronchengang
Copy link
Contributor

ronchengang commented Oct 7, 2024

Description

Integration with GraphRAG is an amazing job!
However, it is not easy to make it work smoothly. In order to make it work on my mac, I changed a lot of things, and finally succeeded.
I found that the main problem here is the setting.yaml of GraphRAG. This setting.yaml file is generated at here:
libs/ktem/ktem/index/file/graph/pipelines.py@call_graphrag_index

def call_graphrag_index(self, input_path: str):
        # Construct the command
        command = [
            "python",
            "-m",
            "graphrag.index",
            "--root",
            input_path,
            "--reporter",
            "rich",
            "--init",
        ]

       ...(snip)

However, it cannot be used directly, as some system variables set in Kotaemon's .env cannot be reflected in setting.yaml, like these three.

# settings for GraphRAG
GRAPHRAG_API_KEY=<YOUR_OPENAI_KEY>
GRAPHRAG_LLM_MODEL=gpt-4o-mini
GRAPHRAG_EMBEDDING_MODEL=text-embedding-3-small

What I did was to divide this step into three parts.

  1. The first part is to do use --init only to generate a default setting.yaml file.
command = ["python", "-m", "graphrag.index",
"--root", input_path, "--init"]
result = subprocess.run(command, capture_output=True, text=True)

after doing this, a default setting.yaml file will be generated

  1. The second part is to add necessary GraphRAG environment variables in .env. There are many variables related to GraphRAG, and I only use the minimum number of these variables.
GRAPHRAG_EMBEDDING_TYPE=openai_embedding 
GRAPHRAG_EMBEDDING_API_BASE=http://localhost:11434/v1
GRAPHRAG_EMBEDDING_API_KEY=empty_key
GRAPHRAG_EMBEDDING_MODEL=nomic-embed-text

GRAPHRAG_LLM_TYPE=openai_chat  
GRAPHRAG_LLM_API_BASE=http://127.0.0.1:11434/v1
GRAPHRAG_LLM_API_KEY=empty_key
GRAPHRAG_LLM_MODEL=llama3.1

and then add following code to update these values into setting.yaml

import yaml

from dotenv import load_dotenv
load_dotenv()

# Read the YAML file
with open(input_path+'/settings.yaml', 'r') as file:
    data = yaml.safe_load(file)

# Update the values
data['llm']['api_key'] = os.getenv("GRAPHRAG_LLM_API_KEY")
data['llm']['type'] = os.getenv("GRAPHRAG_LLM_TYPE")
data['llm']['api_base'] = ...
data['llm']['model'] = ...

# Write the updated YAML back to the file
with open(input_path+'/settings.yaml', 'w') as file:
    yaml.dump(data, file)
  1. Now we can come to execute the index process. Of course, this time the --init parameter is removed.
command = [
"python",
"-m",
"graphrag.index",
"--root",
input_path,
"--reporter",
"rich",
# "--init",
]
  1. I also find that the following part seems to be redundancy, becasue it run the command twice. I comment out the subprocess.run and remain the subprocess.Popen method, in this way it can show the index process output in the web page, and it seems work well.
# result = subprocess.run(command, capture_output=True, text=True)
# print(result.stdout)
# command = command[:-1]

# Run the command and stream stdout
with subprocess.Popen(command, stdout=subprocess.PIPE, text=True) as process:
    if process.stdout:
        for line in process.stdout:
            yield Document(channel="debug", text=line)
  1. After doing all this, the GraphRAG index process can run smoothly, amazing!
  2. the final libs/ktem/ktem/index/file/graph/pipelines.py@call_graphrag_index looks like this
def call_graphrag_index(self, input_path: str):
        # Construct the command
        command = ["python",
                   "-m",
                   "graphrag.index",
                   "--root",
                   input_path,
                   "--init"]
        result = subprocess.run(command, capture_output=True, text=True)

        file_path = input_path+"/settings.yaml"  # Replace with the actual file path
        import yaml

        from dotenv import load_dotenv
        load_dotenv()

        # Read the YAML file
        with open(input_path+'/settings.yaml', 'r') as file:
            data = yaml.safe_load(file)

        # Update the values
        data['llm']['api_key'] = os.getenv("GRAPHRAG_LLM_API_KEY")
        data['llm']['type'] = os.getenv("GRAPHRAG_LLM_TYPE")
        data['llm']['api_base'] = os.getenv("GRAPHRAG_LLM_API_BASE")
        data['llm']['model'] = os.getenv("GRAPHRAG_LLM_MODEL")

        data['embeddings']['llm']['api_key'] = os.getenv(
            "GRAPHRAG_EMBEDDING_API_KEY")
        data['embeddings']['llm']['type'] = os.getenv(
            # "GRAPHRAG_EMBEDDING_TYPE")
        data['embeddings']['llm']['api_base'] = os.getenv(
            "GRAPHRAG_EMBEDDING_API_BASE")
        data['embeddings']['llm']['model'] = os.getenv(
            "GRAPHRAG_EMBEDDING_MODEL")

        # Write the updated YAML back to the file
        with open(input_path+'/settings.yaml', 'w') as file:
            yaml.dump(data, file)

        command = [
            "python",
            "-m",
            "graphrag.index",
            "--root",
            input_path,
            "--reporter",
            "rich",
            # "--init",
        ]

        # Run the command
        yield Document(
            channel="debug",
            text="[GraphRAG] Creating index... This can take a long time.",
        )
        # result = subprocess.run(command, capture_output=True, text=True)
        # print(result.stdout)
        # command = command[:-1]

        # Run the command and stream stdout
        with subprocess.Popen(command, stdout=subprocess.PIPE, text=True) as process:
            if process.stdout:
                for line in process.stdout:
                    yield Document(channel="debug", text=line)

BTW, here are all GraphRAG env vars for your reference.

https://microsoft.github.io/graphrag/posts/config/env_vars/

Reproduction steps

1. Go to 'file/graphrag collection'
2. Click on 'upload and index'
3. it errors, cannot perform GraphRAG index

Screenshots

![DESCRIPTION](LINK.png)

Logs

No response

Browsers

Chrome

OS

MacOS

Additional information

No response

@ronchengang ronchengang added the bug Something isn't working label Oct 7, 2024
@joaoaugustogrobe
Copy link

joaoaugustogrobe commented Oct 8, 2024

Thanks a lot @ronchengang, the index with GraphRAG worked fine here - running on Mac, using Docker.

I'm now experiencing issues in the path of the artifact. When using GraphRAG Collection > Search All, it doesn't find anything, and after changing it to Search in File(s), and selecting the specific document that was indexed, it throws some errors.

Looks like the chat pipeline is trying to retreive the GraphRAG from this path:

'/app/ktem_app_data/user_data/files/graphrag/5b87ea6b-a5b7-4087-bd67-1a1ffa247f00/output/stats.json/artifacts/create_final_nodes.parquet'
Captura de Tela 2024-10-08 às 10 44 50

which causes this error/stack:

kotaemon-1  | Thinking ...
kotaemon-1  | Retrievers [DocumentRetrievalPipeline(DS=<kotaemon.storages.docstores.lancedb.LanceDBDocumentStore object at 0xffff41440b20>, FSPath=PosixPath('/app/ktem_app_data/user_data/files/index_1'), Index=<class 'ktem.index.file.index.IndexTable'>, Source=<class 'ktem.index.file.index.Source'>, VS=<kotaemon.storages.vectorstores.chroma.ChromaVectorStore object at 0xffff41440f40>, get_extra_table=False, llm_scorer=LLMTrulensScoring(concurrent=True, normalize=10, prompt_template=<kotaemon.llms.prompts.template.PromptTemplate object at 0xffff29528970>, system_prompt_template=<kotaemon.llms.prompts.template.PromptTemplate object at 0xffff29528a90>, top_k=3, user_prompt_template=<kotaemon.llms.prompts.template.PromptTemplate object at 0xffff29528b80>), mmr=False, rerankers=[CohereReranking(cohere_api_key='qz....PO', model_name='rerank-multilingual-v2.0')], retrieval_mode='hybrid', top_k=10, user_id=1), GraphRAGRetrieverPipeline(DS=<theflow.base.unset_ object at 0xffffb0cab1f0>, FSPath=<theflow.base.unset_ object at 0xffffb0cab1f0>, Index=<class 'ktem.index.file.index.IndexTable'>, Source=<theflow.base.unset_ object at 0xffffb0cab1f0>, VS=<theflow.base.unset_ object at 0xffffb0cab1f0>, file_ids=['4a733f40-7cae-4dae-9079-acb2feb2f0c3'], user_id=<theflow.base.unset_ object at 0xffffb0cab1f0>)]
kotaemon-1  | searching in doc_ids []
kotaemon-1  | Traceback (most recent call last):
kotaemon-1  |   File "/usr/local/lib/python3.10/site-packages/gradio/queueing.py", line 575, in process_events
kotaemon-1  |     response = await route_utils.call_process_api(
kotaemon-1  |   File "/usr/local/lib/python3.10/site-packages/gradio/route_utils.py", line 276, in call_process_api
kotaemon-1  |     output = await app.get_blocks().process_api(
kotaemon-1  |   File "/usr/local/lib/python3.10/site-packages/gradio/blocks.py", line 1923, in process_api
kotaemon-1  |     result = await self.call_function(
kotaemon-1  |   File "/usr/local/lib/python3.10/site-packages/gradio/blocks.py", line 1520, in call_function
kotaemon-1  |     prediction = await utils.async_iteration(iterator)
kotaemon-1  |   File "/usr/local/lib/python3.10/site-packages/gradio/utils.py", line 663, in async_iteration
kotaemon-1  |     return await iterator.__anext__()
kotaemon-1  |   File "/usr/local/lib/python3.10/site-packages/gradio/utils.py", line 656, in __anext__
kotaemon-1  |     return await anyio.to_thread.run_sync(
kotaemon-1  |   File "/usr/local/lib/python3.10/site-packages/anyio/to_thread.py", line 56, in run_sync
kotaemon-1  |     return await get_async_backend().run_sync_in_worker_thread(
kotaemon-1  |   File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 2405, in run_sync_in_worker_thread
kotaemon-1  |     return await future
kotaemon-1  |   File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 914, in run
kotaemon-1  |     result = context.run(func, *args)
kotaemon-1  |   File "/usr/local/lib/python3.10/site-packages/gradio/utils.py", line 639, in run_sync_iterator_async
kotaemon-1  |     return next(iterator)
kotaemon-1  |   File "/usr/local/lib/python3.10/site-packages/gradio/utils.py", line 801, in gen_wrapper
kotaemon-1  |     response = next(iterator)
kotaemon-1  |   File "/app/libs/ktem/ktem/pages/chat/__init__.py", line 787, in chat_fn
kotaemon-1  |     for response in pipeline.stream(chat_input, conversation_id, chat_history):
kotaemon-1  |   File "/app/libs/ktem/ktem/reasoning/simple.py", line 655, in stream
kotaemon-1  |     docs, infos = self.retrieve(message, history)
kotaemon-1  |   File "/app/libs/ktem/ktem/reasoning/simple.py", line 483, in retrieve
kotaemon-1  |     retriever_docs = retriever_node(text=query)
kotaemon-1  |   File "/usr/local/lib/python3.10/site-packages/theflow/base.py", line 1097, in __call__
kotaemon-1  |     raise e from None
kotaemon-1  |   File "/usr/local/lib/python3.10/site-packages/theflow/base.py", line 1088, in __call__
kotaemon-1  |     output = self.fl.exec(func, args, kwargs)
kotaemon-1  |   File "/usr/local/lib/python3.10/site-packages/theflow/backends/base.py", line 151, in exec
kotaemon-1  |     return run(*args, **kwargs)
kotaemon-1  |   File "/usr/local/lib/python3.10/site-packages/theflow/middleware.py", line 144, in __call__
kotaemon-1  |     raise e from None
kotaemon-1  |   File "/usr/local/lib/python3.10/site-packages/theflow/middleware.py", line 141, in __call__
kotaemon-1  |     _output = self.next_call(*args, **kwargs)
kotaemon-1  |   File "/usr/local/lib/python3.10/site-packages/theflow/middleware.py", line 117, in __call__
kotaemon-1  |     return self.next_call(*args, **kwargs)
kotaemon-1  |   File "/usr/local/lib/python3.10/site-packages/theflow/base.py", line 1017, in _runx
kotaemon-1  |     return self.run(*args, **kwargs)
kotaemon-1  |   File "/app/libs/ktem/ktem/index/file/graph/pipelines.py", line 358, in run
kotaemon-1  |     context_builder = self._build_graph_search()
kotaemon-1  |   File "/app/libs/ktem/ktem/index/file/graph/pipelines.py", line 235, in _build_graph_search
kotaemon-1  |     entity_df = pd.read_parquet(f"{INPUT_DIR}/{ENTITY_TABLE}.parquet")
kotaemon-1  |   File "/usr/local/lib/python3.10/site-packages/pandas/io/parquet.py", line 667, in read_parquet
kotaemon-1  |     return impl.read(
kotaemon-1  |   File "/usr/local/lib/python3.10/site-packages/pandas/io/parquet.py", line 267, in read
kotaemon-1  |     path_or_handle, handles, filesystem = _get_path_or_handle(
kotaemon-1  |   File "/usr/local/lib/python3.10/site-packages/pandas/io/parquet.py", line 140, in _get_path_or_handle
kotaemon-1  |     handles = get_handle(
kotaemon-1  |   File "/usr/local/lib/python3.10/site-packages/pandas/io/common.py", line 882, in get_handle
kotaemon-1  |     handle = open(handle, ioargs.mode)
kotaemon-1  | NotADirectoryError: [Errno 20] Not a directory: '/app/ktem_app_data/user_data/files/graphrag/5b87ea6b-a5b7-4087-bd67-1a1ffa247f00/output/stats.json/artifacts/create_final_nodes.parquet'

Maybe during the settings creation we're losing some config for the artifacts path?

Btw, just as a test/workaround, I've moved my artifacts to match the expected path, and it worked fine, proving that the index with your changes actually are working.

@ronchengang
Copy link
Contributor Author

ronchengang commented Oct 8, 2024

thanks @joaoaugustogrobe!

I forgot to mention that if you are using OpanAI, it should work without any modification, but for people like me using the private model in Ollama, they may have the same issue like me.

Regarding the query problem you mentioned, I also encountered it. My approach is as follows:

Search for the following code snippet in libs/ktem/ktem/index/file/graph/pipelines.py

output_path = root_path / "output"
child_paths = sorted(
list(output_path.iterdir()), key=lambda x: x.stem, reverse=True
)

# get the latest child path
assert child_paths, "GraphRAG index output not found"
latest_child_path = Path(child_paths[0]) / "artifacts"

INPUT_DIR = latest_child_path

and chang it like this

output_path = root_path / "output"
#child_paths = sorted(
#   list(output_path.iterdir()), key=lambda x: x.stem, reverse=True
#)

# get the latest child path
# assert child_paths, "GraphRAG index output not found"
# latest_child_path = Path(child_paths[0]) / "artifacts"

INPUT_DIR = output_path

The change here is related to a setting in GraphRAG setting.yaml. In the old version of GraphRAG's setting.yaml, this setting is like this

storage:
  type: file # or blob
  base_dir: "output/${timestamp}/artifacts" 

Note that in the value of base_dir, there is a ${timestamp} and artifacts .The commented out code in my changes above is about to find the last timestamp in the output directory and append artifacts to form the directory to data files.

But in the new version of GraphRAG, this setting has been changed to this

storage:
  type: file # or blob
  base_dir: "output"

Here, base_dir only retains output, ${timestamp} and artifacts are no longer there. This is why I commented out related lines of code and just keep output_path.

So I can tell the reason of query error is because GraphRAG changed its configuration files. I don't know from which version Graphrag made this change, but it did cause the query problem in Kotaemon. It seems that the author has not noticed this change yet, maybe I can create a PR to remind him.

@vip-china
Copy link

image
image
@ronchengang I have already modified the output path, but did not generate 'creat_final_nodes. paquet'. Have you encountered this before

@ronchengang
Copy link
Contributor Author

@vip-china Are you sure your GraphRAG index process completed successfully? I guess you didn't finish the index process. Take a look at what files are in your output directory, make sure your index is complete and the files are complete. Here are mine for your reference.

create_base_documents.parquet
create_base_entity_graph.parquet
create_base_extracted_entities.parquet
create_base_text_units.parquet
create_final_communities.parquet
create_final_community_reports.parquet
create_final_documents.parquet
create_final_entities.parquet
create_final_nodes.parquet
create_final_relationships.parquet
create_final_text_units.parquet
create_summarized_entities.parquet
indexing-engine.log
logs.json
stats.json

@vip-china
Copy link

您确定您的 GraphRAG 索引过程已成功完成吗?我猜你没有完成索引过程。查看输出目录中的文件,确保您的索引完整且文件完整。这是我的供你参考。

create_base_documents.parquet
create_base_entity_graph.parquet
create_base_extracted_entities.parquet
create_base_text_units.parquet
create_final_communities.parquet
create_final_community_reports.parquet
create_final_documents.parquet
create_final_entities.parquet
create_final_nodes.parquet
create_final_relationships.parquet
create_final_text_units.parquet
create_summarized_entities.parquet
indexing-engine.log
logs.json
stats.json

@vip-china Are you sure your GraphRAG index process completed successfully? I guess you didn't finish the index process. Take a look at what files are in your output directory, make sure your index is complete and the files are complete. Here are mine for your reference.

create_base_documents.parquet
create_base_entity_graph.parquet
create_base_extracted_entities.parquet
create_base_text_units.parquet
create_final_communities.parquet
create_final_community_reports.parquet
create_final_documents.parquet
create_final_entities.parquet
create_final_nodes.parquet
create_final_relationships.parquet
create_final_text_units.parquet
create_summarized_entities.parquet
indexing-engine.log
logs.json
stats.json

############
Okay, thank you very much. That's true; Sorry to bother you, I have another question. Can Graphrag support local models? I started my own model (openai interface) locally using VLLM, but Graphrag needs to configure API-KEY in. env. example, which must be OPENAI's KEY by default. However, I am a local model and my KEY is customized when starting the service; How can Graphrag adapt to local models

@ronchengang
Copy link
Contributor Author

@vip-china
I have seen others asking for the same feature, allowing them to use their self-hosted models and modify other configurations. I also have these needs and have found a way to archive this. I have created a new PR to contribute my code.
#387
You may refer to this PR to implement these similar changes on your side, or just wait for it to be merged to the main branch and pull the code to your machine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants