Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues with edge creation for larger dataset with 83K edges #1925

Closed
dianaoa22 opened this issue Jun 19, 2024 · 9 comments
Closed
Labels
bug Something isn't working Stale Stale issues/PRs

Comments

@dianaoa22
Copy link

dianaoa22 commented Jun 19, 2024

Main.txt
Describe the bug
I am creating 83K vertices and 83K edges by parsing a json file with JAVA. The vertex creation takes about 1 min however for edges it takes around 1hr+.

How are you accessing AGE (Command line, driver, etc.)?

  • JDBC

What data setup do we need to do?

Steps:

1. Attached program reads a JSON file, builds a graph structure in memory based on the JSON content, and then generates and executes Cypher queries to create this graph structure in a PostgreSQL database with the Apache AGE extension for graph processing.
2. Constructs the graph by recursively traversing the JSON structure and creating vertices and edges accordingly.
3. Code snippet attached, input json cannot be shared
4. Optimized with batch execution but no improvement seen, takes about 1 hr+ however the same in neo4j takes 14 mins.
[Main.txt](https://github.com/user-attachments/files/15900495/Main.txt)


**What is the necessary configuration info needed?**
- 

**What is the command that caused the error?**

**Expected behavior**
Execution time should be less atleast close to neo4j.

**Environment (please complete the following information):**
- JAVA 21 ,  postgres version PostgreSQL 16.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0 20210514 (R
ed Hat 8.5.0-10), 64-bit


**Additional context**
Add any other context about the problem here.
@dianaoa22 dianaoa22 added the bug Something isn't working label Jun 19, 2024
@MironAtHome
Copy link

Do you have indexes on related fields used to create edge relationship?
I tried accessing linked file to look over but apparently couldn't get to the file.

@dianaoa22
Copy link
Author

Code.txt
Yes, I have created GIN index as the MATCH query used id inside the properties.
CREATE INDEX load_item_gin_idx ON ms_graph."Item" USING gin (properties);
CREATE INDEX

{
"id": 844424930132046,
"properties": {
"id": "3bd6197582434a01b340b889b357f3a4",
"label": "xxxx",
"properties": "{path: "/c/e", name: "test", s-id: "21977", description: "test", seattr: "2000", state: "enable", customer: "13"}"
}
}

@MironAtHome
Copy link

MironAtHome commented Jul 8, 2024

Here is index structure I found helpful:

CREATE INDEX [CONCURRENTLY] <index_name> ON "<schema/graph_name>"."<vertice_table_name>"
(ag_catalog.agtype_access_operator(properties, '"<property_name>"'::agtype));

Please, replace <index_name>, <schema/graph_name>, <vertice_table_name>, <property_name> with applicable values. This way index will assist in matching on specific field in the property set, unless you need to match on all properties.
This is first.
Secondly, I looked over week - end and found a few tools I find fairly common across RDBMS systems, used to incrementally improve performance by looking at what query part performs slow and than seeing how to improve ( usually it still is index, but we will see ), It's not something that works immediately on a good educated guess, based on experience, as applying indexes does. It's more of a journey. I was hoping this systemic approach would be of interest to explore. If this sounds like acceptable, please read on further.
There are a few troubleshooting approaches I learned over the last week - end.
Please let me know if you are open to installing extensions, that can help to expose detailed wait information, regarding where query spends time, and turning on some of the troubleshooting options.
This, this and this.
Feel free to add a comment here. Please don't just install these extensions, but rather see if those meet your Postgres version and see if ( entirely possible ) there are native, postgres itself, maintained extensions, or even built - in functions, that provide same level of troubleshooting.
This first step is just to ensure proper tooling is in place to attempt to identify slower parts of query prior attempting to fix.
This extensions will not work with Postgres in cloud, hence at all, so, please read carefully and see if this is the right path to troubleshoot query performance in your case. And only if it is, please confirm so that we could look at the next step ( if you would like to have assistance ).

@dianaoa22
Copy link
Author

This would take some time, I will let you know once I have the data, thanks for the suggestions.

@MironAtHome
Copy link

Np at all.
One thing I was looking for is sample config.json file as it would help me to get to some test data faster.

@dianaoa22
Copy link
Author

import json

Original JSON data

data = {
"root": {
"child1": {
"name": "test",
"location": "test"
},
"child2": [
{
"child2-class": "classname",
"child2-number": 1
}
],
"child3": [
{
"child4": [
{
"child4-pref": 1,
"child4-type": "type-schild5"
},
{
"child4-type": "type-schild5",
"child4-pref": 2
}
],
"child5": [
{
"child5-number": 1,
"entryp": {
"econtrol": {
"ploc": "ploc-10001",
"state": "enable"
}
},
"inside": {
"mgmt": {
"state": "enable"
}
}
}
],
"child3-type": "new-child53",
"pref-number": 1
},
{
"child3-type": "test-schild5-c",
"pref-number": 2,
"child4": [
{
"child4-pref": 1
},
{
"child4-pref": 2
}
],
"child5": [
{
"entryp": {
"econtrol": {
"ploc": "ploc-10001",
"state": "enable"
}
},
"child5-number": 1,
"inside": {
"mgmt": {
"state": "enable"
}
}
}
]
},
{
"child3-type": "new-child53",
"pref-number": 3,
"child4": [
{
"child4-type": "type-schild5",
"child4-pref": 1
},
{
"child4-type": "type-schild5",
"child4-pref": 2
}
],
"child5": [
{
"inside": {
"mgmt": {
"state": "enable"
}
},
"child5-number": 1
}
]
},
{
"child4": [
{
"child4-type": "type-schild5",
"child4-pref": 1
},
{
"child4-pref": 2,
"child4-type": "type-schild5"
}
],
"child5": [
{
"inside": {
"mgmt": {
"state": "enable"
}
},
"child5-number": 1
}
],
"child3-type": "new-child53",
"pref-number": 4
},
{
"child3-type": "new-child53",
"pref-number": 5,
"child4": [
{
"child4-pref": 1,
"child4-type": "type-schild5"
}
],
"child5": [
{
"inside": {
"mgmt": {
"state": "enable"
},
"band": {
"band-no": "NQP-10001"
}
},
"child5-number": 1
}
]
},
{
"child4": [
{
"child4-type": "type-schild5",
"child4-pref": 1
},
{
"child4-type": "type-schild5",
"child4-pref": 2
}
],
"child5": [
{
"inside": {
"mgmt": {
"state": "enable"
}
},
"child5-number": 1,
"entryp": {
"econtrol": {
"state": "enable",
"ploc": "ploc-10001"
}
}
}
],
"child3-type": "new-child53",
"pref-number": 6
},
{
"child3-type": "new-child53",
"pref-number": 8,
"child4": [
{
"child4-type": "type-schild5",
"child4-pref": 1
}
],
"child5": [
{
"inside": {
"mgmt": {
"state": "enable"
},
"net": {
"test": "FRL-10001"
}
},
"child5-number": 1
}
]
}
]
}
}

def add_entries(data, num_entries):
base_child3 = {
"child3-type": "new-child53",
"child4": [],
"child5": [],
"pref-number": None # Will be set dynamically
}

base_child4 = {
    "child4-pref": None,  # Will be set dynamically
    "child4-type": "type-schild5"
}

base_child5 = {
    "child5-number": None,  # Will be set dynamically
    "entryp": {
        "econtrol": {
            "ploc": "ploc-10001",
            "state": "enable"
        }
    },
    "inside": {
        "mgmt": {
            "state": "enable"
        }
    }
}

# Generate new child3 entries
new_child3s = []
for i in range(1, num_entries + 1):
    child3 = base_child3.copy()
    child3['pref-number'] = i
    
    # Generate child4 entries for the child3
    child4_entries = []
    for j in range(1, 5):  # 4 child4 entries per child3 for example
        child4 = base_child4.copy()
        child4['child4-pref'] = j
        child4_entries.append(child4)
    child3['child4'] = child4_entries
    
    # Generate child5 entries for the child3
    child5_entries = []
    for j in range(1, 5):  # 4 child5 entries per child3 for example
        child5 = base_child5.copy()
        child5['child5-number'] = j
        child5_entries.append(child5)
    child3['child5'] = child5_entries
    
    new_child3s.append(child3)

# Add new child3s to the data
data['root']['child3'].extend(new_child3s)

Specify the number of entries to add

num_entries = 100000

Add the entries

add_entries(data, num_entries)

Print the total number of child3 entries to confirm

print(f"Total child3 entries: {len(data['root']['child3'])}")

Optional: Save the updated data to a file

with open('big_data.json', 'w') as file:
json.dump(data, file, indent=4)

print("Data added successfully and saved to updated_data.json")

Run this python program to get a big data, I am unable to imprt the json due to size restriction of 25MB

@MironAtHome
Copy link

Thank you

Copy link

This issue is stale because it has been open 60 days with no activity. Remove "Abondoned" label or comment or this will be closed in 14 days.

@github-actions github-actions bot added the Stale Stale issues/PRs label Sep 22, 2024
Copy link

github-actions bot commented Oct 7, 2024

This issue was closed because it has been stalled for further 14 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale Stale issues/PRs
Projects
None yet
Development

No branches or pull requests

2 participants