285 ການອ່ານ

ການປ່ຽນໄຟລ໌ CSV ໃຫ້ເປັນກາຟດ້ວຍ LLMs: ຄູ່ມືຂັ້ນຕອນໂດຍຂັ້ນຕອນ

ໂດຍ Neo4j41m2024/10/29

ຍາວເກີນໄປ; ອ່ານ

ສຳຫຼວດໂດຍໃຊ້ LLMs ເພື່ອປ່ຽນໄຟລ໌ CSV ໃຫ້ເປັນໂຄງສ້າງກຣາຟ, ປັບປຸງການສ້າງແບບຈໍາລອງຂໍ້ມູນໃນ Neo4j ດ້ວຍວິທີການແບບຊ້ຳໆ, ທັນທີ.

featured image - ການປ່ຽນໄຟລ໌ CSV ໃຫ້ເປັນກາຟດ້ວຍ LLMs: ຄູ່ມືຂັ້ນຕອນໂດຍຂັ້ນຕອນ

LLMs ຄ່າໂດຍສານແນວໃດເມື່ອພະຍາຍາມສ້າງກຣາຟຈາກໄຟລ໌ CSV ແບບແປ?

ສ່ວນໃຫຍ່ຂອງວຽກຂອງຂ້ອຍແມ່ນການປັບປຸງປະສົບການຂອງຜູ້ໃຊ້ກັບ Neo4j. ເລື້ອຍໆ, ການເອົາຂໍ້ມູນເຂົ້າໄປໃນ Neo4j ແລະການສ້າງແບບຈໍາລອງຢ່າງມີປະສິດທິພາບແມ່ນສິ່ງທ້າທາຍທີ່ສໍາຄັນສໍາລັບຜູ້ໃຊ້, ໂດຍສະເພາະໃນຍຸກທໍາອິດ. ເຖິງແມ່ນວ່າຮູບແບບຂໍ້ມູນເບື້ອງຕົ້ນແມ່ນມີຄວາມສໍາຄັນແລະຕ້ອງການການໄຕ່ຕອງ, ມັນສາມາດ refactored ໄດ້ຢ່າງງ່າຍດາຍເພື່ອປັບປຸງການປະຕິບັດຍ້ອນວ່າຂະຫນາດຂໍ້ມູນຫຼືຈໍານວນຜູ້ໃຊ້ເພີ່ມຂຶ້ນ.

ດັ່ງນັ້ນ, ເປັນສິ່ງທ້າທາຍຕໍ່ຕົວຂ້ອຍເອງ, ຂ້ອຍຄິດວ່າຂ້ອຍຈະເບິ່ງວ່າ LLM ສາມາດຊ່ວຍຮູບແບບຂໍ້ມູນເບື້ອງຕົ້ນໄດ້. ຖ້າບໍ່ມີຫຍັງອີກ, ມັນຈະສະແດງໃຫ້ເຫັນວິທີ ການ ເຊື່ອມຕໍ່ແລະໃຫ້ຜູ້ໃຊ້ມີຜົນໄດ້ຮັບໄວທີ່ພວກເຂົາສາມາດສະແດງໃຫ້ຄົນອື່ນເຫັນ.

intuitively, ຂ້າພະເຈົ້າຮູ້ວ່າການສ້າງແບບຈໍາລອງຂໍ້ມູນແມ່ນຂະບວນການຊ້ໍາກັນ, ແລະບາງ LLMs ສາມາດຖືກລົບກວນໄດ້ງ່າຍໂດຍຂໍ້ມູນຈໍານວນຫລາຍ, ດັ່ງນັ້ນນີ້ສະເຫນີໂອກາດທີ່ດີທີ່ຈະໃຊ້ LangGraph ເພື່ອເຮັດວຽກເປັນວົງຈອນຜ່ານຂໍ້ມູນ.

ໃຫ້ເຂົ້າໄປໃນການກະຕຸ້ນເຕືອນທີ່ເຮັດໃຫ້ມັນເກີດຂຶ້ນ.

ພື້ນຖານການສ້າງແບບຈໍາລອງກາຟ

ຫຼັກສູດພື້ນຖານການສ້າງແບບຈໍາລອງຂໍ້ມູນ Graph ໃນ GraphAcademy ແນະນໍາທ່ານຜ່ານພື້ນຖານຂອງການສ້າງແບບຈໍາລອງຂໍ້ມູນໃນກາຟ, ແຕ່ເປັນການຜ່ານຄັ້ງທໍາອິດ, ຂ້ອຍໃຊ້ກົດລະບຽບຕໍ່ໄປນີ້:

ຄຳນາມກາຍເປັນປ້າຍກຳກັບ — ພວກມັນອະທິບາຍ ສິ່ງ ທີ່ node ເປັນຕົວແທນ.
ຄຳກິລິຍາກາຍເປັນປະເພດຄວາມສຳພັນ — ພວກມັນອະທິບາຍວິທີ ການ ເຊື່ອມຕໍ່ກັນ.
ທຸກສິ່ງທຸກຢ່າງກາຍເປັນຄຸນສົມບັດ (ໂດຍສະເພາະແມ່ນ adverbs) — ທ່ານມີຊື່ແລະອາດຈະຂັບລົດສີຂີ້ເຖົ່າ.

Verbs ຍັງສາມາດເປັນ nodes; ທ່ານອາດຈະດີໃຈທີ່ຮູ້ວ່າບຸກ ຄົນ ໃດຫນຶ່ງໄດ້ ສັ່ງ ສິນຄ້າ, ແຕ່ຮູບແບບພື້ນຖານນັ້ນບໍ່ໄດ້ອະນຸຍາດໃຫ້ທ່ານຮູ້ວ່າບ່ອນໃດແລະເວລາທີ່ຜະລິດຕະພັນຖືກສັ່ງ. ໃນກໍລະນີນີ້, ຄໍາສັ່ງຈະກາຍເປັນ node ໃຫມ່ໃນແບບຈໍາລອງ.

ຂ້າພະເຈົ້າແນ່ໃຈວ່ານີ້ສາມາດຖືກກັ່ນເຂົ້າໄປໃນການກະຕຸ້ນເຕືອນເພື່ອສ້າງວິທີການທີ່ບໍ່ມີການຍິງກັບການສ້າງແບບຈໍາລອງຂໍ້ມູນກາຟ.

ແນວທາງແບບຊ້ຳໆ

ຂ້າພະເຈົ້າໄດ້ພະຍາຍາມນີ້ໄລຍະສັ້ນໆສອງສາມເດືອນກ່ອນຫນ້ານີ້ແລະພົບວ່າຕົວແບບທີ່ຂ້ອຍກໍາລັງໃຊ້ໄດ້ກາຍເປັນສິ່ງລົບກວນໄດ້ງ່າຍເມື່ອຈັດການກັບ schemas ຂະຫນາດໃຫຍ່, ແລະການກະຕຸ້ນເຕືອນຂ້ອນຂ້າງໄວເຖິງຂອບເຂດຈໍາກັດ token ຂອງ LLM.

ຂ້ອຍຄິດວ່າຂ້ອຍຈະລອງວິທີຊ້ຳໆເທື່ອນີ້, ເອົາກະແຈເທື່ອລະອັນ. ນີ້ຄວນຊ່ວຍຫຼີກເວັ້ນການລົບກວນເພາະວ່າ LLM ພຽງແຕ່ຕ້ອງພິຈາລະນາຫນຶ່ງລາຍການຕໍ່ຄັ້ງ.

ວິທີການສຸດທ້າຍໄດ້ໃຊ້ຂັ້ນຕອນຕໍ່ໄປນີ້:

ໂຫລດໄຟລ໌ CSV ເຂົ້າໄປໃນກອບຂໍ້ມູນ Pandas.
ວິເຄາະແຕ່ລະຖັນໃນ CSV ແລະຕື່ມໃສ່ກັບແບບຂໍ້ມູນແບບວ່າງໆໂດຍອີງໃສ່ JSON Schema.
ກໍານົດແລະເພີ່ມ ID ທີ່ຂາດຫາຍໄປສໍາລັບແຕ່ລະຫນ່ວຍງານ.
ທົບທວນຄືນຮູບແບບຂໍ້ມູນສໍາລັບຄວາມຖືກຕ້ອງ.
ສ້າງຖະແຫຼງການ Cypher ເພື່ອນໍາເຂົ້າ nodes ແລະການພົວພັນ.
ສ້າງຂໍ້ຈໍາກັດທີ່ເປັນເອກະລັກທີ່ເນັ້ນໃສ່ໃບລາຍງານການນໍາເຂົ້າ.
ສ້າງຂໍ້ຈໍາກັດແລະດໍາເນີນການນໍາເຂົ້າ.

ຂໍ້ມູນ

ຂ້າພະເຈົ້າໄດ້ເບິ່ງຢ່າງໄວວາກ່ຽວກັບ Kaggle ສໍາລັບ ຊຸດຂໍ້ມູນທີ່ຫນ້າສົນໃຈ . ຊຸດຂໍ້ມູນທີ່ໂດດເດັ່ນແມ່ນ ເພງ Spotify ທີ່ມີກະແສຫຼາຍທີ່ສຸດ .

 import pandas as pd csv_file = '/Users/adam/projects/datamodeller/data/spotify/spotify-most-streamed-songs.csv' df = pd.read_csv(csv_file) df.head() track_name artist(s)_name artist_count released_year released_month released_day in_spotify_playlists in_spotify_charts streams in_apple_playlists … key mode danceability_% valence_% energy_% acousticness_% instrumentalness_% liveness_% speechiness_% cover_url 0 Seven (feat. Latto) (Explicit Ver.) Latto, Jung Kook 2 2023 7 14 553 147 141381703 43 … B Major 80 89 83 31 0 8 4 Not Found 1 LALA Myke Towers 1 2023 3 23 1474 48 133716286 48 … C# Major 71 61 74 7 0 10 4 https://i.scdn.co/image/ab67616d0000b2730656d5… 2 vampire Olivia Rodrigo 1 2023 6 30 1397 113 140003974 94 … F Major 51 32 53 17 0 31 6 https://i.scdn.co/image/ab67616d0000b273e85259… 3 Cruel Summer Taylor Swift 1 2019 8 23 7858 100 800840817 116 … A Major 55 58 72 11 0 11 15 https://i.scdn.co/image/ab67616d0000b273e787cf… 4 WHERE SHE GOES Bad Bunny 1 2023 5 18 3133 50 303236322 84 … A Minor 65 23 80 14 63 11 6 https://i.scdn.co/image/ab67616d0000b273ab5c9c…

5 ແຖວ × 25 ຖັນ

ມັນຂ້ອນຂ້າງງ່າຍດາຍ, ແຕ່ຂ້ອຍສາມາດເຫັນໄດ້ທັນທີວ່າຄວນຈະມີຄວາມສໍາພັນລະຫວ່າງເພງແລະນັກສິລະປິນ.

ຍັງມີສິ່ງທ້າທາຍດ້ານຄວາມສະອາດຂໍ້ມູນທີ່ຈະເອົາຊະນະໄດ້, ໃນແງ່ຂອງຊື່ຖັນ ແລະສິນລະປິນເປັນຄ່າທີ່ຂັ້ນດ້ວຍເຄື່ອງໝາຍຈຸດພາຍໃນຖັນສິນລະປິນ _name.

ການເລືອກ LLM

ຂ້າພະເຈົ້າກໍ່ຕ້ອງການໃຊ້ LLM ທ້ອງຖິ່ນສໍາລັບການນີ້, ແຕ່ຂ້າພະເຈົ້າພົບເຫັນໃນຕອນຕົ້ນວ່າ Llama 3 ຈະບໍ່ຕັດມັນ. ຖ້າມີຂໍ້ສົງໄສ, ກັບມາໃຊ້ OpenAI:

 from langchain_core.prompts import PromptTemplate from langchain_core.pydantic_v1 import BaseModel, Field from typing import List from langchain_core.output_parsers import JsonOutputParser from langchain_openai import ChatOpenAI llm = ChatOpenAI(model="gpt-4o")

ການສ້າງແບບຈໍາລອງຂໍ້ມູນ

ຂ້ອຍໃຊ້ຊຸດຄໍາແນະນໍາການສ້າງແບບຫຍໍ້ເພື່ອສ້າງແບບຈໍາລອງຂໍ້ມູນ. ຂ້ອຍຕ້ອງວິສະວະກອນການກະຕຸ້ນເຕືອນສອງສາມເທື່ອເພື່ອໃຫ້ໄດ້ຜົນຜະລິດທີ່ສອດຄ່ອງ.

ຕົວຢ່າງສູນການສັກຢາເຮັດວຽກຂ້ອນຂ້າງດີ, ແຕ່ຂ້ອຍພົບວ່າຜົນຜະລິດບໍ່ສອດຄ່ອງ. ການກໍານົດຜົນຜະລິດທີ່ມີໂຄງສ້າງເພື່ອຖືຜົນຜະລິດ JSON ຊ່ວຍແທ້ໆ:

 class JSONSchemaSpecification(BaseModel): notes: str = Field(description="Any notes or comments about the schema") jsonschema: str = Field(description="A JSON array of JSON schema specifications that describe the entities in the data model")

ຜົນຜະລິດຕົວຢ່າງການສັກຢາຈໍານວນຫນ້ອຍ

JSON ຕົວມັນເອງກໍ່ບໍ່ສອດຄ່ອງ, ດັ່ງນັ້ນຂ້າພະເຈົ້າໄດ້ສິ້ນສຸດການກໍານົດ schema ໂດຍອີງໃສ່ຊຸດຂໍ້ມູນການແນະນໍາຮູບເງົາ.

ຕົວຢ່າງຜົນຜະລິດ:

 example_output = [ dict( title="Person", type="object", description="Node", properties=[ dict(name="name", column_name="person_name", type="string", description="The name of the person", examples=["Tom Hanks"]), dict(name="date_of_birth", column_name="person_dob", type="date", description="The date of birth for the person", examples=["1987-06-05"]), dict(name="id", column_name="person_name, date_of_birth", type="string", description="The ID is a combination of name and date of birth to ensure uniqueness", examples=["tom-hanks-1987-06-05"]), ], ), dict( title="Director", type="object", description="Node", properties=[ dict(name="name", column_name="director_names", type="string", description="The name of the directors. Split values in column by a comma", examples=["Francis Ford Coppola"]), ], ), dict( title="Movie", type="object", description="Node", properties=[ dict(name="title", column_name="title", type="string", description="The title of the movie", examples=["Toy Story"]), dict(name="released", column_name="released", type="integer", description="The year the movie was released", examples=["1990"]), ], ), dict( title="ACTED_IN", type="object", description="Relationship", properties=[ dict(name="_from", column_name="od", type="string", description="Person found by the `id`. The ID is a combination of name and date of birth to ensure uniqueness", examples=["Person"]), dict(name="_to", column_name="title", type="string", description="The movie title", examples=["Movie"]), dict(name="roles", type="string", column_name="person_roles", description="The roles the person played in the movie", examples=["Woody"]), ], ), dict( title="DIRECTED", type="object", description="Relationship", properties=[ dict(name="_from", type="string", column_name="director_names", description="Director names are comma separated", examples=["Director"]), dict(name="_to", type="string", column_name="title", description="The label of the node this relationship ends at", examples=["Movie"]), ], ), ]

ຂ້າພະເຈົ້າຕ້ອງ deviate ຈາກ JSON Schema ທີ່ເຄັ່ງຄັດແລະເພີ່ມພາກສະຫນາມ column_name ກັບຜົນໄດ້ຮັບເພື່ອຊ່ວຍໃຫ້ LLM ສ້າງສະຄິບນໍາເຂົ້າ. ການສະຫນອງຕົວຢ່າງຂອງຄໍາອະທິບາຍຍັງຊ່ວຍໃນເລື່ອງນີ້, ຖ້າບໍ່ດັ່ງນັ້ນຄຸນສົມບັດທີ່ໃຊ້ໃນ MATCH clause ແມ່ນບໍ່ສອດຄ່ອງ.

ລະບົບຕ່ອງໂສ້

ນີ້ແມ່ນການເຕືອນສຸດທ້າຍ:

 model_prompt = PromptTemplate.from_template(""" You are an expert Graph Database administrator. Your task is to design a data model based on the information provided from an existing data source. You must decide where the following column fits in with the existing data model. Consider: * Does the column represent an entity, for example a Person, Place, or Movie? If so, this should be a node in its own right. * Does the column represent a relationship between two entities? If so, this should be a relationship between two nodes. * Does the column represent an attribute of an entity or relationship? If so, this should be a property of a node or relationship. * Does the column represent a shared attribute that could be interesting to query through to find similar nodes, for example a Genre? If so, this should be a node in its own right. ## Instructions for Nodes * Node labels are generally nouns, for example Person, Place, or Movie * Node titles should be in UpperCamelCase ## Instructions for Relationships * Relationshops are generally verbs, for example ACTED_IN, DIRECTED, or PURCHASED * Examples of good relationships are (:Person)-[:ACTED_IN]->(:Movie) or (:Person)-[:PURCHASED]->(:Product) * Relationships should be in UPPER_SNAKE_CASE * Provide any specific instructions for the field in the description. For example, does the field contain a list of comma separated values or a single value? ## Instructions for Properties * Relationships should be in lowerPascalCase * Prefer the shorter name where possible, for example "person_id" and "personId" should simply be "id" * If you are changing the property name from the original field name, mention the column name in the description * Do not include examples for integer or date fields * Always include instructions on data preparation for the field. Does it need to be cast as a string or split into multiple fields on a delimiting value? * Property keys should be letters only, no numbers or special characters. ## Important! Consider the examples provided. Does any data preparation need to be done to ensure the data is in the correct format? You must include any information about data preparation in the description. ## Example Output Here is an example of a good output: {example_output} ## New Data: Key: {key} Data Type: {type} Example Values: {examples} ## Existing Data Model Here is the existing data model: {existing_model} ## Keep Existing Data Model Apply your changes to the existing data model but never remove any existing definitions. """, partial_variables=dict(example_output=dumps(example_output))) model_chain = model_prompt | llm.with_structured_output(JSONSchemaSpecification)

ການປະຕິບັດລະບົບຕ່ອງໂສ້

ເພື່ອປັບປຸງແບບຈໍາລອງແບບຊໍ້າໆ, ຂ້າພະເຈົ້າໄດ້ເຮັດຊ້ໍາຄືນຄີໃນກອບຂໍ້ມູນແລະຜ່ານແຕ່ລະກະແຈ, ປະເພດຂໍ້ມູນຂອງມັນ, ແລະຫ້າຄ່າທໍາອິດທີ່ເປັນເອກະລັກເພື່ອເຕືອນ:

 from json_repair import dumps, loads existing_model = {} for i, key in enumerate(df): print("\n", i, key) print("----------------") try: res = try_chain(model_chain, dict( existing_model=dumps(existing_model), key=key, type=df[key].dtype, examples=dumps(df[key].unique()[:5].tolist()) )) print(res.notes) existing_model = loads(res.jsonschema) print([n['title'] for n in existing_model]) except Exception as e: print(e) pass existing_model

ຜົນຜະລິດ Console:

 0 track_name ---------------- Adding 'track_name' to an existing data model. This represents a music track entity. ['Track'] 1 artist(s)_name ---------------- Adding a new column 'artist(s)_name' to the existing data model. This column represents multiple artists associated with tracks and should be modeled as a new node 'Artist' and a relationship 'PERFORMED_BY' from 'Track' to 'Artist'. ['Track', 'Artist', 'PERFORMED_BY'] 2 artist_count ---------------- Added artist_count as a property of Track node. This property indicates the number of artists performing in the track. ['Track', 'Artist', 'PERFORMED_BY'] 3 released_year ---------------- Add the released_year column to the existing data model as a property of the Track node. ['Track', 'Artist', 'PERFORMED_BY'] 4 released_month ---------------- Adding the 'released_month' column to the existing data model, considering it as an attribute of the Track node. ['Track', 'Artist', 'PERFORMED_BY'] 5 released_day ---------------- Added a new property 'released_day' to the 'Track' node to capture the day of the month a track was released. ['Track', 'Artist', 'PERFORMED_BY'] 6 in_spotify_playlists ---------------- Adding the new column 'in_spotify_playlists' to the existing data model as a property of the 'Track' node. ['Track', 'Artist', 'PERFORMED_BY'] 7 in_spotify_charts ---------------- Adding the 'in_spotify_charts' column to the existing data model as a property of the Track node. ['Track', 'Artist', 'PERFORMED_BY'] 8 streams ---------------- Adding a new column 'streams' to the existing data model, representing the number of streams for a track. ['Track', 'Artist', 'PERFORMED_BY'] 9 in_apple_playlists ---------------- Adding new column 'in_apple_playlists' to the existing data model ['Track', 'Artist', 'PERFORMED_BY'] 10 in_apple_charts ---------------- Adding 'in_apple_charts' as a property to the 'Track' node, representing the number of times the track appeared in the Apple charts. ['Track', 'Artist', 'PERFORMED_BY'] 11 in_deezer_playlists ---------------- Add 'in_deezer_playlists' to the existing data model for a music track database. ['Track', 'Artist', 'PERFORMED_BY'] 12 in_deezer_charts ---------------- Adding a new property 'inDeezerCharts' to the existing 'Track' node to represent the number of times the track appeared in Deezer charts. ['Track', 'Artist', 'PERFORMED_BY'] 13 in_shazam_charts ---------------- Adding new data 'in_shazam_charts' to the existing data model. This appears to be an attribute of the 'Track' node, indicating the number of times a track appeared in the Shazam charts. ['Track', 'Artist', 'PERFORMED_BY'] 14 bpm ---------------- Added bpm column as a property to the Track node as it represents a characteristic of the track. ['Track', 'Artist', 'PERFORMED_BY'] 15 key ---------------- Adding the 'key' column to the existing data model. The 'key' represents the musical key of a track, which is a shared attribute that can be interesting to query through to find similar tracks. ['Track', 'Artist', 'PERFORMED_BY'] 16 mode ---------------- Adding 'mode' to the existing data model. It represents a musical characteristic of a track, which is best captured as an attribute of the Track node. ['Track', 'Artist', 'PERFORMED_BY'] 17 danceability_% ---------------- Added 'danceability_%' to the existing data model as a property of the Track node. The field represents the danceability percentage of the track. ['Track', 'Artist', 'PERFORMED_BY'] 18 valence_% ---------------- Adding the valence percentage column to the existing data model as a property of the Track node. ['Track', 'Artist', 'PERFORMED_BY'] 19 energy_% ---------------- Integration of the new column 'energy_%' into the existing data model. This column represents an attribute of the Track entity and should be added as a property of the Track node. ['Track', 'Artist', 'PERFORMED_BY'] 20 acousticness_% ---------------- Adding acousticness_% to the existing data model as a property of the Track node. ['Track', 'Artist', 'PERFORMED_BY'] 21 instrumentalness_% ---------------- Adding the new column 'instrumentalness_%' to the existing Track node as it represents an attribute of the Track entity. ['Track', 'Artist', 'PERFORMED_BY'] 22 liveness_% ---------------- Adding the new column 'liveness_%' to the existing data model as an attribute of the Track node ['Track', 'Artist', 'PERFORMED_BY'] 23 speechiness_% ---------------- Adding the new column 'speechiness_%' to the existing data model as a property of the 'Track' node. ['Track', 'Artist', 'PERFORMED_BY'] 24 cover_url ---------------- Adding a new property 'cover_url' to the existing 'Track' node. This property represents the URL of the track's cover image. ['Track', 'Artist', 'PERFORMED_BY']

ຫຼັງຈາກການປັບປຸງສອງສາມຄັ້ງຕໍ່ການກະຕຸ້ນເຕືອນເພື່ອຈັດການກັບກໍລະນີທີ່ໃຊ້, ຂ້ອຍໄດ້ສິ້ນສຸດລົງດ້ວຍຮູບແບບທີ່ຂ້ອຍພໍໃຈ. LLM ໄດ້ຈັດການເພື່ອກໍານົດວ່າຊຸດຂໍ້ມູນປະກອບດ້ວຍເພງ, ນັກສິລະປິນ, ແລະຄວາມສໍາພັນ PERFORMED_BY ເພື່ອເຊື່ອມຕໍ່ທັງສອງ:

 [ { "title": "Track", "type": "object", "description": "Node", "properties": [ { "name": "name", "column_name": "track_name", "type": "string", "description": "The name of the track", "examples": [ "Seven (feat. Latto) (Explicit Ver.)", "LALA", "vampire", "Cruel Summer", "WHERE SHE GOES", ], }, { "name": "artist_count", "column_name": "artist_count", "type": "integer", "description": "The number of artists performing in the track", "examples": [2, 1, 3, 8, 4], }, { "name": "released_year", "column_name": "released_year", "type": "integer", "description": "The year the track was released", "examples": [2023, 2019, 2022, 2013, 2014], }, { "name": "released_month", "column_name": "released_month", "type": "integer", "description": "The month the track was released", "examples": [7, 3, 6, 8, 5], }, { "name": "released_day", "column_name": "released_day", "type": "integer", "description": "The day of the month the track was released", "examples": [14, 23, 30, 18, 1], }, { "name": "inSpotifyPlaylists", "column_name": "in_spotify_playlists", "type": "integer", "description": "The number of Spotify playlists the track is in. Cast the value as an integer.", "examples": [553, 1474, 1397, 7858, 3133], }, { "name": "inSpotifyCharts", "column_name": "in_spotify_charts", "type": "integer", "description": "The number of times the track appeared in the Spotify charts. Cast the value as an integer.", "examples": [147, 48, 113, 100, 50], }, { "name": "streams", "column_name": "streams", "type": "array", "description": "The list of stream IDs for the track. Maintain the array format.", "examples": [ "141381703", "133716286", "140003974", "800840817", "303236322", ], }, { "name": "inApplePlaylists", "column_name": "in_apple_playlists", "type": "integer", "description": "The number of Apple playlists the track is in. Cast the value as an integer.", "examples": [43, 48, 94, 116, 84], }, { "name": "inAppleCharts", "column_name": "in_apple_charts", "type": "integer", "description": "The number of times the track appeared in the Apple charts. Cast the value as an integer.", "examples": [263, 126, 207, 133, 213], }, { "name": "inDeezerPlaylists", "column_name": "in_deezer_playlists", "type": "array", "description": "The list of Deezer playlist IDs the track is in. Maintain the array format.", "examples": ["45", "58", "91", "125", "87"], }, { "name": "inDeezerCharts", "column_name": "in_deezer_charts", "type": "integer", "description": "The number of times the track appeared in the Deezer charts. Cast the value as an integer.", "examples": [10, 14, 12, 15, 17], }, { "name": "inShazamCharts", "column_name": "in_shazam_charts", "type": "array", "description": "The list of Shazam chart IDs the track is in. Maintain the array format.", "examples": ["826", "382", "949", "548", "425"], }, { "name": "bpm", "column_name": "bpm", "type": "integer", "description": "The beats per minute of the track. Cast the value as an integer.", "examples": [125, 92, 138, 170, 144], }, { "name": "key", "column_name": "key", "type": "string", "description": "The musical key of the track. Cast the value as a string.", "examples": ["B", "C#", "F", "A", "D"], }, { "name": "mode", "column_name": "mode", "type": "string", "description": "The mode of the track (eg, Major, Minor). Cast the value as a string.", "examples": ["Major", "Minor"], }, { "name": "danceability", "column_name": "danceability_%", "type": "integer", "description": "The danceability percentage of the track. Cast the value as an integer.", "examples": [80, 71, 51, 55, 65], }, { "name": "valence", "column_name": "valence_%", "type": "integer", "description": "The valence percentage of the track. Cast the value as an integer.", "examples": [89, 61, 32, 58, 23], }, { "name": "energy", "column_name": "energy_%", "type": "integer", "description": "The energy percentage of the track. Cast the value as an integer.", "examples": [83, 74, 53, 72, 80], }, { "name": "acousticness", "column_name": "acousticness_%", "type": "integer", "description": "The acousticness percentage of the track. Cast the value as an integer.", "examples": [31, 7, 17, 11, 14], }, { "name": "instrumentalness", "column_name": "instrumentalness_%", "type": "integer", "description": "The instrumentalness percentage of the track. Cast the value as an integer.", "examples": [0, 63, 17, 2, 19], }, { "name": "liveness", "column_name": "liveness_%", "type": "integer", "description": "The liveness percentage of the track. Cast the value as an integer.", "examples": [8, 10, 31, 11, 28], }, { "name": "speechiness", "column_name": "speechiness_%", "type": "integer", "description": "The speechiness percentage of the track. Cast the value as an integer.", "examples": [4, 6, 15, 24, 3], }, { "name": "coverUrl", "column_name": "cover_url", "type": "string", "description": "The URL of the track's cover image. If the value is 'Not Found', it should be cast as an empty string.", "examples": [ "https://i.scdn.co/image/ab67616d0000b2730656d5ce813ca3cc4b677e05", "https://i.scdn.co/image/ab67616d0000b273e85259a1cae29a8d91f2093d", ], }, ], }, { "title": "Artist", "type": "object", "description": "Node", "properties": [ { "name": "name", "column_name": "artist(s)_name", "type": "string", "description": "The name of the artist. Split values in column by a comma", "examples": [ "Latto", "Jung Kook", "Myke Towers", "Olivia Rodrigo", "Taylor Swift", "Bad Bunny", ], } ], }, { "title": "PERFORMED_BY", "type": "object", "description": "Relationship", "properties": [ { "name": "_from", "type": "string", "description": "The label of the node this relationship starts at", "examples": ["Track"], }, { "name": "_to", "type": "string", "description": "The label of the node this relationship ends at", "examples": ["Artist"], }, ], }, ] [ { "title": "Track", "type": "object", "description": "Node", "properties": [ { "name": "name", "column_name": "track_name", "type": "string", "description": "The name of the track", "examples": [ "Seven (feat. Latto) (Explicit Ver.)", "LALA", "vampire", "Cruel Summer", "WHERE SHE GOES", ], }, { "name": "artist_count", "column_name": "artist_count", "type": "integer", "description": "The number of artists performing in the track", "examples": [2, 1, 3, 8, 4], }, { "name": "released_year", "column_name": "released_year", "type": "integer", "description": "The year the track was released", "examples": [2023, 2019, 2022, 2013, 2014], }, { "name": "released_month", "column_name": "released_month", "type": "integer", "description": "The month the track was released", "examples": [7, 3, 6, 8, 5], }, { "name": "released_day", "column_name": "released_day", "type": "integer", "description": "The day of the month the track was released", "examples": [14, 23, 30, 18, 1], }, { "name": "inSpotifyPlaylists", "column_name": "in_spotify_playlists", "type": "integer", "description": "The number of Spotify playlists the track is in. Cast the value as an integer.", "examples": [553, 1474, 1397, 7858, 3133], }, { "name": "inSpotifyCharts", "column_name": "in_spotify_charts", "type": "integer", "description": "The number of times the track appeared in the Spotify charts. Cast the value as an integer.", "examples": [147, 48, 113, 100, 50], }, { "name": "streams", "column_name": "streams", "type": "array", "description": "The list of stream IDs for the track. Maintain the array format.", "examples": [ "141381703", "133716286", "140003974", "800840817", "303236322", ], }, { "name": "inApplePlaylists", "column_name": "in_apple_playlists", "type": "integer", "description": "The number of Apple playlists the track is in. Cast the value as an integer.", "examples": [43, 48, 94, 116, 84], }, { "name": "inAppleCharts", "column_name": "in_apple_charts", "type": "integer", "description": "The number of times the track appeared in the Apple charts. Cast the value as an integer.", "examples": [263, 126, 207, 133, 213], }, { "name": "inDeezerPlaylists", "column_name": "in_deezer_playlists", "type": "array", "description": "The list of Deezer playlist IDs the track is in. Maintain the array format.", "examples": ["45", "58", "91", "125", "87"], }, { "name": "inDeezerCharts", "column_name": "in_deezer_charts", "type": "integer", "description": "The number of times the track appeared in the Deezer charts. Cast the value as an integer.", "examples": [10, 14, 12, 15, 17], }, { "name": "inShazamCharts", "column_name": "in_shazam_charts", "type": "array", "description": "The list of Shazam chart IDs the track is in. Maintain the array format.", "examples": ["826", "382", "949", "548", "425"], }, { "name": "bpm", "column_name": "bpm", "type": "integer", "description": "The beats per minute of the track. Cast the value as an integer.", "examples": [125, 92, 138, 170, 144], }, { "name": "key", "column_name": "key", "type": "string", "description": "The musical key of the track. Cast the value as a string.", "examples": ["B", "C#", "F", "A", "D"], }, { "name": "mode", "column_name": "mode", "type": "string", "description": "The mode of the track (eg, Major, Minor). Cast the value as a string.", "examples": ["Major", "Minor"], }, { "name": "danceability", "column_name": "danceability_%", "type": "integer", "description": "The danceability percentage of the track. Cast the value as an integer.", "examples": [80, 71, 51, 55, 65], }, { "name": "valence", "column_name": "valence_%", "type": "integer", "description": "The valence percentage of the track. Cast the value as an integer.", "examples": [89, 61, 32, 58, 23], }, { "name": "energy", "column_name": "energy_%", "type": "integer", "description": "The energy percentage of the track. Cast the value as an integer.", "examples": [83, 74, 53, 72, 80], }, { "name": "acousticness", "column_name": "acousticness_%", "type": "integer", "description": "The acousticness percentage of the track. Cast the value as an integer.", "examples": [31, 7, 17, 11, 14], }, { "name": "instrumentalness", "column_name": "instrumentalness_%", "type": "integer", "description": "The instrumentalness percentage of the track. Cast the value as an integer.", "examples": [0, 63, 17, 2, 19], }, { "name": "liveness", "column_name": "liveness_%", "type": "integer", "description": "The liveness percentage of the track. Cast the value as an integer.", "examples": [8, 10, 31, 11, 28], }, { "name": "speechiness", "column_name": "speechiness_%", "type": "integer", "description": "The speechiness percentage of the track. Cast the value as an integer.", "examples": [4, 6, 15, 24, 3], }, { "name": "coverUrl", "column_name": "cover_url", "type": "string", "description": "The URL of the track's cover image. If the value is 'Not Found', it should be cast as an empty string.", "examples": [ "https://i.scdn.co/image/ab67616d0000b2730656d5ce813ca3cc4b677e05", "https://i.scdn.co/image/ab67616d0000b273e85259a1cae29a8d91f2093d", ], }, ], }, { "title": "Artist", "type": "object", "description": "Node", "properties": [ { "name": "name", "column_name": "artist(s)_name", "type": "string", "description": "The name of the artist. Split values in column by a comma", "examples": [ "Latto", "Jung Kook", "Myke Towers", "Olivia Rodrigo", "Taylor Swift", "Bad Bunny", ], } ], }, { "title": "PERFORMED_BY", "type": "object", "description": "Relationship", "properties": [ { "name": "_from", "type": "string", "description": "The label of the node this relationship starts at", "examples": ["Track"], }, { "name": "_to", "type": "string", "description": "The label of the node this relationship ends at", "examples": ["Artist"], }, ], }, ]

ເພີ່ມຕົວລະບຸທີ່ເປັນເອກະລັກ

ຂ້າພະເຈົ້າສັງເກດເຫັນວ່າ schema ບໍ່ມີຕົວກໍານົດທີ່ເປັນເອກະລັກ, ແລະນີ້ອາດຈະກາຍເປັນບັນຫາໃນເວລາທີ່ມັນມາກັບການນໍາເຂົ້າການພົວພັນ. ມັນຢືນຢູ່ກັບເຫດຜົນທີ່ວ່າ ສິລະປິນທີ່ແຕກຕ່າງກັນຈະປ່ອຍເພງທີ່ມີຊື່ດຽວກັນ ແລະນັກສິລະປິນສອງຄົນອາດມີຊື່ດຽວກັນ.

ສໍາລັບເຫດຜົນນີ້, ມັນເປັນສິ່ງສໍາຄັນທີ່ຈະສ້າງຕົວລະບຸສໍາລັບ Tracks ເພື່ອໃຫ້ພວກເຂົາສາມາດແຍກຄວາມແຕກຕ່າງພາຍໃນຊຸດຂໍ້ມູນໃຫຍ່ກວ່າ:

 # Add primary key/unique identifiers uid_prompt = PromptTemplate.from_template(""" You are a graph database expert reviewing a single entity from a data model generated by a colleague. You want to ensure that all of the nodes imported into the database are unique. ## Example A schema contains Actors with a number of properties including name, date of birth. Two actors may have the same name then add a new compound property combining the name and date of birth. If combining values, include the instruction to convert the value to slug case. Call the new property 'id'. If you have identified a new property, add it to the list of properties leaving the rest intact. Include in the description the fields that are to be concatenated. ## Example Output Here is an example of a good output: {example_output} ## Current Entity Schema {entity} """, partial_variables=dict(example_output=dumps(example_output))) uid_chain = uid_prompt | llm.with_structured_output(JSONSchemaSpecification)

ຂັ້ນຕອນນີ້ແມ່ນຕ້ອງການພຽງແຕ່ສໍາລັບຂໍ້, ສະນັ້ນຂ້າພະເຈົ້າໄດ້ດຶງຂໍ້ມູນຈາກ schema, ແລ່ນຕ່ອງໂສ້ສໍາລັບແຕ່ລະຄົນແລະຫຼັງຈາກນັ້ນລວມການພົວພັນກັບຄໍານິຍາມທີ່ປັບປຸງ:

 # extract nodes and relationships nodes = [n for n in existing_model if "node" in n["description"].lower()] rels = [n for n in existing_model if "node" not in n["description"].lower()] # generate a unique id for nodes with_uids = [] for entity in nodes: res = uid_chain.invoke(dict(entity=dumps(entity))) json = loads(res.jsonschema) with_uids = with_uids + json if type(json) == list else with_uids + [json] # combine nodes and relationships with_uids = with_uids + rels

ການທົບທວນຄືນແບບຈໍາລອງຂໍ້ມູນ

ສໍາລັບສຸຂາພິບານ, ມັນເປັນມູນຄ່າການກວດສອບຮູບແບບສໍາລັບການເພີ່ມປະສິດທິພາບ. model_prompt ໄດ້ເຮັດວຽກທີ່ດີໃນການກໍານົດຄໍານາມແລະພະຍັນຊະນະ, ແຕ່ໃນຮູບແບບທີ່ສັບສົນກວ່າ.

ການຊ້ຳກັນຄັ້ງໜຶ່ງໄດ້ປະຕິບັດຖັນ *_playlists ແລະ _charts ເປັນ ID ແລະພະຍາຍາມສ້າງ Stream nodes ແລະ IN_PLAYLIST ຄວາມສຳພັນ. ຂ້າພະເຈົ້າຄິດວ່ານີ້ແມ່ນເນື່ອງມາຈາກການນັບຫຼາຍກວ່າ 1,000 ລວມທັງການຈັດຮູບແບບທີ່ມີເຄື່ອງຫມາຍຈຸດ (e.g., 1,001).

ຄວາມຄິດທີ່ດີ, ແຕ່ບາງທີອາດສະຫລາດເກີນໄປ. ແຕ່ນີ້ສະແດງໃຫ້ເຫັນເຖິງຄວາມສໍາຄັນຂອງການມີມະນຸດຢູ່ໃນ loop ທີ່ເຂົ້າໃຈໂຄງສ້າງຂໍ້ມູນ.

 # Add primary key/unique identifiers review_prompt = PromptTemplate.from_template(""" You are a graph database expert reviewing a data model generated by a colleague. Your task is to review the data model and ensure that it is fit for purpose. Check for: ## Check for nested objects Remember that Neo4j cannot store arrays of objects or nested objects. These must be converted into into separate nodes with relationships between them. You must include the new node and a reference to the relationship to the output schema. ## Check for Entities in properties If there is a property that represents an array of IDs, a new node should be created for that entity. You must include the new node and a reference to the relationship to the output schema. # Keep Instructions Ensure that the instructions for the nodes, relationships, and properties are clear and concise. You may improve them but the detail must not be removed in any circumstances. ## Current Entity Schema {entity} """) review_chain = review_prompt | llm.with_structured_output(JSONSchemaSpecification) review_nodes = [n for n in with_uids if "node" in n["description"].lower() ] review_rels = [n for n in with_uids if "node" not in n["description"].lower() ] reviewed = [] for entity in review_nodes: res = review_chain.invoke(dict(entity=dumps(entity))) json = loads(res.jsonschema) reviewed = reviewed + json # add relationships back in reviewed = reviewed + review_rels len(reviewed) reviewed = with_uids

ໃນສະຖານະການທີ່ແທ້ຈິງ, ຂ້າພະເຈົ້າຕ້ອງການທີ່ຈະດໍາເນີນການນີ້ສອງສາມເທື່ອເພື່ອປັບປຸງແບບຈໍາລອງຂໍ້ມູນ. ຂ້າພະເຈົ້າຈະວາງຂອບເຂດຈໍາກັດສູງສຸດ, ຫຼັງຈາກນັ້ນໃຫ້ເຮັດຊ້ໍາອີກເຖິງຈຸດນັ້ນຫຼືຈຸດປະສົງຂອງຕົວແບບຂໍ້ມູນບໍ່ປ່ຽນແປງອີກຕໍ່ໄປ.

ສ້າງລາຍງານການນໍາເຂົ້າ

ໂດຍຈຸດນີ້, schema ຄວນຈະເຂັ້ມແຂງພຽງພໍແລະປະກອບມີຂໍ້ມູນຫຼາຍເທົ່າທີ່ເປັນໄປໄດ້ເພື່ອໃຫ້ LLM ສ້າງຊຸດຂອງສະຄິບນໍາເຂົ້າ.

ສອດຄ່ອງກັບ ຄໍາແນະນໍາການນໍາເຂົ້າຂໍ້ມູນ Neo4j , ໄຟລ໌ຄວນໄດ້ຮັບການປະມວນຜົນຫຼາຍຄັ້ງ, ແຕ່ລະຄັ້ງທີ່ນໍາເຂົ້າຂໍ້ດຽວຫຼືການພົວພັນເພື່ອຫຼີກເວັ້ນການກະຕືລືລົ້ນການດໍາເນີນງານແລະການລັອກ.

 import_prompt = PromptTemplate.from_template(""" Based on the data model, write a Cypher statement to import the following data from a CSV file into Neo4j. Do not use LOAD CSV as this data will be imported using the Neo4j Python Driver, use UNWIND on the $rows parameter instead. You are writing a multi-pass import process, so concentrate on the entity mentioned. When importing data, you must use the following guidelines: * follow the instructions in the description when identifying primary keys. * Use the instructions in the description to determine the format of properties when a finding. * When combining fields into an ID, use the apoc.text.slug function to convert any text to slug case and toLower to convert the string to lowercase - apoc.text.slug(toLower(row.`name`)) * If you split a property, convert it to a string and use the trim function to remove any whitespace - trim(toString(row.`name`)) * When combining properties, wrap each property in the coalesce function so the property is not null if one of the values is not set - coalesce(row.`id`, '') + '--'+ coalsece(row.`title`) * Use the `column_name` field to map the CSV column to the property in the data model. * Wrap all column names from the CSV in backticks - for example row.`column_name`. * When you merge nodes, merge on the unique identifier and nothing else. All other properties should be set using `SET`. * Do not use apoc.periodic.iterate, the files will be batched in the application. Data Model: {data_model} Current Entity: {entity} """)

ລະບົບຕ່ອງໂສ້ນີ້ຕ້ອງການວັດຖຸຜົນຜະລິດທີ່ແຕກຕ່າງກັນກັບຂັ້ນຕອນທີ່ຜ່ານມາ. ໃນກໍລະນີນີ້, ສະມາຊິກ cypher ແມ່ນສໍາຄັນທີ່ສຸດ, ແຕ່ຂ້າພະເຈົ້າຍັງຕ້ອງການລວມເອົາລະຫັດ chain_of_thought ເພື່ອຊຸກຍູ້ Chain of Thought:

 class CypherOutputSpecification(BaseModel): chain_of_thought: str = Field(description="Any reasoning used to write the Cypher statement") cypher: str = Field(description="The Cypher statement to import the data") notes: Optional[str] = Field(description="Any notes or closing remarks about the Cypher statement") import_chain = import_prompt | llm.with_structured_output(CypherOutputSpecification)

ຫຼັງຈາກນັ້ນ, ຂະບວນການດຽວກັນນີ້ໃຊ້ກັບ iterate ໃນແຕ່ລະຄໍານິຍາມທີ່ຖືກທົບທວນຄືນແລະສ້າງ Cypher:

 import_cypher = [] for n in reviewed: print('\n\n------', n['title']) res = import_chain.invoke(dict( data_model=dumps(reviewed), entity=n )) import_cypher.append(( res.cypher )) print(res.cypher)

ຜົນຜະລິດ Console:

 ------ Track UNWIND $rows AS row MERGE (t:Track {id: apoc.text.slug(toLower(coalesce(row.`track_name`, '') + '-' + coalesce(row.`released_year`, '')))}) SET t.name = trim(toString(row.`track_name`)), t.artist_count = toInteger(row.`artist_count`), t.released_year = toInteger(row.`released_year`), t.released_month = toInteger(row.`released_month`), t.released_day = toInteger(row.`released_day`), t.inSpotifyPlaylists = toInteger(row.`in_spotify_playlists`), t.inSpotifyCharts = toInteger(row.`in_spotify_charts`), t.streams = row.`streams`, t.inApplePlaylists = toInteger(row.`in_apple_playlists`), t.inAppleCharts = toInteger(row.`in_apple_charts`), t.inDeezerPlaylists = row.`in_deezer_playlists`, t.inDeezerCharts = toInteger(row.`in_deezer_charts`), t.inShazamCharts = row.`in_shazam_charts`, t.bpm = toInteger(row.`bpm`), t.key = trim(toString(row.`key`)), t.mode = trim(toString(row.`mode`)), t.danceability = toInteger(row.`danceability_%`), t.valence = toInteger(row.`valence_%`), t.energy = toInteger(row.`energy_%`), t.acousticness = toInteger(row.`acousticness_%`), t.instrumentalness = toInteger(row.`instrumentalness_%`), t.liveness = toInteger(row.`liveness_%`), t.speechiness = toInteger(row.`speechiness_%`), t.coverUrl = CASE row.`cover_url` WHEN 'Not Found' THEN '' ELSE trim(toString(row.`cover_url`)) END ------ Artist UNWIND $rows AS row WITH row, split(row.`artist(s)_name`, ',') AS artistNames UNWIND artistNames AS artistName MERGE (a:Artist {id: apoc.text.slug(toLower(trim(artistName)))}) SET a.name = trim(artistName) ------ PERFORMED_BY UNWIND $rows AS row UNWIND split(row.`artist(s)_name`, ',') AS artist_name MERGE (t:Track {id: apoc.text.slug(toLower(row.`track_name`)) + '-' + trim(toString(row.`released_year`))}) MERGE (a:Artist {id: apoc.text.slug(toLower(trim(artist_name)))}) MERGE (t)-[:PERFORMED_BY]->(a)

ການກະຕຸ້ນເຕືອນນີ້ໄດ້ເອົາວິສະວະກໍາບາງຢ່າງເພື່ອບັນລຸຜົນໄດ້ຮັບທີ່ສອດຄ່ອງ:

ບາງຄັ້ງ Cypher ຈະປະກອບມີຄໍາຖະແຫຼງທີ່ MERGE ທີ່ມີຫຼາຍຊ່ອງຂໍ້ມູນທີ່ກໍານົດໄວ້, ເຊິ່ງເປັນສິ່ງທີ່ດີທີ່ສຸດ. ຖ້າຖັນໃດນຶ່ງເປັນ null, ການນຳເຂົ້າທັງໝົດຈະລົ້ມເຫລວ.
ບາງຄັ້ງ, ຜົນໄດ້ຮັບຈະປະກອບມີ apoc.period.iterate , ເຊິ່ງບໍ່ຕ້ອງການອີກຕໍ່ໄປ, ແລະຂ້ອຍຕ້ອງການລະຫັດທີ່ຂ້ອຍສາມາດດໍາເນີນການກັບ Python driver.
ຂ້ອຍຕ້ອງເວົ້າຄືນວ່າຊື່ຄໍລໍາທີ່ລະບຸຄວນຖືກນໍາໃຊ້ໃນເວລາສ້າງຄວາມສໍາພັນ.
LLM ພຽງແຕ່ຈະບໍ່ປະຕິບັດຕາມຄໍາແນະນໍາໃນເວລາທີ່ການນໍາໃຊ້ຕົວລະບຸທີ່ເປັນເອກະລັກໃນ nodes ໃນແຕ່ລະຈຸດສິ້ນສຸດຂອງການພົວພັນ, ດັ່ງນັ້ນມັນໄດ້ພະຍາຍາມສອງສາມຄັ້ງເພື່ອໃຫ້ມັນປະຕິບັດຕາມຄໍາແນະນໍາໃນຄໍາອະທິບາຍ. ມີບາງອັນກັບໄປມາລະຫວ່າງການເຕືອນນີ້ ແລະ model_prompt.
Backticks ແມ່ນຈໍາເປັນສໍາລັບຊື່ຖັນທີ່ປະກອບມີຕົວອັກສອນພິເສດ (ເຊັ່ນ: energy_%).

ມັນຍັງຈະເປັນປະໂຫຍດທີ່ຈະແບ່ງປັນນີ້ເປັນສອງ prompts - ຫນຶ່ງສໍາລັບ nodes ແລະຫນຶ່ງສໍາລັບຄວາມສໍາພັນ. ແຕ່ນັ້ນແມ່ນວຽກງານສຳລັບມື້ອື່ນ.

ສ້າງຂໍ້ຈໍາກັດທີ່ເປັນເອກະລັກ

ຕໍ່ໄປ, scripts ນໍາເຂົ້າສາມາດຖືກນໍາໃຊ້ເປັນພື້ນຖານເພື່ອສ້າງຂໍ້ຈໍາກັດທີ່ເປັນເອກະລັກໃນຖານຂໍ້ມູນ:

 constraint_prompt = PromptTemplate.from_template(""" You are an expert graph database administrator. Use the following Cypher statement to write a Cypher statement to create unique constraints on any properties used in a MERGE statement. The correct syntax for a unique constraint is: CREATE CONSTRAINT movie_title_id IF NOT EXISTS FOR (m:Movie) REQUIRE m.title IS UNIQUE; Cypher: {cypher} """) constraint_chain = constraint_prompt | llm.with_structured_output(CypherOutputSpecification) constraint_queries = [] for statement in import_cypher: res = constraint_chain.invoke(dict(cypher=statement)) statements = res.cypher.split(";") for cypher in statements: constraint_queries.append(cypher)

ຜົນຜະລິດ Console:

 CREATE CONSTRAINT track_id_unique IF NOT EXISTS FOR (t:Track) REQUIRE t.id IS UNIQUE CREATE CONSTRAINT stream_id IF NOT EXISTS FOR (s:Stream) REQUIRE s.id IS UNIQUE CREATE CONSTRAINT playlist_id IF NOT EXISTS FOR (p:Playlist) REQUIRE p.id IS UNIQUE CREATE CONSTRAINT chart_id IF NOT EXISTS FOR (c:Chart) REQUIRE c.id IS UNIQUE CREATE CONSTRAINT track_id_unique IF NOT EXISTS FOR (t:Track) REQUIRE t.id IS UNIQUE CREATE CONSTRAINT stream_id_unique IF NOT EXISTS FOR (s:Stream) REQUIRE s.id IS UNIQUE CREATE CONSTRAINT track_id_unique IF NOT EXISTS FOR (t:Track) REQUIRE t.id IS UNIQUE CREATE CONSTRAINT playlist_id_unique IF NOT EXISTS FOR (p:Playlist) REQUIRE p.id IS UNIQUE CREATE CONSTRAINT track_id_unique IF NOT EXISTS FOR (track:Track) REQUIRE track.id IS UNIQUE CREATE CONSTRAINT chart_id_unique IF NOT EXISTS FOR (chart:Chart) REQUIRE chart.id IS UNIQUE

ບາງຄັ້ງການເຕືອນນີ້ຈະສົ່ງຄືນຄໍາຖະແຫຼງການສໍາລັບດັດສະນີແລະຂໍ້ຈໍາກັດ, ເຮັດໃຫ້ເກີດການແບ່ງປັນຢູ່ໃນເຄິ່ງຈໍ້າສອງເມັດ.

ດໍາເນີນການນໍາເຂົ້າ

ດ້ວຍທຸກສິ່ງທຸກຢ່າງຢູ່ໃນສະຖານທີ່, ມັນແມ່ນເວລາທີ່ຈະປະຕິບັດຄໍາຖະແຫຼງທີ່ ethe Cypher:

 from os import getenv from neo4j import GraphDatabase driver = GraphDatabase.driver( getenv("NEO4J_URI"), auth=( getenv("NEO4J_USERNAME"), getenv("NEO4J_PASSWORD") ) ) with driver.session() as session: # truncate the db session.run("MATCH (n) DETACH DELETE n") # create constraints for q in constraint_queries: if q.strip() != "": session.run(q) # import the data for q in import_cypher: if q.strip() != "": res = session.run(q, rows=rows).consume() print(q) print(res.counters)

QA ໃນຊຸດຂໍ້ມູນ

ໂພສນີ້ຈະບໍ່ສົມບູນຖ້າບໍ່ມີ QA ໃນຊຸດຂໍ້ມູນໂດຍໃຊ້ GraphCypherQAChain:

 from langchain.chains import GraphCypherQAChain from langchain_community.graphs import Neo4jGraph graph = Neo4jGraph( url=getenv("NEO4J_URI"), username=getenv("NEO4J_USERNAME"), password=getenv("NEO4J_PASSWORD"), enhanced_schema=True ) qa = GraphCypherQAChain.from_llm( llm, graph=graph, allow_dangerous_requests=True, verbose=True )

ສິລະປິນທີ່ນິຍົມຫຼາຍທີ່ສຸດ

ໃຜເປັນນັກສິລະປິນທີ່ນິຍົມທີ່ສຸດໃນຖານຂໍ້ມູນ?

 qa.invoke({"query": "Who are the most popular artists?"}) > Entering new GraphCypherQAChain chain... Generated Cypher: cypher MATCH (:Track)-[:PERFORMED_BY]->(a:Artist) RETURN a.name, COUNT(*) AS popularity ORDER BY popularity DESC LIMIT 10 Full Context: [{'a.name': 'Bad Bunny', 'popularity': 40}, {'a.name': 'Taylor Swift', 'popularity': 38}, {'a.name': 'The Weeknd', 'popularity': 36}, {'a.name': 'SZA', 'popularity': 23}, {'a.name': 'Kendrick Lamar', 'popularity': 23}, {'a.name': 'Feid', 'popularity': 21}, {'a.name': 'Drake', 'popularity': 19}, {'a.name': 'Harry Styles', 'popularity': 17}, {'a.name': 'Peso Pluma', 'popularity': 16}, {'a.name': '21 Savage', 'popularity': 14}] > Finished chain. { "query": "Who are the most popular artists?", "result": "Bad Bunny, Taylor Swift, and The Weeknd are the most popular artists." }

LLM ເບິ່ງຄືວ່າຈະຕັດສິນຄວາມນິຍົມໃນແງ່ຂອງຈໍານວນຂອງເພງທີ່ນັກສິລະປິນໄດ້ຢູ່ແທນທີ່ຈະເປັນຈໍານວນການຖ່າຍທອດຂອງພວກເຂົາ.

Beats ຕໍ່ນາທີ

ເພງໃດມີ BPM ສູງສຸດ?

 qa.invoke({"query": "Which track has the highest BPM?"}) > Entering new GraphCypherQAChain chain... Generated Cypher: cypher MATCH (t:Track) RETURN t ORDER BY t.bpm DESC LIMIT 1 Full Context: [{'t': {'id': 'seven-feat-latto-explicit-ver--2023'}}] > Finished chain. { "query": "Which track has the highest BPM?", "result": "I don't know the answer." }

ການປັບປຸງ Cypher Generation Prompt

ໃນກໍລະນີນີ້, Cypher ເບິ່ງດີແລະຜົນໄດ້ຮັບທີ່ຖືກຕ້ອງຖືກລວມເຂົ້າໃນການເຕືອນແຕ່ gpt-4o ບໍ່ສາມາດຕີຄວາມຫມາຍຄໍາຕອບໄດ້. ເບິ່ງຄືວ່າ CYPHER_GENERATION_PROMPT ທີ່ຜ່ານໄປ GraphCypherQACchain ສາມາດເຮັດໄດ້ດ້ວຍຄໍາແນະນໍາເພີ່ມເຕີມເພື່ອເຮັດໃຫ້ຊື່ຖັນເປັນ verbise ຫຼາຍ.

ໃຊ້ຊື່ຖັນ verbose ໃນຄໍາຖະແຫຼງ Cypher ໂດຍໃຊ້ປ້າຍຊື່ແລະຊື່ຊັບສິນ. ຕົວຢ່າງ, ໃຊ້ 'person_name' ແທນ 'ຊື່'.

GraphCypherQACchain ດ້ວຍການເຕືອນແບບກຳນົດເອງ:

 CYPHER_GENERATION_TEMPLATE = """Task:Generate Cypher statement to query a graph database. Instructions: Use only the provided relationship types and properties in the schema. Do not use any other relationship types or properties that are not provided. Schema: {schema} Note: Do not include any explanations or apologies in your responses. Do not respond to any questions that might ask anything else than for you to construct a Cypher statement. Do not include any text except the generated Cypher statement. Always use verbose column names in the Cypher statement using the label and property names. For example, use 'person_name' instead of 'name'. Include data from the immediate network around the node in the result to provide extra context. For example, include the Movie release year, a list of actors and their roles, or the director of a movie. When ordering by a property, add an `IS NOT NULL` check to ensure that only nodes with that property are returned. Examples: Here are a few examples of generated Cypher statements for particular questions: # How many people acted in Top Gun? MATCH (m:Movie {{name:"Top Gun"}}) RETURN COUNT { (m)<-[:ACTED_IN]-() } AS numberOfActors The question is: {question}""" CYPHER_GENERATION_PROMPT = PromptTemplate( input_variables=["schema", "question"], template=CYPHER_GENERATION_TEMPLATE ) qa = GraphCypherQAChain.from_llm( llm, graph=graph, allow_dangerous_requests=True, verbose=True, cypher_prompt=CYPHER_GENERATION_PROMPT, )

ເພງປະຕິບັດໂດຍນັກສິລະປິນຫຼາຍທີ່ສຸດ

Graphs ດີເລີດໃນການສົ່ງຄືນການນັບຈໍານວນຂອງການພົວພັນໂດຍປະເພດແລະທິດທາງ.

 qa.invoke({"query": "Which tracks are performed by the most artists?"}) > Entering new GraphCypherQAChain chain... Generated Cypher: cypher MATCH (t:Track) WITH t, COUNT { (t)-[:PERFORMED_BY]->(:Artist) } as artist_count WHERE artist_count IS NOT NULL RETURN t.id AS track_id, t.name AS track_name, artist_count ORDER BY artist_count DESC Full Context: [{'track_id': 'los-del-espacio-2023', 'track_name': 'Los del Espacio', 'artist_count': 8}, {'track_id': 'se-le-ve-2021', 'track_name': 'Se Le Ve', 'artist_count': 8}, {'track_id': 'we-don-t-talk-about-bruno-2021', 'track_name': "We Don't Talk About Bruno", 'artist_count': 7}, {'track_id': 'cayï-ï-la-noche-feat-cruz-cafunï-ï-abhir-hathi-bejo-el-ima--2022', 'track_name': None, 'artist_count': 6}, {'track_id': 'jhoome-jo-pathaan-2022', 'track_name': 'Jhoome Jo Pathaan', 'artist_count': 6}, {'track_id': 'besharam-rang-from-pathaan--2022', 'track_name': None, 'artist_count': 6}, {'track_id': 'nobody-like-u-from-turning-red--2022', 'track_name': None, 'artist_count': 6}, {'track_id': 'ultra-solo-remix-2022', 'track_name': 'ULTRA SOLO REMIX', 'artist_count': 5}, {'track_id': 'angel-pt-1-feat-jimin-of-bts-jvke-muni-long--2023', 'track_name': None, 'artist_count': 5}, {'track_id': 'link-up-metro-boomin-don-toliver-wizkid-feat-beam-toian-spider-verse-remix-spider-man-across-the-spider-verse--2023', 'track_name': None, 'artist_count': 5}] > Finished chain. { "query": "Which tracks are performed by the most artists?", "result": "The tracks \"Los del Espacio\" and \"Se Le Ve\" are performed by the most artists, with each track having 8 artists." }

ສະຫຼຸບ

ການວິເຄາະ CSV ແລະການສ້າງແບບຈໍາລອງແມ່ນພາກສ່ວນທີ່ໃຊ້ເວລາຫຼາຍທີ່ສຸດ. ມັນສາມາດໃຊ້ເວລາຫຼາຍກວ່າຫ້ານາທີເພື່ອສ້າງ.

ຄ່າໃຊ້ຈ່າຍດ້ວຍຕົນເອງແມ່ນລາຄາຖືກ pretty. ໃນແປດຊົ່ວໂມງຂອງການທົດລອງ, ຂ້ອຍຕ້ອງສົ່ງຄໍາຮ້ອງຂໍຫຼາຍຮ້ອຍຄັ້ງແລະຂ້ອຍໄດ້ສິ້ນສຸດລົງເຖິງການໃຊ້ເງິນໂດລາຫຼືຫຼາຍກວ່ານັ້ນ.

ມີຫຼາຍສິ່ງທ້າທາຍທີ່ຈະມາຮອດຈຸດນີ້:

ການກະຕຸ້ນເຕືອນໄດ້ໃຊ້ເວລາຫຼາຍຄັ້ງເພື່ອໃຫ້ຖືກຕ້ອງ. ບັນຫານີ້ສາມາດແກ້ໄຂໄດ້ໂດຍການປັບຕົວແບບໃຫ້ລະອຽດ ຫຼືໃຫ້ຕົວຢ່າງຍິງໜ້ອຍໜຶ່ງ.
ຄໍາຕອບ JSON ຈາກ GPT-4o ສາມາດບໍ່ສອດຄ່ອງ. ຂ້ອຍໄດ້ຖືກແນະນໍາ json-repair , ເຊິ່ງດີກວ່າການພະຍາຍາມເຮັດໃຫ້ LLM ກວດສອບຜົນຜະລິດ JSON ຂອງຕົນເອງ.

ຂ້ອຍສາມາດເຫັນໄດ້ວ່າວິທີການນີ້ເຮັດວຽກໄດ້ດີໃນການປະຕິບັດ LangGraph ບ່ອນທີ່ການດໍາເນີນງານຖືກດໍາເນີນຕາມລໍາດັບ, ໃຫ້ LLM ສາມາດສ້າງແລະປັບປຸງຮູບແບບ. ຍ້ອນວ່າຮູບແບບໃຫມ່ຖືກປ່ອຍອອກມາ, ພວກເຂົາຍັງອາດຈະໄດ້ຮັບຜົນປະໂຫຍດຈາກການປັບຕົວ.

ສຶກສາເພີ່ມເຕີມ

ກວດເບິ່ງ Harnessing Large Language Models with Neo4j ສໍາລັບຂໍ້ມູນເພີ່ມເຕີມກ່ຽວກັບການປັບປຸງຂະບວນການສ້າງກາຟຄວາມຮູ້ກັບ LLMs. ອ່ານ ສ້າງ Neo4j GraphRAG Workflow ໂດຍໃຊ້ LangChain ແລະ LangGraph ສໍາລັບເພີ່ມເຕີມກ່ຽວກັບ LangGraph ແລະ Neo4j. ແລະເພື່ອຮຽນຮູ້ເພີ່ມເຕີມກ່ຽວກັບການປັບລະອຽດ, ໃຫ້ກວດເບິ່ງ Knowledge Graphs ແລະ LLMs: Fine-Tuning vs. Retrieval-augmented Generation .

ຮູບພາບຄຸນສົມບັດ: ຮູບແບບກາຟສະແດງໃຫ້ເຫັນເພງທີ່ມີຄວາມສໍາພັນ PERFORMED_BY ກັບສິນລະປິນ. ຮູບພາບໂດຍຜູ້ຂຽນ.

ເພື່ອສຶກສາເພີ່ມເຕີມກ່ຽວກັບຫົວຂໍ້ນີ້, ເຂົ້າຮ່ວມກັບພວກເຮົາທີ່ NODES 2024 ໃນວັນທີ 7 ພະຈິກ, ກອງປະຊຸມນັກພັດທະນາສະເໝືອນຟຣີຂອງພວກເຮົາກ່ຽວກັບແອັບອັດສະລິຍະ, ກຣາບຄວາມຮູ້ ແລະ AI. ລົງທະບຽນດຽວນີ້!