Brushing up on your Python Skills

The basics of this class are taught in Python. And the neglected basics of ALP is preprocessing our texts.

Preprocessing for ALP is much broader than what computer and data scientists usually mean. Philological conventions in printed and digital publications hold much more information that needs to be correctly parsed before any computational manipulation (analysis).

In this notebook, we are going to provide four examples of messy texts: one in Egyptian and two in Akkadian. We are going to work through the process of how we should parse the texts, what information we are losing when parsing them, and brushing up on basic Python syntax and functions while we’re at it.

Akkadian Example 1:

https://cdli.mpiwg-berlin.mpg.de/artifacts/225104

&P225104 = TIM 10, 134 #atf: use lexical #Nippur 2N-T496; proverb; Alster proverbs @tablet @obverse @column 1 1. dub-sar hu-ru 2. a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e-ne 3. dub-sar hu-ru 4. a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e#-ne @reverse @column 1 1. igi-bi 3(disz) 3(asz) 6(disz)

Task 1:

How do we turn this raw text into a list of words?

akk1 = """&P225104 = TIM 10, 134
#atf: use lexical
#Nippur 2N-T496; proverb; Alster proverbs
@tablet
@obverse
@column 1
1. dub-sar hu-ru
2. a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e-ne
3. dub-sar hu-ru
4. a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e#-ne
@reverse
@column 1
1. igi-bi 3(disz) 3(asz) 6(disz)
"""
akk1
'&P225104 = TIM 10, 134\n#atf: use lexical\n#Nippur 2N-T496; proverb; Alster proverbs\n@tablet\n@obverse\n@column 1\n1. dub-sar hu-ru\n2. a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e-ne\n3. dub-sar hu-ru\n4. a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e#-ne\n@reverse\n@column 1\n1. igi-bi 3(disz) 3(asz) 6(disz)\n'
# split string to lines of texts
lines = akk1.split("\n")
lines
['&P225104 = TIM 10, 134',
 '#atf: use lexical',
 '#Nippur 2N-T496; proverb; Alster proverbs',
 '@tablet',
 '@obverse',
 '@column 1',
 '1. dub-sar hu-ru',
 '2. a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e-ne',
 '3. dub-sar hu-ru',
 '4. a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e#-ne',
 '@reverse',
 '@column 1',
 '1. igi-bi 3(disz) 3(asz) 6(disz)',
 '']
# remove blanks

lines_full = []
for line in lines:
  if line != "":
    lines_full.append(line)

lines_full
['&P225104 = TIM 10, 134',
 '#atf: use lexical',
 '#Nippur 2N-T496; proverb; Alster proverbs',
 '@tablet',
 '@obverse',
 '@column 1',
 '1. dub-sar hu-ru',
 '2. a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e-ne',
 '3. dub-sar hu-ru',
 '4. a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e#-ne',
 '@reverse',
 '@column 1',
 '1. igi-bi 3(disz) 3(asz) 6(disz)']
# keep only lines that begin with a number
# use regular expressions

import re

text_lines = []
for line in lines_full:
  if re.match("^\d", line) != None:
    text_lines.append(line)

text_lines
['1. dub-sar hu-ru',
 '2. a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e-ne',
 '3. dub-sar hu-ru',
 '4. a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e#-ne',
 '1. igi-bi 3(disz) 3(asz) 6(disz)']
# separate lines into words

words_appended = []
words_extended = []
for line in text_lines:
  temp_words = line.split()
  words_appended.append(temp_words[1:]) # creates list of lists
  words_extended.extend(temp_words[1:]) # creates list

print(words_appended)
print("-------------------------------")
print(words_extended)
[['dub-sar', 'hu-ru'], ['a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e-ne'], ['dub-sar', 'hu-ru'], ['a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e#-ne'], ['igi-bi', '3(disz)', '3(asz)', '6(disz)']]
-------------------------------
['dub-sar', 'hu-ru', 'a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e-ne', 'dub-sar', 'hu-ru', 'a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e#-ne', 'igi-bi', '3(disz)', '3(asz)', '6(disz)']
# rewrite the code above as a function

What information did we lose when preprocessing the texts in this way?

Task 2:

Create a dictionary from the raw texts, of the following format:

{"pnum": ...
 "textID": ...
 "surface": [{
  "surfaceType": ...
  "columns": [{
    "columnNum": ...
    "text": [{
      "lineNum": ...
      "words": [..., ..., ...]
    }]
  }]
 }]}
# separate text into lines

lines = akk1.split("\n")

lines_full = []
for line in lines:
  if line != "":
    lines_full.append(line)

lines_full
['&P225104 = TIM 10, 134',
 '#atf: use lexical',
 '#Nippur 2N-T496; proverb; Alster proverbs',
 '@tablet',
 '@obverse',
 '@column 1',
 '1. dub-sar hu-ru',
 '2. a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e-ne',
 '3. dub-sar hu-ru',
 '4. a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e#-ne',
 '@reverse',
 '@column 1',
 '1. igi-bi 3(disz) 3(asz) 6(disz)']
# store the pnum and textID in variables

text_ids = lines_full[0]
pnum, textID = text_ids.split("=")

pnum = pnum.strip()[1:]
textID = textID.strip()
print(pnum)
print(textID)
P225104
TIM 10, 134
# @title
# create a dictionary for each surface (simple no regex method)
# what do you do when you have different type of inscribed object? (e.g. cylinder, prism, bowl, slab, etc.)

valid_surface_values = ["@obverse", "@reverse"]

surface_idx = []

for index, line in enumerate(lines_full):
  if line in valid_surface_values: # what is dangerous in this line? if the line of text is not exactly(!) part of surface, no lines will be found
    surface_idx.append(index)
print(surface_idx)
[4, 10]
# @title
# create a dictionary for each surface (complicated with regex method)
# what do you do when you have different type of inscribed object? (e.g. cylinder, prism, bowl, slab, etc.)

valid_surface_values = ["@obverse", "@reverse"]

pattern = r"^(?:" + "|".join([re.escape(value) for value in valid_surface_values]) + ")"

surface_idx = []

for index, line in enumerate(lines_full):
  if re.match(pattern, line) != None:
    surface_idx.append(index)
print(surface_idx)
[4, 10]
# @title
# use surface indices to create surface dictionaries
# surfaceType; columnNum; lineNum; words
# surfaceType extracted using id values of lines
# columnNum needs first to check whether a column actually exists, then extracted using regex(?)/tokenize on space for any number after the word column
# lineNum is regex for any line that begins with a number plus any tags attached: how would be best to define line numbers, as integers or as string variables?
# words extracted from each text line after lineNum using regex and tokenized on spaces

for index, id in enumerate(surface_idx):
    surfaceType = lines_full[id].replace('@', '')
    print(index, id)
    if index < len(surface_idx) - 1:
        end_of_surface = surface_idx[index+1]
    else:
        end_of_surface = len(lines_full)

    # Extract the text content for the current surface designation
    surface_content = lines_full[id+1:end_of_surface]

    # Print the surface type and its content
    print(f"Surface Type: {surfaceType}")
    # print("Content:")
    # print('\n'.join(surface_content))
    print('---')

    # Extract column number, line numbers, and words for each surface content
    for line in surface_content:
        columnNum = None
        lineNum = None
        words = []

        # Check if the line contains a column number
        if '@column' in line:
            parts = line.split()
            if len(parts) >= 2:
                try:
                    columnNum = int(parts[1])
                except ValueError:
                    pass
            print(f"Column Number: {columnNum}")
            print('---')
            continue  # Skip processing the line with @column

        # Check if the line contains a line number
        if '.' in line:
            parts = line.split('.')
            if len(parts) >= 2:
                lineNum = parts[0].strip()

        # Tokenize the words in the line
        if lineNum:
            words = parts[1].strip().split()
        else:
            words = line.strip().split()

        # Print the extracted information for each line
        print(f"Line Number: {lineNum}")
        print(f"Words: {words}")
        print('---')
0 4
Surface Type: obverse
---
Column Number: 1
---
Line Number: 1
Words: ['dub-sar', 'hu-ru']
---
Line Number: 2
Words: ['a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e-ne']
---
Line Number: 3
Words: ['dub-sar', 'hu-ru']
---
Line Number: 4
Words: ['a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e#-ne']
---
1 10
Surface Type: reverse
---
Column Number: 1
---
Line Number: 1
Words: ['igi-bi', '3(disz)', '3(asz)', '6(disz)']
---
# @title
# Combine the surfaces and metadata into one dictionary

output = {
    "pnum": pnum,
    "textID": textID,
    "surface": []
}

for index, id in enumerate(surface_idx):
    surfaceType = lines_full[id].replace('@', '')
    surface = {
        "surfaceType": surfaceType,
        "columns": []
    }

    if index < len(surface_idx) - 1:
        end_of_surface = surface_idx[index+1]
    else:
        end_of_surface = len(lines_full)

    # Extract the text content for the current surface designation
    surface_content = lines_full[id+1:end_of_surface]

    # Extract column number, line numbers, and words for each surface content
    columnNum = None
    column = {
        "columnNum": None,
        "text": []
    }
    for line in surface_content:
        lineNum = None
        words = []

        # Check if the line contains a column number
        if '@column' in line:
            parts = line.split()
            if len(parts) >= 2:
                try:
                    columnNum = int(parts[1])
                    column["columnNum"] = columnNum
                except ValueError:
                    pass
            continue  # Skip processing the line with @column

        # Check if the line contains a line number
        if '.' in line:
            parts = line.split('.')
            if len(parts) >= 2:
                lineNum = parts[0].strip()

        # Tokenize the words in the line
        if lineNum:
            words = parts[1].strip().split()
        else:
            words = line.strip().split()

        # Add the line information to the column
        line_info = {
            "lineNum": lineNum,
            "words": words
        }
        column["text"].append(line_info)

    # Add the column to the surface
    surface["columns"].append(column)

    # Add the surface to the output
    output["surface"].append(surface)

# Print the output in the specified dictionary format
print(output)
{'pnum': 'P225104', 'textID': 'TIM 10, 134', 'surface': [{'surfaceType': 'obverse', 'columns': [{'columnNum': 1, 'text': [{'lineNum': '1', 'words': ['dub-sar', 'hu-ru']}, {'lineNum': '2', 'words': ['a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e-ne']}, {'lineNum': '3', 'words': ['dub-sar', 'hu-ru']}, {'lineNum': '4', 'words': ['a-ga-asz-gi4-gi4-me!(|ME+ASZ|)-e#-ne']}]}]}, {'surfaceType': 'reverse', 'columns': [{'columnNum': 1, 'text': [{'lineNum': '1', 'words': ['igi-bi', '3(disz)', '3(asz)', '6(disz)']}]}]}]}
# Save the output dictionary as a JSON file

import json
with open(f"{pnum}.json", "w") as json_file:
    json.dump(output, json_file, indent=4)
# rewrite the code above into a function

Egyptian Example 1:

eg2_csv = """,text,line,word,ref,frag,norm,unicode_word,unicode,lemma_id,cf,pos,sense
92,3Z5EM77HJFCOPKZDDZFEMI6KVY,5,7,3Z5EM77HJFCOPKZDDZFEMI6KVY.5.7,gꜣu̯.w,gꜣu̯.w,<g>V96</g>𓅱,"['<', 'g', '>', 'V', '9', '6', '<', '/', 'g', '>', '𓅱']",166210,gꜣu̯,VERB,eng sein; entbehren; (jmdn.) Not leiden lassen
151,4WVXFJZFLNAYHP3Y5O5SLWD7DA,2,2,4WVXFJZFLNAYHP3Y5O5SLWD7DA.2.2,nꜥw,nꜥw,𓈖𓂝𓅱<g>I14C</g>𓏤,"['𓈖', '𓂝', '𓅱', '<', 'g', '>', 'I', '1', '4', 'C', '<', '/', 'g', '>', '𓏤']",80510,Nꜥw,PROPN,Sich windender (Personifikation der Schlange)
153,4WVXFJZFLNAYHP3Y5O5SLWD7DA,2,5,4WVXFJZFLNAYHP3Y5O5SLWD7DA.2.5,nꜥw,nꜥw,𓈖𓂝𓅱<g>I14C</g>𓏤,"['𓈖', '𓂝', '𓅱', '<', 'g', '>', 'I', '1', '4', 'C', '<', '/', 'g', '>', '𓏤']",80510,Nꜥw,PROPN,Sich windender (Personifikation der Schlange)
200,67HZI45S3REA3LWVZOKJ6QJOIE,14,9,67HZI45S3REA3LWVZOKJ6QJOIE.14.9,nbi̯.n,nbi̯.n,𓈖𓎟𓃀<g>D107</g>𓈖,"['𓈖', '𓎟', '𓃀', '<', 'g', '>', 'D', '1', '0', '7', '<', '/', 'g', '>', '𓈖']",82520,nbi̯,VERB,schmelzen; gießen
204,67HZI45S3REA3LWVZOKJ6QJOIE,14,13,67HZI45S3REA3LWVZOKJ6QJOIE.14.13,nḏr.n,nḏr.n,𓈖𓇦𓂋<g>U19A</g>𓆱𓈖,"['𓈖', '𓇦', '𓂋', '<', 'g', '>', 'U', '1', '9', 'A', '<', '/', 'g', '>', '𓆱', '𓈖']",91630,nḏr,VERB,(Holz) bearbeiten; zimmern
206,67HZI45S3REA3LWVZOKJ6QJOIE,14,15,67HZI45S3REA3LWVZOKJ6QJOIE.14.15,b(w)n.wDU,bwn.wDU,𓃀𓈖𓏌𓅱<g>T86</g><g>T86</g>,"['𓃀', '𓈖', '𓏌', '𓅱', '<', 'g', '>', 'T', '8', '6', '<', '/', 'g', '>', '<', 'g', '>', 'T', '8', '6', '<', '/', 'g', '>']",55330,bwn,NOUN,Speerspitzen (des Fischspeeres)
"""
import pandas as pd
from io import StringIO

# Convert the string into a StringIO object
# This is only necessary because we presented the csv as a string instead of loading it as a file
csv_data = StringIO(eg2_csv)

# Read the data into a pandas DataFrame
df = pd.read_csv(csv_data)

# Display the DataFrame
df
Unnamed: 0 text line word ref frag norm unicode_word unicode lemma_id cf pos sense
0 92 3Z5EM77HJFCOPKZDDZFEMI6KVY 5 7 3Z5EM77HJFCOPKZDDZFEMI6KVY.5.7 gꜣu̯.w gꜣu̯.w <g>V96</g>𓅱 ['<', 'g', '>', 'V', '9', '6', '<', '/', 'g', ... 166210 gꜣu̯ VERB eng sein; entbehren; (jmdn.) Not leiden lassen
1 151 4WVXFJZFLNAYHP3Y5O5SLWD7DA 2 2 4WVXFJZFLNAYHP3Y5O5SLWD7DA.2.2 nꜥw nꜥw 𓈖𓂝𓅱<g>I14C</g>𓏤 ['𓈖', '𓂝', '𓅱', '<', 'g', '>', 'I', '1', '4', ... 80510 Nꜥw PROPN Sich windender (Personifikation der Schlange)
2 153 4WVXFJZFLNAYHP3Y5O5SLWD7DA 2 5 4WVXFJZFLNAYHP3Y5O5SLWD7DA.2.5 nꜥw nꜥw 𓈖𓂝𓅱<g>I14C</g>𓏤 ['𓈖', '𓂝', '𓅱', '<', 'g', '>', 'I', '1', '4', ... 80510 Nꜥw PROPN Sich windender (Personifikation der Schlange)
3 200 67HZI45S3REA3LWVZOKJ6QJOIE 14 9 67HZI45S3REA3LWVZOKJ6QJOIE.14.9 nbi̯.n nbi̯.n 𓈖𓎟𓃀<g>D107</g>𓈖 ['𓈖', '𓎟', '𓃀', '<', 'g', '>', 'D', '1', '0', ... 82520 nbi̯ VERB schmelzen; gießen
4 204 67HZI45S3REA3LWVZOKJ6QJOIE 14 13 67HZI45S3REA3LWVZOKJ6QJOIE.14.13 nḏr.n nḏr.n 𓈖𓇦𓂋<g>U19A</g>𓆱𓈖 ['𓈖', '𓇦', '𓂋', '<', 'g', '>', 'U', '1', '9', ... 91630 nḏr VERB (Holz) bearbeiten; zimmern
5 206 67HZI45S3REA3LWVZOKJ6QJOIE 14 15 67HZI45S3REA3LWVZOKJ6QJOIE.14.15 b(w)n.wDU bwn.wDU 𓃀𓈖𓏌𓅱<g>T86</g><g>T86</g> ['𓃀', '𓈖', '𓏌', '𓅱', '<', 'g', '>', 'T', '8', ... 55330 bwn NOUN Speerspitzen (des Fischspeeres)
def split_tags(text):
    parts = []  # List to collect output of the function
    while '<g>' in text and '</g>' in text:
        pre, rest = text.split('<g>', 1)  # splits at the first <g> found
        tag_content, post = rest.split('</g>', 1)  # splits the rest at the first </g> found

        # adds elements before the first <g></g> tag to the List
        parts.extend(pre)

        #  adds element inside the first <g></g> tag to the List
        parts.append(tag_content)

        # text variable is set to remaining text
        text = post

    # After last tag found, the remainder of the text is split and added to the List
    parts.extend(text)
    return parts

def process_text(text):
    if pd.isna(text): # deals with NaN
        return []
    else:
        return split_tags(text)
# apply functions to every row of the column 'unicode_word'
df['unicode_splitted'] = df['unicode_word'].apply(process_text)
# delete obsolete column
df.drop('unicode', axis=1, inplace=True)

# show modified dataframe
df
# export as *.csv file
#df.to_csv("EG-TLA-example.csv")
Unnamed: 0 text line word ref frag norm unicode_word lemma_id cf pos sense unicode_splitted
0 92 3Z5EM77HJFCOPKZDDZFEMI6KVY 5 7 3Z5EM77HJFCOPKZDDZFEMI6KVY.5.7 gꜣu̯.w gꜣu̯.w <g>V96</g>𓅱 166210 gꜣu̯ VERB eng sein; entbehren; (jmdn.) Not leiden lassen [V96, 𓅱]
1 151 4WVXFJZFLNAYHP3Y5O5SLWD7DA 2 2 4WVXFJZFLNAYHP3Y5O5SLWD7DA.2.2 nꜥw nꜥw 𓈖𓂝𓅱<g>I14C</g>𓏤 80510 Nꜥw PROPN Sich windender (Personifikation der Schlange) [𓈖, 𓂝, 𓅱, I14C, 𓏤]
2 153 4WVXFJZFLNAYHP3Y5O5SLWD7DA 2 5 4WVXFJZFLNAYHP3Y5O5SLWD7DA.2.5 nꜥw nꜥw 𓈖𓂝𓅱<g>I14C</g>𓏤 80510 Nꜥw PROPN Sich windender (Personifikation der Schlange) [𓈖, 𓂝, 𓅱, I14C, 𓏤]
3 200 67HZI45S3REA3LWVZOKJ6QJOIE 14 9 67HZI45S3REA3LWVZOKJ6QJOIE.14.9 nbi̯.n nbi̯.n 𓈖𓎟𓃀<g>D107</g>𓈖 82520 nbi̯ VERB schmelzen; gießen [𓈖, 𓎟, 𓃀, D107, 𓈖]
4 204 67HZI45S3REA3LWVZOKJ6QJOIE 14 13 67HZI45S3REA3LWVZOKJ6QJOIE.14.13 nḏr.n nḏr.n 𓈖𓇦𓂋<g>U19A</g>𓆱𓈖 91630 nḏr VERB (Holz) bearbeiten; zimmern [𓈖, 𓇦, 𓂋, U19A, 𓆱, 𓈖]
5 206 67HZI45S3REA3LWVZOKJ6QJOIE 14 15 67HZI45S3REA3LWVZOKJ6QJOIE.14.15 b(w)n.wDU bwn.wDU 𓃀𓈖𓏌𓅱<g>T86</g><g>T86</g> 55330 bwn NOUN Speerspitzen (des Fischspeeres) [𓃀, 𓈖, 𓏌, 𓅱, T86, T86]

Egyptian Example 2:

A sentence from the sarcophagus of the Napatan king Aspelta (c. 600-580 BCE), found in his pyramid in Nuri, Sudan (Nu. 8), https://collections.mfa.org/objects/145117

Get the context of the sentence from the Thesaurus Linguae Aegyptiae: https://thesaurus-linguae-aegyptiae.de/text/27KHHMEP4VHSDH737F2OFLKNSE/sentences

# This Dictionary was created from the original json file

eg1 = {'publication_statement': {'credit_citation': 'Doris Topmann, Sentence ID 2CBOF5UQ7JGETCXG2CQKPCWDZM <https://github.com/thesaurus-linguae-aegyptiae/tla-raw-data/blob/v17/sentences/2CBOF5UQ7JGETCXG2CQKPCWDZM.json>, in: Thesaurus Linguae Aegyptiae: Raw Data <https://github.com/thesaurus-linguae-aegyptiae/tla-raw-data>, Corpus issue 17 (31 October 2022), ed. by Tonio Sebastian Richter & Daniel A. Werning on behalf of the Berlin-Brandenburgische Akademie der Wissenschaften and Hans-Werner Fischer-Elfert & Peter Dils on behalf of the Sächsische Akademie der Wissenschaften zu Leipzig (first published: 22 September 2023)', 'collection_editors': 'Tonio Sebastian Richter & Daniel A. Werning on behalf of the Berlin-Brandenburgische Akademie der Wissenschaften and Hans-Werner Fischer-Elfert & Peter Dils on behalf of the Sächsische Akademie der Wissenschaften zu Leipzig', 'data_engineers': {'input_software_BTS': ['Christoph Plutte', 'Jakob Höper'], 'database_curation': ['Simon D. Schweitzer'], 'data_transformation': ['Jakob Höper', 'R. Dominik Blöse', 'Daniel A. Werning']}, 'date_published_in_TLA': '2022-10-31', 'rawdata_first_published': '2023-09-22', 'corresponding_TLA_URL': 'https://thesaurus-linguae-aegyptiae.de/sentence/2CBOF5UQ7JGETCXG2CQKPCWDZM', 'license': 'Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) <https://creativecommons.org/licenses/by-sa/4.0/>'}, 'context': {'line': 'III', 'paragraph': None, 'pos': 7, 'textId': '27KHHMEP4VHSDH737F2OFLKNSE', 'textType': 'Text', 'variants': 1}, 'eclass': 'BTSSentence', 'glyphs': {'mdc_compact': None, 'unicode': None}, 'id': '2CBOF5UQ7JGETCXG2CQKPCWDZM', 'relations': {'contains': [{'eclass': 'BTSAnnotation', 'id': 'DYJEAXFKBJAXJPVLJGWREJZJ5M', 'ranges': [{'end': 'OKLGJLCEQFHU7HDRYUTYR352YA', 'start': '22TFIMS2CBBCFFCDSCAIT3HR3Y'}], 'type': 'ägyptologische Textsegmentierung'}], 'partOf': [{'eclass': 'BTSText', 'id': '27KHHMEP4VHSDH737F2OFLKNSE', 'name': 'Isis (HT 15, HT 14, HT 17)', 'type': 'Text'}]}, 'tokens': [{'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': '(unspecified)', 'lingGloss': 'PTCL', 'numeric': 3}, 'glyphs': {'mdc_artificially_aligned': True, 'mdc_compact': 'D35:N35', 'mdc_original': 'D35-N35', 'mdc_original_safe': None, 'mdc_tla': 'D35-N35', 'order': [1, 2], 'unicode': '𓂜𓈖'}, 'id': '22TFIMS2CBBCFFCDSCAIT3HR3Y', 'label': 'nn', 'lemma': {'POS': {'type': 'particle'}, 'id': '851961'}, 'transcription': {'mdc': 'nn', 'unicode': 'nn'}, 'translations': {'de': ['[Negationspartikel]']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': 'SC.act.ngem.nom.subj_Neg.nn', 'lingGloss': 'V\\tam.act', 'numeric': 210020}, 'glyphs': {'mdc_artificially_aligned': False, 'mdc_compact': 'W11-V28-A7', 'mdc_original': 'W11-V28-A7', 'mdc_original_safe': None, 'mdc_tla': 'W11-V28-A7', 'order': [2, 3, 4], 'unicode': '𓎼𓎛𓀉'}, 'id': 'IOLUGQXLCRGNLMTAPJ65LI7MHU', 'label': 'gḥ', 'lemma': {'POS': {'subtype': 'verb_3-lit', 'type': 'verb'}, 'id': '166480'}, 'transcription': {'mdc': 'gH', 'unicode': 'gḥ'}, 'translations': {'de': ['matt sein']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': 'Noun.pl.stpr.3sgm', 'lingGloss': 'N.f:pl:stpr', 'numeric': 70154}, 'glyphs': {'mdc_artificially_aligned': True, 'mdc_compact': 'D36:X1*F51B-Z2', 'mdc_original': 'D36-X1-F51B-Z2', 'mdc_original_safe': None, 'mdc_tla': 'D36-X1-F51B-Z2', 'order': [5, 6, 7, 8], 'unicode': '𓂝𓏏𓄹︀\U00013440𓏥'}, 'id': 'GUVBJUGCSVF5VN55PN6RYS4YLI', 'label': 'ꜥ,t.pl', 'lemma': {'POS': {'subtype': 'substantive_fem', 'type': 'substantive'}, 'id': '34550'}, 'transcription': {'mdc': 'a.t.PL', 'unicode': 'ꜥ.t.PL'}, 'translations': {'de': ['Glied; Körperteil']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': '(unspecified)', 'lingGloss': '-3sg.m', 'numeric': 3}, 'glyphs': {'mdc_artificially_aligned': False, 'mdc_compact': 'I9', 'mdc_original': 'I9', 'mdc_original_safe': None, 'mdc_tla': 'I9', 'order': [9], 'unicode': '𓆑'}, 'id': 'GIHCJ27JXVAM7GDUYWGEPKBRB4', 'label': '=f', 'lemma': {'POS': {'subtype': 'personal_pronoun', 'type': 'pronoun'}, 'id': '10050'}, 'transcription': {'mdc': '=f', 'unicode': '=f'}, 'translations': {'de': ['[Suffix Pron. sg.3.m.]']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': '(unspecified)', 'lingGloss': 'dem.f.pl', 'numeric': 3}, 'glyphs': {'mdc_artificially_aligned': True, 'mdc_compact': 'M17-Q3:N35', 'mdc_original': 'M17-Q3-N35', 'mdc_original_safe': None, 'mdc_tla': 'M17-Q3-N35', 'order': [10, 11, 12], 'unicode': '𓇋𓊪𓈖'}, 'id': 'Z6HTGGPBPRDT3OZTZNXRF2GRDA', 'label': 'jp〈t〉n', 'lemma': {'POS': {'subtype': 'demonstrative_pronoun', 'type': 'pronoun'}, 'id': '850009'}, 'transcription': {'mdc': 'jp〈t〉n', 'unicode': 'jp〈t〉n'}, 'translations': {'de': ['diese [Dem.Pron. pl.f.]']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': '(unspecified)', 'lingGloss': 'TITL', 'numeric': 3}, 'glyphs': {'mdc_artificially_aligned': False, 'mdc_compact': 'D4-Q1-A40', 'mdc_original': 'D4-Q1-A40', 'mdc_original_safe': None, 'mdc_tla': 'D4-Q1-A40', 'order': [13, 14, 15], 'unicode': '𓁹𓊨𓀭'}, 'id': 'UCFJWBLRKJG4NJWTWT22WDR2MU', 'label': 'Wsr,w', 'lemma': {'POS': {'subtype': 'title', 'type': 'epitheton_title'}, 'id': '49461'}, 'transcription': {'mdc': 'wsr.w', 'unicode': 'Wsr.w'}, 'translations': {'de': ['Osiris (Totentitel des Verstorbenen)']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': '(unspecified)', 'lingGloss': 'N', 'numeric': 3}, 'glyphs': {'mdc_artificially_aligned': True, 'mdc_compact': 'M23-X1:N35', 'mdc_original': 'M23-X1-N35', 'mdc_original_safe': None, 'mdc_tla': 'M23-X1-N35', 'order': [16, 17, 18], 'unicode': '𓇓𓏏𓈖'}, 'id': 'LI5FJI4ZUJEMPIKS5RQ5HHNBUE', 'label': 'nzw', 'lemma': {'POS': {'type': 'substantive'}, 'id': '88040'}, 'transcription': {'mdc': 'nzw', 'unicode': 'nzw'}, 'translations': {'de': ['König']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': '(unspecified)', 'lingGloss': 'ROYLN', 'numeric': 3}, 'glyphs': {'mdc_artificially_aligned': True, 'mdc_compact': 'V30:N17-N17', 'mdc_original': 'V30-N17-N17', 'mdc_original_safe': None, 'mdc_tla': 'V30-N17-N17', 'order': [19, 20, 21], 'unicode': '𓎟𓇿𓇿'}, 'id': 'ICADWHGbHkfdokpooG4eCy3Zfe8', 'label': 'nb-Tꜣ,du', 'lemma': {'POS': {'subtype': 'epith_king', 'type': 'epitheton_title'}, 'id': '400038'}, 'transcription': {'mdc': 'nb-tA.DU', 'unicode': 'nb-Tꜣ.DU'}, 'translations': {'de': ['Herr der Beiden Länder (Könige)']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': '(unspecified)', 'lingGloss': 'TITL', 'numeric': 3}, 'glyphs': {'mdc_artificially_aligned': True, 'mdc_compact': 'V30:D4-Aa1*X1:Y1', 'mdc_original': 'V30-D4-Aa1-X1-Y1', 'mdc_original_safe': None, 'mdc_tla': 'V30-D4-Aa1-X1-Y1', 'order': [22, 23, 24, 25, 26], 'unicode': '𓎟𓁹𓐍𓏏𓏛'}, 'id': 'ICADWHT2O1dc30SXuRZUlquIDpM', 'label': 'nb-jr(,t)-(j)ḫ,t', 'lemma': {'POS': {'subtype': 'title', 'type': 'epitheton_title'}, 'id': '400354'}, 'transcription': {'mdc': 'nb-jr(.t)-(j)x.t', 'unicode': 'nb-jr(.t)-(j)ḫ.t'}, 'translations': {'de': ['Herr des Rituals']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': '(unspecified)', 'lingGloss': 'ROYLN', 'numeric': 3}, 'glyphs': {'mdc_artificially_aligned': True, 'mdc_compact': '<-M17-O34:Q3-E23-N17->', 'mdc_original': '<-M17-O34-Q3-E23-N17->', 'mdc_original_safe': None, 'mdc_tla': '<-M17-O34-Q3-E23-N17->', 'order': [18, 19, 20, 21, 22, 23], 'unicode': '𓍹\U0001343c𓇋𓊃𓊪𓃭𓇿\U0001343d𓍺'}, 'id': 'J3MLYALWVNAMDDG33VZ3RIEEUA', 'label': 'Jsplt', 'lemma': {'POS': {'subtype': 'kings_name', 'type': 'entity_name'}, 'id': '850103'}, 'transcription': {'mdc': 'jsplt', 'unicode': 'Jsplt'}, 'translations': {'de': ['Aspelta']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': '(unspecified)', 'lingGloss': 'N.m:sg', 'numeric': 3}, 'glyphs': {'mdc_artificially_aligned': True, 'mdc_compact': 'U5:D36-P8h', 'mdc_original': 'U5-D36-P8h', 'mdc_original_safe': None, 'mdc_tla': 'U5-D36-P8h', 'order': [25, 26, 27], 'unicode': '𓌷𓂝𓊤︂'}, 'id': 'OKLGJLCEQFHU7HDRYUTYR352YA', 'label': 'mꜣꜥ-ḫrw', 'lemma': {'POS': {'subtype': 'substantive_masc', 'type': 'substantive'}, 'id': '66750'}, 'transcription': {'mdc': 'mAa-xrw', 'unicode': 'mꜣꜥ-ḫrw'}, 'translations': {'de': ['Gerechtfertigter (der selige Tote)']}, 'type': 'word'}], 'transcription': {'mdc': 'nn gH a.t.PL=f jp〈t〉n wsr.w nzw nb-tA.DU nb-jr(.t)-(j)x.t jsplt mAa-xrw', 'unicode': 'nn gḥ ꜥ.t.PL=f jp〈t〉n Wsr.w nzw nb-Tꜣ.DU nb-jr(.t)-(j)ḫ.t Jsplt mꜣꜥ-ḫrw'}, 'translations': {'de': ['Diese seine Glieder werden nicht matt sein, (die des) Osiris Königs, des Herrn der Beiden Länder, des Herrn des Rituals, Aspelta, des Gerechtfertigten.']}, 'type': None, 'wordCount': 11, 'editors': {'author': 'Doris Topmann', 'contributors': None, 'created': '2020-12-23 12:24:26', 'type': None, 'updated': '2022-08-29 10:22:01'}}
print(eg1)
{'publication_statement': {'credit_citation': 'Doris Topmann, Sentence ID 2CBOF5UQ7JGETCXG2CQKPCWDZM <https://github.com/thesaurus-linguae-aegyptiae/tla-raw-data/blob/v17/sentences/2CBOF5UQ7JGETCXG2CQKPCWDZM.json>, in: Thesaurus Linguae Aegyptiae: Raw Data <https://github.com/thesaurus-linguae-aegyptiae/tla-raw-data>, Corpus issue 17 (31 October 2022), ed. by Tonio Sebastian Richter & Daniel A. Werning on behalf of the Berlin-Brandenburgische Akademie der Wissenschaften and Hans-Werner Fischer-Elfert & Peter Dils on behalf of the Sächsische Akademie der Wissenschaften zu Leipzig (first published: 22 September 2023)', 'collection_editors': 'Tonio Sebastian Richter & Daniel A. Werning on behalf of the Berlin-Brandenburgische Akademie der Wissenschaften and Hans-Werner Fischer-Elfert & Peter Dils on behalf of the Sächsische Akademie der Wissenschaften zu Leipzig', 'data_engineers': {'input_software_BTS': ['Christoph Plutte', 'Jakob Höper'], 'database_curation': ['Simon D. Schweitzer'], 'data_transformation': ['Jakob Höper', 'R. Dominik Blöse', 'Daniel A. Werning']}, 'date_published_in_TLA': '2022-10-31', 'rawdata_first_published': '2023-09-22', 'corresponding_TLA_URL': 'https://thesaurus-linguae-aegyptiae.de/sentence/2CBOF5UQ7JGETCXG2CQKPCWDZM', 'license': 'Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) <https://creativecommons.org/licenses/by-sa/4.0/>'}, 'context': {'line': 'III', 'paragraph': None, 'pos': 7, 'textId': '27KHHMEP4VHSDH737F2OFLKNSE', 'textType': 'Text', 'variants': 1}, 'eclass': 'BTSSentence', 'glyphs': {'mdc_compact': None, 'unicode': None}, 'id': '2CBOF5UQ7JGETCXG2CQKPCWDZM', 'relations': {'contains': [{'eclass': 'BTSAnnotation', 'id': 'DYJEAXFKBJAXJPVLJGWREJZJ5M', 'ranges': [{'end': 'OKLGJLCEQFHU7HDRYUTYR352YA', 'start': '22TFIMS2CBBCFFCDSCAIT3HR3Y'}], 'type': 'ägyptologische Textsegmentierung'}], 'partOf': [{'eclass': 'BTSText', 'id': '27KHHMEP4VHSDH737F2OFLKNSE', 'name': 'Isis (HT 15, HT 14, HT 17)', 'type': 'Text'}]}, 'tokens': [{'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': '(unspecified)', 'lingGloss': 'PTCL', 'numeric': 3}, 'glyphs': {'mdc_artificially_aligned': True, 'mdc_compact': 'D35:N35', 'mdc_original': 'D35-N35', 'mdc_original_safe': None, 'mdc_tla': 'D35-N35', 'order': [1, 2], 'unicode': '𓂜𓈖'}, 'id': '22TFIMS2CBBCFFCDSCAIT3HR3Y', 'label': 'nn', 'lemma': {'POS': {'type': 'particle'}, 'id': '851961'}, 'transcription': {'mdc': 'nn', 'unicode': 'nn'}, 'translations': {'de': ['[Negationspartikel]']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': 'SC.act.ngem.nom.subj_Neg.nn', 'lingGloss': 'V\\tam.act', 'numeric': 210020}, 'glyphs': {'mdc_artificially_aligned': False, 'mdc_compact': 'W11-V28-A7', 'mdc_original': 'W11-V28-A7', 'mdc_original_safe': None, 'mdc_tla': 'W11-V28-A7', 'order': [2, 3, 4], 'unicode': '𓎼𓎛𓀉'}, 'id': 'IOLUGQXLCRGNLMTAPJ65LI7MHU', 'label': 'gḥ', 'lemma': {'POS': {'subtype': 'verb_3-lit', 'type': 'verb'}, 'id': '166480'}, 'transcription': {'mdc': 'gH', 'unicode': 'gḥ'}, 'translations': {'de': ['matt sein']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': 'Noun.pl.stpr.3sgm', 'lingGloss': 'N.f:pl:stpr', 'numeric': 70154}, 'glyphs': {'mdc_artificially_aligned': True, 'mdc_compact': 'D36:X1*F51B-Z2', 'mdc_original': 'D36-X1-F51B-Z2', 'mdc_original_safe': None, 'mdc_tla': 'D36-X1-F51B-Z2', 'order': [5, 6, 7, 8], 'unicode': '𓂝𓏏𓄹︀\U00013440𓏥'}, 'id': 'GUVBJUGCSVF5VN55PN6RYS4YLI', 'label': 'ꜥ,t.pl', 'lemma': {'POS': {'subtype': 'substantive_fem', 'type': 'substantive'}, 'id': '34550'}, 'transcription': {'mdc': 'a.t.PL', 'unicode': 'ꜥ.t.PL'}, 'translations': {'de': ['Glied; Körperteil']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': '(unspecified)', 'lingGloss': '-3sg.m', 'numeric': 3}, 'glyphs': {'mdc_artificially_aligned': False, 'mdc_compact': 'I9', 'mdc_original': 'I9', 'mdc_original_safe': None, 'mdc_tla': 'I9', 'order': [9], 'unicode': '𓆑'}, 'id': 'GIHCJ27JXVAM7GDUYWGEPKBRB4', 'label': '=f', 'lemma': {'POS': {'subtype': 'personal_pronoun', 'type': 'pronoun'}, 'id': '10050'}, 'transcription': {'mdc': '=f', 'unicode': '=f'}, 'translations': {'de': ['[Suffix Pron. sg.3.m.]']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': '(unspecified)', 'lingGloss': 'dem.f.pl', 'numeric': 3}, 'glyphs': {'mdc_artificially_aligned': True, 'mdc_compact': 'M17-Q3:N35', 'mdc_original': 'M17-Q3-N35', 'mdc_original_safe': None, 'mdc_tla': 'M17-Q3-N35', 'order': [10, 11, 12], 'unicode': '𓇋𓊪𓈖'}, 'id': 'Z6HTGGPBPRDT3OZTZNXRF2GRDA', 'label': 'jp〈t〉n', 'lemma': {'POS': {'subtype': 'demonstrative_pronoun', 'type': 'pronoun'}, 'id': '850009'}, 'transcription': {'mdc': 'jp〈t〉n', 'unicode': 'jp〈t〉n'}, 'translations': {'de': ['diese [Dem.Pron. pl.f.]']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': '(unspecified)', 'lingGloss': 'TITL', 'numeric': 3}, 'glyphs': {'mdc_artificially_aligned': False, 'mdc_compact': 'D4-Q1-A40', 'mdc_original': 'D4-Q1-A40', 'mdc_original_safe': None, 'mdc_tla': 'D4-Q1-A40', 'order': [13, 14, 15], 'unicode': '𓁹𓊨𓀭'}, 'id': 'UCFJWBLRKJG4NJWTWT22WDR2MU', 'label': 'Wsr,w', 'lemma': {'POS': {'subtype': 'title', 'type': 'epitheton_title'}, 'id': '49461'}, 'transcription': {'mdc': 'wsr.w', 'unicode': 'Wsr.w'}, 'translations': {'de': ['Osiris (Totentitel des Verstorbenen)']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': '(unspecified)', 'lingGloss': 'N', 'numeric': 3}, 'glyphs': {'mdc_artificially_aligned': True, 'mdc_compact': 'M23-X1:N35', 'mdc_original': 'M23-X1-N35', 'mdc_original_safe': None, 'mdc_tla': 'M23-X1-N35', 'order': [16, 17, 18], 'unicode': '𓇓𓏏𓈖'}, 'id': 'LI5FJI4ZUJEMPIKS5RQ5HHNBUE', 'label': 'nzw', 'lemma': {'POS': {'type': 'substantive'}, 'id': '88040'}, 'transcription': {'mdc': 'nzw', 'unicode': 'nzw'}, 'translations': {'de': ['König']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': '(unspecified)', 'lingGloss': 'ROYLN', 'numeric': 3}, 'glyphs': {'mdc_artificially_aligned': True, 'mdc_compact': 'V30:N17-N17', 'mdc_original': 'V30-N17-N17', 'mdc_original_safe': None, 'mdc_tla': 'V30-N17-N17', 'order': [19, 20, 21], 'unicode': '𓎟𓇿𓇿'}, 'id': 'ICADWHGbHkfdokpooG4eCy3Zfe8', 'label': 'nb-Tꜣ,du', 'lemma': {'POS': {'subtype': 'epith_king', 'type': 'epitheton_title'}, 'id': '400038'}, 'transcription': {'mdc': 'nb-tA.DU', 'unicode': 'nb-Tꜣ.DU'}, 'translations': {'de': ['Herr der Beiden Länder (Könige)']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': '(unspecified)', 'lingGloss': 'TITL', 'numeric': 3}, 'glyphs': {'mdc_artificially_aligned': True, 'mdc_compact': 'V30:D4-Aa1*X1:Y1', 'mdc_original': 'V30-D4-Aa1-X1-Y1', 'mdc_original_safe': None, 'mdc_tla': 'V30-D4-Aa1-X1-Y1', 'order': [22, 23, 24, 25, 26], 'unicode': '𓎟𓁹𓐍𓏏𓏛'}, 'id': 'ICADWHT2O1dc30SXuRZUlquIDpM', 'label': 'nb-jr(,t)-(j)ḫ,t', 'lemma': {'POS': {'subtype': 'title', 'type': 'epitheton_title'}, 'id': '400354'}, 'transcription': {'mdc': 'nb-jr(.t)-(j)x.t', 'unicode': 'nb-jr(.t)-(j)ḫ.t'}, 'translations': {'de': ['Herr des Rituals']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': '(unspecified)', 'lingGloss': 'ROYLN', 'numeric': 3}, 'glyphs': {'mdc_artificially_aligned': True, 'mdc_compact': '<-M17-O34:Q3-E23-N17->', 'mdc_original': '<-M17-O34-Q3-E23-N17->', 'mdc_original_safe': None, 'mdc_tla': '<-M17-O34-Q3-E23-N17->', 'order': [18, 19, 20, 21, 22, 23], 'unicode': '𓍹\U0001343c𓇋𓊃𓊪𓃭𓇿\U0001343d𓍺'}, 'id': 'J3MLYALWVNAMDDG33VZ3RIEEUA', 'label': 'Jsplt', 'lemma': {'POS': {'subtype': 'kings_name', 'type': 'entity_name'}, 'id': '850103'}, 'transcription': {'mdc': 'jsplt', 'unicode': 'Jsplt'}, 'translations': {'de': ['Aspelta']}, 'type': 'word'}, {'annoTypes': ['ägyptologische Textsegmentierung'], 'flexion': {'btsGloss': '(unspecified)', 'lingGloss': 'N.m:sg', 'numeric': 3}, 'glyphs': {'mdc_artificially_aligned': True, 'mdc_compact': 'U5:D36-P8h', 'mdc_original': 'U5-D36-P8h', 'mdc_original_safe': None, 'mdc_tla': 'U5-D36-P8h', 'order': [25, 26, 27], 'unicode': '𓌷𓂝𓊤︂'}, 'id': 'OKLGJLCEQFHU7HDRYUTYR352YA', 'label': 'mꜣꜥ-ḫrw', 'lemma': {'POS': {'subtype': 'substantive_masc', 'type': 'substantive'}, 'id': '66750'}, 'transcription': {'mdc': 'mAa-xrw', 'unicode': 'mꜣꜥ-ḫrw'}, 'translations': {'de': ['Gerechtfertigter (der selige Tote)']}, 'type': 'word'}], 'transcription': {'mdc': 'nn gH a.t.PL=f jp〈t〉n wsr.w nzw nb-tA.DU nb-jr(.t)-(j)x.t jsplt mAa-xrw', 'unicode': 'nn gḥ ꜥ.t.PL=f jp〈t〉n Wsr.w nzw nb-Tꜣ.DU nb-jr(.t)-(j)ḫ.t Jsplt mꜣꜥ-ḫrw'}, 'translations': {'de': ['Diese seine Glieder werden nicht matt sein, (die des) Osiris Königs, des Herrn der Beiden Länder, des Herrn des Rituals, Aspelta, des Gerechtfertigten.']}, 'type': None, 'wordCount': 11, 'editors': {'author': 'Doris Topmann', 'contributors': None, 'created': '2020-12-23 12:24:26', 'type': None, 'updated': '2022-08-29 10:22:01'}}
# parse the dictionary (json)

unicodeHiero = []
transcription = []
translLemma = []
posLemma = []
tokenID = []

for text_word in eg1["tokens"] :
    print(text_word["glyphs"]["unicode"], text_word["transcription"]["unicode"], text_word["translations"]["de"][0], text_word["lemma"]["POS"]["type"], text_word["id"] )
    tokenID.append(text_word["id"])
    unicodeHiero.append(text_word["glyphs"]["unicode"])
    translLemma.append(text_word["translations"]["de"][0])
    posLemma.append(text_word["lemma"]["POS"]["type"])

    if text_word["transcription"]["unicode"][0] == "=" : # replace equal sign as it will cause trouble in spreadsheet software like MS Excel
        transcription.append(text_word["transcription"]["unicode"].replace("=", '⸗')) # U+2E17
    else :
        transcription.append(text_word["transcription"]["unicode"])
𓂜𓈖 nn [Negationspartikel] particle 22TFIMS2CBBCFFCDSCAIT3HR3Y
𓎼𓎛𓀉 gḥ matt sein verb IOLUGQXLCRGNLMTAPJ65LI7MHU
𓂝𓏏𓄹︀𓑀𓏥 ꜥ.t.PL Glied; Körperteil substantive GUVBJUGCSVF5VN55PN6RYS4YLI
𓆑 =f [Suffix Pron. sg.3.m.] pronoun GIHCJ27JXVAM7GDUYWGEPKBRB4
𓇋𓊪𓈖 jp〈t〉n diese [Dem.Pron. pl.f.] pronoun Z6HTGGPBPRDT3OZTZNXRF2GRDA
𓁹𓊨𓀭 Wsr.w Osiris (Totentitel des Verstorbenen) epitheton_title UCFJWBLRKJG4NJWTWT22WDR2MU
𓇓𓏏𓈖 nzw König substantive LI5FJI4ZUJEMPIKS5RQ5HHNBUE
𓎟𓇿𓇿 nb-Tꜣ.DU Herr der Beiden Länder (Könige) epitheton_title ICADWHGbHkfdokpooG4eCy3Zfe8
𓎟𓁹𓐍𓏏𓏛 nb-jr(.t)-(j)ḫ.t Herr des Rituals epitheton_title ICADWHT2O1dc30SXuRZUlquIDpM
𓍹𓐼𓇋𓊃𓊪𓃭𓇿𓐽𓍺 Jsplt Aspelta entity_name J3MLYALWVNAMDDG33VZ3RIEEUA
𓌷𓂝𓊤︂ mꜣꜥ-ḫrw Gerechtfertigter (der selige Tote) substantive OKLGJLCEQFHU7HDRYUTYR352YA
# get the ID of this sentence

sentenceID = eg1["id"]
# create a dataframe and fill it

import pandas as pd

df_eg = pd.DataFrame({
    'unicode_hieroglyphs': unicodeHiero,
    'unicode_transcription': transcription,
    'lemma_translation': translLemma,
    'part-of-speech': posLemma,
    'tokenID' : tokenID
})

df_eg
unicode_hieroglyphs unicode_transcription lemma_translation part-of-speech tokenID
0 𓂜𓈖 nn [Negationspartikel] particle 22TFIMS2CBBCFFCDSCAIT3HR3Y
1 𓎼𓎛𓀉 gḥ matt sein verb IOLUGQXLCRGNLMTAPJ65LI7MHU
2 𓂝𓏏𓄹︀𓑀𓏥 ꜥ.t.PL Glied; Körperteil substantive GUVBJUGCSVF5VN55PN6RYS4YLI
3 𓆑 ⸗f [Suffix Pron. sg.3.m.] pronoun GIHCJ27JXVAM7GDUYWGEPKBRB4
4 𓇋𓊪𓈖 jp〈t〉n diese [Dem.Pron. pl.f.] pronoun Z6HTGGPBPRDT3OZTZNXRF2GRDA
5 𓁹𓊨𓀭 Wsr.w Osiris (Totentitel des Verstorbenen) epitheton_title UCFJWBLRKJG4NJWTWT22WDR2MU
6 𓇓𓏏𓈖 nzw König substantive LI5FJI4ZUJEMPIKS5RQ5HHNBUE
7 𓎟𓇿𓇿 nb-Tꜣ.DU Herr der Beiden Länder (Könige) epitheton_title ICADWHGbHkfdokpooG4eCy3Zfe8
8 𓎟𓁹𓐍𓏏𓏛 nb-jr(.t)-(j)ḫ.t Herr des Rituals epitheton_title ICADWHT2O1dc30SXuRZUlquIDpM
9 𓍹𓐼𓇋𓊃𓊪𓃭𓇿𓐽𓍺 Jsplt Aspelta entity_name J3MLYALWVNAMDDG33VZ3RIEEUA
10 𓌷𓂝𓊤︂ mꜣꜥ-ḫrw Gerechtfertigter (der selige Tote) substantive OKLGJLCEQFHU7HDRYUTYR352YA
# save as *.csv

fileName = "aspelta_TLA_Sentence_" + sentenceID + ".csv"
df_eg.to_csv(fileName)

Akkadian Example 2:

consider the following Akkadian text:

http://www.achemenet.com//fr/item/?/sources-textuelles/textes-par-publication/Strassmaier_Cyrus/1665118

6 udu-nita2 ina šuII Iden-gi a-šú šá Id[

a-na 8 gín 4-kù-babbar i-na kù-babbar

šá i-di é [ o o o ] a-na é-babbar-ra

it-ta-din 5 udu-nita2 šá Ika-ṣir

a-šú šá Iden-mu a-na 7 gín 4-

kù-babbar šá muh-hi dul-lu Imu-mu

ú-šá-hi-su a-na lìb-bi sì-na

1 udu-nita2 a-na 1 gín 4-kù-babbar

ina šuII Idutu-ba-šá! [

1 udu-nita2 šá IDU-[

a-na 1? gín [

pap [13 udu-nita2-meš

iti du6 u4 [o-kam] mu sag nam-lugal-la

Iku-ra-áš lugal tin-tirki u kur-kur

How would you preprocess this text?

akk2 = """ 6 udu-nita<sub>2</sub> <i>ina</i> šu<sup>II</sup> <sup>Id</sup>en-gi a-<i>šú šá</i> <sup>Id</sup>[
 <i>a-na</i> 8 gín 4-<i>tú</i> kù-babbar <i>i-na</i> kù-babbar
 <i>šá</i> <i>i-di</i> é [ o o o ] <i>a-na</i> é-babbar-ra
 <i>it-ta-din</i> 5 udu-nita<sub>2</sub> <i>šá</i> <sup>I</sup><i>ka-ṣir</i>
 a-<i>šú šá</i> <sup>Id</sup>en-mu <i>a-na</i> 7 gín 4-<i>tú</i>
 kù-babbar <i>šá</i> <i>muh-hi</i> <i>dul-lu</i> <sup>I</sup>mu-mu
 <i>ú-šá-hi-su a-na</i> <i>lìb-bi</i> sì-<i>na</i>
 1 udu-nita<sub>2</sub> <i>a-na</i> 1 gín 4-<i>tú </i>kù-babbar
 <i>ina</i> šu<sup>II</sup> <sup>Id</sup>utu-ba-<i>šá</i><sup>!</sup> [
 1 udu-nita<sub>2</sub> <i>šá</i> <sup>I</sup>DU-[
 <i>a-na</i> 1<sup>?</sup> gín [
 pap [13 udu-nita<sub>2</sub>-meš
 iti du<sub>6</sub> u<sub>4</sub> [o-kam] mu sag nam-lugal-la
 <sup>I</sup><i>ku-ra-áš</i> lugal tr/i> kur-kur """
akk2_lines = akk2.split('\n')
print(akk2_lines)
[' 6 udu-nita<sub>2</sub> <i>ina</i> šu<sup>II</sup> <sup>Id</sup>en-gi a-<i>šú šá</i> <sup>Id</sup>[', ' <i>a-na</i> 8 gín 4-<i>tú</i> kù-babbar <i>i-na</i> kù-babbar', ' <i>šá</i> <i>i-di</i> é [ o o o ] <i>a-na</i> é-babbar-ra', ' <i>it-ta-din</i> 5 udu-nita<sub>2</sub> <i>šá</i> <sup>I</sup><i>ka-ṣir</i>', ' a-<i>šú šá</i> <sup>Id</sup>en-mu <i>a-na</i> 7 gín 4-<i>tú</i>', ' kù-babbar <i>šá</i> <i>muh-hi</i> <i>dul-lu</i> <sup>I</sup>mu-mu', ' <i>ú-šá-hi-su a-na</i> <i>lìb-bi</i> sì-<i>na</i>', ' 1 udu-nita<sub>2</sub> <i>a-na</i> 1 gín 4-<i>tú </i>kù-babbar', ' <i>ina</i> šu<sup>II</sup> <sup>Id</sup>utu-ba-<i>šá</i><sup>!</sup> [', ' 1 udu-nita<sub>2</sub> <i>šá</i> <sup>I</sup>DU-[', ' <i>a-na</i> 1<sup>?</sup> gín [', ' pap [13 udu-nita<sub>2</sub>-meš', ' iti du<sub>6</sub> u<sub>4</sub> [o-kam] mu sag nam-lugal-la', ' <sup>I</sup><i>ku-ra-áš</i> lugal tin-tir<sup>ki</sup> <i>u</i> kur-kur ']
line_count = 1
for line in akk2_lines:
    print("line:", line_count)
    words = line.split()
    word_count = 1

    for word in words :
        print(word_count, end=" ")

        if word.startswith("<i>") and word.endswith("</i>") :
            #print(word)
            signList = word.split("-")
            for sign in signList :
            #print(sign)
                sign = sign.replace("<i>", "")
                sign = sign.replace("</i>", "")
                sign = [sign, {"sign_function": "phon" }]
                print(sign)
        elif "-<i>" in word :
            signList = word.split("-")
            for sign in signList :
            #print(sign)
                if sign.startswith("<i>") :
                    sign = sign.replace("<i>", "")
                    sign = [sign, {"sign_function": "phon" }]
                    print(sign)
                else:
                    print([sign, {"sign_function": "log" }])

        elif word.endswith("</i>") :
            word = word.replace("</i>", "")
            word = [word, {"sign_function": "phon" }]
            print(word)
        #elif :
           # print(word)
        elif "</sup" in word :
            #print(word.split("-"), "logogram")
            signCluster = word.split("-")
            #print(signCluster)
            for elem in signCluster :
                if "</sup>" in elem :
                    #print("yes")
                    elem = elem.replace("</sup>", "</class>-")
                    elem = elem.replace("<sup>", "-<class>")
                    signs = elem.split("-")
                    for sign in signs :
                        if sign.startswith("<class>") and sign.endswith("</class>"):
                            print([sign[7:-8], {"sign_function": "class" }])

                        else:
                            print([sign, {"sign_function": "log" }])
                else:
                    print([elem, {"sign_function": "log" }])
        else:
          if "-" in word :
            signs = word.split("-")
            for sign in signs :
              print([sign, {"sign_function": "log" }])
          else:
            print([word, {"sign_function": "log" }])

        word_count +=1

    line_count += 1
    print("------------------")
line: 1
1 ['6', {'sign_function': 'log'}]
2 ['udu', {'sign_function': 'log'}]
['nita<sub>2</sub>', {'sign_function': 'log'}]
3 ['ina', {'sign_function': 'phon'}]
4 ['šu', {'sign_function': 'log'}]
['II', {'sign_function': 'class'}]
['', {'sign_function': 'log'}]
5 ['', {'sign_function': 'log'}]
['Id', {'sign_function': 'class'}]
['en', {'sign_function': 'log'}]
['gi', {'sign_function': 'log'}]
6 ['a', {'sign_function': 'log'}]
['šú', {'sign_function': 'phon'}]
7 ['šá', {'sign_function': 'phon'}]
8 ['', {'sign_function': 'log'}]
['Id', {'sign_function': 'class'}]
['[', {'sign_function': 'log'}]
------------------
line: 2
1 ['a', {'sign_function': 'phon'}]
['na', {'sign_function': 'phon'}]
2 ['8', {'sign_function': 'log'}]
3 ['gín', {'sign_function': 'log'}]
4 ['4', {'sign_function': 'log'}]
['tú</i>', {'sign_function': 'phon'}]
5 ['kù', {'sign_function': 'log'}]
['babbar', {'sign_function': 'log'}]
6 ['i', {'sign_function': 'phon'}]
['na', {'sign_function': 'phon'}]
7 ['kù', {'sign_function': 'log'}]
['babbar', {'sign_function': 'log'}]
------------------
line: 3
1 ['šá', {'sign_function': 'phon'}]
2 ['i', {'sign_function': 'phon'}]
['di', {'sign_function': 'phon'}]
3 ['é', {'sign_function': 'log'}]
4 ['[', {'sign_function': 'log'}]
5 ['o', {'sign_function': 'log'}]
6 ['o', {'sign_function': 'log'}]
7 ['o', {'sign_function': 'log'}]
8 [']', {'sign_function': 'log'}]
9 ['a', {'sign_function': 'phon'}]
['na', {'sign_function': 'phon'}]
10 ['é', {'sign_function': 'log'}]
['babbar', {'sign_function': 'log'}]
['ra', {'sign_function': 'log'}]
------------------
line: 4
1 ['it', {'sign_function': 'phon'}]
['ta', {'sign_function': 'phon'}]
['din', {'sign_function': 'phon'}]
2 ['5', {'sign_function': 'log'}]
3 ['udu', {'sign_function': 'log'}]
['nita<sub>2</sub>', {'sign_function': 'log'}]
4 ['šá', {'sign_function': 'phon'}]
5 ['<sup>I</sup><i>ka-ṣir', {'sign_function': 'phon'}]
------------------
line: 5
1 ['a', {'sign_function': 'log'}]
['šú', {'sign_function': 'phon'}]
2 ['šá', {'sign_function': 'phon'}]
3 ['', {'sign_function': 'log'}]
['Id', {'sign_function': 'class'}]
['en', {'sign_function': 'log'}]
['mu', {'sign_function': 'log'}]
4 ['a', {'sign_function': 'phon'}]
['na', {'sign_function': 'phon'}]
5 ['7', {'sign_function': 'log'}]
6 ['gín', {'sign_function': 'log'}]
7 ['4', {'sign_function': 'log'}]
['tú</i>', {'sign_function': 'phon'}]
------------------
line: 6
1 ['kù', {'sign_function': 'log'}]
['babbar', {'sign_function': 'log'}]
2 ['šá', {'sign_function': 'phon'}]
3 ['muh', {'sign_function': 'phon'}]
['hi', {'sign_function': 'phon'}]
4 ['dul', {'sign_function': 'phon'}]
['lu', {'sign_function': 'phon'}]
5 ['', {'sign_function': 'log'}]
['I', {'sign_function': 'class'}]
['mu', {'sign_function': 'log'}]
['mu', {'sign_function': 'log'}]
------------------
line: 7
1 ['<i>ú', {'sign_function': 'log'}]
['šá', {'sign_function': 'log'}]
['hi', {'sign_function': 'log'}]
['su', {'sign_function': 'log'}]
2 ['a-na', {'sign_function': 'phon'}]
3 ['lìb', {'sign_function': 'phon'}]
['bi', {'sign_function': 'phon'}]
4 ['sì', {'sign_function': 'log'}]
['na</i>', {'sign_function': 'phon'}]
------------------
line: 8
1 ['1', {'sign_function': 'log'}]
2 ['udu', {'sign_function': 'log'}]
['nita<sub>2</sub>', {'sign_function': 'log'}]
3 ['a', {'sign_function': 'phon'}]
['na', {'sign_function': 'phon'}]
4 ['1', {'sign_function': 'log'}]
5 ['gín', {'sign_function': 'log'}]
6 ['4', {'sign_function': 'log'}]
['tú', {'sign_function': 'phon'}]
7 ['</i>kù', {'sign_function': 'log'}]
['babbar', {'sign_function': 'log'}]
------------------
line: 9
1 ['ina', {'sign_function': 'phon'}]
2 ['šu', {'sign_function': 'log'}]
['II', {'sign_function': 'class'}]
['', {'sign_function': 'log'}]
3 ['<sup>Id</sup>utu', {'sign_function': 'log'}]
['ba', {'sign_function': 'log'}]
['šá</i><sup>!</sup>', {'sign_function': 'phon'}]
4 ['[', {'sign_function': 'log'}]
------------------
line: 10
1 ['1', {'sign_function': 'log'}]
2 ['udu', {'sign_function': 'log'}]
['nita<sub>2</sub>', {'sign_function': 'log'}]
3 ['šá', {'sign_function': 'phon'}]
4 ['', {'sign_function': 'log'}]
['I', {'sign_function': 'class'}]
['DU', {'sign_function': 'log'}]
['[', {'sign_function': 'log'}]
------------------
line: 11
1 ['a', {'sign_function': 'phon'}]
['na', {'sign_function': 'phon'}]
2 ['1', {'sign_function': 'log'}]
['?', {'sign_function': 'class'}]
['', {'sign_function': 'log'}]
3 ['gín', {'sign_function': 'log'}]
4 ['[', {'sign_function': 'log'}]
------------------
line: 12
1 ['pap', {'sign_function': 'log'}]
2 ['[13', {'sign_function': 'log'}]
3 ['udu', {'sign_function': 'log'}]
['nita<sub>2</sub>', {'sign_function': 'log'}]
['meš', {'sign_function': 'log'}]
------------------
line: 13
1 ['iti', {'sign_function': 'log'}]
2 ['du<sub>6</sub>', {'sign_function': 'log'}]
3 ['u<sub>4</sub>', {'sign_function': 'log'}]
4 ['[o', {'sign_function': 'log'}]
['kam]', {'sign_function': 'log'}]
5 ['mu', {'sign_function': 'log'}]
6 ['sag', {'sign_function': 'log'}]
7 ['nam', {'sign_function': 'log'}]
['lugal', {'sign_function': 'log'}]
['la', {'sign_function': 'log'}]
------------------
line: 14
1 ['<sup>I</sup><i>ku-ra-áš', {'sign_function': 'phon'}]
2 ['lugal', {'sign_function': 'log'}]
3 ['tin', {'sign_function': 'log'}]
['tir', {'sign_function': 'log'}]
['ki', {'sign_function': 'class'}]
['', {'sign_function': 'log'}]
4 ['u', {'sign_function': 'phon'}]
5 ['kur', {'sign_function': 'log'}]
['kur', {'sign_function': 'log'}]
------------------

This notebook was created by Eliese-Sophia Lincke and Shai Gordin in Fall 2024 for the course Ancient Language Processing, incorporating materials from Avital Romach. Code can be reused under a CC-BY 4.0