Post reply

Name:
Email:
Subject:
Tags:

Seperate each tag by a comma
Message icon:

Attach:
(Clear Attachment)
(more attachments)
Allowed file types: apk, doc, docx, gif, jpg, mpg, pdf, png, txt, zip, xls, 3gpp, mp2, mp3, wav, odt, ods, html, mp4, amr, apk, m4a, jpeg, aac
Restrictions: 50 per post, maximum total size 150000KB, maximum individual size 150000KB
Note that any files attached will not be displayed until approved by a moderator.
Anti-spam: complete the task

shortcuts: hit alt+s to submit/post or alt+p to preview


Topic Summary

Posted by: Dhammañāṇa
« on: September 17, 2019, 03:25:26 PM »

...and filename-modus to utf-8, of course.

Meanwhile uploaded all (more or lesser) new files (incl. the new words from Bhante Varados Glossary, while source-files not made yet).

Redirects from old to new filenames, where it has changed are to be made and surely countless corrections here and there incl. an new template-file for the dic-section.

That the files are found is up to the status of indexing which may need a while.
Posted by: Dhammañāṇa
« on: September 17, 2019, 11:24:52 AM »

My person has now turned "deaccent" off and it will need it's time till all pages, touched by it (especially dictionary) are reproduced with proper file-name (many might not be found meanwhile).

As for finding words in dictionary easier by quick-search suggestions, my person adds "[dic]" at the beginning of the header and in brakes the spelling without diacritics. For example "[dic] ācariya (acariya)" or "[dic] āciṇṇakakamma (acinnakakamma)"
Posted by: Dhammañāṇa
« on: May 19, 2019, 10:50:42 AM »

Good, so then Atma follows that Vision when prepearing further the next days.
Posted by: Moritz
« on: May 19, 2019, 10:14:42 AM »

The most "easiest" way with given, existing, possibilities might be that my person changes the filename-mode to utf-8, deaccent to 0 and add systematical alternative spellings as page content. That would make pages searchable and match in quick search as well. Writing the header without diacritics would make the quicksearch fine as well.

Linking the pages will then require right spelling.

The work on the existing content would be rename and save mostly Khmer pages. In regard of the impact for windows systems, my person is not informed for now.
That seems like the cleanest solution for a start.
Regarding impact for Windows systems , I think there should be no need to consider, because the server, like most servers is running some variant of Linux.
_/\_
Posted by: Dhammañāṇa
« on: May 19, 2019, 06:26:51 AM »

The most "easiest" way with given, existing, possibilities might be that my person changes the filename-mode to utf-8, deaccent to 0 and add systematical alternative spellings as page content. That would make pages searchable and match in quick search as well. Writing the header without diacritics would make the quicksearch fine as well.

Linking the pages will then require right selling.

The work on the existing content would be rename and save mostly Khmer pages. In regard of the impact for windows systems, my person is not informed for now.
Posted by: Dhammañāṇa
« on: May 18, 2019, 10:39:54 AM »

Using now "url" encoding, my person thinks, although it's saver for certain windows-program use, that uft-8 would be better and easier, avoiding the long and not human readable filenames.

Also here

Warning: Changing this option could cause unintended behaviour. By changing it you can make pages created under a previous setting inaccessible.

Please also note that storing UTF-8 filenames might not be possible with all file systems. Windows systems have been reported to not work with this setting.

Windows might possible follow soon, since utf-8 is easier for many.

Yet at the moment still relative less pages would have to be re-saved, especially Khmer.
Posted by: Dhammañāṇa
« on: May 18, 2019, 10:33:20 AM »

Currently still up to prepare the PTS-dictionary for ati's dictionary, having some 10.000ts of words, Atma thinks on how to best handle the pagenames. The software, as it is now, would cut certain characters with diacritics, incl. in the roman table to standard roman characters. Some not, because not incl. Out of that, which is fine when using search tools, now able to search with or without diacritics for this characters, there would be many double, tripple names.

Vandami Bhante _/\_

I do not really understand the problem.
The dictionary now includes pages like for example http://accesstoinsight.eu/en/dictionary/ṭhiti-bhāgiya-samādhi , which has diacritics in the pagename.
Searching for the phrase ṭhiti-bhāgiya-samādhi yields 211 matches on 2 pages . But searching for thiti-bhagiya-samadhi (without diacritics) yields no matches at all .

Where does it happen that Roman characters with diacritics are stripped of diacritics?

There is current no problem because Atma looked after to have no "same" names within the standard but that will no more possible, or with tricks like using aa instead of ā in pagenames. When uploading the "whole" set of words such can be troublesome.

The filename is ṭhitibhagiyasamadhi.txt or better %E1%B9%ADhitibhagiyasamadhi.txt. When ever a page is made new, the program would cut the name down that standard.
Since is not included in that standard, it can be distinguished from t. The cutting down is fine as long as there is no need of a page a and ā. If needed, than it meets it's limits. The quicksearch searches for header and filename and is in the current situation great because in that way finding 2 different kind of spellings. If filename = header, real spelling, one would currently need to write correct in regard of finding words with resisting diacritics.

Not having tried yet, put to upload files not in the standard of the page names would, say ṭhitibhāgiyasamādhi.txt would make them invalid files causing error, not displayed, as far as understanding.

If now, when possible, one would disable the cut of of diacritics also in regard of the roman table, the search would no more that pleasing, possible would need to be recorded to match either with or without diacritic characters.

This seems to be already the case here: One would have to enter any phrase exactly correct with diacritics to find matching results.
Could Bhante give an example of a search where this is not the case? Or have I misunderstood something?

Not in the case where pagename and header is different, at least for quick-search (as searching both, header and filename. Standard search looks for what ever is as it is for now, as far as known.


In any case, yes, it seems very useful, being able to search for Pali phrases without all the exact diacritics. But would also be good to have the option to search for exact matching diacritics. Not sure how easy or difficult it would be to implement.

Every idea and suggestion welcome here.

Atma thinks it will need another week or two till it would need to make a choice in regard of filenames to progress.
Not sure what choice in regard to filenames is needed? At the moment, it seems dictionary filen ames/page names include diacritics etc., which I think makes sense. If one would strip diacritics from filenames then one might have some conflicting words which would be the same without diacritics.

So I think it would be best to have all diacritics included in the file names / page names. Still not sure where anything is stripped of diacritics. Is it the case that one would have a file name with diacritics, but the URL would be without diacritics? Would be good to see an example.

_/\_

May person thinks that it would be the best if the filename = right spelling and search engines as well as programs handle files would be certain adopted, but my person thinks that such might require huge programmer work, possible and has probably much impact on such as applications, plugins as well.

Another opinion would be turning the deaccent -config to 0, currently is 2 (remove diacritics in Latin, with matches some of the Pali diacritics as well). Using 0 now may make some current pagesnames problematic and cause troubles.
If removing the whole set of Pali diacritics out of the cut away in the responsible script, possible better. Yet the matter that the searches are not optimized for different spellings is another and resists.

In the current system used Atma used tricks like "aa" but for m with a upper dot, for example more problematic, m with a dot below would stay as it is since not in the latin block of diacritics.

While always finding way to make best use with what is available, in this case, may person has no idea of the amount of programming effort and skill needed for a good rendering, and would therefore not really ask for going after this or that better solution. He just aware that it would possible require huge and skilled work and would need a lot of scarifies and concentration and giving into this matter.
Posted by: Moritz
« on: May 18, 2019, 07:50:24 AM »

Currently still up to prepare the PTS-dictionary for ati's dictionary, having some 10.000ts of words, Atma thinks on how to best handle the pagenames. The software, as it is now, would cut certain characters with diacritics, incl. in the roman table to standard roman characters. Some not, because not incl. Out of that, which is fine when using search tools, now able to search with or without diacritics for this characters, there would be many double, tripple names.

Vandami Bhante _/\_

I do not really understand the problem.
The dictionary now includes pages like for example http://accesstoinsight.eu/en/dictionary/ṭhiti-bhāgiya-samādhi , which has diacritics in the pagename.
Searching for the phrase ṭhiti-bhāgiya-samādhi yields 211 matches on 2 pages . But searching for thiti-bhagiya-samadhi (without diacritics) yields no matches at all .

Where does it happen that Roman characters with diacritics are stripped of diacritics?

If now, when possible, one would disable the cut of of diacritics also in regard of the roman table, the search would no more that pleasing, possible would need to be recoded to match either with or without diacritic characters.

This seems to be already the case here: One would have to enter any phrase exactly correct with diacritics to find matching results.
Could Bhante give an example of a search where this is not the case? Or have I misunderstood something?

In any case, yes, it seems very useful, being able to search for Pali phrases without all the exact diacritics. But would also be good to have the option to search for exact matching diacritics. Not sure how easy or difficult it would be to implement.

Every idea and suggestion welcome here.

Atma thinks it will need another week or two till it would need to make a choice in regard of filenames to progress.
Not sure what choice in regard to filenames is needed? At the moment, it seems dictionary filen ames/page names include diacritics etc., which I think makes sense. If one would strip diacritics from filenames then one might have some conflicting words which would be the same without diacritics.

So I think it would be best to have all diacritics included in the file names / page names. Still not sure where anything is stripped of diacritics. Is it the case that one would have a file name with diacritics, but the URL would be without diacritics? Would be good to see an example.

_/\_
Posted by: Dhammañāṇa
« on: May 15, 2019, 08:46:09 AM »

Currently still up to prepare the PTS-dictionary for ati's dictionary, having some 10.000ts of words, Atma thinks on how to best handle the pagenames. The software, as it is now, would cut certain characters with diacritics, incl. in the roman table to standard roman characters. Some not, because not incl. Out of that, which is fine when using search tools, now able to search with or without diacritics for this characters, there would be many double, tripple names.
If now, when possible, one would disable the cut of of diacritics also in regard of the roman table, the search would no more that pleasing, possible would need to be recoded to match either with or without diacritic characters.

Every idea and suggestion welcome here.

 Atma thinks it will need another week or two till it would need to make a choice in regard of filenames to progress.