Beyond My Mind

April 11, 2007

Genetic Algorithm in Adaptive Web Search

Filed under: Research — mahbub @ 3:36 am

I am not interested about the web search performed by some web spider or a web crawler here. Those automated sophisticated super fast programs may have many blows and whistles to boost their performance. I am interested here to formalize the adaptive search technique that we often use on a day to day basis without even knowing how strong it is. This relation will help you understand the formal approach behind the adaptive search and thus allow you to use it more rigorously.

What is an adaptive web search? As I mentioned in one of my previous articles it is a cute and smart technique to find what you want using a web search engine, like Google. Let me quote the search technique used to find some literatures for solving a problem described in my previous article.

Say I have a problem to solve that was assigned by some course teacher or my research adviser. I mark some keywords and Google for them. If I don’t find any relevant information I use combination of those keywords or use alternative keywords adapted from the search results. Once I start getting some keywords that produce relevant results in Google, I pass it to Google Scholar. Sometimes I go to some other subject specific search engines to search using those keywords.

I believe you can see the pattern used here. Although the procedure is described here for literature search, it is equally applicable to any other search. The procedure is, start with some initial guess, and depending on the outcome refine your guess after each query. As on each query you are adapting to the outcome, I tend to call it an adaptive web search.

How is it related to Genetic algorithm? Genetic algorithm has its root into our civilization and into the evolution process of whole world. We are all familiar with the phrase “survival of the fittest“. This means nature does not like unfit outcome, so nature refines itself with each evolution. Can you see the similarity with adaptive web search?

Genetic algorithm is a simulation technique that uses a formal approach to simulate above situation and finally come up with an approximate solution to a problem. Wikipedia has two nice articles on genetic algorithm and genetic programming. Quoting from wikipedia:

Genetic algorithms are implemented as a computer simulation in which a population of abstract representations (called chromosomes or the genotype or the genome) of candidate solutions (called individuals, creatures, or phenotypes) to an optimization problem evolves toward better solutions. Traditionally, solutions are represented in binary as strings of 0s and 1s, but other encodings are also possible. The evolution usually starts from a population of randomly generated individuals and happens in generations. In each generation, the fitness of every individual in the population is evaluated, multiple individuals are stochastically selected from the current population (based on their fitness), and modified (recombined and possibly mutated) to form a new population. The new population is then used in the next iteration of the algorithm. Commonly, the algorithm terminates when either a maximum number of generations has been produced, or a satisfactory fitness level has been reached for the population. If the algorithm has terminated due to a maximum number of generations, a satisfactory solution may or may not have been reached.

It should be clear by now that the adaptive web search we do is essentially an application of genetic algorithm. So you should feel confident about this searching technique. Here is a summary of the technique again:

  • Start with some random or predefined initial guesses
  • Search for those keywords
  • Select “acceptable” results from your search results and mark down some keywords from it
  • Do this until the results are approximately what you were looking for
  • Stop if you are searching over and over for many times but you are not getting good results

As adaptive search is a genetic algorithm it also suffers from the shortcomings that genetic algorithm has. First of all adaptive search depends on you first initial guess, which may become so unrelated that your search results are really bad. Secondly, it depends on your judgment of accepting a result. If you expect results too soon and accept no result you may end up unhappy. On the other hand, if you except all results you may end up doing too many searches with all unrelated results.

You may get surprised with the efficiency of this adaptive search technique. If I remember correctly I never had to go for more than three iterations to get some good results. What I often do is, I go through this process two or three times and gather some acceptable results. I start working with the results and collect knowledge related to my problem and branch out my findings. Some of my questions may be answered by this set of results, but some may not be. So I go again searching solutions for the sub problem at hand using the same adaptive search technique I have discussed so far.

As long as you are aware of the pitfalls of adaptive search, this can be a powerful approach for web searching. So folks what approach do you follow for a web search?

March 24, 2007

Details of Microsoft Office 2007 Bibliographic Format Compared to BibTex

Filed under: Research — mahbub @ 2:29 am

I am trying to decipher the Microsoft Office 2007 Bibliographic format. As I mentioned in my previous post I am writing a Microsoft Office 2007 bibliographic import-export module for JabRef, a bibliographic manager. In this post I will try to find a link between BibTex citation elements and Microsoft Office 2007 Bibliographic elements, so that anyone can use it to create export-import modules. For the shake of discussion here, let me coin the term MSBib for Microsoft Office 2007 Bibliographic Format.

I created an XML ‘sources’ file with all possible entries or source types in Microsoft Office 2007. The file can be downloaded from here. The data used to create this XML are rather random garbage than anything useful. But you can refer to this and see what needs to go in where.

One important similarity I found between MSBib and BibTex is that they both ignore unknown entries. Thats a neat feature, because one can then utilize it to put in some information that is not representable by other, instead of scrapping it all together.

Except this similarity, both the formats are very different. MSBib is quite new and supports newer entry types like movies, case, patent etc. Another important difference is that MSBib has fields that are specific to some source or entry types, whereas BibTex has a common set of field types applicable to anything. Correct me if I am wrong about BibTex here.

Entry or Source Types

There are 17 types of sources in MSBib, whereas BibTex has 16 entry types. The table 1 below shows BibTex and MSBib as one by one mapping. Obvious enough that, there may not be one to one mapping for all of them. In such cases I will try to put them as misc entry or some other similar types. Remember this table as well as any other table in this document is subject to change, as I learn more.

Table 1: Entry or source types

MSBib BibTex Comment
Book book  
BookSection inbook  
BookSection, field BibTex_Entry=booklet booklet Not sure
BookSection, field BibTex_Entry=incollection incollection Not sure
JournalArticle article  
ArticleInAPeriodical article, field msbib-source=ArticleInAPeriodical Not sure
ConferenceProceedings inproceedings  
ConferenceProceedings, field BibTex_Entry=conference conference  
ConferenceProceedings, field BibTex_Entry=proceedings proceedings Not sure
ConferenceProceedings, field BibTex_Entry=collection collection Not sure
Report techreport  
Report, field BibTex_Entry=manual manual  
InternetSite misc, field msbib-source=InternetSite  
DocumentFromInternetSite misc, field msbib-source=DocumentFromInternetSite  
ElectronicSource misc, field msbib-source=ElectronicSource  
Art misc, field msbib-source=Art  
SoundRecording misc, field msbib-source=SoundRecording  
Performance misc, field msbib-source=Performance  
Film misc, field msbib-source=Film  
Interview misc, field msbib-source=Interview  
Patent patent  
Case misc, field msbib-source=Case  
Report, field BibTex_Entry=mastersthesis mastersthesis  
Report, field BibTex_Entry=phdthesis phdthesis  
Report, field BibTex_Entry=unpublished unpublished  
Misc misc  

As I mentioned earlier fields in MSBib are entry specific, whereas in BibTex they are common for all the entries. To compare, I will present MSBib fields in a linear list like BibTex. First I would like to show the details of MSBib fields starting with composite Author field and then discussing the rest of it. Later a comparison with BibTex will be presented.

Author Field
MSBib authors are composite structures. Author can be of following types:

  1. Author
  2. BookAuthor
  3. Editor
  4. Translator
  5. ProducerName
  6. Composer
  7. Conductor
  8. Performer
  9. Writer
  10. Director
  11. Compiler
  12. Interviewer
  13. Interviewee
  14. Inventor
  15. Counsel

Each of the sub-types of Author contains a NameList containing one or more Person(s) or a field Corporate, comma separated list of corporate persons.

In MSBib each Person of NameList as well as in BibTex have three parts, First, Last and Middle. They are presented in MSBib as

<Person>
<Last>LastName</Last>
<First>FirstName</First>
<Middle>MiddleName</Middle>
</Person>

In BibTex names are represented,

LastName, FirsName MiddleName; LastName, FirsName MiddleName;

In MSBib Corporate field the names are represented as:

LastName, FirsName MiddleName; LastName, FirsName MiddleName;

Different Fields in MSBib
The common fields presented below is common for all source types. Special fields for each source types is presented after that. The fields with red star (*) is a recommended (by Microsoft Office 2007) field for that source.

Common fields present in all sources

  • Tag: Identifier for the source. Same as BibTex key. Most probably this is created from first three letters of the first name of the first author combined with last two digits of the publishing year. Example: Mah07
  • SourceType: One of the MSBib source types from Table 1.Example: Book
  • GUID: Global ID. This enables Word to determine which source is most recent, based on the value of the GUID, and to prompt whether the user wants Word to update the outdated source to maintain continuity between the master list and the current list. Example: {F3BEFB3B-FC0D-47AB-970A-F4003FF99F9F} (more)
  • LCID: Language ID. Use 0 for English. Example: 0
  • Author: A composite containing different author subtypes. Sub types of Author are source specific.
  • Title*: Title of the source. Example: Brief History of Time
  • Year*:Publication year. Example: 2004
  • ShortTitle: Short title of the source. Example: BHT
  • Comments: Free form text as comment on the source. Example: Comment is helpful to annotate a source.

Additional fields in Book

  • Composite Author:
    1. Author*: A NameList containing one or more author(s) of this source. Example: Murshed, Mahbub; Zakir, Tanjia
    2. Editor: A NameList containing one or more editor(s) of this source. Example: Murshed, Manjur; Ali, Liyakat
    3. Translator: A NameList containing one or more translator(s) of this source. Example: Murshed, Laura; Jones, Dave
  • Pages: Page range referenced. Example: 23-45
  • Volume: Volume of the book. Example: 2
  • NumberVolumes: Total number of volumes the book has. Example: 5
  • Edition: Edition of the book. Example: 2
  • StandardNumber: ISBN/ISSN or some other standard number. Example: ISBN 226-392-34
  • Publisher*: Name of the publisher Example: Spinger-Verlag
  • City*: City published in. Example: San-Fransisco.
  • StateProvince: State published in. Example: California
  • CountryRegion:Country published in. Example: USA

Additional fields in BookSection

  • Composite Author:
    1. Author*: A NameList containing one or more author(s) of this source. Example: Murshed, Mahbub; Zakir, Tanjia
    2. BookAuthor*: A NameList containing one or more author(s) of this book. This is may not be same as Author. Example: Murshed, Mahbub; Zakir, Tanjia
    3. Editor: A NameList containing one or more editor(s) of this source. Example: Murshed, Manjur; Ali, Liyakat
  • BookTitle*: Title of the book, appears twice. Example: Brief History of Time
  • Pages*: Page range referenced. Example: 23-45
  • Volume: Volume of the book. Example: 2
  • NumberVolumes: Total number of volumes the book has. Example: 5
  • ChapterNumber: The chapter number of the book referenced. Example: 7
  • Edition: Edition of the book. Example: 2
  • StandardNumber: ISBN/ISSN or some other standard number. Example: ISBN 226-392-34
  • Publisher*: Name of the publisher Example: Spinger-Verlag
  • City*: City published in. Example: San-Fransisco.
  • StateProvince: State published in. Example: California
  • CountryRegion:Country published in. Example: USA

Additional fields in JournalArticle

  • Composite Author:
    1. Author*: A NameList containing one or more author(s) of this source. Example: Murshed, Mahbub; Zakir, Tanjia
    2. Editor: A NameList containing one or more editor(s) of this source. Example: Murshed, Manjur; Ali, Liyakat
  • JournalName*: Name of the journal this article appeared in. Example: Engineering Design
  • Pages*: Page range referenced. Example: 23-45
  • Volume: Volume of the journal. Example: 2
  • Issue: Issue number of current volume in which the article published. Example: 4
  • StandardNumber: DOI or some other standard number. Example: DOI 22639234
  • Publisher: Name of the publisher Example: Spinger-Verlag
  • City: City published in. Example: San-Fransisco.
  • Month: Month published in. Example: February.
  • Day: Day published in. Example: 19.

Additional fields in ArticleInAPeriodical

  • Composite Author:
    1. Author*: A NameList containing one or more author(s) of this source. Example: Murshed, Mahbub; Zakir, Tanjia
    2. Editor: A NameList containing one or more editor(s) of this source. Example: Murshed, Manjur; Ali, Liyakat
  • PeriodicalTitle*: Name of the periodical this article appeared in. Example: Mechanical Engineering
  • Pages*: Page range referenced. Example: 23-45
  • Edition: Edition of the book. Example: 2
  • Volume: Volume of the journal. Example: 2
  • Issue: Issue number of current volume in which the article published. Example: 4
  • StandardNumber: DOI or some other standard number. Example: DOI 22639234
  • Publisher: Name of the publisher Example: Spinger-Verlag
  • City: City published in. Example: San-Fransisco.
  • Month*: Month published in. Example: February.
  • Day*: Day published in. Example: 19.

Additional fields in ConferenceProceedings

  • Composite Author:
    1. Author*: A NameList containing one or more author(s) of this source. Example: Murshed, Mahbub; Zakir, Tanjia
    2. Editor: A NameList containing one or more editor(s) of this source. Example: Murshed, Manjur; Ali, Liyakat
  • ConferenceName*: Name of the conference this article appeared in. Example: Mechanical Engineering
  • Pages*: Page range referenced. Example: 23-45
  • Volume: Volume of the journal. Example: 2
  • StandardNumber: DOI or some other standard number. Example: DOI 22639234
  • Publisher*: Name of the publisher Example: Spinger-Verlag
  • City*: City published in. Example: San-Fransisco.

Additional fields in Report

  • Composite Author:
    1. Author*: A NameList containing one or more author(s) of this source. Example: Murshed, Mahbub; Zakir, Tanjia
  • Department: Name of the department this report prepared for. Example: Mechanical Engineering Department
  • Institution: Name of the institution this report prepared for. Example: Arizona State University
  • ThesisType: Type of thesis. Example: phd, masters or technical
  • Pages: Page range referenced. Example: 23-45
  • StandardNumber: Some standard number. Example: ASU-PHD 22639234
  • Publisher*: Name of the publisher Example: Arizona State University Press
  • City*: City published in. Example: Tempe.

Additional fields in InternetSite and DocumentFromInternetSite

  • Composite Author:
    1. Author*: A NameList containing one or more author(s) of this source. Example: Murshed, Mahbub; Zakir, Tanjia
    2. Editor: A NameList containing one or more editor(s) of this source. Example: Murshed, Manjur; Ali, Liyakat
    3. ProducerName: A NameList containing one or more producer’s name(s) of this source. Example: Murshed, Manjur; Ali, Liyakat
  • InternetSiteTitle*: Title of the internet site, duplicate of the Title field appread in common section.Example: Beyond My Mind
  • Month*: Month published in. Example: February.
  • Day*: Day published in. Example: 19.
  • YearAccessed*: Year in which the site was accessed for reference. Example: 2004
  • MonthAccessed*: Month in which the site was accessed for reference. Example: February.
  • DayAccessed*: Day in which the site was accessed for reference. Example: 19.
  • URL*: The website URL. Example: http://mahbub.wordpress.com.
  • ProductionCompany: The production company of the website. Example: wordpress.
  • Version: Version number of the website. Example: 1.3.
  • StandardNumber: Some standard number. Example: SITE-ID 22639234

Additional fields in ElectronicSource

  • Composite Author:
    1. Author*: A NameList containing one or more author(s) of this source. Example: Murshed, Mahbub; Zakir, Tanjia
    2. Editor: A NameList containing one or more editor(s) of this source. Example: Murshed, Manjur; Ali, Liyakat
    3. ProducerName: A NameList containing one or more producer’s name(s) of this source. Example: Murshed, Manjur; Ali, Liyakat
    4. Translator: A NameList containing one or more translator(s) of this source. Example: Murshed, Manjur; Ali, Liyakat
  • PublicationTitle: Title of the source, appears twice. Example: GNU C++ Source
  • Volume: Volume of the source. Example: 2
  • Medium: Medium of the source. Example: CD-ROM
  • Edition: Edition of the source. Example: 2
  • Month*: Month published in. Example: February.
  • Day*: Day published in. Example: 19.
  • ProductionCompany: Company published the code. Example: FSF
  • Publisher: Publisher of the code. Example: GNU
  • City*: City published in. Example: San-Fransisco.
  • StateProvince*: State published in. Example: California
  • CountryRegion*:Country published in. Example: USA
  • StandardNumber: Some standard number. Example: SITE-ID 22639234

Additional fields in Art

  • Composite Author:
    1. Artist*: A NameList containing one or more artist(s) of this source. Example: Murshed, Mahbub; Zakir, Tanjia
  • PublicationTitle*: Title of the art work, appears twice. Example: Monalisa
  • Institution*: Institution the art work belongs to. Example: Arizona State University
  • Publisher: Publisher of the art. Example: Art Publisher
  • Pages: Pages of the art. In my opinion this is incorrect. Example: 23-34
  • City*: City published in. Example: San-Fransisco.
  • StateProvince: State published in. Example: California
  • CountryRegion:Country published in. Example: USA

Additional fields in SoundRecording

  • Composite Author:
    1. Artist: A NameList containing one or more artist(s) of this source. Example: Murshed, Mahbub; Zakir, Tanjia
    2. Composer*: A NameList containing one or more composer(s) of this source. Example: Murshed, Manjur; Ali, Liyakat
    3. Conductor*: A NameList containing one or more conductor(s) of this source. Example: Murshed, Manjur; Ali, Liyakat
    4. Performer*: A NameList containing one or more performer(s) of this source. Example: Murshed, Manjur; Ali, Liyakat
    5. ProducerName: A NameList containing one or more producer’s name(s) of this source. Example: Murshed, Manjur; Ali, Liyakat
  • AlbumTitle: Title of the album, appears twice. Example: Best of Bob Dylan
  • ProductionCompany: Company that produced the album. Example: Golden records
  • City*: City published in. Example: San-Fransisco.
  • StateProvince*: State published in. Example: California
  • CountryRegion*:Country published in. Example: USA
  • Medium: Medium of the album. Example: CD-ROM
  • RecordingNumber: Some recording number. Example: 22639
  • StandardNumber: Some standard number. Example: RECORD 22639

Additional fields in Performance

  • Composite Author:
    1. Performer*: A NameList containing one or more performer(s) of this source. Example: Murshed, Mahbub; Zakir, Tanjia
    2. Writer*: A NameList containing one or more writer(s) of this source. Example: Murshed, Manjur; Ali, Liyakat
    3. ProducerName: A NameList containing one or more producer’s name(s) of this source. Example: Murshed, Manjur; Ali, Liyakat
    4. Director: A NameList containing one or more director(s) of this source. Example: Murshed, Manjur; Ali, Liyakat
  • Theater*: Theater the performance was performed. Example: Arizona State University Central Theater
  • ProductionCompany: Company that produced the performance. Example: Golden records
  • City*: City performed in. Example: San-Fransisco.
  • StateProvince*: State performed in. Example: California
  • CountryRegion*:Country performed in. Example: USA
  • Month*: Month performed in. Example: February.
  • Day*: Day performed in. Example: 19.
  • StandardNumber: Some standard number. Example: RECORD 22639

Additional fields in Film

  • Composite Author:
    1. Writer: A NameList containing one or more writer(s) of this source. Example: Murshed, Manjur; Ali, Liyakat
    2. Performer: A NameList containing one or more performer(s) of this source. Example: Murshed, Mahbub; Zakir, Tanjia
    3. Director*: A NameList containing one or more director(s) of this source. Example: Murshed, Manjur; Ali, Liyakat
    4. ProducerName: A NameList containing one or more producer’s name(s) of this source. Example: Murshed, Manjur; Ali, Liyakat
  • ProductionCompany: Company that produced the film. Example: Golden records
  • Distributor: Company that distributed the film. Example: Golden distributor
  • CountryRegion: Country performed in. Example: USA
  • Medium: Medium the record published in. Example: CD-ROM
  • StandardNumber: Some standard number. Example: RECORD 22639

Additional fields in Interview

  • Composite Author:
    1. Interviewee*: A NameList containing one or more interviewee(s) of this source. Example: Murshed, Manjur; Ali, Liyakat
    2. Interviewer*: A NameList containing one or more interviewer(s) of this source. Example: Murshed, Manjur; Ali, Liyakat
    3. Editor: A NameList containing one or more editor(s) of this source. Example: Murshed, Mahbub; Zakir, Tanjia
    4. Translator: A NameList containing one or more translator(s) of this source. Example: Murshed, Manjur; Ali, Liyakat
    5. Compiler: A NameList containing one or more compiler(s) of this source. Example: Murshed, Manjur; Ali, Liyakat
  • BroadcastTitle: Title of the interview, same as Title. Example: Interview of Dr. Knuth
  • Publisher: Company that published the interview. Example: Adventure publisher
  • Broadcaster: Company that broad casted the interview. Example: NBC
  • Station: Station that broad casted the interview. Example: WNBC
  • City: City performed in. Example: San-Fransisco.
  • StateProvince: State performed in. Example: California
  • CountryRegion: Country performed in. Example: USA
  • Month*: Month published in. Example: February.
  • Day*: Day published in. Example: 19.
  • StandardNumber: Some standard number. Example: RECORD 22639

Additional fields in Patent

  • Composite Author:
    1. Inventor*: A NameList containing one or more inventor(s) of this source. Example: Murshed, Manjur; Ali, Liyakat
    2. Editor: A NameList containing one or more editor(s) of this source. Example: Murshed, Manjur; Ali, Liyakat
    3. Translator: A NameList containing one or more translator(s) of this source. Example: Murshed, Mahbub; Zakir, Tanjia
  • Type: Patent type. Example: Software
  • CountryRegion*:Country performed in. Example: USA
  • Month: Month published in. Example: February.
  • Day: Day published in. Example: 19.
  • PatentNumber*: Some standard patent number. Example: PATENT 22639

Additional fields in Case

  • Composite Author:
    1. Author: A NameList containing one or more author(s) of this source. Example: Murshed, Manjur; Ali, Liyakat
    2. Counsel: A NameList containing one or more counsel(s) of this source. Example: Murshed, Manjur; Ali, Liyakat
  • Court*: Court the case appeared in. Example: Supreme Court
  • Reporter: Reporter reported on the case. Example: Big Reporter Agency
  • Month*: Month appeared in. Example: February.
  • Day*: Day appeared in. Example: 19.
  • City: City appeared in. Example: San-Fransisco.
  • CaseNumber*: Some standard case number. Example: CASE 22639
  • AbbreviatedCaseNumber: Some standard abbreviated case number. Example: CASE 22639, for doing some illegal activity.

Additional fields in Misc

  • Composite Author:
    1. Author*: A NameList containing one or more author(s) of this source. Example: Murshed, Manjur; Ali, Liyakat
    2. Editor: A NameList containing one or more editor(s) of this source. Example: Murshed, Manjur; Ali, Liyakat
    3. Translator: A NameList containing one or more translator(s) of this source. Example: Murshed, Mahbub; Zakir, Tanjia
    4. Compiler: A NameList containing one or more compiler(s) of this source. Example: Murshed, Mahbub; Zakir, Tanjia
  • PublicationTitle*: Title of the publication. Example: The Big Bang Theory
  • Publisher*: Name of the publisher Example: Spinger-Verlag
  • Pages: Page range referenced. Example: 23-45
  • Volume: Volume of the publication. Example: 2
  • Edition: Edition of the publication. Example: 2
  • Issue: Issue of the publication. Example: 2
  • Month*: Month appeared in. Example: February.
  • Day*: Day appeared in. Example: 19.
  • City*: City published in. Example: San-Fransisco.
  • StateProvince*: State published in. Example: California
  • CountryRegion*: Country published in. Example: USA
  • Medium: Medium published in. Example: CD-ROM
  • StandardNumber: ISBN/ISSN or some other standard number. Example: ISBN 226-392-34

Similar fields in BibTex
As I mentioned earlier in this post the comparison will be based on a linear mapping table as shown in Table 3. Table 2 contains the mapping of author fields.

Table 2: Author fields in MSBib and BibTex

MSBib BibTex Comment
Author author  
BookAuthor msbib-bookauthor custom
Editor editor  
Translator msbib-translator custom
ProducerName msbib-producername custom
Composer msbib-composer custom
Conductor msbib-conductor custom
Performer msbib-performer custom
Writer msbib-writer custom
Director msbib-director custom
Compiler msbib-compiler custom
Interviewer msbib-interviewer custom
Interviewee msbib-interviewee custom
Inventor msbib-inventor custom
Counsel msbib-counsel custom

Here goes the table with MSBib and BibTex fields.
Table 3: Fields in MSBib and BibTex

MSBib BibTex Comment
Tag Database key or key  
SourceType   Chose from Table 1
GUID   Ignore
LCID language A map between language name to LCID may be required
Title title  
Year year  
ShortTitle msbib-shorttitle custom
Comments note or annote
Pages pages  
Volume volume  
NumberVolumes msbib-numberofvolume custom
Edition edition  
StandardNumber ISBN, ISSN, LCCN, mrnumber Parse standard number to determine ISBN or ISSN
Publisher publisher  
City, StateProvince, CountryRegion address or location. Usually MSBib fields appear together.
BookTitle booktitle  
ChapterNumber chapter  
JournalName journal  
Issue number  
Month month  
Day msbib-day custom
PeriodicalTitle organization  
ConferenceName organization  
Department school  
Institution institution  
ThesisType type  
InternetSiteTitle title Approximate
YearAccessed, MonthAccessed, DayAccessed msbib-accessed Date accessed “month day, year” format in an additional field
URL URL  
ProductionCompany msbib-productioncompany custom
PublicationTitle title Approximate
Medium msbib-medium custom
AlbumTitle title Approximate
RecordingNumber msbib-recordingnumber custom
Theater msbib-theater custom
Distributor msbib-distributor custom
BroadcastTitle title Approximate
Broadcaster msbib-broadcaster custom
Station msbib-station custom
Type msbib-type Patent type. custom
PatentNumber msbib-patentnumber custom
Court msbib-court custom
Reporter msbib-reporter Reporter for a case. custom
CaseNumber msbib-casenumber custom
AbbreviatedCaseNumber msbib-abbreviatedcasenumber custom
BibTex_Series series Common name of series of books.
BibTex_Abstract abstract  
BibTex_KeyWords keywords  
BibTex_CrossRef crossref Database key being cross referenced.
BibTex_HowPublished howpublished  
BibTex_Affiliation affiliation Authors affiliation.
BibTex_Contents contents A table of contents.
BibTex_Copyright copyright  
BibTex_Price price  
BibTex_Size size Physical dimension of a work.

This articles is more or less complete here. I hope to update this post as soon as I get more info about these formats. Thank you for reading it.

Ping
I love visitors. So let me ping important sources so that people come to know about this article. ;)

Some references that might be helpful:

  1. How to use Office 2007 bibliographic tool
  2. OpenXML Developer
  3. Blog of Brian Jones, the person behind the Office 2007 open XML
  4. ECMA Open XML Standard Elaborated Schemas (all documents)
  5. MSDN article showing how to work with Bibliography (updated March 23, 2007)

March 22, 2007

Deciphering Microsoft Office 2007 Bibliography Format

Filed under: Research — mahbub @ 12:56 am

I am about to write a module for JabRef, an open source bibliographic management software to export the bibliographic information for Microsoft Office 2007.

Some references that might be helpful:

  1. How to use Office 2007 bibliographic tool
  2. OpenXML Developer
  3. Blog of Brian Jones, the person behind the Office 2007 open XML
  4. ECMA Open XML Standard Elaborated Schemas (all documents)
  5. MSDN article showing how to work with Bibliography (updated March 23, 2007)

But after searching for a day, I could not find a single web page describing the exact or near exact format for bibliographic information in Microsoft Office 2007. So I started digging in myself.

I started adding some bibliographies in Microsoft Office Bibliography Editor. The very first thing I noticed is, if you add some references and don’t use them in the document they are not going be saved. If you use one or more of them in your document, all of them will be saved in “C:\Documents and Settings\<USER>\Application Data\Microsoft\Bibliography\Sources.xml“. I opened the XML file and here’s what I got (figure 1).

Figure 1: Mircosoft Office 2007 Bibliographic Database Format

Content of Microsoft Office 2007 Bibliographic Source XML

Obviously I had only one bibliographic source in the “Sources.xml”. I was almost certain that Office will import a copy of this file without any problem. A copy of this file with the information altered and GUID, LCID deleted, just worked as imported bibliography. But wait, where are my previous bibliographic sources?

So I tried to discover what happened and found that Office does NOT really imports bibliography into the “Sources.xml”, it allows you to work on currently opened XML only. All the bibliographic sources in currently opened bibliographic XML file are displayed in the ‘master list‘. You have to “copy” them into your ‘current list‘ to work with it. If you want to merge information from an external XML file into your “C:\Documents and Settings\<USER>\Application Data\Microsoft\Bibliography\Sources.xml” you have to open the external XML file, copy the information into your ‘current list‘, open the “Sources.xml” again and then copy them back into the ‘master list‘ which now points to “Sources.xml“.

I wanted to find out the least possible information required for the XML file to be recognized as a valid bibliographic source by Office 2007. The bare minimum is:

<sources xmlns="http://schemas.openxmlformats.org/officeDocument/2006/bibliography"/>

If you want to add information in this base minimum XML don’t use the “b:” tag.

From MSDN (update):

The Guid and LCID elements are optional, but you can provide values for them if you want. The Guid element value should be a valid GUID, which you can generate programmatically outside the Word object model. (See the Microsoft Visual Studio documentation or the Microsoft Windows documentation on MSDN for information about programmatically generating ID.) Word generates GUIDs when users add or edit a source. If you do not add a GUID to the XML and a user then edits a source, Word generates a GUID. This enables Word to determine which source is most recent, based on the value of the GUID, and to prompt whether the user wants Word to update the outdated source to maintain continuity between the master list and the current list.

The LCID specifies the language for the source. (See MSDN for valid language identification values.) Word uses the LCID to know how to display a cited source in a document’s bibliography. For example, one source may be written in French, one in English, and one in Japanese. From the LCID, Word determines how to display names (for example, Last, First for English), what punctuation to use (for example, using comma in one language and a semicolon in another), and what strings to use (for example, whether to use “et al” or another localized form).

Now that I deciphered how bibliographic information can be presented in an XML, so that Office 2007 recognizes it as a bibliographic source, I can now list down all the bits and pieces that can go inside it. Please follow my next post on it.

March 19, 2007

Comparison of Bibliographic Search Engines

Filed under: Research — mahbub @ 1:23 am

Inspired from the responses of my post comparing free bibliographic tools, I planned on posting few more articles about other different research tools. Here I am presenting a comparison of bibliographic search engines. The next article will be on scientific document processing tools. This series of comparisons follow a general pattern in which I start with the usage of such tools, followed by setting up a set of criteria to compare on and then I compare different tools. I believe this is a reasonable choice to present these topics.

Bibliographic search engine is a general purpose research tool that contains all the research citations for one or more broad subject area(s) to which people can refer to reliably for their specific personal research. It is different from “personal bibliographic tool” in the sense that a bibliographic search engine is developed for the commons; it must be reliable for any reference it is providing, it must provide a way to export the data but it does not need to present the reference in any specific ‘visual’ or ‘presentation’ format.

The way I search for scientific articles is pretty simple. Say I have a problem to solve that was assigned by some course teachers or my research supervisor. I mark some keywords and Google for them. If I don’t find any relevant information I use combination of those keywords or use alternative keywords adapted from the search results. Once I start getting some keywords that produce relevant results in Google, I pass it to Google Scholar. Sometimes I go to some other subject specific search engines to search using those keywords.

Expected Features from a Bibliographic Search Engine

A bibliographic search engine must be accessible from each and every operating system available. For this reason all such databases are platform independent, i.e. web based.

Providing a reliable collection of references is not an easy task. Obtaining huge publication information, plugging them into the system, providing the information to the user and maintaining them periodically are all enormous amount of tasks. The data collection process may not be free. Moreover, if a search engine wants to provide soft copy of the reference along with the citation information then the search engine provider has to pay for it. For this reason many of these search engines are not free.

I tried to compile a set of features expected from a bibliographic search engine. I believe the following requirements are complete and sufficient for general purpose research.

  • Areas: Number of main stream subject areas covered. For example, medical, engineering, etc.
  • Search: Support for searching different data fields. For example searching for title, author, abstract, etc.
  • Export: Export formats and automation
    • Format: Different formats supported. For example, BibTex, EndNote, RIS, etc.
    • Automation: Communicating through API calls or other ways from external applications
  • Document: A soft copy of the document
  • Cost: Cost of obtaining the service.

Comparison

Google is inserted twice in the table. This is not a typo. I think general Google search and Google Scholar search are quite different. In few cases research papers are freely available in public domains that Google Scholar may not recognize. So I consider both as different search tools.

Table 1: Comparison matrix of bibliographic search engines.

Name Subject Area Search Export Format Export Automation Electronic Copy Cost
Google General General None No No Free
Google Scholar General Title, Author, Publisher, Date, Area None No No Free
CiteSeer General General BibTeX May be Yes, with exceptions Free
ScienceDirect General Title, Author, Journal Name, Volume, Issue, Page RIS, ASCII RefWorks Yes Paid
IEEE Xplore Electrical eng., Computer Sci., Electronics General, Advanced RIS, ASCII May be Yes Paid
ACM DL Only ACM Articles, mainly Computer Sci. General, Advanced BibTex, End Note, ACM Ref May be Yes Paid
ACM Guide Computer Sci. General, Advanced BibTex, End Note, ACM Ref May be Yes Paid
CSB Computer Sci. General, Advanced BibTex No No Free
DBLP Computer Sci. General BibTex No No Free
Net Bib Computer Sci General BibTex No No Free
PubMed Biomedical and LifeScience General, Advanced Unkn. May be Yes Free
Ingenta Connect General General BibTex, End Note May be Yes Partially Free#
Engineering Village* Engineering Unkn. Unkn. May be Yes Paid
ISI Web of Knowledge* General Unkn. Unkn. Yes Yes Paid
arXiv* Physics, Math, Computer Sci., Biology General, Advanced None No Yes Free

Like previous article I need your help here to complete this table. I am requesting more information from you.

Mid Size Search Engines

Some new generation web based bibliographic tools have been evolved that can act as search engines for a specific subject area as well as help a single person or a group of people by acting as personal bibliographic tool. The main problems of this kind of mid sized search engines are that they may be poorly maintained and may not be that much reliable. You can find names and comparison of some of these tools in my previous article.

This article is rather a simplified outline on all the bibliographic search engine available. One must however, depending on his/her specific subject area, find out which search engine best suites for him/her. Your comments on more search engines, grammatical/spelling error and writing style are most welcome.

March 4, 2007

LaTeX vs. Microsoft Word

Filed under: Research — mahbub @ 4:00 pm

Latex is definitely for power users. Authors those want to have nice documents and can handle the complexity of creating latex documents. On the other hand word is for users who don’t want to handle the complexity, or willing to spend less time with a moderate looking document.

Current trend in development in this area is interesting. LyX, a tool for latex document processing is trying to mimic WYSIWYG type behavior but at the end produces nice looking portable document. On the other hand, Microsoft Word is trying to add the support for complex features like cross-referencing, citation etc.

This reminds me about Xena. Usually heroes are rough and tough as well as powerful. Whereas, the female counter part of heroes are soft, beautiful, and smell nice. Xena was the hybrid who had the beauty of a female and power of a male. LyX and Microsoft Word 2007 reminds me of Xena every now and then.

Comparison of Free Bibliographic Managers

Filed under: Research — mahbub @ 1:26 am

I never realized the scarcity of a good tool for managing personal bibliography database until recently. I was writing a paper and found that it is really difficult to manage hundreds of references and use them in a document. Beside my original research I started researching on this issue and found that no single tool can solve all the required tasks for this purpose. This post is a result of the search for a free tool that will best serve this purpose.

Here is how a bibliographic manager works. An author creates a document and cites his document with entries form a bibliographic database created earlier. After the author completed writing the paper he passes the document and the database through some application and the application incorporates all the references cited in the document from the database. This produces a final version of the document containing the text, graphics, etc. the author created along with the references he cited in a specific format. The entire process is shown in figure 1.

The bibliographic citation process

Figure 1: The bibliographic citation process

I will present a comparison matrix on all the available bibliographic managers. First I will decompose each bibliographic management requirements into several sub-objectives i.e. feature. Then I will present a table showing the supports for those features in available tools. However, I did not use all the bibliographic managers out there and it is not possible for a single person to use all of them. So I would request you to let me know about any tool you have used.

Expected Features in a Bibliographic Manager
There are six basic requirements expected from a bibliographic manager. Listed down below is a functional decomposing of these requirements.

  • Search: Search all the available academic/non-academic databases.
  • Store: Store the reference and possibly a soft copy of the reference.
    • Viewer: View soft copy (doc, pdf, etc.).
  • Annotate: Keep notes on the reference.
    • Overall: Just a single note on the reference
    • Anywhere: A note anywhere in the document
  • Communicate: Import from and export to different formats.
    • Import (BibTex, End Note, XML, etc.)
    • Export (BibTex, End Note, XML, etc.)
  • Platform: Run on different platforms (Linux, Windows, etc.)
  • Presentation: Present the data to some standard formats.
    • Formats (MLA, APA, etc.)
    • Document (doc, pdf, html, etc.)

Some Issues about Bibliographic Tools
The main hurdle in my opinion is to annotate the reference document. There are so many different and complex file formats out there that it is really difficult to add support for all of them. Also pdf, the mostly used format for file exchange is also very difficult to handle. There is no good free application or library to handle pdf. I know about pdf library iText and annotation tools like Jarnal and Multivalent, but even these are tough to incorporate into any system.

The next problem such software may face is to present the data according to some specific standard. There are so many different organizations and so many different formatting styles that it is really hard to add support for all of them. On the other hand some text processing systems, for example LaTeX, can produce a final document from a bibliographic database and source document, but others like Microsoft Word 2003 can not do it. Adding this support for all such systems is important but very difficult.

The platform issue limits the application to communicate between different systems.

The search capability has to deal with the differences in internet connection methods in different operating systems. Another issue is to provide search support for all different bibliographic databases in different research domains.

So it is easy to see why most of the free tools and many of the commercial tools do not support all the required functionalities.

Free Tools for Bibliographic Management
There are many different tools available for this purpose. They can be divided into three categories: Application, Web based, and Hybrid. A Google search on bibliographic tools returned this comprehensive survey on bibliographic tools. This list includes even the smallest possible script to count number of bibliographic entries in a database. The purpose of this writing is not to include every possible bibliographic tool but to compare the tools that are decent enough to do the tasks specified earlier. (update) Links to some other useful articles submitted by the readers:

Table 1: Comparison matrix on free bibliography management application.

AppType Tools Search Store Annotate Communicate Platform Presentation License*
Application JabRef Medline, citeseer, IEEExplore Pdf, ps Single note BibTex, RIS, MODS XML All HTML, RTF Open Source
Bibdesk** Pubmed, Z39.50 Yes Single note BibTex, RIS, MODS XML Mac HTML, RTF Open Source
PBib No No No BibTex, Endnote All HTML Open Source, Copyrighted
pybliographer Pubmed No No BibTex, Endnote All Unkn. GPL

HyperBIBTEX
Unkn. Unkn. Unkn. Unkn. Mac Unkn. Free, Copyrighted
KBibTeX Unkn. Unkn. Unkn. Unkn. Linux Unkn. GPL 2
Bibus yes no Single note RIS, Refer, Medline Windows, Linux, Mac Itegrates with Word and OOo GPL 2
Web based Aigaion Unkn. Yes Yes BibTex, RIS All HTML, RTF GPL
bibsonomy yes Unkn. Unkn. BibTex, RIS All Unkn. X
CiteULike yes Unkn. Unkn. Unkn. All Unkn. X
EasyBib yes Unkn. Unkn. Unkn. All Unkn. X
RefBase * yes yes Shared/ Personal BibTeX, RIS, MODS XML, COinS All ASCII, HTML, LaTeX, MarkDown, PDF, RTF, etc. X
BibConverter! yes no no IEEEXplr,
Eng Vil and
ISI.
All BibTex X
WIKINDX$$ PubMed yes yes BibTeX, Endnote, RIS All Rtf, HTML + format editor GPL
Hybrid Zotero yes yes Notes, Snapshots BibTex, RIS, MODS, RDF, Refer, Bibex, COinS All (firefox plugin) RTF, HTML# Open Source

This matrix is however, not complete and perfect. I hope to update it if I find more information. I would appreciate your feedback here.

Commercial Tools for Bibliographic Management
There are more applications than you can imagine for solving this problem. Norman listed almost all known commercial packages available in his website. He also has a Bibliographic Grid comparing almost all the commercial packages. This page, however, does not include Microsoft Word 2007. Microsoft recently added bibliographic support in Microsoft Word 2007. Although it is still in primitive stage, its conformance with open standard will allow people to come up with solutions in this area. This Microsoft Word 2007 team blog post on bibliographic feature will help you to know more about it. Also don’t forget to see this document in msdn2.

I could not find a single resource on free bibliographic tools when I was searching for it. Even the free bibliographic tools do not show up with a moderate Google search. After I found JabRef I promptly started using it. Later I found other tools. I have not used all of them. But the comparison matrix will definitely help me hunt down others and choose which one fits best for me. Hope this helps any avid researcher out there.

April 27, 2006

CD Selector: A “run from cd” tool

Filed under: Research — mahbub @ 3:17 pm

In my research lab we develop programs that uses `A’ geometric kernel. As the projects are inherited from previous students, most of them are built around `A’ 7, an old version with several architechtural problems. In many cases we wanted to run these programs off the CD or any removable device as we are not allowed to install `A’ in the other computers. So I created this program so that, any kind of program can run from the CD, with the dlls located in the CD. I am pasting the user manual below.

This document describes how to create applications that rely on external dlls such as `A’. In this regard it’s should be mentioned that the approach is good for any kind of dll based application.

Requirements

  1. `A’ dll requires to be put into the system path. This can be achieved by copying all the relevant dlls (for debug version use files from NT_DLLD and for release version use NT_DLL files) into windows system directory or into the program directory. Copying in this manner is not good. `A’ usually uses PATH environmental variables to achieve this. But this requires administrative privileges and may not be always doable due to license restrictions. One can copy the dlls into a folder and create a BATCH file updating the path variable to include the `A’ dll directory, then calling the program. Thus the program will “see” the `A’ directory in its path and will not complain.
  2. System files may not match from the development system to other system. Specially program compiled with DEBUG switch turned on will contain debug symbol and will look for debug version of all dlls thus screwing up in other computers that don’t have the dlls with correct debug symbols.

Solution

After trying different options this seemed most suitable. An application has been developed in Visual C++ that itself is compiled as static program, so it does not depend on dlls. It reads a file named “CDSelector.inf” of the following format:

[COMMON]

PATH=dlls\a_dlls;%WINDIR%\system32;%WINDIR%

[Feature Tutor 2]

Tutor2\Tutor2.exe

[AppViewer]

AppViewer\AppViewer.exe

[NRep Output Generator]

OutGen\OUTGen.exe

[Shen's GDT System]

SOME=ShenTestbed

CODE=%SOME%\code

DATA=%SOME%\data

DOF=%SOME%\global_model

QHULL=%SOME%\local_model

SAT=%SOME%\sat

XLS=%SOME%\demo-xls

SomeSys\System.exe

The common portion is applied to all programs. The common portion sets some environment variables applicable for all applications.

The other portions starting with an enclosed name within left [ and right ] is a separate program. Each program can define as many variables as it wants to followed by a line containing the path to the program. So the CDSelector will read the “CDSelector.inf” and list down the programs as shown below:

Double click the program name or select the name and click Run to run the program. You can place comment by using “;” or “rem” in front of a line.

To solve the problem associated with system variables, trying to run the program using this CDSelector and see what error message you get. If you get a crash try to see the file name that caused the problem by clicking “Details” button note down the file name. Copy the file from the development machine, usually from windows system directory and place it inside the folder containing the application. This will hopefully solve the problem. Try running the program in different versions of Windows and different computers.

To run from a CD you can place a “Autorun.inf” inside the root of the CD containing:

[autorun]

open=CDSelector.exe

icon= CDSelector.exe,0

It can be downloaded here.

Powered by Qumana

April 22, 2006

A C++ Color class for using with opengl

Filed under: Uncategorized — mahbub @ 9:05 pm

I know this is nothing new. But when I was looking for something similar to C# color class with tons of predefined color values I found one wihich somehow matches with my requirement but that was for MFC. So I took the predefined color codes from that class and added some function for my own purpose. Code project have two of them hosted there. Here are the links:

http://www.codeproject.com/gdi/ccolor.asp

http://www.codeproject.com/bitmap/ccolor.asp

I will not paste the entire file here, just a portion. The files can be downloaded from here. Let me discuss some of the characterstic of the class

The entire class is inside Mahbub namespace. The Color class has four color components for red, green, blue and alpha. It has a option to store a name for it as well. But the memory is never allocated if no name is specified. You can use random number generator and generate random color from it. I used Mersenne Twister random number generator with it. Also I used a Mathematics wrapper class with it. I will post the Mathematics wrapper class description later. You can download it from here or just replace with appropriate function from math.h.

Here is an example usage of it:

Color a = Color::RED;

glColor(a);

If you find it useful or have any suggestion please don’t forget to let me know, udvranto (a) yahoo.

Powered by Qumana

April 20, 2006

Programming language: language or translator?

Filed under: Uncategorized — mahbub @ 2:24 pm

This view towards the language came to my mind today while I was attending a rather boring class on an interesting topic. I had to spend last two days doing C++ coding and was thinking about it in the class. What is programming language?

When you write something in assembly language what happens? The compiler translates it into machine code and its excuted in the machine. The same notion is applied everywhere. So essentially every compiler is a translator, a mapping between two sets of instruction. I think YACC and other parser tool came out of this concept. Gotta read something on these.

December 25, 2005

First day at Staten Island, NY

Filed under: Uncategorized — mahbub @ 10:02 pm

We spent the whole day on plane journey. We reached La Guardia, NY in local time 8pm. It was raining. On our way to Tushi’s uncles home, I found the NY city a lot like Dhaka, densly packed city with dull appearance. It will be raining tomorrow also. Lets see where we can go.

Next Page »

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.